Should the /psc URL work on both HA Platform Services nodes?

I recently ran into a strange issue following the enablement of two PSC 6.5 nodes in an HA configuration, as part of a larger rolling upgrade from vCenter 5.5.

NB – all URLs shown are internal, in use within my lab environment only.

During the migration of the existing customers vCenter environment we had to rehearse the externalisation of PSC from an initial embedded SSO instance. As part of this process the first PSC node in a new site was migrated from an original Window vCenter 5.5 SSO to PSC 6.5, and subsequently a second new node was joined to the first site in order for replication to be established.

I used a Citrix NetScaler to load balance the configuration and noticed at some point after the successful HA repointing was done that I was unable to access the https://hosso01.sbcpureconsult.internal/psc URL.

The second node, https://hosso2.sbcpureconsult.internal/psc worked correctly and redirects to the load balanced address psc-ha-vip.sbcpureconsult.internal for authentication before displaying the PSC client UI.

Irrespective of whichever node is selected I was able to log in to vCenter, then choose Administration, System Configuration, select a node then Manage, Settings or CA without receiving any errors.

If I deliberately dropped the first node out of the load balancing config on the NetScaler I didn’t have any issues when accessing the /psc URL by either host name or load balancer name, but if I tried to connect to the first node by its own DNS name or IP I received an HTTP 400 error and the following entry in:

/storage/log/vmware/psc-client/psc-client.log

[2018-10-08 12:05:20.347] [ERROR] tomcat-http--3 com.vmware.vsphere.client.security.websso.MetadataGeneratorImpl - Error when creating idp metadata.
java.lang.RuntimeException: java.io.IOException: HTTPS hostname wrong:  should be <psc-ha-vip.sbcpureconsult.internal>

It appeared that the HTTP 400 error is because the psc-client Tomcat application doesn’t start up correctly on the first node anymore, along with an error in..

/storage/log/vmware/rhttpproxy/rhttpproxy.log

2018-10-08T13:27:10.691Z warning rhttpproxy[7FEA4B941700] [Originator@6876 sub=Default] SSL Handshake failed for stream <SSL(<io_obj p:0x00007fea2c098010, h:27, <TCP '192.168.0.117:443'>, <TCP '192.168.0.121:26417'>>)>: N7Vmacore3Ssl12SSLExceptionE(SSL Exception: error:140000DB:SSL routines:SSL routines:short read)

I repeated the same series of steps in my lab environment I had experienced on the customer site, and was able to confirm the same behaviour. Let me explain at this point, that all other vCenter functionality was correct and our issue only affected the /psc URL.

Could this be deemed ‘correct’ behaviour?

If I chose https://psc-ha-vip.sbcpureconsult.internal/psc (which is the load balancer address) I was initially only able to connect if the second node is online and happens to be selected.

I wanted to confirm before signing off on the work that it should be possible to access the /psc URL on each node deliberately?

After what seemed like a lot of internal dialogue between myself and my inner tech support dept. (sleepless nights!) I was left wondering what could be going wrong.. especially if this was the documented procedure from VMware?

Good news, I was able to roll back my lab and re-run the updateSSOConfig.py and UpdateLsEndpoint.py scripts – only to find that the /psc URL did indeed load successfully on both nodes with the NetScaler load balancing in place!

So at least I knew that the correct behaviour is that you should be able to open /psc on both appliances.

By examining my snapshots at different stages I was able to identify a difference between the original migration node and the clean appliance:

When you run the updateSSOconfig.py Python script to repoint the SSO URL to the load balanced address it explains that hostname.txt and server.xml were modified:

# python updateSSOConfig.py --lb-fqdn=psc-ha-vip.sbcpureconsult.internal
script version:1.1.0
executing vmafd-cli command
Modifying hostname.txt
modifying server.xml
Executing StopService --all
Executing StartService --all

I was able to locate hostname.txt files (containing the load balancer address) in:

  • /etc/vmware/service-state/vmidentity/hostname.txt
  • /etc/vmware-sso/keys/hostname.txt (missing on node 2, but contained the local name on node 1)
  • /etc/vmware-sso/hostname.txt

but this second hostname file was missing on the second node. Why is this? I guess that it is used transiently during the script execution in order to inject the correct value into the server.xml file.

The server XML file is located in the folder:

/usr/lib/vmware-sso/vmware-sts/conf/server.xml

my faulty node contained the following certificate entries under the connector definition:

..store="STS_INTERNAL_SSL_CERT"
certificateKeystoreFile="STS_INTERNAL_SSL_CERT"..

my working node contained:

..store="MACHINE_SSL_CERT"
certificateKeystoreFile="MACHINE_SSL_CERT"..

So I was able to simply copy the server.xml file from the working node (overwriting the original on the faulty node) and also remove the /etc/vmware-sso/keys/hostname.txt file to match the configuration.

Following a reboot my first SSO node then responded correctly by redirecting https://hosso01.sbcpureconsult.internal/psc to https://psc-ha-vip.sbcpureconsult.internal/websso to obtain its SAML token before ultimately displaying the PSC client UI.

As a follow up, by examining the STS_INTERNAL_SSL_CERT store I could see that the machine certificate being used was issued by the original Windows vCenter Server 5.5 SSO CA to the subject name:

ssoserver,dc=vsphere,dc=local

This store was not present on the other node, and so the correct load balancing certificate replacement must somehow be omitted by one of the upgrade scripts when this scenario occurs (5.5 SSO to 6.5 PSC).

I hope that this bug gets removed by VMware in due course, particularly as more customers are moving to the appliance based model of vCenter 6.x, but this workaround and method should be considered at least if you run into a similar problem.

NB This post is adapted from a longer discussion on VMware Communities page available under https://communities.vmware.com/thread/598140.

Leave a Reply