PowerCLI Get-Tag fails with ‘Could not load file or assembly ‘Newtonsoft.Json, Version=10.0.0.0’

Here’s a simple scenario which I came across today. You would like to work with your vSphere environment using the latest PowerCLI but discover that v6.5.1 is the latest downloadable version on VMware’s website. Hearing that the distribution for this code has now moved to the PowerShell Gallery you open a PS prompt and enter:

PS:\> Install-Module VMware.PowerCLI

The modules are downloaded and installed successfully, and you are able to connect to your vCenter environment:

Connect-VIServer -server vcenterserver.com -user 'DOMAIN\username'

But when you attempt to use a simple command such as:

Get-Tag

you receive an error similar to:

get-tag : 11/10/2018 21:06:20   Get-Tag         Could not load file or assembly 'Newtonsoft.Json, Version=10.0.0.0,
Culture=neutral, PublicKeyToken=30ad4fe6b2a6aeed' or one of its dependencies. The system cannot find the file
specified.
At line:1 char:1

In my case I found that other system components on my VM were using an older version of the Newtonsoft.Json.dll (e.g. Citrix Virtual Desktop Agent) that were found in the file search path before the PowerShell module’s location.

Searching for the file conflict using ProcMon I noticed that the Connect-VIServer cmdlet does indeed find and load a version of this .dll during the connection process, e.g. the one located in:

C:\Windows\Microsoft.NET\assembly\GAC_MSIL\Newtonsoft.Json\v4.0_4.5.0.0__30ad4fe6b2a6aeed\Newtonsoft.Json.dll

However this version is 5.0.5.16108 on my Windows Server 2016 platform and we’re looking for 10.0.0.0 or newer.

Work-around

Retrieve the newer version of the file (supplied with the PowerCLI modules), located for instance in:

C:\Users\username\Documents\WindowsPowerShell\Modules\VMware.VimAutomation.Common\net45

and place a copy in somewhere PowerShell is likely to find it, e.g.:

C:\Windows\System32\WindowsPowerShell\v1.0\Newtonsoft.Json.dll

This simple work-around proved successful for me, but you should check of course to verify all other functionality which might depend on this file before making a similar change in a production environment.

Should the /psc URL work on both HA Platform Services nodes?

I recently ran into a strange issue following the enablement of two PSC 6.5 nodes in an HA configuration, as part of a larger rolling upgrade from vCenter 5.5.

NB – all URLs shown are internal, in use within my lab environment only.

During the migration of the existing customers vCenter environment we had to rehearse the externalisation of PSC from an initial embedded SSO instance. As part of this process the first PSC node in a new site was migrated from an original Window vCenter 5.5 SSO to PSC 6.5, and subsequently a second new node was joined to the first site in order for replication to be established.

I used a Citrix NetScaler to load balance the configuration and noticed at some point after the successful HA repointing was done that I was unable to access the https://hosso01.sbcpureconsult.internal/psc URL.

The second node, https://hosso2.sbcpureconsult.internal/psc worked correctly and redirects to the load balanced address psc-ha-vip.sbcpureconsult.internal for authentication before displaying the PSC client UI.

Irrespective of whichever node is selected I was able to log in to vCenter, then choose Administration, System Configuration, select a node then Manage, Settings or CA without receiving any errors.

If I deliberately dropped the first node out of the load balancing config on the NetScaler I didn’t have any issues when accessing the /psc URL by either host name or load balancer name, but if I tried to connect to the first node by its own DNS name or IP I received an HTTP 400 error and the following entry in:

/storage/log/vmware/psc-client/psc-client.log

[2018-10-08 12:05:20.347] [ERROR] tomcat-http--3 com.vmware.vsphere.client.security.websso.MetadataGeneratorImpl - Error when creating idp metadata.
java.lang.RuntimeException: java.io.IOException: HTTPS hostname wrong:  should be <psc-ha-vip.sbcpureconsult.internal>

It appeared that the HTTP 400 error is because the psc-client Tomcat application doesn’t start up correctly on the first node anymore, along with an error in..

/storage/log/vmware/rhttpproxy/rhttpproxy.log

2018-10-08T13:27:10.691Z warning rhttpproxy[7FEA4B941700] [Originator@6876 sub=Default] SSL Handshake failed for stream <SSL(<io_obj p:0x00007fea2c098010, h:27, <TCP '192.168.0.117:443'>, <TCP '192.168.0.121:26417'>>)>: N7Vmacore3Ssl12SSLExceptionE(SSL Exception: error:140000DB:SSL routines:SSL routines:short read)

I repeated the same series of steps in my lab environment I had experienced on the customer site, and was able to confirm the same behaviour. Let me explain at this point, that all other vCenter functionality was correct and our issue only affected the /psc URL.

Could this be deemed ‘correct’ behaviour?

If I chose https://psc-ha-vip.sbcpureconsult.internal/psc (which is the load balancer address) I was initially only able to connect if the second node is online and happens to be selected.

I wanted to confirm before signing off on the work that it should be possible to access the /psc URL on each node deliberately?

After what seemed like a lot of internal dialogue between myself and my inner tech support dept. (sleepless nights!) I was left wondering what could be going wrong.. especially if this was the documented procedure from VMware?

Good news, I was able to roll back my lab and re-run the updateSSOConfig.py and UpdateLsEndpoint.py scripts – only to find that the /psc URL did indeed load successfully on both nodes with the NetScaler load balancing in place!

So at least I knew that the correct behaviour is that you should be able to open /psc on both appliances.

By examining my snapshots at different stages I was able to identify a difference between the original migration node and the clean appliance:

When you run the updateSSOconfig.py Python script to repoint the SSO URL to the load balanced address it explains that hostname.txt and server.xml were modified:

# python updateSSOConfig.py --lb-fqdn=psc-ha-vip.sbcpureconsult.internal
script version:1.1.0
executing vmafd-cli command
Modifying hostname.txt
modifying server.xml
Executing StopService --all
Executing StartService --all

I was able to locate hostname.txt files (containing the load balancer address) in:

  • /etc/vmware/service-state/vmidentity/hostname.txt
  • /etc/vmware-sso/keys/hostname.txt (missing on node 2, but contained the local name on node 1)
  • /etc/vmware-sso/hostname.txt

but this second hostname file was missing on the second node. Why is this? I guess that it is used transiently during the script execution in order to inject the correct value into the server.xml file.

The server XML file is located in the folder:

/usr/lib/vmware-sso/vmware-sts/conf/server.xml

my faulty node contained the following certificate entries under the connector definition:

..store="STS_INTERNAL_SSL_CERT"
certificateKeystoreFile="STS_INTERNAL_SSL_CERT"..

my working node contained:

..store="MACHINE_SSL_CERT"
certificateKeystoreFile="MACHINE_SSL_CERT"..

So I was able to simply copy the server.xml file from the working node (overwriting the original on the faulty node) and also remove the /etc/vmware-sso/keys/hostname.txt file to match the configuration.

Following a reboot my first SSO node then responded correctly by redirecting https://hosso01.sbcpureconsult.internal/psc to https://psc-ha-vip.sbcpureconsult.internal/websso to obtain its SAML token before ultimately displaying the PSC client UI.

As a follow up, by examining the STS_INTERNAL_SSL_CERT store I could see that the machine certificate being used was issued by the original Windows vCenter Server 5.5 SSO CA to the subject name:

ssoserver,dc=vsphere,dc=local

This store was not present on the other node, and so the correct load balancing certificate replacement must somehow be omitted by one of the upgrade scripts when this scenario occurs (5.5 SSO to 6.5 PSC).

I hope that this bug gets removed by VMware in due course, particularly as more customers are moving to the appliance based model of vCenter 6.x, but this workaround and method should be considered at least if you run into a similar problem.

NB This post is adapted from a longer discussion on VMware Communities page available under https://communities.vmware.com/thread/598140.

Checking VMware Platform Services Controller 6.5 replication

Following installation of a second Platform Services Controller node in a site how will you know if replication is functioning correctly?

Assuming that you’ve got time to wait 30 seconds for each change to be replicated you could first try creating a test user on each node within the vsphere.local domain to verify bidirectional communication. But if you prefer to be a little more scientific or repeat the process programmatically you can follow a simple sequence of steps.

The following article from VMware explains the process, however it does omit a period (.) character at the beginning of the Linux commands such that the steps can’t be followed verbatim.

https://kb.vmware.com/s/article/2127057

I’ve rewritten the steps that I generally follow below:

Login to the PSC appliance over SSH as the root user

Enter the following commands to change directory and execute the vdcrepadmin tool (bearing in mind here that the administrator user is from the single-sign-on vsphere.local domain)

cd /usr/lib/vmware-vmdir/bin

./vdcrepadmin -f showservers -h hopsc01.xyz.company.com -u administrator -w password

This command lists out all of the PSC nodes which have joined the single-sign-on domain:

cn=hopsc01.xyz.company.com,cn=Servers,cn=HeadOffice,cn=Sites,cn=Configuration,dc=vsphere,dc=local
cn=hopsc02.xyz.company.com,cn=Servers,cn=HeadOffice,cn=Sites,cn=Configuration,dc=vsphere,dc=local

Repeat this step on the second (or additional) PSC nodes:

cn=hopsc01.xyz.company.com,cn=Servers,cn=HeadOffice,cn=Sites,cn=Configuration,dc=vsphere,dc=local
cn=hopsc02.xyz.company.com,cn=Servers,cn=HeadOffice,cn=Sites,cn=Configuration,dc=vsphere,dc=local

Enter the following commands to display the replication partners for each node:

./vdcrepadmin -f showpartners -h hopsc01.xyz.company.com -u administrator -w password

ldap://HOPSC02.xyz.company.com

./vdcrepadmin -f showpartners -h hopsc02.xyz.company.com -u administrator -w password

ldap://hopsc01.xyz.company.com

Enter the following commands to display the replication status of each node with its counterpart replication partners:

./vdcrepadmin -f showpartnerstatus -h hopsc01.xyz.company.com -u administrator -w password

Partner: HOPSC02.xyz.company.com
Host available: Yes
Status available: Yes
My last change number: 4676
Partner has seen my change number: 4676
Partner is 0 changes behind.

./vdcrepadmin -f showpartnerstatus -h hopsc02.xyz.company.com -u administrator -w password

Partner: hopsc01.xyz.company.com
Host available: Yes
Status available: Yes
My last change number: 8986
Partner has seen my change number: 8986
Partner is 0 changes behind.

In these examples the change numbers (unique sequence numbers) are specific to the local host, but are not necessarily the same if they were introduced to the site at different times. The important value to pay attention to is whether the replication partner shows that any changes are not yet communicated or if the other partner is unavailable.