Should the /psc URL work on both HA Platform Services nodes?

I recently ran into a strange issue following the enablement of two PSC 6.5 nodes in an HA configuration, as part of a larger rolling upgrade from vCenter 5.5.

NB – all URLs shown are internal, in use within my lab environment only.

During the migration of the existing customers vCenter environment we had to rehearse the externalisation of PSC from an initial embedded SSO instance. As part of this process the first PSC node in a new site was migrated from an original Window vCenter 5.5 SSO to PSC 6.5, and subsequently a second new node was joined to the first site in order for replication to be established.

I used a Citrix NetScaler to load balance the configuration and noticed at some point after the successful HA repointing was done that I was unable to access the https://hosso01.sbcpureconsult.internal/psc URL.

The second node, https://hosso2.sbcpureconsult.internal/psc worked correctly and redirects to the load balanced address psc-ha-vip.sbcpureconsult.internal for authentication before displaying the PSC client UI.

Irrespective of whichever node is selected I was able to log in to vCenter, then choose Administration, System Configuration, select a node then Manage, Settings or CA without receiving any errors.

If I deliberately dropped the first node out of the load balancing config on the NetScaler I didn’t have any issues when accessing the /psc URL by either host name or load balancer name, but if I tried to connect to the first node by its own DNS name or IP I received an HTTP 400 error and the following entry in:

/storage/log/vmware/psc-client/psc-client.log

[2018-10-08 12:05:20.347] [ERROR] tomcat-http--3 com.vmware.vsphere.client.security.websso.MetadataGeneratorImpl - Error when creating idp metadata.
java.lang.RuntimeException: java.io.IOException: HTTPS hostname wrong:  should be <psc-ha-vip.sbcpureconsult.internal>

It appeared that the HTTP 400 error is because the psc-client Tomcat application doesn’t start up correctly on the first node anymore, along with an error in..

/storage/log/vmware/rhttpproxy/rhttpproxy.log

2018-10-08T13:27:10.691Z warning rhttpproxy[7FEA4B941700] [Originator@6876 sub=Default] SSL Handshake failed for stream <SSL(<io_obj p:0x00007fea2c098010, h:27, <TCP '192.168.0.117:443'>, <TCP '192.168.0.121:26417'>>)>: N7Vmacore3Ssl12SSLExceptionE(SSL Exception: error:140000DB:SSL routines:SSL routines:short read)

I repeated the same series of steps in my lab environment I had experienced on the customer site, and was able to confirm the same behaviour. Let me explain at this point, that all other vCenter functionality was correct and our issue only affected the /psc URL.

Could this be deemed ‘correct’ behaviour?

If I chose https://psc-ha-vip.sbcpureconsult.internal/psc (which is the load balancer address) I was initially only able to connect if the second node is online and happens to be selected.

I wanted to confirm before signing off on the work that it should be possible to access the /psc URL on each node deliberately?

After what seemed like a lot of internal dialogue between myself and my inner tech support dept. (sleepless nights!) I was left wondering what could be going wrong.. especially if this was the documented procedure from VMware?

Good news, I was able to roll back my lab and re-run the updateSSOConfig.py and UpdateLsEndpoint.py scripts – only to find that the /psc URL did indeed load successfully on both nodes with the NetScaler load balancing in place!

So at least I knew that the correct behaviour is that you should be able to open /psc on both appliances.

By examining my snapshots at different stages I was able to identify a difference between the original migration node and the clean appliance:

When you run the updateSSOconfig.py Python script to repoint the SSO URL to the load balanced address it explains that hostname.txt and server.xml were modified:

# python updateSSOConfig.py --lb-fqdn=psc-ha-vip.sbcpureconsult.internal
script version:1.1.0
executing vmafd-cli command
Modifying hostname.txt
modifying server.xml
Executing StopService --all
Executing StartService --all

I was able to locate hostname.txt files (containing the load balancer address) in:

  • /etc/vmware/service-state/vmidentity/hostname.txt
  • /etc/vmware-sso/keys/hostname.txt (missing on node 2, but contained the local name on node 1)
  • /etc/vmware-sso/hostname.txt

but this second hostname file was missing on the second node. Why is this? I guess that it is used transiently during the script execution in order to inject the correct value into the server.xml file.

The server XML file is located in the folder:

/usr/lib/vmware-sso/vmware-sts/conf/server.xml

my faulty node contained the following certificate entries under the connector definition:

..store="STS_INTERNAL_SSL_CERT"
certificateKeystoreFile="STS_INTERNAL_SSL_CERT"..

my working node contained:

..store="MACHINE_SSL_CERT"
certificateKeystoreFile="MACHINE_SSL_CERT"..

So I was able to simply copy the server.xml file from the working node (overwriting the original on the faulty node) and also remove the /etc/vmware-sso/keys/hostname.txt file to match the configuration.

Following a reboot my first SSO node then responded correctly by redirecting https://hosso01.sbcpureconsult.internal/psc to https://psc-ha-vip.sbcpureconsult.internal/websso to obtain its SAML token before ultimately displaying the PSC client UI.

As a follow up, by examining the STS_INTERNAL_SSL_CERT store I could see that the machine certificate being used was issued by the original Windows vCenter Server 5.5 SSO CA to the subject name:

ssoserver,dc=vsphere,dc=local

This store was not present on the other node, and so the correct load balancing certificate replacement must somehow be omitted by one of the upgrade scripts when this scenario occurs (5.5 SSO to 6.5 PSC).

I hope that this bug gets removed by VMware in due course, particularly as more customers are moving to the appliance based model of vCenter 6.x, but this workaround and method should be considered at least if you run into a similar problem.

NB This post is adapted from a longer discussion on VMware Communities page available under https://communities.vmware.com/thread/598140.

Checking VMware Platform Services Controller 6.5 replication

Following installation of a second Platform Services Controller node in a site how will you know if replication is functioning correctly?

Assuming that you’ve got time to wait 30 seconds for each change to be replicated you could first try creating a test user on each node within the vsphere.local domain to verify bidirectional communication. But if you prefer to be a little more scientific or repeat the process programmatically you can follow a simple sequence of steps.

The following article from VMware explains the process, however it does omit a period (.) character at the beginning of the Linux commands such that the steps can’t be followed verbatim.

https://kb.vmware.com/s/article/2127057

I’ve rewritten the steps that I generally follow below:

Login to the PSC appliance over SSH as the root user

Enter the following commands to change directory and execute the vdcrepadmin tool (bearing in mind here that the administrator user is from the single-sign-on vsphere.local domain)

cd /usr/lib/vmware-vmdir/bin

./vdcrepadmin -f showservers -h hopsc01.xyz.company.com -u administrator -w password

This command lists out all of the PSC nodes which have joined the single-sign-on domain:

cn=hopsc01.xyz.company.com,cn=Servers,cn=HeadOffice,cn=Sites,cn=Configuration,dc=vsphere,dc=local
cn=hopsc02.xyz.company.com,cn=Servers,cn=HeadOffice,cn=Sites,cn=Configuration,dc=vsphere,dc=local

Repeat this step on the second (or additional) PSC nodes:

cn=hopsc01.xyz.company.com,cn=Servers,cn=HeadOffice,cn=Sites,cn=Configuration,dc=vsphere,dc=local
cn=hopsc02.xyz.company.com,cn=Servers,cn=HeadOffice,cn=Sites,cn=Configuration,dc=vsphere,dc=local

Enter the following commands to display the replication partners for each node:

./vdcrepadmin -f showpartners -h hopsc01.xyz.company.com -u administrator -w password

ldap://HOPSC02.xyz.company.com

./vdcrepadmin -f showpartners -h hopsc02.xyz.company.com -u administrator -w password

ldap://hopsc01.xyz.company.com

Enter the following commands to display the replication status of each node with its counterpart replication partners:

./vdcrepadmin -f showpartnerstatus -h hopsc01.xyz.company.com -u administrator -w password

Partner: HOPSC02.xyz.company.com
Host available: Yes
Status available: Yes
My last change number: 4676
Partner has seen my change number: 4676
Partner is 0 changes behind.

./vdcrepadmin -f showpartnerstatus -h hopsc02.xyz.company.com -u administrator -w password

Partner: hopsc01.xyz.company.com
Host available: Yes
Status available: Yes
My last change number: 8986
Partner has seen my change number: 8986
Partner is 0 changes behind.

In these examples the change numbers (unique sequence numbers) are specific to the local host, but are not necessarily the same if they were introduced to the site at different times. The important value to pay attention to is whether the replication partner shows that any changes are not yet communicated or if the other partner is unavailable.

Repointing vCenter Server to external PSC on load balanced FQDN fails

I have been  planning a migration project for a customer for a while which involves moving from an embedded SSO instance on vCenter 5.5 to an external Platform Services Controller instance on 6.5. Suffice to say, plenty of ‘how to’ guides exist, alongside the documentation from VMware – however, there is a generally scant outline of what steps to take when ‘repointing your vCenter to the new load balanced PSC virtual IP. The topic of this post is what happens when you follow the available load balancing documentation and your VMware Update Manager service fails to start afterwards.

I’ll include the reference articles up front, in case these are the ones which you might also have referred to:

Reference articles:

Configuring HA PSC load balancing on Citrix NetScaler – VMware KB article

Repoint vCenter Server to Another External Platform Services Controller in the Same Domain – VMware KB article

The repoint command:

At the step where you are reminded to repoint your vCenter instances at the new load balanced VIP address you’ll need to use the command:

cmsso-util repoint --repoint-psc psc-ha-vip.sbcpureconsult.internal

However, if you’ve followed the steps precisely, you’re likely to run into the following output when the repoint script attempts to restart the Update Manager service:

What happens:

Validating Provided Configuration …
Validation Completed Successfully.
Executing repointing steps. This will take few minutes to complete.
Please wait …
Stopping all the services …
All services stopped.
Starting all the services …

[… truncated …]

Stderr = Service-control failed. Error Failed to start vmon services.vmon-cli RC=2, stderr=Failed to start updatemgr services. Error: Service crashed while starting

Failed to start all the services. Error {
“resolution”: null,
“detail”: [
{
“args”: [
“Stderr: Service-control failed. Error Failed to start vmon services.vmon-cli RC=2, stderr=Failed to start updatemgr services. Error: Service crashed while starting\n\n”
],
“id”: “install.ciscommon.command.errinvoke”,
“localized”: “An error occurred while invoking external command : ‘Stderr: Service-control failed. Error Failed to start vmon services.vmon-cli RC=2, stderr=Failed to start updatemgr services. Error: Service crashed while starting\n\n'”,
“translatable”: “An error occurred while invoking external command : ‘%(0)s'”
}
],
“componentKey”: null,
“problemId”: null
}

Following this issue you might reboot or attempt to start all services directly on the vCenter appliance afterwards and receive:

service-control --start --all

Service-control failed. Error Failed to start vmon services.vmon-cli RC=2, stderr=Failed to start updatemgr services. Error: Service crashed while starting

This again is fairly unhelpful output and doesn’t provide any assistance as to the cause of the issue. After much investigation, it turns out that the list of TCP port numbers which the load balancing configuration details are not complete, causing the service startup to fail. Because we’re not running any other applications on the PSC hosts it’s possible to simplify the configuration on NetScaler by using wildcard port services for each server.

NetScaler configuration commands (specific to PSC load balancing):

The following alternative configuration ensures that any PSC service requested by your vCenter Server (or other solutions) will remain persistently connected on a ‘per host’ basis for up to 1440 minutes which is the default lifetime of a vCenter Web Client session. This is different to VMware’s documented approach which load balances each service individually, but obviously misses out some crucial port.

add server hosso01.sbcpureconsult.internal 192.168.0.117
add server hosso02.sbcpureconsult.internal 192.168.0.116

add service hosso01.sbcpureconsult.internal_TCP_ANY hosso01.sbcpureconsult.internal TCP * -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO

add service hosso02.sbcpureconsult.internal_TCP_ANY hosso02.sbcpureconsult.internal TCP * -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO

add lb vserver lb_hosso01_02_TCP_ANY TCP 192.168.0.122 * -persistenceType SOURCEIP -timeout 1440 -cltTimeout 9000

bind lb vserver lb_hosso01_02_TCP_ANY hosso01.sbcpureconsult.internal_TCP_ANY

bind lb vserver lb_hosso01_02_TCP_ANY hosso02.sbcpureconsult.internal_TCP_ANY

Once this configuration is put in place you’ll find that the vCenter Update Manager service will start correctly and your repoint will be successful.

Edit: Following the above configuration steps to get past the installation issue, I’ve since improved the list of ports that are load balanced by NetScaler to extend the list that VMware published for vCenter in their docs page. By enhancing the original series of ports I think we can resolve the initial issue without resorting to IP based wildcard load balancing.

I’ve included the full configuration below for reference:

Thanks for reading!

If you find this useful drop me a message via my contact page.

add server hosso01.sbcpureconsult.internal 192.168.0.117
add server hosso02.sbcpureconsult.internal 192.168.0.116
add service hosso01_TCP80 hosso01.sbcpureconsult.internal TCP 80 -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO
add service hosso01_TCP88 hosso01.sbcpureconsult.internal TCP 88 -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO
add service hosso01_TCP389 hosso01.sbcpureconsult.internal TCP 389 -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO
add service hosso01_TCP443 hosso01.sbcpureconsult.internal TCP 443 -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO
add service hosso01_TCP514 hosso01.sbcpureconsult.internal TCP 514 -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO
add service hosso01_TCP636 hosso01.sbcpureconsult.internal TCP 636 -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO
add service hosso01_TCP1514 hosso01.sbcpureconsult.internal TCP 1514 -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO
add service hosso01_TCP2012 hosso01.sbcpureconsult.internal TCP 2012 -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO
add service hosso01_TCP2014 hosso01.sbcpureconsult.internal TCP 2014 -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO
add service hosso01_TCP2015 hosso01.sbcpureconsult.internal TCP 2015 -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO
add service hosso01_TCP2020 hosso01.sbcpureconsult.internal TCP 2020 -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO
add service hosso01_TCP5480 hosso01.sbcpureconsult.internal TCP 5480 -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO
add service hosso01_TCP7444 hosso01.sbcpureconsult.internal TCP 7444 -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO
add service hosso02_TCP80 hosso02.sbcpureconsult.internal TCP 80 -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO
add service hosso02_TCP88 hosso02.sbcpureconsult.internal TCP 88 -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO
add service hosso02_TCP389 hosso02.sbcpureconsult.internal TCP 389 -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO
add service hosso02_TCP443 hosso02.sbcpureconsult.internal TCP 443 -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO
add service hosso02_TCP514 hosso02.sbcpureconsult.internal TCP 514 -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO
add service hosso02_TCP636 hosso02.sbcpureconsult.internal TCP 636 -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO
add service hosso02_TCP1514 hosso02.sbcpureconsult.internal TCP 1514 -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO
add service hosso02_TCP2012 hosso02.sbcpureconsult.internal TCP 2012 -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO
add service hosso02_TCP2014 hosso02.sbcpureconsult.internal TCP 2014 -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO
add service hosso02_TCP2015 hosso02.sbcpureconsult.internal TCP 2015 -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO
add service hosso02_TCP2020 hosso02.sbcpureconsult.internal TCP 2020 -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO
add service hosso02_TCP5480 hosso02.sbcpureconsult.internal TCP 5480 -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO
add service hosso02_TCP7444 hosso02.sbcpureconsult.internal TCP 7444 -gslb NONE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -sp OFF -cltTimeout 9000 -svrTimeout 9000 -CKA NO -TCPB NO -CMP NO
add lb vserver lb_hosso01_02_80 TCP 192.168.0.122 80 -timeout 1440 -cltTimeout 9000
add lb vserver lb_hosso01_02_88 TCP 192.168.0.122 88 -timeout 1440 -cltTimeout 9000
add lb vserver lb_hosso01_02_389 TCP 192.168.0.122 389 -timeout 1440 -cltTimeout 9000
add lb vserver lb_hosso01_02_443 TCP 192.168.0.122 443 -timeout 1440 -cltTimeout 9000
add lb vserver lb_hosso01_02_514 TCP 192.168.0.122 514 -timeout 1440 -cltTimeout 9000
add lb vserver lb_hosso01_02_636 TCP 192.168.0.122 636 -timeout 1440 -cltTimeout 9000
add lb vserver lb_hosso01_02_1514 TCP 192.168.0.122 1514 -timeout 1440 -cltTimeout 9000
add lb vserver lb_hosso01_02_2012 TCP 192.168.0.122 2012 -timeout 1440 -cltTimeout 9000
add lb vserver lb_hosso01_02_2014 TCP 192.168.0.122 2014 -timeout 1440 -cltTimeout 9000
add lb vserver lb_hosso01_02_2015 TCP 192.168.0.122 2015 -timeout 1440 -cltTimeout 9000
add lb vserver lb_hosso01_02_2020 TCP 192.168.0.122 2020 -timeout 1440 -cltTimeout 9000
add lb vserver lb_hosso01_02_5480 TCP 192.168.0.122 5480 -timeout 1440 -cltTimeout 9000
add lb vserver lb_hosso01_02_7444 TCP 192.168.0.122 7444 -timeout 1440 -cltTimeout 9000
bind lb vserver lb_hosso01_02_80 hosso01_TCP80
bind lb vserver lb_hosso01_02_80 hosso02_TCP80
bind lb vserver lb_hosso01_02_88 hosso01_TCP88
bind lb vserver lb_hosso01_02_88 hosso02_TCP88
bind lb vserver lb_hosso01_02_389 hosso01_TCP389
bind lb vserver lb_hosso01_02_389 hosso02_TCP389
bind lb vserver lb_hosso01_02_443 hosso01_TCP443
bind lb vserver lb_hosso01_02_443 hosso02_TCP443
bind lb vserver lb_hosso01_02_514 hosso01_TCP514
bind lb vserver lb_hosso01_02_514 hosso02_TCP514
bind lb vserver lb_hosso01_02_636 hosso01_TCP636
bind lb vserver lb_hosso01_02_636 hosso02_TCP636
bind lb vserver lb_hosso01_02_1514 hosso01_TCP1514
bind lb vserver lb_hosso01_02_1514 hosso02_TCP1514
bind lb vserver lb_hosso01_02_2012 hosso01_TCP2012
bind lb vserver lb_hosso01_02_2012 hosso02_TCP2012
bind lb vserver lb_hosso01_02_2014 hosso01_TCP2014
bind lb vserver lb_hosso01_02_2014 hosso02_TCP2014
bind lb vserver lb_hosso01_02_2015 hosso01_TCP2015
bind lb vserver lb_hosso01_02_2015 hosso02_TCP2015
bind lb vserver lb_hosso01_02_2020 hosso01_TCP2020
bind lb vserver lb_hosso01_02_2020 hosso02_TCP2020
bind lb vserver lb_hosso01_02_5480 hosso01_TCP5480
bind lb vserver lb_hosso01_02_5480 hosso02_TCP5480
bind lb vserver lb_hosso01_02_7444 hosso01_TCP7444
bind lb vserver lb_hosso01_02_7444 hosso02_TCP7444
add lb group pg_hosso_01_02 -persistenceType SOURCEIP -timeout 1440
bind lb group pg_hosso_01_02 lb_hosso01_02_80
bind lb group pg_hosso_01_02 lb_hosso01_02_88
bind lb group pg_hosso_01_02 lb_hosso01_02_389
bind lb group pg_hosso_01_02 lb_hosso01_02_443
bind lb group pg_hosso_01_02 lb_hosso01_02_514
bind lb group pg_hosso_01_02 lb_hosso01_02_636
bind lb group pg_hosso_01_02 lb_hosso01_02_1514
bind lb group pg_hosso_01_02 lb_hosso01_02_2012
bind lb group pg_hosso_01_02 lb_hosso01_02_2014
bind lb group pg_hosso_01_02 lb_hosso01_02_2015
bind lb group pg_hosso_01_02 lb_hosso01_02_2020
bind lb group pg_hosso_01_02 lb_hosso01_02_5480
bind lb group pg_hosso_01_02 lb_hosso01_02_7444
set lb group pg_hosso_01_02 -persistenceType SOURCEIP -timeout 1440

Oracle licensing on hyper-converged platforms such as Nutanix, VSAN etc.

I recently posted on Michael Webster of Nutanix’ blog about Oracle licensing on VMware clusters and wanted to link back to it here as it’s something I’ve been involved with several times now.

With VMware vSphere 5.5 the vMotion boundary is defined by the individual datacenter object in vCenter, which means that you cannot move an individual VM between datacenters without exporting, removing it from the inventory, and reimporting somewhere else. This currently means that even if you deploy Oracle DB on an ESXi cluster having just two nodes that you could be required by Oracle to license all of the other CPU sockets in the datacenter!

This rule is due to Oracle’s stance that they do not support soft partitioning or any kind of host or CPU affinity rules. Providing that a VM could run on a processor socket, through some kind of administrative operation, then that socket should be licensed. This doesn’t seem fair, and VMware even suggest that this can be counteracted by simply defining host affinity rules – but let’s be clear, the final say so has to be down to Oracle’s licensing agreement and not whether VMware thinks it should be acceptable.

http://www.vmware.com/files/pdf/techpaper/vmw-understanding-oracle-certification-supportlicensing-environments.pdf

So the only current solution is to build Oracle dedicated clusters with separate shared storage and separate vCenter instances consisting only of Oracle DB servers. This means that you are able to define exactly which CPU sockets should be licensed, in effect all those which make up part of one or more ESXi clusters within the vCenter datacenter object.

Now, with vSphere ESXi 6 there was a new feature introduced called long distance vMotion which facilitates being able to migrate a VM between cities, or even continents – even if they are managed by different vCenter instances. An excellent description of the new features can be found here.

This rather complicates the matter, since Oracle will now need to consider how this effects the ‘reach’ of any particular VM instance, which now would appear to only be limited to the scope of your single sign-on domain, rather than how many hosts or clusters are defined within your datacenter. I will be interested to see how this develops and certainly post back here if anything moves us further towards clarity on this subject.

Permalink to Michael’s original article

Optimising Oracle DB with VMware’s vFlash Read Cache feature

This post is a slightly different one that I’ve usually made simply because it is more notes based than editorial or comment, however I hope that the simple steps and data captured here will be useful. In fact it’s taken me a while to get this data out, but even though it’s about a year old now the performance improvement should be even better with ESXi 6.x. In this test we were interested basically in evaluating whether VMware’s new Flash Read Cache(vFRC)  feature released in ESXi 5.5 would benefit read heavy virtual workloads such as Oracle DB.

Test scenario:

Oracle 11g 11.2.0.1 DB with 4vCPU, 8,192MB RAM and 200GB Oracle ASM disk for database
HP DL380 G7 with 2 x Intel Xeon 5650 6C 2.67GHz CPU and 128GB RAM, locally attached 4 x 7.2K SAS RAID array
VMware ESXi 5.5 Enterprise Plus license with vFlash Read Cache capability.

Creating a baseline (before applying vFRC)

Using esxitop to establish typical baseline values:Disk latency typical across measured virtual machines – 11.97ms latencyCorrelation of baseline latency and command per second values with vCenter Operations Manager:

High and low water disk latency – between 4 and 16ms (using 7.2K RPM drives in 4 disk RAID5 array).

Disk usage was negligible following VM boot and Oracle DB startup:

In order to set the vFlash Read Cache block size correctly we need to find out the typical write block size (so that small writes do not consume too large a cache block if it is set higher than the mean).Using vscsiStats to measure the frequency of different sized I/O commands:


Highlighted frequency values (above) show that 4,096 byte I/Os were the most common across both write and read buckets, and therefore the overall number of operations peaked in the same window.In order to establish the baseline Oracle performance an I/O calibration script was run several times.Oracle DB I/O metrics calculation:

Max IOPs were found to lie between 576 and 608 per second using a 200GB VMDK located on the 4 disk RAID array.The high water mark for disk latency rose to 28ms during the test, versus 12ms when the instance was idle – indicating contention on the spindles during read/write activity.

During the I/O calibration test the high water mark for disk throughput rose to 76,000 KBps, versus 3,450 KBps when the instance was idle. This shows that the array throughput max is around 74MB/s.

Having established that the majority of writes during the above test were in fact using an 8KB block size (not as shown in the screenshot which was taken from a different test (4KB)) the vFRC was enabled only on the 200GB ASM disk using an arbitrary 50GB reservation (25% of total disk size). No reboot was required, VMware inserts the cache in front of the disk storage transparently to the VM.

With Flash Read Cache enabled on 200GB ASM disk

After adding a locally attached 200GB SATA SSD disk to the ESXi server and claiming the storage for Flash Read Cache a 50GB vFRC cache was enabled on the Oracle ASM data disk within the guest OS configuration:

Once the vFRC function was enabled the Oracle I/O calibration script was run again, and surprisingly the first pass was considerably slower than previous runs (max IOPs 268). This is because each read from the SSD cache initially fails, because prior writes have not primed the cache. By writing to SSD before committing to disk (write-through caching), data is continually added to the vFRC cache such that performance should improve over time:

Esxcli was used to view the resulting cache efficiency after running I/O calibration (showing 29% read hit rate via SSD cache vs reads from SAS disk):

In the example above, no blocks have been evicted from the cache yet meaning that the 50GB cache assigned to this VMDK still offers room for growth. When all of the cache blocks are exhausted the ESXi storage stack will begin to remove older blocks in favour of storing more relevant up to date data.The resulting I/O calibration performance is shown below – both before and after enabling the vFRC feature.

In brief conclusion, the vFlash Read Cache feature is an excellent way to add in-line SSD based read caching for specific virtual machines and volumes. You must enable the option on specific VMs only, and then track their usage and cache effectiveness over time in order to make sure that you have allocated not too much, or not too little cache. However, once the cache is primed with data there is a marked and positive improvement to the read throughput, and a much reduced number of IOPS needing to be dealt with by the physical storage array. For Oracle servers which are read biased this should significantly improve performance where non-SSD storage arrays are being utilised.