NSX-T Manager appliance high-CPU whilst idle

I run VMware NSX-T in a small lab environment based on Intel NUCs, but I’ve noticed recently that even when not being challenged e.g. following initial boot and being essentially idle, the Manager appliance suffers continual high CPU usage which leads eventually to an uncomfortably warm office.

Even though running the correct minimum virtual machine hardware for the appliance has been configured, i.e. 4 vCPU and 16GB RAM, it was regularly using ~4.5GHz of physical CPU.

Here’s a good example of an otherwise idle appliance showing 40% CPU usage.

~40% CPU on a 4 vCPU virtual appliance

After connecting over SSH as the ‘admin’ user and entering ‘get process monitor’ it’s quickly apparent from the top output that ‘rngd’ is responsible for the majority of the CPU utilisation:

‘get process monitor’ whilst logged in as NSX-T admin console user

But what is this? A quick search of more general Linux resources informs us that it is a random number generator used in ensuring sufficient ‘entropy’ is available during creation of certificates, SSH keys etc.

In order to discover more about the purpose of this daemon we can inspect the description of the installed version (5-0ubuntu4nn1) under the current Ubuntu 18.04.4 LTS release.

apt show rng-tools/now

Description: Daemon to use a Hardware TRNG
The rngd daemon acts as a bridge between a Hardware TRNG (true random number
generator) such as the ones in some Intel/AMD/VIA chipsets, and the kernel’s
PRNG (pseudo-random number generator).
.
It tests the data received from the TRNG using the FIPS 140-2 (2002-10-10)
tests to verify that it is indeed random, and feeds the random data to the
kernel entropy pool.
.
This increases the bandwidth of the /dev/random device, from a source that
does not depend on outside activity. It may also improve the quality
(entropy) of the randomness of /dev/random.
.
A TRNG kernel module such as hw_random, or some other source of true
entropy that is accessible as a device or fifo, is required to use this
package.
.
This is an unofficial version of rng-tools which has been extensively
modified to add multithreading and a lot of new functionality.

So we know that this is a helper daemon which improves the speed of providing near-truly random numbers when applications ask for them. What version do we currently have installed in the NSX-T 3.1.2 manager appliance?

apt search rng-tools
Sorting... Done
Full Text Search... Done
rng-tools/now 5-0ubuntu4nn1 amd64 [installed,local]
  Daemon to use a Hardware TRNG

This appears to be the latest available version. In order to examine the status of the rngd daemon itself, log in to the appliance console as the root user and use:

systemctl list-units rng-tools.service

The service is shown as running,

Name of rngd random number generator service
root@nsx-manager:~# systemctl status rng-tools.service
rng-tools.service
Loaded: loaded (/etc/init.d/rng-tools; enabled; vendor preset: enabled)
Active: active (running) since Tue 2021-10-05 09:31:59 UTC; 17min ago
Docs: man:systemd-sysv-generator(8)
Process: 886 ExecStart=/etc/init.d/rng-tools start (code=exited, status=0/SUCCESS)
Tasks: 1 (limit: 4915)
CGroup: /system.slice/rng-tools.service
`-934 /usr/sbin/rngd -r /dev/hwrng
Oct 05 09:31:59 nsx-manager systemd[1]: Starting rng-tools.serviceā€¦
Oct 05 09:31:59 nsx-manager rng-tools[886]: Starting Hardware RNG entropy gatherer daemon: /etc/init.d/rng-tools: assigning /dev/hwrng to access rdrand on cpu
Oct 05 09:31:59 nsx-manager rng-tools[886]: crw-rw-rw- 1 root root 1, 8 Oct 5 09:31 /dev/random
Oct 05 09:31:59 nsx-manager rng-tools[886]: rngd.
Oct 05 09:31:59 nsx-manager systemd[1]: Started rng-tools.service.

What else can you find out about what it is doing in the background?

rngd -v

Two instances of ‘read error’ are output, followed by two further entropy sources being the Intel/AMG hardware random number generator and AES digital random number generator (RNDG). The ‘read error’ issue appears to be normal behaviour as the package attempts to read sources which don’t exist. Both of the displayed sources indicate that the CPU instruction set includes the necessary flags to tell the VM that it can access hardware random number generation.

Verbose output from rngd daemon

I must say, at this point it’s not clear whether NSX-T requires this service to be running permanently or whether it’s a component which Linux uses as a background service in order only to optimise the generation of a random number feed. It seems that stopping the service does appear to eventually cause problems in my lab – so please attempt the next section with CAUTION.

systemctl stop rng-tools.service

This leads to a significant reduction in CPU consumption and running temperature of my ESXi nodes.

CPU usage decreases after stopping rngd service

It may also be possible to disable the service permanently, but since I don’t have a full explanation of the purpose of this service from an NSX-T point of view I would stop short currently from doing this.

systemctl disable rng-tools.service

In the meantime I am hoping that I can get someone within the NSX-T development team to investigate these findings and provide some more permanent kind of workaround.

Further investigation

Further reading around the subject led me to find an issue has been reported on certain CPUs leading to activity spikes, https://github.com/nhorman/rng-tools/issues/136 and newer versions promise to fix this problem. The article mentioned suggests adding the -x jitter option to the start command but this is not available in the version installed in NSX-T.

RNGD_OPTS="-x jitter -r /dev/hwrng"

You can locate and edit the startup parameters by altering the service definition:

vi /etc/init.d/rng-tools

and potentially altering the default kernel values which are referenced by:

RNGDOPTIONS=
[ -r /etc/default/rng-tools ] && . /etc/default/rng-tools

Edit using:

vi /etc/default/rng-tools

However until the version of rng-tools used in NSX-T is updated to resolve this apparent issue it remains a personal choice as to whether or not the service can be stopped intermittently when a lab environment is not needed.


https://en.wikipedia.org/wiki/RDRAND
https://www.exoscale.com/syslog/random-numbers-generation-in-virtual-machines/

Citrix Gateway 13.0 Registry value EPA scan examples

If you’re having trouble with getting Citrix Endpoint Analysis scans of client device registry values to work properly (on Citrix Gateway) you may come across the following issue I experienced in the latest versions of firmware.

It appears that the EPA scan functionality in the NS 13.0 GUI (this article relates to 13.0.82.45) has been merged so that the numeric/non-numeric registry scan types now coalesce into one type of scan: REG_PATH; whereas in previous versions string values were interpreted using REG_NON_NUM_PATH.

Here’s a screen shot of the new expression editor drop down for Windows client EPA scans

NS13.0.82.45 drop down for Windows EPA scans

In comparison to the previous version (NS13.0.71.44).

NS13.0.71.44 drop down for Windows EPA scans

Here’s a screenshot of the registry scan entry panel where you can enter registry path and value, plus comparison or presence operators. Note the tooltip box which says that numeric comparisons will be done when using <,>,== etc.

NS13.0 registry scan value/comparison entry GUI

The convergence of these two types of scan into one appears to hide a reduction in comparison functionality, which only emerges once you attempt to use a string based registry value comparison using REG_PATH. You cannot use == anymore with string values such as REG_SZ.

This is a quick summary of the new behaviour following my own testing:

Numeric comparisons

Scans based upon REG_DWORD, REG_QWORD, REG_BINARY values will only work when carrying out boolean comparisons on numeric values with operators such as ==, !=, >=

e.g.

sys.client_expr("sys_0_REG_PATH_==HKEY\\_LOCAL\\_MACHINE\\\\SOFTWARE\\\\Classes\\\\YourRegistryKeyLocation\\\\YourRegistryValueName_VALUE==_12345[COMMENT: Registry]")

will result in a successful scan when YourRegistryValueName == 12345.

String comparisons

However when using the newly merged functionality, scans based upon REG_SZ values will only work when carrying out comparisons on string values using operators such as ‘contains’, ‘notcontains’.

If you try to use == as the operator on a string comparison the EPA scan logs will result in:

2021-09-28 09:25:38.883 Boolean compare failed. Value false operator ==
2021-09-28 09:25:38.883 Scan 'REG_PATH_==_HKEY\_LOCAL\_MACHINE\\SOFTWARE\\Classes\\YourRegistryKeyLocation\\YourRegistryValueName_VALUE_==_12345' failed for method 'VALUE'

Therefore modify your EPA action expression to fit the following example using ‘contains’:

sys.client_expr("sys_0_REG_PATH_==_HKEY\\\\_LOCAL\\\\_MACHINE\\\\\\\\SOFTWARE\\\\\\\\Classes\\\\\\\\YourRegistryKeyLocation\\\\\\\\YourRegistryValueName_VALUE_contains_12345[COMMENT: Registry]")

There are several other comparisons which do not appear to work properly, e.g. a numeric registry comparison of a REG_QWORD value which is longer than that allowed by the Citrix EPA plugin BUT is allowed within the 64 bytes of the Windows Registry value.

So my advice would be to consider whether the version of Citrix ADC you’re currently using actually offers the type of scan which you’re intending to use (REG_NON_NUM_PATH, REG_PATH), and NOT to rely upon documented examples without determining if the operator matches the value type correctly.

Further reading

https://support.citrix.com/article/CTX209148 – How to enable client EPA logging/troubleshooting

https://docs.citrix.com/en-us/citrix-gateway/current-release/vpn-user-config/advanced-endpoint-analysis-policies/advanced-endpoint-analysis-policy-expression-reference.html