NSX-T Manager appliance high-CPU whilst idle

I run VMware NSX-T in a small lab environment based on Intel NUCs, but I’ve noticed recently that even when not being challenged e.g. following initial boot and being essentially idle, the Manager appliance suffers continual high CPU usage which leads eventually to an uncomfortably warm office.

Even though running the correct minimum virtual machine hardware for the appliance has been configured, i.e. 4 vCPU and 16GB RAM, it was regularly using ~4.5GHz of physical CPU.

Here’s a good example of an otherwise idle appliance showing 40% CPU usage.

~40% CPU on a 4 vCPU virtual appliance

After connecting over SSH as the ‘admin’ user and entering ‘get process monitor’ it’s quickly apparent from the top output that ‘rngd’ is responsible for the majority of the CPU utilisation:

‘get process monitor’ whilst logged in as NSX-T admin console user

But what is this? A quick search of more general Linux resources informs us that it is a random number generator used in ensuring sufficient ‘entropy’ is available during creation of certificates, SSH keys etc.

In order to discover more about the purpose of this daemon we can inspect the description of the installed version (5-0ubuntu4nn1) under the current Ubuntu 18.04.4 LTS release.

apt show rng-tools/now

Description: Daemon to use a Hardware TRNG
The rngd daemon acts as a bridge between a Hardware TRNG (true random number
generator) such as the ones in some Intel/AMD/VIA chipsets, and the kernel’s
PRNG (pseudo-random number generator).
.
It tests the data received from the TRNG using the FIPS 140-2 (2002-10-10)
tests to verify that it is indeed random, and feeds the random data to the
kernel entropy pool.
.
This increases the bandwidth of the /dev/random device, from a source that
does not depend on outside activity. It may also improve the quality
(entropy) of the randomness of /dev/random.
.
A TRNG kernel module such as hw_random, or some other source of true
entropy that is accessible as a device or fifo, is required to use this
package.
.
This is an unofficial version of rng-tools which has been extensively
modified to add multithreading and a lot of new functionality.

So we know that this is a helper daemon which improves the speed of providing near-truly random numbers when applications ask for them. What version do we currently have installed in the NSX-T 3.1.2 manager appliance?

apt search rng-tools
Sorting... Done
Full Text Search... Done
rng-tools/now 5-0ubuntu4nn1 amd64 [installed,local]
  Daemon to use a Hardware TRNG

This appears to be the latest available version. In order to examine the status of the rngd daemon itself, log in to the appliance console as the root user and use:

systemctl list-units rng-tools.service

The service is shown as running,

Name of rngd random number generator service
root@nsx-manager:~# systemctl status rng-tools.service
rng-tools.service
Loaded: loaded (/etc/init.d/rng-tools; enabled; vendor preset: enabled)
Active: active (running) since Tue 2021-10-05 09:31:59 UTC; 17min ago
Docs: man:systemd-sysv-generator(8)
Process: 886 ExecStart=/etc/init.d/rng-tools start (code=exited, status=0/SUCCESS)
Tasks: 1 (limit: 4915)
CGroup: /system.slice/rng-tools.service
`-934 /usr/sbin/rngd -r /dev/hwrng
Oct 05 09:31:59 nsx-manager systemd[1]: Starting rng-tools.serviceā€¦
Oct 05 09:31:59 nsx-manager rng-tools[886]: Starting Hardware RNG entropy gatherer daemon: /etc/init.d/rng-tools: assigning /dev/hwrng to access rdrand on cpu
Oct 05 09:31:59 nsx-manager rng-tools[886]: crw-rw-rw- 1 root root 1, 8 Oct 5 09:31 /dev/random
Oct 05 09:31:59 nsx-manager rng-tools[886]: rngd.
Oct 05 09:31:59 nsx-manager systemd[1]: Started rng-tools.service.

What else can you find out about what it is doing in the background?

rngd -v

Two instances of ‘read error’ are output, followed by two further entropy sources being the Intel/AMG hardware random number generator and AES digital random number generator (RNDG). The ‘read error’ issue appears to be normal behaviour as the package attempts to read sources which don’t exist. Both of the displayed sources indicate that the CPU instruction set includes the necessary flags to tell the VM that it can access hardware random number generation.

Verbose output from rngd daemon

I must say, at this point it’s not clear whether NSX-T requires this service to be running permanently or whether it’s a component which Linux uses as a background service in order only to optimise the generation of a random number feed. It seems that stopping the service does appear to eventually cause problems in my lab – so please attempt the next section with CAUTION.

systemctl stop rng-tools.service

This leads to a significant reduction in CPU consumption and running temperature of my ESXi nodes.

CPU usage decreases after stopping rngd service

It may also be possible to disable the service permanently, but since I don’t have a full explanation of the purpose of this service from an NSX-T point of view I would stop short currently from doing this.

systemctl disable rng-tools.service

In the meantime I am hoping that I can get someone within the NSX-T development team to investigate these findings and provide some more permanent kind of workaround.

Further investigation

Further reading around the subject led me to find an issue has been reported on certain CPUs leading to activity spikes, https://github.com/nhorman/rng-tools/issues/136 and newer versions promise to fix this problem. The article mentioned suggests adding the -x jitter option to the start command but this is not available in the version installed in NSX-T.

RNGD_OPTS="-x jitter -r /dev/hwrng"

You can locate and edit the startup parameters by altering the service definition:

vi /etc/init.d/rng-tools

and potentially altering the default kernel values which are referenced by:

RNGDOPTIONS=
[ -r /etc/default/rng-tools ] && . /etc/default/rng-tools

Edit using:

vi /etc/default/rng-tools

However until the version of rng-tools used in NSX-T is updated to resolve this apparent issue it remains a personal choice as to whether or not the service can be stopped intermittently when a lab environment is not needed.


https://en.wikipedia.org/wiki/RDRAND
https://www.exoscale.com/syslog/random-numbers-generation-in-virtual-machines/