SBC PureConsult

September 17, 2025September 30, 2025

VCF 9 and VVF 9 version mismatch between vCenter and ESX lab installation

This issue appears to have been resolved with the coordinated release of both an updated ESX and vCenter version in the VCF 9.0.1.0 release
Update following GA release of ESX 9.0.1.0.24957456 and vCenter Server 9.0.1.0.24957454

Following on from the excitement of VMware Explore in Las Vegas and receiving my new VMUG Advantage licenses for VCF 9 I have been testing various deployment choices for a new VVF 9 environment in my lab using the GA release of vCenter (9.0.0.0.24755230).

When downloading the various files to populate the offline depot for my cloud installer appliance to consume I noticed that there were two versions of ESX, an older one in both .ISO form (installer) and .ZIP (offline bundle), and a newer one in only the offline bundle.

Here’s what the GA version files look like in the drop down when displaying the original 9.0.0.0 version:

9.0.0.0 offline bundle and ISO downloads

The small drop down box on the top left of the VMware ESX panel can be used to select which version should be displayed, but as you will see below, when selecting the option for 9.0.0.0100 the only file available is an offline bundle.

I had decided to build a custom ISO including some of the VMware Flings (which I thought would also be required to get a working VCF 9 deployment in the future) and used the newer version of ESX (9.0.0.0100) to install my ESX servers prior to executing the VVF installer process.

Mostly this has been a trouble free assumption, other than the warning displayed by the VVF cloud installer that I was using an unexpected version of ESX on my target hosts. I didn’t have any real reason to doubt this process (because the newer version of ESX fixes some critical CVEs) and continued with setting up my environment successfully.

However, during several repeated installations of my Supervisor cluster I ran into issues where the spherelet VIB didn’t always uninstall/reinstall correctly, and was required to remediate the cluster against the ‘autogen-software-spec-1’ lifecycle manager image – which is autogenerated using the version of ESX and any vendor plugins or components used at installation time.

autogen-software-spec-1 lifecycle manager image choices

At this point I noticed that often the remediation might fail because it would skip each of the hosts due to a ‘supposed’ hardware incompatibility and not complete the rest of the remediation. These warnings can be silenced within the vCenter UI under cluster object, Monitor tab, vSAN, Skyline Health.

Silence those alerts if you’re not interested in maintaining compatibility with the HCL (because it’s a lab environment for instance).

In trying to chase down the further cause of the problem I saw the orange ribbon displayed on the Hardware Compatibility tab:

“Requested target version is not supported for the cluster.”

Select the cluster object, choose Updates tab and examine the Hardware Compatibility.

This seems strange, because the remediation will eventually complete anyway because I am not using the remediation option to force hardware compatibility before starting. The following screen shows the same message in a different position.

Successful remediation messages on the cluster tab

Why isn’t the offline-depot from the 9.0.0.0100 version actually supported, even though the auto-gen profile is created automatically and the offline-depot file is imported? It seems that this just comes down to the release dates of vCenter and the subsequent ESX update. You can see this in more detail by examining the following file on the vCenter appliance:

/var/log/vmware/vmware-updatemgr/vum-server/hcl_python_lib.log

Here are the relevant points which called out to me:

Called to discover Hardware Compatibility List
Unknown version 9.0.0-0100.24813472
Cannot find target version (9.0.0-0100.24813472) in VCG

2025-09-17T04:30:13.461Z INFO report.hcl Called to discover HCL for hostId host-34 with target version 9.0.0-0100.24813472, vSanHclConstraints = True.
2025-09-17T04:30:13.462Z ERROR compatibility.releases Got exception while using cache.
Traceback (most recent call last):
  File "/usr/lib/vmware-updatemgr/python/hcl/compatibility/releases.py", line 37, in getEsxiReleaseByVersion
    result = getCacheFactory().getReleasesCache().getByVersion(version)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/vmware-updatemgr/python/hcl/compatibility/cache/release_cache.py", line 72, in getByVersion
    return ProductRelease(releaseId=1, version=version)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/vmware-updatemgr/python/hcl/compatibility/vvs/models/product.py", line 38, in __init__
    raise ValueError("Unknown version %s" % version)
ValueError: Unknown version 9.0.0-0100.24813472
2025-09-17T04:30:13.462Z ERROR report.hcl Cannot find target version (9.0.0-0100.24813472) in VCG.

This would on the face of it seem strange, because the software depot even includes the version which is already deployed to the cluster and already knows the version which it presumably extracted from the host(s) when they were added to the cluster for the first time.

Is it as simple as the vCenter version is older than the ESX release, and that Broadcom haven’t yet released a patch for vCenter which recognises new versions of ESX?

Seems like we just need to get vCenter to agree that it’s not using the latest data and to search online for the updated information.

Eventually I stumbled across the menu option to synchronise the hardware compatibility settings and the release data:

Main vCenter menu dropdown, Lifecycle Manager, Actions drop down, Compatibility Data, Sync

The menu option to update the list of release versions which are supported by vCenter

Returning again to /var/log/vmware/vmware-updatemgr/vum-server/hcl_python_lib.log

2025-09-17T10:32:39.412Z INFO __main__ Loading VVS database from file /storage/updatemgr/patch-store/vvs/vvs-consolidated-bundle-download.json
2025-09-17T10:32:39.966Z INFO __main__ VVS database json loaded in-memory
2025-09-17T10:32:39.966Z INFO compatibility.cache Creating CacheFactory...
2025-09-17T10:32:39.970Z INFO compatibility.cache.executor Datastore locked
2025-09-17T10:32:39.970Z INFO __main__ Loading releases into datastore
2025-09-17T10:32:39.970Z INFO __main__ Releases to be loaded: 9
2025-09-17T10:32:39.971Z INFO __main__ Releases loaded
2025-09-17T10:32:39.971Z INFO __main__ Loading cpuseries into datastore
2025-09-17T10:32:39.971Z INFO __main__ Cpuseries to be loaded: 97
2025-09-17T10:32:39.972Z INFO __main__ Cpuseries loaded
2025-09-17T10:32:39.972Z INFO __main__ Loading servers into datastore
2025-09-17T10:32:40.365Z INFO __main__ Servers to be loaded: 5636
2025-09-17T10:32:40.709Z INFO __main__ Servers loaded
2025-09-17T10:32:40.709Z INFO __main__ Loading devices into datastore
2025-09-17T10:32:41.539Z INFO __main__ Devices to be loaded: 7033
2025-09-17T10:32:42.361Z INFO __main__ Devices loaded
2025-09-17T10:32:42.361Z INFO __main__ Loading OEM vendors into datastore
2025-09-17T10:32:42.361Z INFO __main__ OEM vendors to be loaded: 7
2025-09-17T10:32:42.362Z INFO __main__ OEM vendors loaded
2025-09-17T10:32:42.362Z INFO __main__ Updating datastore time to 1758091862563
2025-09-17T10:32:42.513Z INFO compatibility.cache.executor Datastore unlocked

In this case perhaps the issue isn’t that the versions aren’t being published necessarily but that the compatibility cache doesn’t yet include build ‘9.0.0.0100.24813472’.

Never afraid to reference a William Lam article, I would question why this release is supported for VCF 9 as described here: https://williamlam.com/2025/07/applying-1st-esx-live-patch-using-vcf-9-0-operations.html but not actually within VVF yet?

A quick check through the /storage/updatemgr/patch-store/vvs/vvs-consolidated-bundle-download.json file on my vCenter doesn’t show any release matching that version – so perhaps the initial error was to assume that this patch was also eligible at all for VVF 9 (and not as it seems for VCF 9 only) at this stage.

Perhaps we’ll find out with the next minor or patch release of vCenter if this omission is resolved, but at present it would appear that the best way to avoid these remediation problems in VVF is to only use the GA 9.0.0.0 release of the ESX ISO file.

September 6, 2025September 8, 2025

How much space does VCF 9 offline depot require?

During my own deployment testing of VMware Cloud Foundation 9, and its smaller counterpart VMware vSphere Foundation 9, I came across the need to determine exactly how much space is required if you want to set up an offline depot to install your first fleet member. In offline/airgap or edge locations where connectivity to the internet is either too slow or completely inaccessible VMware provides a great solution to be able to prepare a portable depot which could either be a virtual machine or laptop operating close to where you are going to run the VCF installer.

Previously the Cloud Builder appliance was very large (~27GB) because the image required a lot of space just to store the various files which it will need for the installation. The new combined VMware Cloud Foundation Installer/SDDC Manager .OVA download in VCF 9 replaces this element with a much smaller 2.15GB file because it doesn’t by default include the other components which will be installed later, e.g. VCF Operations, VCF Automation etc.

Instead you can choose to create an offline depot on a laptop or virtual machine and use this to bootstrap the VMware Cloud Foundation Installer with those files instead of having it download everything from the Broadcom site using your download token. This scenario is far more likely in my own experience, as downloading tens of GB when reaching a greenfield site (for any solution) is variously difficult and can take hours.

The product documentation states that the installer appliance requires a minimum of 914GB storage if thick provisioned, but this post will cover what is the bare minimum in order to get a working VM which will deploy the whole VCF suite.

Doing it the William Lam way

William Lam has several excellent articles which describe the process of (1) enabling a simple HTTP server using Python which will serve the files that are imported into the SDDC Manager when it is first deployed, and (2) these will be useful in setting up a depot (perhaps you’ll use a VMUG Advantage entitlement to download the files if it’s for a lab environment)

How to deploy VCF 9 using VMUG Advantage licenses: https://williamlam.com/2025/07/how-to-deploy-vvf-vcf-9-0-using-vmug-advantage-vcp-vcf-certification-entitlement.html

Using HTTP protocol to host the depot files: https://williamlam.com/2025/06/using-http-with-vcf-9-0-installer-for-offline-depot.html

Creating a simple Python web server: https://williamlam.com/2025/01/quick-tip-easily-host-vmware-cloud-foundation-vcf-offline-depot-using-python-simplehttpserver-with-authentication.html

Disable 10Gbit ethernet adapter speed check: https://williamlam.com/2025/06/disable-10gbe-nic-pre-check-in-the-vcf-9-0-installer.html

How much space?!

However rather than duplicating any of this above guidance, this post concentrates on exactly how much space you’ll need for either a VCF 9 or VVF 9 based offline-depot, ideally stored on a laptop or virtual machine acting as a web server. It could be costly in disk space terms to host these files permanently on your laptop, but there’s no reason why you can’t use an external SSD for this purpose – the question is what size will I need?

Also worth considering is where you’re going to put the VCF Installer virtual machine when you’re getting ready to bootstrap a vSAN environment. In this case you’ll need to find a VMFS datastore which is large enough for the installer plus the data which will be imported into the VM itself. The rest of your disks are probably going to be cleared of any partitions so you can’t use those to store data.

A fully populated offline depot which is capable of serving both VVF and VCF products to the SDDC Manager will require ~56GB on disk when combined with the other parts of the depot. The thin provisioned VM will use approximately 80GB when the offline-depot files have been uploaded to it, so if you’ve got a VMFS partition on a 250GB SSD or NVMe disk that is local to your ESXi server this should be sufficient to hold the installer VM before you eventually migrate it to vSAN.

Here’s how the space breaks down into the two products, each is neatly defined within the user interface of the VMware Cloud Foundation Installer:

VMware vSphere Foundation 9

You will need a total of 16.67GB free space to store the three files (.ova and .iso) comprising the three elements stored in the offline-depot.

VVF 9 space requirements in the offline-depot

VMware Cloud Foundation 9

You will need a total of 52GB free space to store the nine files (.ova, .tar, .vlcm and .iso) comprising the seven application elements stored in the offline-depot.

VCF 9 space requirements in the offline-depot

So there you have it, the whole VVF/VCF stack can be installed from these binaries using the VMware Cloud Foundation Installer, and the only element that you’ll need to install onto your target ESXi server initially is the 2GB installer VM (which can optionally become the actual SDDC Manager for the cluster once complete).

Product name	Fully populated depot size
VMware vSphere Foundation (VVF)	~16GB
VMware Cloud Foundation (VCF)	~56GB

August 5, 2025

Connecting Draytek Vigor 2927 to Azure VPN: Step-by-Step Guide

This week, I set out to re-connect my Draytek Vigor 2927 router to an Azure Virtual Network using a site-to-site VPN. While the process had a few challenges, the end result was a simple reliable connection between my on-premises network and Azure.

In this post, I’ll walk through the main steps, highlight a few tips, and share the Powershell commands I used once the Azure Powershell module was installed.

In this brief outline I am only attempting to create an IPsec site-to-site VPN tunnel using a preshared key, alternative steps are required to use an X.509 certificate for authentication.

Today, it is only possible to deploy the lowest cost ‘Basic’ SKU VPN Gateway in Azure using command line tools, and this gateway type requires a Basic public IP address. Coming up, in September 2025 only Standard IPs will be available for deployment, so this guide will likely need to be revised.

NB This guidance should only be followed in a test or lab environment where you can control the networking outcomes caused by reconfiguring your equipment and restarting it where necessary. Use with caution.

Step 1: Install and Connect Azure Powershell

Before running any commands, make sure the Azure Powershell module is installed. If not, you can do it with:

Install-Module -Name Az -AllowClobber -Scope CurrentUser

Once installed, open Powershell and log in to your Azure account:

Connect-AzAccount

Select the correct subscription if you have multiple under your tenant.

Step 2: Define initial Powershell variables

Based on my example values, define the variables which you will need during the process, at minimum you will need to replace the location, subnet prefix, local network prefix and preshared key:

$rg = 'SBC-Infrastructure-PAYG' #This is the resource group where the new resources will be created

$location = 'UK South'

$publicIpName = 'pubip-sbcvpngw1'

$vnetName = 'sbcazurevnet'

$subnetName = 'GatewaySubnet'

$subnetPrefix = '172.20.0.0/24' #This is my Azure VNet’s network range, not the Gateway subnet range

$gatewayName = 'vpngw-sbcvpngw1'

$gatewaySku = 'Basic'

$lngRouterName = 'lng-sbcrouterip1'

$lngRouterIP = '80.xx.yy.zz' #This is my on-premises Draytek router’s public IP

$lngSubnetPrefix = '192.168.0.0/24' #This is my on-premises network range

$preSharedKey = 'yourpresharedkey'

Step 3: Set Up the Azure Virtual Networking elements

After logging in, create the required resource group, virtual network, and gateway subnet. Replace the variable values as needed for your environment.

Create a new resource group for the VPN gateway:

New-AzResourceGroup -Name $rg -Location $location

Create a new public IP address in the resource group:

$gwip = New-AzPublicIpAddress -Name $publicIpName -ResourceGroupName $rg -Location $location -AllocationMethod Dynamic -Sku Basic

Create a subnet in the virtual network for the VPN gateway:

$subnet = New-AzVirtualNetworkSubnetConfig -Name $subnetName -AddressPrefix $subnetPrefix

Create a new virtual network before creating the VPN gateway (if it doesn’t exist):

$vnet = New-AzVirtualNetwork -Name $vnetName -ResourceGroupName $rg -Location $location -AddressPrefix "172.20.0.0/16" -Subnets $subnet

If you have already implemented a VNet, subnet and public IP address and want to retrieve those objects you can follow the sub-step below to populate them using Powershell (this is a good way to validate all of your variables):

Retrieve the virtual network and subnet to ensure they are correctly set up:

$vnet = Get-AzVirtualNetwork -Name $vnetName -ResourceGroupName $rg

$subnet = Get-AzVirtualNetworkSubnetConfig -Name 'GatewaySubnet' -VirtualNetwork $vnet

$gwip = Get-AzPublicIpAddress -Name $publicIpName -ResourceGroupName $rg

Step 4: Create the VPN Gateway

Now you can create the VPN gateway with a basic SKU and route-based IPv4 configuration

$ngwIpConfig = New-AzVirtualNetworkGatewayIpConfig -Name "GatewayIpConfig" -Subnet $subnet -PublicIpAddress $gwip

$azvng = New-AzVirtualNetworkGateway -Name $gatewayName -ResourceGroupName $rg -Location $location -IpConfigurations $ngwIpConfig -GatewayType "Vpn" -VpnType "RouteBased" -GatewaySku $gatewaySku

Note: Creating a gateway can take up to 45 minutes, however in my recent experience this can be something around 10 minutes. Be patient!

Once the command has run to completion you should have a variable containing the detail of the virtual network gateway, however if you need to recreate it run:

$azvng = Get-AzVirtualNetworkGateway -Name $gatewayName -ResourceGroupName $rg

Step 5: Configure the Local Network Gateway

Create a local network gateway which defines the connection target for your on-premises router

$azlng = New-AzLocalNetworkGateway -Name $lngRouterName -ResourceGroupName $rg -Location $location -GatewayIpAddress $lngRouterIP -AddressPrefix $lngSubnetPrefix

Otherwise, you can retrieve an existing local network gateway using

$azlng = Get-AzLocalNetworkGateway -Name $lngRouterName -ResourceGroupName $rg

Step 6: Establish the VPN Connection

Create a VPN connection between the Azure VPN gateway and the local network gateway (this is a multi-line command)

New-AzVirtualNetworkGatewayConnection -Name conn-AZ-to-SBC-vpn -ResourceGroupName $rg ` -Location $location -VirtualNetworkGateway1 $azvng -LocalNetworkGateway2 $azlng ` -ConnectionType IPsec ` -SharedKey $preSharedKey ` -ConnectionProtocol IKEv2

Step 7: Configure the Draytek Vigor 2927

On the Draytek side, log in to the router web interface and configure an IPsec VPN profile.

Input the pre-shared key, remote gateway (Azure’s public IP), and remote subnets as configured above. Use the correct proposal algorithms for compatibility (usually AES256/SHA256 and DH Group 2 for Azure).

These screen elements relate to the latest firmware at time of publishing (V2927_20250804_DrayTek_4462).

It is very important to complete/validate the first step because enabling the IPsec VPN service requires a reboot, and whilst this may have been done previously there’s no way to know if you will be able to make a new tunnel until it is enabled then rebooted.

Here are the exact configuration elements which I finally validated successfully:

Open VPN and Remote Access > Enable IPsec VPN Service (reboot required otherwise no tunnel will be established)
Open VPN and Remote Access > LAN to LAN > Add (new profile)
Set the following Common settings:
- Enable this profile: Yes
- Name: Azure_VPN_tunnel
Set the following Dial-Out settings:
- Call direction: Both
- Dial out through: WAN1
- Dial out settings:
- IPsec tunnel: IKEv2
- Server IP/Host Name: IP address or name of the local network gateway (e.g., lng-sbcrouterip1)
IKE phase 1 settings:
- Authentication method: Pre-shared key
- Pre-shared key: yourpresharedkey
- Proposal Encryption: AES-256
- Proposal DH Group: Group 2 (1024 bit)
- Proposal Authentication: SHA-256
IKE phase 2 settings:
- Security protocol: ESP (High)
- Proposal Encryption: AES-256
- Proposal Authentication: SHA-256
IKE Advanced settings:
- Phase 1 Key lifetime: 28800 seconds
- Phase 2 Key lifetime: 3599 seconds
- Enable Perfect Forward Secret: Not checked
Set the following Dial-In parameters:
- Allowed VPN Type: IPsec Tunnel(IKEv1/IKEv2)
- Specify Remote VPN Gateway:
- Remote IP address: (e.g. public IP of the Azure VPN gateway)
- Allowed IKE Authentication Method:
- Pre-shared key: yourpresharedkey
- Allowed IPsec Security method:
- AH, ESP-DES, ESP-3DES, ESP-AES
TCP/IP Network Settings:

Local Network IP: 192.168.0.0
Local Network mask: 255.255.255.0
Remote Network IP: 172.20.0.0
Remote Network mask: 255.255.0.0

October 18, 2024October 18, 2024

MacOS Sequoia firewall dropping Citrix ICA and MS RDP connections

Here’s a quick post in order to describe an issue I ran into after upgrading to MacOS Sequoia on an M1 Studio. After upgrade was complete I was prompted to enable the firewall (previously I used an additional product). Everything seemed fine and I was able to get on with using my system normally. I currently am running Sequoia 15.0.1 (24A348).

Around the same time I also upgraded to Citrix Workspace App 2409, especially because it introduced support for MacOS Sequoia https://docs.citrix.com/en-us/citrix-workspace-app-for-mac/whats-new#support-for-macos-15-sequoia.

At some point later that day I started to experience disconnection from Citrix sessions upon the 30 and 60 minute session lifetime, which incorrectly sent me down the route of checking timeout settings firstly and then other suspicious network observations which turned out to be unrelated.

What follows to my regret was starting a lengthy sequence of troubleshooting steps including Wireshark, building Azure VMs in different geo-locations, downgrade of Citrix ADC VPX Gateway, testing on Windows VM in my Mac, throwing the kitchen sink at it in fact.

Only when I began investigating further – by building a Windows 11 VM in the geographical location (in order to eliminate inter-country internet links) – did I find that the Citrix session within the desktop was rock solid. Instead, the problem shifted to my MacOS based RDP connection to the said temporary desktop, along with error 0x407 “Your session ended because of a data encryption error”.

At which point I realised that the session resets occurring periodically under ICA were also a symptom of the underlying session terminating, in a similar way to the RDP protocol.

Searching more widely I came across a few articles, linked below (with thanks) and discovered that disabling the Mac firewall then immediately resolved the problem with both RDP and ICA connections. I upgraded the MS Remote Desktop app with the new version (now called Windows App) and retested my session stability. I checked my Citrix Workspace App for Mac was updated fully and everything is now working fine, but only with the firewall switched off and a third-party app doing its job instead.

In summary – there appears to be a problem currently with the firewall implementation in Sequoia, and I look forward to seeing this patched soon.

https://discussions.apple.com/thread/255799644
https://www.reddit.com/r/MacOS/comments/1fizxc9/ms_rdp_broken_on_macos_sequoia/
https://www.theregister.com/2024/09/23/security_in_brief

May 2, 2024

Installing correct Python version for VMware PowerCLI 13 ImageBuilder for ESXi 8 custom images

This quick post outlines the successful approach for installing Python on Windows in order to use PowerCLI for making a custom ESXi 8 image.

As you’ll know already this requires PowerCLI 13 as a minimum to be able to handle ESXi8 images, however try as I might there were persistent problems installing and configuring Python on my Windows 10 VM (caused by failure to recognise OpenSSL mainly).

Here’s what I know, having been through several troubleshooting steps:

You don’t need to install OpenSSL separately, this is distributed in the pyopenssl package installed by Pip
The -pythonpath parameter in the Set-PowerCLIConfiguration commandlet needs to include python.exe and be surrounded by double-quotes if necessary
You can run pip commands from command prompt or PowerShell, but the VMware instructions have you run the versioned command e.g. pip3.11.exe from within the Python installation

Python 3.12.3 did not work with PowerCLI 13.2.1, even when using the process outlined below. It would never detect the correct version of OpenSSL. I was only able to make Python 3.11.9 work successfully with this release of PowerCLI.

Here is a brief outline of what I did in order to resolve my problems – I feel that starting on a fresh installation was important here.

Outline of installation steps

Built a clean VM in Azure running Windows 11, this was an important point to eliminate any problems which might have been caused by upgrading from previous PowerCLI versions.

Opened PowerShell 5.1, this is pre-installed with Windows and so there is no need to install

$PSVersionTable.PSVersion

Major Minor Build Revision
----- ----- ----- --------
5     1     22621 2506

Installed PowerCLI using

Install-Module VMware.PowerCLI -Scope CurrentUser

Installed Python installer for Windows (https://www.python.org/ftp/python/3.11.9/python-3.11.9-amd64.exe) from https://www.python.org/downloads/release/python-3119/ and deliberately installed it for ‘all users’ using an administrator privileged installation.

The newer versions include the Pip package manager so it’s not necessary to use the later ‘get-pip.py’ script to install the additional packages, these can be obtained without it.

Return to Powershell window and configured the Python path, noting that there are double quotes around the value because of a space character in ‘Program Files’.

Set-PowerCLIConfiguration -PythonPath "C:\Program Files\Python311\python.exe" -Scope User

Based upon the general instructions here: https://developer.vmware.com/docs/15315/GUID-F98FF88D-D31F-48F0-8C3A-1C6492CD8AFB.html

It was then straight forward to install the necessary additional packages via command line (I used Windows Command Prompt)

cd "C:\Program Files\Python311\Scripts"
pip3.11.exe install six psutil lxml pyopenssl

Close Command Prompt and PowerShell, then reopen PowerShell. Test the ability of ImageBuilder to access the Python packages it needs by using:

Get-EsxImageProfile

This command will generate a red error in PowerShell if any of the elements are missing, but don’t be too quick in reading the output as there are several different errors which all start with similar beginnings.

Building custom ESXi image – additional material

Here are some example commands for what you might do next once PowerCLI and Python are working properly. The process here shows what you would do in order to install the VMware Fling for the USB network driver into an ESXi 8.0U2b generic offline depot file.

Only two files are referenced here, ESXi80U2-VMKUSB-NIC-FLING-67561870-component-22416446.zip and VMware-ESXi-8.0U2b-23305546-depot.zip which can be obtained from the VMware Flings page and VMware Customer Connect portal.

Make sure you log in first to access the Flings page otherwise none of the download options will be visible.

Once the .ISO file is generated then you can use a tool like Rufus to write the image to a bootable USB drive for instance.

Add-EsxSoftwareDepot .\ESXi80U2-VMKUSB-NIC-FLING-67561870-component-22416446.zip
Add-EsxSoftwareDepot .\VMware-ESXi-8.0U2b-23305546-depot.zip

Get-EsxSoftwareDepot | fl

Depot Url
---------
zip:C:\Users\vmadmin\Downloads\ESXi80U2-VMKUSB-NIC-FLING-67561870-component-22416446.zip?index.xml
zip:C:\Users\vmadmin\Downloads\VMware-ESXi-8.0U2b-23305546-depot.zip?index.xml

Get-EsxImageProfile | ft Name

Name
----
ESXi-8.0U2b-23305546-standard
ESXi-8.0U2sb-23305545-standard
ESXi-8.0U2b-23305546-no-tools
ESXi-8.0U2sb-23305545-no-tools

Get-EsxSoftwarePackage -Name *usb*

Name                     Version                        Vendor     Creation Date
----                     -------                        ------     -------------
vmkusb-nic-fling         1.12-2vmw.802.0.0.67561870     VMW        9/7/2023 8:53...
vmkusb                   0.1-18vmw.802.0.0.22380479     VMW        9/4/2023 9:34...
vmkusb-esxio             0.1-18vmw.802.0.0.22380479     VMW        9/4/2023 9:33...

New-EsxImageProfile -CloneProfile "ESXi-8.0U2b-23305546-standard" -Name "ESXi-8.0U2b-23305546-standard-usb-fling" -vendor "VMware"

Name                           Vendor          Last Modified   Acceptance Level
----                           ------          -------------   ----------------
ESXi-8.0U2b-23305546-standa... VMware          2/29/2024 12... PartnerSupported

Add-EsxSoftwarePackage -ImageProfile "ESXi-8.0U2b-23305546-standard-usb-fling" -SoftwarePackage "vmkusb-nic-fling" -Force

Export-EsxImageProfile -ImageProfile "ESXi-8.0U2b-23305546-standard-usb-fling" -FilePath "VMware-ESXi-8.0U2b-23305546-USB-NICs.iso" -ExportToIso -Force

April 19, 2024

Maintaining long lived Tanzu clusters

Very few resources provide real guidance on what to do after creating a cluster using Tanzu TKG, particularly in terms of ongoing maintenance beyond the initial handoff to a developer.

What often happens next is you eventually learn of a problem once the system has long since become stable and adopted for general use, and this puts you straight away on the back-foot in terms of overcoming the issue.

This post concentrates on the kinds of problems you might run into during operational management of a cluster, however it doesn’t claim to capture every such problem, just those which I’ve personally been involved in troubleshooting.

Of course – the VMware documentation should be your go-to place on the first occasion, so do check through the known issues of the release notes for your specific version first before continuing with any other activity.

You can find a link to the general product documentation for Tanzu TKG in the final section, which include the most recent versions by default, and even an archive for older versions.

Do cluster credentials even expire?

Well, this is something which we’re unfortunately going to have to learn, most likely the hard way, and to my mind this is not sufficiently sign-posted within VMware TKG documentation.

Here’s a description of this scenario and the task to update your credentials, using the TKG 2.1 documentation as an example: https://docs.vmware.com/en/VMware-Tanzu-Kubernetes-Grid/2.1/tkg-deploy-mc/mgmt-manage-index.html

This approach assumes that you can still access the cluster using kubectl commands in order to retrieve the current kubeconfig data of your CLUSTER-NAME:

kubectl -n tkg-system get secrets CLUSTER-NAME-kubeconfig -o 'go-template={{ index .data "value"}}' | base64 -d > mc_kubeconfig.yaml

Once you have obtained this data you’ll want to know when the credentials expire, so use the following method to decode just the client certificate element first and then use openssl to extract the date elements:

kubectl -n tkg-system get secrets CLUSTER-NAME-kubeconfig -o 'go-template={{ index .data "value"}}' | base64 -d | grep client-certificate-data | awk '{print $2}' | base64 -d | openssl x509 -noout -dates

notBefore=Aug 21 15:26:11 2023 GMT notAfter=Feb 19 03:31:32 2025 GMT

Compare these dates (obtained from the cluster) to what you have stored locally (held within your current kubeconfig context) using:

kubectl config view --raw --minify | grep client-certificate-data | awk '{print $2}' | base64 -d | openssl x509 -noout -dates

The output should be the same, but if it is not then you can update your local kubeconfig file copy of the cluster’s data using the mc_kubeconfig.yaml file outputted earlier.

Now is a good time to make a date in the diary to either upgrade your Tanzu TKG implementation or manually rotate the certificates before this date arrives. Please refer to the general guidance here kb.vmware.com concerning rotation.

Thankfully this issue has been resolved in TKG 2.1.x via the auto-renew feature which can be retrospectively changed by editing the cluster object:

https://docs.vmware.com/en/VMware-Tanzu-Kubernetes-Grid/2.1/using-tkg/workload-clusters-secret.html#auto-renew

Misplacing the keys to the castle

Losing cluster admin credentials

Individual client certificates stored within Kubeconfig files generally expire after 6 months, and the kubeadm generated certs (seen below) automatically expire within 365 days of the cluster being built. Only the three certificate-authority certs created within the Kubernetes cluster last 10 years by default.

Here is the output generated from a control-plane node using the command:

kubeadm alpha certs check-expiration

This shows the output from a cluster created a few minutes ago, hence <365d residual time remaining.

CERTIFICATE                EXPIRES                  RESIDUAL TIME   CERTIFICATE AUTHORITY   EXTERNALLY MANAGED
admin.conf                 Mar 22, 2025 10:11 UTC   364d            ca                      no
apiserver                  Mar 22, 2025 10:11 UTC   364d            ca                      no
apiserver-etcd-client      Mar 22, 2025 10:11 UTC   364d            etcd-ca                 no
apiserver-kubelet-client   Mar 22, 2025 10:11 UTC   364d            ca                      no
controller-manager.conf    Mar 22, 2025 10:11 UTC   364d            ca                      no
etcd-healthcheck-client    Mar 22, 2025 10:11 UTC   364d            etcd-ca                 no
etcd-peer                  Mar 22, 2025 10:11 UTC   364d            etcd-ca                 no
etcd-server                Mar 22, 2025 10:11 UTC   364d            etcd-ca                 no
front-proxy-client         Mar 22, 2025 10:11 UTC   364d            front-proxy-ca          no
scheduler.conf             Mar 22, 2025 10:11 UTC   364d            ca                      no

CERTIFICATE AUTHORITY   EXPIRES                  RESIDUAL TIME   EXTERNALLY MANAGED
ca                      Mar 25, 2033 11:44 UTC   9y              no
etcd-ca                 Mar 25, 2033 11:44 UTC   9y              no
front-proxy-ca          Mar 25, 2033 11:44 UTC   9y              no

Upcoming kubeadm certs expiry

If you are now approaching the expiry date of the kubeadm certificates and have not automatically renewed these certs, do so by either completing an upgrade to a newer Tanzu version, or by control plane scaling in which case they can be automatically rotated whilst remaining in the current state.

Manual rotation process is described here: https://kb.vmware.com/s/article/86251 – BUT crucially this requires SSH access to at least one of your control plane nodes. See the section below on losing access for some possible recovery options.

Once you have rotated the certificates on the control-plane don’t forget to also refresh the content of the following two sections,

client-certificate-data: [updated data from admin.conf]
client-key-data: [updated data from admin.conf]

in BOTH the files of your tanzu CLI machine below: (used for kubectl workload contexts and tanzu login management contexts respectively)

~/.kube/config
~/.kube-tkg/config

After the management console side of things are refreshed correctly you can then have confidence of another year’s administrative access. Now move on to rotate the certificates on any worker clusters which that management cluster also oversees using the same process (but you don’t need to update the tanzu CLI version of the kubeconfig file for worker clusters since you don’t log in to them).

Retrieve admin credentials before expiry

Approaching the time of your individual kubeconfig client-certificate expiry you simply need to retrieve a new file using tanzu cluster kubeconfig get clustername --admin --export-file new_file_name (or the management cluster equivalent) command. This will provide a new kubeconfig file which will last typically 6 months from the date of issue.

However, if you don’t refresh your admin credentials periodically then you may eventually find that after 365 days of operating a stable cluster that you no-longer have access to it at all via kubectl or tanzu login commands.

In situations where your computers are kept isolated from the internet, you might run into a problem where, without prior notice, you can’t use your kubeconfig file to talk to your cluster. This is especially true if you’re not updating your system more than once a year.

It is recommended to maintain awareness of the cluster’s certificate expiry dates and complete rotation beforehand, including to refresh your kubeconfig file via tanzu CLI. If you are told about a cluster’s access expiring after the fact but if you still have SSH access then all is not lost. You should then connect to the cluster control-plane and carry out manual rotation then retrieve the updated content from the /etc/kubernetes/admin.conf file and place the data into your local kubeconfig file.

Losing SSH access

What if you lose access via SSH? In the TKG 1.6.x release a security hardening issue https://kb.vmware.com/s/article/90368 can cause attempts to logon via capv@controlplaneIPaddress to fail, requiring additional edits to the cluster’s kcp and kubeadmtemplate before you will be able to log in.

An alternative approach might be to scale your control plane nodes using vertical scaling, e.g. by modifying the size of a node’s attached hard disk, CPU or memory spec for instance. This process is described here: https://kb.vmware.com/s/article/91164. By scaling your control plane new VMs will be spun up, each with the possibility of regaining access via 60 day SSH access (Ubuntu) or 90 days (Photon). This security feature can be disabled (see Method 2 in the referenced document) but only once you have regained connectivity.

More desperate measures might be required in the event that both your kubeconfig and tanzu login access is no longer possible. I do not confirm nor recommend the manual removal (via vCenter) of a control-plane VM – in the event that you lose both SSH and tanzu CLI access, but my suspicion is that if there are more than one control-plane nodes the KubeadmControlPlane will no longer match the running spec and a new node will be provisioned, along with SSH access reinstated. This is something I aim to test further.

Contour package certificate expiry

VMware’s documentation for installing the Contour package into a worker cluster is rather generic, but one of the ways in which you can extend your workloads is by installing the Contour/Envoy ingress controller along with some default values. This scenario is described as a CLI managed package, beyond which the basic process for installation is detailed below (for TKG 1.6):

tanzu package available list contour.tanzu.vmware.com
tanzu package available get contour.tanzu.vmware.com/1.20.2+vmware.1-tkg.1 --generate-default-values-file

Within the generated default values file is a snippet which defines a TLS certificate lifetime which is consumed by Envoy and Contour pods when communicating with each other over gRPC protocol.

certificates:
  duration: 8760h
  renewBefore: 360h

The first value defines the period after which these internal-use Contour certificates will expire, and the renewal period before which they should be updated. However this can cause a strange problem with Envoy pods if you installed Contour a couple of days after spinning up a new cluster – you will see something like:

StreamListeners gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268436498:SSL routines:OPENSSL_internal:SSLV3_ALERT_BAD_CERTIFICATE

https://kb.vmware.com/s/article/90811 details the solution, which is simply to remove the secrets and have the package automatically recreate them. I have also deinstalled the package using the tanzu CLI and reinstalled it without any other problems, however you should be aware that your cluster’s certificates might expire before Contour has recreated the certificates.

List of resources and useful pages

https://docs.vmware.com/en/VMware-Tanzu/index.html

https://docs.vmware.com/en/VMware-Tanzu-Kubernetes-Grid/index.html

https://docs.vmware.com/en/VMware-Tanzu-Packages/2024.2.1/tanzu-packages/ref.html

https://docs.vmware.com/en/VMware-Tanzu-Kubernetes-Grid/services/tkg-doc-archive-2x.zip – this is a ZIP file archive of the TKG 2.x version documentation

https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs

Avi NSX licensing resources

https://avinetworks.com/docs/latest/nsx-alb-license-editions

https://docs.vmware.com/en/VMware-NSX-Advanced-Load-Balancer/22.1/Administration_Guide/GUID-B5EC8F3B-A75E-4809-A653-6EBE08CFED81.html – Avi licensing

February 13, 2024February 13, 2024

How to clean up NSX Advanced Load Balancer following replacement of a failed Tanzu control plane node

How do I clean up a missing control plane node in the Avi load balancer console?

This post outlines an approach I used to solve a problem which has occurred in several environments I’ve worked in recently. I haven’t seen a similar set of instructions anywhere yet, but it doesn’t mean that they are the only way to solve the problem. Check with VMware Support if you’re having a production problem, don’t follow this guidance without properly understanding the type of problem which you’re experiencing.

If you have found this page because you’re stuck with a similar problem it is probably because one or more of your control plane nodes in a Tanzu Kubernetes Grid (TKG) cluster have failed and been replaced automatically leaving a broken IP pool entry in NSX Advanced Load Balancer user interface.

For example, you log in and find that one of your IP pools which define the control plan endpoints are offline (shown as 3/4 servers up).

Clicking into the cluster will provide further detail of the missing control plane endpoints

In this case, one of the existing control plane nodes (172.20.11.45) became frozen and went offline , eventually losing its DHCP lease before it could be converted into a permanent reservation. Tanzu’s vSphere integration automatically provisioned a new node, and the old IP address now belongs to a new VM somewhere outside of Tanzu.

However, despite this situation occurring some days previously the Avi Kubernetes Operator (ako) has not cleaned up, perhaps expecting that the VM might be recovered eventually.

If you’re in a similar situation you will now know the name of the environment and should be able to determine the IP addresses of your current control plane nodes still:

kubectl config use-context [name of your management cluster context]
kubectl get nodes -o wide

In this case we are only interested in the IP addresses belonging to nodes having the control-plane node (the first three in the output below).

There aren’t any more ‘missing’ control plane endpoints shown above, so Kubernetes appears satisfied that it is in a workable state.

As a validation, check that the endpoints listed within the Kubernetes service map onto the current working list of nodes.

List the endpoints for the Kubernetes service (in default namespace)

kubectl get ep kubernetes -o json

The JSON output above is quite simple to read vertically, and confirms that there are three IP addresses within a subset of endpoints serving the Kubernetes API service on port 6443 (via the Avi Load Balancer vserver) that is defined in your ~/.kube/config file.

These match the output which the NSX Advanced Load Balancer showed previously.

What puzzled me for a very long time now seems obvious, that you cannot edit/remove any defunct entries from the Avi IP pool using the UI because the operator synchronises the list of endpoints for each service. By fixing the condition in Kubernetes the operator will take care of the content of the pool itself.

This is the way.

Obtain the list of services in the tkg-system namespace

kubectl get svc -n tkg-system

Now use the cluster-specific named control plane service to output the list of endpoints for the control plane

Aha, there’s the 172.20.11.45 control plane node which no longer exists in the cluster.

Edit the endpoint and manually remove the missing address from the subset addresses section

kubectl edit ep [tkg-system-tkg-mgmt-projit-control-plane] -n tkg-system

Using the VI editor remove the two lines declaring the ip and nodeName entries for the missing cluster node

Close the file and save the changes, the endpoint will be updated.

Refresh the Avi load balancer UI and if everything is well the pool will be updated dynamically when the ako operator detects the updated list of endpoints.

Further information confirming the status update is reflected in the ako-0 pod logs, which shows that a change has been detected between the cached copy of the virtual server object and the updated relationship which is computed from the graph database.

kubectl get logs ako-0 -n avi-system

It then resynchronises the pool content with Avi.

I’d be very pleased to hear if you run into a similar scenario, as I do not think that this element of ako is described anywhere in the official documentation of either Tanzu or AKO – and the DHCP lease re-issue will often crop up if an admin did not take care of making a permanent reservation after a node is added. Often this is because Tanzu will discover a broken node and intervene without anyone being aware of the problem, but this does not always make sense if addresses are not reserved permanently by default in your subnet.

Thanks for reading –

March 23, 2023March 24, 2023

How much space does an air gap installation of Tanzu TKG 2.1.1 need?

In a follow up post to how-much-space-does-an-air-gap-installation-of-tanzu-tkg-1-6-0-need I thought it would be useful to expand on the initial summary to include an upgrade to TKG 2.1.1.

In the previous 1.6.0 example there was a total of 157 images (881 artifacts) requiring 9.7GB of storage space. However the download process has been modified and doesn’t use a shell script to download files for an air gap registry anymore, but rather a command such as:

tanzu isolated-cluster download-bundle --source-repo projects.registry.vmware.com/tkg --tkg-version v2.1.1

This results in 244 tar files being downloaded for a single version of TKG and 45GB of space needed.

When these tar files are uploaded I experienced several problems caused by a redis bug when using Harbor 1.10.x, and the upload command only succeeded once I had upgraded to Harbor 2.5.0.

tanzu isolated-cluster upload-bundle --source-directory ./ --destination-repo registry.sbcpureconsult.internal/tkg --ca-certificate /tmp/ca.crt

In total (for the combination of both TKG 1.6.0 and 2.1.1 releases) there are a total of 177 repositories requiring 20.58GB of storage space.

If I subtract the two figures from one another it indicates that TKG 2.1.1 requires 10.88GB in total.

March 17, 2023

How much space does an air gap installation of Tanzu TKG 1.6.0 need?

I have implemented several air gapped installations of Tanzu Kubernetes Grid 1.6 now using Harbor registry so thought it might be worth recording how many images are stored and the space required.

Example clean registry with only TKG 1.6 files

Short on time? I should caveat that my results only record the space needed for a single version of Kubernetes (1.23.8). This is the newest supported build of Kubernetes in the TLG 1.6.0 release.

During the air gap installation it is possible to reduce the file set required to be stored in your registry by extracting the Bill of Materials for a specific version only:

export DOWNLOAD_TKRS="v1.23.8_vmware.2-tkg.1"

In total (for this specific release) there are 157 images (881 artifacts) requiring 9.7GB of storage space.

I have tested the deployment of a management and worker cluster from the air gap registry and confirm successful installation.

Over time you may accumulate older versions in your registry which are no longer required, however there’s not information available currently on how you could reduce the number of images stored – so I would recommend keeping the image-copy file produced during each iteration of the air gap registry preparation phase so that you could remove them manually at a later date.

February 10, 2022March 23, 2023

Comparing YAML documents with Beyond Compare

I’m a huge fan of Scooter Software’s magnificent Beyond Compare tool, having used it for years in many different scenarios. It does a brilliant job of lining up the similarities and differences in multiple file types – especially text based file formats.

A Kubernetes container orchestration project likely involves defining multiple YAML manifests for creating different objects in the API so I often spend many happy hours comparing differences between folders full of files in different environments.

Unfortunately within Beyond Compare there isn’t any inbuilt YAML file support so it sometimes makes mistakes when aligning functionally equivalent but structurally different files.

Here’s an example:

Default file comparison using simple everything else text format

By default red text shows differences, blue text means minor differences and hatched areas show where there is a lack of alignment possible. I have defined certain non-important grammar elements to ignore, e.g. secretName: so these are shown in blue, but why is it not possible to line up the other elements?

In discussion with Scooter Software they explained that they would need a YAML parsing algorithm to be developed in order to pre-sort such lists into the correct layout, and this isn’t possible right now.

Alternatively they suggested creating a custom File Type which uses an external program to do the sorting work before it attempts to line up and search for differences.

With that in mind I came across the yq project in Github by Mike Farah which provides amongst other things a sorting tool along the lines of jq (for JSON) for YAML documents instead. The following brief instructions show how to create a custom file type in Beyond Compare to have yq sort the keys and values in a temporary file before showing the comparison view. You can then choose to integrate and save changes on either side.

Follow preparation steps

Download yq_windows_amd64.exe from the release page and rename the file to jq.exe for ease of use
Create a folder called YAML in %USERPROFILE%\AppData\Roaming\Scooter Software\Beyond Compare 4\Helpers
Copy the jq.exe file into the above folder
Create a new file within the same folder yaml_sorted.bat containing:

@echo off Helpers\YAML\yq.exe e -M "sort_keys(…)" %1 >>%2
Open Beyond Compare and configure as follows

Configure Beyond Compare

Choose Tools, File Formats

Choose the [+] button and select Text Format

On the General tab enter a file mask to use when applying the File Format correctly, e.g. *.yaml,*.yml

Click on the Conversion tab and select External program (Unicode filenames) then enter the following path and parameters

Helpers\YAML\yaml_sorted.bat %s %t

Optionally, choose whether to Disable editing (when displaying the resulting comparison) and whether to ‘Trim trailing whitespace’ when saving – I find this helpful sometimes.

Click on the Misc tab and choose ‘Insert spaces instead of tabs’ and tab stop 8.

Click Save and Close.

Now find two YAML files to compare and double click in Beyond Compare. It will use the file type mask to apply the custom file format you just created.

In the background it will run the yq executable twice (via the batch file), once for each input file (left and right comparison). The executable takes the input file (%1) then sorts the keys and values alphabetically in sequence first before outputting the content to a new temporary file (%2).

Now open a YAML file comparison and you’ll find a much improved alignment based upon first the sorting of individual keys, then subsequent scalar values within each.

e.g. yq e -M -P "keys" C:\Users\stwa\desktop\example-ingress1.yaml

Here is the resulting comparison after processing with yq. I think you’ll agree that it’s much easier to spot the differences and also allow standardisation of your files going forward.

Resulting file comparison between left and right – after sorting via yq

There’s much more that you can do with both yq and Beyond Compare to further tune the formatting order, but hopefully this gives you some sort of light at the end of the tunnel when attempting comparisons of YAML documents.

I would like to investigate whether it’s possible to define a template format in yq so that it uses a custom key alignment order so that it matches the specification used by the Kubernetes API. In addition Beyond Compare allows line weighting within the algorithm so that it can assign preferential weights to certain grammar elements, although this has not been particularly fruitful when used according to the example below.

Final thoughts

Custom file types don’t just affect comparisons – If you right click on a file and choose Open With, Text Edit you will also see the sorted version of the file – not the original.

NB – If you save the modified content it will be saved using the new sorted format.

Documentation for yq is available here: https://mikefarah.gitbook.io/yq