Very few resources provide real guidance on what to do after creating a cluster using Tanzu TKG, particularly in terms of ongoing maintenance beyond the initial handoff to a developer.
What often happens next is you eventually learn of a problem once the system has long since become stable and adopted for general use, and this puts you straight away on the back-foot in terms of overcoming the issue.
This post concentrates on the kinds of problems you might run into during operational management of a cluster, however it doesn’t claim to capture every such problem, just those which I’ve personally been involved in troubleshooting.
Of course – the VMware documentation should be your go-to place on the first occasion, so do check through the known issues of the release notes for your specific version first before continuing with any other activity.
You can find a link to the general product documentation for Tanzu TKG in the final section, which include the most recent versions by default, and even an archive for older versions.
Do cluster credentials even expire?
Well, this is something which we’re unfortunately going to have to learn, most likely the hard way, and to my mind this is not sufficiently sign-posted within VMware TKG documentation.
Here’s a description of this scenario and the task to update your credentials, using the TKG 2.1 documentation as an example: https://docs.vmware.com/en/VMware-Tanzu-Kubernetes-Grid/2.1/tkg-deploy-mc/mgmt-manage-index.html
This approach assumes that you can still access the cluster using kubectl
commands in order to retrieve the current kubeconfig data of your CLUSTER-NAME:
kubectl -n tkg-system get secrets CLUSTER-NAME-kubeconfig -o 'go-template={{ index .data "value"}}' | base64 -d > mc_kubeconfig.yaml
Once you have obtained this data you’ll want to know when the credentials expire, so use the following method to decode just the client certificate element first and then use openssl to extract the date elements:
kubectl -n tkg-system get secrets CLUSTER-NAME-kubeconfig -o 'go-template={{ index .data "value"}}' | base64 -d | grep client-certificate-data | awk '{print $2}' | base64 -d | openssl x509 -noout -dates
notBefore=Aug 21 15:26:11 2023 GMT
notAfter=Feb 19 03:31:32 2025 GMT
Compare these dates (obtained from the cluster) to what you have stored locally (held within your current kubeconfig context) using:
kubectl config view --raw --minify | grep client-certificate-data | awk '{print $2}' | base64 -d | openssl x509 -noout -dates
The output should be the same, but if it is not then you can update your local kubeconfig file copy of the cluster’s data using the mc_kubeconfig.yaml
file outputted earlier.
Now is a good time to make a date in the diary to either upgrade your Tanzu TKG implementation or manually rotate the certificates before this date arrives. Please refer to the general guidance here kb.vmware.com concerning rotation.
Thankfully this issue has been resolved in TKG 2.1.x via the auto-renew feature which can be retrospectively changed by editing the cluster object:
Misplacing the keys to the castle
Losing cluster admin credentials
Individual client certificates stored within Kubeconfig files generally expire after 6 months, and the kubeadm generated certs (seen below) automatically expire within 365 days of the cluster being built. Only the three certificate-authority certs created within the Kubernetes cluster last 10 years by default.
Here is the output generated from a control-plane node using the command:
kubeadm alpha certs check-expiration
This shows the output from a cluster created a few minutes ago, hence <365d residual time remaining.
CERTIFICATE EXPIRES RESIDUAL TIME CERTIFICATE AUTHORITY EXTERNALLY MANAGED
admin.conf Mar 22, 2025 10:11 UTC 364d ca no
apiserver Mar 22, 2025 10:11 UTC 364d ca no
apiserver-etcd-client Mar 22, 2025 10:11 UTC 364d etcd-ca no
apiserver-kubelet-client Mar 22, 2025 10:11 UTC 364d ca no
controller-manager.conf Mar 22, 2025 10:11 UTC 364d ca no
etcd-healthcheck-client Mar 22, 2025 10:11 UTC 364d etcd-ca no
etcd-peer Mar 22, 2025 10:11 UTC 364d etcd-ca no
etcd-server Mar 22, 2025 10:11 UTC 364d etcd-ca no
front-proxy-client Mar 22, 2025 10:11 UTC 364d front-proxy-ca no
scheduler.conf Mar 22, 2025 10:11 UTC 364d ca no
CERTIFICATE AUTHORITY EXPIRES RESIDUAL TIME EXTERNALLY MANAGED
ca Mar 25, 2033 11:44 UTC 9y no
etcd-ca Mar 25, 2033 11:44 UTC 9y no
front-proxy-ca Mar 25, 2033 11:44 UTC 9y no
Upcoming kubeadm certs expiry
If you are now approaching the expiry date of the kubeadm certificates and have not automatically renewed these certs, do so by either completing an upgrade to a newer Tanzu version, or by control plane scaling in which case they can be automatically rotated whilst remaining in the current state.
Manual rotation process is described here: https://kb.vmware.com/s/article/86251 – BUT crucially this requires SSH access to at least one of your control plane nodes. See the section below on losing access for some possible recovery options.
Once you have rotated the certificates on the control-plane don’t forget to also refresh the content of the following two sections,
client-certificate-data: [updated data from admin.conf]
client-key-data: [updated data from admin.conf]
in BOTH the files of your tanzu CLI machine below: (used for kubectl
workload contexts and tanzu login
management contexts respectively)
~/.kube/config
~/.kube-tkg/config
After the management console side of things are refreshed correctly you can then have confidence of another year’s administrative access. Now move on to rotate the certificates on any worker clusters which that management cluster also oversees using the same process (but you don’t need to update the tanzu CLI version of the kubeconfig file for worker clusters since you don’t log in to them).
Retrieve admin credentials before expiry
Approaching the time of your individual kubeconfig client-certificate expiry you simply need to retrieve a new file using
(or the management cluster equivalent) command. This will provide a new kubeconfig file which will last typically 6 months from the date of issue. tanzu cluster kubeconfig get clustername --admin
--export-file new_file_name
However, if you don’t refresh your admin credentials periodically then you may eventually find that after 365 days of operating a stable cluster that you no-longer have access to it at all via kubectl
or tanzu login
commands.
In situations where your computers are kept isolated from the internet, you might run into a problem where, without prior notice, you can’t use your kubeconfig file to talk to your cluster. This is especially true if you’re not updating your system more than once a year.
It is recommended to maintain awareness of the cluster’s certificate expiry dates and complete rotation beforehand, including to refresh your kubeconfig file via tanzu CLI. If you are told about a cluster’s access expiring after the fact but if you still have SSH access then all is not lost. You should then connect to the cluster control-plane and carry out manual rotation then retrieve the updated content from the /etc/kubernetes/admin.conf
file and place the data into your local kubeconfig file.
Losing SSH access
What if you lose access via SSH? In the TKG 1.6.x release a security hardening issue https://kb.vmware.com/s/article/90368 can cause attempts to logon via capv@controlplaneIPaddress
to fail, requiring additional edits to the cluster’s kcp and kubeadmtemplate before you will be able to log in.
An alternative approach might be to scale your control plane nodes using vertical scaling, e.g. by modifying the size of a node’s attached hard disk, CPU or memory spec for instance. This process is described here: https://kb.vmware.com/s/article/91164. By scaling your control plane new VMs will be spun up, each with the possibility of regaining access via 60 day SSH access (Ubuntu) or 90 days (Photon). This security feature can be disabled (see Method 2 in the referenced document) but only once you have regained connectivity.
More desperate measures might be required in the event that both your kubeconfig and tanzu login access is no longer possible. I do not confirm nor recommend the manual removal (via vCenter) of a control-plane VM – in the event that you lose both SSH and tanzu CLI access, but my suspicion is that if there are more than one control-plane nodes the KubeadmControlPlane will no longer match the running spec and a new node will be provisioned, along with SSH access reinstated. This is something I aim to test further.
Contour package certificate expiry
VMware’s documentation for installing the Contour package into a worker cluster is rather generic, but one of the ways in which you can extend your workloads is by installing the Contour/Envoy ingress controller along with some default values. This scenario is described as a CLI managed package, beyond which the basic process for installation is detailed below (for TKG 1.6):
tanzu package available list contour.tanzu.vmware.com
tanzu package available get contour.tanzu.vmware.com/1.20.2+vmware.1-tkg.1 --generate-default-values-file
Within the generated default values file is a snippet which defines a TLS certificate lifetime which is consumed by Envoy and Contour pods when communicating with each other over gRPC protocol.
certificates:
duration: 8760h
renewBefore: 360h
The first value defines the period after which these internal-use Contour certificates will expire, and the renewal period before which they should be updated. However this can cause a strange problem with Envoy pods if you installed Contour a couple of days after spinning up a new cluster – you will see something like:
StreamListeners gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268436498:SSL routines:OPENSSL_internal:SSLV3_ALERT_BAD_CERTIFICATE
https://kb.vmware.com/s/article/90811 details the solution, which is simply to remove the secrets and have the package automatically recreate them. I have also deinstalled the package using the tanzu CLI and reinstalled it without any other problems, however you should be aware that your cluster’s certificates might expire before Contour has recreated the certificates.
List of resources and useful pages
https://docs.vmware.com/en/VMware-Tanzu/index.html
https://docs.vmware.com/en/VMware-Tanzu-Kubernetes-Grid/index.html
https://docs.vmware.com/en/VMware-Tanzu-Packages/2024.2.1/tanzu-packages/ref.html
https://docs.vmware.com/en/VMware-Tanzu-Kubernetes-Grid/services/tkg-doc-archive-2x.zip – this is a ZIP file archive of the TKG 2.x version documentation
https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs
Avi NSX licensing resources
https://avinetworks.com/docs/latest/nsx-alb-license-editions
https://docs.vmware.com/en/VMware-NSX-Advanced-Load-Balancer/22.1/Administration_Guide/GUID-B5EC8F3B-A75E-4809-A653-6EBE08CFED81.html – Avi licensing