How do I clean up a missing control plane node in the Avi load balancer console?
This post outlines an approach I used to solve a problem which has occurred in several environments I’ve worked in recently. I haven’t seen a similar set of instructions anywhere yet, but it doesn’t mean that they are the only way to solve the problem. Check with VMware Support if you’re having a production problem, don’t follow this guidance without properly understanding the type of problem which you’re experiencing.
If you have found this page because you’re stuck with a similar problem it is probably because one or more of your control plane nodes in a Tanzu Kubernetes Grid (TKG) cluster have failed and been replaced automatically leaving a broken IP pool entry in NSX Advanced Load Balancer user interface.
For example, you log in and find that one of your IP pools which define the control plan endpoints are offline (shown as 3/4 servers up).
Clicking into the cluster will provide further detail of the missing control plane endpoints
In this case, one of the existing control plane nodes (172.20.11.45) became frozen and went offline , eventually losing its DHCP lease before it could be converted into a permanent reservation. Tanzu’s vSphere integration automatically provisioned a new node, and the old IP address now belongs to a new VM somewhere outside of Tanzu.
However, despite this situation occurring some days previously the Avi Kubernetes Operator (ako) has not cleaned up, perhaps expecting that the VM might be recovered eventually.
If you’re in a similar situation you will now know the name of the environment and should be able to determine the IP addresses of your current control plane nodes still:
kubectl config use-context [name of your management cluster context]
kubectl get nodes -o wide
In this case we are only interested in the IP addresses belonging to nodes having the control-plane node (the first three in the output below).
There aren’t any more ‘missing’ control plane endpoints shown above, so Kubernetes appears satisfied that it is in a workable state.
As a validation, check that the endpoints listed within the Kubernetes service map onto the current working list of nodes.
List the endpoints for the Kubernetes service (in default namespace)
kubectl get ep kubernetes -o json
The JSON output above is quite simple to read vertically, and confirms that there are three IP addresses within a subset of endpoints serving the Kubernetes API service on port 6443 (via the Avi Load Balancer vserver) that is defined in your ~/.kube/config file.
These match the output which the NSX Advanced Load Balancer showed previously.
What puzzled me for a very long time now seems obvious, that you cannot edit/remove any defunct entries from the Avi IP pool using the UI because the operator synchronises the list of endpoints for each service. By fixing the condition in Kubernetes the operator will take care of the content of the pool itself.
This is the way.
Obtain the list of services in the tkg-system namespace
kubectl get svc -n tkg-system
Now use the cluster-specific named control plane service to output the list of endpoints for the control plane
Aha, there’s the 172.20.11.45 control plane node which no longer exists in the cluster.
Edit the endpoint and manually remove the missing address from the subset addresses section
kubectl edit ep [tkg-system-tkg-mgmt-projit-control-plane] -n tkg-system
Using the VI editor remove the two lines declaring the ip and nodeName entries for the missing cluster node
Close the file and save the changes, the endpoint will be updated.
Refresh the Avi load balancer UI and if everything is well the pool will be updated dynamically when the ako operator detects the updated list of endpoints.
Further information confirming the status update is reflected in the ako-0 pod logs, which shows that a change has been detected between the cached copy of the virtual server object and the updated relationship which is computed from the graph database.
kubectl get logs ako-0 -n avi-system
It then resynchronises the pool content with Avi.
I’d be very pleased to hear if you run into a similar scenario, as I do not think that this element of ako is described anywhere in the official documentation of either Tanzu or AKO – and the DHCP lease re-issue will often crop up if an admin did not take care of making a permanent reservation after a node is added. Often this is because Tanzu will discover a broken node and intervene without anyone being aware of the problem, but this does not always make sense if addresses are not reserved permanently by default in your subnet.
Thanks for reading –