When troubleshooting a Master/Control Plane Machine, it's absolutely imperative that you familiarize yourself with determining the health of the etcd members. On some rare occasions, a Node may go unready on multiple master Machines, but one or more of those Machines may have healthy etcd members. Before selecting a master Machine to delete, it's mandatory to determine that the Machine you are intended to delete will not compromise etcd quorum.
Additionally, Master/Control Plane Machines are not currently managed by MachineSets, so always ensure you have a backup copy of a master Machine object before deleting it so that you may recreate it easily.
For more information identifying etcd member health, refer to these steps: https://docs.openshift.com/container-platform/4.5/backup_and_restore/replacing-unhealthy-etcd-member.html#restore-identify-unhealthy-etcd-member_replacing-unhealthy-etcd-member
For in-depth information and steps to replace a Master/Control Plane machine, please refer to this guide: https://docs.openshift.com/container-platform/4.5/backup_and_restore/replacing-unhealthy-etcd-member.html
- Document Purpose
- Important Pod Logs
- I created a Machine (or scaled up a MachineSet) but I didn't get a Node.
- I deleted a Machine (or scaled down a MachineSet) but the Machine and/or Node did not go away
- A Machine is listed as 'Failed'
This document outlines the steps to investigate the current status of an individual Machine or MachineSet that appears to not be creating Machines which results in additional Nodes joining the cluster. Troubleshooting an unhealthy Node (a Machine that already joined the cluster as a Node) is outside the scope of this document.
Most everything that relates to the machine-api is viewable in the openshift-machine-api
namespace. You will want to familiarize yourself with the output of
oc get deployments -n openshift-machine-api
oc get pods -n openshift-machine-api
The machine-api-controllers-*
pod has several containers running: machineset-controller
, machine-controller
, nodelink-controller
, and the machine-healthcheck-controller
.
To check the logs for a particular component, use
oc logs -n openshift-machine-api machine-api-controllers-<random suffix> -c <controller-name>
The random suffix is automatically generated by the machine-api-controllers
deployment and is most easily found using the output of
oc get pods -n openshift-machine-api
CSRs that are automatically generated by kubelets on instances provisioned by the machine-api will automatically attempt to join the cluster by issuing a CSR (certificate signing request)
. Under normal circumstances, these CSRs should be approved automatically. On rare occasions, you may encounter a bug where a CSR is stuck in pending state and the kubelet is unable to join the cluster successfully.
To view the cluster-machine-approver
logs, perform the following:
oc get pods -n openshift-cluster-machine-approver
Note the name of the pod machine-approver-*
The suffix will be randomly generated by the pod's deployment controller.
Next, get the logs for the machine-approver-controller
container:
oc logs -n openshift-cluster-machine-approver machine-approver-<random suffix> -c machine-approver-controller
Be sure to replace <random suffix>
above with the real suffix from the previous step.
First, check that a Machine object was created successfully if scaling a MachineSet [TODO: need steps to look at MachineSet status and also Machines]. If there is not a new Machine, then check the machineset-controller
's logs; refer to the section Important Pod Logs above for exact steps.
Next, check the Machine object's status. There may be status conditions that explain the problem, and be sure to check the Phase.
If the phase is "Provisioning" it means that the cloud provider has not created the corresponding instance yet for one reason or another. This could be quota, misconfiguration, or some other problem. Check the machine-controller
's logs; refer to the section Important Pod Logs above for exact steps.
Next, if the phase is "Provisioned" that means the instance was created successfully in the cloud provider. Two things need to happen at this point for the Machine to successfully become a Node: First, ignition needs to run successfully, contact the machine-config-server
, and the kubelet will issue a certificate signing request
(CSR). This CSR must be approved by the cluster-machine-approver.
First, check if there are any pending CSRs:
oc get csr
If there are none, it means the kubelet did not start successfully. If there is a pending CSR for the corresponding Machine/Node, then check the logs for the cluster-machine-approver
; refer to the section Important Pod Logs above for exact steps.
If the kubelet did not start successfully; the problem is related to either an invalid user-data secret (this is referenced from the Machine object) or some other problem with the ignition payload and/or the operating system. In this case, you will need to consult the machine-config-operator
documentation.
See the section A Machine is listed as 'Failed' below.
This can be caused by a variety of reasons, such as invalid cloud credentials or PodDisruptionBudgets preventing the Node from draining. The best place to look for information is the machine-controller
's logs; refer to the section Important Pod Logs above for exact steps.
In this case, you'll need to take a look at the Machine's status and determine why the Machine entered a failed state. In many instances, simply deleting the Machine object is sufficient. In some other circumstances, the instance may need to be manually cleaned up directly from the cloud provider. The best place to look for information is the machine-controller
's logs; refer to the section Important Pod Logs above for exact steps.
If a Machine's status is failed, this means something unrecoverable has happened to the Machine. It may be a Machine spec misconfiguration, the instance may have gone missing (eg, terminated by an outside actor) from the cloud.
First, consult with the machine-controller
's logs; refer to Important Pod Logs above for exact steps.
Next, compare your findings in the machine-controller logs with the cloud provider's configuration.
You may need to restore proper credentials for the cloud provider, you may need to manually remove the instance from the cloud provider (if any), or you may simply need to correct this misconfiguration by adjusting the corresponding MachineSet or creating a new Machine object. Finally, ensure you delete the corresponding Machine object.
IMPORTANT
Ensure you have reviewed and understand that Masters/Control Plane machines are not backed by MachineSets at the root of this document.
oc delete machines -n openshift-machine-api <problem machine>