Important

When troubleshooting a Master/Control Plane Machine, it's absolutely imperative that you familiarize yourself with determining the health of the etcd members. On some rare occasions, a Node may go unready on multiple master Machines, but one or more of those Machines may have healthy etcd members. Before selecting a master Machine to delete, it's mandatory to determine that the Machine you are intended to delete will not compromise etcd quorum.

Additionally, Master/Control Plane Machines are not currently managed by MachineSets, so always ensure you have a backup copy of a master Machine object before deleting it so that you may recreate it easily.

For more information identifying etcd member health, refer to these steps: https://docs.openshift.com/container-platform/4.5/backup_and_restore/replacing-unhealthy-etcd-member.html#restore-identify-unhealthy-etcd-member_replacing-unhealthy-etcd-member

For in-depth information and steps to replace a Master/Control Plane machine, please refer to this guide: https://docs.openshift.com/container-platform/4.5/backup_and_restore/replacing-unhealthy-etcd-member.html

Document Purpose

This document outlines the steps to investigate the current status of an individual Machine or MachineSet that appears to not be creating Machines which results in additional Nodes joining the cluster. Troubleshooting an unhealthy Node (a Machine that already joined the cluster as a Node) is outside the scope of this document.

Important Pod Logs

machine-api components

Most everything that relates to the machine-api is viewable in the openshift-machine-api namespace. You will want to familiarize yourself with the output of

oc get deployments -n openshift-machine-api
oc get pods -n openshift-machine-api

The machine-api-controllers-* pod has several containers running: machineset-controller, machine-controller, nodelink-controller, and the machine-healthcheck-controller.

To check the logs for a particular component, use

oc logs -n openshift-machine-api machine-api-controllers-<random suffix> -c <controller-name>

The random suffix is automatically generated by the machine-api-controllers deployment and is most easily found using the output of

oc get pods -n openshift-machine-api

cluster-machine-approver

CSRs that are automatically generated by kubelets on instances provisioned by the machine-api will automatically attempt to join the cluster by issuing a CSR (certificate signing request). Under normal circumstances, these CSRs should be approved automatically. On rare occasions, you may encounter a bug where a CSR is stuck in pending state and the kubelet is unable to join the cluster successfully.

To view the cluster-machine-approver logs, perform the following:

oc get pods -n openshift-cluster-machine-approver

Note the name of the pod machine-approver-* The suffix will be randomly generated by the pod's deployment controller.

Next, get the logs for the machine-approver-controller container:

oc logs -n openshift-cluster-machine-approver machine-approver-<random suffix> -c machine-approver-controller

Be sure to replace <random suffix> above with the real suffix from the previous step.

I created a Machine (or scaled up a MachineSet) but I didn't get a Node.

First, check that a Machine object was created successfully if scaling a MachineSet [TODO: need steps to look at MachineSet status and also Machines]. If there is not a new Machine, then check the machineset-controller's logs; refer to the section Important Pod Logs above for exact steps.

Next, check the Machine object's status. There may be status conditions that explain the problem, and be sure to check the Phase.

Machine Status: Phase Provisioning

If the phase is "Provisioning" it means that the cloud provider has not created the corresponding instance yet for one reason or another. This could be quota, misconfiguration, or some other problem. Check the machine-controller's logs; refer to the section Important Pod Logs above for exact steps.

Machine Status: Phase Provisioned

Next, if the phase is "Provisioned" that means the instance was created successfully in the cloud provider. Two things need to happen at this point for the Machine to successfully become a Node: First, ignition needs to run successfully, contact the machine-config-server, and the kubelet will issue a certificate signing request (CSR). This CSR must be approved by the cluster-machine-approver.

First, check if there are any pending CSRs:

oc get csr

If there are none, it means the kubelet did not start successfully. If there is a pending CSR for the corresponding Machine/Node, then check the logs for the cluster-machine-approver; refer to the section Important Pod Logs above for exact steps.

If the kubelet did not start successfully; the problem is related to either an invalid user-data secret (this is referenced from the Machine object) or some other problem with the ignition payload and/or the operating system. In this case, you will need to consult the machine-config-operator documentation.

Machine Status: Phase Failed

See the section A Machine is listed as 'Failed' below.

I deleted a Machine (or scaled down a MachineSet) but the Machine and/or Node did not go away

This can be caused by a variety of reasons, such as invalid cloud credentials or PodDisruptionBudgets preventing the Node from draining. The best place to look for information is the machine-controller's logs; refer to the section Important Pod Logs above for exact steps.

A Machine is listed as 'Failed'

In this case, you'll need to take a look at the Machine's status and determine why the Machine entered a failed state. In many instances, simply deleting the Machine object is sufficient. In some other circumstances, the instance may need to be manually cleaned up directly from the cloud provider. The best place to look for information is the machine-controller's logs; refer to the section Important Pod Logs above for exact steps.

If a Machine's status is failed, this means something unrecoverable has happened to the Machine. It may be a Machine spec misconfiguration, the instance may have gone missing (eg, terminated by an outside actor) from the cloud.

First, consult with the machine-controller's logs; refer to Important Pod Logs above for exact steps.

Next, compare your findings in the machine-controller logs with the cloud provider's configuration.

You may need to restore proper credentials for the cloud provider, you may need to manually remove the instance from the cloud provider (if any), or you may simply need to correct this misconfiguration by adjusting the corresponding MachineSet or creating a new Machine object. Finally, ensure you delete the corresponding Machine object.

IMPORTANT

Ensure you have reviewed and understand that Masters/Control Plane machines are not backed by MachineSets at the root of this document.

oc delete machines -n openshift-machine-api <problem machine>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TroubleShooting.md

TroubleShooting.md

Important

Table of Contents

Document Purpose

Important Pod Logs

machine-api components

cluster-machine-approver

I created a Machine (or scaled up a MachineSet) but I didn't get a Node.

Machine Status: Phase Provisioning

Machine Status: Phase Provisioned

Machine Status: Phase Failed

I deleted a Machine (or scaled down a MachineSet) but the Machine and/or Node did not go away

A Machine is listed as 'Failed'

Files

TroubleShooting.md

Latest commit

History

TroubleShooting.md

File metadata and controls

Important

Table of Contents

Document Purpose

Important Pod Logs

machine-api components

cluster-machine-approver

I created a Machine (or scaled up a MachineSet) but I didn't get a Node.

Machine Status: Phase Provisioning

Machine Status: Phase Provisioned

Machine Status: Phase Failed

I deleted a Machine (or scaled down a MachineSet) but the Machine and/or Node did not go away

A Machine is listed as 'Failed'