gke/ERR/2023_003

containerd config.toml is valid.

Product: Google Kubernetes Engine
Rule class: ERR - Something that is very likely to be wrong

Description

containerd container runtime is a crucial component of a GKE cluster that runs on all nodes. If its configuration file was customized and became invalid, containerd can’t be started and its node will stay in NotReady state.

The config file in question is /etc/containerd/config.toml.

Remediation

Usually such a change is introduced via a DaemonSet with a privileged container.

A short-term remidiation is to disable (delete) the DaemonSet and recreate all affected nodes by scaling the affected nodepools to 0 nodes and then back, to the original value.

A long-term solution is to fix the issue at the source (the application in the DaemonSet).

One of the known DaemonSets that could cause this issue is twistlock defender from Prizma Cloud security suite, which could inject configuration entries in old configuration file format to a configuration file created with a newer configuration file format.

You can use the following filter to find matching log lines:

resource.type="k8s_node"
log_id("container-runtime")
jsonPayload.MESSAGE:"containerd: failed to load TOML"

Further information