gke/ERR/2023_011

GKE Metadata Server isn’t reporting errors for pod IP not found

Product: Google Kubernetes Engine
Rule class: ERR - Something that is very likely to be wrong

Description

The gke-metadata-server DaemonSet uses pod IP addresses to match client requests to Kubernetes Service Accounts. Pod IP not found errors may indicate a misconfiguration or a workload that is not compatible with GKE Workload Identity.

You can use the following Cloud Logging filter to find errors in the GKE Metadata Server logs:

resource.type="k8s_container"
log_id("stderr")
resource.labels.container_name="gke-metadata-server"
severity=ERROR

Examples from gke-metadata-server –component-version=0.4.276:

[conn-id:bc54e859ac0e7269] Unable to find pod: generic::not_found: retry budget exhausted (50 attempts): ip “169.254.123.2” not recorded in the snapshot

[conn-id:bc54e859ac0e7269 rpc-id:29b6f8cbdbdaafb5] Caller is not authenticated

Older example from gke-metadata-server –component-version=0.4.275:

[ip:172.17.0.2 pod:/ rpc-id:387a551d4b506f31] Failed to find Workload Identity configuration for pod: while retrieving pod from cache: pod "" not found

Note: 172.17.0.0/16 and 169.254.123.0/24 are the default ranges used by the Docker daemon for container networking.

Remediation

One known cause of these errors is use of the deprecated legacy logging agent on COS GKE Nodes via project metadata google-logging-enabled=true without google-logging-use-fluentbit=true which was introduced in COS Milestone 105. Enabling the fluent-bit agent will automatically update all existing nodes and prevent them from generating pod not found error messages. In COS Milestone 109 the fluent-bit agent will become the default when enabling logging via project metadata.

Another cause could be docker-in-docker pods (often used by CI/CD systems to build containers) running with hostNetwork: true or other docker based VM agents running outside Kubernetes. If you identify a workload that is not compatible with the GKE Metadata Server, you can create a nodepool with --workload-metadata=GCE_METADATA and use taints/tolerations to specify where the workload should run.

Further information

See Workload Identity docs for more restrictions and alternatives.