gke/ERR/2023_011
Product: Google Kubernetes Engine
Rule class: ERR - Something that is very likely to be wrong
Description
The gke-metadata-server DaemonSet uses pod IP addresses to match client requests to Kubernetes Service Accounts. Pod IP not found errors may indicate a misconfiguration or a workload that is not compatible with GKE Workload Identity.
You can use the following Cloud Logging filter to find errors in the GKE Metadata Server logs:
resource.type="k8s_container"
log_id("stderr")
resource.labels.container_name="gke-metadata-server"
severity=ERROR
Examples from gke-metadata-server –component-version=0.4.276:
[conn-id:bc54e859ac0e7269] Unable to find pod: generic::not_found: retry budget exhausted (50 attempts): ip “169.254.123.2” not recorded in the snapshot
[conn-id:bc54e859ac0e7269 rpc-id:29b6f8cbdbdaafb5] Caller is not authenticated
Older example from gke-metadata-server –component-version=0.4.275:
[ip:172.17.0.2 pod:/ rpc-id:387a551d4b506f31] Failed to find Workload Identity configuration for pod: while retrieving pod from cache: pod "" not found
Note: 172.17.0.0/16 and 169.254.123.0/24 are the default ranges used by the Docker daemon for container networking.
Remediation
One known cause of these errors is use of the deprecated legacy logging agent on COS GKE Nodes via project metadata google-logging-enabled=true
without google-logging-use-fluentbit=true
which was introduced in COS Milestone 105. Enabling the fluent-bit agent will automatically update all existing nodes and prevent them from generating pod not found error messages. In COS Milestone 109 the fluent-bit agent will become the default when enabling logging via project metadata.
Another cause could be docker-in-docker pods (often used by CI/CD systems to build containers) running with hostNetwork: true
or other docker based VM agents running outside Kubernetes. If you identify a workload that is not compatible with the GKE Metadata Server, you can create a nodepool with --workload-metadata=GCE_METADATA
and use taints/tolerations to specify where the workload should run.
Further information
See Workload Identity docs for more restrictions and alternatives.