composer/ERR/2023_004

Cloud Composer Dags are getting zombie error

Product: Cloud Composer
Rule class: ERR - Something that is very likely to be wrong

Description

Based on heartbeats, the Airflow scheduler is able to detect abnormally terminated tasks - if they’re missing for extended period of time, a task will be detected as a zombie and the following message will be written in logs:

Detected zombie job: {'full_filepath': '/home/airflow/gcs/dags/xxxx.py', 'processor_subdir': '/home/airflow/gcs/dags', 'msg': "{'DAG Id': 'DAGName', 'Task Id': 'TaskName', 'Run Id': 'scheduled__2023-12-12Txx:xx:xx.xx+00:00', 'Hostname': 'airflow-worker-xxxxx', 'External Executor Id': 'xxxx-xxxx-xxxx'}", 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at xxxxx>, 'is_failure_callback': True}

This could be verified from Cloud Logging using following logging filter:

resource.type="cloud_composer_environment"
severity>=ERROR
log_id("airflow-scheduler")
textPayload:"Detected zombie job"

Remediation

The usual reason for Zombie tasks is the resource pressure in your environment’s cluster. As a result, an Airflow worker might not be able to report the status of a task. Hence, the scheduler marks the task as a Zombie. To avoid Zombie tasks, assign more resources to your environment by using the optimization and [scaling](https://cloud.google.com/composer/docs/composer-2/environment-scaling steps). As a temporary workaround, you may consider increasing [scheduler]scheduler_zombie_task_threshold Airflow configuration, however, it will only change when zombies are detected and not going to prevent them.

Further information

Troubleshooting Zombie tasks Zombie tasks