composer/WARN/2023_009
Cloud Composer Intermittent Task Failure during Scheduling
Product: Cloud Composer Rule class: WARN - Something that is possibly wrong
Description
The issue is seen in a Airflow Scheduler for the task instance during the execution of task. However, the logs do not explain the cause of task failure and Airflow Worker and Airflow Scheduler looked relatively healthy.
The error message on Airflow Scheduler may look like the following error:
Executor reports task instance <TaskInstance: xx.xxxx scheduled__2022-04-21T06:00:00+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
Or there might be some error on Airflow Worker similar to following error:
Log file is not found: gs://$BUCKET_NAME/logs/$DAG_NAME/$TASK_NAME/2023-01-25T05:01:17.044759+00:00/1.log.
The task might not have been executed or worker executing it might have finished abnormally (e.g. was evicted).
This could be verified from Cloud Logging using following logging filter:
resource.type="cloud_composer_environment"
severity>=ERROR
log_id("airflow-scheduler")
textPayload:"[queued]> finished (failed) although the task says its queued."
Remediation
There could be multiple reasons for the failure, we strongly recommend to check the below methods to mitigate the issue
- Could be due to longstanding issue in Airflow, Proactively implementing appropriate retry strategies at both the task and DAG levels.
- Enable task retries; starting on Composer version 1.16.13, Airflow 2 performs two retries for a failed task by default.
- Provision enough resources for workers.
- Make sure
[celery]worker_concurrency
is not too high. - Optimize top level code and avoid unnecessary code.
- Reduce DAG complexity.
- Review the Airflow community recommendations for dynamic DAGs generation