dataflow/WARN/2023_001
Dataflow job does not have a hot key
Product: Dataflow
Rule class: WARN - Something that is possibly wrong
Description
A Dataflow job might have hot key which can limit the ability of Dataflow to process elements in parallel, which increases execution time.
You can search in the Logs Explorer for such jobs with the logging query:
resource.type="dataflow_step"
log_id("dataflow.googleapis.com/worker") OR log_id("dataflow.googleapis.com/harness")
severity>=WARNING
textPayload=~"A hot key(\s''.*'')? was detected in step" OR "A hot key was detected"
Remediation
To resolve this issue, check that your data is evenly distributed. If a key has disproportionately many values, consider the following courses of action:
- Rekey your data. Apply a ParDo transform to output new key-value pairs.
- For Java jobs, use the Combine.PerKey.withHotKeyFanout transform.
- For Python jobs, use the CombinePerKey.with_hot_key_fanout transform.
- Enable Dataflow Shuffle