dataproc/Spark Job Failures

Provides a comprehensive analysis of common issues which affects Dataproc Spark job failures.

Product: Cloud Dataproc Kind: Debugging Tree

Description

This runbook focuses on a range of potential problems for Dataproc Spark jobs on Google Cloud Platform. By conducting a series of checks, the runbook aims to pinpoint the root cause of Spark job failures.

The following areas are examined:

  • Cluster version supportability: Evaluates if the job was run on a supported cluster image version.
  • Permissions: Checks for permission related issues on the cluster and GCS bucket level.
  • OOM: Checks Out-Of-Memory issues for the Spark job on master or worker nodes.
  • Logs: Check other logs related to shuffle failures, broken pipe, YARN runtime exception, import failures.
  • Throttling: Checks if the job was throttled and provides the exact reason for it.
  • GCS Connector: Evaluates possible issues with the GCS Connector.
  • BigQuery Connector: Evaluates possible issues with BigQuery Connector, such as dependency version conflicts.

Executing this runbook

gcpdiag runbook dataproc/spark-job-failures \
  -p project_id=value \
  -p job_id=value \
  -p region=value \
  -p zone=value \
  -p service_account=value \
  -p cross_project=value \
  -p stackdriver=value

Parameters

Name Required Default Type Help
project_id True None str The Project ID of the resource under investigation
job_id True None str The Job ID of the resource under investigation
region True None str Dataproc job/cluster Region
zone False None str Dataproc cluster Zone
service_account False None str Dataproc cluster Service Account used to create the resource
cross_project False None str Cross Project ID, where service account is located if it is not in the same project as the Dataproc cluster
stackdriver False False str Checks if stackdriver logging is enabled for further troubleshooting

Get help on available commands

gcpdiag runbook --help

Potential Steps