Skip to content
This repository was archived by the owner on Sep 4, 2024. It is now read-only.

Duplicate Job Creation in Databricks During Airflow DAG Runs #75

Closed
Hang1225 opened this issue Apr 11, 2024 · 0 comments · Fixed by #76
Closed

Duplicate Job Creation in Databricks During Airflow DAG Runs #75

Hang1225 opened this issue Apr 11, 2024 · 0 comments · Fixed by #76

Comments

@Hang1225
Copy link
Contributor

Issue

Our teams at HealthPartners are encountering a recurring issue where each execution of an Airflow DAG leads to the creation of a new job, despite the job already existing within the Databricks workspace.

This issue is most likely linked to the Databricks REST API retrieving a limit of 20 jobs per request, by default. In instances where the workspace contains over 20 jobs, additional API requests are necessary utilizing the 'next_page_token' from the initial call to fetch the complete job list.

Proposed Solution

Under "_get_job_by_name" function in operators/workflow.py:

  • directly pass the job_name parameter to the jobs_api.list_jobs() method to leverage the API's built-in job name filtering capability. This approach is more efficient than fetching an exhaustive job list and subsequently filtering for the specific job.
pankajkoti pushed a commit that referenced this issue Apr 12, 2024
Previously, Airflow DAG executions could inadvertently create duplicate jobs in the Databricks workspace, even when the job already existed. The root cause of this issue is that we checked if a job exists by querying the Databricks REST API using the `list_jobs()` method in `workflow.py/_get_job_by_name`. However, the REST API returns a limited set of jobs as a result of the paginated API, leading to incomplete results. Consequently, if the job name was not found in the first page of results retrieved by the `list_jobs` API, a duplicate job could be created.

To address this issue, this PR leverages the built-in job name filtering feature of the Databricks REST API within the `list_jobs()` method. This ensures that the API returns jobs with the given name, effectively preventing the creation of duplicate jobs in the Databricks workspace.

closes: #75
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant