Skip to content
This repository was archived by the owner on Sep 4, 2024. It is now read-only.

Prevent creation of duplicate jobs in Databricks #76

Merged

Conversation

Hang1225
Copy link
Contributor

@Hang1225 Hang1225 commented Apr 11, 2024

Previously, Airflow DAG executions could inadvertently create duplicate jobs in the Databricks workspace, even when the job already existed. The root cause of this issue is that we checked if a job exists by querying the Databricks REST API using the list_jobs() method in workflow.py/_get_job_by_name. However, the REST API returns a limited set of jobs as a result of the paginated API, leading to incomplete results. Consequently, if the job name was not found in the first page of results retrieved by the list_jobs API, a duplicate job could be created.

To address this issue, this PR leverages the built-in job name filtering feature of the Databricks REST API within the list_jobs() method. This ensures that the API returns jobs with the given name, effectively preventing the creation of duplicate jobs in the Databricks workspace.

closes: #75

Copy link
Collaborator

@pankajkoti pankajkoti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @Hang1225 just wish to check once if we were able to test this fix end to end to verify this resolves the issue?

We can merge this once you confirm the test results. Thanks for your contribution again.

@Hang1225
Copy link
Contributor Author

Hi @pankajkoti, thanks for your review. I've thoroughly tested this change in our development environment, testing 5 different jobs both before and after implementing the updates. This included testing 2 existing jobs that previously couldn't return a job id via the original REST API call. I can confirm that the changes successfully resolve the issue. We're ready to proceed with the merge. Thanks again for your support.

@pankajkoti pankajkoti merged commit e9f2d38 into astronomer:main Apr 12, 2024
31 checks passed
@pankajkoti pankajkoti changed the title Fix of Duplicate Job Creation in Databricks During Airflow DAG Runs Prevent creation of duplicate jobs in Databricks Apr 12, 2024
@pankajkoti
Copy link
Collaborator

Thanks @Hang1225 for the contribution. The PR has been merged and will be included in the next release.

I modified the PR title and description a bit to elaborate it further 🙌🏽

@Hang1225
Copy link
Contributor Author

Thank you again @pankajkoti! Would you be able to share the timeline for the next release? I'd like to update our teams on when to expect the fix.

@pankajkoti
Copy link
Collaborator

Hi @Hang1225 thanks, I will work on the release soon. Expected ETA before EOD tomorrow 17th April, 2024 IST

@tatiana tatiana mentioned this pull request Apr 16, 2024
@pankajkoti
Copy link
Collaborator

Hi @Hang1225 , we just released https://pypi.org/project/astro-provider-databricks/0.2.2/ which includes this PR. Please try it out and let us know how it works. Thanks again for contributing this fix!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Duplicate Job Creation in Databricks During Airflow DAG Runs
2 participants