-
Notifications
You must be signed in to change notification settings - Fork 11
Prevent creation of duplicate jobs in Databricks #76
Prevent creation of duplicate jobs in Databricks #76
Conversation
…b functionality is correctly tested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @Hang1225 just wish to check once if we were able to test this fix end to end to verify this resolves the issue?
We can merge this once you confirm the test results. Thanks for your contribution again.
Hi @pankajkoti, thanks for your review. I've thoroughly tested this change in our development environment, testing 5 different jobs both before and after implementing the updates. This included testing 2 existing jobs that previously couldn't return a job id via the original REST API call. I can confirm that the changes successfully resolve the issue. We're ready to proceed with the merge. Thanks again for your support. |
Thanks @Hang1225 for the contribution. The PR has been merged and will be included in the next release. I modified the PR title and description a bit to elaborate it further 🙌🏽 |
Thank you again @pankajkoti! Would you be able to share the timeline for the next release? I'd like to update our teams on when to expect the fix. |
Hi @Hang1225 thanks, I will work on the release soon. Expected ETA before EOD tomorrow 17th April, 2024 IST |
Hi @Hang1225 , we just released https://pypi.org/project/astro-provider-databricks/0.2.2/ which includes this PR. Please try it out and let us know how it works. Thanks again for contributing this fix! |
Previously, Airflow DAG executions could inadvertently create duplicate jobs in the Databricks workspace, even when the job already existed. The root cause of this issue is that we checked if a job exists by querying the Databricks REST API using the
list_jobs()
method inworkflow.py/_get_job_by_name
. However, the REST API returns a limited set of jobs as a result of the paginated API, leading to incomplete results. Consequently, if the job name was not found in the first page of results retrieved by thelist_jobs
API, a duplicate job could be created.To address this issue, this PR leverages the built-in job name filtering feature of the Databricks REST API within the
list_jobs()
method. This ensures that the API returns jobs with the given name, effectively preventing the creation of duplicate jobs in the Databricks workspace.closes: #75