-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyDeequ support to Apache Spark 3.4.0 (and ideally 3.5.0) #192
Comments
We treat backward compatibility very seriously, as all AWS API or owned library does. Dropping support for EOL Spark version can be an option but it need a bit more research. I don't think it's very hard to fix #169 but the change should be made at Deequ Scala land (adding overloaded functions with the old parameters). We currently do not have a date. |
As a workaround, you can set env var |
I see. But is a native support of higher spark versions planned at all? If yes, for when is it scheduled? |
Is there a date for when this update might be expected? I am currenlty working in a project that uses pyspark 3.4.1 in databricks and I would like to use pydeequ |
Hello! Just checking in to see if there's any news on when we might expect that new feature to drop? Any rough idea of a release date? I'm using this library in my project and need to upgrade to Spark 3.4 since we're on Databricks runtime 13.3LTS and would like keep using this. Thanks! |
Hey guys, |
@chenliu0831 do you have a release date for spark 3.5 upgrades |
I think we are getting very close #203 (only 2 test failures down to a dep issue ). |
@chenliu0831 how is it looking buddy? |
@hardiktalati the fix for the 2 failures would need Deequ release I think, please be patient and I will post updates. I think it should solve both 3.4 & 3.5 and we may release it together. |
Hello! Any refreshing news? I know it’s complicated, and we have to be patient. I’m just checking if there is an approximate release date because my project is blocked and would like keep using this. Thanks😊 |
@chenliu0831 Also from my side this Spark 3.5 is highly awaited 😃 Observing this thread for some time now. No Spark 3.5 support would be show stopper using pydeequ and rather an argument for great expectations:) Looking forward to it and thanks for moving this topic forward |
@chenliu0831 bro you mentioned it's nearly done how far |
@chenliu0831 any updates?? it is more than a month now.. |
@chenliu0831 Would appreciate the response, we are blocked due to the pending upgrade |
I'm evaluating PyDeequ vs. Great Expectations and after reading this all PyDeequ seems very unreliable. How can you take over a year add support for Spark 3.4? |
@chenliu0831 atleast response back ... so that we can make decision |
We developed a DQ solution based on Pydeequ. |
@hardiktalati @D2Bull @sqlkabouter We apologize for the inconvenience. We are actively working on the upgrade to Spark 3.4 and we aim to finish it as soon as possible. The upgrade to Spark 3.5 will follow right after. |
|
@hardiktalati At the moment, we don't have a date to share. We are trying to root cause the failure of two unit tests. Upgrading PyDeequ to Spark 3.4 and using Deequ's 2.0.7 Spark 3.4 library is resulting in the following error.
Once the RCA is done, if a new release of Deequ is required, then it can take a week until PyDeequ is fixed. If the fix is within PyDeequ itself, the new version with Spark 3.4 can be released within a few days. Once the Spark 3.4 support is added, we will work on Spark 3.5 next. |
We took a different approach from my previous message. Looks like we might need a new Deequ release to upgrade the Breeze dependency for Spark 3.4. In light of that , created a PR that adds Spark 3.5 support: #210 |
@rdsharma26 Let us know once you have released a release candidate :) Btw is it worth supporting older spark version? I think mantainance is 18 months. I would probably cut release with older versions at some point. Especially if breaking changes :) |
Spark 3.5 support has been added in https://pypi.org/project/pydeequ/1.4.0/ 🚀 @datanikkthegreek That's a great point. We did recently drop support for Spark 2.4. Spark 3.4 is still a relatively newer version, so we will add support for it soon. |
What I'm missing, it seems that the latest announcement is regarding Spark 3.30. 🎉 Announcements 🎉 |
@D2Bull The README has been updated in the |
Hi folks. I'm on a middle of a migration of my data quality pipeline from Spark 3.1 to 3.5. Unfortunately I don't have means to change my environment and I need to run my code at Spark 3.5. Unfortunately things are broken at pydeequ 1.4.0 mostly because since Spark 3.4: : "... Spark Connect supports most PySpark APIs, including DataFrame, Functions, and Column. However, some APIs such as SparkContext and RDD are not supported" (source) Which causes things like this to break Any thoughts on it? |
Just choose @MrPowers FYI |
Is your feature request related to a problem? Please describe.
I'm currently facing issues with the PyDeequ support to Apache Spark version 3.4.0, since it is impacting several projects in my organization that uses PyDeequ as a data quality tool. The problem arises because our EMR clusters are required to support the latest version releases, but since the release of emr-6.12.0, the support for Apache Spark 3.3.x has been dropped.
Describe the solution you'd like
I would like PyDeequ to be updated to support Apache Spark 3.4.0 and ideally, also the most recent version 3.5.0. I would also like to understand the requirements for this support, such as whether there are any backwards compatibility requirements for PyDeequ, and whether it is necessary for all future PyDeequ versions to continue supporting all of the currently supported Spark and Deequ versions, or if there is scope for dropping support for some versions, as mentioned on #178.
Describe alternatives you've considered
As an alternative, we have considered migrating to Great Expectations due to its active maintenance and large community. However, PyDeequ is still preferred due to its seamless integration with our internal PySpark library. The transition to a new tool would also require significant resources and time. Therefore, having PyDeequ support Apache Spark 3.4.0 and 3.5.0 would be the most beneficial solution for us.
Additional context
It seems that Deequ is already supporting Apache Spark 3.4.0 (#505) and most recently 3.5.0 (#514).
The text was updated successfully, but these errors were encountered: