-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark version detection prevents startup of new spark sessions #111
Comments
Specifically tagging @lecardozo for visibility. Issue seems to be here: python-deequ/pydeequ/configs.py Line 22 in 7ec9f6f
My suspicion is that the subprocess doesn't fully close out the spark context, or at least it isn't garbage collected quickly enough, and the SparkSession builder somehow isn't able to find it across process boundaries? Not entirely sure, but replacing this with the following snippet works for me locally:
|
Thanks for the heads up, @pajablons! 😃
That might be what's happening. I opted for this "weird" subprocess-based solution because tests were failing with a solution similar to yours. Tests failed because, when the tests tried setting up a new context, actually just recycled the context that was already created inside I have a feeling that your solution works in this case because you are just recycling the already created session (i.e. just getting, instead of creating). 🤔 Can you confirm to me if you could properly run tests with your suggested approach? We might need to get back to the previous, explicit, approach of using the On an additional note: I feel this |
Would you mind trying out the revert from this branch?
Let me know if this is working for you, or if you have any suggestions. 😃 |
That revert works perfectly for me, with one little detail caveat-- It's not completely future proof (which isn't a new thing or a big deal). This line: Will obviously break if by some miracle Apache decides to drop spark 3.10.0 on us or something. It's one of those super benign things that is probably never going to be an issue, but it's worth noting I think. |
Closing after the revert. Thanks everyone for contributing and reporting. |
Describe the bug
With the newest commit giving support to Spark 3.1/3.2, pydeequ uses a new method for detecting the (py)spark version. By spawning a new subprocess within the Spark JVM, and not appropriately cleaning up the created SparkContext, subsequent calls to SparkSession.builder.getOrCreate() fail as it attempts to create a secondary SparkContext. This bug was not present historically, when the Spark version was detected through an environment variable.
Error output:
To Reproduce
On a Spark 3.2.2 (likely 3.x) installation:
pip install git+https://github.com/awslabs/python-deequ.git@7ec9f6f72839779f370a4753c0db26d6cf052203
$SPARK_HOME/jobs/test.py
with the following contents:$SPARK_HOME/bin/spark-submit jobs/test.py
Expected behavior
PySpark won't crash when I try to make a spark session.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: