-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PySpark 3.2.0 support #423
Comments
HI @Hoeze Glow v1.1.0 is only supported with Spark 3.1, please use that version! |
Hi @williambrandler, thanks for the note! |
Hey @Hoeze Databricks recently released version of the Databricks Runtime for Spark 3.1 that is 'LTS', which means Long Term Support (for 18 months). We released Glow v1.1 at that time, which is compatible with Spark 3.1 (Glow 1.0 is compatible with Spark 3.0) Spark 3.2 was announced publicly by Databricks this week. We plan to wait on upgrading Glow until there is a Long Term Support version of Databricks for Spark 3.2 Are there specific features in Spark 3.2 you wish to leverage with Glow? |
I was interested in trying out Spark 3.2, especially the parquet column index support: Beside of that, I was just confused why a point-update of Spark completely broke my environment. |
what got broken? Please send over the errors Here are two PRs we could use to bump the version to Spark 3.2 |
The problem is that I cannot read parquet files any more (see my first post). |
Ah got it, so the way indexing works is not quite the same as for single node tools. The indexing is on a per partition level, not row level (unless you have a single row foreach partition). You can get queries on position down to a few seconds, and on genes down to about 10-15s by leveraging indexing in Delta Lake. So the performance will not be as good as for single node tools (for example using bedtools). But of course bedtools only takes you so far. Curious how indexing works for parquet in Spark 3.2, will want to test and compare to Delta Lake |
But we're in the process of releasing Glow v1.1.1, which will still be Spark 3.1.2. So it will take a little bit of time before we can move onto Spark 3.2 What is your query performance now for these joins? |
hey @Hoeze we now have everything in place to upgrade glow to Spark 3.2, we are just waiting on Hail to upgrade also, as glow depends on Hail. I created an issue with them |
Thank you for the update @williambrandler, looking forward to try it! |
hit some more unexpected issues on the release @Hoeze but we're getting close. We are also going to press on without waiting for Hail, EMR and Dataproc to upgrade to Spark 3.2. This means the continuous integration tests will fail at the Hail on Spark 3 step, but I have manually tested that the export from hail to glow functionality still works |
@Hoeze glow on spark 3.2.1 is now available as a pre-release, still doing some testing but everything seems to work except exporting from hail to glow, https://github.com/projectglow/glow/releases/tag/v1.2.1 |
I just tried glow with PySpark 3.2.0:
My spark config:
The text was updated successfully, but these errors were encountered: