Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PySpark 3.2.0 support #423

Closed
Hoeze opened this issue Oct 21, 2021 · 12 comments
Closed

PySpark 3.2.0 support #423

Hoeze opened this issue Oct 21, 2021 · 12 comments

Comments

@Hoeze
Copy link

Hoeze commented Oct 21, 2021

I just tried glow with PySpark 3.2.0:

spark.read.parquet(OUTPUT_PATH)
Py4JJavaError: An error occurred while calling o62.parquet.
: java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformAllExpressions(Lscala/PartialFunction;)Lorg/apache/spark/sql/catalyst/plans/logical/LogicalPlan;
	at io.projectglow.sql.optimizer.ReplaceExpressionsRule$.apply(hlsOptimizerRules.scala:33)
	at io.projectglow.sql.optimizer.ReplaceExpressionsRule$.apply(hlsOptimizerRules.scala:31)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:211)
	at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
	at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
	at scala.collection.immutable.List.foldLeft(List.scala:91)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:208)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200)
	at scala.collection.immutable.List.foreach(List.scala:431)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:215)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:209)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:172)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:179)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:193)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:192)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:88)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:196)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:196)
	at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:88)
	at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:86)
	at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:78)
	at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88)
	at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:440)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
	at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:596)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:834)

My spark config:

spark = (
    SparkSession.builder
    .appName('playground')
    .config("spark.jars.packages", ",".join([
        "io.projectglow:glow-spark3_2.12:1.1.0",
    ]))
    .config("spark.sql.execution.arrow.pyspark.enabled", "true")
    .config("spark.hadoop.io.compression.codecs", "io.projectglow.sql.util.BGZFCodec")
    .getOrCreate()
)
glow.register(spark)
@williambrandler
Copy link
Contributor

HI @Hoeze Glow v1.1.0 is only supported with Spark 3.1, please use that version!

@Hoeze
Copy link
Author

Hoeze commented Oct 22, 2021

Hi @williambrandler, thanks for the note!
Are there any plans to update to v3.2?

@williambrandler
Copy link
Contributor

Hey @Hoeze

Databricks recently released version of the Databricks Runtime for Spark 3.1 that is 'LTS', which means Long Term Support (for 18 months). We released Glow v1.1 at that time, which is compatible with Spark 3.1 (Glow 1.0 is compatible with Spark 3.0)

Spark 3.2 was announced publicly by Databricks this week. We plan to wait on upgrading Glow until there is a Long Term Support version of Databricks for Spark 3.2

Are there specific features in Spark 3.2 you wish to leverage with Glow?

@Hoeze
Copy link
Author

Hoeze commented Oct 22, 2021

I was interested in trying out Spark 3.2, especially the parquet column index support:
https://issues.apache.org/jira/browse/SPARK-26345

Beside of that, I was just confused why a point-update of Spark completely broke my environment.

@williambrandler
Copy link
Contributor

what got broken? Please send over the errors
For column indexing, would delta lake as the indexing layer over parquet?
What are you using the indexes for, querying? Please provide a little more detail

Here are two PRs we could use to bump the version to Spark 3.2
databricks/spark-xml#564
#396

@Hoeze
Copy link
Author

Hoeze commented Oct 22, 2021

The problem is that I cannot read parquet files any more (see my first post).
My intention was to have very fast response on joins "variantID" with parquet/PySpark.

@williambrandler
Copy link
Contributor

williambrandler commented Oct 29, 2021

Ah got it, so the way indexing works is not quite the same as for single node tools.

The indexing is on a per partition level, not row level (unless you have a single row foreach partition). You can get queries on position down to a few seconds, and on genes down to about 10-15s by leveraging indexing in Delta Lake.

So the performance will not be as good as for single node tools (for example using bedtools). But of course bedtools only takes you so far.

Curious how indexing works for parquet in Spark 3.2, will want to test and compare to Delta Lake

@williambrandler
Copy link
Contributor

williambrandler commented Oct 29, 2021

But we're in the process of releasing Glow v1.1.1, which will still be Spark 3.1.2. So it will take a little bit of time before we can move onto Spark 3.2

What is your query performance now for these joins?

@williambrandler
Copy link
Contributor

hey @Hoeze we now have everything in place to upgrade glow to Spark 3.2,

we are just waiting on Hail to upgrade also, as glow depends on Hail. I created an issue with them

hail-is/hail#11707

@Hoeze
Copy link
Author

Hoeze commented Apr 3, 2022

Thank you for the update @williambrandler, looking forward to try it!

@williambrandler
Copy link
Contributor

hit some more unexpected issues on the release @Hoeze but we're getting close. We are also going to press on without waiting for Hail, EMR and Dataproc to upgrade to Spark 3.2. This means the continuous integration tests will fail at the Hail on Spark 3 step, but I have manually tested that the export from hail to glow functionality still works

@williambrandler
Copy link
Contributor

@Hoeze glow on spark 3.2.1 is now available as a pre-release, still doing some testing but everything seems to work except exporting from hail to glow, https://github.com/projectglow/glow/releases/tag/v1.2.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants