Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TEST BUG] CollectLimit falling off of GPU on Databricks 14.3 #12056

Closed
razajafri opened this issue Feb 1, 2025 · 4 comments
Closed

[TEST BUG] CollectLimit falling off of GPU on Databricks 14.3 #12056

razajafri opened this issue Feb 1, 2025 · 4 comments
Assignees
Labels
bug Something isn't working test Only impacts tests

Comments

@razajafri
Copy link
Collaborator

Describe the bug
When running delta_zorder_test.py::test_delta_dfp_reuse_broadcast_exchange it fails even with deletion vectors off

We see that CollectLimitExec is falling off of the GPU

E                   py4j.protocol.Py4JJavaError: An error occurred while calling o413.collectToPython.
E                   : org.apache.spark.SparkException: Exception thrown in awaitResult: java.lang.IllegalArgumentException: Part of the plan is not columnar class org.apache.spark.sql.execution.CollectLimitExec
E                   CollectLimit 3000001
E                   +- GpuColumnarToRow false
E                      +- GpuHashAggregate (keys=[ex_key#8048], functions=[], output=[ex_key#8048]) [loreId=50]
E                         +- GpuShuffleCoalesce 104857600
E                            +- ShuffleQueryStage 2, Statistics(sizeInBytes=392.0 B, rowCount=20, ColumnStat: N/A, isRuntime=true)
E                               +- ReusedExchange [ex_key#8048], GpuColumnarExchange gpusinglepartitioning$(), EXECUTOR_BROADCAST, [plan_id=6885], [loreId=22]

Further up in the transformation logs we can see why it is falling off. It doesn't answer the question why the same is not observed on versions prior to Databricks 14.3

!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU

2025-02-01 04:00:18,461 [dynamicpruning-1] WARN  com.nvidia.spark.rapids.GpuOverrides - Transformed query:
Original Plan:
CollectLimit 3000001
+- HashAggregate(keys=[ex_key#8048], functions=[], output=[ex_key#8048])
   +- ShuffleQueryStage 2, Statistics(sizeInBytes=392.0 B, rowCount=20, ColumnStat: N/A, isRuntime=true)
      +- ReusedExchange [ex_key#8048], GpuColumnarExchange gpusinglepartitioning$(), EXECUTOR_BROADCAST, [plan_id=6885], [loreId=22]

Transformed Plan:
CollectLimit 3000001
+- GpuHashAggregate (keys=[ex_key#8048], functions=[], output=[ex_key#8048]) [loreId=50]
   +- ShuffleQueryStage 2, Statistics(sizeInBytes=392.0 B, rowCount=20, ColumnStat: N/A, isRuntime=true)
      +- ReusedExchange [ex_key#8048], GpuColumnarExchange gpusinglepartitioning$(), EXECUTOR_BROADCAST, [plan_id=6885], [loreId=22]

2025-02-01 04:00:18,479 [dynamicpruning-1] ERROR com.nvidia.spark.rapids.GpuOverrideUtil - Encountered an exception applying GPU overrides java.lang.IllegalArgumentException: Part of the plan is not columnar class org.apache.spark.sql.execution.CollectLimitExec
CollectLimit 3000001
+- GpuColumnarToRow false
   +- GpuHashAggregate (keys=[ex_key#8048], functions=[], output=[ex_key#8048]) [loreId=50]
      +- GpuShuffleCoalesce 104857600
         +- ShuffleQueryStage 2, Statistics(sizeInBytes=392.0 B, rowCount=20, ColumnStat: N/A, isRuntime=true)
            +- ReusedExchange [ex_key#8048], GpuColumnarExchange gpusinglepartitioning$(), EXECUTOR_BROADCAST, [plan_id=6885], [loreId=22]

java.lang.IllegalArgumentException: Part of the plan is not columnar class org.apache.spark.sql.execution.CollectLimitExec
CollectLimit 3000001
+- GpuColumnarToRow false
   +- GpuHashAggregate (keys=[ex_key#8048], functions=[], output=[ex_key#8048]) [loreId=50]
      +- GpuShuffleCoalesce 104857600
         +- ShuffleQueryStage 2, Statistics(sizeInBytes=392.0 B, rowCount=20, ColumnStat: N/A, isRuntime=true)
            +- ReusedExchange [ex_key#8048], GpuColumnarExchange gpusinglepartitioning$(), EXECUTOR_BROADCAST, [plan_id=6885], [loreId=22]

Steps/Code to reproduce bug

PYSP_TEST_spark_rapids_sql_debug_logTransformations=true TEST_MODE=DELTA_LAKE_ONLY WITH_DEFAULT_UPSTREAM_SHIM=0 TESTS=delta_zorder_test.py::test_delta_dfp_reuse_broadcast_exchange  TEST_PARALLEL=0 ./jenkins/databricks/test.sh |& tee testoutput.log

Expected behavior
A clear and concise description of what you expected to happen.

Environment details (please complete the following information)

  • Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
  • Spark configuration settings related to the issue

Additional context
Add any other context about the problem here.

@razajafri razajafri added ? - Needs Triage Need team to review and classify bug Something isn't working labels Feb 1, 2025
@revans2
Copy link
Collaborator

revans2 commented Feb 3, 2025

CollectLimit is something that is always not on the GPU. We do not truly support it because there is some special sauce at the RDD level that makes it so that it can run a single task first and then if there are not enough rows returned it will run more tasks.

That said the plan looks a lot like #11764 and changes we already had to make to a few tests to support it. #11750

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Feb 4, 2025
@razajafri razajafri added the test Only impacts tests label Feb 7, 2025
@mythrocks mythrocks changed the title [BUG] CollectLimit falling off of GPU on Databricks 14.3 [TEST BUG] CollectLimit falling off of GPU on Databricks 14.3 Feb 7, 2025
@gerashegalov gerashegalov self-assigned this Feb 12, 2025
@razajafri
Copy link
Collaborator Author

CollectLimit is something that is always not on the GPU. We do not truly support it because there is some special sauce at the RDD level that makes it so that it can run a single task first and then if there are not enough rows returned it will run more tasks.

That said the plan looks a lot like #11764 and changes we already had to make to a few tests to support it. #11750

Based on this I am closing this issue as working as expected based

@revans2
Copy link
Collaborator

revans2 commented Feb 18, 2025

@razajafri Does the test pass? If so then we are good. If not, then we need to fix the test

@razajafri
Copy link
Collaborator Author

@razajafri Does the test pass? If so then we are good. If not, then we need to fix the test

Yes, the test passes after we add CollectLimitExec to the allowed expressions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working test Only impacts tests
Projects
None yet
Development

No branches or pull requests

4 participants