[TEST BUG] CollectLimit falling off of GPU on Databricks 14.3 #12056

razajafri · 2025-02-01T04:07:58Z

Describe the bug
When running delta_zorder_test.py::test_delta_dfp_reuse_broadcast_exchange it fails even with deletion vectors off

We see that CollectLimitExec is falling off of the GPU

E                   py4j.protocol.Py4JJavaError: An error occurred while calling o413.collectToPython.
E                   : org.apache.spark.SparkException: Exception thrown in awaitResult: java.lang.IllegalArgumentException: Part of the plan is not columnar class org.apache.spark.sql.execution.CollectLimitExec
E                   CollectLimit 3000001
E                   +- GpuColumnarToRow false
E                      +- GpuHashAggregate (keys=[ex_key#8048], functions=[], output=[ex_key#8048]) [loreId=50]
E                         +- GpuShuffleCoalesce 104857600
E                            +- ShuffleQueryStage 2, Statistics(sizeInBytes=392.0 B, rowCount=20, ColumnStat: N/A, isRuntime=true)
E                               +- ReusedExchange [ex_key#8048], GpuColumnarExchange gpusinglepartitioning$(), EXECUTOR_BROADCAST, [plan_id=6885], [loreId=22]

Further up in the transformation logs we can see why it is falling off. It doesn't answer the question why the same is not observed on versions prior to Databricks 14.3

!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU

2025-02-01 04:00:18,461 [dynamicpruning-1] WARN  com.nvidia.spark.rapids.GpuOverrides - Transformed query:
Original Plan:
CollectLimit 3000001
+- HashAggregate(keys=[ex_key#8048], functions=[], output=[ex_key#8048])
   +- ShuffleQueryStage 2, Statistics(sizeInBytes=392.0 B, rowCount=20, ColumnStat: N/A, isRuntime=true)
      +- ReusedExchange [ex_key#8048], GpuColumnarExchange gpusinglepartitioning$(), EXECUTOR_BROADCAST, [plan_id=6885], [loreId=22]

Transformed Plan:
CollectLimit 3000001
+- GpuHashAggregate (keys=[ex_key#8048], functions=[], output=[ex_key#8048]) [loreId=50]
   +- ShuffleQueryStage 2, Statistics(sizeInBytes=392.0 B, rowCount=20, ColumnStat: N/A, isRuntime=true)
      +- ReusedExchange [ex_key#8048], GpuColumnarExchange gpusinglepartitioning$(), EXECUTOR_BROADCAST, [plan_id=6885], [loreId=22]

2025-02-01 04:00:18,479 [dynamicpruning-1] ERROR com.nvidia.spark.rapids.GpuOverrideUtil - Encountered an exception applying GPU overrides java.lang.IllegalArgumentException: Part of the plan is not columnar class org.apache.spark.sql.execution.CollectLimitExec
CollectLimit 3000001
+- GpuColumnarToRow false
   +- GpuHashAggregate (keys=[ex_key#8048], functions=[], output=[ex_key#8048]) [loreId=50]
      +- GpuShuffleCoalesce 104857600
         +- ShuffleQueryStage 2, Statistics(sizeInBytes=392.0 B, rowCount=20, ColumnStat: N/A, isRuntime=true)
            +- ReusedExchange [ex_key#8048], GpuColumnarExchange gpusinglepartitioning$(), EXECUTOR_BROADCAST, [plan_id=6885], [loreId=22]

java.lang.IllegalArgumentException: Part of the plan is not columnar class org.apache.spark.sql.execution.CollectLimitExec
CollectLimit 3000001
+- GpuColumnarToRow false
   +- GpuHashAggregate (keys=[ex_key#8048], functions=[], output=[ex_key#8048]) [loreId=50]
      +- GpuShuffleCoalesce 104857600
         +- ShuffleQueryStage 2, Statistics(sizeInBytes=392.0 B, rowCount=20, ColumnStat: N/A, isRuntime=true)
            +- ReusedExchange [ex_key#8048], GpuColumnarExchange gpusinglepartitioning$(), EXECUTOR_BROADCAST, [plan_id=6885], [loreId=22]

Steps/Code to reproduce bug

PYSP_TEST_spark_rapids_sql_debug_logTransformations=true TEST_MODE=DELTA_LAKE_ONLY WITH_DEFAULT_UPSTREAM_SHIM=0 TESTS=delta_zorder_test.py::test_delta_dfp_reuse_broadcast_exchange  TEST_PARALLEL=0 ./jenkins/databricks/test.sh |& tee testoutput.log

Expected behavior
A clear and concise description of what you expected to happen.

Environment details (please complete the following information)

Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
Spark configuration settings related to the issue

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

revans2 · 2025-02-03T15:00:44Z

CollectLimit is something that is always not on the GPU. We do not truly support it because there is some special sauce at the RDD level that makes it so that it can run a single task first and then if there are not enough rows returned it will run more tasks.

That said the plan looks a lot like #11764 and changes we already had to make to a few tests to support it. #11750

razajafri · 2025-02-18T19:19:53Z

CollectLimit is something that is always not on the GPU. We do not truly support it because there is some special sauce at the RDD level that makes it so that it can run a single task first and then if there are not enough rows returned it will run more tasks.

That said the plan looks a lot like #11764 and changes we already had to make to a few tests to support it. #11750

Based on this I am closing this issue as working as expected based

revans2 · 2025-02-18T19:45:58Z

@razajafri Does the test pass? If so then we are good. If not, then we need to fix the test

razajafri · 2025-02-27T01:42:19Z

@razajafri Does the test pass? If so then we are good. If not, then we need to fix the test

Yes, the test passes after we add CollectLimitExec to the allowed expressions

razajafri added ? - Needs Triage Need team to review and classify bug Something isn't working labels Feb 1, 2025

mattahrens removed the ? - Needs Triage Need team to review and classify label Feb 4, 2025

razajafri added the test Only impacts tests label Feb 7, 2025

mythrocks changed the title ~~[BUG] CollectLimit falling off of GPU on Databricks 14.3~~ [TEST BUG] CollectLimit falling off of GPU on Databricks 14.3 Feb 7, 2025

razajafri mentioned this issue Feb 7, 2025

[FEA] Add support for Databricks 14.3 ML LTS #10661

Open

16 tasks

gerashegalov self-assigned this Feb 12, 2025

razajafri closed this as completed Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TEST BUG] CollectLimit falling off of GPU on Databricks 14.3 #12056

[TEST BUG] CollectLimit falling off of GPU on Databricks 14.3 #12056

razajafri commented Feb 1, 2025

revans2 commented Feb 3, 2025

razajafri commented Feb 18, 2025

revans2 commented Feb 18, 2025

razajafri commented Feb 27, 2025

[TEST BUG] CollectLimit falling off of GPU on Databricks 14.3 #12056

[TEST BUG] CollectLimit falling off of GPU on Databricks 14.3 #12056

Comments

razajafri commented Feb 1, 2025

revans2 commented Feb 3, 2025

razajafri commented Feb 18, 2025

revans2 commented Feb 18, 2025

razajafri commented Feb 27, 2025