Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] parquet_test.py::test_buckets failed due to SparkOutOfMemoryError intermittently #10940

Open
yinqingh opened this issue May 29, 2024 · 10 comments
Labels
bug Something isn't working documentation Improvements or additions to documentation

Comments

@yinqingh
Copy link
Collaborator

Describe the bug
parquet_test.py::test_buckets failed due to org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 65536 bytes of memory, got 0

This failure occurred in JDK11 nightly build (JDK11-nightly/625)

[2024-05-29T03:04:38.647Z] =================================== FAILURES ===================================
[2024-05-29T03:04:38.647Z] _____________________ test_buckets[parquet-reader_confs1] ______________________
[2024-05-29T03:04:38.647Z] [gw0] linux -- Python 3.9.19 /usr/bin/python
[2024-05-29T03:04:38.647Z] 
[2024-05-29T03:04:38.647Z] spark_tmp_path = '/tmp/pyspark_tests//jdk11-nightly-jdk11-nightly-qcpg3-4xsms-gw0-48139-2003166063/'
[2024-05-29T03:04:38.647Z] v1_enabled_list = 'parquet'
[2024-05-29T03:04:38.647Z] reader_confs = {'spark.rapids.sql.format.parquet.reader.footer.type': 'NATIVE', 'spark.rapids.sql.format.parquet.reader.type': 'MULTI....rapids.sql.reader.multithreaded.combine.sizeBytes': '0', 'spark.rapids.sql.reader.multithreaded.read.keepOrder': True}
[2024-05-29T03:04:38.647Z] spark_tmp_table_factory = <conftest.TmpTableFactory object at 0x7fdc36877940>
[2024-05-29T03:04:38.647Z] 
[2024-05-29T03:04:38.647Z]     @ignore_order
[2024-05-29T03:04:38.647Z]     @allow_non_gpu('DataWritingCommandExec,ExecutedCommandExec,WriteFilesExec')
[2024-05-29T03:04:38.647Z]     @pytest.mark.parametrize('reader_confs', reader_opt_confs)
[2024-05-29T03:04:38.647Z]     @pytest.mark.parametrize('v1_enabled_list', ["", "parquet"])
[2024-05-29T03:04:38.647Z]     # this test would be better if we could ensure exchanges didn't exist - ie used buckets
[2024-05-29T03:04:38.647Z]     def test_buckets(spark_tmp_path, v1_enabled_list, reader_confs, spark_tmp_table_factory):
[2024-05-29T03:04:38.647Z]         all_confs = copy_and_update(reader_confs, {
[2024-05-29T03:04:38.647Z]             'spark.sql.sources.useV1SourceList': v1_enabled_list,
[2024-05-29T03:04:38.647Z]             'spark.sql.autoBroadcastJoinThreshold': '-1'})
[2024-05-29T03:04:38.647Z]         def do_it(spark):
[2024-05-29T03:04:38.647Z]             return createBucketedTableAndJoin(spark, spark_tmp_table_factory.get(),
[2024-05-29T03:04:38.647Z]                     spark_tmp_table_factory.get())
[2024-05-29T03:04:38.647Z] >       assert_gpu_and_cpu_are_equal_collect(do_it, conf=all_confs)
[2024-05-29T03:04:38.647Z] 
[2024-05-29T03:04:38.647Z] ../../src/main/python/parquet_test.py:681: 
[2024-05-29T03:04:38.647Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2024-05-29T03:04:38.647Z] ../../src/main/python/asserts.py:595: in assert_gpu_and_cpu_are_equal_collect
[2024-05-29T03:04:38.647Z]     _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first, result_canonicalize_func_before_compare=result_canonicalize_func_before_compare)
[2024-05-29T03:04:38.647Z] ../../src/main/python/asserts.py:502: in _assert_gpu_and_cpu_are_equal
[2024-05-29T03:04:38.647Z]     from_cpu = run_on_cpu()
[2024-05-29T03:04:38.647Z] ../../src/main/python/asserts.py:487: in run_on_cpu
[2024-05-29T03:04:38.647Z]     from_cpu = with_cpu_session(bring_back, conf=conf)
[2024-05-29T03:04:38.647Z] ../../src/main/python/spark_session.py:147: in with_cpu_session
[2024-05-29T03:04:38.647Z]     return with_spark_session(func, conf=copy)
[2024-05-29T03:04:38.647Z] /usr/lib/python3.9/contextlib.py:79: in inner
[2024-05-29T03:04:38.647Z]     return func(*args, **kwds)
[2024-05-29T03:04:38.647Z] ../../src/main/python/spark_session.py:131: in with_spark_session
[2024-05-29T03:04:38.647Z]     ret = func(_spark)
[2024-05-29T03:04:38.647Z] ../../src/main/python/asserts.py:205: in <lambda>
[2024-05-29T03:04:38.647Z]     bring_back = lambda spark: limit_func(spark).collect()
[2024-05-29T03:04:38.647Z] /spark-3.2.0-bin-hadoop3.2/python/pyspark/sql/dataframe.py:693: in collect
[2024-05-29T03:04:38.647Z]     sock_info = self._jdf.collectToPython()
[2024-05-29T03:04:38.647Z] /spark-3.2.0-bin-hadoop3.2/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py:1309: in __call__
[2024-05-29T03:04:38.647Z]     return_value = get_return_value(
[2024-05-29T03:04:38.647Z] /spark-3.2.0-bin-hadoop3.2/python/pyspark/sql/utils.py:111: in deco
[2024-05-29T03:04:38.647Z]     return f(*a, **kw)
[2024-05-29T03:04:38.647Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2024-05-29T03:04:38.647Z] 
[2024-05-29T03:04:38.647Z] answer = 'xro655074'
[2024-05-29T03:04:38.647Z] gateway_client = <py4j.clientserver.JavaClient object at 0x7fdcdefe8f70>
[2024-05-29T03:04:38.647Z] target_id = 'o655073', name = 'collectToPython'
[2024-05-29T03:04:38.647Z] 
[2024-05-29T03:04:38.647Z]     def get_return_value(answer, gateway_client, target_id=None, name=None):
[2024-05-29T03:04:38.647Z]         """Converts an answer received from the Java gateway into a Python object.
[2024-05-29T03:04:38.647Z]     
[2024-05-29T03:04:38.647Z]         For example, string representation of integers are converted to Python
[2024-05-29T03:04:38.647Z]         integer, string representation of objects are converted to JavaObject
[2024-05-29T03:04:38.647Z]         instances, etc.
[2024-05-29T03:04:38.647Z]     
[2024-05-29T03:04:38.647Z]         :param answer: the string returned by the Java gateway
[2024-05-29T03:04:38.647Z]         :param gateway_client: the gateway client used to communicate with the Java
[2024-05-29T03:04:38.647Z]             Gateway. Only necessary if the answer is a reference (e.g., object,
[2024-05-29T03:04:38.648Z]             list, map)
[2024-05-29T03:04:38.648Z]         :param target_id: the name of the object from which the answer comes from
[2024-05-29T03:04:38.648Z]             (e.g., *object1* in `object1.hello()`). Optional.
[2024-05-29T03:04:38.648Z]         :param name: the name of the member from which the answer comes from
[2024-05-29T03:04:38.648Z]             (e.g., *hello* in `object1.hello()`). Optional.
[2024-05-29T03:04:38.648Z]         """
[2024-05-29T03:04:38.648Z]         if is_error(answer)[0]:
[2024-05-29T03:04:38.648Z]             if len(answer) > 1:
[2024-05-29T03:04:38.648Z]                 type = answer[1]
[2024-05-29T03:04:38.648Z]                 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
[2024-05-29T03:04:38.648Z]                 if answer[1] == REFERENCE_TYPE:
[2024-05-29T03:04:38.648Z] >                   raise Py4JJavaError(
[2024-05-29T03:04:38.648Z]                         "An error occurred while calling {0}{1}{2}.\n".
[2024-05-29T03:04:38.648Z]                         format(target_id, ".", name), value)
[2024-05-29T03:04:38.648Z] E                   py4j.protocol.Py4JJavaError: An error occurred while calling o655073.collectToPython.
[2024-05-29T03:04:38.648Z] E                   : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 8946.0 failed 1 times, most recent failure: Lost task 3.0 in stage 8946.0 (TID 32225) (jdk11-nightly-jdk11-nightly-qcpg3-4xsms executor driver): org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 65536 bytes of memory, got 0
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:158)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:97)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:383)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.allocateMemoryForRecordIfNecessary(UnsafeExternalSorter.java:417)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:455)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:138)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.sort_addToSorter_0$(Unknown Source)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_findNextJoinRows_0$(Unknown Source)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:778)
[2024-05-29T03:04:38.648Z] E                   	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
[2024-05-29T03:04:38.648Z] E                   	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:41)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.RangePartitioner$.$anonfun$sketch$1(Partitioner.scala:306)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.RangePartitioner$.$anonfun$sketch$1$adapted(Partitioner.scala:304)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:915)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.scheduler.Task.run(Task.scala:131)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
[2024-05-29T03:04:38.648Z] E                   	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[2024-05-29T03:04:38.648Z] E                   	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[2024-05-29T03:04:38.648Z] E                   	at java.base/java.lang.Thread.run(Thread.java:829)
[2024-05-29T03:04:38.648Z] E                   
[2024-05-29T03:04:38.648Z] E                   Driver stacktrace:
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2403)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2352)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2351)
[2024-05-29T03:04:38.648Z] E                   	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
[2024-05-29T03:04:38.648Z] E                   	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
[2024-05-29T03:04:38.648Z] E                   	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2351)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1109)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1109)
[2024-05-29T03:04:38.648Z] E                   	at scala.Option.foreach(Option.scala:407)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1109)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2591)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2533)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2522)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:898)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214)
[2024-05-29T03:04:38.648Z] E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2235)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2254)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2279)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:304)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:171)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:293)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.shuffleDependency$lzycompute(ShuffleExchangeExec.scala:173)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.shuffleDependency(ShuffleExchangeExec.scala:167)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:189)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:184)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:222)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:219)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:180)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.execution.InputAdapter.inputRDD(WholeStageCodegenExec.scala:526)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs(WholeStageCodegenExec.scala:454)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs$(WholeStageCodegenExec.scala:453)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:497)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:132)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:750)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:184)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:222)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:219)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:180)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:325)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:391)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3538)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3535)
[2024-05-29T03:04:38.649Z] E                   	at jdk.internal.reflect.GeneratedMethodAccessor136.invoke(Unknown Source)
[2024-05-29T03:04:38.649Z] E                   	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[2024-05-29T03:04:38.649Z] E                   	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
[2024-05-29T03:04:38.649Z] E                   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
[2024-05-29T03:04:38.649Z] E                   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
[2024-05-29T03:04:38.649Z] E                   	at py4j.Gateway.invoke(Gateway.java:282)
[2024-05-29T03:04:38.649Z] E                   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
[2024-05-29T03:04:38.649Z] E                   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
[2024-05-29T03:04:38.649Z] E                   	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
[2024-05-29T03:04:38.649Z] E                   	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
[2024-05-29T03:04:38.649Z] E                   	at java.base/java.lang.Thread.run(Thread.java:829)
[2024-05-29T03:04:38.649Z] E                   Caused by: org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 65536 bytes of memory, got 0
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:158)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:97)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:383)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.allocateMemoryForRecordIfNecessary(UnsafeExternalSorter.java:417)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:455)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:138)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.sort_addToSorter_0$(Unknown Source)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_findNextJoinRows_0$(Unknown Source)
[2024-05-29T03:04:38.649Z] E                   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
[2024-05-29T03:04:38.650Z] E                  	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[2024-05-29T03:04:38.650Z] E                   	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:778)
[2024-05-29T03:04:38.650Z] E                   	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
[2024-05-29T03:04:38.650Z] E                   	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
[2024-05-29T03:04:38.650Z] E                   	at org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:41)
[2024-05-29T03:04:38.650Z] E                   	at org.apache.spark.RangePartitioner$.$anonfun$sketch$1(Partitioner.scala:306)
[2024-05-29T03:04:38.650Z] E                   	at org.apache.spark.RangePartitioner$.$anonfun$sketch$1$adapted(Partitioner.scala:304)
[2024-05-29T03:04:38.650Z] E                   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915)
[2024-05-29T03:04:38.650Z] E                   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:915)
[2024-05-29T03:04:38.650Z] E                   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[2024-05-29T03:04:38.650Z] E                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
[2024-05-29T03:04:38.650Z] E                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
[2024-05-29T03:04:38.650Z] E                   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
[2024-05-29T03:04:38.650Z] E                   	at org.apache.spark.scheduler.Task.run(Task.scala:131)
[2024-05-29T03:04:38.650Z] E                   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
[2024-05-29T03:04:38.650Z] E                   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
[2024-05-29T03:04:38.650Z] E                   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
[2024-05-29T03:04:38.650Z] E                   	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[2024-05-29T03:04:38.650Z] E                   	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[2024-05-29T03:04:38.650Z] E                   	... 1 more
[2024-05-29T03:04:38.650Z] 
[2024-05-29T03:04:38.650Z] /spark-3.2.0-bin-hadoop3.2/python/lib/py4j-0.10.9.2-src.zip/py4j/protocol.py:326: Py4JJavaError
@yinqingh yinqingh added bug Something isn't working ? - Needs Triage Need team to review and classify labels May 29, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Jun 4, 2024
@mattahrens
Copy link
Collaborator

Did not reproduce on subsequent builds

@mattahrens mattahrens closed this as not planned Won't fix, can't repro, duplicate, stale Jun 4, 2024
@amahussein
Copy link
Collaborator

It appeared one more time in nightly dev build . rapids_it test standalone spark 3.3.0

16:07:51  =================================== FAILURES ===================================
16:07:51  _____________________ test_buckets[parquet-reader_confs1] ______________________
16:07:51  [gw3] linux -- Python 3.10.12 /usr/bin/python
16:07:51  
16:07:51  spark_tmp_path = '/tmp/pyspark_tests//HOST_NAME-gw3-913827-1920466094/'
16:07:51  v1_enabled_list = 'parquet'
16:07:51  reader_confs = {'spark.rapids.sql.format.parquet.reader.footer.type': 'NATIVE', 'spark.rapids.sql.format.parquet.reader.type': 'MULTI....rapids.sql.reader.multithreaded.combine.sizeBytes': '0', 'spark.rapids.sql.reader.multithreaded.read.keepOrder': True}
16:07:51  spark_tmp_table_factory = <conftest.TmpTableFactory object at 0x7ff4dafb8b50>
16:07:51  
16:07:51      @ignore_order
16:07:51      @allow_non_gpu('DataWritingCommandExec,ExecutedCommandExec,WriteFilesExec')
16:07:51      @pytest.mark.parametrize('reader_confs', reader_opt_confs)
16:07:51      @pytest.mark.parametrize('v1_enabled_list', ["", "parquet"])
16:07:51      # this test would be better if we could ensure exchanges didn't exist - ie used buckets
16:07:51      def test_buckets(spark_tmp_path, v1_enabled_list, reader_confs, spark_tmp_table_factory):
16:07:51          all_confs = copy_and_update(reader_confs, {
16:07:51              'spark.sql.sources.useV1SourceList': v1_enabled_list,
16:07:51              'spark.sql.autoBroadcastJoinThreshold': '-1'})
16:07:51          def do_it(spark):
16:07:51              return createBucketedTableAndJoin(spark, spark_tmp_table_factory.get(),
16:07:51                      spark_tmp_table_factory.get())
16:07:51  >       assert_gpu_and_cpu_are_equal_collect(do_it, conf=all_confs)
16:07:51  
16:07:51  ../../src/main/python/parquet_test.py:681: 
16:07:51  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
16:07:51  ../../src/main/python/asserts.py:599: in assert_gpu_and_cpu_are_equal_collect
16:07:51      _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first, result_canonicalize_func_before_compare=result_canonicalize_func_before_compare)
16:07:51  ../../src/main/python/asserts.py:507: in _assert_gpu_and_cpu_are_equal
16:07:51      from_gpu = run_on_gpu()
16:07:51  ../../src/main/python/asserts.py:500: in run_on_gpu
16:07:51      from_gpu = with_gpu_session(bring_back, conf=conf)
16:07:51  ../../src/main/python/spark_session.py:166: in with_gpu_session
16:07:51      return with_spark_session(func, conf=copy)
16:07:51  /usr/lib/python3.10/contextlib.py:79: in inner
16:07:51      return func(*args, **kwds)
16:07:51  ../../src/main/python/spark_session.py:133: in with_spark_session
16:07:51      ret = func(_spark)
16:07:51  ../../src/main/python/asserts.py:209: in <lambda>
16:07:51      bring_back = lambda spark: limit_func(spark).collect()
16:07:51  /home/workdir/spark-3.3.0-bin-hadoop3.2/python/pyspark/sql/dataframe.py:817: in collect
16:07:51      sock_info = self._jdf.collectToPython()
16:07:51  /home/workdir/spark-3.3.0-bin-hadoop3.2/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py:1321: in __call__
16:07:51      return_value = get_return_value(
16:07:51  /home/workdir/spark-3.3.0-bin-hadoop3.2/python/pyspark/sql/utils.py:190: in deco
16:07:51      return f(*a, **kw)
16:07:51  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
16:07:51  
16:07:51  answer = 'xro1183123'
16:07:51  gateway_client = <py4j.clientserver.JavaClient object at 0x7ff4e4d2fcd0>
16:07:51  target_id = 'o1183122', name = 'collectToPython'
16:07:51  
16:07:51      def get_return_value(answer, gateway_client, target_id=None, name=None):
16:07:51          """Converts an answer received from the Java gateway into a Python object.
16:07:51      
16:07:51          For example, string representation of integers are converted to Python
16:07:51          integer, string representation of objects are converted to JavaObject
16:07:51          instances, etc.
16:07:51      
16:07:51          :param answer: the string returned by the Java gateway
16:07:51          :param gateway_client: the gateway client used to communicate with the Java
16:07:51              Gateway. Only necessary if the answer is a reference (e.g., object,
16:07:51              list, map)
16:07:51          :param target_id: the name of the object from which the answer comes from
16:07:51              (e.g., *object1* in `object1.hello()`). Optional.
16:07:51          :param name: the name of the member from which the answer comes from
16:07:51              (e.g., *hello* in `object1.hello()`). Optional.
16:07:51          """
16:07:51          if is_error(answer)[0]:
16:07:51              if len(answer) > 1:
16:07:51                  type = answer[1]
16:07:51                  value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
16:07:51                  if answer[1] == REFERENCE_TYPE:
16:07:51  >                   raise Py4JJavaError(
16:07:51                          "An error occurred while calling {0}{1}{2}.\n".
16:07:51                          format(target_id, ".", name), value)
16:07:51  E                   py4j.protocol.Py4JJavaError: An error occurred while calling o1183122.collectToPython.
16:07:51  E                   : org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 29613.0 failed 1 times, most recent failure: Lost task 2.0 in stage 29613.0 (TID 748491) (10.136.6.4 executor 2): java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
16:07:51  E                   	at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
16:07:51  E                   	at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
16:07:51  E                   	at com.nvidia.spark.rapids.MultiFileCloudPartitionReaderBase.getNextBuffersAndMetaSingleFile(GpuMultiFileReader.scala:584)
16:07:51  E                   	at com.nvidia.spark.rapids.MultiFileCloudPartitionReaderBase.getNextBuffersAndMeta(GpuMultiFileReader.scala:598)
16:07:51  E                   	at com.nvidia.spark.rapids.MultiFileCloudPartitionReaderBase.$anonfun$next$1(GpuMultiFileReader.scala:655)
16:07:51  E                   	at com.nvidia.spark.rapids.MultiFileCloudPartitionReaderBase.$anonfun$next$1$adapted(GpuMultiFileReader.scala:630)
16:07:51  E                   	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
16:07:51  E                   	at com.nvidia.spark.rapids.MultiFileCloudPartitionReaderBase.next(GpuMultiFileReader.scala:630)
16:07:51  E                   	at com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
16:07:51  E                   	at com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
16:07:51  E                   	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.$anonfun$hasNext$1(GpuDataSourceRDD.scala:73)
16:07:51  E                   	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(GpuDataSourceRDD.scala:73)
16:07:51  E                   	at scala.Option.exists(Option.scala:376)
16:07:51  E                   	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.hasNext(GpuDataSourceRDD.scala:73)
16:07:51  E                   	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.advanceToNextIter(GpuDataSourceRDD.scala:97)
16:07:51  E                   	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.hasNext(GpuDataSourceRDD.scala:73)
16:07:51  E                   	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
16:07:51  E                   	at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
16:07:51  E                   	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
16:07:51  E                   	at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$hasNext$1(GpuExec.scala:216)
16:07:51  E                   	at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$hasNext$1$adapted(GpuExec.scala:215)
16:07:51  E                   	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
16:07:51  E                   	at com.nvidia.spark.rapids.CollectTimeIterator.hasNext(GpuExec.scala:215)
16:07:51  E                   	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.getHasOnDeck(GpuCoalesceBatches.scala:313)
16:07:51  E                   	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.hasNext(GpuCoalesceBatches.scala:330)
16:07:51  E                   	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
16:07:51  E                   	at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$hasNext$1(GpuExec.scala:216)
16:07:51  E                   	at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$hasNext$1$adapted(GpuExec.scala:215)
16:07:51  E                   	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
16:07:51  E                   	at com.nvidia.spark.rapids.CollectTimeIterator.hasNext(GpuExec.scala:215)
16:07:51  E                   	at com.nvidia.spark.rapids.GpuShuffledSymmetricHashJoinExec$SymmetricJoinSizer.$anonfun$getJoinInfo$2(GpuShuffledSymmetricHashJoinExec.scala:184)
16:07:51  E                   	at com.nvidia.spark.rapids.Arm$.closeOnExcept(Arm.scala:141)
16:07:51  E                   	at com.nvidia.spark.rapids.GpuShuffledSymmetricHashJoinExec$SymmetricJoinSizer.$anonfun$getJoinInfo$1(GpuShuffledSymmetricHashJoinExec.scala:178)
16:07:51  E                   	at com.nvidia.spark.rapids.Arm$.closeOnExcept(Arm.scala:141)
16:07:51  E                   	at com.nvidia.spark.rapids.GpuShuffledSymmetricHashJoinExec$SymmetricJoinSizer.getJoinInfo(GpuShuffledSymmetricHashJoinExec.scala:177)
16:07:51  E                   	at com.nvidia.spark.rapids.GpuShuffledSymmetricHashJoinExec$SymmetricJoinSizer.getJoinInfo$(GpuShuffledSymmetricHashJoinExec.scala:160)
16:07:51  E                   	at com.nvidia.spark.rapids.GpuShuffledSymmetricHashJoinExec$SpillableColumnarBatchJoinSizer.getJoinInfo(GpuShuffledSymmetricHashJoinExec.scala:281)
16:07:51  E                   	at com.nvidia.spark.rapids.GpuShuffledSymmetricHashJoinExec.getGpuGpuJoinInfo(GpuShuffledSymmetricHashJoinExec.scala:594)
16:07:51  E                   	at com.nvidia.spark.rapids.GpuShuffledSymmetricHashJoinExec.$anonfun$internalDoExecuteColumnar$1(GpuShuffledSymmetricHashJoinExec.scala:425)
16:07:51  E                   	at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
16:07:51  E                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
16:07:51  E                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
16:07:51  E                   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
16:07:51  E                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
16:07:51  E                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
16:07:51  E                   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
16:07:51  E                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
16:07:51  E                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
16:07:51  E                   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
16:07:51  E                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
16:07:51  E                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
16:07:51  E                   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
16:07:51  E                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
16:07:51  E                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
16:07:51  E                   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
16:07:51  E                   	at org.apache.spark.scheduler.Task.run(Task.scala:136)
16:07:51  E                   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
16:07:51  E                   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
16:07:51  E                   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
16:07:51  E                   	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
16:07:51  E                   	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
16:07:51  E                   	at java.base/java.lang.Thread.run(Thread.java:829)
16:07:51  E                   Caused by: java.lang.OutOfMemoryError: Java heap space
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase.$anonfun$copyRemoteBlocksData$4(GpuParquetScan.scala:1596)
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase.$anonfun$copyRemoteBlocksData$4$adapted(GpuParquetScan.scala:1595)
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase$$Lambda$5097/0x0000000841516840.apply(Unknown Source)
16:07:51  E                   	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase.$anonfun$copyRemoteBlocksData$3(GpuParquetScan.scala:1595)
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase$$Lambda$5096/0x0000000841515c40.apply$mcJ$sp(Unknown Source)
16:07:51  E                   	at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
16:07:51  E                   	at scala.Option.getOrElse(Option.scala:189)
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase.copyRemoteBlocksData(GpuParquetScan.scala:1595)
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase.copyBlocksData(GpuParquetScan.scala:1573)
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase.copyBlocksData$(GpuParquetScan.scala:1542)
16:07:51  E                   	at com.nvidia.spark.rapids.MultiFileCloudParquetPartitionReader.copyBlocksData(GpuParquetScan.scala:2063)
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase.$anonfun$readPartFile$2(GpuParquetScan.scala:1670)
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase$$Lambda$8449/0x00000008414e2840.apply(Unknown Source)
16:07:51  E                   	at com.nvidia.spark.rapids.Arm$.closeOnExcept(Arm.scala:98)
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase.$anonfun$readPartFile$1(GpuParquetScan.scala:1667)
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase$$Lambda$8440/0x000000084172f840.apply(Unknown Source)
16:07:51  E                   	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase.readPartFile(GpuParquetScan.scala:1665)
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase.readPartFile$(GpuParquetScan.scala:1661)
16:07:51  E                   	at com.nvidia.spark.rapids.MultiFileCloudParquetPartitionReader.readPartFile(GpuParquetScan.scala:2063)
16:07:51  E                   	at com.nvidia.spark.rapids.MultiFileCloudParquetPartitionReader$ReadBatchRunner.doRead(GpuParquetScan.scala:2432)
16:07:51  E                   	at com.nvidia.spark.rapids.MultiFileCloudParquetPartitionReader$ReadBatchRunner.call(GpuParquetScan.scala:2374)
16:07:51  E                   	at com.nvidia.spark.rapids.MultiFileCloudParquetPartitionReader$ReadBatchRunner.call(GpuParquetScan.scala:2355)
16:07:51  E                   	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
16:07:51  E                   	... 3 more
16:07:51  E                   
16:07:51  E                   Driver stacktrace:
16:07:51  E                   	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
16:07:51  E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
16:07:51  E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
16:07:51  E                   	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
16:07:51  E                   	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
16:07:51  E                   	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
16:07:51  E                   	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
16:07:51  E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
16:07:51  E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
16:07:51  E                   	at scala.Option.foreach(Option.scala:407)
16:07:51  E                   	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
16:07:51  E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
16:07:51  E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
16:07:51  E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
16:07:51  E                   	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
16:07:51  E                   	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
16:07:51  E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228)
16:07:51  E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2249)
16:07:51  E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2268)
16:07:51  E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2293)
16:07:51  E                   	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1021)
16:07:51  E                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
16:07:51  E                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
16:07:51  E                   	at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
16:07:51  E                   	at org.apache.spark.rdd.RDD.collect(RDD.scala:1020)
16:07:51  E                   	at com.nvidia.spark.rapids.GpuRangePartitioner$.sketch(GpuRangePartitioner.scala:51)
16:07:51  E                   	at com.nvidia.spark.rapids.GpuRangePartitioner$.createRangeBounds(GpuRangePartitioner.scala:136)
16:07:51  E                   	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$.getPartitioner(GpuShuffleExchangeExecBase.scala:406)
16:07:51  E                   	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$.prepareBatchShuffleDependency(GpuShuffleExchangeExecBase.scala:315)
16:07:51  E                   	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.shuffleDependencyColumnar$lzycompute(GpuShuffleExchangeExecBase.scala:255)
16:07:51  E                   	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.shuffleDependencyColumnar(GpuShuffleExchangeExecBase.scala:244)
16:07:51  E                   	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.$anonfun$internalDoExecuteColumnar$1(GpuShuffleExchangeExecBase.scala:270)
16:07:51  E                   	at com.nvidia.spark.rapids.shims.Spark320PlusShims.attachTreeIfSupported(Spark320PlusShims.scala:333)
16:07:51  E                   	at com.nvidia.spark.rapids.shims.Spark320PlusShims.attachTreeIfSupported$(Spark320PlusShims.scala:328)
16:07:51  E                   	at com.nvidia.spark.rapids.shims.SparkShimImpl$.attachTreeIfSupported(SparkShims.scala:26)
16:07:51  E                   	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.internalDoExecuteColumnar(GpuShuffleExchangeExecBase.scala:267)
16:07:51  E                   	at com.nvidia.spark.rapids.GpuExec.doExecuteColumnar(GpuExec.scala:388)
16:07:51  E                   	at com.nvidia.spark.rapids.GpuExec.doExecuteColumnar$(GpuExec.scala:387)
16:07:51  E                   	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase.doExecuteColumnar(GpuShuffleExchangeExecBase.scala:167)
16:07:51  E                   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:221)
16:07:51  E                   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:232)
16:07:51  E                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
16:07:51  E                   	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:229)
16:07:51  E                   	at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:217)
16:07:51  E                   	at com.nvidia.spark.rapids.GpuShuffleCoalesceExec.internalDoExecuteColumnar(GpuShuffleCoalesceExec.scala:70)
16:07:51  E                   	at com.nvidia.spark.rapids.GpuExec.doExecuteColumnar(GpuExec.scala:388)
16:07:51  E                   	at com.nvidia.spark.rapids.GpuExec.doExecuteColumnar$(GpuExec.scala:387)
16:07:51  E                   	at com.nvidia.spark.rapids.GpuShuffleCoalesceExec.doExecuteColumnar(GpuShuffleCoalesceExec.scala:43)
16:07:51  E                   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:221)
16:07:51  E                   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:232)
16:07:51  E                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
16:07:51  E                   	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:229)
16:07:51  E                   	at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:217)
16:07:51  E                   	at com.nvidia.spark.rapids.GpuSortExec.internalDoExecuteColumnar(GpuSortExec.scala:137)
16:07:51  E                   	at com.nvidia.spark.rapids.GpuExec.doExecuteColumnar(GpuExec.scala:388)
16:07:51  E                   	at com.nvidia.spark.rapids.GpuExec.doExecuteColumnar$(GpuExec.scala:387)
16:07:51  E                   	at com.nvidia.spark.rapids.GpuSortExec.doExecuteColumnar(GpuSortExec.scala:86)
16:07:51  E                   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:221)
16:07:51  E                   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:232)
16:07:51  E                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
16:07:51  E                   	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:229)
16:07:51  E                   	at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:217)
16:07:51  E                   	at com.nvidia.spark.rapids.GpuColumnarToRowExec.doExecute(GpuColumnarToRowExec.scala:365)
16:07:51  E                   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:194)
16:07:51  E                   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:232)
16:07:51  E                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
16:07:51  E                   	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:229)
16:07:51  E                   	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:190)
16:07:51  E                   	at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:340)
16:07:51  E                   	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:421)
16:07:51  E                   	at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3688)
16:07:51  E                   	at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3858)
16:07:51  E                   	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510)
16:07:51  E                   	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3856)
16:07:51  E                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
16:07:51  E                   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
16:07:51  E                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
16:07:51  E                   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
16:07:51  E                   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
16:07:51  E                   	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3856)
16:07:51  E                   	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3685)
16:07:51  E                   	at jdk.internal.reflect.GeneratedMethodAccessor52.invoke(Unknown Source)
16:07:51  E                   	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
16:07:51  E                   	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
16:07:51  E                   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
16:07:51  E                   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
16:07:51  E                   	at py4j.Gateway.invoke(Gateway.java:282)
16:07:51  E                   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
16:07:51  E                   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
16:07:51  E                   	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
16:07:51  E                   	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
16:07:51  E                   	at java.base/java.lang.Thread.run(Thread.java:829)
16:07:51  E                   Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
16:07:51  E                   	at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
16:07:51  E                   	at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
16:07:51  E                   	at com.nvidia.spark.rapids.MultiFileCloudPartitionReaderBase.getNextBuffersAndMetaSingleFile(GpuMultiFileReader.scala:584)
16:07:51  E                   	at com.nvidia.spark.rapids.MultiFileCloudPartitionReaderBase.getNextBuffersAndMeta(GpuMultiFileReader.scala:598)
16:07:51  E                   	at com.nvidia.spark.rapids.MultiFileCloudPartitionReaderBase.$anonfun$next$1(GpuMultiFileReader.scala:655)
16:07:51  E                   	at com.nvidia.spark.rapids.MultiFileCloudPartitionReaderBase.$anonfun$next$1$adapted(GpuMultiFileReader.scala:630)
16:07:51  E                   	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
16:07:51  E                   	at com.nvidia.spark.rapids.MultiFileCloudPartitionReaderBase.next(GpuMultiFileReader.scala:630)
16:07:51  E                   	at com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
16:07:51  E                   	at com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
16:07:51  E                   	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.$anonfun$hasNext$1(GpuDataSourceRDD.scala:73)
16:07:51  E                   	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(GpuDataSourceRDD.scala:73)
16:07:51  E                   	at scala.Option.exists(Option.scala:376)
16:07:51  E                   	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.hasNext(GpuDataSourceRDD.scala:73)
16:07:51  E                   	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.advanceToNextIter(GpuDataSourceRDD.scala:97)
16:07:51  E                   	at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.hasNext(GpuDataSourceRDD.scala:73)
16:07:51  E                   	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
16:07:51  E                   	at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
16:07:51  E                   	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
16:07:51  E                   	at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$hasNext$1(GpuExec.scala:216)
16:07:51  E                   	at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$hasNext$1$adapted(GpuExec.scala:215)
16:07:51  E                   	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
16:07:51  E                   	at com.nvidia.spark.rapids.CollectTimeIterator.hasNext(GpuExec.scala:215)
16:07:51  E                   	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.getHasOnDeck(GpuCoalesceBatches.scala:313)
16:07:51  E                   	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.hasNext(GpuCoalesceBatches.scala:330)
16:07:51  E                   	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
16:07:51  E                   	at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$hasNext$1(GpuExec.scala:216)
16:07:51  E                   	at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$hasNext$1$adapted(GpuExec.scala:215)
16:07:51  E                   	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
16:07:51  E                   	at com.nvidia.spark.rapids.CollectTimeIterator.hasNext(GpuExec.scala:215)
16:07:51  E                   	at com.nvidia.spark.rapids.GpuShuffledSymmetricHashJoinExec$SymmetricJoinSizer.$anonfun$getJoinInfo$2(GpuShuffledSymmetricHashJoinExec.scala:184)
16:07:51  E                   	at com.nvidia.spark.rapids.Arm$.closeOnExcept(Arm.scala:141)
16:07:51  E                   	at com.nvidia.spark.rapids.GpuShuffledSymmetricHashJoinExec$SymmetricJoinSizer.$anonfun$getJoinInfo$1(GpuShuffledSymmetricHashJoinExec.scala:178)
16:07:51  E                   	at com.nvidia.spark.rapids.Arm$.closeOnExcept(Arm.scala:141)
16:07:51  E                   	at com.nvidia.spark.rapids.GpuShuffledSymmetricHashJoinExec$SymmetricJoinSizer.getJoinInfo(GpuShuffledSymmetricHashJoinExec.scala:177)
16:07:51  E                   	at com.nvidia.spark.rapids.GpuShuffledSymmetricHashJoinExec$SymmetricJoinSizer.getJoinInfo$(GpuShuffledSymmetricHashJoinExec.scala:160)
16:07:51  E                   	at com.nvidia.spark.rapids.GpuShuffledSymmetricHashJoinExec$SpillableColumnarBatchJoinSizer.getJoinInfo(GpuShuffledSymmetricHashJoinExec.scala:281)
16:07:51  E                   	at com.nvidia.spark.rapids.GpuShuffledSymmetricHashJoinExec.getGpuGpuJoinInfo(GpuShuffledSymmetricHashJoinExec.scala:594)
16:07:51  E                   	at com.nvidia.spark.rapids.GpuShuffledSymmetricHashJoinExec.$anonfun$internalDoExecuteColumnar$1(GpuShuffledSymmetricHashJoinExec.scala:425)
16:07:51  E                   	at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
16:07:51  E                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
16:07:51  E                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
16:07:51  E                   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
16:07:51  E                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
16:07:51  E                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
16:07:51  E                   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
16:07:51  E                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
16:07:51  E                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
16:07:51  E                   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
16:07:51  E                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
16:07:51  E                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
16:07:51  E                   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
16:07:51  E                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
16:07:51  E                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
16:07:51  E                   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
16:07:51  E                   	at org.apache.spark.scheduler.Task.run(Task.scala:136)
16:07:51  E                   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
16:07:51  E                   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
16:07:51  E                   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
16:07:51  E                   	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
16:07:51  E                   	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
16:07:51  E                   	... 1 more
16:07:51  E                   Caused by: java.lang.OutOfMemoryError: Java heap space
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase.$anonfun$copyRemoteBlocksData$4(GpuParquetScan.scala:1596)
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase.$anonfun$copyRemoteBlocksData$4$adapted(GpuParquetScan.scala:1595)
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase$$Lambda$5097/0x0000000841516840.apply(Unknown Source)
16:07:51  E                   	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase.$anonfun$copyRemoteBlocksData$3(GpuParquetScan.scala:1595)
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase$$Lambda$5096/0x0000000841515c40.apply$mcJ$sp(Unknown Source)
16:07:51  E                   	at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
16:07:51  E                   	at scala.Option.getOrElse(Option.scala:189)
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase.copyRemoteBlocksData(GpuParquetScan.scala:1595)
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase.copyBlocksData(GpuParquetScan.scala:1573)
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase.copyBlocksData$(GpuParquetScan.scala:1542)
16:07:51  E                   	at com.nvidia.spark.rapids.MultiFileCloudParquetPartitionReader.copyBlocksData(GpuParquetScan.scala:2063)
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase.$anonfun$readPartFile$2(GpuParquetScan.scala:1670)
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase$$Lambda$8449/0x00000008414e2840.apply(Unknown Source)
16:07:51  E                   	at com.nvidia.spark.rapids.Arm$.closeOnExcept(Arm.scala:98)
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase.$anonfun$readPartFile$1(GpuParquetScan.scala:1667)
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase$$Lambda$8440/0x000000084172f840.apply(Unknown Source)
16:07:51  E                   	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase.readPartFile(GpuParquetScan.scala:1665)
16:07:51  E                   	at com.nvidia.spark.rapids.ParquetPartitionReaderBase.readPartFile$(GpuParquetScan.scala:1661)
16:07:51  E                   	at com.nvidia.spark.rapids.MultiFileCloudParquetPartitionReader.readPartFile(GpuParquetScan.scala:2063)
16:07:51  E                   	at com.nvidia.spark.rapids.MultiFileCloudParquetPartitionReader$ReadBatchRunner.doRead(GpuParquetScan.scala:2432)
16:07:51  E                   	at com.nvidia.spark.rapids.MultiFileCloudParquetPartitionReader$ReadBatchRunner.call(GpuParquetScan.scala:2374)
16:07:51  E                   	at com.nvidia.spark.rapids.MultiFileCloudParquetPartitionReader$ReadBatchRunner.call(GpuParquetScan.scala:2355)
16:07:51  E                   	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
16:07:51  E                   	... 3 more
16:07:51  
16:07:51  /home/workdir/spark-3.3.0-bin-hadoop3.2/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py:326: Py4JJavaError

@amahussein amahussein reopened this Jun 28, 2024
@pxLi
Copy link
Member

pxLi commented Jul 1, 2024

^ above failure was found in nightly multi-thread run on egx machine.
Seems to be an intermittent error, it would be nice to have someone to help check if the case has memory leak issue

@sameerz sameerz added the ? - Needs Triage Need team to review and classify label Jul 8, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Jul 9, 2024
@res-life
Copy link
Collaborator

Also met in premerge, seqID: 9729
Stack is:

[2024-07-10T10:34:05.983Z] : org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 18351.0 failed 1 times, most recent failure: Lost task 2.0 in stage 18351.0 (TID 45910) (ci-scala213-jenkins-rapids-premerge-github-9729-kq78v-fbd7z executor driver): org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 65536 bytes of memory, got 0

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:158)

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:97)

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:413)

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.allocateMemoryForRecordIfNecessary(UnsafeExternalSorter.java:447)

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:485)

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:138)

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.sort_addToSorter_0$(Unknown Source)

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_findNextJoinRows_0$(Unknown Source)

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779)

[2024-07-10T10:34:05.983Z] 	at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:576)

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.scheduler.Task.run(Task.scala:136)

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)

[2024-07-10T10:34:05.983Z] 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)

[2024-07-10T10:34:05.983Z] 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)

[2024-07-10T10:34:05.983Z] 	at java.base/java.lang.Thread.run(Thread.java:840)

[2024-07-10T10:34:05.983Z] 

[2024-07-10T10:34:05.983Z] Driver stacktrace:

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)

[2024-07-10T10:34:05.983Z] 	at scala.collection.immutable.List.foreach(List.scala:333)

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)

[2024-07-10T10:34:05.983Z] 	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)

[2024-07-10T10:34:05.984Z] 	at scala.Option.foreach(Option.scala:437)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2249)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2268)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2293)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1021)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.rdd.RDD.collect(RDD.scala:1020)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:424)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3688)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3858)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3856)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3856)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3685)

[2024-07-10T10:34:05.984Z] 	at jdk.internal.reflect.GeneratedMethodAccessor124.invoke(Unknown Source)

[2024-07-10T10:34:05.984Z] 	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

[2024-07-10T10:34:05.984Z] 	at java.base/java.lang.reflect.Method.invoke(Method.java:568)

[2024-07-10T10:34:05.984Z] 	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)

[2024-07-10T10:34:05.984Z] 	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)

[2024-07-10T10:34:05.984Z] 	at py4j.Gateway.invoke(Gateway.java:282)

[2024-07-10T10:34:05.984Z] 	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)

[2024-07-10T10:34:05.984Z] 	at py4j.commands.CallCommand.execute(CallCommand.java:79)

[2024-07-10T10:34:05.984Z] 	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)

[2024-07-10T10:34:05.984Z] 	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)

[2024-07-10T10:34:05.984Z] 	at java.base/java.lang.Thread.run(Thread.java:840)

[2024-07-10T10:34:05.984Z] Caused by: org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 65536 bytes of memory, got 0

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:158)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:97)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:413)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.allocateMemoryForRecordIfNecessary(UnsafeExternalSorter.java:447)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:485)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:138)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.sort_addToSorter_0$(Unknown Source)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_findNextJoinRows_0$(Unknown Source)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779)

[2024-07-10T10:34:05.984Z] 	at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:576)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.scheduler.Task.run(Task.scala:136)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)

[2024-07-10T10:34:05.984Z] 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)

[2024-07-10T10:34:05.984Z] 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)

[2024-07-10T10:34:05.984Z] 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)

[2024-07-10T10:34:05.984Z] 	... 1 more

@pxLi
Copy link
Member

pxLi commented Jul 25, 2024

saw another instance failed pre-merge in #11246

@pxLi
Copy link
Member

pxLi commented Jul 26, 2024

This is happening more and more frequently.

@sameerz sameerz added the ? - Needs Triage Need team to review and classify label Jul 26, 2024
@pxLi pxLi changed the title [BUG] parquet_test.py::test_buckets failed due to SparkOutOfMemoryError [BUG] parquet_test.py::test_buckets failed due to SparkOutOfMemoryError intermittently Jul 29, 2024
@pxLi
Copy link
Member

pxLi commented Jul 29, 2024

try mitigate first #11260

@mattahrens
Copy link
Collaborator

Is this resolved now because of #11260?

@pxLi
Copy link
Member

pxLi commented Aug 7, 2024

Is this resolved now because of #11260?

We have not seen any issue after the mitigate PR (increase heap memory). but we still need developers to confirm if the increased memory usage is expected or not

@mattahrens
Copy link
Collaborator

Scope would be to add FAQ in docs regarding possible setting related to JDK version.

@mattahrens mattahrens added documentation Improvements or additions to documentation and removed ? - Needs Triage Need team to review and classify labels Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

6 participants