Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Part of the plan is not columnar class org.apache.spark.sql.execution.ProjectExec failure #6032

Closed
pxLi opened this issue Jul 20, 2022 · 1 comment · Fixed by #6041
Closed
Assignees
Labels
bug Something isn't working

Comments

@pxLi
Copy link
Member

pxLi commented Jul 20, 2022

Describe the bug
saw this in,
rapids_it-Dataproc #5837
rapids_it-EGX-Yarn #374

FAILED integration_tests/src/main/python/qa_nightly_select_test.py::test_select[REGEXP_REPLACE(strF, 'Yu', 'Eric')][INCOMPAT, APPROXIMATE_FLOAT]
FAILED integration_tests/src/main/python/conditionals_test.py::test_conditional_with_side_effects_cast[String]
FAILED integration_tests/src/main/python/conditionals_test.py::test_conditional_with_side_effects_case_when[String]
pyspark.sql.utils.IllegalArgumentException: Part of the plan is not columnar class org.apache.spark.sql.execution.ProjectExec

pytest log,

[2022-07-19T17:13:28.885Z] =================================== FAILURES ===================================

[2022-07-19T17:13:28.885Z] _______________ test_select[REGEXP_REPLACE(strF, 'Yu', 'Eric')] ________________

[2022-07-19T17:13:28.885Z] 

[2022-07-19T17:13:28.885Z] sql_query_line = ("SELECT REGEXP_REPLACE(strF, 'Yu', 'Eric') FROM test_table", "REGEXP_REPLACE(strF, 'Yu', 'Eric')")

[2022-07-19T17:13:28.885Z] pytestconfig = <_pytest.config.Config object at 0x7f03f46be670>

[2022-07-19T17:13:28.885Z] 

[2022-07-19T17:13:28.885Z]     @approximate_float

[2022-07-19T17:13:28.885Z]     @incompat

[2022-07-19T17:13:28.885Z]     @qarun

[2022-07-19T17:13:28.885Z]     @pytest.mark.parametrize('sql_query_line', SELECT_SQL, ids=idfn)

[2022-07-19T17:13:28.885Z]     def test_select(sql_query_line, pytestconfig):

[2022-07-19T17:13:28.885Z]         sql_query = sql_query_line[0]

[2022-07-19T17:13:28.885Z]         if sql_query:

[2022-07-19T17:13:28.885Z]             print(sql_query)

[2022-07-19T17:13:28.885Z]             with_cpu_session(num_stringDf)

[2022-07-19T17:13:28.885Z] >           assert_gpu_and_cpu_are_equal_collect(lambda spark: spark.sql(sql_query), conf=_qa_conf)

[2022-07-19T17:13:28.885Z] 

[2022-07-19T17:13:28.885Z] integration_tests/src/main/python/qa_nightly_select_test.py:167: 

[2022-07-19T17:13:28.885Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

[2022-07-19T17:13:28.885Z] integration_tests/src/main/python/asserts.py:508: in assert_gpu_and_cpu_are_equal_collect

[2022-07-19T17:13:28.885Z]     _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first)

[2022-07-19T17:13:28.885Z] integration_tests/src/main/python/asserts.py:428: in _assert_gpu_and_cpu_are_equal

[2022-07-19T17:13:28.885Z]     run_on_gpu()

[2022-07-19T17:13:28.885Z] integration_tests/src/main/python/asserts.py:422: in run_on_gpu

[2022-07-19T17:13:28.885Z]     from_gpu = with_gpu_session(bring_back, conf=conf)

[2022-07-19T17:13:28.885Z] integration_tests/src/main/python/spark_session.py:131: in with_gpu_session

[2022-07-19T17:13:28.885Z]     return with_spark_session(func, conf=copy)

[2022-07-19T17:13:28.885Z] integration_tests/src/main/python/spark_session.py:98: in with_spark_session

[2022-07-19T17:13:28.885Z]     ret = func(_spark)

[2022-07-19T17:13:28.885Z] integration_tests/src/main/python/asserts.py:201: in <lambda>

[2022-07-19T17:13:28.885Z]     bring_back = lambda spark: limit_func(spark).collect()

[2022-07-19T17:13:28.885Z] /hadoop/yarn/nm-local-dir/usercache/sa_116163337916449219958/appcache/application_1658242515047_0014/container_e02_1658242515047_0014_01_000001/pyspark.zip/pyspark/sql/dataframe.py:677: in collect

[2022-07-19T17:13:28.885Z]     sock_info = self._jdf.collectToPython()

[2022-07-19T17:13:28.885Z] /hadoop/yarn/nm-local-dir/usercache/sa_116163337916449219958/appcache/application_1658242515047_0014/container_e02_1658242515047_0014_01_000001/py4j-0.10.9-src.zip/py4j/java_gateway.py:1304: in __call__

[2022-07-19T17:13:28.885Z]     return_value = get_return_value(

[2022-07-19T17:13:28.885Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

[2022-07-19T17:13:28.885Z] 

[2022-07-19T17:13:28.885Z] a = ('xro53146', <py4j.java_gateway.GatewayClient object at 0x7f03f3b8c1f0>, 'o53145', 'collectToPython')

[2022-07-19T17:13:28.885Z] kw = {}

[2022-07-19T17:13:28.885Z] converted = IllegalArgumentException('Part of the plan is not columnar class org.apache.spark.sql.execution.ProjectExec\nProject [...:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:238)\n\tat java.lang.Thread.run(Thread.java:750)\n', None)

[2022-07-19T17:13:28.885Z] 

[2022-07-19T17:13:28.885Z]     def deco(*a, **kw):

[2022-07-19T17:13:28.885Z]         try:

[2022-07-19T17:13:28.885Z]             return f(*a, **kw)

[2022-07-19T17:13:28.885Z]         except py4j.protocol.Py4JJavaError as e:

[2022-07-19T17:13:28.885Z]             converted = convert_exception(e.java_exception)

[2022-07-19T17:13:28.885Z]             if not isinstance(converted, UnknownException):

[2022-07-19T17:13:28.885Z]                 # Hide where the exception came from that shows a non-Pythonic

[2022-07-19T17:13:28.885Z]                 # JVM exception message.

[2022-07-19T17:13:28.885Z] >               raise converted from None

[2022-07-19T17:13:28.885Z] E               pyspark.sql.utils.IllegalArgumentException: Part of the plan is not columnar class org.apache.spark.sql.execution.ProjectExec

[2022-07-19T17:13:28.885Z] E               Project [regexp_replace(strF#6499, Yu, Eric, 1) AS regexp_replace(strF, Yu, Eric, 1)#6534]

[2022-07-19T17:13:28.885Z] E               +- Scan ExistingRDD[strF#6499,byteF#6500,shortF#6501,intF#6502,longF#6503L,floatF#6504,doubleF#6505,decimalF#6506,booleanF#6507,timestampF#6508,dateF#6509]

[2022-07-19T17:13:28.885Z] 

[2022-07-19T17:13:28.885Z] /hadoop/yarn/nm-local-dir/usercache/sa_116163337916449219958/appcache/application_1658242515047_0014/container_e02_1658242515047_0014_01_000001/pyspark.zip/pyspark/sql/utils.py:117: IllegalArgumentException

[2022-07-19T17:13:28.885Z] ----------------------------- Captured stdout call -----------------------------

[2022-07-19T17:13:28.885Z] SELECT REGEXP_REPLACE(strF, 'Yu', 'Eric') FROM test_table

[2022-07-19T17:13:28.885Z] ### CREATE DATAFRAME 1  ####

[2022-07-19T17:13:28.885Z] 1990-01-01

[2022-07-19T17:13:28.885Z] ### CPU RUN ###

[2022-07-19T17:13:28.885Z] ### GPU RUN ###
[2022-07-19T17:05:25.456Z] =================================== FAILURES ===================================
[2022-07-19T17:05:25.456Z] �[31m�[1m_______________ test_conditional_with_side_effects_cast[String] ________________�[0m
[2022-07-19T17:05:25.456Z] 
[2022-07-19T17:05:25.456Z] data_gen = String
[2022-07-19T17:05:25.456Z] 
[2022-07-19T17:05:25.456Z]     @pytest.mark.parametrize('data_gen', [mk_str_gen('[0-9]{1,20}')], ids=idfn)
[2022-07-19T17:05:25.457Z]     def test_conditional_with_side_effects_cast(data_gen):
[2022-07-19T17:05:25.457Z]         test_conf=copy_and_update(
[2022-07-19T17:05:25.457Z]             ansi_enabled_conf, {'****.rapids.sql.regexp.enabled': True})
[2022-07-19T17:05:25.457Z]         assert_gpu_and_cpu_are_equal_collect(
[2022-07-19T17:05:25.457Z]                 lambda **** : unary_op_df(****, data_gen).selectExpr(
[2022-07-19T17:05:25.457Z]                     'IF(a RLIKE "^[0-9]{1,5}\\z", CAST(a AS INT), 0)'),
[2022-07-19T17:05:25.457Z] >               conf = test_conf)
[2022-07-19T17:05:25.457Z] 
[2022-07-19T17:05:25.457Z] �[1m�[31mintegration_tests/src/main/python/conditionals_test.py�[0m:208: 
[2022-07-19T17:05:25.457Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2022-07-19T17:05:25.457Z] �[1m�[31mintegration_tests/src/main/python/asserts.py�[0m:508: in assert_gpu_and_cpu_are_equal_collect
[2022-07-19T17:05:25.457Z]     _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first)
[2022-07-19T17:05:25.457Z] �[1m�[31mintegration_tests/src/main/python/asserts.py�[0m:428: in _assert_gpu_and_cpu_are_equal
[2022-07-19T17:05:25.457Z]     run_on_gpu()
[2022-07-19T17:05:25.457Z] �[1m�[31mintegration_tests/src/main/python/asserts.py�[0m:422: in run_on_gpu
[2022-07-19T17:05:25.457Z]     from_gpu = with_gpu_session(bring_back, conf=conf)
[2022-07-19T17:05:25.457Z] �[1m�[31mintegration_tests/src/main/python/****_session.py�[0m:131: in with_gpu_session
[2022-07-19T17:05:25.457Z]     return with_****_session(func, conf=copy)
[2022-07-19T17:05:25.457Z] �[1m�[31mintegration_tests/src/main/python/****_session.py�[0m:98: in with_****_session
[2022-07-19T17:05:25.457Z]     ret = func(_****)
[2022-07-19T17:05:25.457Z] �[1m�[31mintegration_tests/src/main/python/asserts.py�[0m:201: in <lambda>
[2022-07-19T17:05:25.457Z]     bring_back = lambda ****: limit_func(****).collect()
[2022-07-19T17:05:25.457Z] �[1m�[31m/hadoop_disks/disk0/local/usercache/****/appcache/application_1642661999714_11269/container_e94_1642661999714_11269_01_000001/py****.zip/py****/sql/dataframe.py�[0m:677: in collect
[2022-07-19T17:05:25.457Z]     sock_info = self._jdf.collectToPython()
[2022-07-19T17:05:25.457Z] �[1m�[31m/hadoop_disks/disk0/local/usercache/****/appcache/application_1642661999714_11269/container_e94_1642661999714_11269_01_000001/py4j-0.10.9-src.zip/py4j/java_gateway.py�[0m:1305: in __call__
[2022-07-19T17:05:25.457Z]     answer, self.gateway_client, self.target_id, self.name)
[2022-07-19T17:05:25.457Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2022-07-19T17:05:25.457Z] 
[2022-07-19T17:05:25.457Z] a = ('xro84151', <py4j.java_gateway.GatewayClient object at 0x7fbcbdfc56d8>, 'o84150', 'collectToPython')
[2022-07-19T17:05:25.457Z] kw = {}
[2022-07-19T17:05:25.457Z] converted = IllegalArgumentException('Part of the plan is not columnar class org.apache.****.sql.execution.ProjectExec\nProject [...:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:238)\n\tat java.lang.Thread.run(Thread.java:748)\n', None)
[2022-07-19T17:05:25.457Z] 
[2022-07-19T17:05:25.457Z]     def deco(*a, **kw):
[2022-07-19T17:05:25.457Z]         try:
[2022-07-19T17:05:25.457Z]             return f(*a, **kw)
[2022-07-19T17:05:25.457Z]         except py4j.protocol.Py4JJavaError as e:
[2022-07-19T17:05:25.457Z]             converted = convert_exception(e.java_exception)
[2022-07-19T17:05:25.457Z]             if not isinstance(converted, UnknownException):
[2022-07-19T17:05:25.457Z]                 # Hide where the exception came from that shows a non-Pythonic
[2022-07-19T17:05:25.457Z]                 # JVM exception message.
[2022-07-19T17:05:25.457Z] >               raise converted from None
[2022-07-19T17:05:25.457Z] �[1m�[31mE               py****.sql.utils.IllegalArgumentException: Part of the plan is not columnar class org.apache.****.sql.execution.ProjectExec�[0m
[2022-07-19T17:05:25.457Z] �[1m�[31mE               Project [if (a#16392 RLIKE ^[0-9]{1,5}z) ansi_cast(a#16392 as int) else 0 AS (IF(a RLIKE ^[0-9]{1,5}z, CAST(a AS INT), 0))#16394]�[0m
[2022-07-19T17:05:25.457Z] �[1m�[31mE               +- Scan ExistingRDD[a#16392]�[0m
[2022-07-19T17:05:25.457Z] 
[2022-07-19T17:05:25.457Z] �[1m�[31m/hadoop_disks/disk0/local/usercache/****/appcache/application_1642661999714_11269/container_e94_1642661999714_11269_01_000001/py****.zip/py****/sql/utils.py�[0m:117: IllegalArgumentException
[2022-07-19T17:05:25.457Z] ----------------------------- Captured stdout call -----------------------------
[2022-07-19T17:05:25.457Z] ### CPU RUN ###
[2022-07-19T17:05:25.457Z] ### GPU RUN ###
[2022-07-19T17:05:25.457Z] �[31m�[1m_____________ test_conditional_with_side_effects_case_when[String] _____________�[0m
[2022-07-19T17:05:25.457Z] 
[2022-07-19T17:05:25.457Z] data_gen = String
[2022-07-19T17:05:25.457Z] 
[2022-07-19T17:05:25.457Z]     @pytest.mark.parametrize('data_gen', [mk_str_gen('[0-9]{1,9}')], ids=idfn)
[2022-07-19T17:05:25.457Z]     def test_conditional_with_side_effects_case_when(data_gen):
[2022-07-19T17:05:25.457Z]         test_conf=copy_and_update(
[2022-07-19T17:05:25.457Z]             ansi_enabled_conf, {'****.rapids.sql.regexp.enabled': True})
[2022-07-19T17:05:25.457Z]         assert_gpu_and_cpu_are_equal_collect(
[2022-07-19T17:05:25.457Z]                 lambda **** : unary_op_df(****, data_gen).selectExpr(
[2022-07-19T17:05:25.457Z]                     'CASE \
[2022-07-19T17:05:25.457Z]                     WHEN a RLIKE "^[0-9]{1,3}\\z" THEN CAST(a AS INT) \
[2022-07-19T17:05:25.457Z]                     WHEN a RLIKE "^[0-9]{4,6}\\z" THEN CAST(a AS INT) + 123 \
[2022-07-19T17:05:25.457Z]                     ELSE -1 END'),
[2022-07-19T17:05:25.457Z] >                   conf = test_conf)
[2022-07-19T17:05:25.457Z] 
[2022-07-19T17:05:25.457Z] �[1m�[31mintegration_tests/src/main/python/conditionals_test.py�[0m:220: 
[2022-07-19T17:05:25.457Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2022-07-19T17:05:25.457Z] �[1m�[31mintegration_tests/src/main/python/asserts.py�[0m:508: in assert_gpu_and_cpu_are_equal_collect
[2022-07-19T17:05:25.457Z]     _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first)
[2022-07-19T17:05:25.457Z] �[1m�[31mintegration_tests/src/main/python/asserts.py�[0m:428: in _assert_gpu_and_cpu_are_equal
[2022-07-19T17:05:25.457Z]     run_on_gpu()
[2022-07-19T17:05:25.457Z] �[1m�[31mintegration_tests/src/main/python/asserts.py�[0m:422: in run_on_gpu
[2022-07-19T17:05:25.457Z]     from_gpu = with_gpu_session(bring_back, conf=conf)
[2022-07-19T17:05:25.457Z] �[1m�[31mintegration_tests/src/main/python/****_session.py�[0m:131: in with_gpu_session
[2022-07-19T17:05:25.457Z]     return with_****_session(func, conf=copy)
[2022-07-19T17:05:25.457Z] �[1m�[31mintegration_tests/src/main/python/****_session.py�[0m:98: in with_****_session
[2022-07-19T17:05:25.457Z]     ret = func(_****)
[2022-07-19T17:05:25.457Z] �[1m�[31mintegration_tests/src/main/python/asserts.py�[0m:201: in <lambda>
[2022-07-19T17:05:25.457Z]     bring_back = lambda ****: limit_func(****).collect()
[2022-07-19T17:05:25.457Z] �[1m�[31m/hadoop_disks/disk0/local/usercache/****/appcache/application_1642661999714_11269/container_e94_1642661999714_11269_01_000001/py****.zip/py****/sql/dataframe.py�[0m:677: in collect
[2022-07-19T17:05:25.457Z]     sock_info = self._jdf.collectToPython()
[2022-07-19T17:05:25.457Z] �[1m�[31m/hadoop_disks/disk0/local/usercache/****/appcache/application_1642661999714_11269/container_e94_1642661999714_11269_01_000001/py4j-0.10.9-src.zip/py4j/java_gateway.py�[0m:1305: in __call__
[2022-07-19T17:05:25.457Z]     answer, self.gateway_client, self.target_id, self.name)
[2022-07-19T17:05:25.457Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2022-07-19T17:05:25.457Z] 
[2022-07-19T17:05:25.458Z] a = ('xro84412', <py4j.java_gateway.GatewayClient object at 0x7fbcbdfc56d8>, 'o84411', 'collectToPython')
[2022-07-19T17:05:25.458Z] kw = {}
[2022-07-19T17:05:25.458Z] converted = IllegalArgumentException('Part of the plan is not columnar class org.apache.****.sql.execution.ProjectExec\nProject [...:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:238)\n\tat java.lang.Thread.run(Thread.java:748)\n', None)
[2022-07-19T17:05:25.458Z] 
[2022-07-19T17:05:25.458Z]     def deco(*a, **kw):
[2022-07-19T17:05:25.458Z]         try:
[2022-07-19T17:05:25.458Z]             return f(*a, **kw)
[2022-07-19T17:05:25.458Z]         except py4j.protocol.Py4JJavaError as e:
[2022-07-19T17:05:25.458Z]             converted = convert_exception(e.java_exception)
[2022-07-19T17:05:25.458Z]             if not isinstance(converted, UnknownException):
[2022-07-19T17:05:25.458Z]                 # Hide where the exception came from that shows a non-Pythonic
[2022-07-19T17:05:25.458Z]                 # JVM exception message.
[2022-07-19T17:05:25.458Z] >               raise converted from None
[2022-07-19T17:05:25.458Z] �[1m�[31mE               py****.sql.utils.IllegalArgumentException: Part of the plan is not columnar class org.apache.****.sql.execution.ProjectExec�[0m
[2022-07-19T17:05:25.458Z] �[1m�[31mE               Project [CASE WHEN a#16400 RLIKE ^[0-9]{1,3}z THEN ansi_cast(a#16400 as int) WHEN a#16400 RLIKE ^[0-9]{4,6}z THEN (ansi_cast(a#16400 as int) + 123) ELSE -1 END AS CASE WHEN a RLIKE ^[0-9]{1,3}z THEN CAST(a AS INT) WHEN a RLIKE ^[0-9]{4,6}z THEN (CAST(a AS INT) + 123) ELSE -1 END#16402]�[0m
[2022-07-19T17:05:25.458Z] �[1m�[31mE               +- Scan ExistingRDD[a#16400]�[0m
@pxLi pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 20, 2022
@pxLi pxLi changed the title [BUG] qa_nightly_select_test.py::test_select[REGEXP_REPLACE failed in dataproc [BUG] Part of the plan is not columnar class org.apache.spark.sql.execution.ProjectExec failure Jul 20, 2022
@pxLi
Copy link
Member Author

pxLi commented Jul 20, 2022

seems also related to recent regex update #5776
related to #6028

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants