[BUG] Support for wider types in read schemas for Parquet Reads #11512

mythrocks · 2024-09-26T21:44:51Z

TL;DR:

When running the plugin with Spark 4+, if a Parquet file is being read with a read-schema that contains wider types than the Parquet file's schema, the read should not fail.

Details:

This is with reference to apache/spark#44368. Spark 4 has the ability to read Parquet files where the read-schema uses wider types than the write-schema in the file.

For instance, a Parquet file with an Integer column a should be readable with a read-schema that defines a as having a type Long.

Prior to Spark 4, this would yield a `SchemaColumnConvertNotSupportedException on Apache Spark and the plugin. After apache/spark#44368, if the read-schema uses a wider, compatible type, there is an implicit conversion to the wider data type during the read. An incompatible type continues to fail as before.

spark-rapids's parquet_test.py::test_parquet_check_schema_compatibility integration test currently looks as follows:

def test_parquet_check_schema_compatibility(spark_tmp_path):
    data_path = spark_tmp_path + '/PARQUET_DATA'
    gen_list = [('int', int_gen), ('long', long_gen), ('dec32', decimal_gen_32bit)]
    with_cpu_session(lambda spark: gen_df(spark, gen_list).coalesce(1).write.parquet(data_path))

    read_int_as_long = StructType(
        [StructField('long', LongType()), StructField('int', LongType())])
    assert_gpu_and_cpu_error(
        lambda spark: spark.read.schema(read_int_as_long).parquet(data_path).collect(),
        conf={},
        error_message='Parquet column cannot be converted')

Spark 4's change in behaviour causes this test to fail thus:

        """
>       with pytest.raises(Exception) as excinfo:
E       Failed: DID NOT RAISE <class 'Exception'>

../../../../integration_tests/src/main/python/asserts.py:650: Failed

The text was updated successfully, but these errors were encountered:

Fixes NVIDIA#11015. Contributes to NVIDIA#11004. This commit addresses the tests that fail in parquet_test.py, when run on Spark 4. 1. Some of the tests were failing as a result of NVIDIA#5114. Those tests have been disabled, at least until we get around to supporting aggregations with ANSI mode enabled. 2. `test_parquet_check_schema_compatibility` fails on Spark 4 regardless of ANSI mode, because it tests implicit type promotions where the read schema includes wider columns than the write schema. This will require new code. The test is disabled until NVIDIA#11512 is addressed. 3. `test_parquet_int32_downcast` had an erroneous setup phase that fails in ANSI mode. This has been corrected. The test was refactored to run in ANSI and non-ANSI mode. Signed-off-by: MithunR <mithunr@nvidia.com>

mythrocks · 2024-10-04T05:08:13Z

It appears that apache/spark#43368 has found its way into Databricks 14.3. This issue looks to be a problem there as well.

* Spark 4: Fix parquet_test.py. Fixes #11015. (Spark 4 failure.) Also fixes #11531. (Databricks 14.3 failure.) Contributes to #11004. This commit addresses the tests that fail in parquet_test.py, when run on Spark 4. 1. Some of the tests were failing as a result of #5114. Those tests have been disabled, at least until we get around to supporting aggregations with ANSI mode enabled. 2. `test_parquet_check_schema_compatibility` fails on Spark 4 regardless of ANSI mode, because it tests implicit type promotions where the read schema includes wider columns than the write schema. This will require new code. The test is disabled until #11512 is addressed. 3. `test_parquet_int32_downcast` had an erroneous setup phase that fails in ANSI mode. This has been corrected. The test was refactored to run in ANSI and non-ANSI mode. Signed-off-by: MithunR <mithunr@nvidia.com>

nartal1 · 2025-01-22T17:52:37Z

#11727 supports widening of decimal types. We still need to support widening of other types such as int->long, int->double and float->double mentioned in the original spark commit in this issue.

mythrocks added ? - Needs Triage Need team to review and classify feature request New feature or request Spark 4.0+ Spark 4.0+ issues labels Sep 26, 2024

mythrocks mentioned this issue Sep 27, 2024

Spark 4: Fix parquet_test.py [databricks] #11519

Merged

mattahrens added bug Something isn't working and removed ? - Needs Triage Need team to review and classify feature request New feature or request labels Oct 1, 2024

mattahrens changed the title ~~[FEA] Support for wider types in read schemas for Parquet Reads~~ [BUG] Support for wider types in read schemas for Parquet Reads Oct 1, 2024

nartal1 mentioned this issue Nov 16, 2024

Widen type promotion for decimals with larger scale in Parquet Read [databricks] #11727

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Support for wider types in read schemas for Parquet Reads #11512

[BUG] Support for wider types in read schemas for Parquet Reads #11512

mythrocks commented Sep 26, 2024 •

edited

Loading

mythrocks commented Oct 4, 2024

nartal1 commented Jan 22, 2025

[BUG] Support for wider types in read schemas for Parquet Reads #11512

[BUG] Support for wider types in read schemas for Parquet Reads #11512

Comments

mythrocks commented Sep 26, 2024 • edited Loading

TL;DR:

Details:

mythrocks commented Oct 4, 2024

nartal1 commented Jan 22, 2025

mythrocks commented Sep 26, 2024 •

edited

Loading