Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trino not able to read nested map type written as Parquet Hive table by parquet-mr 1.10.1 and converted to Iceberg table #11343

Closed
puchengy opened this issue Mar 6, 2022 · 1 comment

Comments

@puchengy
Copy link
Contributor

puchengy commented Mar 6, 2022

I found out that Trino 371 is not able to read nested map type written by parquet-mr 1.10.1.

This is the way to reproduce: in spark-sql 2.4.4 (which is using parquet 1.10.1), create a hive table with a nested map type column

spark-sql> create table my_db.my_tbl (configs MAP<string, MAP <string, string>>);
spark-sql> insert into my_db.my_tbl values (map('key1', map('key1-1', 'val1-1')));

Then, using spark-3.2 to create a iceberg table on top of the hive table with the snapshot procedure

CALL iceberg.system.snapshot('my_db.my_tbl', 'iceberg.my_db.my_tbl_ice');

on trino-371, we can successfully query the hive table, however, we will have problem when querying the iceberg table select * from iceberg.my_db.my_tbl_ice, then following error is generated:

io.trino.spi.TrinoException: Error opening Iceberg split s3n://mask-out-parquet-file-location (offset=0, length=740): Metadata is missing for column: [configs, key_value, key] required binary key (STRING) = 2
  at io.trino.plugin.iceberg.IcebergPageSourceProvider.createParquetPageSource(IcebergPageSourceProvider.java:713)
  at io.trino.plugin.iceberg.IcebergPageSourceProvider.createDataPageSource(IcebergPageSourceProvider.java:276)
  at io.trino.plugin.iceberg.IcebergPageSourceProvider.createPageSource(IcebergPageSourceProvider.java:207)
  at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:49)
  at io.trino.split.PageSourceManager.createPageSource(PageSourceManager.java:68)
  at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:308)
  at io.trino.operator.Driver.processInternal(Driver.java:388)
  at io.trino.operator.Driver.lambda$processFor$9(Driver.java:292)
  at io.trino.operator.Driver.tryWithLock(Driver.java:693)
  at io.trino.operator.Driver.processFor(Driver.java:285)
  at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1092)
  at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
  at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:488)
  at io.trino.$gen.Trino_0a00079____20220306_050111_2.run(Unknown Source)
  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: io.trino.parquet.ParquetCorruptionException: Metadata is missing for column: [configs, key_value, key] required binary key (STRING) = 2
  at io.trino.parquet.reader.ParquetReader.getColumnChunkMetaData(ParquetReader.java:413)
  at io.trino.parquet.reader.ParquetReader.<init>(ParquetReader.java:183)
  at io.trino.parquet.reader.ParquetReader.<init>(ParquetReader.java:134)
  at io.trino.plugin.iceberg.IcebergPageSourceProvider.createParquetPageSource(IcebergPageSourceProvider.java:671)
  ... 16 more

Below is the parquet file information

file:        file:xxxxxxx
creator:     parquet-mr version 1.10.1-xxxxx (build ${buildNumber}) -> in house version

file schema: hive_schema
--------------------------------------------------------------------------------
configs:     OPTIONAL F:1
.map:        REPEATED F:2
..key:       REQUIRED BINARY O:UTF8 R:1 D:2
..value:     OPTIONAL F:1
...map:      REPEATED F:2
....key:     REQUIRED BINARY O:UTF8 R:2 D:4
....value:   OPTIONAL BINARY O:UTF8 R:2 D:5

row group 1: RC:1 TS:222 OFFSET:4
--------------------------------------------------------------------------------
configs:
.map:
..key:        BINARY ZSTD DO:0 FPO:4 SZ:75/66/0.88 VC:1 ENC:PLAIN,RLE ST:[min: key1, max: key1, num_nulls: 0]
..value:
...map:
....key:      BINARY ZSTD DO:0 FPO:79 SZ:87/78/0.90 VC:1 ENC:PLAIN,RLE ST:[min: key1-1, max: key1-1, num_nulls: 0]
....value:    BINARY ZSTD DO:0 FPO:166 SZ:87/78/0.90 VC:1 ENC:PLAIN,RLE ST:[min: val1-1, max: val1-1, num_nulls: 0]

If the hive table is created with spark-3.2 using parquet-1.12.2, such problem won't exist. And the parquet meta info is as such

file:         file:xxxxxxx
creator:      parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94)
extra:        org.apache.spark.version = 3.2.0
extra:        org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"configs","type":{"type":"map","keyType":"string","valueType":{"type":"map","keyType":"string","valueType":"string","valueContainsNull":true},"valueContainsNull":true},"nullable":false,"metadata":{}}]}

file schema:  spark_schema
--------------------------------------------------------------------------------
configs:      REQUIRED F:1
.key_value:   REPEATED F:2
..key:        REQUIRED BINARY O:UTF8 R:1 D:1
..value:      OPTIONAL F:1
...key_value: REPEATED F:2
....key:      REQUIRED BINARY O:UTF8 R:2 D:3
....value:    OPTIONAL BINARY O:UTF8 R:2 D:4

row group 1:  RC:1 TS:138 OFFSET:4
--------------------------------------------------------------------------------
configs:
.key_value:
..key:         BINARY ZSTD DO:0 FPO:4 SZ:52/43/0.83 VC:1 ENC:RLE,PLAIN ST:[min: key1, max: key1, num_nulls: 0]
..value:
...key_value:
....key:       BINARY ZSTD DO:0 FPO:56 SZ:56/47/0.84 VC:1 ENC:RLE,PLAIN ST:[min: key1-1, max: key1-1, num_nulls: 0]
....value:     BINARY ZSTD DO:0 FPO:112 SZ:57/48/0.84 VC:1 ENC:RLE,PLAIN ST:[min: val1-1, max: val1-1, num_nulls: 0]
@puchengy puchengy changed the title [Iceberg] Trino 371 not able to read nested map type written by parquet-mr 1.10.1 [Iceberg Parquet] Trino 371 not able to read nested map type written by parquet-mr 1.10.1 Mar 6, 2022
@findepi findepi changed the title [Iceberg Parquet] Trino 371 not able to read nested map type written by parquet-mr 1.10.1 Trino not able to read nested map type written as Parquet Hive table by parquet-mr 1.10.1 and converted to Iceberg table Mar 7, 2022
@puchengy
Copy link
Contributor Author

puchengy commented Mar 7, 2022

This is not an issue any more after upgrading Trino's Iceberg dependencies to 0.13.1 (#11032). So I will close this ticket.

@puchengy puchengy closed this as completed Mar 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant