Trino not able to read nested map type written as Parquet Hive table by parquet-mr 1.10.1 and converted to Iceberg table #11343

puchengy · 2022-03-06T06:12:46Z

I found out that Trino 371 is not able to read nested map type written by parquet-mr 1.10.1.

This is the way to reproduce: in spark-sql 2.4.4 (which is using parquet 1.10.1), create a hive table with a nested map type column

spark-sql> create table my_db.my_tbl (configs MAP<string, MAP <string, string>>);
spark-sql> insert into my_db.my_tbl values (map('key1', map('key1-1', 'val1-1')));

Then, using spark-3.2 to create a iceberg table on top of the hive table with the snapshot procedure

CALL iceberg.system.snapshot('my_db.my_tbl', 'iceberg.my_db.my_tbl_ice');

on trino-371, we can successfully query the hive table, however, we will have problem when querying the iceberg table select * from iceberg.my_db.my_tbl_ice, then following error is generated:

io.trino.spi.TrinoException: Error opening Iceberg split s3n://mask-out-parquet-file-location (offset=0, length=740): Metadata is missing for column: [configs, key_value, key] required binary key (STRING) = 2
  at io.trino.plugin.iceberg.IcebergPageSourceProvider.createParquetPageSource(IcebergPageSourceProvider.java:713)
  at io.trino.plugin.iceberg.IcebergPageSourceProvider.createDataPageSource(IcebergPageSourceProvider.java:276)
  at io.trino.plugin.iceberg.IcebergPageSourceProvider.createPageSource(IcebergPageSourceProvider.java:207)
  at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:49)
  at io.trino.split.PageSourceManager.createPageSource(PageSourceManager.java:68)
  at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:308)
  at io.trino.operator.Driver.processInternal(Driver.java:388)
  at io.trino.operator.Driver.lambda$processFor$9(Driver.java:292)
  at io.trino.operator.Driver.tryWithLock(Driver.java:693)
  at io.trino.operator.Driver.processFor(Driver.java:285)
  at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1092)
  at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
  at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:488)
  at io.trino.$gen.Trino_0a00079____20220306_050111_2.run(Unknown Source)
  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: io.trino.parquet.ParquetCorruptionException: Metadata is missing for column: [configs, key_value, key] required binary key (STRING) = 2
  at io.trino.parquet.reader.ParquetReader.getColumnChunkMetaData(ParquetReader.java:413)
  at io.trino.parquet.reader.ParquetReader.<init>(ParquetReader.java:183)
  at io.trino.parquet.reader.ParquetReader.<init>(ParquetReader.java:134)
  at io.trino.plugin.iceberg.IcebergPageSourceProvider.createParquetPageSource(IcebergPageSourceProvider.java:671)
  ... 16 more

Below is the parquet file information

file:        file:xxxxxxx
creator:     parquet-mr version 1.10.1-xxxxx (build ${buildNumber}) -> in house version

file schema: hive_schema
--------------------------------------------------------------------------------
configs:     OPTIONAL F:1
.map:        REPEATED F:2
..key:       REQUIRED BINARY O:UTF8 R:1 D:2
..value:     OPTIONAL F:1
...map:      REPEATED F:2
....key:     REQUIRED BINARY O:UTF8 R:2 D:4
....value:   OPTIONAL BINARY O:UTF8 R:2 D:5

row group 1: RC:1 TS:222 OFFSET:4
--------------------------------------------------------------------------------
configs:
.map:
..key:        BINARY ZSTD DO:0 FPO:4 SZ:75/66/0.88 VC:1 ENC:PLAIN,RLE ST:[min: key1, max: key1, num_nulls: 0]
..value:
...map:
....key:      BINARY ZSTD DO:0 FPO:79 SZ:87/78/0.90 VC:1 ENC:PLAIN,RLE ST:[min: key1-1, max: key1-1, num_nulls: 0]
....value:    BINARY ZSTD DO:0 FPO:166 SZ:87/78/0.90 VC:1 ENC:PLAIN,RLE ST:[min: val1-1, max: val1-1, num_nulls: 0]

If the hive table is created with spark-3.2 using parquet-1.12.2, such problem won't exist. And the parquet meta info is as such

file:         file:xxxxxxx
creator:      parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94)
extra:        org.apache.spark.version = 3.2.0
extra:        org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"configs","type":{"type":"map","keyType":"string","valueType":{"type":"map","keyType":"string","valueType":"string","valueContainsNull":true},"valueContainsNull":true},"nullable":false,"metadata":{}}]}

file schema:  spark_schema
--------------------------------------------------------------------------------
configs:      REQUIRED F:1
.key_value:   REPEATED F:2
..key:        REQUIRED BINARY O:UTF8 R:1 D:1
..value:      OPTIONAL F:1
...key_value: REPEATED F:2
....key:      REQUIRED BINARY O:UTF8 R:2 D:3
....value:    OPTIONAL BINARY O:UTF8 R:2 D:4

row group 1:  RC:1 TS:138 OFFSET:4
--------------------------------------------------------------------------------
configs:
.key_value:
..key:         BINARY ZSTD DO:0 FPO:4 SZ:52/43/0.83 VC:1 ENC:RLE,PLAIN ST:[min: key1, max: key1, num_nulls: 0]
..value:
...key_value:
....key:       BINARY ZSTD DO:0 FPO:56 SZ:56/47/0.84 VC:1 ENC:RLE,PLAIN ST:[min: key1-1, max: key1-1, num_nulls: 0]
....value:     BINARY ZSTD DO:0 FPO:112 SZ:57/48/0.84 VC:1 ENC:RLE,PLAIN ST:[min: val1-1, max: val1-1, num_nulls: 0]

The text was updated successfully, but these errors were encountered:

puchengy · 2022-03-07T19:35:24Z

This is not an issue any more after upgrading Trino's Iceberg dependencies to 0.13.1 (#11032). So I will close this ticket.

puchengy changed the title ~~[Iceberg] Trino 371 not able to read nested map type written by parquet-mr 1.10.1~~ [Iceberg Parquet] Trino 371 not able to read nested map type written by parquet-mr 1.10.1 Mar 6, 2022

findepi changed the title ~~[Iceberg Parquet] Trino 371 not able to read nested map type written by parquet-mr 1.10.1~~ Trino not able to read nested map type written as Parquet Hive table by parquet-mr 1.10.1 and converted to Iceberg table Mar 7, 2022

puchengy closed this as completed Mar 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trino not able to read nested map type written as Parquet Hive table by parquet-mr 1.10.1 and converted to Iceberg table #11343

Trino not able to read nested map type written as Parquet Hive table by parquet-mr 1.10.1 and converted to Iceberg table #11343

puchengy commented Mar 6, 2022 •

edited

Loading

puchengy commented Mar 7, 2022

Trino not able to read nested map type written as Parquet Hive table by parquet-mr 1.10.1 and converted to Iceberg table #11343

Trino not able to read nested map type written as Parquet Hive table by parquet-mr 1.10.1 and converted to Iceberg table #11343

Comments

puchengy commented Mar 6, 2022 • edited Loading

puchengy commented Mar 7, 2022

puchengy commented Mar 6, 2022 •

edited

Loading