Skip to content

senyosimpson/fusion-repro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

fusion-repro

This is a reproduction of an error when querying Parquet files written in parquet-go.

For a reproduction case, I generated a very simplistic Parquet file using the parquet-go library. The program is under the directory go-parquet-writer. To save you the hassle of running it, the output of that program is saved under go-parquet-writer/go-testfile.parquet.

Every tool I've used to view their metadata makes them look like they are correctly formed (including datafusion-cli's describe function (seen below)).

» datafusion-cli --command "describe 'go-parquet-writer/go-testfile.parquet'"
DataFusion CLI v44.0.0
+---------------+-------------------------------------+-------------+
| column_name   | data_type                           | is_nullable |
+---------------+-------------------------------------+-------------+
| city          | Utf8View                            | NO          |
| country       | Utf8View                            | NO          |
| age           | UInt8                               | NO          |
| scale         | Int16                               | NO          |
| status        | UInt32                              | NO          |
| time_captured | Timestamp(Millisecond, Some("UTC")) | NO          |
| checked       | Boolean                             | NO          |
+---------------+-------------------------------------+-------------+
7 row(s) fetched.
Elapsed 0.001 seconds.

When I run a query over the file with no predicate, it works fine

» datafusion-cli --command "select * from 'go-parquet-writer/go-testfile.parquet'"
DataFusion CLI v44.0.0
+--------+---------+-----+-------+--------+--------------------------+---------+
| city   | country | age | scale | status | time_captured            | checked |
+--------+---------+-----+-------+--------+--------------------------+---------+
| Madrid | Spain   | 10  | -1    | 12     | 2025-01-24T16:34:00.715Z | false   |
| Athens | Greece  | 32  | 1     | 20     | 2025-01-24T17:34:00.715Z | true    |
+--------+---------+-----+-------+--------+--------------------------+---------+
2 row(s) fetched.
Elapsed 0.002 seconds.

When I run the query with a predicate, it says I have bad data

» datafusion-cli --command "select * from 'go-parquet-writer/go-testfile.parquet' where age > 10"
DataFusion CLI v44.0.0
Error: External error: Parquet error: External: bad data

I initially ran into this error using datafusion in an application. It gives a more descriptive error about converting types

Error: ParquetError(External(ProtocolError { kind: InvalidData, message: "cannot convert 2 into TType" }))

Using PyArrow, it also works fine. I made a simple script that loads the Parquet file as a Pandas DataFrame and filters it. It's found under pyarrow-ex, just run python3 example.py (needs PyArrow and Pandas dependencies). For simplicity sake, here's the code

def main():
    table = pq.read_table('../go-parquet-writer/go-testfile.parquet')
    df = table.to_pandas()
    with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.width', None):
        print(df[df['age'] > 10])

with resulting output

» python3 example.py
     city country  age  scale  status                    time_captured  checked
1  Athens  Greece   32      1      20 2025-01-24 17:34:00.715000+00:00     True

Debugging

From everything I've gathered, this error is likely coming from this conversion function. However, it only skips checking 0x02 when a collection is being parsed. The only weird thing is I have no map/list in my schema. I assume this means this 0x02 is being used to encode something else but it is beyond my knowledge.

I went spelunking in parquet-go codebase. The Thrift protocol implementation is split amongst the compact protocol, the Thrift type definitions and the encoding logic

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published