-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pandas Segfault when reading Parquet data #2224
Comments
Working to narrow this down. It doesn't occur when targetting one of the parquet files rather than the entire directory. It does occur on this dask graph: {('get-partition-0-read-parquet-f55c83a03e199bcdcf32bbc567242dcf',
0): ('read-parquet-f55c83a03e199bcdcf32bbc567242dcf', 0),
('read-parquet-f55c83a03e199bcdcf32bbc567242dcf',
0): (<function dask.dataframe.io.parquet._read_parquet_row_group>,
<dask.bytes.core.OpenFileCreator at 0x7f8270a01160>,
'day-0.parquet/part.0.parquet',
None,
('click',
'numeric_0',
'numeric_1',
'numeric_2',
'numeric_3',
'numeric_4',
'numeric_5',
'numeric_6',
'numeric_7',
'numeric_8',
'numeric_9',
'numeric_10',
'numeric_11',
'numeric_12',
'category_0',
'category_1',
'category_2',
'category_3',
'category_4',
'category_5',
'category_6',
'category_7',
'category_8',
'category_9',
'category_10',
'category_11',
'category_12',
'category_13',
'category_14',
'category_15',
'category_16',
'category_17',
'category_18',
'category_19',
'category_20',
'category_21',
'category_22',
'category_23',
'category_24',
'category_25'),
<class 'parquet_thrift.RowGroup'>
columns: [<class 'parquet_thrift.ColumnChunk'>
file_offset: 2028397
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 4
dictionary_page_offset: None
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['click']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: �
min:
null_count: 0
total_compressed_size: 2028393
total_uncompressed_size: 2028393
type: 2
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 4056866
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 2028473
dictionary_page_offset: None
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['numeric_0']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
min: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
null_count: 0
total_compressed_size: 2028393
total_uncompressed_size: 2028393
type: 5
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 6085342
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 4056949
dictionary_page_offset: None
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['numeric_1']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
min: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
null_count: 0
total_compressed_size: 2028393
total_uncompressed_size: 2028393
type: 5
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 8113818
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 6085425
dictionary_page_offset: None
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['numeric_2']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
min: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
null_count: 0
total_compressed_size: 2028393
total_uncompressed_size: 2028393
type: 5
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 10142294
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 8113901
dictionary_page_offset: None
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['numeric_3']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
min: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
null_count: 0
total_compressed_size: 2028393
total_uncompressed_size: 2028393
type: 5
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 12170770
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 10142377
dictionary_page_offset: None
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['numeric_4']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
min: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
null_count: 0
total_compressed_size: 2028393
total_uncompressed_size: 2028393
type: 5
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 14199246
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 12170853
dictionary_page_offset: None
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['numeric_5']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
min: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
null_count: 0
total_compressed_size: 2028393
total_uncompressed_size: 2028393
type: 5
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 16227722
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 14199329
dictionary_page_offset: None
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['numeric_6']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
min: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
null_count: 0
total_compressed_size: 2028393
total_uncompressed_size: 2028393
type: 5
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 18256198
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 16227805
dictionary_page_offset: None
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['numeric_7']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: �L
min: b'\xff\xff\xff\xff\xff\xff\xff\xff'
null_count: 0
total_compressed_size: 2028393
total_uncompressed_size: 2028393
type: 2
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 20284674
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 18256281
dictionary_page_offset: None
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['numeric_8']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: A
min:
null_count: 0
total_compressed_size: 2028393
total_uncompressed_size: 2028393
type: 2
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 22313150
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 20284757
dictionary_page_offset: None
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['numeric_9']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
min: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
null_count: 0
total_compressed_size: 2028393
total_uncompressed_size: 2028393
type: 5
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 24341626
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 22313233
dictionary_page_offset: None
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['numeric_10']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
min: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
null_count: 0
total_compressed_size: 2028393
total_uncompressed_size: 2028393
type: 5
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 26370103
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 24341710
dictionary_page_offset: None
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['numeric_11']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
min: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
null_count: 0
total_compressed_size: 2028393
total_uncompressed_size: 2028393
type: 5
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 28398580
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 26370187
dictionary_page_offset: None
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['numeric_12']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
min: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
null_count: 0
total_compressed_size: 2028393
total_uncompressed_size: 2028393
type: 5
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 30071583
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 29056987
dictionary_page_offset: 28398664
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
, <class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 2
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['category_0']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: None
min: None
null_count: 7830
total_compressed_size: 1014596
total_uncompressed_size: 1014596
type: 6
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 30675249
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 30168116
dictionary_page_offset: 30071658
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
, <class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 2
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['category_1']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: None
min: None
null_count: 0
total_compressed_size: 507133
total_uncompressed_size: 507133
type: 6
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 31294947
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 30787814
dictionary_page_offset: 30675323
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
, <class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 2
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['category_2']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: None
min: None
null_count: 0
total_compressed_size: 507133
total_uncompressed_size: 507133
type: 6
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 31836428
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 31329295
dictionary_page_offset: 31295021
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
, <class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 2
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['category_3']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: None
min: None
null_count: 0
total_compressed_size: 507133
total_uncompressed_size: 507133
type: 6
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 32427829
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 31920696
dictionary_page_offset: 31836502
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
, <class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 2
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['category_4']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: None
min: None
null_count: 0
total_compressed_size: 507133
total_uncompressed_size: 507133
type: 6
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 32681548
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 32427960
dictionary_page_offset: 32427903
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
, <class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 2
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['category_5']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: None
min: None
null_count: 0
total_compressed_size: 253588
total_uncompressed_size: 253588
type: 6
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 33250197
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 32743064
dictionary_page_offset: 32681622
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
, <class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 2
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['category_6']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: None
min: None
null_count: 0
total_compressed_size: 507133
total_uncompressed_size: 507133
type: 6
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 33769670
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 33262537
dictionary_page_offset: 33250271
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
, <class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 2
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['category_7']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: None
min: None
null_count: 0
total_compressed_size: 507133
total_uncompressed_size: 507133
type: 6
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 34023751
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 33770163
dictionary_page_offset: 33769744
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
, <class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 2
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['category_8']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: None
min: None
null_count: 0
total_compressed_size: 253588
total_uncompressed_size: 253588
type: 6
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 35573024
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 34558428
dictionary_page_offset: 34023825
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
, <class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 2
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['category_9']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: None
min: None
null_count: 7830
total_compressed_size: 1014596
total_uncompressed_size: 1014596
type: 6
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 36278164
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 35754998
dictionary_page_offset: 35573099
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
, <class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 2
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['category_10']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: None
min: None
null_count: 7830
total_compressed_size: 523166
total_uncompressed_size: 523166
type: 6
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 37017408
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 36510275
dictionary_page_offset: 36278240
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
, <class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 2
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['category_11']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: None
min: None
null_count: 0
total_compressed_size: 507133
total_uncompressed_size: 507133
type: 6
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 37271202
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 37017614
dictionary_page_offset: 37017483
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
, <class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 2
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['category_12']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: None
min: None
null_count: 0
total_compressed_size: 253588
total_uncompressed_size: 253588
type: 6
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 37626523
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 37288883
dictionary_page_offset: 37271277
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
, <class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 2
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['category_13']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: None
min: None
null_count: 100593
total_compressed_size: 337640
total_uncompressed_size: 337640
type: 6
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 38180427
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 37673294
dictionary_page_offset: 37626600
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
, <class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 2
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['category_14']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: None
min: None
null_count: 0
total_compressed_size: 507133
total_uncompressed_size: 507133
type: 6
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 38365789
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 38181101
dictionary_page_offset: 38180502
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
, <class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 2
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['category_15']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: None
min: None
null_count: 100593
total_compressed_size: 184688
total_uncompressed_size: 184688
type: 6
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 38550611
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 38365923
dictionary_page_offset: 38365866
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
, <class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 2
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['category_16']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: None
min: None
null_count: 100593
total_compressed_size: 184688
total_uncompressed_size: 184688
type: 6
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 39063449
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 38556316
dictionary_page_offset: 38550688
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
, <class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 2
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['category_17']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: None
min: None
null_count: 0
total_compressed_size: 507133
total_uncompressed_size: 507133
type: 6
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 39317303
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 39063715
dictionary_page_offset: 39063524
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
, <class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 2
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['category_18']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: None
min: None
null_count: 0
total_compressed_size: 253588
total_uncompressed_size: 253588
type: 6
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 41013145
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 39998549
dictionary_page_offset: 39317378
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
, <class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 2
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['category_19']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: None
min: None
null_count: 7830
total_compressed_size: 1014596
total_uncompressed_size: 1014596
type: 6
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 41891470
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 41368304
dictionary_page_offset: 41013221
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
, <class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 2
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['category_20']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: None
min: None
null_count: 7830
total_compressed_size: 523166
total_uncompressed_size: 523166
type: 6
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 43524529
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 42509933
dictionary_page_offset: 41891546
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
, <class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 2
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['category_21']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: None
min: None
null_count: 7830
total_compressed_size: 1014596
total_uncompressed_size: 1014596
type: 6
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 44027812
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 43690172
dictionary_page_offset: 43524605
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
, <class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 2
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['category_22']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: None
min: None
null_count: 100593
total_compressed_size: 337640
total_uncompressed_size: 337640
type: 6
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 44611068
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 44103935
dictionary_page_offset: 44027889
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
, <class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 2
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['category_23']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: None
min: None
null_count: 0
total_compressed_size: 507133
total_uncompressed_size: 507133
type: 6
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 44865258
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 44611670
dictionary_page_offset: 44611143
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
, <class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 2
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['category_24']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: None
min: None
null_count: 0
total_compressed_size: 253588
total_uncompressed_size: 253588
type: 6
, <class 'parquet_thrift.ColumnChunk'>
file_offset: 45119328
file_path: part.0.parquet
meta_data: <class 'parquet_thrift.ColumnMetaData'>
codec: 0
data_page_offset: 44865740
dictionary_page_offset: 44865333
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
, <class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 2
]
encodings: [3, 4, 0]
index_page_offset: None
key_value_metadata: []
num_values: 253545
path_in_schema: ['category_25']
statistics: <class 'parquet_thrift.Statistics'>
distinct_count: None
max: None
min: None
null_count: 0
total_compressed_size: 253588
total_uncompressed_size: 253588
type: 6
]
num_rows: 253545
sorting_columns: None
total_byte_size: 41139732,
False,
{'category_0': 1,
'category_1': 1,
'category_10': 1,
'category_11': 1,
'category_12': 1,
'category_13': 1,
'category_14': 1,
'category_15': 1,
'category_16': 1,
'category_17': 1,
'category_18': 1,
'category_19': 1,
'category_2': 1,
'category_20': 1,
'category_21': 1,
'category_22': 1,
'category_23': 1,
'category_24': 1,
'category_25': 1,
'category_3': 1,
'category_4': 1,
'category_5': 1,
'category_6': 1,
'category_7': 1,
'category_8': 1,
'category_9': 1},
<Parquet Schema with 41 entries>,
{},
{'category_0': 'category',
'category_1': 'category',
'category_10': 'category',
'category_11': 'category',
'category_12': 'category',
'category_13': 'category',
'category_14': 'category',
'category_15': 'category',
'category_16': 'category',
'category_17': 'category',
'category_18': 'category',
'category_19': 'category',
'category_2': 'category',
'category_20': 'category',
'category_21': 'category',
'category_22': 'category',
'category_23': 'category',
'category_24': 'category',
'category_25': 'category',
'category_3': 'category',
'category_4': 'category',
'category_5': 'category',
'category_6': 'category',
'category_7': 'category',
'category_8': 'category',
'category_9': 'category',
'click': dtype('int64'),
'numeric_0': dtype('float64'),
'numeric_1': dtype('float64'),
'numeric_10': dtype('float64'),
'numeric_11': dtype('float64'),
'numeric_12': dtype('float64'),
'numeric_2': dtype('float64'),
'numeric_3': dtype('float64'),
'numeric_4': dtype('float64'),
'numeric_5': dtype('float64'),
'numeric_6': dtype('float64'),
'numeric_7': dtype('int64'),
'numeric_8': dtype('int64'),
'numeric_9': dtype('float64')})} |
To be sure: you get the error when you read the file via dask using this graph, which only loads one file, having only one row-group, but you don't get the error when you read 'day-0.parquet/part.0.parquet' directly? Suspicions:
|
Correct
I'm not sure exactly what I'm looking for here, but here are is some output in case it's helpful: In [2]: pf = fastparquet.ParquetFile('day-0.parquet/')
In [3]: pf.categories
Out[3]:
{'category_0': 1,
'category_1': 1,
'category_10': 1,
'category_11': 1,
'category_12': 1,
'category_13': 1,
'category_14': 1,
'category_15': 1,
'category_16': 1,
'category_17': 1,
'category_18': 1,
'category_19': 1,
'category_2': 1,
'category_20': 1,
'category_21': 1,
'category_22': 1,
'category_23': 1,
'category_24': 1,
'category_25': 1,
'category_3': 1,
'category_4': 1,
'category_5': 1,
'category_6': 1,
'category_7': 1,
'category_8': 1,
'category_9': 1}
In [4]: pf = fastparquet.ParquetFile('day-0.parquet/part.0.parquet')
In [5]: pf.categories
Out[5]: {}
In [6]: pf
Out[6]: <Parquet File: {'name': 'day-0.parquet/part.0.parquet', 'columns': ['click', 'numeric_0', 'numeric_1', 'numeric_2', 'numeric_3', 'numeric_4', 'numeric_5', 'numeric_6', 'numeric_7', 'numeric_8', 'numeric_9', 'numeric_10', 'numeric_11', 'numeric_12', 'category_0', 'category_1', 'category_2', 'category_3', 'category_4', 'category_5', 'category_6', 'category_7', 'category_8', 'category_9', 'category_10', 'category_11', 'category_12', 'category_13', 'category_14', 'category_15', 'category_16', 'category_17', 'category_18', 'category_19', 'category_20', 'category_21', 'category_22', 'category_23', 'category_24', 'category_25'], 'partitions': [], 'rows': 253545}> |
Are there more than one values per categorical column? |
There are. I'm writing dask.dataframes with categories-per-partition but for which we don't know the full set of categories. import dask.dataframe as dd
from dask.distributed import Client
client = Client()
columns = ['click'] + ['numeric_%d' % i for i in range(13)] + ['category_%d' % i for i in range(26)]
dtypes = {'category_%d' % i: 'category' for i in range(26)}
df = dd.read_csv('day_0', sep='\t', names=columns, header=None, dtype=dtypes)
df.to_parquet('day-0.parquet') This is likely to be a decently common case, finding the full set of categories can be expensive. Is it possible to store categoricals efficiently without knowing the global set? |
Also it looks like the individual files don't know that they should be categoricals. Though I suppose that this makes sense if we're depending on the metadata file: In [1]: import fastparquet
In [2]: pf = fastparquet.ParquetFile('day-0.parquet/part.0.parquet')
In [3]: pf.to_pandas().dtypes
Out[3]:
click int64
numeric_0 float64
numeric_1 float64
numeric_2 float64
numeric_3 float64
numeric_4 float64
numeric_5 float64
numeric_6 float64
numeric_7 int64
numeric_8 int64
numeric_9 float64
numeric_10 float64
numeric_11 float64
numeric_12 float64
category_0 object
category_1 object
category_2 object
category_3 object
category_4 object
category_5 object
category_6 object
category_7 object
category_8 object
category_9 object
category_10 object
category_11 object
category_12 object
category_13 object
category_14 object
category_15 object
category_16 object
category_17 object
category_18 object
category_19 object
category_20 object
category_21 object
category_22 object
category_23 object
category_24 object
category_25 object
dtype: object |
This was written before dask allowed different category labels per partition... |
Note that to test if this is indeed the cause, try passing the |
In [1]: import dask.dataframe as dd
In [2]: df = dd.read_parquet('day-0.parquet/part.0.parquet', index=False, categories=['category_%d' % i for i in ra
...: nge(26)])
In [3]: df.head()
Out[3]: Segmentation fault (core dumped) In [1]: import dask.dataframe as dd
In [2]: df = dd.read_parquet('day-0.parquet', index=False, categories=['category_%d' % i for i in range(26)])
In [3]: df.head()
Out[3]: Bus error (core dumped) |
Do you mind going through the columns to see if it's one particular one? |
Individually they seem to be ok? In [1]: import dask.dataframe as dd
In [2]: df = dd.read_parquet('day-0.parquet/part.0.parquet', index=False, categories=['category_%d' % i for i in range(26)])
In [3]: for i in range(26):
...: print(i)
...: df['category_%d' % i].head()
...:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
In [4]: df = dd.read_parquet('day-0.parquet', index=False, categories=['category_%d' % i for i in range(26)])
In [5]: for i in range(26):
...: print(i)
...: df['category_%d' % i].head()
...:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25 |
The number of labels in day-1 (just downloaded) is: category_8 30 Some of these are certainly more than 2**15. Does increasing the label count like the following avoid the error: So we can fix this to make sure the category prescription is in every row-group (this will be JSON text, so easy to decode). However, having different prescriptions per row-group would and should make the data-set unreadable by fastparquet as categorical (would require recoding each categories set to a union of cat labels). |
Huzzah In [1]: import dask.dataframe as dd
In [2]: df = dd.read_parquet('day-0.parquet', index=False, categories={'category
...: _%d' % i: 2**31 for i in range(26)})
In [3]: df.head()
Out[3]:
click numeric_0 numeric_1 numeric_2 numeric_3 numeric_4 numeric_5 \
0 1 5.0 110.0 NaN 16.0 NaN 1.0
1 0 32.0 3.0 5.0 NaN 1.0 0.0
2 0 NaN 233.0 1.0 146.0 1.0 0.0
3 0 NaN 24.0 NaN 11.0 24.0 NaN
4 0 60.0 223.0 6.0 15.0 5.0 0.0
numeric_6 numeric_7 numeric_8 ... category_16 category_17 \
0 0.0 14 7 ... d20856aa b8170bba
1 0.0 61 5 ... d20856aa a1eb1511
2 0.0 99 7 ... d20856aa 628f1b8d
3 0.0 56 3 ... 1f7fc70b a1eb1511
4 0.0 1 8 ... d20856aa d9f758ff
category_18 category_19 category_20 category_21 category_22 category_23 \
0 9512c20b c38e2f28 14f65a5d 25b1b089 d7c1fc0b 7caf609c
1 9512c20b febfd863 a3323ca1 c8e1ee56 1752e9e8 75350c8a
2 9512c20b c38e2f28 14f65a5d 25b1b089 d7c1fc0b 34a9b905
3 9512c20b NaN NaN NaN dc209cd3 b8a81fb0
4 9512c20b c709ec07 2b07677e a89a92a5 aa137169 e619743b
category_24 category_25
0 30436bfc ed10571d
1 991321ea b757e957
2 ff654802 ed10571d
3 30436bfc b757e957
4 cdc3217e ed10571d
[5 rows x 40 columns] |
What is the status of this? |
Including the categories= keyword is the current workaround. |
I commented pandas-dev/pandas#16010 (comment). No one has taken ownership of writing a specification, so I can do it. This week with any luck. I think as soon as we have a spec the implementation is simple. |
I wrote a spec here pandas-dev/pandas#16315. As soon as we finalize this, we can hustle to implement and ship this in the coming weeks. It would also be good to remove the index inference logic in Dask since the metadata makes it unnecessary. |
The index inference is still useful for parquet files from other vendors which happen to have statistics - it's really useful for optimization in many cases. |
OK, but how about changing to only use column statistics for this inference, and opting in to it rather than it being the default ( |
Agreed, if the metadata says what the index is intended to be (because it came from pandas), we should use that. |
Can this be closed? There have been many changes since this, including standardized pandas metadata and changes to categorical and index support. |
Closing as stale. |
I'm playing with the criteo data and am getting an odd segfault from within Pandas even when on a single thread
cc @jreback @martindurant
The text was updated successfully, but these errors were encountered: