Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas Segfault when reading Parquet data #2224

Closed
mrocklin opened this issue Apr 15, 2017 · 22 comments
Closed

Pandas Segfault when reading Parquet data #2224

mrocklin opened this issue Apr 15, 2017 · 22 comments

Comments

@mrocklin
Copy link
Member

I'm playing with the criteo data and am getting an odd segfault from within Pandas even when on a single thread

>>> import dask.dataframe as dd
>>> import dask
>>> dask.set_options(get=dask.async.get_sync)
<dask.context.set_options object at 0x7ffff6474be0>
>>> df = dd.read_parquet('day-0.parquet')
>>> df.head()

Program received signal SIGBUS, Bus error.
0x00007fffec39f443 in __pyx_f_6pandas_5algos_take_1d_object_object_memview (
    __pyx_optional_args=<synthetic pointer>, __pyx_v_values=..., __pyx_v_indexer=..., __pyx_v_out=...)
   from /home/mrocklin/Software/anaconda/lib/python3.6/site-packages/pandas/algos.cpython-36m-x86_64-linux-gnu.so
(gdb) up
#1  __pyx_pf_6pandas_5algos_380take_1d_object_object (__pyx_self=<optimized out>, 
    __pyx_v_fill_value=0x7ffff7f61fa8, __pyx_v_out=..., __pyx_v_indexer=..., __pyx_v_values=<optimized out>)
    at pandas/algos.c:2818
2818	pandas/algos.c: No such file or directory.
(gdb) up
#2  __pyx_pw_6pandas_5algos_381take_1d_object_object (__pyx_self=<optimized out>, __pyx_args=<optimized out>, 
    __pyx_kwds=<optimized out>) at pandas/algos.c:2741
2741	in pandas/algos.c
(gdb) up
#3  0x00007ffff7994902 in _PyCFunction_FastCallDict (func_obj=0x7fffec2d53a8, args=0x1550960, 
    nargs=<optimized out>, kwargs=0x0) at Objects/methodobject.c:231
231	Objects/methodobject.c: No such file or directory.
(gdb) up
#4  0x00007ffff7a19f4c in call_function (pp_stack=0x7fffffffac58, oparg=<optimized out>, kwnames=0x0)
    at Python/ceval.c:4788
4788	Python/ceval.c: No such file or directory.

cc @jreback @martindurant

@mrocklin
Copy link
Member Author

mrocklin commented Apr 15, 2017

Working to narrow this down. It doesn't occur when targetting one of the parquet files rather than the entire directory. It does occur on this dask graph:

{('get-partition-0-read-parquet-f55c83a03e199bcdcf32bbc567242dcf',
  0): ('read-parquet-f55c83a03e199bcdcf32bbc567242dcf', 0),
 ('read-parquet-f55c83a03e199bcdcf32bbc567242dcf',
  0): (<function dask.dataframe.io.parquet._read_parquet_row_group>,
  <dask.bytes.core.OpenFileCreator at 0x7f8270a01160>,
  'day-0.parquet/part.0.parquet',
  None,
  ('click',
   'numeric_0',
   'numeric_1',
   'numeric_2',
   'numeric_3',
   'numeric_4',
   'numeric_5',
   'numeric_6',
   'numeric_7',
   'numeric_8',
   'numeric_9',
   'numeric_10',
   'numeric_11',
   'numeric_12',
   'category_0',
   'category_1',
   'category_2',
   'category_3',
   'category_4',
   'category_5',
   'category_6',
   'category_7',
   'category_8',
   'category_9',
   'category_10',
   'category_11',
   'category_12',
   'category_13',
   'category_14',
   'category_15',
   'category_16',
   'category_17',
   'category_18',
   'category_19',
   'category_20',
   'category_21',
   'category_22',
   'category_23',
   'category_24',
   'category_25'),
  <class 'parquet_thrift.RowGroup'>
  columns: [<class 'parquet_thrift.ColumnChunk'>
  file_offset: 2028397
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 4
    dictionary_page_offset: None
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['click']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: �
      min: 
      null_count: 0
  
    total_compressed_size: 2028393
    total_uncompressed_size: 2028393
    type: 2
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 4056866
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 2028473
    dictionary_page_offset: None
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['numeric_0']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
      min: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
      null_count: 0
  
    total_compressed_size: 2028393
    total_uncompressed_size: 2028393
    type: 5
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 6085342
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 4056949
    dictionary_page_offset: None
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['numeric_1']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
      min: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
      null_count: 0
  
    total_compressed_size: 2028393
    total_uncompressed_size: 2028393
    type: 5
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 8113818
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 6085425
    dictionary_page_offset: None
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['numeric_2']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
      min: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
      null_count: 0
  
    total_compressed_size: 2028393
    total_uncompressed_size: 2028393
    type: 5
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 10142294
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 8113901
    dictionary_page_offset: None
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['numeric_3']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
      min: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
      null_count: 0
  
    total_compressed_size: 2028393
    total_uncompressed_size: 2028393
    type: 5
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 12170770
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 10142377
    dictionary_page_offset: None
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['numeric_4']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
      min: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
      null_count: 0
  
    total_compressed_size: 2028393
    total_uncompressed_size: 2028393
    type: 5
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 14199246
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 12170853
    dictionary_page_offset: None
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['numeric_5']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
      min: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
      null_count: 0
  
    total_compressed_size: 2028393
    total_uncompressed_size: 2028393
    type: 5
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 16227722
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 14199329
    dictionary_page_offset: None
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['numeric_6']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
      min: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
      null_count: 0
  
    total_compressed_size: 2028393
    total_uncompressed_size: 2028393
    type: 5
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 18256198
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 16227805
    dictionary_page_offset: None
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['numeric_7']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: �L
      min: b'\xff\xff\xff\xff\xff\xff\xff\xff'
      null_count: 0
  
    total_compressed_size: 2028393
    total_uncompressed_size: 2028393
    type: 2
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 20284674
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 18256281
    dictionary_page_offset: None
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['numeric_8']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: A
      min: 
      null_count: 0
  
    total_compressed_size: 2028393
    total_uncompressed_size: 2028393
    type: 2
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 22313150
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 20284757
    dictionary_page_offset: None
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['numeric_9']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
      min: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
      null_count: 0
  
    total_compressed_size: 2028393
    total_uncompressed_size: 2028393
    type: 5
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 24341626
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 22313233
    dictionary_page_offset: None
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['numeric_10']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
      min: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
      null_count: 0
  
    total_compressed_size: 2028393
    total_uncompressed_size: 2028393
    type: 5
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 26370103
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 24341710
    dictionary_page_offset: None
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['numeric_11']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
      min: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
      null_count: 0
  
    total_compressed_size: 2028393
    total_uncompressed_size: 2028393
    type: 5
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 28398580
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 26370187
    dictionary_page_offset: None
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['numeric_12']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
      min: b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
      null_count: 0
  
    total_compressed_size: 2028393
    total_uncompressed_size: 2028393
    type: 5
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 30071583
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 29056987
    dictionary_page_offset: 28398664
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  , <class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 2
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['category_0']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: None
      min: None
      null_count: 7830
  
    total_compressed_size: 1014596
    total_uncompressed_size: 1014596
    type: 6
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 30675249
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 30168116
    dictionary_page_offset: 30071658
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  , <class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 2
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['category_1']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: None
      min: None
      null_count: 0
  
    total_compressed_size: 507133
    total_uncompressed_size: 507133
    type: 6
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 31294947
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 30787814
    dictionary_page_offset: 30675323
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  , <class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 2
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['category_2']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: None
      min: None
      null_count: 0
  
    total_compressed_size: 507133
    total_uncompressed_size: 507133
    type: 6
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 31836428
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 31329295
    dictionary_page_offset: 31295021
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  , <class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 2
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['category_3']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: None
      min: None
      null_count: 0
  
    total_compressed_size: 507133
    total_uncompressed_size: 507133
    type: 6
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 32427829
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 31920696
    dictionary_page_offset: 31836502
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  , <class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 2
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['category_4']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: None
      min: None
      null_count: 0
  
    total_compressed_size: 507133
    total_uncompressed_size: 507133
    type: 6
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 32681548
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 32427960
    dictionary_page_offset: 32427903
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  , <class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 2
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['category_5']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: None
      min: None
      null_count: 0
  
    total_compressed_size: 253588
    total_uncompressed_size: 253588
    type: 6
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 33250197
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 32743064
    dictionary_page_offset: 32681622
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  , <class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 2
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['category_6']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: None
      min: None
      null_count: 0
  
    total_compressed_size: 507133
    total_uncompressed_size: 507133
    type: 6
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 33769670
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 33262537
    dictionary_page_offset: 33250271
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  , <class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 2
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['category_7']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: None
      min: None
      null_count: 0
  
    total_compressed_size: 507133
    total_uncompressed_size: 507133
    type: 6
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 34023751
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 33770163
    dictionary_page_offset: 33769744
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  , <class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 2
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['category_8']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: None
      min: None
      null_count: 0
  
    total_compressed_size: 253588
    total_uncompressed_size: 253588
    type: 6
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 35573024
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 34558428
    dictionary_page_offset: 34023825
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  , <class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 2
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['category_9']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: None
      min: None
      null_count: 7830
  
    total_compressed_size: 1014596
    total_uncompressed_size: 1014596
    type: 6
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 36278164
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 35754998
    dictionary_page_offset: 35573099
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  , <class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 2
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['category_10']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: None
      min: None
      null_count: 7830
  
    total_compressed_size: 523166
    total_uncompressed_size: 523166
    type: 6
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 37017408
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 36510275
    dictionary_page_offset: 36278240
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  , <class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 2
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['category_11']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: None
      min: None
      null_count: 0
  
    total_compressed_size: 507133
    total_uncompressed_size: 507133
    type: 6
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 37271202
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 37017614
    dictionary_page_offset: 37017483
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  , <class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 2
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['category_12']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: None
      min: None
      null_count: 0
  
    total_compressed_size: 253588
    total_uncompressed_size: 253588
    type: 6
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 37626523
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 37288883
    dictionary_page_offset: 37271277
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  , <class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 2
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['category_13']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: None
      min: None
      null_count: 100593
  
    total_compressed_size: 337640
    total_uncompressed_size: 337640
    type: 6
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 38180427
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 37673294
    dictionary_page_offset: 37626600
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  , <class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 2
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['category_14']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: None
      min: None
      null_count: 0
  
    total_compressed_size: 507133
    total_uncompressed_size: 507133
    type: 6
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 38365789
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 38181101
    dictionary_page_offset: 38180502
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  , <class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 2
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['category_15']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: None
      min: None
      null_count: 100593
  
    total_compressed_size: 184688
    total_uncompressed_size: 184688
    type: 6
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 38550611
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 38365923
    dictionary_page_offset: 38365866
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  , <class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 2
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['category_16']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: None
      min: None
      null_count: 100593
  
    total_compressed_size: 184688
    total_uncompressed_size: 184688
    type: 6
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 39063449
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 38556316
    dictionary_page_offset: 38550688
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  , <class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 2
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['category_17']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: None
      min: None
      null_count: 0
  
    total_compressed_size: 507133
    total_uncompressed_size: 507133
    type: 6
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 39317303
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 39063715
    dictionary_page_offset: 39063524
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  , <class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 2
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['category_18']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: None
      min: None
      null_count: 0
  
    total_compressed_size: 253588
    total_uncompressed_size: 253588
    type: 6
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 41013145
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 39998549
    dictionary_page_offset: 39317378
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  , <class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 2
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['category_19']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: None
      min: None
      null_count: 7830
  
    total_compressed_size: 1014596
    total_uncompressed_size: 1014596
    type: 6
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 41891470
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 41368304
    dictionary_page_offset: 41013221
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  , <class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 2
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['category_20']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: None
      min: None
      null_count: 7830
  
    total_compressed_size: 523166
    total_uncompressed_size: 523166
    type: 6
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 43524529
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 42509933
    dictionary_page_offset: 41891546
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  , <class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 2
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['category_21']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: None
      min: None
      null_count: 7830
  
    total_compressed_size: 1014596
    total_uncompressed_size: 1014596
    type: 6
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 44027812
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 43690172
    dictionary_page_offset: 43524605
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  , <class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 2
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['category_22']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: None
      min: None
      null_count: 100593
  
    total_compressed_size: 337640
    total_uncompressed_size: 337640
    type: 6
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 44611068
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 44103935
    dictionary_page_offset: 44027889
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  , <class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 2
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['category_23']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: None
      min: None
      null_count: 0
  
    total_compressed_size: 507133
    total_uncompressed_size: 507133
    type: 6
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 44865258
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 44611670
    dictionary_page_offset: 44611143
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  , <class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 2
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['category_24']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: None
      min: None
      null_count: 0
  
    total_compressed_size: 253588
    total_uncompressed_size: 253588
    type: 6
  
  , <class 'parquet_thrift.ColumnChunk'>
  file_offset: 45119328
  file_path: part.0.parquet
  meta_data: <class 'parquet_thrift.ColumnMetaData'>
    codec: 0
    data_page_offset: 44865740
    dictionary_page_offset: 44865333
    encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 0
  , <class 'parquet_thrift.PageEncodingStats'>
  count: 1
  encoding: 0
  page_type: 2
  ]
    encodings: [3, 4, 0]
    index_page_offset: None
    key_value_metadata: []
    num_values: 253545
    path_in_schema: ['category_25']
    statistics: <class 'parquet_thrift.Statistics'>
      distinct_count: None
      max: None
      min: None
      null_count: 0
  
    total_compressed_size: 253588
    total_uncompressed_size: 253588
    type: 6
  
  ]
  num_rows: 253545
  sorting_columns: None
  total_byte_size: 41139732,
  False,
  {'category_0': 1,
   'category_1': 1,
   'category_10': 1,
   'category_11': 1,
   'category_12': 1,
   'category_13': 1,
   'category_14': 1,
   'category_15': 1,
   'category_16': 1,
   'category_17': 1,
   'category_18': 1,
   'category_19': 1,
   'category_2': 1,
   'category_20': 1,
   'category_21': 1,
   'category_22': 1,
   'category_23': 1,
   'category_24': 1,
   'category_25': 1,
   'category_3': 1,
   'category_4': 1,
   'category_5': 1,
   'category_6': 1,
   'category_7': 1,
   'category_8': 1,
   'category_9': 1},
  <Parquet Schema with 41 entries>,
  {},
  {'category_0': 'category',
   'category_1': 'category',
   'category_10': 'category',
   'category_11': 'category',
   'category_12': 'category',
   'category_13': 'category',
   'category_14': 'category',
   'category_15': 'category',
   'category_16': 'category',
   'category_17': 'category',
   'category_18': 'category',
   'category_19': 'category',
   'category_2': 'category',
   'category_20': 'category',
   'category_21': 'category',
   'category_22': 'category',
   'category_23': 'category',
   'category_24': 'category',
   'category_25': 'category',
   'category_3': 'category',
   'category_4': 'category',
   'category_5': 'category',
   'category_6': 'category',
   'category_7': 'category',
   'category_8': 'category',
   'category_9': 'category',
   'click': dtype('int64'),
   'numeric_0': dtype('float64'),
   'numeric_1': dtype('float64'),
   'numeric_10': dtype('float64'),
   'numeric_11': dtype('float64'),
   'numeric_12': dtype('float64'),
   'numeric_2': dtype('float64'),
   'numeric_3': dtype('float64'),
   'numeric_4': dtype('float64'),
   'numeric_5': dtype('float64'),
   'numeric_6': dtype('float64'),
   'numeric_7': dtype('int64'),
   'numeric_8': dtype('int64'),
   'numeric_9': dtype('float64')})}

@martindurant
Copy link
Member

To be sure: you get the error when you read the file via dask using this graph, which only loads one file, having only one row-group, but you don't get the error when you read 'day-0.parquet/part.0.parquet' directly?
I wonder how a memoryview operation in pandas could need a file.

Suspicions:

  • to check the number of category labels for each column: ParquetFile.categories should list the assumed counts. The int dtype assigned to store the labels is based on these counts. Even if it was wrong, though, I think you'd only get a segfault when accessing accessing the relavant row)
  • the open files logic changed recently specifically for parquet, but the code loads data into bytes, makes numpy arrays (using memoryview) and then assigns into numpy arrays within the pandas dataframe. I don't see scope for "no such file" in the context of a memoryview within pandas.

@mrocklin
Copy link
Member Author

To be sure: you get the error when you read the file via dask using this graph, which only loads one file, having only one row-group, but you don't get the error when you read 'day-0.parquet/part.0.parquet' directly?

Correct

to check the number of category labels for each column: ParquetFile.categories should list the assumed counts. The int dtype assigned to store the labels is based on these counts. Even if it was wrong, though, I think you'd only get a segfault when accessing accessing the relavant row)

I'm not sure exactly what I'm looking for here, but here are is some output in case it's helpful:

In [2]: pf = fastparquet.ParquetFile('day-0.parquet/')

In [3]: pf.categories
Out[3]: 
{'category_0': 1,
 'category_1': 1,
 'category_10': 1,
 'category_11': 1,
 'category_12': 1,
 'category_13': 1,
 'category_14': 1,
 'category_15': 1,
 'category_16': 1,
 'category_17': 1,
 'category_18': 1,
 'category_19': 1,
 'category_2': 1,
 'category_20': 1,
 'category_21': 1,
 'category_22': 1,
 'category_23': 1,
 'category_24': 1,
 'category_25': 1,
 'category_3': 1,
 'category_4': 1,
 'category_5': 1,
 'category_6': 1,
 'category_7': 1,
 'category_8': 1,
 'category_9': 1}

In [4]: pf = fastparquet.ParquetFile('day-0.parquet/part.0.parquet')

In [5]: pf.categories
Out[5]: {}

In [6]: pf
Out[6]: <Parquet File: {'name': 'day-0.parquet/part.0.parquet', 'columns': ['click', 'numeric_0', 'numeric_1', 'numeric_2', 'numeric_3', 'numeric_4', 'numeric_5', 'numeric_6', 'numeric_7', 'numeric_8', 'numeric_9', 'numeric_10', 'numeric_11', 'numeric_12', 'category_0', 'category_1', 'category_2', 'category_3', 'category_4', 'category_5', 'category_6', 'category_7', 'category_8', 'category_9', 'category_10', 'category_11', 'category_12', 'category_13', 'category_14', 'category_15', 'category_16', 'category_17', 'category_18', 'category_19', 'category_20', 'category_21', 'category_22', 'category_23', 'category_24', 'category_25'], 'partitions': [], 'rows': 253545}>

@martindurant
Copy link
Member

Are there more than one values per categorical column?
We see that the categories metadata is not being written to the component files, so this is quite possibly the problem.

@mrocklin
Copy link
Member Author

There are. I'm writing dask.dataframes with categories-per-partition but for which we don't know the full set of categories.

import dask.dataframe as dd
from dask.distributed import Client
client = Client()

columns = ['click'] + ['numeric_%d' % i for i in range(13)] + ['category_%d' %  i for i in range(26)]
dtypes = {'category_%d' % i: 'category' for i in range(26)}
df = dd.read_csv('day_0', sep='\t', names=columns, header=None, dtype=dtypes)
df.to_parquet('day-0.parquet')

This is likely to be a decently common case, finding the full set of categories can be expensive. Is it possible to store categoricals efficiently without knowing the global set?

@mrocklin
Copy link
Member Author

Also it looks like the individual files don't know that they should be categoricals. Though I suppose that this makes sense if we're depending on the metadata file:

In [1]: import fastparquet

In [2]: pf = fastparquet.ParquetFile('day-0.parquet/part.0.parquet')

In [3]: pf.to_pandas().dtypes
Out[3]: 
click            int64
numeric_0      float64
numeric_1      float64
numeric_2      float64
numeric_3      float64
numeric_4      float64
numeric_5      float64
numeric_6      float64
numeric_7        int64
numeric_8        int64
numeric_9      float64
numeric_10     float64
numeric_11     float64
numeric_12     float64
category_0      object
category_1      object
category_2      object
category_3      object
category_4      object
category_5      object
category_6      object
category_7      object
category_8      object
category_9      object
category_10     object
category_11     object
category_12     object
category_13     object
category_14     object
category_15     object
category_16     object
category_17     object
category_18     object
category_19     object
category_20     object
category_21     object
category_22     object
category_23     object
category_24     object
category_25     object
dtype: object

@martindurant
Copy link
Member

This was written before dask allowed different category labels per partition...
It would be possible to store the metadata now only written at the global level for each row-group.
"don't know that they should be categoricals" - yes, this is truly a bug.

@martindurant
Copy link
Member

Note that to test if this is indeed the cause, try passing the categories= to read_parquet as before, either as a list of column names (which assumes 2**15 labels per category) or as {col: num...} is you have a conservative guess at the number of labels.

@mrocklin
Copy link
Member Author

In [1]: import dask.dataframe as dd

In [2]: df = dd.read_parquet('day-0.parquet/part.0.parquet', index=False, categories=['category_%d' % i for i in ra
   ...: nge(26)])

In [3]: df.head()
Out[3]: Segmentation fault (core dumped)
In [1]: import dask.dataframe as dd

In [2]: df = dd.read_parquet('day-0.parquet', index=False, categories=['category_%d' % i for i in range(26)])

In [3]: df.head()
Out[3]: Bus error (core dumped)

@martindurant
Copy link
Member

Do you mind going through the columns to see if it's one particular one?

@mrocklin
Copy link
Member Author

Individually they seem to be ok?

In [1]: import dask.dataframe as dd

In [2]: df = dd.read_parquet('day-0.parquet/part.0.parquet', index=False, categories=['category_%d' % i for i in range(26)])

In [3]: for i in range(26):
   ...:     print(i)
   ...:     df['category_%d' % i].head()
   ...:     
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

In [4]: df = dd.read_parquet('day-0.parquet', index=False, categories=['category_%d' % i for i in range(26)])

In [5]: for i in range(26):
   ...:     print(i)
   ...:     df['category_%d' % i].head()
   ...:     
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

@martindurant
Copy link
Member

The number of labels in day-1 (just downloaded) is: category_8 30
category_9 43791
category_5 3
category_10 15434
category_23 6498
category_21 50357
category_6 5278
category_14 3933
category_22 13816
category_13 1474
category_1 8129
category_4 8722
category_15 51
category_18 14
category_12 9
category_17 528
category_16 3
category_11 19985
category_20 29481
category_7 1019
category_0 53586
category_25 33
category_24 47
category_2 9656
category_3 3549
category_19 55626
dtype: int64

Some of these are certainly more than 2**15. Does increasing the label count like the following avoid the error: categories={'category_%d' % i: 2**31 for i in range(26)} ?

So we can fix this to make sure the category prescription is in every row-group (this will be JSON text, so easy to decode). However, having different prescriptions per row-group would and should make the data-set unreadable by fastparquet as categorical (would require recoding each categories set to a union of cat labels).

@mrocklin
Copy link
Member Author

Huzzah

In [1]: import dask.dataframe as dd

In [2]: df = dd.read_parquet('day-0.parquet', index=False, categories={'category
   ...: _%d' % i: 2**31 for i in range(26)})

In [3]: df.head()
Out[3]: 
   click  numeric_0  numeric_1  numeric_2  numeric_3  numeric_4  numeric_5  \
0      1        5.0      110.0        NaN       16.0        NaN        1.0   
1      0       32.0        3.0        5.0        NaN        1.0        0.0   
2      0        NaN      233.0        1.0      146.0        1.0        0.0   
3      0        NaN       24.0        NaN       11.0       24.0        NaN   
4      0       60.0      223.0        6.0       15.0        5.0        0.0   

   numeric_6  numeric_7  numeric_8     ...      category_16  category_17  \
0        0.0         14          7     ...         d20856aa     b8170bba   
1        0.0         61          5     ...         d20856aa     a1eb1511   
2        0.0         99          7     ...         d20856aa     628f1b8d   
3        0.0         56          3     ...         1f7fc70b     a1eb1511   
4        0.0          1          8     ...         d20856aa     d9f758ff   

   category_18  category_19 category_20 category_21 category_22 category_23  \
0     9512c20b     c38e2f28    14f65a5d    25b1b089    d7c1fc0b    7caf609c   
1     9512c20b     febfd863    a3323ca1    c8e1ee56    1752e9e8    75350c8a   
2     9512c20b     c38e2f28    14f65a5d    25b1b089    d7c1fc0b    34a9b905   
3     9512c20b          NaN         NaN         NaN    dc209cd3    b8a81fb0   
4     9512c20b     c709ec07    2b07677e    a89a92a5    aa137169    e619743b   

  category_24 category_25  
0    30436bfc    ed10571d  
1    991321ea    b757e957  
2    ff654802    ed10571d  
3    30436bfc    b757e957  
4    cdc3217e    ed10571d  

[5 rows x 40 columns]

@jcrist
Copy link
Member

jcrist commented May 9, 2017

What is the status of this?

@martindurant
Copy link
Member

Including the categories= keyword is the current workaround.
There will be no definitive solution until the pandas and arrow people have agreed with me on a prescription for metadata to put into the parquet header.

@wesm
Copy link
Contributor

wesm commented May 9, 2017

I commented pandas-dev/pandas#16010 (comment). No one has taken ownership of writing a specification, so I can do it. This week with any luck. I think as soon as we have a spec the implementation is simple.

@wesm
Copy link
Contributor

wesm commented May 10, 2017

I wrote a spec here pandas-dev/pandas#16315. As soon as we finalize this, we can hustle to implement and ship this in the coming weeks. It would also be good to remove the index inference logic in Dask since the metadata makes it unnecessary.

@martindurant
Copy link
Member

The index inference is still useful for parquet files from other vendors which happen to have statistics - it's really useful for optimization in many cases.

@wesm
Copy link
Contributor

wesm commented May 10, 2017

OK, but how about changing to only use column statistics for this inference, and opting in to it rather than it being the default (index='infer')? It's not an issue of providing access to the column statistics but more that it seems a bit magical as a default behavior

@martindurant
Copy link
Member

Agreed, if the metadata says what the index is intended to be (because it came from pandas), we should use that.

@jcrist
Copy link
Member

jcrist commented Nov 13, 2017

Can this be closed? There have been many changes since this, including standardized pandas metadata and changes to categorical and index support.

@jcrist
Copy link
Member

jcrist commented Jan 30, 2018

Closing as stale.

@jcrist jcrist closed this as completed Jan 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants