ENH: Specification document for embedding pandas-specific metadata in binary file formats #16010

wesm · 2017-04-15T20:43:12Z

From the discussion in dask/dask#2222 and elsewhere, there has been a need to standardize storing custom metadata related to the serializing pandas DataFrame objects in binary file formats like Apache Parquet, Arrow, and Feather.

Because there are multiple implementations of various file formats, where it is possible to specify the "official" pandas metadata representation (e.g. if one or more columns represent the index, then these are clearly marked), this will help avoid implementation divergence and incompatibilities.

Things that we should specify:

How to determine the Index when reading the object
The original data types, where there is possible ambiguity (e.g. multiple types stored in the same physical storage type)

I don't think this needs to be more complicated than a Markdown document in the pandas repo, so we can maintain a single point of truth and reach consensus as a community.

mrocklin · 2017-04-15T20:45:26Z

We might also want to include column order for storage formats that don't maintain column order. There are also complexities around multi-index and hierarchical columns (though I'm not sure how far we want to go here).

mrocklin · 2017-04-15T20:47:03Z

I can imagine a function living in Pandas that emits a JSON-encodable data structure that contains all relevant metadata for a Pandas dataframe. I'm not sure if this is in line with what you had in mind though.

jreback · 2017-04-15T20:50:09Z

http://pandas-docs.github.io/pandas-docs-travis/io.html#io-table-schema

is going to be in 0.20.0 fyi

mrocklin · 2017-04-15T20:50:18Z

>>> df.metadata()
{'index': 'b',
 'columns': ['a', 'b', 'c'],
 'dtypes': {'a': 'f8', 'b': 'M8[ns]+10:00', 'c': 'category'}}

jreback · 2017-04-15T20:52:37Z

we could add an option to have a more detailed version if need be (e.g. more exact types like int8)

jreback · 2017-04-15T21:08:17Z

In [7]: from pandas.io.json import build_table_schema

In [5]: df = pd.DataFrame({'A': [1, 2, 3], 'B': 'foo', 'C': pd.date_range('20130101', periods=3), 'D': pd.Series(list('abc')).astype('category')})

In [6]: build_table_schema(df)
Out[6]: 
{'fields': [{'name': 'index', 'type': 'integer'},
  {'name': 'A', 'type': 'integer'},
  {'name': 'B', 'type': 'string'},
  {'name': 'C', 'type': 'datetime'},
  {'constraints': {'enum': ['a', 'b', 'c']},
   'name': 'D',
   'ordered': False,
   'type': 'any'}],
 'pandas_version': '0.20.0',
 'primaryKey': ['index']}

mrocklin · 2017-04-15T21:12:14Z

For some formats we would want to add in the timezone.

…

On Sat, Apr 15, 2017 at 5:08 PM, Jeff Reback ***@***.***> wrote: In [7]: from pandas.io.json import build_table_schema In [5]: df = pd.DataFrame({'A': [1, 2, 3], 'B': 'foo', 'C': pd.date_range('20130101', periods=3), 'D': pd.Series(list('abc')).astype('category')}) In [6]: build_table_schema(df) Out[6]: {'fields': [{'name': 'index', 'type': 'integer'}, {'name': 'A', 'type': 'integer'}, {'name': 'B', 'type': 'string'}, {'name': 'C', 'type': 'datetime'}, {'constraints': {'enum': ['a', 'b', 'c']}, 'name': 'D', 'ordered': False, 'type': 'any'}], 'pandas_version': '0.20.0', 'primaryKey': ['index']} — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#16010 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszGNniTW4S2v8cd61MjAXprX8lVJ6ks5rwTHKgaJpZM4M-dI9> .

TomAugspurger · 2017-04-15T21:19:50Z

build_table_schema includes a tz field if it's a timezone-aware column.

There will be points where table-schema's representation is too limiting, or not quite right. For example, build_table_schema uses the primaryKey field to indicate the index, but only if the index is unique as per the spec (so we shouldn't be afraid to define our own).

martindurant · 2017-05-01T20:46:26Z

I think it would be very useful to store not only that a column is categorical, but ht dtype of the codes array, so that we can efficiently assign on load. When parquet is written from dask, the partitions can have different dtypes per row-group for a given categorical column. Where should this be stored, or should we instead just keep the largest dtype that stores everything?

martindurant · 2017-05-05T16:54:22Z

I would probably be better off proposing a PR for discussion, but here are a few thoughts first. From the comments here, there are some various ideas of what a pandas schema is and why we need it. Let me give the case from the point of view of loading a dataframe from parquet files, which is only one possible binary format of interest.

The parquet format has its own idea of types, both at primitive storage level and derived richer final types. For example, in parquet (and I suspect most binary formats) one can specify that a column is to be interpreted as uint16, say, and the implementation can take care of how to actually store that. For this reason, saving metadata that says that a column is 'integer', 'float' or 'string' is not particularly useful. Knowing the specific type is important if pre-allocating memory to fill from the binary data, rather than doing some concat operations after reading.

Pandas-specific things that are not normally captured in binary formats are: the columns to use for the index(es), categorical encoding and perhaps timezones.

the index information could just be a list of column names; the type rules for indexes are not quite the same as for columns, though, and I'm not sure if any further information might be required. Parquet has a sorting column spec, but this is not the same thing. If there is no index column, use a range, as normal.
storing the category labels in the metadata seems to me like a very bad idea, as it could be very large and expensive. Currently fastparquet stores which columns are categorical and the number of labels as {'col1': 5, 'col5': 1093} where the codes array assigned for col1 would be int8 and for col2 int16 (could have stored the actual types). Labeling a column as categorical is not a statement of how the data is stored, but of how the data was before it was written. There is added complexity here, because the labels present in each partition are not necessarily the same, especially if there are appends.
timezones (and times in general) may not be stored in some formats: parquet assumes UTC, although some implementations already store timezones in the metadata. The accuracy stored also may not be adequate for pandas' ns.

wesm · 2017-05-09T20:03:22Z

If no one beats me to it, I can take a crack at a PR for this (into the pandas documentation) this week and we can discuss there?

wesm · 2017-05-10T01:34:15Z

Let's review in #16315

wesm mentioned this issue Apr 15, 2017

Index column inference issues in read_parquet dask/dask#2222

Closed

jreback added Docs Enhancement labels Apr 15, 2017

wesm mentioned this issue Apr 24, 2017

ENH: Refactoring to support multiple Parquet readers, add PyArrow reader dask/dask#2223

Merged

wesm mentioned this issue May 1, 2017

So that no longer need to give categories= to read_parquet dask/dask#2277

Closed

This was referenced May 9, 2017

Pandas Segfault when reading Parquet data dask/dask#2224

Closed

ENH: Draft metadata specification doc for Apache Parquet #16315

Merged

jreback added this to the 0.21.0 milestone May 16, 2017

wesm closed this as completed in #16315 May 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Specification document for embedding pandas-specific metadata in binary file formats #16010

ENH: Specification document for embedding pandas-specific metadata in binary file formats #16010

wesm commented Apr 15, 2017

mrocklin commented Apr 15, 2017

mrocklin commented Apr 15, 2017

jreback commented Apr 15, 2017

mrocklin commented Apr 15, 2017 •

edited

Loading

jreback commented Apr 15, 2017

jreback commented Apr 15, 2017

mrocklin commented Apr 15, 2017 via email

TomAugspurger commented Apr 15, 2017 •

edited

Loading

martindurant commented May 1, 2017

martindurant commented May 5, 2017

wesm commented May 9, 2017

wesm commented May 10, 2017

ENH: Specification document for embedding pandas-specific metadata in binary file formats #16010

ENH: Specification document for embedding pandas-specific metadata in binary file formats #16010

Comments

wesm commented Apr 15, 2017

mrocklin commented Apr 15, 2017

mrocklin commented Apr 15, 2017

jreback commented Apr 15, 2017

mrocklin commented Apr 15, 2017 • edited Loading

jreback commented Apr 15, 2017

jreback commented Apr 15, 2017

mrocklin commented Apr 15, 2017 via email

TomAugspurger commented Apr 15, 2017 • edited Loading

martindurant commented May 1, 2017

martindurant commented May 5, 2017

wesm commented May 9, 2017

wesm commented May 10, 2017

mrocklin commented Apr 15, 2017 •

edited

Loading

TomAugspurger commented Apr 15, 2017 •

edited

Loading