Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Specification document for embedding pandas-specific metadata in binary file formats #16010

Closed
wesm opened this issue Apr 15, 2017 · 12 comments · Fixed by #16315
Closed

Comments

@wesm
Copy link
Member

wesm commented Apr 15, 2017

From the discussion in dask/dask#2222 and elsewhere, there has been a need to standardize storing custom metadata related to the serializing pandas DataFrame objects in binary file formats like Apache Parquet, Arrow, and Feather.

Because there are multiple implementations of various file formats, where it is possible to specify the "official" pandas metadata representation (e.g. if one or more columns represent the index, then these are clearly marked), this will help avoid implementation divergence and incompatibilities.

Things that we should specify:

  • How to determine the Index when reading the object
  • The original data types, where there is possible ambiguity (e.g. multiple types stored in the same physical storage type)

I don't think this needs to be more complicated than a Markdown document in the pandas repo, so we can maintain a single point of truth and reach consensus as a community.

@mrocklin
Copy link
Contributor

We might also want to include column order for storage formats that don't maintain column order. There are also complexities around multi-index and hierarchical columns (though I'm not sure how far we want to go here).

@mrocklin
Copy link
Contributor

I can imagine a function living in Pandas that emits a JSON-encodable data structure that contains all relevant metadata for a Pandas dataframe. I'm not sure if this is in line with what you had in mind though.

@jreback
Copy link
Contributor

jreback commented Apr 15, 2017

@mrocklin
Copy link
Contributor

mrocklin commented Apr 15, 2017

>>> df.metadata()
{'index': 'b',
 'columns': ['a', 'b', 'c'],
 'dtypes': {'a': 'f8', 'b': 'M8[ns]+10:00', 'c': 'category'}}

@jreback
Copy link
Contributor

jreback commented Apr 15, 2017

we could add an option to have a more detailed version if need be (e.g. more exact types like int8)

@jreback
Copy link
Contributor

jreback commented Apr 15, 2017

In [7]: from pandas.io.json import build_table_schema

In [5]: df = pd.DataFrame({'A': [1, 2, 3], 'B': 'foo', 'C': pd.date_range('20130101', periods=3), 'D': pd.Series(list('abc')).astype('category')})

In [6]: build_table_schema(df)
Out[6]: 
{'fields': [{'name': 'index', 'type': 'integer'},
  {'name': 'A', 'type': 'integer'},
  {'name': 'B', 'type': 'string'},
  {'name': 'C', 'type': 'datetime'},
  {'constraints': {'enum': ['a', 'b', 'c']},
   'name': 'D',
   'ordered': False,
   'type': 'any'}],
 'pandas_version': '0.20.0',
 'primaryKey': ['index']}

@mrocklin
Copy link
Contributor

mrocklin commented Apr 15, 2017 via email

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 15, 2017

build_table_schema includes a tz field if it's a timezone-aware column.

There will be points where table-schema's representation is too limiting, or not quite right. For example, build_table_schema uses the primaryKey field to indicate the index, but only if the index is unique as per the spec (so we shouldn't be afraid to define our own).

@martindurant
Copy link
Contributor

I think it would be very useful to store not only that a column is categorical, but ht dtype of the codes array, so that we can efficiently assign on load. When parquet is written from dask, the partitions can have different dtypes per row-group for a given categorical column. Where should this be stored, or should we instead just keep the largest dtype that stores everything?

@martindurant
Copy link
Contributor

I would probably be better off proposing a PR for discussion, but here are a few thoughts first. From the comments here, there are some various ideas of what a pandas schema is and why we need it. Let me give the case from the point of view of loading a dataframe from parquet files, which is only one possible binary format of interest.

The parquet format has its own idea of types, both at primitive storage level and derived richer final types. For example, in parquet (and I suspect most binary formats) one can specify that a column is to be interpreted as uint16, say, and the implementation can take care of how to actually store that. For this reason, saving metadata that says that a column is 'integer', 'float' or 'string' is not particularly useful. Knowing the specific type is important if pre-allocating memory to fill from the binary data, rather than doing some concat operations after reading.

Pandas-specific things that are not normally captured in binary formats are: the columns to use for the index(es), categorical encoding and perhaps timezones.

  • the index information could just be a list of column names; the type rules for indexes are not quite the same as for columns, though, and I'm not sure if any further information might be required. Parquet has a sorting column spec, but this is not the same thing. If there is no index column, use a range, as normal.
  • storing the category labels in the metadata seems to me like a very bad idea, as it could be very large and expensive. Currently fastparquet stores which columns are categorical and the number of labels as {'col1': 5, 'col5': 1093} where the codes array assigned for col1 would be int8 and for col2 int16 (could have stored the actual types). Labeling a column as categorical is not a statement of how the data is stored, but of how the data was before it was written. There is added complexity here, because the labels present in each partition are not necessarily the same, especially if there are appends.
  • timezones (and times in general) may not be stored in some formats: parquet assumes UTC, although some implementations already store timezones in the metadata. The accuracy stored also may not be adequate for pandas' ns.

@wesm
Copy link
Member Author

wesm commented May 9, 2017

If no one beats me to it, I can take a crack at a PR for this (into the pandas documentation) this week and we can discuss there?

@wesm
Copy link
Member Author

wesm commented May 10, 2017

Let's review in #16315

@jreback jreback added this to the 0.21.0 milestone May 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants