-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Specification document for embedding pandas-specific metadata in binary file formats #16010
Comments
We might also want to include column order for storage formats that don't maintain column order. There are also complexities around multi-index and hierarchical columns (though I'm not sure how far we want to go here). |
I can imagine a function living in Pandas that emits a JSON-encodable data structure that contains all relevant metadata for a Pandas dataframe. I'm not sure if this is in line with what you had in mind though. |
http://pandas-docs.github.io/pandas-docs-travis/io.html#io-table-schema is going to be in 0.20.0 fyi |
>>> df.metadata()
{'index': 'b',
'columns': ['a', 'b', 'c'],
'dtypes': {'a': 'f8', 'b': 'M8[ns]+10:00', 'c': 'category'}} |
we could add an option to have a more detailed version if need be (e.g. more exact types like int8) |
|
For some formats we would want to add in the timezone.
…On Sat, Apr 15, 2017 at 5:08 PM, Jeff Reback ***@***.***> wrote:
In [7]: from pandas.io.json import build_table_schema
In [5]: df = pd.DataFrame({'A': [1, 2, 3], 'B': 'foo', 'C': pd.date_range('20130101', periods=3), 'D': pd.Series(list('abc')).astype('category')})
In [6]: build_table_schema(df)
Out[6]:
{'fields': [{'name': 'index', 'type': 'integer'},
{'name': 'A', 'type': 'integer'},
{'name': 'B', 'type': 'string'},
{'name': 'C', 'type': 'datetime'},
{'constraints': {'enum': ['a', 'b', 'c']},
'name': 'D',
'ordered': False,
'type': 'any'}],
'pandas_version': '0.20.0',
'primaryKey': ['index']}
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#16010 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszGNniTW4S2v8cd61MjAXprX8lVJ6ks5rwTHKgaJpZM4M-dI9>
.
|
There will be points where table-schema's representation is too limiting, or not quite right. For example, |
I think it would be very useful to store not only that a column is categorical, but ht dtype of the codes array, so that we can efficiently assign on load. When parquet is written from dask, the partitions can have different dtypes per row-group for a given categorical column. Where should this be stored, or should we instead just keep the largest dtype that stores everything? |
I would probably be better off proposing a PR for discussion, but here are a few thoughts first. From the comments here, there are some various ideas of what a pandas schema is and why we need it. Let me give the case from the point of view of loading a dataframe from parquet files, which is only one possible binary format of interest. The parquet format has its own idea of types, both at primitive storage level and derived richer final types. For example, in parquet (and I suspect most binary formats) one can specify that a column is to be interpreted as uint16, say, and the implementation can take care of how to actually store that. For this reason, saving metadata that says that a column is 'integer', 'float' or 'string' is not particularly useful. Knowing the specific type is important if pre-allocating memory to fill from the binary data, rather than doing some concat operations after reading. Pandas-specific things that are not normally captured in binary formats are: the columns to use for the index(es), categorical encoding and perhaps timezones.
|
If no one beats me to it, I can take a crack at a PR for this (into the pandas documentation) this week and we can discuss there? |
Let's review in #16315 |
From the discussion in dask/dask#2222 and elsewhere, there has been a need to standardize storing custom metadata related to the serializing pandas DataFrame objects in binary file formats like Apache Parquet, Arrow, and Feather.
Because there are multiple implementations of various file formats, where it is possible to specify the "official" pandas metadata representation (e.g. if one or more columns represent the index, then these are clearly marked), this will help avoid implementation divergence and incompatibilities.
Things that we should specify:
I don't think this needs to be more complicated than a Markdown document in the pandas repo, so we can maintain a single point of truth and reach consensus as a community.
The text was updated successfully, but these errors were encountered: