-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Draft metadata specification doc for Apache Parquet #16315
Conversation
This does not provide for non-string column names. I'm open to ideas about how to deal with that |
Thank you, @wesm , this looks perfect from a quick scan. |
For boolean with null, I expect we would have
We'll have to run type inference on the object columns at some point anyhow to know what Parquet type to write them to. For JSON, we could use |
Codecov Report
@@ Coverage Diff @@
## master #16315 +/- ##
==========================================
- Coverage 90.39% 90.37% -0.02%
==========================================
Files 161 161
Lines 50863 50863
==========================================
- Hits 45978 45968 -10
- Misses 4885 4895 +10
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #16315 +/- ##
==========================================
- Coverage 90.38% 90.37% -0.02%
==========================================
Files 161 161
Lines 50916 50949 +33
==========================================
+ Hits 46021 46043 +22
- Misses 4895 4906 +11
Continue to review full report at Codecov.
|
doc/source/metadata.rst
Outdated
This document provides specifications for metadata to assist with reading and | ||
writing pandas objects to different third party file formats. | ||
|
||
Apache Parquet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you link to Apache docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
doc/source/metadata.rst
Outdated
.. code-block:: text | ||
|
||
{'index_columns': ['__index_level_0__', '__index_level_1__', ...], | ||
'columns': [<c0>, <c1>, ...], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we have tuples for columns (e.g. a MI in columns), this will not work (not saying that this should be supported, just noting it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I don't think we are supporting arbitrary objects yet. This metadata spec doesn't preclude us adding support for that later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Parquet format doesn't support this anyway
doc/source/metadata.rst
Outdated
The ``type_metadata`` is ``None`` except for: | ||
|
||
* ``datetimetz``: ``{'timezone': zone}``, e.g. ``{'timezone', 'America/New_York'}`` | ||
* ``categorical``: ``{'num_categories': K}`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are the categories listed?, indication of ordered=True/False
would be nice as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We wouldn't want to list the categories, as it could blow up the size of this metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added ordered
Any more comments? |
Looks good from my (Parquet) perspective. Interesting thing would be on how to deal with object columns. |
doc/source/metadata.rst
Outdated
{'name': 'c2', | ||
'type': 'categorical', | ||
'numpy_type': 'int16', | ||
'metadata': {'num_categories': 1000}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ordered is missing here in the example (or is not a required field?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
A small comment on putting this in the docs: this will now create a top level entry in the toc of our docs called "Storing pandas Objects in Various File Formats". I think most users will think of something else when they see that section in the toc. |
I'm about 0+ on @jorisvandenbossche comment. I can see the point of not having a top-level, but this is almost a new concept, downstream library documentation, and not really internals. So would be +1 on have a 'Downstream Documention' (maybe a better name)? |
doc/source/metadata.rst
Outdated
'numpy_dtype': numpy_type, | ||
'metadata': type_metadata} | ||
|
||
``pandas_type`` is the logical type of the column, and is one of: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there are some naming inconsistencies. Should pandas_type
be type
, or the other way around?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe these are better named: logical_type
and storage_type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
storage type
is ambiguous too: here we mean in memory, not the (final) storage in whichever binary backend.
doc/source/metadata.rst
Outdated
'metadata': None}, | ||
{'name': 'c1', | ||
'type': 'bytes', | ||
'numpy_type': 'object', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be numpy_dtype
or numpy_type
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left it as numpy_type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I'll need to update the arrow implementation then once the dust settles here.
doc/source/metadata.rst
Outdated
* Floats: ``'float16', 'float32', 'float64'`` | ||
* Datetime: ``'datetime', 'datetimetz'`` | ||
* String: ``'unicode', 'bytes'`` | ||
* Categorical: ``'categorical'`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about using the result of pd.lib.infer_type(series)
here? Then we have:
Floats: `floating`
Integers: `integer`
Datetime: `datetime64`
String: no change
Categorical: no change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, nevermind. The existing logical types are fine the way you have them.
doc/source/metadata.rst
Outdated
* Integers: ``'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64'`` | ||
* Floats: ``'float16', 'float32', 'float64'`` | ||
* Datetime: ``'datetime', 'datetimetz'`` | ||
* String: ``'unicode', 'bytes'`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a type here for general objects, e.g., we could have a column containing python dictionaries? I realise that not all backends will have a way to store such things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should provide a way to embed JSON or general pickled objects in BYTE_ARRAY columns, I will update the spec with something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's exactly what I would expect the storage backend (i.e., parquet) to do, and what a user could decide to do themselves, but I'm thinking the spec here should be simple, and state that the column contains generic objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if the objects have been encoded, then we should probably indicate how they were encoded. For example, encoders might be: json (unless the JSON logical type is used), msgpack, pickle. Any others we should include? @jreback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding an 'object'
logical type with encoding metadata
OK, no problem |
doc/source/index.rst.template
Outdated
@@ -146,6 +146,7 @@ See the package overview for more detail about what's in the library. | |||
comparison_with_r | |||
comparison_with_sql | |||
comparison_with_sas | |||
metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nix this
@@ -16,3 +16,107 @@ Developer | |||
********* | |||
|
|||
This section will focus on downstream applications of pandas. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a ref-tag here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
doc/source/developer.rst
Outdated
* Integers: ``'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64'`` | ||
* Floats: ``'float16', 'float32', 'float64'`` | ||
* Datetime: ``'datetime', 'datetimetz'`` | ||
* String: ``'unicode', 'bytes'`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assume utf-8? (if its not, then would be object
?), or is it possible to provide a string encoding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added optional encoding metadata
doc/source/developer.rst
Outdated
* Datetime: ``'datetime', 'datetimetz'`` | ||
* String: ``'unicode', 'bytes'`` | ||
* Categorical: ``'categorical'`` | ||
* Other Python objects: ``'object'`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you store the categorical types as a nested specification? (e.g. ints, string, etc).?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, will do
doc/source/developer.rst
Outdated
|
||
The ``type_metadata`` is ``None`` except for: | ||
|
||
* ``datetimetz``: ``{'timezone': zone}``, e.g. ``{'timezone', 'America/New_York'}`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe unit on the datetime for future compat?
* Categorical: ``'categorical'`` | ||
* Other Python objects: ``'object'`` | ||
|
||
The ``numpy_type`` is the physical storage type of the column, which is the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add timedelta type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a timedelta
type with optional metadata indicating the unit
Updated |
doc/source/developer.rst
Outdated
|
||
{'name': column_name, | ||
'pandas_type': pandas_type, | ||
'numpy_dtype': numpy_type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> numpy_type (I think that's the spelling elsewhere)
lgtm. (tiny typo). merge when ready. |
If you want to add a note in whatsnew pointing to the new section would be ok (up 2 you) |
* upstream/master: (48 commits) BUG: Categorical comparison with unordered (pandas-dev#16339) ENH: Adding 'protocol' parameter to 'to_pickle'. PERF: improve MultiIndex get_loc performance (pandas-dev#16346) TST: remove pandas-datareader xfail as 0.4.0 works (pandas-dev#16374) TST: followup to pandas-dev#16364, catch errstate warnings (pandas-dev#16373) DOC: new oauth token TST: Add test for clip-na (pandas-dev#16369) ENH: Draft metadata specification doc for Apache Parquet (pandas-dev#16315) MAINT: Add .iml to .gitignore (pandas-dev#16368) BUG/API: Categorical constructor scalar categories (pandas-dev#16340) ENH: Provide dict object for to_dict() pandas-dev#16122 (pandas-dev#16220) PERF: improved clip performance (pandas-dev#16364) DOC: try new token for docs DOC: try with new secure token DOC: add developer section to the docs DEPS: Drop Python 3.4 support (pandas-dev#16303) DOC: remove credential helper DOC: force fetch on build docs DOC: redo dev docs access token DOC: add dataframe construction in merge_asof example (pandas-dev#16348) ...
Perhaps a little late but no chance of having |
If you'd like to add, submit a PR? |
@wesm, cool. curious about the format definition over here: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#interval from this I gather that |
@buyology pls create a new issue to add the |
There is no "INTERVAL" in pandas, we have timedelta. That parquet has such a thing doesn't really concern us, if we can represent everything in pandas without it (TIME_MILLIS will do), and in fact I'd say that type is pretty useless (does anyone use it?). |
@martindurant starting in 0.20.1 we do: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#intervalindex (the scalar type is |
@jreback : stop innovating, life will be easier :) |
…16315) * Draft metadata specification doc for Apache Parquet * Tweaks, add pandas version * Relax metadata key * Be explicit that the metadata is file-level * Don't hard code version * Code reviews * Move Parquet metadata to developer.rst, account for code reviews * Code review comments * Review comments * Fix typo
…16315) * Draft metadata specification doc for Apache Parquet * Tweaks, add pandas version * Relax metadata key * Be explicit that the metadata is file-level * Don't hard code version * Code reviews * Move Parquet metadata to developer.rst, account for code reviews * Code review comments * Review comments * Fix typo
cc @martindurant @mrocklin @jreback @cpcloud
closes #16010