Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Draft metadata specification doc for Apache Parquet #16315

Merged
merged 10 commits into from
May 16, 2017

Conversation

wesm
Copy link
Member

@wesm wesm commented May 10, 2017

@wesm
Copy link
Member Author

wesm commented May 10, 2017

This does not provide for non-string column names. I'm open to ideas about how to deal with that

@martindurant
Copy link
Contributor

Thank you, @wesm , this looks perfect from a quick scan.
One question: do we want to make provision for special object columns, such as bool-with-null, int-ith-null and dict/list (the latter being natural for json encoding)?

@wesm
Copy link
Member Author

wesm commented May 10, 2017

For boolean with null, I expect we would have

{'type': 'bool',
 'numpy_type': 'object'}

We'll have to run type inference on the object columns at some point anyhow to know what Parquet type to write them to.

For JSON, we could use 'type': 'json', 'numpy_type': 'object'?

@codecov
Copy link

codecov bot commented May 10, 2017

Codecov Report

Merging #16315 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16315      +/-   ##
==========================================
- Coverage   90.39%   90.37%   -0.02%     
==========================================
  Files         161      161              
  Lines       50863    50863              
==========================================
- Hits        45978    45968      -10     
- Misses       4885     4895      +10
Flag Coverage Δ
#multiple 88.16% <ø> (-0.01%) ⬇️
#single 40.33% <ø> (-0.11%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/common.py 90.68% <0%> (-0.35%) ⬇️
pandas/core/frame.py 97.59% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 81aa70c...2014a68. Read the comment docs.

@codecov
Copy link

codecov bot commented May 10, 2017

Codecov Report

Merging #16315 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16315      +/-   ##
==========================================
- Coverage   90.38%   90.37%   -0.02%     
==========================================
  Files         161      161              
  Lines       50916    50949      +33     
==========================================
+ Hits        46021    46043      +22     
- Misses       4895     4906      +11
Flag Coverage Δ
#multiple 88.14% <ø> (ø) ⬆️
#single 40.2% <ø> (-0.13%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.69% <0%> (-0.1%) ⬇️
pandas/core/series.py 94.71% <0%> (ø) ⬆️
pandas/core/categorical.py 95.87% <0%> (+0.01%) ⬆️
pandas/core/generic.py 91.96% <0%> (+0.01%) ⬆️
pandas/core/common.py 91.05% <0%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d92f06a...2d00f55. Read the comment docs.

@jreback jreback added Docs IO Data IO issues that don't fit into a more specific label labels May 10, 2017
This document provides specifications for metadata to assist with reading and
writing pandas objects to different third party file formats.

Apache Parquet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you link to Apache docs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

.. code-block:: text

{'index_columns': ['__index_level_0__', '__index_level_1__', ...],
'columns': [<c0>, <c1>, ...],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we have tuples for columns (e.g. a MI in columns), this will not work (not saying that this should be supported, just noting it).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I don't think we are supporting arbitrary objects yet. This metadata spec doesn't preclude us adding support for that later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Parquet format doesn't support this anyway

The ``type_metadata`` is ``None`` except for:

* ``datetimetz``: ``{'timezone': zone}``, e.g. ``{'timezone', 'America/New_York'}``
* ``categorical``: ``{'num_categories': K}``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the categories listed?, indication of ordered=True/False would be nice as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wouldn't want to list the categories, as it could blow up the size of this metadata.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added ordered

@wesm
Copy link
Member Author

wesm commented May 12, 2017

Any more comments?

@xhochy
Copy link
Contributor

xhochy commented May 12, 2017

Looks good from my (Parquet) perspective. Interesting thing would be on how to deal with object columns.

{'name': 'c2',
'type': 'categorical',
'numpy_type': 'int16',
'metadata': {'num_categories': 1000}},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ordered is missing here in the example (or is not a required field?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@jorisvandenbossche
Copy link
Member

A small comment on putting this in the docs: this will now create a top level entry in the toc of our docs called "Storing pandas Objects in Various File Formats". I think most users will think of something else when they see that section in the toc.
Maybe we can put it somewhere under the 'internals' or 'contributing' page? Or give it a slightly more developer-oriented title?

@jreback
Copy link
Contributor

jreback commented May 12, 2017

I'm about 0+ on @jorisvandenbossche comment. I can see the point of not having a top-level, but this is almost a new concept, downstream library documentation, and not really internals. So would be +1 on have a 'Downstream Documention' (maybe a better name)?

'numpy_dtype': numpy_type,
'metadata': type_metadata}

``pandas_type`` is the logical type of the column, and is one of:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there are some naming inconsistencies. Should pandas_type be type, or the other way around?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe these are better named: logical_type and storage_type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storage type is ambiguous too: here we mean in memory, not the (final) storage in whichever binary backend.

'metadata': None},
{'name': 'c1',
'type': 'bytes',
'numpy_type': 'object',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be numpy_dtype or numpy_type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left it as numpy_type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'll need to update the arrow implementation then once the dust settles here.

* Floats: ``'float16', 'float32', 'float64'``
* Datetime: ``'datetime', 'datetimetz'``
* String: ``'unicode', 'bytes'``
* Categorical: ``'categorical'``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using the result of pd.lib.infer_type(series) here? Then we have:

Floats: `floating`
Integers: `integer`
Datetime: `datetime64`
String: no change
Categorical: no change

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, nevermind. The existing logical types are fine the way you have them.

* Integers: ``'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64'``
* Floats: ``'float16', 'float32', 'float64'``
* Datetime: ``'datetime', 'datetimetz'``
* String: ``'unicode', 'bytes'``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a type here for general objects, e.g., we could have a column containing python dictionaries? I realise that not all backends will have a way to store such things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should provide a way to embed JSON or general pickled objects in BYTE_ARRAY columns, I will update the spec with something

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly what I would expect the storage backend (i.e., parquet) to do, and what a user could decide to do themselves, but I'm thinking the spec here should be simple, and state that the column contains generic objects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if the objects have been encoded, then we should probably indicate how they were encoded. For example, encoders might be: json (unless the JSON logical type is used), msgpack, pickle. Any others we should include? @jreback

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding an 'object' logical type with encoding metadata

@jreback
Copy link
Contributor

jreback commented May 16, 2017

@wesm if you'd move to developer.rst (added here: 46dc536).

Note the dev docs are built, but not being uploaded ATM.

@wesm
Copy link
Member Author

wesm commented May 16, 2017

OK, no problem

@wesm wesm force-pushed the parquet-metadata branch from bd77ee1 to e0a176e Compare May 16, 2017 01:49
@@ -146,6 +146,7 @@ See the package overview for more detail about what's in the library.
comparison_with_r
comparison_with_sql
comparison_with_sas
metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nix this

@@ -16,3 +16,107 @@ Developer
*********

This section will focus on downstream applications of pandas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a ref-tag here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* Integers: ``'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64'``
* Floats: ``'float16', 'float32', 'float64'``
* Datetime: ``'datetime', 'datetimetz'``
* String: ``'unicode', 'bytes'``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assume utf-8? (if its not, then would be object?), or is it possible to provide a string encoding?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added optional encoding metadata

* Datetime: ``'datetime', 'datetimetz'``
* String: ``'unicode', 'bytes'``
* Categorical: ``'categorical'``
* Other Python objects: ``'object'``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you store the categorical types as a nested specification? (e.g. ints, string, etc).?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, will do


The ``type_metadata`` is ``None`` except for:

* ``datetimetz``: ``{'timezone': zone}``, e.g. ``{'timezone', 'America/New_York'}``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe unit on the datetime for future compat?

* Categorical: ``'categorical'``
* Other Python objects: ``'object'``

The ``numpy_type`` is the physical storage type of the column, which is the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add timedelta type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a timedelta type with optional metadata indicating the unit

@wesm
Copy link
Member Author

wesm commented May 16, 2017

Updated


{'name': column_name,
'pandas_type': pandas_type,
'numpy_dtype': numpy_type,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> numpy_type (I think that's the spelling elsewhere)

@jreback jreback added this to the 0.21.0 milestone May 16, 2017
@jreback
Copy link
Contributor

jreback commented May 16, 2017

lgtm. (tiny typo). merge when ready.

@jreback
Copy link
Contributor

jreback commented May 16, 2017

If you want to add a note in whatsnew pointing to the new section would be ok (up 2 you)

@wesm wesm merged commit 623b3e7 into pandas-dev:master May 16, 2017
@wesm wesm deleted the parquet-metadata branch May 16, 2017 22:26
pawroman added a commit to pawroman/pandas that referenced this pull request May 18, 2017
* upstream/master: (48 commits)
  BUG: Categorical comparison with unordered (pandas-dev#16339)
  ENH: Adding 'protocol' parameter to 'to_pickle'.
  PERF: improve MultiIndex get_loc performance (pandas-dev#16346)
  TST: remove pandas-datareader xfail as 0.4.0 works (pandas-dev#16374)
  TST: followup to pandas-dev#16364, catch errstate warnings (pandas-dev#16373)
  DOC: new oauth token
  TST: Add test for clip-na (pandas-dev#16369)
  ENH: Draft metadata specification doc for Apache Parquet (pandas-dev#16315)
  MAINT: Add .iml to .gitignore (pandas-dev#16368)
  BUG/API: Categorical constructor scalar categories (pandas-dev#16340)
  ENH: Provide dict object for to_dict() pandas-dev#16122 (pandas-dev#16220)
  PERF: improved clip performance (pandas-dev#16364)
  DOC: try new token for docs
  DOC: try with new secure token
  DOC: add developer section to the docs
  DEPS: Drop Python 3.4 support (pandas-dev#16303)
  DOC: remove credential helper
  DOC: force fetch on build docs
  DOC: redo dev docs access token
  DOC: add dataframe construction in merge_asof example (pandas-dev#16348)
  ...
@buyology
Copy link

Perhaps a little late but no chance of having Interval types in this?

@wesm
Copy link
Member Author

wesm commented May 20, 2017

If you'd like to add, submit a PR?

@buyology
Copy link

@wesm, cool. curious about the format definition over here: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#interval from this I gather that INTERVAL ≈ Timedelta? maybe this observation is not significant at all — still trying to wrap my head around the path forward for parquet/feather et al

@jreback
Copy link
Contributor

jreback commented May 20, 2017

@buyology pls create a new issue to add the Interval type. This is not actually support by pyarrow ATM. and confusingly, it is NOT the INTERVAL type at all (which is simply a Timedelta). Rather this is a new type.

@martindurant
Copy link
Contributor

There is no "INTERVAL" in pandas, we have timedelta. That parquet has such a thing doesn't really concern us, if we can represent everything in pandas without it (TIME_MILLIS will do), and in fact I'd say that type is pretty useless (does anyone use it?).

@jreback
Copy link
Contributor

jreback commented May 20, 2017

@martindurant starting in 0.20.1 we do: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#intervalindex (the scalar type is Interval)

@martindurant
Copy link
Contributor

@jreback : stop innovating, life will be easier :)
So there are something completely different, now I see what you mean above.

pcluo pushed a commit to pcluo/pandas that referenced this pull request May 22, 2017
…16315)

* Draft metadata specification doc for Apache Parquet

* Tweaks, add pandas version

* Relax metadata key

* Be explicit that the metadata is file-level

* Don't hard code version

* Code reviews

* Move Parquet metadata to developer.rst, account for code reviews

* Code review comments

* Review comments

* Fix typo
stangirala pushed a commit to stangirala/pandas that referenced this pull request Jun 11, 2017
…16315)

* Draft metadata specification doc for Apache Parquet

* Tweaks, add pandas version

* Relax metadata key

* Be explicit that the metadata is file-level

* Don't hard code version

* Code reviews

* Move Parquet metadata to developer.rst, account for code reviews

* Code review comments

* Review comments

* Fix typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Specification document for embedding pandas-specific metadata in binary file formats
7 participants