ENH: Draft metadata specification doc for Apache Parquet #16315

wesm · 2017-05-10T01:33:54Z

cc @martindurant @mrocklin @jreback @cpcloud

wesm · 2017-05-10T01:35:00Z

This does not provide for non-string column names. I'm open to ideas about how to deal with that

martindurant · 2017-05-10T01:46:45Z

Thank you, @wesm , this looks perfect from a quick scan.
One question: do we want to make provision for special object columns, such as bool-with-null, int-ith-null and dict/list (the latter being natural for json encoding)?

wesm · 2017-05-10T01:57:31Z

For boolean with null, I expect we would have

{'type': 'bool',
 'numpy_type': 'object'}

We'll have to run type inference on the object columns at some point anyhow to know what Parquet type to write them to.

For JSON, we could use 'type': 'json', 'numpy_type': 'object'?

codecov · 2017-05-10T02:23:48Z

Codecov Report

Merging #16315 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #16315      +/-   ##
==========================================
- Coverage   90.39%   90.37%   -0.02%     
==========================================
  Files         161      161              
  Lines       50863    50863              
==========================================
- Hits        45978    45968      -10     
- Misses       4885     4895      +10

Flag	Coverage Δ
#multiple	`88.16% <ø> (-0.01%)`	⬇️
#single	`40.33% <ø> (-0.11%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/common.py	`90.68% <0%> (-0.35%)`	⬇️
pandas/core/frame.py	`97.59% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 81aa70c...2014a68. Read the comment docs.

codecov · 2017-05-10T02:24:01Z

Codecov Report

Merging #16315 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #16315      +/-   ##
==========================================
- Coverage   90.38%   90.37%   -0.02%     
==========================================
  Files         161      161              
  Lines       50916    50949      +33     
==========================================
+ Hits        46021    46043      +22     
- Misses       4895     4906      +11

Flag	Coverage Δ
#multiple	`88.14% <ø> (ø)`	⬆️
#single	`40.2% <ø> (-0.13%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.69% <0%> (-0.1%)`	⬇️
pandas/core/series.py	`94.71% <0%> (ø)`	⬆️
pandas/core/categorical.py	`95.87% <0%> (+0.01%)`	⬆️
pandas/core/generic.py	`91.96% <0%> (+0.01%)`	⬆️
pandas/core/common.py	`91.05% <0%> (+0.02%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d92f06a...2d00f55. Read the comment docs.

jreback · 2017-05-10T11:08:21Z

doc/source/metadata.rst

+This document provides specifications for metadata to assist with reading and
+writing pandas objects to different third party file formats.
+
+Apache Parquet


can you link to Apache docs

jreback · 2017-05-10T11:09:20Z

doc/source/metadata.rst

+.. code-block:: text
+
+   {'index_columns': ['__index_level_0__', '__index_level_1__', ...],
+    'columns': [<c0>, <c1>, ...],


if we have tuples for columns (e.g. a MI in columns), this will not work (not saying that this should be supported, just noting it).

@jreback I don't think we are supporting arbitrary objects yet. This metadata spec doesn't preclude us adding support for that later.

The Parquet format doesn't support this anyway

jreback · 2017-05-10T11:10:44Z

doc/source/metadata.rst

+The ``type_metadata`` is ``None`` except for:
+
+* ``datetimetz``: ``{'timezone': zone}``, e.g. ``{'timezone', 'America/New_York'}``
+* ``categorical``: ``{'num_categories': K}``


are the categories listed?, indication of ordered=True/False would be nice as well.

We wouldn't want to list the categories, as it could blow up the size of this metadata.

added ordered

wesm · 2017-05-12T00:24:39Z

Any more comments?

xhochy · 2017-05-12T07:47:16Z

Looks good from my (Parquet) perspective. Interesting thing would be on how to deal with object columns.

jorisvandenbossche · 2017-05-12T08:25:53Z

doc/source/metadata.rst

+        {'name': 'c2',
+         'type': 'categorical',
+         'numpy_type': 'int16',
+         'metadata': {'num_categories': 1000}},


ordered is missing here in the example (or is not a required field?)

jorisvandenbossche · 2017-05-12T08:29:19Z

A small comment on putting this in the docs: this will now create a top level entry in the toc of our docs called "Storing pandas Objects in Various File Formats". I think most users will think of something else when they see that section in the toc.
Maybe we can put it somewhere under the 'internals' or 'contributing' page? Or give it a slightly more developer-oriented title?

jreback · 2017-05-12T10:27:56Z

I'm about 0+ on @jorisvandenbossche comment. I can see the point of not having a top-level, but this is almost a new concept, downstream library documentation, and not really internals. So would be +1 on have a 'Downstream Documention' (maybe a better name)?

cpcloud · 2017-05-13T16:31:41Z

doc/source/metadata.rst

+    'numpy_dtype': numpy_type,
+    'metadata': type_metadata}
+
+``pandas_type`` is the logical type of the column, and is one of:


It looks like there are some naming inconsistencies. Should pandas_type be type, or the other way around?

maybe these are better named: logical_type and storage_type

storage type is ambiguous too: here we mean in memory, not the (final) storage in whichever binary backend.

cpcloud · 2017-05-13T16:31:57Z

doc/source/metadata.rst

+         'metadata': None},
+        {'name': 'c1',
+         'type': 'bytes',
+         'numpy_type': 'object',


should this be numpy_dtype or numpy_type?

left it as numpy_type

Ok. I'll need to update the arrow implementation then once the dust settles here.

cpcloud · 2017-05-13T16:46:31Z

doc/source/metadata.rst

+* Floats: ``'float16', 'float32', 'float64'``
+* Datetime: ``'datetime', 'datetimetz'``
+* String: ``'unicode', 'bytes'``
+* Categorical: ``'categorical'``


What about using the result of pd.lib.infer_type(series) here? Then we have:

Floats: `floating` Integers: `integer` Datetime: `datetime64` String: no change Categorical: no change

Actually, nevermind. The existing logical types are fine the way you have them.

martindurant · 2017-05-15T13:52:20Z

doc/source/metadata.rst

+* Integers: ``'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64'``
+* Floats: ``'float16', 'float32', 'float64'``
+* Datetime: ``'datetime', 'datetimetz'``
+* String: ``'unicode', 'bytes'``


Should there be a type here for general objects, e.g., we could have a column containing python dictionaries? I realise that not all backends will have a way to store such things.

We should provide a way to embed JSON or general pickled objects in BYTE_ARRAY columns, I will update the spec with something

That's exactly what I would expect the storage backend (i.e., parquet) to do, and what a user could decide to do themselves, but I'm thinking the spec here should be simple, and state that the column contains generic objects.

Well, if the objects have been encoded, then we should probably indicate how they were encoded. For example, encoders might be: json (unless the JSON logical type is used), msgpack, pickle. Any others we should include? @jreback

Adding an 'object' logical type with encoding metadata

jreback · 2017-05-16T00:09:47Z

@wesm if you'd move to developer.rst (added here: 46dc536).

Note the dev docs are built, but not being uploaded ATM.

wesm · 2017-05-16T01:03:46Z

OK, no problem

jreback · 2017-05-16T02:34:27Z

doc/source/index.rst.template

@@ -146,6 +146,7 @@ See the package overview for more detail about what's in the library.
    comparison_with_r
    comparison_with_sql
    comparison_with_sas
+    metadata


jreback · 2017-05-16T02:34:43Z

doc/source/developer.rst

@@ -16,3 +16,107 @@ Developer
 *********

 This section will focus on downstream applications of pandas.
+


add a ref-tag here.

jreback · 2017-05-16T02:37:03Z

doc/source/developer.rst

+* Integers: ``'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64'``
+* Floats: ``'float16', 'float32', 'float64'``
+* Datetime: ``'datetime', 'datetimetz'``
+* String: ``'unicode', 'bytes'``


assume utf-8? (if its not, then would be object?), or is it possible to provide a string encoding?

added optional encoding metadata

jreback · 2017-05-16T02:37:39Z

doc/source/developer.rst

+* Datetime: ``'datetime', 'datetimetz'``
+* String: ``'unicode', 'bytes'``
+* Categorical: ``'categorical'``
+* Other Python objects: ``'object'``


do you store the categorical types as a nested specification? (e.g. ints, string, etc).?

good catch, will do

jreback · 2017-05-16T02:38:03Z

doc/source/developer.rst

+
+The ``type_metadata`` is ``None`` except for:
+
+* ``datetimetz``: ``{'timezone': zone}``, e.g. ``{'timezone', 'America/New_York'}``


maybe unit on the datetime for future compat?

jreback · 2017-05-16T02:38:16Z

doc/source/developer.rst

+* Categorical: ``'categorical'``
+* Other Python objects: ``'object'``
+
+The ``numpy_type`` is the physical storage type of the column, which is the


add timedelta type

added a timedelta type with optional metadata indicating the unit

wesm · 2017-05-16T02:56:32Z

Updated

jreback · 2017-05-16T22:23:36Z

doc/source/developer.rst

+
+   {'name': column_name,
+    'pandas_type': pandas_type,
+    'numpy_dtype': numpy_type,


-> numpy_type (I think that's the spelling elsewhere)

jreback · 2017-05-16T22:25:22Z

lgtm. (tiny typo). merge when ready.

jreback · 2017-05-16T22:26:12Z

If you want to add a note in whatsnew pointing to the new section would be ok (up 2 you)

* upstream/master: (48 commits) BUG: Categorical comparison with unordered (pandas-dev#16339) ENH: Adding 'protocol' parameter to 'to_pickle'. PERF: improve MultiIndex get_loc performance (pandas-dev#16346) TST: remove pandas-datareader xfail as 0.4.0 works (pandas-dev#16374) TST: followup to pandas-dev#16364, catch errstate warnings (pandas-dev#16373) DOC: new oauth token TST: Add test for clip-na (pandas-dev#16369) ENH: Draft metadata specification doc for Apache Parquet (pandas-dev#16315) MAINT: Add .iml to .gitignore (pandas-dev#16368) BUG/API: Categorical constructor scalar categories (pandas-dev#16340) ENH: Provide dict object for to_dict() pandas-dev#16122 (pandas-dev#16220) PERF: improved clip performance (pandas-dev#16364) DOC: try new token for docs DOC: try with new secure token DOC: add developer section to the docs DEPS: Drop Python 3.4 support (pandas-dev#16303) DOC: remove credential helper DOC: force fetch on build docs DOC: redo dev docs access token DOC: add dataframe construction in merge_asof example (pandas-dev#16348) ...

buyology · 2017-05-20T00:06:38Z

Perhaps a little late but no chance of having Interval types in this?

wesm · 2017-05-20T00:19:20Z

If you'd like to add, submit a PR?

buyology · 2017-05-20T00:33:55Z

@wesm, cool. curious about the format definition over here: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#interval from this I gather that INTERVAL ≈ Timedelta? maybe this observation is not significant at all — still trying to wrap my head around the path forward for parquet/feather et al

jreback · 2017-05-20T15:20:36Z

@buyology pls create a new issue to add the Interval type. This is not actually support by pyarrow ATM. and confusingly, it is NOT the INTERVAL type at all (which is simply a Timedelta). Rather this is a new type.

martindurant · 2017-05-20T15:30:53Z

There is no "INTERVAL" in pandas, we have timedelta. That parquet has such a thing doesn't really concern us, if we can represent everything in pandas without it (TIME_MILLIS will do), and in fact I'd say that type is pretty useless (does anyone use it?).

jreback · 2017-05-20T15:35:04Z

@martindurant starting in 0.20.1 we do: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#intervalindex (the scalar type is Interval)

martindurant · 2017-05-20T15:38:57Z

@jreback : stop innovating, life will be easier :)
So there are something completely different, now I see what you mean above.

…16315) * Draft metadata specification doc for Apache Parquet * Tweaks, add pandas version * Relax metadata key * Be explicit that the metadata is file-level * Don't hard code version * Code reviews * Move Parquet metadata to developer.rst, account for code reviews * Code review comments * Review comments * Fix typo

wesm mentioned this pull request May 10, 2017

ENH: Specification document for embedding pandas-specific metadata in binary file formats #16010

Closed

wesm mentioned this pull request May 10, 2017

Pandas Segfault when reading Parquet data dask/dask#2224

Closed

jreback added Docs IO Data IO issues that don't fit into a more specific label labels May 10, 2017

jreback reviewed May 10, 2017

View reviewed changes

jorisvandenbossche reviewed May 12, 2017

View reviewed changes

cpcloud reviewed May 13, 2017

View reviewed changes

martindurant mentioned this pull request May 13, 2017

Add cats info to column chunks, propagate through dask/fastparquet#141

Closed

martindurant reviewed May 15, 2017

View reviewed changes

martindurant mentioned this pull request May 15, 2017

So that no longer need to give categories= to read_parquet dask/dask#2277

Closed

wesm added 7 commits May 15, 2017 21:43

Draft metadata specification doc for Apache Parquet

47caeb8

Tweaks, add pandas version

0c57d65

Relax metadata key

656acbe

Be explicit that the metadata is file-level

17c6ba3

Don't hard code version

2155ea9

Code reviews

d2c66d8

Move Parquet metadata to developer.rst, account for code reviews

e0a176e

wesm force-pushed the parquet-metadata branch from bd77ee1 to e0a176e Compare May 16, 2017 01:49

jreback reviewed May 16, 2017

View reviewed changes

Code review comments

67448be

wesm mentioned this pull request May 16, 2017

Index column inference issues in read_parquet dask/dask#2222

Closed

Review comments

a2a42c0

jreback reviewed May 16, 2017

View reviewed changes

jreback added this to the 0.21.0 milestone May 16, 2017

Fix typo

2d00f55

wesm merged commit 623b3e7 into pandas-dev:master May 16, 2017

wesm deleted the parquet-metadata branch May 16, 2017 22:26

		@@ -16,3 +16,107 @@ Developer
		*********

		This section will focus on downstream applications of pandas.


		The ``type_metadata`` is ``None`` except for:

		* ``datetimetz``: ``{'timezone': zone}``, e.g. ``{'timezone', 'America/New_York'}``

ENH: Draft metadata specification doc for Apache Parquet #16315

ENH: Draft metadata specification doc for Apache Parquet #16315

Conversation

wesm commented May 10, 2017 • edited Loading

wesm commented May 10, 2017

martindurant commented May 10, 2017

wesm commented May 10, 2017

codecov bot commented May 10, 2017 • edited Loading

Codecov Report

codecov bot commented May 10, 2017 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm commented May 12, 2017

xhochy commented May 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented May 12, 2017

jreback commented May 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented May 16, 2017

wesm commented May 16, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm commented May 16, 2017

Choose a reason for hiding this comment

jreback commented May 16, 2017

jreback commented May 16, 2017

buyology commented May 20, 2017

wesm commented May 20, 2017

buyology commented May 20, 2017

jreback commented May 20, 2017

martindurant commented May 20, 2017

jreback commented May 20, 2017

martindurant commented May 20, 2017

wesm commented May 10, 2017 •

edited

Loading

codecov bot commented May 10, 2017 •

edited

Loading

codecov bot commented May 10, 2017 •

edited

Loading