Skip to content

Commit

Permalink
ENH: Draft metadata specification doc for Apache Parquet (pandas-dev#…
Browse files Browse the repository at this point in the history
…16315)

* Draft metadata specification doc for Apache Parquet

* Tweaks, add pandas version

* Relax metadata key

* Be explicit that the metadata is file-level

* Don't hard code version

* Code reviews

* Move Parquet metadata to developer.rst, account for code reviews

* Code review comments

* Review comments

* Fix typo
  • Loading branch information
wesm authored and stangirala committed Jun 11, 2017
1 parent f17ec3e commit 9328ab7
Showing 1 changed file with 117 additions and 0 deletions.
117 changes: 117 additions & 0 deletions doc/source/developer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,120 @@ Developer
*********

This section will focus on downstream applications of pandas.

.. _apache.parquet:

Storing pandas DataFrame objects in Apache Parquet format
---------------------------------------------------------

The `Apache Parquet <https://github.com/apache/parquet-format>`__ format
provides key-value metadata at the file and column level, stored in the footer
of the Parquet file:

.. code-block:: shell
5: optional list<KeyValue> key_value_metadata
where ``KeyValue`` is

.. code-block:: shell
struct KeyValue {
1: required string key
2: optional string value
}
So that a ``pandas.DataFrame`` can be faithfully reconstructed, we store a
``pandas`` metadata key in the ``FileMetaData`` with the the value stored as :

.. code-block:: text
{'index_columns': ['__index_level_0__', '__index_level_1__', ...],
'columns': [<c0>, <c1>, ...],
'pandas_version': $VERSION}
Here, ``<c0>`` and so forth are dictionaries containing the metadata for each
column. This has JSON form:

.. code-block:: text
{'name': column_name,
'pandas_type': pandas_type,
'numpy_type': numpy_type,
'metadata': type_metadata}
``pandas_type`` is the logical type of the column, and is one of:

* Boolean: ``'bool'``
* Integers: ``'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64'``
* Floats: ``'float16', 'float32', 'float64'``
* Date and Time Types: ``'datetime', 'datetimetz'``, ``'timedelta'``
* String: ``'unicode', 'bytes'``
* Categorical: ``'categorical'``
* Other Python objects: ``'object'``

The ``numpy_type`` is the physical storage type of the column, which is the
result of ``str(dtype)`` for the underlying NumPy array that holds the data. So
for ``datetimetz`` this is ``datetime64[ns]`` and for categorical, it may be
any of the supported integer categorical types.

The ``type_metadata`` is ``None`` except for:

* ``datetimetz``: ``{'timezone': zone, 'unit': 'ns'}``, e.g. ``{'timezone',
'America/New_York', 'unit': 'ns'}``. The ``'unit'`` is optional, and if
omitted it is assumed to be nanoseconds.
* ``categorical``: ``{'num_categories': K, 'ordered': is_ordered, 'type': $TYPE}``

* Here ``'type'`` is optional, and can be a nested pandas type specification
here (but not categorical)

* ``unicode``: ``{'encoding': encoding}``

* The encoding is optional, and if not present is UTF-8

* ``object``: ``{'encoding': encoding}``. Objects can be serialized and stored
in ``BYTE_ARRAY`` Parquet columns. The encoding can be one of:

* ``'pickle'``
* ``'msgpack'``
* ``'bson'``
* ``'json'``

* ``timedelta``: ``{'unit': 'ns'}``. The ``'unit'`` is optional, and if omitted
it is assumed to be nanoseconds. This metadata is optional altogether

For types other than these, the ``'metadata'`` key can be
omitted. Implementations can assume ``None`` if the key is not present.

As an example of fully-formed metadata:

.. code-block:: text
{'index_columns': ['__index_level_0__'],
'columns': [
{'name': 'c0',
'pandas_type': 'int8',
'numpy_type': 'int8',
'metadata': None},
{'name': 'c1',
'pandas_type': 'bytes',
'numpy_type': 'object',
'metadata': None},
{'name': 'c2',
'pandas_type': 'categorical',
'numpy_type': 'int16',
'metadata': {'num_categories': 1000, 'ordered': False}},
{'name': 'c3',
'pandas_type': 'datetimetz',
'numpy_type': 'datetime64[ns]',
'metadata': {'timezone': 'America/Los_Angeles'}},
{'name': 'c4',
'pandas_type': 'object',
'numpy_type': 'object',
'metadata': {'encoding': 'pickle'}},
{'name': '__index_level_0__',
'pandas_type': 'int64',
'numpy_type': 'int64',
'metadata': None}
],
'pandas_version': '0.20.0'}

0 comments on commit 9328ab7

Please sign in to comment.