Skip to content
This repository has been archived by the owner on Apr 10, 2024. It is now read-only.

Commit

Permalink
Add section about logical to physical correspondence
Browse files Browse the repository at this point in the history
  • Loading branch information
wesm committed Aug 23, 2016
1 parent 7a44f8f commit 94d0281
Show file tree
Hide file tree
Showing 4 changed files with 93 additions and 14 deletions.
10 changes: 5 additions & 5 deletions source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,9 +48,9 @@
master_doc = 'index'

# General information about the project.
project = "Wes's pandas 2.0 Design Docs"
copyright = '2016, Wes McKinney'
author = 'Wes McKinney'
project = "pandas 2.0 Design Docs"
copyright = '2016, pandas Development Team'
author = 'pandas Development Team'

# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
Expand Down Expand Up @@ -143,7 +143,7 @@
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
html_static_path = []

# Add any extra paths that contain custom files (such as robots.txt or
# .htaccess) here, relative to this directory. These files are copied
Expand Down Expand Up @@ -229,7 +229,7 @@
# author, documentclass [howto, manual, or own class]).
latex_documents = [
(master_doc, 'pandas20DesignDocs.tex', 'pandas 2.0 Design Docs Documentation',
'Wes McKinney', 'manual'),
'pandas Development Team', 'manual'),
]

# The name of an image file (relative to this directory) to place at the top of
Expand Down
4 changes: 2 additions & 2 deletions source/index.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Wes's pandas 2.0 Design Documents
=================================
pandas 2.0 Design Documents
===========================

These are a set of documents, based on discussions started in December 2015, to
assist with discussions around changes to Python pandas's internal design
Expand Down
87 changes: 83 additions & 4 deletions source/internal-architecture.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,9 @@
np.set_printoptions(precision=4, suppress=True)
pd.options.display.max_rows = 100
===============================
Internal Architecture Changes
===============================
===================================
Internals: Data structure changes
===================================

Logical types and Physical Storage Decoupling
=============================================
Expand Down Expand Up @@ -203,6 +203,85 @@ we've chosen for pandas, and elsewhere we can invoke pandas-specific code.
A major concern here based on these ideas is **preserving NumPy
interoperability**, so I'll examine this topic in some detail next.

Correspondence between logical and physical types
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* **Floating point numbers**

- Logical: ``Float16/32/64``
- Physical: ``numpy.float16/32/64``, with ``NaN`` for null (for backwards
compatibility)

* **Signed Integers**

- Logical: ``Int8/16/32/64``
- Physical: ``numpy.int8/16/32/64`` array plus nullness bitmap

* **Unsigned Integers**

- Logical: ``Int8/16/32/64``
- Physical: ``numpy.int8/16/32/64`` array plus nullness bitmap

* **Boolean**

- Logical: ``Boolean``
- Physical: ``np.bool_`` (a.k.a. ``np.uint8``) array plus nullness bitmap. We
may also explore bit storage (versus bytes).

* **Categorical**

- Logical: ``Categorical[T]``, where ``T`` is any other logical type
- Physical: this type is a composition of a ``Int8`` through ``Int64``
(depending on the cardinality of the categories) plus the categories
array. These have the same physical representation as

* **String and Binary**

- Logical: ``String`` and ``Binary``
- Physical: Dictionary-encoded representation for UTF-8 and general binary
data as described in the `string section <strings>`.

* **Timestamp**

- Logical: ``Timestamp[unit]``, where unit is the resolution. Nanoseconds can
continue to be the default unit for now
- Physical: ``numpy.int64``, with ``INT64_MIN`` as the null value.

* **Timedelta**

- Logical: ``Timedelta[unit]``, where unit is the resolution
- Physical: ``numpy.int64``, with ``INT64_MIN`` as the null value.

* **Period**

- Logical: ``Period[unit]``, where unit is the resolution
- Physical: ``numpy.int64``, with ``INT64_MIN`` as the null value.

* **Interval**

- Logical: ``Interval``
- Physical: two arrays of ``Timestamp[U]`` -- these may need to be forced to
both be the same resolution

* **Python objects** (catch-all for other data types)

- Logical: ``Object``
- Physical: ``numpy.object_`` array, with None for null values (perhaps with
``np.NaN`` also for backwards compatibility)

* **Complex numbers**

- Logical: ``Complex64/128``
- Physical: ``numpy.complex64/128``, with ``NaN`` for null (for backwards
compatibility)

Some notes on these:

- While a pandas (logical) type may map onto one or more physical
representations, in general NumPy types will map directly onto a pandas
type. Thus, existing code involving ``numpy.dtype``-like objects (such as
``'f8'`` or ``numpy.float64``) will continue to work.

Preserving NumPy interoperability
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -318,7 +397,7 @@ bitmap** (which the user never sees). This has numerous benefits:
Notably, this is the way that PostgreSQL handles null values. For example, we
might have:

.. code-block::
.. code-block:: text
[0, 1, 2, NA, NA, 5, 6, NA]
Expand Down
6 changes: 3 additions & 3 deletions source/strings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,9 @@
np.set_printoptions(precision=4, suppress=True)
pd.options.display.max_rows = 100
==================================
Enhanced string / UTF-8 handling
==================================
=============================================
Internals: Enhanced string / UTF-8 handling
=============================================

There are some things we can do to make pandas use less memory and perform
computations significantly faster on string data.
Expand Down

0 comments on commit 94d0281

Please sign in to comment.