Add section about logical to physical correspondence

wesm · Aug 23, 2016 · 94d0281 · 94d0281
1 parent 7a44f8f
commit 94d0281
Show file tree

Hide file tree

Showing 4 changed files with 93 additions and 14 deletions.
diff --git a/source/conf.py b/source/conf.py
@@ -48,9 +48,9 @@
 master_doc = 'index'
 
 # General information about the project.
-project = "Wes's pandas 2.0 Design Docs"
-copyright = '2016, Wes McKinney'
-author = 'Wes McKinney'
+project = "pandas 2.0 Design Docs"
+copyright = '2016, pandas Development Team'
+author = 'pandas Development Team'
 
 # The version info for the project you're documenting, acts as replacement for
 # |version| and |release|, also used in various other places throughout the
@@ -143,7 +143,7 @@
 # Add any paths that contain custom static files (such as style sheets) here,
 # relative to this directory. They are copied after the builtin static files,
 # so a file named "default.css" will overwrite the builtin "default.css".
-html_static_path = ['_static']
+html_static_path = []
 
 # Add any extra paths that contain custom files (such as robots.txt or
 # .htaccess) here, relative to this directory. These files are copied
@@ -229,7 +229,7 @@
 #  author, documentclass [howto, manual, or own class]).
 latex_documents = [
     (master_doc, 'pandas20DesignDocs.tex', 'pandas 2.0 Design Docs Documentation',
-     'Wes McKinney', 'manual'),
+     'pandas Development Team', 'manual'),
 ]
 
 # The name of an image file (relative to this directory) to place at the top of

diff --git a/source/index.rst b/source/index.rst
@@ -1,5 +1,5 @@
-Wes's pandas 2.0 Design Documents
-=================================
+pandas 2.0 Design Documents
+===========================
 
 These are a set of documents, based on discussions started in December 2015, to
 assist with discussions around changes to Python pandas's internal design

diff --git a/source/internal-architecture.rst b/source/internal-architecture.rst
@@ -8,9 +8,9 @@
    np.set_printoptions(precision=4, suppress=True)
    pd.options.display.max_rows = 100
 
-===============================
- Internal Architecture Changes
-===============================
+===================================
+ Internals: Data structure changes
+===================================
 
 Logical types and Physical Storage Decoupling
 =============================================
@@ -203,6 +203,85 @@ we've chosen for pandas, and elsewhere we can invoke pandas-specific code.
 A major concern here based on these ideas is **preserving NumPy
 interoperability**, so I'll examine this topic in some detail next.
 
+Correspondence between logical and physical types
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+* **Floating point numbers**
+
+  - Logical: ``Float16/32/64``
+  - Physical: ``numpy.float16/32/64``, with ``NaN`` for null (for backwards
+    compatibility)
+
+* **Signed Integers**
+
+  - Logical: ``Int8/16/32/64``
+  - Physical: ``numpy.int8/16/32/64`` array plus nullness bitmap
+
+* **Unsigned Integers**
+
+  - Logical: ``Int8/16/32/64``
+  - Physical: ``numpy.int8/16/32/64`` array plus nullness bitmap
+
+* **Boolean**
+
+  - Logical: ``Boolean``
+  - Physical: ``np.bool_`` (a.k.a. ``np.uint8``) array plus nullness bitmap. We
+    may also explore bit storage (versus bytes).
+
+* **Categorical**
+
+  - Logical: ``Categorical[T]``, where ``T`` is any other logical type
+  - Physical: this type is a composition of a ``Int8`` through ``Int64``
+    (depending on the cardinality of the categories) plus the categories
+    array. These have the same physical representation as
+
+* **String and Binary**
+
+  - Logical: ``String`` and ``Binary``
+  - Physical: Dictionary-encoded representation for UTF-8 and general binary
+    data as described in the `string section <strings>`.
+
+* **Timestamp**
+
+  - Logical: ``Timestamp[unit]``, where unit is the resolution. Nanoseconds can
+    continue to be the default unit for now
+  - Physical: ``numpy.int64``, with ``INT64_MIN`` as the null value.
+
+* **Timedelta**
+
+  - Logical: ``Timedelta[unit]``, where unit is the resolution
+  - Physical: ``numpy.int64``, with ``INT64_MIN`` as the null value.
+
+* **Period**
+
+  - Logical: ``Period[unit]``, where unit is the resolution
+  - Physical: ``numpy.int64``, with ``INT64_MIN`` as the null value.
+
+* **Interval**
+
+  - Logical: ``Interval``
+  - Physical: two arrays of ``Timestamp[U]`` -- these may need to be forced to
+    both be the same resolution
+
+* **Python objects** (catch-all for other data types)
+
+  - Logical: ``Object``
+  - Physical: ``numpy.object_`` array, with None for null values (perhaps with
+    ``np.NaN`` also for backwards compatibility)
+
+* **Complex numbers**
+
+  - Logical: ``Complex64/128``
+  - Physical: ``numpy.complex64/128``, with ``NaN`` for null (for backwards
+    compatibility)
+
+Some notes on these:
+
+- While a pandas (logical) type may map onto one or more physical
+  representations, in general NumPy types will map directly onto a pandas
+  type. Thus, existing code involving ``numpy.dtype``-like objects (such as
+  ``'f8'`` or ``numpy.float64``) will continue to work.
+
 Preserving NumPy interoperability
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -318,7 +397,7 @@ bitmap** (which the user never sees). This has numerous benefits:
 Notably, this is the way that PostgreSQL handles null values. For example, we
 might have:
 
-.. code-block::
+.. code-block:: text
 
    [0, 1, 2, NA, NA, 5, 6, NA]
 

diff --git a/source/strings.rst b/source/strings.rst
@@ -8,9 +8,9 @@
    np.set_printoptions(precision=4, suppress=True)
    pd.options.display.max_rows = 100
 
-==================================
- Enhanced string / UTF-8 handling
-==================================
+=============================================
+ Internals: Enhanced string / UTF-8 handling
+=============================================
 
 There are some things we can do to make pandas use less memory and perform
 computations significantly faster on string data.