Incorporate some more feedback from github

wesm · Aug 24, 2016 · 458ae85 · 458ae85
1 parent 801259d
commit 458ae85
Show file tree

Hide file tree

Showing 3 changed files with 53 additions and 9 deletions.
diff --git a/source/internal-architecture.rst b/source/internal-architecture.rst
@@ -189,6 +189,9 @@ By separating pandas data from the presumption of using a particular physical
   data by forming a composite data structure consisting of a NumPy array plus a
   bitmap marking the null / not-null values.
 
+  - It may end up being a requirement that 3rd party data structures will need
+    to have a C or C++ API to be used in pandas.
+
 * We can start to think about improved behavior around data ownership (like
   copy-on-write) which may yield many benefits. I will write a dedicated
   section about this.
@@ -502,6 +505,11 @@ to carry out certain operations.
    dimensional cases, not just the 2D case, so that even Series has a lean
    "SingleBlockManager" internally.
 
+Another motivation for the BlockManager was to be able to create DataFrame
+objects with zero copy from two-dimensional NumPy arrays. See Jeff Reback's
+`exposition on this
+<http://nbviewer.jupyter.org/github/jreback/PandasTalks/blob/master/performance/may_2016/1.%20storage.ipynb>`_.
+
 Drawbacks of BlockManager
 ~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -635,6 +643,11 @@ will enable us to:
 * Use RAII (exception-safe allocation) and smart pointers (``std::unique_ptr``
   and ``std::shared_ptr``) to simplify memory management
 
+* Can use the STL (e.g. ``std::unordered_map`` for some hash tables) for
+  standard data structures, or incorporate other C++ data structures (e.g. from
+  Google open source libraries) that have been more optimized for certain use
+  cases.
+
 * Define performant C++ classes modeling the current internals, with various
   mechanisms for code reuse or type-specific dynamic dispatch (i.e. through
   template classes, CRTP, or simply virtual functions).
@@ -684,6 +697,25 @@ semantics without much need for manual memory management.
 These Array types would be wrapped and exposed to pandas developers (probably
 in Cython).
 
+We would also want to provide a public Python API to the ``pandas.Array`` type,
+which would be the object returned by ``Series.values``. For example, at
+present we have:
+
+.. ipython:: python
+
+   s = pd.Series([1,2] * 2)
+   s
+   s.values
+   s2 = s.astype('category')
+   s2.values
+   type(s2.values)
+
+By introducing a consistent base array type, we can eliminate the current
+dichotomy between pandas's extension dtypes and built-in NumPy physical dtypes.
+
+We could also define a limited public API for interacting with these data
+containers directly.
+
 Index types
 ~~~~~~~~~~~
 

diff --git a/source/removals.rst b/source/removals.rst
@@ -1,8 +1,8 @@
 .. _removals:
 
-================================
- Code to remove and other ideas
-================================
+===========================
+ Other miscellaneous ideas
+===========================
 
 Dropping Python 2 support
 =========================
@@ -40,7 +40,7 @@ are very useful to know, such as:
 * **Null count**: for data not containing any nulls, the null handling path in
   some algorithms can be skipped entirely
 
-
+* **Uniqueness**: used in indexes, and can be helpful elsewhere
 
 Strided arrays: more trouble than they are worth?
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -76,3 +76,15 @@ Some cons:
 
 For me, at least, I don't find the cons compelling enough to warrant the code
 complexity tradeoff.
+
+Enforcing immutability in GroupBy functions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Side effects from ``groupby`` operations have been a common source of issues or
+unintuitive behavior for users.
+
+Handling of sparse data structures
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It's possible that the sparse types could become first class logical types,
+e.g. ``Sparse[T]``, eliminating the ``Sparse*`` classes.
diff --git a/source/strings.rst b/source/strings.rst
@@ -42,9 +42,9 @@ somewhat ad hoc basis.
       this is not the default storage mechanism. More on this below.
 
   * Using ``PyString`` objects and ``PyObject*`` NumPy storage adds non-trivial
-    overhead (approximately 24 bytes per unique object, see `this exposition
-    <http://www.gahcep.com/python-internals-pyobject/>`_ for a deeper drive) to
-    each value.
+    overhead (52 bytes in Python 3, slightly less in Python 2, see `this
+    exposition <http://www.gahcep.com/python-internals-pyobject/>`_ for a
+    deeper drive) to each value.
 
 Possible solution: new non-NumPy string memory layout
 =====================================================
@@ -97,8 +97,8 @@ Here's an example of what the data would look like:
 Some benefits of this approach include:
 
 * Much better data locality for low-cardinality categorical data
-* 8.125 bytes (8 bytes plus 1 bit) of memory overhead per value versus 24 bytes
-  (the current)
+* 8.125 bytes (8 bytes plus 1 bit) of memory overhead per value versus 33 to 52
+  bytes (the current).
 * The data is already categorical: cast to ``category`` dtype can be perform
   very cheaply and without duplicating the underlying string memory buffer
 * Computations like ``groupby`` on dictionary-encoded strings will be as