From 458ae85502c9624290db325a49549dd687174775 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 23 Aug 2016 17:02:38 -0700 Subject: [PATCH] Incorporate some more feedback from github --- source/internal-architecture.rst | 32 ++++++++++++++++++++++++++++++++ source/removals.rst | 20 ++++++++++++++++---- source/strings.rst | 10 +++++----- 3 files changed, 53 insertions(+), 9 deletions(-) diff --git a/source/internal-architecture.rst b/source/internal-architecture.rst index 5eb07fca5..d3bc5d270 100644 --- a/source/internal-architecture.rst +++ b/source/internal-architecture.rst @@ -189,6 +189,9 @@ By separating pandas data from the presumption of using a particular physical data by forming a composite data structure consisting of a NumPy array plus a bitmap marking the null / not-null values. + - It may end up being a requirement that 3rd party data structures will need + to have a C or C++ API to be used in pandas. + * We can start to think about improved behavior around data ownership (like copy-on-write) which may yield many benefits. I will write a dedicated section about this. @@ -502,6 +505,11 @@ to carry out certain operations. dimensional cases, not just the 2D case, so that even Series has a lean "SingleBlockManager" internally. +Another motivation for the BlockManager was to be able to create DataFrame +objects with zero copy from two-dimensional NumPy arrays. See Jeff Reback's +`exposition on this +`_. + Drawbacks of BlockManager ~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -635,6 +643,11 @@ will enable us to: * Use RAII (exception-safe allocation) and smart pointers (``std::unique_ptr`` and ``std::shared_ptr``) to simplify memory management +* Can use the STL (e.g. ``std::unordered_map`` for some hash tables) for + standard data structures, or incorporate other C++ data structures (e.g. from + Google open source libraries) that have been more optimized for certain use + cases. + * Define performant C++ classes modeling the current internals, with various mechanisms for code reuse or type-specific dynamic dispatch (i.e. through template classes, CRTP, or simply virtual functions). @@ -684,6 +697,25 @@ semantics without much need for manual memory management. These Array types would be wrapped and exposed to pandas developers (probably in Cython). +We would also want to provide a public Python API to the ``pandas.Array`` type, +which would be the object returned by ``Series.values``. For example, at +present we have: + +.. ipython:: python + + s = pd.Series([1,2] * 2) + s + s.values + s2 = s.astype('category') + s2.values + type(s2.values) + +By introducing a consistent base array type, we can eliminate the current +dichotomy between pandas's extension dtypes and built-in NumPy physical dtypes. + +We could also define a limited public API for interacting with these data +containers directly. + Index types ~~~~~~~~~~~ diff --git a/source/removals.rst b/source/removals.rst index 5f10485b3..2b1acac55 100644 --- a/source/removals.rst +++ b/source/removals.rst @@ -1,8 +1,8 @@ .. _removals: -================================ - Code to remove and other ideas -================================ +=========================== + Other miscellaneous ideas +=========================== Dropping Python 2 support ========================= @@ -40,7 +40,7 @@ are very useful to know, such as: * **Null count**: for data not containing any nulls, the null handling path in some algorithms can be skipped entirely - +* **Uniqueness**: used in indexes, and can be helpful elsewhere Strided arrays: more trouble than they are worth? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -76,3 +76,15 @@ Some cons: For me, at least, I don't find the cons compelling enough to warrant the code complexity tradeoff. + +Enforcing immutability in GroupBy functions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Side effects from ``groupby`` operations have been a common source of issues or +unintuitive behavior for users. + +Handling of sparse data structures +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It's possible that the sparse types could become first class logical types, +e.g. ``Sparse[T]``, eliminating the ``Sparse*`` classes. diff --git a/source/strings.rst b/source/strings.rst index 425c80bbe..8a8a955a2 100644 --- a/source/strings.rst +++ b/source/strings.rst @@ -42,9 +42,9 @@ somewhat ad hoc basis. this is not the default storage mechanism. More on this below. * Using ``PyString`` objects and ``PyObject*`` NumPy storage adds non-trivial - overhead (approximately 24 bytes per unique object, see `this exposition - `_ for a deeper drive) to - each value. + overhead (52 bytes in Python 3, slightly less in Python 2, see `this + exposition `_ for a + deeper drive) to each value. Possible solution: new non-NumPy string memory layout ===================================================== @@ -97,8 +97,8 @@ Here's an example of what the data would look like: Some benefits of this approach include: * Much better data locality for low-cardinality categorical data -* 8.125 bytes (8 bytes plus 1 bit) of memory overhead per value versus 24 bytes - (the current) +* 8.125 bytes (8 bytes plus 1 bit) of memory overhead per value versus 33 to 52 + bytes (the current). * The data is already categorical: cast to ``category`` dtype can be perform very cheaply and without duplicating the underlying string memory buffer * Computations like ``groupby`` on dictionary-encoded strings will be as