Skip to content
This repository has been archived by the owner on Apr 10, 2024. It is now read-only.

Commit

Permalink
Incorporate some more feedback from github
Browse files Browse the repository at this point in the history
  • Loading branch information
wesm committed Aug 24, 2016
1 parent 801259d commit 458ae85
Show file tree
Hide file tree
Showing 3 changed files with 53 additions and 9 deletions.
32 changes: 32 additions & 0 deletions source/internal-architecture.rst
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,9 @@ By separating pandas data from the presumption of using a particular physical
data by forming a composite data structure consisting of a NumPy array plus a
bitmap marking the null / not-null values.

- It may end up being a requirement that 3rd party data structures will need
to have a C or C++ API to be used in pandas.

* We can start to think about improved behavior around data ownership (like
copy-on-write) which may yield many benefits. I will write a dedicated
section about this.
Expand Down Expand Up @@ -502,6 +505,11 @@ to carry out certain operations.
dimensional cases, not just the 2D case, so that even Series has a lean
"SingleBlockManager" internally.

Another motivation for the BlockManager was to be able to create DataFrame
objects with zero copy from two-dimensional NumPy arrays. See Jeff Reback's
`exposition on this
<http://nbviewer.jupyter.org/github/jreback/PandasTalks/blob/master/performance/may_2016/1.%20storage.ipynb>`_.

Drawbacks of BlockManager
~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -635,6 +643,11 @@ will enable us to:
* Use RAII (exception-safe allocation) and smart pointers (``std::unique_ptr``
and ``std::shared_ptr``) to simplify memory management

* Can use the STL (e.g. ``std::unordered_map`` for some hash tables) for
standard data structures, or incorporate other C++ data structures (e.g. from
Google open source libraries) that have been more optimized for certain use
cases.

* Define performant C++ classes modeling the current internals, with various
mechanisms for code reuse or type-specific dynamic dispatch (i.e. through
template classes, CRTP, or simply virtual functions).
Expand Down Expand Up @@ -684,6 +697,25 @@ semantics without much need for manual memory management.
These Array types would be wrapped and exposed to pandas developers (probably
in Cython).

We would also want to provide a public Python API to the ``pandas.Array`` type,
which would be the object returned by ``Series.values``. For example, at
present we have:

.. ipython:: python
s = pd.Series([1,2] * 2)
s
s.values
s2 = s.astype('category')
s2.values
type(s2.values)
By introducing a consistent base array type, we can eliminate the current
dichotomy between pandas's extension dtypes and built-in NumPy physical dtypes.

We could also define a limited public API for interacting with these data
containers directly.

Index types
~~~~~~~~~~~

Expand Down
20 changes: 16 additions & 4 deletions source/removals.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
.. _removals:

================================
Code to remove and other ideas
================================
===========================
Other miscellaneous ideas
===========================

Dropping Python 2 support
=========================
Expand Down Expand Up @@ -40,7 +40,7 @@ are very useful to know, such as:
* **Null count**: for data not containing any nulls, the null handling path in
some algorithms can be skipped entirely


* **Uniqueness**: used in indexes, and can be helpful elsewhere

Strided arrays: more trouble than they are worth?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -76,3 +76,15 @@ Some cons:

For me, at least, I don't find the cons compelling enough to warrant the code
complexity tradeoff.

Enforcing immutability in GroupBy functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Side effects from ``groupby`` operations have been a common source of issues or
unintuitive behavior for users.

Handling of sparse data structures
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It's possible that the sparse types could become first class logical types,
e.g. ``Sparse[T]``, eliminating the ``Sparse*`` classes.
10 changes: 5 additions & 5 deletions source/strings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,9 @@ somewhat ad hoc basis.
this is not the default storage mechanism. More on this below.

* Using ``PyString`` objects and ``PyObject*`` NumPy storage adds non-trivial
overhead (approximately 24 bytes per unique object, see `this exposition
<http://www.gahcep.com/python-internals-pyobject/>`_ for a deeper drive) to
each value.
overhead (52 bytes in Python 3, slightly less in Python 2, see `this
exposition <http://www.gahcep.com/python-internals-pyobject/>`_ for a
deeper drive) to each value.

Possible solution: new non-NumPy string memory layout
=====================================================
Expand Down Expand Up @@ -97,8 +97,8 @@ Here's an example of what the data would look like:
Some benefits of this approach include:

* Much better data locality for low-cardinality categorical data
* 8.125 bytes (8 bytes plus 1 bit) of memory overhead per value versus 24 bytes
(the current)
* 8.125 bytes (8 bytes plus 1 bit) of memory overhead per value versus 33 to 52
bytes (the current).
* The data is already categorical: cast to ``category`` dtype can be perform
very cheaply and without duplicating the underlying string memory buffer
* Computations like ``groupby`` on dictionary-encoded strings will be as
Expand Down

0 comments on commit 458ae85

Please sign in to comment.