Skip to content
This repository has been archived by the owner on Apr 10, 2024. It is now read-only.

Commit

Permalink
Make goals section leaner / more concise per comments
Browse files Browse the repository at this point in the history
  • Loading branch information
wesm committed Aug 23, 2016
1 parent 94d0281 commit 801259d
Show file tree
Hide file tree
Showing 2 changed files with 34 additions and 45 deletions.
49 changes: 18 additions & 31 deletions source/goals.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,37 +51,24 @@ familiar with some of these internal details, particular around performance and
memory use, and so the degree to which users are impacted will vary quite a
lot.

Key areas of work
=================

Possible changes or improvements to pandas's internals fall into a number of
different buckets to be explored in great detail:

* **Decoupling from NumPy while preserving interoperability**: by eliminating
the presumption that pandas objects internally must contain data stored in
NumPy ``ndarray`` objects, we will be able to bring more consistency to
pandas's semantics and enable the core developers to extend pandas more
cleanly with new data types, data structures, and computational semantics.
* **Exposing a pandas Cython and/or C/C++ API to other Python library
developers**: the internals of Series and DataFrame are only weakly
accessible in other developers' native code. At minimum, we wish to better
enable developers to construct the precise data structures / memory
representation that fill the insides of Series and DataFrame.
* **Improving user control and visibility of memory use**: pandas's memory use,
as a result of its internal implementation, can frequently be opaque to the
user or outright unpredictable.
* **Improving performance and system utilization**: We aim to improve both the
micro (operations that take < 1 ms) and macro (all other operations)
performance of pandas across the board. As part of this, we aim to make it
easier for pandas's core developers to leverage multicore systems to
accelerate computations (without running into any of Python's well-known
concurrency limitations)
* **Removal of deprecated / underutilized functionality**: As the Python data
ecosystem has grown, a number of areas of pandas (e.g. plotting and datasets
with more than 2 dimensions) may be better served by other open source
projects. Also, functionality that has been explicitly deprecated or
discouraged from use (like the ``.ix`` indexing operator) would ideally be
removed.
Goals
=====

Some high levels goals of the pandas 2.0 plan include the following:

* Fixing long-standing limitations or inconsistencies in missing data: null
values in integer and boolean data, and a more consistent notion of null /
NA.
* Improved performance and utilization of multicore systems
* Better user control / visibility of memory usage (which can be opaque and
difficult to conttrol)
* Clearer semantics around non-NumPy data types, and permitting new pandas-only
data types to be added
* Exposing a "libpandas" C/C++ API to other Python library developers: the
internals of Series and DataFrame are only weakly accessible in other
developers' native code. This has been a limitation for scikit-learn and
other projects requiring C or Cython-level access to pandas object data.
* Removal of deprecated functionality

Non-goals / FAQ
===============
Expand Down
30 changes: 16 additions & 14 deletions source/internal-architecture.rst
Original file line number Diff line number Diff line change
Expand Up @@ -288,32 +288,34 @@ Preserving NumPy interoperability
Some of types of intended interoperability between NumPy and pandas are as
follows:

* Users can obtain the a ``numpy.ndarray`` (possibly a view depending on the
internal block structure, more on this soon) in constant time and without
copying the actual data. This has a couple other implications
* **Access to internal data**: Users can obtain the a ``numpy.ndarray``
(possibly a view depending on the internal block structure, more on this
soon) in constant time and without copying the actual data. This has a couple
other implications

* Changes made to this array will be reflected in the source pandas object.
* If you write C extension code (possibly in Cython) and respect pandas's
missing data details, you can invoke certain kinds of fast custom code on
pandas data (but it's somewhat inflexible -- see the latest discussion on
adding a native code API to pandas).

* NumPy ufuncs (like ``np.sqrt`` or ``np.log``) can be invoked on
* **Ufuncs**: NumPy ufuncs (like ``np.sqrt`` or ``np.log``) can be invoked on
pandas objects like Series and DataFrame

* ``numpy.asarray`` will always yield some array, even if it discards metadata
or has to create a new array. For example ``asarray`` invoked on
``pandas.Categorical`` yields a reconstructed array (rather than either the
categories or codes internal arrays)
* **Array protocol**: ``numpy.asarray`` will always yield some array, even if
it discards metadata or has to create a new array. For example ``asarray``
invoked on ``pandas.Categorical`` yields a reconstructed array (rather than
either the categories or codes internal arrays)

* Many NumPy methods designed to work on subclasses (or duck-typed classes) of
``ndarray`` may be used. For example ``numpy.sum`` may be used on a Series
even though it does not invoke NumPy's internal C sum algorithm. This means
that a Series may be used as an interchangeable argument in a large set of
functions that only know about NumPy arrays.
* **Interchangeability**: Many NumPy methods designed to work on subclasses (or
duck-typed classes) of ``ndarray`` may be used. For example ``numpy.sum`` may
be used on a Series even though it does not invoke NumPy's internal C sum
algorithm. This means that a Series may be used as an interchangeable
argument in a large set of functions that only know about NumPy arrays.

By and large, I think much of this can be preserved, but there will be some API
breakage.
breakage. In particular, interchangeability is not something we can or should
guarantee.

If we add more composite data structures (Categorical can be thought of as
one existing composite data structure) to pandas or alternate non-NumPy data
Expand Down

0 comments on commit 801259d

Please sign in to comment.