diff --git a/source/goals.rst b/source/goals.rst index cc2ff6467..a38d380da 100644 --- a/source/goals.rst +++ b/source/goals.rst @@ -51,37 +51,24 @@ familiar with some of these internal details, particular around performance and memory use, and so the degree to which users are impacted will vary quite a lot. -Key areas of work -================= - -Possible changes or improvements to pandas's internals fall into a number of -different buckets to be explored in great detail: - -* **Decoupling from NumPy while preserving interoperability**: by eliminating - the presumption that pandas objects internally must contain data stored in - NumPy ``ndarray`` objects, we will be able to bring more consistency to - pandas's semantics and enable the core developers to extend pandas more - cleanly with new data types, data structures, and computational semantics. -* **Exposing a pandas Cython and/or C/C++ API to other Python library - developers**: the internals of Series and DataFrame are only weakly - accessible in other developers' native code. At minimum, we wish to better - enable developers to construct the precise data structures / memory - representation that fill the insides of Series and DataFrame. -* **Improving user control and visibility of memory use**: pandas's memory use, - as a result of its internal implementation, can frequently be opaque to the - user or outright unpredictable. -* **Improving performance and system utilization**: We aim to improve both the - micro (operations that take < 1 ms) and macro (all other operations) - performance of pandas across the board. As part of this, we aim to make it - easier for pandas's core developers to leverage multicore systems to - accelerate computations (without running into any of Python's well-known - concurrency limitations) -* **Removal of deprecated / underutilized functionality**: As the Python data - ecosystem has grown, a number of areas of pandas (e.g. plotting and datasets - with more than 2 dimensions) may be better served by other open source - projects. Also, functionality that has been explicitly deprecated or - discouraged from use (like the ``.ix`` indexing operator) would ideally be - removed. +Goals +===== + +Some high levels goals of the pandas 2.0 plan include the following: + +* Fixing long-standing limitations or inconsistencies in missing data: null + values in integer and boolean data, and a more consistent notion of null / + NA. +* Improved performance and utilization of multicore systems +* Better user control / visibility of memory usage (which can be opaque and + difficult to conttrol) +* Clearer semantics around non-NumPy data types, and permitting new pandas-only + data types to be added +* Exposing a "libpandas" C/C++ API to other Python library developers: the + internals of Series and DataFrame are only weakly accessible in other + developers' native code. This has been a limitation for scikit-learn and + other projects requiring C or Cython-level access to pandas object data. +* Removal of deprecated functionality Non-goals / FAQ =============== diff --git a/source/internal-architecture.rst b/source/internal-architecture.rst index 8e89612dc..5eb07fca5 100644 --- a/source/internal-architecture.rst +++ b/source/internal-architecture.rst @@ -288,9 +288,10 @@ Preserving NumPy interoperability Some of types of intended interoperability between NumPy and pandas are as follows: -* Users can obtain the a ``numpy.ndarray`` (possibly a view depending on the - internal block structure, more on this soon) in constant time and without - copying the actual data. This has a couple other implications +* **Access to internal data**: Users can obtain the a ``numpy.ndarray`` + (possibly a view depending on the internal block structure, more on this + soon) in constant time and without copying the actual data. This has a couple + other implications * Changes made to this array will be reflected in the source pandas object. * If you write C extension code (possibly in Cython) and respect pandas's @@ -298,22 +299,23 @@ follows: pandas data (but it's somewhat inflexible -- see the latest discussion on adding a native code API to pandas). -* NumPy ufuncs (like ``np.sqrt`` or ``np.log``) can be invoked on +* **Ufuncs**: NumPy ufuncs (like ``np.sqrt`` or ``np.log``) can be invoked on pandas objects like Series and DataFrame -* ``numpy.asarray`` will always yield some array, even if it discards metadata - or has to create a new array. For example ``asarray`` invoked on - ``pandas.Categorical`` yields a reconstructed array (rather than either the - categories or codes internal arrays) +* **Array protocol**: ``numpy.asarray`` will always yield some array, even if + it discards metadata or has to create a new array. For example ``asarray`` + invoked on ``pandas.Categorical`` yields a reconstructed array (rather than + either the categories or codes internal arrays) -* Many NumPy methods designed to work on subclasses (or duck-typed classes) of - ``ndarray`` may be used. For example ``numpy.sum`` may be used on a Series - even though it does not invoke NumPy's internal C sum algorithm. This means - that a Series may be used as an interchangeable argument in a large set of - functions that only know about NumPy arrays. +* **Interchangeability**: Many NumPy methods designed to work on subclasses (or + duck-typed classes) of ``ndarray`` may be used. For example ``numpy.sum`` may + be used on a Series even though it does not invoke NumPy's internal C sum + algorithm. This means that a Series may be used as an interchangeable + argument in a large set of functions that only know about NumPy arrays. By and large, I think much of this can be preserved, but there will be some API -breakage. +breakage. In particular, interchangeability is not something we can or should +guarantee. If we add more composite data structures (Categorical can be thought of as one existing composite data structure) to pandas or alternate non-NumPy data