Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storing & exchange of categorical dtypes #41

Open
rgommers opened this issue Apr 8, 2021 · 0 comments
Open

storing & exchange of categorical dtypes #41

rgommers opened this issue Apr 8, 2021 · 0 comments

Comments

@rgommers
Copy link
Member

rgommers commented Apr 8, 2021

Categorical dtypes

xref gh-26 for some discussion on categorical dtypes.

What it looks like in different libraries

Pandas

The dtype is called category there. See pandas.Categorical docs:

>>> df = pd.DataFrame({"A": [1, 2, 5, 1]})
>>> df["B"] = df["A"].astype("category")

>>> df.dtypes
A       int64
B    category
dtype: object

>>> col = df['B']
>>> col.dtype
CategoricalDtype(categories=[1, 2, 5], ordered=False)

>>> col.values.ordered
False
>>> col.values.codes
array([0, 1, 2, 0], dtype=int8)
>>> col.values.categories
Int64Index([1, 2, 5], dtype='int64')
>>> col.values.categories.values
array([1, 2, 5])

Apache Arrow

The dtype is called _"dictionary-encoded" in Arrow - so a dataframe with a categorical dtype is called a "dictionary-encoded array" there.
See https://arrow.apache.org/docs/format/CDataInterface.html#structure-definitions for details.

A practical example (from @kkraus14 in gh-38), for a categorical column of
['gold', 'bronze', 'silver', null, 'bronze', 'silver', 'gold'] with categories of
['gold' < 'silver' < 'bronze']:

categorical column: {
    mask_buffer: [119], # 01110111 in binary
    data_buffer: [0, 2, 1, 127, 2, 1, 0], # the 127 value in here is undefined since it's null
    children: [
        string column: {
            mask_buffer: None,
            offsets_buffer: [0, 4, 10, 16],
            data_buffer: [103, 111, 108, 100, 115, 105, 108, 118, 101, 114, 98, 114, 111, 110, 122, 101]
        }
    ]
}
struct ArrowSchema {
  // Array type description
  const char* format;
  const char* name;
  const char* metadata;
  int64_t flags;
  int64_t n_children;
  struct ArrowSchema** children;
  struct ArrowSchema* dictionary;  // the categories
  ...
};

struct ArrowArray {
  // Array data description
  int64_t length;
  int64_t null_count;
  int64_t offset;
  int64_t n_buffers;
  int64_t n_children;
  const void** buffers;
  struct ArrowArray** children;
  struct ArrowArray* dictionary;
  ...
};

Also see https://arrow.apache.org/docs/python/data.html#dictionary-arrays for what PyArrow does - it matches the current exchange protocol more closely than the Arrow C Data Interface. E.g., it uses an actual Python dictionary for the mapping of values to categories.

Vaex

EDIT: Vaex's API was done pre Arrow integration, and will change to match Arrow in the future.

>>> import vaex
... >>> df = vaex.from_arrays(year=[2012, 2015, 2019], weekday=[0, 4, 6])
... >>> df = df.categorize('year', min_value=2020, max_value=2019)
... >>> df = df.categorize('weekday', labels=['Mon', 'Tue', 'Wed', 'Thu', 'Fr
... i', 'Sat', 'Sun'])
>>> 
>>> df.dtypes
year       int64
weekday    int64
dtype: object
>>> df.is_category('year')
True
>>> df.is_category('weekday')
True
>>> df._categories
{'year': {'labels': [], 'N': 0, 'min_value': 2020}, 'weekday': {'labels': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'], 'N': 7, 'min_value': 0}}

Other libraries

  • Modin follows Pandas
  • Dask follows Pandas
  • Koalas does not support categorical dtypes at all

Exchange protocol

This is the current form in gh-38 for the Pandas implementation of the exchange protocol:

>>> col = df.__dataframe__().get_column_by_name('B')
>>> col
<__main__._PandasColumn object at 0x7f0202973211>
>>> col.dtype  # kind, bitwidth, format-string, endianness
(23, 64, '|O08', '=')

>>> col.describe_categorical  # is_ordered, is_dictionary, mapping
(False, True, {0: 1, 1: 2, 2: 5})

>>> col.describe_null  # kind (2 = sentinel value), value
(2, -1)

Changes needed & discussion points

What we already determined needs changing:

  1. Add get_children() method, and store the mapping that is now in Column.describe_categorical in a child column instead. Note that child columns are also needed for variable-length strings.

To discuss:

  1. If dtype is the logical dtype for the column, where to store how to interpret the actual data buffer? Right now this is done not in a static attribute but by returning the dtype along with the buffer when accessing it:
    def get_data_buffer(self) -> Tuple[_PandasBuffer, _Dtype]:
        """
        Return the buffer containing the data.
        """
        _k = _DtypeKind
        if self.dtype[0] in (_k.INT, _k.UINT, _k.FLOAT, _k.BOOL):
            buffer = _PandasBuffer(self._col.to_numpy())
            dtype = self.dtype
        elif self.dtype[0] == _k.CATEGORICAL:
            codes = self._col.values.codes
            buffer = _PandasBuffer(codes)
            dtype = self._dtype_from_pandasdtype(codes.dtype)
        else:
            raise NotImplementedError(f"Data type {self._col.dtype} not handled yet")

        return buffer, dtype
  1. What goes in the data buffer on the column? The category-encoded data makes sense, because the buffer needs to be the same size as the column (number of elements), otherwise it would be inconsistent with other dtypes.

    • What happens when the data is strings?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant