Ensure categorical column order is the same across dask partitions #1239
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #1202.
This fixes datashader's handling of categorical columns in dask partitions. It was broken by a change in dask's reading of categorical columns between releases
2022.7.0
and2022.7.1
(dask/dask#9264) but really is due to our slightly inconsistent handling of categorical columns which we got away with historically but not after that change.Relevant section of dask docs is https://docs.dask.org/en/stable/dataframe-categoricals.html. When categorical columns are read from parquet files there is no enforcement of the columns being the same across dask partitions for performance reasons, and we need to do this ourselves. They provide a couple of solutions, both of which incur a performance loss as they involve traversing the whole dataframe.
Cutting a long story short, the fix is actually a two-liner in datashader. We already correct the categorical columns in
dshape_from_dask()
by callingcategorize
passing a list of the categorical columns to correct. The correcteddask.DataFrame
is used in datashader to work out the full list of categories for the categorical dimension of the returnedxarray.DataArray
. But the originally supplied (and therefore uncorrected)dask.DataFrame
was still used in each individual partition's mapping of category to integer index. So the fix is to use the categorically-correctedDataFrame
for the remainder of the datashader calculations. As we were already performinng thecategorize
call, we incur no performance loss by doing this.I have added a new explicit test.
Using the USA 2010 Census data and the following test code:
before the fix we see for
fastparquet
andpyarrow
readers respectivelyand after the fix