Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where reduction using dataframe row index #1164

Merged
merged 6 commits into from
Jan 19, 2023
Merged

Where reduction using dataframe row index #1164

merged 6 commits into from
Jan 19, 2023

Conversation

ianthomas23
Copy link
Member

This is built on top of PR #1155 and ideally that should be merged first, then this rebased on top of it. I am submitting it early to run it through CI.

It supports the use of the where reduction without specifying the lookup_column argument to return an agg containing the corresponding row indexes from the pandas/dask DataFrame. The agg returned is int64 with -1 to represent missing values. Implementing the row index for pandas DataFrames is quite simple, for dask DataFrames the implementation is more complicated as this information is not normally available and the index of the DataFrame cannot be relied upon in all scenarios.

Demo code:

import datashader as ds
import numpy as np
import pandas as pd

df = pd.DataFrame(dict(
    x     = [ 0,  0,  1,  1,  0,  0,  2,  2],
    y     = [ 0,  0,  0,  0,  1,  1,  1,  1],
    value = [ 9,  8,  7,  6,  2,  3,  4,  5],
    other = [11, 12, 13, 14, 15, 16, 17, 18],
    #index    0   1   2   3   4   5   6   7
))

canvas = ds.Canvas(plot_height=2, plot_width=3)

reductions = [
    ("where first index", ds.where(ds.first("value"))),
    ("where last index", ds.where(ds.last("value"))),
    ("where max index", ds.where(ds.max("value"))),
    ("where max other", ds.where(ds.max("value"), "other")),
    ("where min index", ds.where(ds.min("value"))),
    ("where min other", ds.where(ds.min("value"), "other")),
]

for name, reduction in reductions:
    agg = canvas.points(df, 'x', 'y', agg=reduction)
    print(name, agg.data.dtype)
    print(agg.data)

which outputs

where first index int64
[[ 0  2 -1]
 [ 4 -1  6]]
where last index int64
[[ 1  3 -1]
 [ 5 -1  7]]
where max index int64
[[ 0  2 -1]
 [ 5 -1  7]]
where max other float64
[[11. 13. nan]
 [16. nan 18.]]
where min index int64
[[ 1  3 -1]
 [ 4 -1  6]]
where min other float64
[[12. 14. nan]
 [15. nan 17.]]

selector reductions that where supports in this way are first, last, max and min. For dask DataFrames this is just max and min so far as first and last do not have any dask implementation.

@ianthomas23
Copy link
Member Author

This is ready for review. After this is merged we will be in a position to start working on holoviews to use this functionality for improved inspection.

Copy link
Member

@jbednar jbednar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Can you provide a section for the user guide showing how to use this in an example?

@ianthomas23
Copy link
Member Author

Looks great! Can you provide a section for the user guide showing how to use this in an example?

Yes, I'll do that in a separate PR.

@ianthomas23 ianthomas23 merged commit 73d3deb into holoviz:main Jan 19, 2023
@ianthomas23 ianthomas23 deleted the where_row_index branch January 19, 2023 09:46
@ianthomas23 ianthomas23 added this to the v0.14.4 milestone Jan 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants