Faster subsetting for cohorts #397

dcherian · 2024-09-18T01:42:05Z

Big improvement but still scales linearly with number of cohorts.

  1736      3650     426000.0    116.7      0.0                  reindexer = (
  1737      3650    1115000.0    305.5      0.0                      partial(reindex_intermediates, agg=agg, unique_groups=cohort_index)
  1738      3650     609000.0    166.8      0.0                      if do_simple_combine
  1739                                                               else identity
  1740                                                           )
  1741      3650  885572000.0 242622.5     20.6                  reindexed = subset_to_blocks(intermediate, blks, block_shape, reindexer, chunks_as_array)
  1742                                                           # now that we have reindexed, we can set reindex=True explicitlly
  1743      7300    1385000.0    189.7      0.0                  reduced_.append(
  1744      7300 2782229000.0 381127.3     64.6                      tree_reduce(
  1745      3650     399000.0    109.3      0.0                          reindexed,
  1746      3650    1490000.0    408.2      0.0                          combine=partial(combine, agg=agg, reindex=do_simple_combine),
  1747      7300    1570000.0    215.1      0.0                          aggregate=partial(
  1748      3650     339000.0     92.9      0.0                              aggregate,
  1749      3650     307000.0     84.1      0.0                              expected_groups=cohort_index,
  1750      3650     309000.0     84.7      0.0                              reindex=do_simple_combine,
  1751                                                                   ),
  1752                                                               )
  1753                                                           )
  1754                                                           # This is done because pandas promotes to 64-bit types when an Index is created
  1755                                                           # So we use the index to generate the return value for consistency with "map-reduce"
  1756                                                           # This is important on windows
  1757      3650    3268000.0    895.3      0.1                  groups_.append(cohort_index.values)
  1758                                           
  1759         1  230072000.0    2e+08      5.3              reduced = dask.array.concatenate(reduced_, axis=-1)
  1760         1     368000.0 368000.0      0.0              groups = (np.concatenate(groups_),)
  1761         1     379000.0 379000.0      0.0              group_chunks = (tuple(len(cohort) for cohort in groups_),)

Closes #396

* main: Revert "Support first, last with datetime, timedelta (#402)" (#404) Support first, last with datetime, timedelta (#402) Bump codecov/codecov-action from 4.5.0 to 4.6.0 (#401) Bump mamba-org/setup-micromamba from 1 to 2 (#400) Revert "[revert] test with Xarray PR branch" (#393) [pre-commit.ci] pre-commit autoupdate (#399) Faster subsetting for cohorts (#397) Fix default int on windows, numpy<2 (#395) Avoid rechunking when preferred_method="blockwise" (#394) Preserve dtype better when specified. (#389) Drop python 3.9, use ruff (#392) silence warning (#390) Expand groupby_reduce property tests (#385) Fix bug with NaNs in `by` and method='blockwise' (#384) Avoid explicit np.nan, np.inf (#383)

dcherian added 2 commits September 17, 2024 19:41

Faster subsetting for cohorts

4a221de

Closes #396

tpying

08d3087

dcherian merged commit 07a15c4 into main Sep 18, 2024
14 of 16 checks passed

dcherian deleted the faster-cohorts-subset branch September 18, 2024 03:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster subsetting for cohorts #397

Faster subsetting for cohorts #397

dcherian commented Sep 18, 2024 •

edited

Loading

Faster subsetting for cohorts #397

Faster subsetting for cohorts #397

Conversation

dcherian commented Sep 18, 2024 • edited Loading

dcherian commented Sep 18, 2024 •

edited

Loading