Add sampling with replacement to Table.subsample #774

wasade · 2018-04-20T14:25:42Z

This could likely adapt the strategy used in scikit-bio and is important for instances where the sum of a vector is massive.

Cc @mortonjt (couldn't assign for some reason)

mortonjt · 2018-05-07T05:25:07Z

Quick question on reproducibility

Looking at the tests for Table.subsample I don't see any uses of numpy random seeds. And I'm having trouble making consistent unittests when using Table.subsample.

@wasade , any thoughts on setting random seeds?

wasade · 2018-05-07T14:08:13Z

No strong feelings

…

On Sun, May 6, 2018, 10:25 PM Jamie Morton ***@***.***> wrote: Quick question on reproducibility Looking at the tests for [Table.subsample] https://github.com/biocore/biom-format/blob/master/tests/test_table.py#L2619) I don't see any uses of numpy random seeds. And I'm having trouble making consistent unittests when using Table.subsample. @wasade <https://github.com/wasade> , any thoughts on setting random seeds? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#774 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAc8sqk_5fLfaYPxHI0CIPCy6e74l_zZks5tv9q0gaJpZM4Tdjjf> .

stevendbrown · 2018-09-04T18:01:22Z

Do either of you have a pre-PR branch for this that I could try to pick up and carry forward?

wasade · 2018-09-04T18:47:43Z

@stevendbrown, thank you for the inquiry! If you have bandwidth, we'd love a PR. The change should be relatively small, as it should just require a branch and call to np.random.multinomial as done here. I don't think this needs to be done within the Cython code. This would also be a pleasant feature expansion for q2-feature-table and I would commit to making sure your contribution is available in the next release of QIIME 2.

stevendbrown · 2018-09-04T20:20:05Z

@wasade OK, I made a basic implementation but I did it in the Cython code since that logic is already worked out, and doesn't require using dense data. I'm checking my work now to make sure I didn't bungle something non-obviously, but on the surface it looks pretty clean. Are there reasons I shouldn't do this and should do it instead using pure Python (e.g. maybe writing tests is harder)?

wasade · 2018-09-04T20:25:48Z

No reason not too! Just as a heads up, appropriate unit tests will necessary for merge. One example for subsample can be found here. I recommend adding in a new test method or methods to assert correctness.

mortonjt · 2018-09-04T20:28:56Z

Note that the numpy implementation of multinomial is already written in C -- not entirely sure how much faster it will be in a cython implementation ...

stevendbrown · 2018-09-04T20:49:13Z

@mortonjt Agreed. For me it's less about speed and more about piggyback on the existing Cython code to manipulate sparse data as input to multinomial. It's the "record-keeping" code around the subsampling of which I'm trying to take advantage, rather than recapitulating the existing Cython gymnastics (e.g. looping over idxptr) for handling the sparse matrix data a level up in table.py. I'll get this into a PR and maybe we can discuss options there?

mortonjt · 2018-09-04T20:59:26Z

👍 Look forward to seeing your PR!

…

On Tue, Sep 4, 2018, 1:49 PM Steven Brown ***@***.***> wrote: @mortonjt <https://github.com/mortonjt> Agreed. For me it's less about speed and more about piggyback on the existing Cython code to manipulate sparse data as input to multinomial. It's the "record-keeping" code around the subsampling of which I'm trying to take advantage, rather than recapitulating the existing Cython gymnastics (e.g. looping over idxptr) for handling the sparse matrix data a level up in table.py. I'll get this into a PR and maybe we can discuss options there? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#774 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AD_a3a0Bgh5xvxdRR4GL1MLxBd2SgI-bks5uXudKgaJpZM4Tdjjf> .

stevendbrown · 2018-09-06T15:50:47Z

Presumably this can be closed?

wasade · 2018-09-06T16:07:50Z

Yes, thanks. Used to the issues getting closed automagically :)

mortonjt mentioned this issue May 7, 2018

Request : iloc functionality #775

Closed

stevendbrown mentioned this issue Sep 4, 2018

addition of subsampling with replacement #783

Merged

wasade closed this as completed Sep 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sampling with replacement to Table.subsample #774

Add sampling with replacement to Table.subsample #774

wasade commented Apr 20, 2018

mortonjt commented May 7, 2018 •

edited

Loading

wasade commented May 7, 2018 via email

stevendbrown commented Sep 4, 2018

wasade commented Sep 4, 2018

stevendbrown commented Sep 4, 2018

wasade commented Sep 4, 2018

mortonjt commented Sep 4, 2018

stevendbrown commented Sep 4, 2018

mortonjt commented Sep 4, 2018 via email

stevendbrown commented Sep 6, 2018

wasade commented Sep 6, 2018

Add sampling with replacement to Table.subsample #774

Add sampling with replacement to Table.subsample #774

Comments

wasade commented Apr 20, 2018

mortonjt commented May 7, 2018 • edited Loading

wasade commented May 7, 2018 via email

stevendbrown commented Sep 4, 2018

wasade commented Sep 4, 2018

stevendbrown commented Sep 4, 2018

wasade commented Sep 4, 2018

mortonjt commented Sep 4, 2018

stevendbrown commented Sep 4, 2018

mortonjt commented Sep 4, 2018 via email

stevendbrown commented Sep 6, 2018

wasade commented Sep 6, 2018

mortonjt commented May 7, 2018 •

edited

Loading