Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sampling with replacement to Table.subsample #774

Closed
wasade opened this issue Apr 20, 2018 · 11 comments
Closed

Add sampling with replacement to Table.subsample #774

wasade opened this issue Apr 20, 2018 · 11 comments

Comments

@wasade
Copy link
Member

wasade commented Apr 20, 2018

This could likely adapt the strategy used in scikit-bio and is important for instances where the sum of a vector is massive.

Cc @mortonjt (couldn't assign for some reason)

@mortonjt
Copy link
Contributor

mortonjt commented May 7, 2018

Quick question on reproducibility

Looking at the tests for Table.subsample I don't see any uses of numpy random seeds. And I'm having trouble making consistent unittests when using Table.subsample.

@wasade , any thoughts on setting random seeds?

@wasade
Copy link
Member Author

wasade commented May 7, 2018 via email

@stevendbrown
Copy link
Contributor

Do either of you have a pre-PR branch for this that I could try to pick up and carry forward?

@wasade
Copy link
Member Author

wasade commented Sep 4, 2018

@stevendbrown, thank you for the inquiry! If you have bandwidth, we'd love a PR. The change should be relatively small, as it should just require a branch and call to np.random.multinomial as done here. I don't think this needs to be done within the Cython code. This would also be a pleasant feature expansion for q2-feature-table and I would commit to making sure your contribution is available in the next release of QIIME 2.

@stevendbrown
Copy link
Contributor

@wasade OK, I made a basic implementation but I did it in the Cython code since that logic is already worked out, and doesn't require using dense data. I'm checking my work now to make sure I didn't bungle something non-obviously, but on the surface it looks pretty clean. Are there reasons I shouldn't do this and should do it instead using pure Python (e.g. maybe writing tests is harder)?

@wasade
Copy link
Member Author

wasade commented Sep 4, 2018

No reason not too! Just as a heads up, appropriate unit tests will necessary for merge. One example for subsample can be found here. I recommend adding in a new test method or methods to assert correctness.

@mortonjt
Copy link
Contributor

mortonjt commented Sep 4, 2018

Note that the numpy implementation of multinomial is already written in C -- not entirely sure how much faster it will be in a cython implementation ...

@stevendbrown
Copy link
Contributor

@mortonjt Agreed. For me it's less about speed and more about piggyback on the existing Cython code to manipulate sparse data as input to multinomial. It's the "record-keeping" code around the subsampling of which I'm trying to take advantage, rather than recapitulating the existing Cython gymnastics (e.g. looping over idxptr) for handling the sparse matrix data a level up in table.py. I'll get this into a PR and maybe we can discuss options there?

@mortonjt
Copy link
Contributor

mortonjt commented Sep 4, 2018 via email

@stevendbrown
Copy link
Contributor

Presumably this can be closed?

@wasade
Copy link
Member Author

wasade commented Sep 6, 2018

Yes, thanks. Used to the issues getting closed automagically :)

@wasade wasade closed this as completed Sep 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants