Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: add skipmer capacity to sourmash python layer via ffi #3446

Merged
merged 32 commits into from
Dec 20, 2024

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented Dec 20, 2024

This PR updates the FFI and python layer to allow the skipmer moltypes (skipm1n3, skipm2n3). We will keep this as an undocumented experimental feature for now. There are no guarantees at the moment that skipmers will work with all sourmash commands, as there are no explicit tests in place. There are tests for using skipmers with branchwater, so all skipmer searching is best done in the plugin for now. This PR enables critical handy utilities, though: sig cat, sig summarize, sig describe, etc.

Documentation and additional tests should be added prior to release(#3449).

bluegenes and others added 25 commits November 12, 2024 16:13
Make skipmers robust, but keep #3395 functional in the meantime.

This PR:
- enables second skipmer types, so we have m1n3 in addition to m2n3
- switches to a reading frame approach for both translation + skipmers,
which means we first build the reading frame, then kmerize, rather than
building kmers + translating/skipping on the fly
- avoids "extended length" needed for skipping on the fly

Since this changes the `SeqToHashes` strategy a bit, there's one python
test where we now see a different error.

Future thoughts:
- with the new structure, it would be straightforward to add validation
to exclude protein k-mers with invalid amino acids (`X`). I guess I'm
not entirely sure what happens to those atm...
Copy link

codecov bot commented Dec 20, 2024

Codecov Report

Attention: Patch coverage is 47.22222% with 19 lines in your changes missing coverage. Please review.

Project coverage is 86.26%. Comparing base (419eb73) to head (017280a).
Report is 1 commits behind head on latest.

Files with missing lines Patch % Lines
src/sourmash/minhash.py 52.94% 4 Missing and 4 partials ⚠️
src/sourmash/sourmash_args.py 0.00% 2 Missing and 3 partials ⚠️
src/core/src/ffi/minhash.rs 0.00% 2 Missing ⚠️
src/core/src/ffi/mod.rs 0.00% 2 Missing ⚠️
src/core/src/sketch/minhash.rs 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           latest    #3446      +/-   ##
==========================================
- Coverage   86.37%   86.26%   -0.11%     
==========================================
  Files         137      137              
  Lines       16196    16226      +30     
  Branches     2219     2225       +6     
==========================================
+ Hits        13989    13998       +9     
- Misses       1900     1915      +15     
- Partials      307      313       +6     
Flag Coverage Δ
hypothesis-py 25.43% <26.66%> (-0.01%) ⬇️
python 92.32% <56.66%> (-0.08%) ⬇️
rust 62.34% <0.00%> (-0.21%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Base automatically changed from try-skipmers to latest December 20, 2024 17:47
@bluegenes bluegenes changed the title WIP: add skipmers to sourmash python layer MRG: add skipmer capacity to sourmash python layer via ffi Dec 20, 2024
@bluegenes bluegenes merged commit 150967b into latest Dec 20, 2024
41 of 44 checks passed
@bluegenes bluegenes deleted the py-skipmers branch December 20, 2024 20:36
@@ -102,7 +102,7 @@ def _set_num_scaled(mh, num, scaled):
# Number of hashes is 0th parameter
mh_params[0] = num
# Scale is 8th parameter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Scale is 8th parameter
# Scale is 10th parameter

😅

ctb added a commit that referenced this pull request Dec 20, 2024
## [0.18.0] - 2024-12-20

MSRV: 1.66

Changes/additions:

* add skipmer capacity to sourmash python layer via ffi (#3446)
* add skipmers; switch to reading frame approach for translation,
skipmers (#3395)
* Refactor: Use to_writer/from_reader across the codebase (#3443)
* adjust `Signature::name()` to return `Option<String>` instead of
`filename()` and `md5sum()` (#3434)
* propagate zipfile errors (#3431)

Updates:

* Bump proptest from 1.5.0 to 1.6.0 (#3437)
* Bump roaring from 0.10.8 to 0.10.9 (#3438)
* Bump serde from 1.0.215 to 1.0.216 (#3436)
* Bump statrs from 0.17.1 to 0.18.0 (#3426)
* Bump roaring from 0.10.7 to 0.10.8 (#3423)
* Bump needletail from 0.6.0 to 0.6.1 (#3427)
* Bump web-sys from 0.3.72 to 0.3.74 (#3411)
* Bump js-sys from 0.3.72 to 0.3.74 (#3412)
* Bump roaring from 0.10.6 to 0.10.7 (#3413)
* Bump serde_json from 1.0.132 to 1.0.133 (#3402)
* Bump serde from 1.0.214 to 1.0.215 (#3403)
ctb added a commit that referenced this pull request Dec 21, 2024
@ctb ctb mentioned this pull request Jan 11, 2025
ctb added a commit that referenced this pull request Jan 11, 2025
Release issue: #3481

----

NOTE: This release adds basic support for skipmers, but they are not
yet fully supported.

Minor new features:

* add genbank plant db to docs (#3429)
* add skipmer capacity to sourmash python layer via ffi (#3446)
* add skipmers; switch to reading frame approach for translation,
skipmers (#3395)
* additional moltype specification needed for `sig downsample` with
skipmers (#3457)
* update with misc animal genomes (#3422)

Cleanup and documentation updates:

* add comment about semver and column headings (#3433)

Developer updates:

* Deps: update to rocksdb 0.23 (#3456)
* Refactor: Use to_writer/from_reader across the codebase (#3443)
* adjust `Signature::name()` to return `Option<String>` instead of
`filename()` and `md5sum()` (#3434)
* bump version to 4.8.13-dev (#3474)
* fix comment in _set_num_scaled (#3451)
* propagate zipfile errors (#3431)
* update rust CHANGELOG in preparation for r0.18.0 (#3450)
* CI: github actions updates (#3476)

Dependabot updates:

* Bump itertools from 0.13.0 to 0.14.0 (#3471)
* Bump needletail from 0.6.0 to 0.6.1 (#3427)
* Bump proptest from 1.5.0 to 1.6.0 (#3437)
* Bump roaring from 0.10.7 to 0.10.8 (#3423)
* Bump roaring from 0.10.8 to 0.10.9 (#3438)
* Bump serde from 1.0.215 to 1.0.216 (#3436)
* Bump serde from 1.0.216 to 1.0.217 (#3464)
* Bump serde_json from 1.0.133 to 1.0.134 (#3453)
* Bump statrs from 0.17.1 to 0.18.0 (#3426)
* Bump tempfile from 3.14.0 to 3.15.0 (#3472)
* Bump thiserror from 2.0.3 to 2.0.6 (#3425)
* Bump thiserror from 2.0.6 to 2.0.7 (#3435)
* Bump thiserror from 2.0.7 to 2.0.8 (#3448)
* Bump thiserror from 2.0.8 to 2.0.9 (#3452)
* Update maturin requirement from <1.8.0,>=1 to >=1,<1.9.0 (#3465)
* [pre-commit.ci] pre-commit autoupdate (#3428)
* [pre-commit.ci] pre-commit autoupdate (#3439)
* [pre-commit.ci] pre-commit autoupdate (#3454)
* [pre-commit.ci] pre-commit autoupdate (#3473)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants