-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: implement FFI for manifest, picklist and selection in Rust #2726
base: latest
Are you sure you want to change the base?
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## latest #2726 +/- ##
==========================================
- Coverage 86.23% 84.36% -1.88%
==========================================
Files 135 136 +1
Lines 15307 15406 +99
Branches 2622 2631 +9
==========================================
- Hits 13200 12997 -203
- Misses 1808 2106 +298
- Partials 299 303 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
593775a
to
d7d455a
Compare
d7d455a
to
40d8336
Compare
I'm looking forward to this! 🎉 Note some hopefully useful comments in #1849, especially:
and
|
40d8336
to
9efe7fc
Compare
9efe7fc
to
642bb27
Compare
On-disk RevIndex based on RocksDB, initially implemented in https://github.com/luizirber/2022-06-26-rocksdb-eval This is the index/core data structure backing https://mastiff.sourmash.bio There are many changes in the Rust code, so bumping the version to `0.12.0`. This is mostly not exposed thru the FFI yet. Tests from the from the in-memory `RevIndex` (greyhound) from #1238 were kept working, but it is not well supported (doesn't allow saving/loading from disk, for example), and should be wholly replaced by `sourmash::index::revindex::disk_revindex` (the on-disk RevIndex) in the future. It is confusing to have these different RevIndex impls in Rust, and I started converging them, but the work is not completely done yet. #2727 is a better starting point for how `Index` abc/trait should work acrosss Python/Rust, and I started moving the Rust indices to start from a `LinearIndex` and later specialize into a `RevIndex`, which will make easier to translate the work from #2727 for future indices across FFI. A couple of new concepts introduced in this PR: - a `Collection` is a `Manifest` + `Storage`. So a zip file like the ones for GTDB databases fit this easily (storage = `ZipStorage`, manifest is read from the zipfile), but a file paths list does too (manifest built from the file paths, storage = `FSStorage`). This goes in a bit of different direction than #1901, which was extending `Storage` to support more functionality. I think `Storage` should stay pretty bare and mostly deal with loading/saving data, but not much knowledge of **what** data is there (this is covered with `Manifest`). - a `CollectionSet` is a consistent collection of signatures. Consistent here means: same k-size, downsample-compatible for scaled, same moltype. You can create a `CollectionSet` by running `.select()` on a `Collection`. `CollectionSet` is required for building indices (because we shouldn't be building indices mixing k-size/moltype), and greatly simplifies the logic in many places because it is not necessary to check for compatibility. - `LinearIndex` was rewritten based on `Collection` (and also the `greyhound`/`branchwater` parallelism), and this supports the "parallel search without an index" use case. There is no index construction per se here, pretty much just a thin layer on top of `Collection` implementing functionality expected from indices. - `Manifest`, `Selection`, and `Picklist` are still incomplete, but the relevant function definitions are in place, need to barrage it with tests (and potentially exposing to Python and reusing the ones there in #2726) ## Feature - Initial implementation for `Manifest`, `Selection`, and `Picklist` following the Python API. - `Collection` is a new abstraction for working with a set of signatures. A collection needs a `Storage` for holding the signatures (on-disk, in-memory, or remotely), and a `Manifest` to describe the metadata for each signature. - Expose CSV parsing and RocksDB errors. - New module `sourmash::index::revindex::disk_revindex` with the on-disk RevIndex implementation based on RocksDB. - Add `iter` and `iter_mut` methods for `Signature`. - Add `load_sig` and `save_sig` methods to `Storage` trait for higher-level data manipulation and caching. - Add `spec` method to `Storage` to allow constructing a concrete `Storage` from a string description. - Add `InnerStorage` for synchronizing parallel access to `Storage` implementations. - Add `MemStorage` for keeping signatures in-memory (mostly for debugging and testing). ## Refactor - Rename `HashFunctions` variants to follow camel-case, so `Murmur64Protein` instead of `murmur64_protein` - `LinearIndex` is now implemented as a thin layer on top of `Collection`. - Move `GatherResult` to `sourmash::index` module. - Move `sourmash::index::revindex` to `sourmash::index::mem_revindex` (this is the Greyhound version of revindex, in-memory only). It was also refactored internally to build a version of a `LinearIndex` that will be merged in the future with `sourmash::index::LinearIndex` - Move `select` method from `Index` trait into a separate `Select` trait, and implement it for `Signature` based on the new `Selection` API. - Move `SigStore` into `sourmash::storage` module, and remove the generic. Now it always stores `Signature`. Also implement `Select` for it. ## Build - Add new `branchwater` feature (enabled by default), which can be disabled by downstream projects to limit bringing heavy dependencies like rocksdb - Add new `rkyv` feature (disabled by default), making `MinHash` serializable with the `rkyv` crate. - Add semver checks for CI (so we bump versions accordingly, or avoid breaking changes) - Reduce features combinations on Rust checks (takes much less time to run) - Disable `musllinux` wheels (need to figure out how to build rocksdb for it) --------- Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com> Co-authored-by: C. Titus Brown <titus@idyll.org>
642bb27
to
9a93249
Compare
9a93249
to
710b48d
Compare
Implement
Manifest
,Picklist
andSelection
in Rust.This is needed to better support Rust indices with the same feature set from Python indices.
(spun off #2230)