Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: implement FFI for manifest, picklist and selection in Rust #2726

Draft
wants to merge 1 commit into
base: latest
Choose a base branch
from

Conversation

luizirber
Copy link
Member

Implement Manifest, Picklist and Selection in Rust.

This is needed to better support Rust indices with the same feature set from Python indices.

(spun off #2230)

@codecov
Copy link

codecov bot commented Aug 20, 2023

Codecov Report

Attention: 180 lines in your changes are missing coverage. Please review.

Comparison is base (9529c73) 86.23% compared to head (593775a) 84.36%.

❗ Current head 593775a differs from pull request most recent head 710b48d. Consider uploading reports for the commit 710b48d to get more accurate results

Files Patch % Lines
src/core/src/manifest.rs 0.00% 38 Missing ⚠️
src/sourmash/index/__init__.py 53.03% 30 Missing and 1 partial ⚠️
src/core/src/index/mod.rs 0.00% 28 Missing ⚠️
src/core/src/ffi/index/mod.rs 0.00% 26 Missing ⚠️
src/sourmash/manifest.py 11.53% 22 Missing and 1 partial ⚠️
src/core/src/ffi/picklist.rs 0.00% 19 Missing ⚠️
src/sourmash/picklist.py 11.11% 8 Missing ⚠️
src/core/src/ffi/manifest.rs 0.00% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           latest    #2726      +/-   ##
==========================================
- Coverage   86.23%   84.36%   -1.88%     
==========================================
  Files         135      136       +1     
  Lines       15307    15406      +99     
  Branches     2622     2631       +9     
==========================================
- Hits        13200    12997     -203     
- Misses       1808     2106     +298     
- Partials      299      303       +4     
Flag Coverage Δ
hypothesis-py 25.79% <31.68%> (-0.01%) ⬇️
python 92.39% <38.61%> (-0.35%) ⬇️
rust 49.91% <0.00%> (-8.14%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@luizirber luizirber force-pushed the lirber/manifest_selection_picklist branch from 593775a to d7d455a Compare August 20, 2023 22:39
@luizirber luizirber force-pushed the lirber/manifest_selection_picklist branch from d7d455a to 40d8336 Compare August 20, 2023 23:00
@luizirber luizirber changed the title Feature: implement manifest, picklist and selection in Rust Feature: implement FFI for manifest, picklist and selection in Rust Aug 21, 2023
@ctb
Copy link
Contributor

ctb commented Aug 21, 2023

I'm looking forward to this! 🎉

Note some hopefully useful comments in #1849, especially:

first, it turns out that manifests do not contain seed or license (also see #1846 for motivation and discovery). so those should be added.

and

additional information that could be useful in manifests: the type of sketch (FracMinHash, MinHash, etc) - ref #751 also

@luizirber luizirber force-pushed the lirber/manifest_selection_picklist branch from 40d8336 to 9efe7fc Compare October 20, 2023 21:14
@luizirber luizirber changed the base branch from latest to lirber/mastiff October 20, 2023 21:15
@luizirber luizirber force-pushed the lirber/manifest_selection_picklist branch from 9efe7fc to 642bb27 Compare November 26, 2023 20:34
Base automatically changed from lirber/mastiff to latest December 1, 2023 00:56
ctb added a commit that referenced this pull request Dec 1, 2023
On-disk RevIndex based on RocksDB, initially implemented in
https://github.com/luizirber/2022-06-26-rocksdb-eval
This is the index/core data structure backing
https://mastiff.sourmash.bio

There are many changes in the Rust code, so bumping the version to
`0.12.0`.

This is mostly not exposed thru the FFI yet. Tests from the from the
in-memory `RevIndex` (greyhound) from #1238 were kept working, but it is
not well supported (doesn't allow saving/loading from disk, for
example), and should be wholly replaced by
`sourmash::index::revindex::disk_revindex` (the on-disk RevIndex) in the
future.
It is confusing to have these different RevIndex impls in Rust, and I
started converging them, but the work is not completely done yet.

#2727 is a better starting point for how `Index` abc/trait should work
acrosss Python/Rust, and I started moving the Rust indices to start from
a `LinearIndex` and later specialize into a `RevIndex`, which will make
easier to translate the work from #2727 for future indices across FFI.

A couple of new concepts introduced in this PR:

- a `Collection` is a `Manifest` + `Storage`. So a zip file like the
ones for GTDB databases fit this easily (storage = `ZipStorage`,
manifest is read from the zipfile), but a file paths list does too
(manifest built from the file paths, storage = `FSStorage`). This goes
in a bit of different direction than #1901, which was extending
`Storage` to support more functionality. I think `Storage` should stay
pretty bare and mostly deal with loading/saving data, but not much
knowledge of **what** data is there (this is covered with `Manifest`).

- a `CollectionSet` is a consistent collection of signatures. Consistent
here means: same k-size, downsample-compatible for scaled, same moltype.
You can create a `CollectionSet` by running `.select()` on a
`Collection`. `CollectionSet` is required for building indices (because
we shouldn't be building indices mixing k-size/moltype), and greatly
simplifies the logic in many places because it is not necessary to check
for compatibility.

- `LinearIndex` was rewritten based on `Collection` (and also the
`greyhound`/`branchwater` parallelism), and this supports the "parallel
search without an index" use case. There is no index construction per se
here, pretty much just a thin layer on top of `Collection` implementing
functionality expected from indices.

- `Manifest`, `Selection`, and `Picklist` are still incomplete, but the
relevant function definitions are in place, need to barrage it with
tests (and potentially exposing to Python and reusing the ones there in
#2726)

## Feature

- Initial implementation for `Manifest`, `Selection`, and `Picklist`
following
  the Python API.
- `Collection` is a new abstraction for working with a set of
signatures. A
collection needs a `Storage` for holding the signatures (on-disk,
in-memory,
or remotely), and a `Manifest` to describe the metadata for each
signature.
- Expose CSV parsing and RocksDB errors.
- New module `sourmash::index::revindex::disk_revindex` with the on-disk
  RevIndex implementation based on RocksDB.
- Add `iter` and `iter_mut` methods for `Signature`.
- Add `load_sig` and `save_sig` methods to `Storage` trait for
higher-level data
  manipulation and caching.
- Add `spec` method to `Storage` to allow constructing a concrete
`Storage` from
  a string description.
- Add `InnerStorage` for synchronizing parallel access to `Storage`
  implementations.
- Add `MemStorage` for keeping signatures in-memory (mostly for
debugging and
  testing).

## Refactor

- Rename `HashFunctions` variants to follow camel-case, so
`Murmur64Protein` instead of `murmur64_protein`
- `LinearIndex` is now implemented as a thin layer on top of
`Collection`.
- Move `GatherResult` to `sourmash::index` module.
- Move `sourmash::index::revindex` to `sourmash::index::mem_revindex`
(this is
the Greyhound version of revindex, in-memory only). It was also
refactored
internally to build a version of a `LinearIndex` that will be merged in
the
  future with `sourmash::index::LinearIndex`
- Move `select` method from `Index` trait into a separate `Select`
trait,
  and implement it for `Signature` based on the new `Selection` API.
- Move `SigStore` into `sourmash::storage` module, and remove the
generic. Now
  it always stores `Signature`. Also implement `Select` for it.

## Build

- Add new `branchwater` feature (enabled by default), which can be
disabled by downstream projects to limit bringing heavy dependencies
like rocksdb
- Add new `rkyv` feature (disabled by default), making `MinHash`
serializable
  with the `rkyv` crate.
- Add semver checks for CI (so we bump versions accordingly, or avoid
breaking changes)
- Reduce features combinations on Rust checks (takes much less time to
run)
- Disable `musllinux` wheels (need to figure out how to build rocksdb
for it)

---------

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
Co-authored-by: C. Titus Brown <titus@idyll.org>
@luizirber luizirber force-pushed the lirber/manifest_selection_picklist branch from 642bb27 to 9a93249 Compare December 2, 2023 16:29
@luizirber luizirber force-pushed the lirber/manifest_selection_picklist branch from 9a93249 to 710b48d Compare December 5, 2023 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants