A checksum
utility for the multicore age. It's (going to be) so fast it will make you chuckle.
I work in genomics. This means I often transfer small handfuls of files from sequencing cores, where each file can be as much as a half-a-terabyte. As such, checking the integrity of these files post-transfer can be an arduous, time-consuming task. In my experience, bioinformaticians tackle this problem with shell or Python for loops that will run checksum
or some other single-threaded utility and wait however long it takes for the integrity checks to finish before they get going with their analyses.
checkle
aims to make this approach obsolete. It will perform checksums on batches of files transferred over the interwebs, using Merkle Trees to accelerate hashing on multicore machines.
I have the following goals for checkle
:
- Find all recently transferred files based on a set of file attribute filters.
- Spread hashing across as many (virtual) cores as possible using Merkle Trees (for the heads:
checkle
is a portmanteau of checksum and Merkle). - If a manifest of hashes from the source server is provided, spread post-transfer checksums across cores as well.
- Support md5 for backward compatibility along with at least one more cryptographically secure hashing function.
- Be capable of reaching into
tar
andzip
archives to checksum files without decompressing the whole archive. - Have an easy-to-use command line interface powered by
clap
. - Be easy to install, either through crates.io or with binaries for your platform of choice distributed in this repo.
- Print a report to
stdout
on which files should be re-transferred.
checkle
will be made available on crates.io when it reaches a reasonable level of stability.