Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup scan speed for '--patch-from' via rolling hashes #2189

Open
RubenKelevra opened this issue Jun 2, 2020 · 1 comment
Open

Speedup scan speed for '--patch-from' via rolling hashes #2189

RubenKelevra opened this issue Jun 2, 2020 · 1 comment
Assignees

Comments

@RubenKelevra
Copy link
Contributor

IPFS has the ability to dedup blocks between different types of files. This functionality is based on a rolling hash algorithm.

You can either select rabin or buzzhash for this task (in IPFS). Rabin is kind of slow, but buzzhash is quite fast.

The rolling hash would allow to 'prescan' both files, get some cut marks and run some fast cryptographic hash algorithm over the chunks, like blake2b.

I think both operations are much cheaper than pattern matching. This way you can skip all pattern matching attempts which are on both sides (A and B) inside the known equal blocks.

The first layer of patching would just generate a lengths+offset+move triple, which can copy the blocks from the original file into a sparse file as first patching operation.

The pattern matching rules could be used on top of that, completing the gaps of the output file.

Originally posted by @RubenKelevra in #2063 (comment)

@terrelln
Copy link
Contributor

--patch-from does already using rolling hashes!

However, we also use our regular match finders, so end up inserting the whole file anyway.

One idea would be to only insert the "end" of the dictionary into our normal match finders. Basically only the portion that we expect would be reasonably indexed by our normal hash tables (e.g. like 4 * (1 << max(hashLog, chainLog))). Then let our LDM mode handle the rest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants