Speedup scan speed for '--patch-from' via rolling hashes #2189

RubenKelevra · 2020-06-02T17:11:12Z

IPFS has the ability to dedup blocks between different types of files. This functionality is based on a rolling hash algorithm.

You can either select rabin or buzzhash for this task (in IPFS). Rabin is kind of slow, but buzzhash is quite fast.

The rolling hash would allow to 'prescan' both files, get some cut marks and run some fast cryptographic hash algorithm over the chunks, like blake2b.

I think both operations are much cheaper than pattern matching. This way you can skip all pattern matching attempts which are on both sides (A and B) inside the known equal blocks.

The first layer of patching would just generate a lengths+offset+move triple, which can copy the blocks from the original file into a sparse file as first patching operation.

The pattern matching rules could be used on top of that, completing the gaps of the output file.

Originally posted by @RubenKelevra in #2063 (comment)

terrelln · 2022-12-22T02:09:43Z

--patch-from does already using rolling hashes!

However, we also use our regular match finders, so end up inserting the whole file anyway.

One idea would be to only insert the "end" of the dictionary into our normal match finders. Basically only the portion that we expect would be reasonably indexed by our normal hash tables (e.g. like 4 * (1 << max(hashLog, chainLog))). Then let our LDM mode handle the rest.

bimbashrestha self-assigned this Jun 5, 2020

terrelln unassigned bimbashrestha Dec 22, 2022

terrelln added the optimization label Dec 22, 2022

daniellerozenblit self-assigned this Mar 6, 2023

daniellerozenblit mentioned this issue Mar 10, 2023

patch-from speed optimization #3545

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup scan speed for '--patch-from' via rolling hashes #2189

Speedup scan speed for '--patch-from' via rolling hashes #2189

RubenKelevra commented Jun 2, 2020

terrelln commented Dec 22, 2022

Speedup scan speed for '--patch-from' via rolling hashes #2189

Speedup scan speed for '--patch-from' via rolling hashes #2189

Comments

RubenKelevra commented Jun 2, 2020

terrelln commented Dec 22, 2022