lazy-init LZ4-HC hashtable in blocktreewriter [LUCENE-9816] #10855

asfimport · 2021-02-28T16:26:23Z

Based upon the data for a field, blocktree may compress with LZ4-HC (or with simple lowercase compression or none at all).

But we currently eagerly initialize HC hashtable (132k) for each field regardless of whether it will be even "tried". This shows up as top cpu and heap hotspot when profiling tests. It creates unnecessary overhead for small flushes.

Migrated from LUCENE-9816 by Robert Muir (@rmuir), resolved Feb 28 2021
Attachments: LUCENE-9816.patch

asfimport · 2021-02-28T16:28:06Z

Robert Muir (@rmuir) (migrated from JIRA)

Attached patch, I don't see the LZ4-HC stuff when profiling tests anymore. I think the logic @jpountz added here is enough to avoid even trying to use this compression when flushing if it stands no chance to benefit.

asfimport · 2021-02-28T18:53:28Z

Michael McCandless (@mikemccand) (migrated from JIRA)

Whoa, thank you JFR! I love how simple the patch is – just moving to lazy init. +1

And it is disappointing this LZ4 hash table allocates 128 KB up front like that!?

asfimport · 2021-02-28T21:13:41Z

Adrien Grand (@jpountz) (migrated from JIRA)

+1 to the patch

@mikemccand This is due to how the algorithm looks for duplicates, it stores a large hash table that maps 4-bytes sequences to offsets in the input.

The high-speed variant uses packed ints to lower memory usage on short inputs, but maybe we could be smarter on the high-compression variant: as it not only records the last offset for every hash like the high-speed variant, but also the previous ones up to a window of 64k, a strong hash function only makes compression faster, not more effective. So maybe we could adapt the number of bits of the hash function depending on the size of the input in order to reduce memory usage, without hurting compression ratios and hopefully without hurting compression speed.

asfimport · 2021-02-28T21:38:22Z

Robert Muir (@rmuir) (migrated from JIRA)

The patch takes care of the tests issue (and maybe some similar extreme use-case). The tests (and maybe some users) just like to index a lot of fields, and to my knowledge this is the only place in the default codec where we are doing LZ4-HC.

Due to the way we write (field at a time), maybe it doesn't help to super-optimize the hashtable memory usage? For example it does not show up in luceneutil benchmarks at all.

asfimport · 2021-02-28T23:02:49Z

ASF subversion and git services (migrated from JIRA)

Commit dade99c in lucene-solr's branch refs/heads/master from Robert Muir
https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=dade99c

LUCENE-9816: lazy-init LZ4-HC hashtable in BlockTreeTermsWriter

LZ4-HC hashtable is heavy (128kb int[] + 128kb short[]) and must be
filled with special values on initialization. This is a lot of overhead
for fields that might not use the compression at all.

Don't initialize this for a field until we see hints that the data might
be compressible and need to use the table in order to test it out.

asfimport · 2021-03-01T20:08:56Z

Michael McCandless (@mikemccand) (migrated from JIRA)

[~mikemccand] This is due to how the algorithm looks for duplicates, it stores a large hash table that maps 4-bytes sequences to offsets in the input.

+1, thanks for the explanation and musings about how we might further optimize it @jpountz!

asfimport · 2021-12-08T17:05:14Z

Adrien Grand (@jpountz) (migrated from JIRA)

Closing after the 9.0.0 release

asfimport closed this as completed Feb 28, 2021

jpountz mentioned this issue Jun 7, 2023

Option for disabling term dictionary compression #12317

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lazy-init LZ4-HC hashtable in blocktreewriter [LUCENE-9816] #10855

lazy-init LZ4-HC hashtable in blocktreewriter [LUCENE-9816] #10855

asfimport commented Feb 28, 2021

asfimport commented Feb 28, 2021

asfimport commented Feb 28, 2021

asfimport commented Feb 28, 2021

asfimport commented Feb 28, 2021

asfimport commented Feb 28, 2021

asfimport commented Mar 1, 2021

asfimport commented Dec 8, 2021

lazy-init LZ4-HC hashtable in blocktreewriter [LUCENE-9816] #10855

lazy-init LZ4-HC hashtable in blocktreewriter [LUCENE-9816] #10855

Comments

asfimport commented Feb 28, 2021

asfimport commented Feb 28, 2021

asfimport commented Feb 28, 2021

asfimport commented Feb 28, 2021

asfimport commented Feb 28, 2021

asfimport commented Feb 28, 2021

asfimport commented Mar 1, 2021

asfimport commented Dec 8, 2021