-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lazy-init LZ4-HC hashtable in blocktreewriter [LUCENE-9816] #10855
Comments
Robert Muir (@rmuir) (migrated from JIRA) Attached patch, I don't see the LZ4-HC stuff when profiling tests anymore. I think the logic @jpountz added here is enough to avoid even trying to use this compression when flushing if it stands no chance to benefit. |
Michael McCandless (@mikemccand) (migrated from JIRA) Whoa, thank you JFR! I love how simple the patch is – just moving to lazy init. +1 And it is disappointing this LZ4 hash table allocates 128 KB up front like that!? |
Adrien Grand (@jpountz) (migrated from JIRA) +1 to the patch @mikemccand This is due to how the algorithm looks for duplicates, it stores a large hash table that maps 4-bytes sequences to offsets in the input. The high-speed variant uses packed ints to lower memory usage on short inputs, but maybe we could be smarter on the high-compression variant: as it not only records the last offset for every hash like the high-speed variant, but also the previous ones up to a window of 64k, a strong hash function only makes compression faster, not more effective. So maybe we could adapt the number of bits of the hash function depending on the size of the input in order to reduce memory usage, without hurting compression ratios and hopefully without hurting compression speed. |
Robert Muir (@rmuir) (migrated from JIRA) The patch takes care of the tests issue (and maybe some similar extreme use-case). The tests (and maybe some users) just like to index a lot of fields, and to my knowledge this is the only place in the default codec where we are doing LZ4-HC. Due to the way we write (field at a time), maybe it doesn't help to super-optimize the hashtable memory usage? For example it does not show up in luceneutil benchmarks at all. |
ASF subversion and git services (migrated from JIRA) Commit dade99c in lucene-solr's branch refs/heads/master from Robert Muir LUCENE-9816: lazy-init LZ4-HC hashtable in BlockTreeTermsWriter LZ4-HC hashtable is heavy (128kb int[] + 128kb short[]) and must be Don't initialize this for a field until we see hints that the data might |
Michael McCandless (@mikemccand) (migrated from JIRA)
+1, thanks for the explanation and musings about how we might further optimize it @jpountz! |
Adrien Grand (@jpountz) (migrated from JIRA) Closing after the 9.0.0 release |
Based upon the data for a field, blocktree may compress with LZ4-HC (or with simple lowercase compression or none at all).
But we currently eagerly initialize HC hashtable (132k) for each field regardless of whether it will be even "tried". This shows up as top cpu and heap hotspot when profiling tests. It creates unnecessary overhead for small flushes.
Migrated from LUCENE-9816 by Robert Muir (@rmuir), resolved Feb 28 2021
Attachments: LUCENE-9816.patch
The text was updated successfully, but these errors were encountered: