Option for disabling term dictionary compression #12317

jainankitk · 2023-05-20T00:38:46Z

Description

While working on a customer issue, I noticed that memory allocations for recently added term dictionary compression is significant. After disabling the compression using patch, I was able to notice some reduction in the memory allocation.

Generally the cost of storage is significantly lower than memory/CPU, but can be useful once the segment/index is being archived. But during live data ingestion when segments merge frequently, the cost of compression/decompression is paid more than once.

Wondering couple of things here:

Should we expose an option to disable term dictionary compression?
Does it make sense to initialize the HighCompressionHashTable lazily in TermsWriter? As some code paths (non-compression) don't end up using this.

For context, the customer workload is running on instance having 32G memory with 16G allocated for heap. Attaching the memory allocation profile below:

gsmiller · 2023-05-21T15:06:29Z

Interesting @jainankitk! Thanks for sharing details. I'm no expert in this area of our codec, but I'm curious to understand the issue a bit better. From the flame chart you provide, it looks like you're primarily looking at an indexing-related performance issue and concerned with the memory usage during writing. Is that correct? When you disabled the patch, did you notice query-time performance changes? Compression isn't only useful for saving disk space; it's useful for keeping index pages hot in the OS cache and getting better data locality, which translates to better query-time performance. I bring this up because I would generally feel pretty cautious about introducing configuration options. So I'm pushing on the idea of disabling the compression a little bit (i.e., is it actually an overall "win" for your customer's use-case).

jainankitk · 2023-05-22T18:53:57Z

@gsmiller - Thank you for reviewing and providing your comments

it looks like you're primarily looking at an indexing-related performance issue and concerned with the memory usage during writing. Is that correct?

Looking at an issue around higher GC in recent versions (8.10+) compared to previous version (7.x). Nothing specifically with the indexing

When you disabled the patch, did you notice query-time performance changes?

Did not notice any degradation in performance as the index size is small, so it can fit in memory with / without compression

Compression isn't only useful for saving disk space; it's useful for keeping index pages hot in the OS cache and getting better data locality, which translates to better query-time performance.

Not sure if I understand this completely. Based on my understanding, file is nothing but an array of bytes, and lucene reader directly works with that. Now if we compress and store those bytes, the indices into that array changes and reader cannot use that directly. So even if we can keep it hot in the OS cache, some intermediate logic takes care of decoding that sequence of bytes (decompression). That decompressed sequence needs to be stored somewhere, be it byte buffer on heap or native memory. Although we will decode only the blocks that lucene reader needs, we could have directly read the same blocks into native memory from uncompressed file.

@jpountz Thoughts?

gsmiller · 2023-05-26T13:29:45Z

@jainankitk thanks! To clarify my question a little bit, my understanding is that you'd like to explore the idea of making this compression optional based on memory usage profiling. I guess what I'm wondering is if that would ever really be an overall benefit in your system (or for our users more generally). A smaller index has a number of benefits, one of which can be improved query-time performance due to data locality benefits, such as more index remaining hot in the page cache. I'd personally rather optimize for better query-time performance than memory consumption while indexing (within reason of course), but I acknowledge different users have different needs of course. I'm just wondering if disabling this compression is something that users would actually be interested in, as I question how it might impact the query performance.

(Note I'm only responding to the aspect of making this configurable, not your other points about maybe making it more efficient in some cases)

jpountz · 2023-06-06T15:12:21Z

Sorry for the lag! I've been out for some time but I am back now. In general, we don't like adding options to file formats and prefer to have full control to keep file formats easy to reason about and to test.

The object that you are referring to (LZ4CompressionHashTable) only gets allocated once per field per segment, which is not much. So I wouldn't generally expect it to be a big contributor to a heap profile unless there are many small segments getting written, which could happen if you do frequent refreshes, have many fields, or many indices. In that case it's possible that LZ4 compression never gets used on some fields/segments because of the checks on prefix length and average suffix length, so your idea to lazily allocate this compression hash table might help?

jainankitk · 2023-06-06T23:01:39Z

In general, we don't like adding options to file formats and prefer to have full control to keep file formats easy to reason about and to test.
I'm just wondering if disabling this compression is something that users would actually be interested in, as I question how it might impact the query performance.

Since I don't have concrete evidence of performance degradation, it looks reasonable to not add option for keeping testing overhead limited

So I wouldn't generally expect it to be a big contributor to a heap profile unless there are many small segments getting written, which could happen if you do frequent refreshes, have many fields, or many indices. In that case it's possible that LZ4 compression never gets used on some fields/segments because of the checks on prefix length and average suffix length, so your idea to lazily allocate this compression hash table might help?

Per field per segment looks reasonably high to me, given each of these are allocating 256k (128k for short[] and 128k for int[]). I have seen index mappings upto 1500 fields, although not all of them are text fields. But for these very large documents, we are talking couple hundred mbs. And due to tiered merge policy every segment might be getting merged a few times. Hence, it does make sense to lazily allocate this compression hash table

jpountz · 2023-06-07T16:51:12Z

I just noticed that it's already lazily allocated since #10855. To help with many fields, we could try to store this hash table on Lucene90BlockTreeTermsWriter instead of TermWriter, ie. we'd allocate the hash table once per segment instead of once per field per segment.

jainankitk added the type:enhancement label May 20, 2023

Bukhtawar mentioned this issue Aug 19, 2024

Adding @jainankitk as a Maintainer opensearch-project/OpenSearch#15304

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option for disabling term dictionary compression #12317

Option for disabling term dictionary compression #12317

jainankitk commented May 20, 2023

gsmiller commented May 21, 2023 •

edited

Loading

jainankitk commented May 22, 2023

gsmiller commented May 26, 2023

jpountz commented Jun 6, 2023

jainankitk commented Jun 6, 2023

jpountz commented Jun 7, 2023

Option for disabling term dictionary compression #12317

Option for disabling term dictionary compression #12317

Comments

jainankitk commented May 20, 2023

Description

gsmiller commented May 21, 2023 • edited Loading

jainankitk commented May 22, 2023

gsmiller commented May 26, 2023

jpountz commented Jun 6, 2023

jainankitk commented Jun 6, 2023

jpountz commented Jun 7, 2023

gsmiller commented May 21, 2023 •

edited

Loading