-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option for disabling term dictionary compression #12317
Comments
Interesting @jainankitk! Thanks for sharing details. I'm no expert in this area of our codec, but I'm curious to understand the issue a bit better. From the flame chart you provide, it looks like you're primarily looking at an indexing-related performance issue and concerned with the memory usage during writing. Is that correct? When you disabled the patch, did you notice query-time performance changes? Compression isn't only useful for saving disk space; it's useful for keeping index pages hot in the OS cache and getting better data locality, which translates to better query-time performance. I bring this up because I would generally feel pretty cautious about introducing configuration options. So I'm pushing on the idea of disabling the compression a little bit (i.e., is it actually an overall "win" for your customer's use-case). |
@gsmiller - Thank you for reviewing and providing your comments
Looking at an issue around higher GC in recent versions (8.10+) compared to previous version (7.x). Nothing specifically with the indexing
Did not notice any degradation in performance as the index size is small, so it can fit in memory with / without compression
Not sure if I understand this completely. Based on my understanding, file is nothing but an array of bytes, and lucene reader directly works with that. Now if we compress and store those bytes, the indices into that array changes and reader cannot use that directly. So even if we can keep it hot in the OS cache, some intermediate logic takes care of decoding that sequence of bytes (decompression). That decompressed sequence needs to be stored somewhere, be it byte buffer on heap or native memory. Although we will decode only the blocks that lucene reader needs, we could have directly read the same blocks into native memory from uncompressed file. @jpountz Thoughts? |
@jainankitk thanks! To clarify my question a little bit, my understanding is that you'd like to explore the idea of making this compression optional based on memory usage profiling. I guess what I'm wondering is if that would ever really be an overall benefit in your system (or for our users more generally). A smaller index has a number of benefits, one of which can be improved query-time performance due to data locality benefits, such as more index remaining hot in the page cache. I'd personally rather optimize for better query-time performance than memory consumption while indexing (within reason of course), but I acknowledge different users have different needs of course. I'm just wondering if disabling this compression is something that users would actually be interested in, as I question how it might impact the query performance. (Note I'm only responding to the aspect of making this configurable, not your other points about maybe making it more efficient in some cases) |
Sorry for the lag! I've been out for some time but I am back now. In general, we don't like adding options to file formats and prefer to have full control to keep file formats easy to reason about and to test. The object that you are referring to ( |
Since I don't have concrete evidence of performance degradation, it looks reasonable to not add option for keeping testing overhead limited
Per field per segment looks reasonably high to me, given each of these are allocating 256k (128k for short[] and 128k for int[]). I have seen index mappings upto 1500 fields, although not all of them are text fields. But for these very large documents, we are talking couple hundred mbs. And due to tiered merge policy every segment might be getting merged a few times. Hence, it does make sense to lazily allocate this compression hash table |
I just noticed that it's already lazily allocated since #10855. To help with many fields, we could try to store this hash table on |
Description
While working on a customer issue, I noticed that memory allocations for recently added term dictionary compression is significant. After disabling the compression using patch, I was able to notice some reduction in the memory allocation.
Generally the cost of storage is significantly lower than memory/CPU, but can be useful once the segment/index is being archived. But during live data ingestion when segments merge frequently, the cost of compression/decompression is paid more than once.
Wondering couple of things here:
For context, the customer workload is running on instance having 32G memory with 16G allocated for heap. Attaching the memory allocation profile below:
The text was updated successfully, but these errors were encountered: