-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase the number of dims for KNN vectors to 2048 [LUCENE-10471] #11507
Comments
Robert Muir (@rmuir) (migrated from JIRA) I don't "strongly object" but I question the approach of just raising the limit to satisfy whatever shitty models people come up with. At some point we should have a limit, and people should do dimensionality reduction. |
Julie Tibshirani (@jtibshirani) (migrated from JIRA) I also don't have an objection to increasing it a bit. But along the same lines as Robert's point, it'd be good to think about our decision making process – otherwise we'd be tempted to continuously increase it. I've already heard users requesting 12288 dims (to handle OpenAI DaVinci embeddings). Two possible approaches I could see:
I feel a bit better about approach 2 because I'm not confident I could come up with a statement about a "reasonable max dimension", especially given the fast-moving research. |
Robert Muir (@rmuir) (migrated from JIRA) I think the major problem is still no Vector API in the java APIs. It changes this entire conversation completely when we think about this limit. if openjdk would release this low level vector API, or barring that, maybe some way to MR-JAR for it, or barring that, maybe some intrinsics such as SloppyMath.dotProduct and SloppyMath.matrixMultiply, maybe java wouldn't become the next COBOL. |
Stanislav Stolpovskiy (migrated from JIRA) I don't think there is a trend to increase dimensionality. Only few models have feature dimensions more than 2048. Most of modern neural networks (ViT and whole Bert family) have dimensions less than 1k. However there are still many models like ms-resnet or EfficientNet that operate in range from 1k to 2048. And they are most common models for image embedding and vector search. Current limit is forcing to do dimensionally reduction for pretty standard shapes.
|
Michael Sokolov (@msokolov) (migrated from JIRA) We should not be imposing an arbitrary limit that prevents people with CNNs (image-processing models) from using this feature. It makes sense to me to increase the limit to the point where we would see actual bugs/failures, or where the large numbers might prevent us from making some future optimization, rather than trying to determine where the performance stops being acceptable - that's a question for users to decide for themselves. Of course we don't know where that place is that we might want to optimize in the future (Rob and I discussed an idea using all-integer math that would suffer from overflow, but still we should not just allow MAX_INT dimensions I think? To me a limit like 16K makes sense – well beyond any stated use case, but not effectively infinite? |
Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)
|
Robert Muir (@rmuir) (migrated from JIRA) My questions are still unanswered. Please don't merge the PR when there are standing objections! |
Mayya Sharipova (@mayya-sharipova) (migrated from JIRA) Sorry, may be I should have provided more explanation.
|
Robert Muir (@rmuir) (migrated from JIRA) The problem is that nobody will ever want to reduce the limit in the future. Let's be honest, once we support a limit of N, nobody will want to ever make it smaller because of the potential users who wouldn't be able to use it anymore. So because this is a "one-way" decision, it needs serious justification, benchmarks, etc etc. Regardless of how the picture looks, its definitely not something we should be "rushing" into 9.3 |
Mayya Sharipova (@mayya-sharipova) (migrated from JIRA) Got it, thanks, I will not rush, and will try to provide benchmarks. |
Michael Wechner (@michaelwechner) (migrated from JIRA) Maybe I do not understand the code base of Lucene well enough, but wouldn't it be possible to have a default limit of 1024 or 2028 and that one can set a different limit programmable on the IndexWriter/Reader/Searcher? |
Marcus Eagan (@MarcusSorealheis) (migrated from JIRA) @michaelwechner You are free to increase the dimension limit as it is a static variable and Lucene is your oyster. However, @erikhatcher has Seared in my mind that this a long term fork ok Lucene is a bad idea for many reasons. #[~rcmuir] I agree with you on "whatever shitty models." They are here, and more are coming. With respect to the vector API, Oracle is doing an interesting bit of work in Open JDK 17 to improve their vector API. They've added support for Intel's short vector math library, which will improve. The folk at OpenJDK exploit the Panama APIs. There are several hardware accelerations they are yet to exploit, and many operations will fall back to scalar code. My argument is for increasing the limit of dimensions is not to suggest that there is a better fulcrum in the performance tradeoff balancer, but that more users testing Lucene is good for improving the feature. Open AI's Da Vinci is one such model but not the only I've had customers ask for 4096 based on the performance they observe with question an answering. I'm waiting on the model and will share when I know. If customers want to introduce rampant numerical errors in their systems, there is little we can do for them. Don't take my word on any of this yet. I need to bring data and complete evidence. I'm asking my customers why they cannot do dimensional reduction. |
Michael Sokolov (@msokolov) (migrated from JIRA) > Maybe I do not understand the code base of Lucene well enough, but wouldn't it be possible to have a default limit of 1024 or 2028 and that one can set a different limit programmable on the IndexWriter/Reader/Searcher? I think the idea is to protect ourselves from accidental booboos; this could eventually get exposed in some shared configuration file, and then if somebody passes MAX_INT it could lead to allocating huge buffers somewhere and taking down a service shared by many people/groups? Hypothetical, but it's basically following the principle that we should be strict to help stop people shooting themselves and others in the feet. We may also want to preserve our ability to introduce optimizations that rely on some limits to the size, which would become difficult if usage of larger sizes became entrenched. (We can't so easily take it back once it's out there). Having said that I still feel a 16K limit, while allowing for models that are beyond reasonable, wouldn't cause any of these sort of issues, so that's the number I'm advocating. |
Julie Tibshirani (@jtibshirani) (migrated from JIRA)
Mike's perspective makes sense to me too. I'd be supportive of increasing the limit to an upper bound. Maybe we could run a test with ~1 million synthetic vectors with the proposed max dimension (~16K) to check there are no failures or unexpected behavior? |
Robert Muir (@rmuir) (migrated from JIRA) My main concern is that it can't be undone, as i mentioned. Nobody will be willing to go backwards. This is why I make a big deal about it, because of the "one-way" nature of the backwards compatibility associated with this change. It seems this is still not yet understood or appreciated. Historically, users fight against every limit we have in lucene, so when people complain about this one, it doesn't bother me (esp when it seems related to one or two bad models/bad decisions unrelated to this project). But these limits are important, especially when features are in their infancy, without them, there is less flexibility and you can find yourself easily "locked in" to a particular implementation. |
Robert Muir (@rmuir) (migrated from JIRA) It is also terrible that this issue says 2048 but somehow that already blew up to 16k here. -1 to 16K. Its unnecessarily large and puts the project at risk in the future. We can debate 2048. |
Lots of things happened since Aug, like the arrival of ChatGPT, and people's increased desire to use OpenAI's state of the art embeddings which are of size 1536. Can you at least please increase it to 1536 for now, while you discuss upper limits? |
Actually it is a one-line change (without any garantees), see https://github.com/apache/lucene/pull/874/files If you really want to shoot you in the foot: Download source code of Lucene in the version you need for your Elasticsearch instance (I assume you coming from elastic/elasticsearch#92458), patch it with #874, and then run './gradlew distribution'. Copy the JAR files into your ES districution. Done. But it is not sure if this will blos up and indexes created by that won't read anymore with standard Lucene |
Why I was doing that suggestion: If you are interested, try it out with your dataset and your Elasticsearch server and report back! Maybe you will figure out that performance does not work or memory usage is too high. |
I'll preface this by saying I am also skeptical that going beyond 1024 makes sense for most use cases and scaling is a concern. However, amidst the current excitement to try and use openai embeddings the first cut at choosing a system to store and use those embeddings was Elasticsearch. Then the 1024 limit was run into and so various folks are looking at other alternatives largely because of this limit. The use cases tend to be Q/A, summarization, and recommendation systems for WordPress and Tumblr. There are multiple proof of concept systems people have built (typically on top of various typscript, javascript, or python libs) which use the openai embeddings directly (and give quite impressive results). Even though I am pretty certain that reducing the dimensions will be a better idea for many of these, the ability to build and prototype on higher dimensions would be extremely useful. |
@uschindler @rmuir FWIW We are interested in using Lucene's kNN with 1536 dimensions in order to use OpenAI's embeddings API. We benchmarked a patched Lucene/Solr. We fully understand (we measured it :-P) that there is an increase in memory consumption and latency. Sure thing. We have applications where dev teams have chosen to work with OpenAI embeddings and where the number of records involved and requests per second make the trade offs of memory and latency perfectly acceptable. There is a great deal of enthusiasm around OpenAI and releasing a working application ASAP. For many of these the resource cost of 1536 dimensions is perfectly acceptable against the alternative of delaying a pilot to optimize further. Our work would be a lot easier if Lucene's kNN implementation supported 1536 dimensions without need for a patch. |
I'm reminded of the great maxBooleanClauses debate. At least that limit is user configurable (for the system deployer; not the end user doing a query) whereas this new one for kNN is not. I can understand how we got to this point -- limits often start as hard limits. The current limit even seems high based on what has been said. But users have spoken here on a need for them to configure Lucene for their use-case (such as experimentation within a system they are familiar with) and accept the performance consequences. I would like this to be possible with a System property. This hasn't been expressly asked? Why should Lucene, just a library that doesn't know what's best for the user, prevent a user from being able to do that? This isn't an inquiry about why limits exist; of course systems need limits. |
Hi @dsmiley I updated the dev discussion on the mailing list: And proceeded with a pragmatic new mail thread, where we just collect proposals with a motivation (no discussion there): Feel free to participate! |
The rabbit hole that is trying to store Open AI embeddings in Elasticsearch eventually leads here. I read the entire thread and unless I am missing something, the obvious move to make the limit configurable (up to a point) or at a minimum, increase the limit to 1536 to support the |
Cross posting here because I responded to the PR instead of this issue.
I think this comment actually supports @MarcusSorealheis argument? e.g., What's the point in indexing 8K dimensions if it isn't much better at recall than 768?
I may be wrong but it seems like this is where most of the lucene committers here are settling? Over a decade ago I wanted a high dimension index for some facial recognition and surveillance applications I was working on. I rejected Lucene at first only because of it being written in java and I personally felt something like C++ was a better fit for the high dimension job (no garbage collection to worry about). So I wrote a high dimension indexer for MongoDB inspired by RTree (for the record it's implementation is based on XTree) and wrote it using C++ 14 preview features (lambda functions were the new hotness on the block and java didn't even have them yet). Even in C++ back then SIMD wasn't very well supported by the compiler natively so I had to add all sorts of compiler tricks to squeeze every ounce of vector parallelization to make it performant. C++ has gotten better since then but I think java still lags in this area? Even JEP 426 is a ways off (maybe because OpenJDK is holding these things hostage)? So maybe java is still not the right fit here? I wonder though, does that mean Lucene shouldn't provide dimensionality higher than arbitrary 1024? Maybe not. I agree dimensional reduction techniques like PCA should be considered to reduce the storage volume. The problem with that argument is that dimensionality reduction fails when features are weakly correlated. You can't capture the majority of the signal in the first N components and therefore need higher dimensionality. But does that mean that 1024 is still too low to make Lucene a viable option? Aside from conjecture does anyone have empirical examples where 1024 is too low and what specific Lucene capabilities (e.g., scoring?) would make adding support for dimensions higher than 1024 really worth considering over using dimensionality reduction? If Lucene doesn't do this does it really risk the project becoming irrelevant? That sounds a bit like sensationalism. Even if higher dimensionality is added to the current vector implementation (I'd actually argue we should explore converting BKD to support higher dimensions instead) are we convinced it will reasonably perform without JEP 426 or better SIMD support that's only available in newer JDKs? Can anyone smart here post their benchmarks to substantiate their claims? I know Pinecone (and others) have blogged about their love for RUST for these kinds of applications. Should Lucene just leave this to the job of alternative Search APIs? Maybe something like Tantivy or Rucene? Or is it time we explore a new optional Lucene Vector module that supports cutting edge JDK features through gradle tooling for optimizing the vector use case? Interested what others think. |
Copying and pasting here, just for visibility:
I may sound like somebody who contradicts another just for the sake of doing so, but I do genuinely believe these kind of discoveries support the fact that making it configurable is actually a good idea: |
It also shows that this causes a long-tail of issues:
|
In addition if we raise the number of dimensions people will then start claiming for higher precision in calculations, completely forgetting that Lucene is a full text search engine to bring results in milliseconds not 2 hours. Score calculations introduce rounding anyways and making them exact is (a) not needed for lucene (we just sort on those values) and (b) would slow down the whole thing so much. So keep with the current limit and NOT make it configurable. I agree to raise the maximum to 2048 (while recommending to people to use Java 20 for running Lucene and enable incubator vectors). At same time close any issues about calculation precission and on the other hand get the JDK people support half float calculations. |
I think a library should empower a user to discover what works (and doesn't) for them, rather than playing big brother and insist it knows best that there's no way some high setting could ever work for any user. Right? By making it a system property that does not need to be configured for <= 1024, it should raise a red flag to users that they are venturing into unusual territory. i.e. they've been warned. They'd have to go looking for such a setting and see warnings; it's not something a user would do accidentally either.
LOL People may ask for whatever they want :-) including using/abusing a system beyond its intended scope. So what? BTW I've thoroughly enjoyed seeing several use cases of my code in Lucene/Solr that I had never considered yet worked really well for a user :-D. Pure joy. Of course not every request makes sense to us. I'd rather endure such than turn users away from Lucene that we can support trivially today. |
@uschindler , I am not convinced but it's fine to have different opinions!
We may have different opinions here and that's fine, but my intent as a committer is to build the best solution for the community rather than the best solution according to my ideas. You know, if we wanted sub-ms responses all the time we could set a hard limit to 1024 chars per textual field and allow a very low number of fields, but then would Lucene attract any user at all? |
I would like to renew the issue in light of the recent integration of incubating Panama Vector API, as indexing of vectors with it much faster. We run a benchmarking test, and indexing a dataset of vectors of 1536 dims was slightly faster than indexing of 1024 dims. This gives us enough confidence to extend max dims to 2048 (at least when vectorization is enabled). Test environment
Test1:
Details
Test2
Details
|
I found this very strange at first :) But then I read more closely, and I think what you meant is indexing 1024 dims without Panama (SIMD vector instructions) is slower than indexing 1536 dims with Panama enabled? Which is really quite impressive. Do we know what gains we see at search time going from 1024 -> 1536? |
Interestingly it was only an Apple M1. This one only has 128 bits vector size and only 2 PU (the 128 bits is in the spec of CPU, but Robert told me about number of PUs; I found no info on that in wikichip). So I would like to also see the difference on a real cool AVX512 machine with 4 PUs. So unfortunately the Apple M1 is a bit limited but it is still good enough to outperform the scalar impl. Cool. Now please test on a real Intel Server CPU. 😍 In general I am fine with rising vectors to 2048 dims. But apply that limit only to the HNSW codec. So check should not in the field type but in the codec. |
@mikemccand Indeed, exactly as said, sorry for being unclear. We have not checked search, will work on that. @uschindler Thanks, indeed, we need tests on other machines. +1 for raising dims to 2048 in HNSW codec. |
I ran @mayya-sharipova's exact same benchmark/test on my machine. Here are the results. Test environment
Result
So the test run with 1536 dims and Panama enabled at AVX 512 was 503 secs (or ~16%) faster than the run with 1024 dims and No Panama. Test1:
Details
Test2
Details
Full output from the test runs can be see here https://gist.github.com/ChrisHegarty/ef008da196624c1a3fe46578ee3a0a6c. |
Can we run this test with lucene's defaults (e.g. not a 2GB rambuffer)? |
I am extremely curious, what should we consider a good performance to index <3M docs? |
I've done the test and surprising indexing time decreased substantially. It is almost 2 times faster to index with Lucene's defaults than with 2Gb RamBuffer at the expense that we end up with a bigger number of segments.
Details
|
Leaving a higher number of segments dodges the merge costs, I think. |
This benchmark really only measures the flushing cost, as |
This increases the limit of float vector to 1024 to 2048. The previous limit was based what Lucene provided but current discussions and current benchmark indicate that 2048 will also be ok and the next Lucene version will have 2048 as default: apache/lucene#11507 (comment) apache/lucene#11507 (comment)
This increases the limit of float vector to 1024 to 2048. The previous limit was based what Lucene provided but current discussions and current benchmark indicate that 2048 will also be ok and the next Lucene version will have 2048 as default: apache/lucene#11507 (comment) apache/lucene#11507 (comment)
This increases the limit of float vector to 1024 to 2048. The previous limit was based what Lucene provided but current discussions and current benchmark indicate that 2048 will also be ok and the next Lucene version will have 2048 as default: apache/lucene#11507 (comment) apache/lucene#11507 (comment)
This increases the limit of float vector from 1024 to 2048. The previous limit was based what Lucene provided but current discussions and current benchmark indicate that 2048 will also be ok and the next Lucene version will have 2048 as default: apache/lucene#11507 (comment) apache/lucene#11507 (comment)
This increases the limit of float vector from 1024 to 2048. The previous limit was based what Lucene provided but current discussions and current benchmark indicate that 2048 will also be ok and the next Lucene version will have 2048 as default: apache/lucene#11507 (comment) apache/lucene#11507 (comment)
Last comment is already a couple of months old, so please let me clarify the status of this initiative. If there's a chance it's going to be merged? If there's any blocker or action item that prevents one from being merged? The context of my inquiry is that Lucene-based solutions (e.g. OpenSearch) are commonly deployed within enterprises, which makes them good candidates to experiment with vector search and commercial LLM-offerings, without deploying and maintaining specialized technologies. Max dimensionality of 1024, however, puts certain restrictions (similar thoughts are here https://arxiv.org/abs/2308.14963). |
Hi, To implement (on your own risk), create your own You can do this with Lucene 9.8+ OpenSearch and Elasticsearch and Solr will have custom limits in their code (based on this approach). |
@mayya-sharipova: Should we close this issue or are there any plans to also change the default maximum? I don't think so. |
I think we should close it for sure. |
Yes, thanks for the reminder. Now Codec is responsible for managing dims, we can close it. |
The current maximum allowed number of dimensions is equal to 1024. But we see in practice a couple well-known models that produce vectors with > 1024 dimensions (e.g mobilenet_v2 uses 1280d vectors, OpenAI / GPT-3 Babbage uses 2048d vectors). Increasing max dims to
2048
will satisfy these use cases.I am wondering if anybody has strong objections against this.
Migrated from LUCENE-10471 by Mayya Sharipova (@mayya-sharipova), 6 votes, updated Aug 15 2022
Pull requests: #874
The text was updated successfully, but these errors were encountered: