-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores #12342
Comments
I honestly would much prefer us removing the normalization restriction and not do anything to the output. IMO, plain ol' dot_product is perfectly fine at scoring as higher values mean more relevant (just like Lucene scoring). I am not sure why this was scaled at all. But, since this changes how things are scored (and that would be considered a breaking change, correct?), why not remove scaling at all and allow the score to be the result of |
Ok, did a bunch of reading on MAX Block WAND and MAXScore :/. I guess MBW is the reason for us bounding these vector scores somehow. Technically, max inner product is unbounded. Encoding information in the vector magnitude is useful (especially for multi-lingual search from what I gather). My questions would be:
|
@benwtrent Thanks for taking a look! Interesting, I am not too familiar with MBW. Ill take a look. The main reason I wanted to avoid returning the dot product was to avoid negative scores, as ref by #9044. Also, what do you mean by scaling continuously? The above formula gives negative and positive scores the same number of possible scores, reducing overall precision by 2 (please correct me if I am wrong). |
Your algorithm is piecewise vs. continuous. But, I am not sure how we could do a continuous transformation (everything is on the same scale). EDIT: Thinking more, I am not sure we would want to and your equation is OK. More thought here is required.
Yeah, MBW and MAXSCORE get all mixed up in my brain. But, yes MAXSCORE is why we disallow negative scoring. Forgive my misdirection.
My main concern overall is this: we are changing the scoring methodology period for positive scores (and thus considered "valid"). I think this means that it cannot go into a Lucene 9.x release (correct @jpountz ?). What do you think @msokolov ? Maybe we have a new |
hmm, my view is dot_product is only usable if your vectors are normalized, as documented. I also don't think we can change the scoring formula in a minor release. As for producing a D.P. that is scaled for use with arbitrary vectors I don't see the point really. If what you want is to handle arbitrary scaled vectors, EUCLIDEAN is a better choice. It will produce the same rank as DOT_PRODUCT for normalized vectors and has the meaning of an actual metric (satisfies the triangle inequality). What does D.P. even mean for random vectors? What if one of the vectors is zero? Then it is equidistant to every other vector? I guess I'd want to see a real use case before working to support this use case that seems weird to me. And honestly there are other distances that seem more useful (L1 norm, for example) |
Historically Lucene did not have restrictions on scores, but approaches for dynamic pruning like MAXSCORE and WAND assume that matching another clause would make the score higher, not lower. In hindsight, it made sense that scores should be non-negative, so we updated the contract of I'm not too worried about having a vector similarity that produces unbounded scores from the perspective of MAXSCORE/WAND. The way things work today, the vector query is first rewritten into a query that maps a small set of doc IDs to scores, so we can easily get access to the maximum score over a range of doc IDs, which is what WAND/MAXSCORE need. To me the main concern is more that unbounded scores make it hard to combine scores with another query via a disjunction as it's hard to know ahead of time whether the vector may completely dominate scores. And also bw compat as MikeS raised. |
I say we have that problem now. Vector scores and BM25 are nowhere near on the same scale. Folks need to adjust their boosts accordingly, regardless.
I agree, BWC is a big deal here. And I suggest we create a new similarity that just uses dot product under the hood. Call it
Quoting a SLACK conversation with @nreimers:
|
thanks for the reference @benwtrent, that's an interesting perspective. I wouldn't be opposed to adding a new distance if people find it useful |
The way I read this @jpountz
Is that negative scores here are OK as the optimization constraints traditionally required do not apply. If that is the case, I would suggest us adding a new scoring methodology that is simply the dot product and call it maximum inner product. The current scaling for dot_product only makes sense for normalized vectors and it should only be treated as an optimization. |
For byte vectors we already have some guard built in (at least they can't get <0). See So this also relates to this discussion: Should we have some constraints on vectors while they are indexed. In the other PR we added the requirement to make all their components finite. |
I was to convey that it should be ok not to have score upper bounds for the produced scores. I still think scores should always be non-negative. |
I think for floats it is not so easy like in the byte case with |
would it make sense to truncate negative scores to zero? Since we think this is an abuse/misconfiguration, loss of information seems OK, and least we would be able to guarantee not to violate the "no negative scores" contract. Then if we want to have a separate score that scales in an information-preserving way, we can add it. |
@msokolov The index / vector DB should return the dot product score as is. No scaling, no truncation. Using dot product is tremendously useful for embedding models, they perform in asymmetric settings where you want to map a short search query to a longer relevant document (which is the most common case in search) much better than cosine similarity or euclidean distance. But here the index should return the values as is and it should then be up to the user to truncate negative scores or to normalize these scores to pre-defined ranges. |
The problem is that this is not compatible with Lucene. |
I would think as long as more negative values are scored lower, we will retrieve documents in a sane manner. Scaling negatives to restrict them and then not scaling positive values at all could work. The This brings us back to @jmazanec15 suggestion on scaling scores. |
Yeah, after consideration, I think we could maybe argue for changing the scaling of negative values given that they were documented as unsupported, even though it would be breaking back-compat in the sense that scores would be changed. But I think we ought to preserve the scaling of non-negative values in case people have scaling factors they use for combining scores with other queries' scores. So we could go with @jmazanec15 suggestion except leaving in place the scale by 1/2? |
@msokolov Ah, so negative values would live between |
Yeah. Another thing we could consider is doing this scaling in KnnVectorQuery and/or its Scorer. These have the ultimate responsibility of complying with the Scorer contract. If we did it there we wouldn't have to change the output of VectorSimilarity. However it's messy to do it there since this is specific to a particular similarity implementation, so on balance doing it in the similarity makes more sense to me. |
I think the scores would have to be preserved for not only positive dot products amongst normalized vectors, but also negative ones, to avoid breaking bwc. I think the current range of dot products that are valid is [-1, 1] and scores map to [0, 1]. So I dont think we could map all negative values between [0, 0.5] |
Wasn't there the possibility to return a score for indexing and for search? Basically the VectorSimilarity enum could have a separate method called queryScore(v1, v2) that is enforced to be positive. Actually for cosine its not a problem as its normalized, so we can add 1 (and for safety to prevent rounding errors add |
I am not sure why we care about separating VectorSimilarity and scoring. VectorSimilarity is only ever for KNN search and indexing and as long as vectors that are less similar score lower, its fine. If we start thinking about separating out scoring and similarity, we should do it for all the current similarities. This would be significant work and it would be tricky. Think of EUCLIDEAN, we invert it's calculation so that a higher score means more similar. So, we would still need to use If folks want to use the raw vector distances, they should use
I think you are correct @jmazanec15 since normalized vectors are in the unit-sphere. Its possible to have negative values (and thus fall into the [0, 0.5] range) when they point in opposite directions within the sphere. Your scaling method + a new This may complicate things (which 'dotProduct' should I use?!?!?!), but we should not change the existing |
@benwtrent I think that makes sense, but would add a little confusion. How common is it to use Vector results with MAX_SCORE/WAND? I am wondering if it would be better to just leave as is in 9.x and change the warning message in the javadoc that non-normalized vectors are supported, but they should not be used with WAND/MAX_SCORE and can return negatives. And then switch the score to scale in 10 as a breaking change. Or is condoning negative scores under any circumstances a non-starter? |
If you are utilizing hybrid search, negating WAND/MAX_SCORE will slow things down significantly. We should protect folks from shooting themselves in the foot.
I agree, there will be confusion. What do you think @uschindler & @msokolov ? Being able to do non-normalized dot-product is an important aspect of recommendation engines and vector search as a whole. My imagination is too poor to come up with a better solution than adding a new similarity function that uses dot-product under the hood and scales differently. |
@jmazanec15 have you done any digging into the dot-product scaling and if it provides good recall in the MAX-INNER-PRODUCT search use-case? https://blog.vespa.ai/announcing-maximum-inner-product-search/ && https://towardsdatascience.com/maximum-inner-product-search-using-nearest-neighbor-search-algorithms-c125d24777ef Implies there might be some weirdness with HNSW and raw MIP. I am honestly not 100% sure if Lucene has this issue specifically with HNSW. A key observation in MIP is that a |
@benwtrent I have been thinking about this and am still not completely sure of the implications. It seems like the construction of the graphs may rely on some assumption about the underlying space supporting the triangle inequality. Thus, with inner product space where this does not hold, the graph construction might have problems. However, graphs aside, with brute force search, utilizing the scaled negative dot product would preserve the ordering of MIPs search. I will try to think more about this this week. |
@jmazanec15 adding the largest magnitude for scoring per segment isn't that bad a change in the codec if it means we can truly support maximum inner product. Plus it would be a change that would help other vector indexing codecs in the future besides HNSW. |
@benwtrent Interesting, Im still not sure if this approach is necessary. I spoke with @searchivarius who is the maintainer of nmslib, and he mentioned that there was some research suggesting this is not required (https://proceedings.neurips.cc/paper_files/paper/2018/hash/229754d7799160502a143a72f6789927-Abstract.html, https://arxiv.org/pdf/1506.03163.pdf). Let me try re-running the Vespa experiments with Lucene without the reduction and see what numbers we get. I dont think in the blog post they posted any comparison to using negative dot product approach (please correct me if I am missing something). |
thank you @jmazanec15 : there's also an unpublished paper (I can share the preprint privately) where we benchmarked HNSW for maximum inner product search on 3 datasets and it was just fine (for this paper I did try the reduction to the cosine similarity and I also got poorer outcomes). In my thesis, I benchmarked SW-graph (which is pretty much HNSW when it comes to peculiarities of handling the inner product search) using an inner-product like similarity (fusion of BM25 and MODEL1 scores) and it was fine. See the black asterisk run in Figure 3.2. Moreover, HNSW and SW-graph were tested with non-metric similarities (see again my thesis and references therein) as well as Yury Malkov's HNSW paper. These methods established SOTA results as well. One may wonder why, right? I think for real datasets the quirky distances don't deviate from the Euclidean distances all that much so the minimal set of geometric properties required for graph based retrieval is preserved (and no I don't think the triangle inequality is required). Specifically, for the inner product search the outcomes are pretty close (in many cases) to the outcomes where the inner product search is replaced with cosine similarity (which is equivalent to L2 search) Why? Because with real embeddings the magnitude of vectors doesn't change all that much. That said, there are of course degenerate cases (I know one, but embedding models don't produce such weirdness) where HNSW won't work with MIPS (or rather recall will be low). However, I am not aware of any realistic one. If you have some interesting examples of real datasets where direct application of HNSW/SW-graph fails, I would love to have a look. REGARDING THE SCORE sign: dot-product scores need not be normalized, but the sign can be changed when the results is returned to the user. |
I fixed my bug (thanks @jmazanec15!). I will never forget how Here is my new run.
|
Cool. Now thr nontransformed run is better. |
@benwtrent many thanks! This is even more convincing since now we have two cases where original is much better than transformed and one where it is roughly at par. |
OK, here (unless we have done these incorrectly), is the final Cohere test (IMO). This is mixing English and Japanese embeddings (just in case the cohere model encodes info for language in magnitude). To mix the language embeddings, I appended the embedded rows together, shuffled them and then ran the previous testing procedure. EN-JA
|
Q: just want to make sure I understand what the transformation is that you are testing here. Is it that you are taking non-unit vectors and making them into unit vectors by dividing by the max length in the corpus and then adding another dimension to force the vector to be unit length? And then comparing this to max-inner-product on vectors of varying length? If so, what is the meaning of the length of the vectors - where does it come from? Do the training processes generate non-unit vectors? |
@msokolov the transformation also preserves the inner product up to a query-specific constant.
Yes, the training process often uses the inner/doct product and not the cosine similarity, so some information is being encoded by the vector length/magnitude. Typically, however, the variance isn't that huge. |
@msokolov In our BEIR paper we talked about this: The issue with cosine similarity is that it just encodes the topic. For the query B: A cosine similarity model would retrieve A (it is more on-topic). A dotproduct model would retrieve B, as it is on-topic as well and contains more information on the topic (represented as a longer vector). I.e. cossim(query, A) > cossim(query, B) but dotprod(query, A) < dotprod(query, B). So dotproduct models can encode topic + quantity/quality of information, while cosine similarity models just encode topic match (which isn't perfect for retrieval). |
I found another dataset, Yandex Text-to-image: https://research.yandex.com/blog/benchmarks-for-billion-scale-similarity-search I tested against the first 500_000 values in the 1M dataset. It utilizes inner-product, looking at magnitudes, they are all < 1. So this dataset might not be that useful :/. The magnitudes range from 0.79 - 0.99. I am just looking for more realistic inner-product search data sets and found this one. Yandex Image-to-Text
code for transforming yandex fbinimport numpy as np
def read_fbin(filename, start_idx=0, chunk_size=None):
""" Read *.fbin file that contains float32 vectors
Args:
:param filename (str): path to *.fbin file
:param start_idx (int): start reading vectors from this index
:param chunk_size (int): number of vectors to read.
If None, read all vectors
Returns:
Array of float32 vectors (numpy.ndarray)
"""
with open(filename, "rb") as f:
nvecs, dim = np.fromfile(f, count=2, dtype=np.int32)
nvecs = (nvecs - start_idx) if chunk_size is None else chunk_size
arr = np.fromfile(f, count=nvecs * dim, dtype=np.float32,
offset=start_idx * 4 * dim)
return arr.reshape(nvecs, dim)
def transform_queries(Q):
n, _ = Q.shape
return np.concatenate([Q, np.zeros((n, 1))], axis=-1, dtype=np.float32)
def transform_docs(D, norms):
n, d = D.shape
max_norm = magnitudes.max()
flipped_norms = np.copy(norms).reshape(n, 1)
transformed_data = np.concatenate([D, np.sqrt(max_norm**2 - flipped_norms**2)], axis=-1, dtype=np.float32)
return transformed_data
def validate_array_match_upto_dim(arr1, arr2, dim_eq_upto):
assert np.allclose(arr1[:dim_eq_upto], arr2[:dim_eq_upto]), "data sets are different"
def validate_dataset_match_upto_dim(arr1, arr2, dim_eq_upto):
n1, d1 = arr1.shape
n2, d2 = arr2.shape
assert n1 == n2, f"Shape does not map [{arr1.shape}] vs [{arr2.shape}]"
for i in range(n1):
validate_array_match_upto_dim(arr1[i], arr2[i], dim_eq_upto)
name = "yandex"
np_flat_ds = read_fbin("base.1M.fbin")[0:500_000]
queries = read_fbin("query.public.100k.fbin")[0:10_000]
transformed_queries = transform_queries(queries)
validate_dataset_match_upto_dim(transformed_queries, queries, 200)
with open(f"{name}-transform.test", "w") as out_f:
transformed_queries.tofile(out_f)
with open(f"{name}.test", "w") as out_f:
queries.tofile(out_f)
magnitudes = np.linalg.norm(np_flat_ds, axis=1)
indices = np.argsort(magnitudes)
transformed_np_flat_ds = transform_docs(np_flat_ds, magnitudes)
validate_dataset_match_upto_dim(transformed_np_flat_ds, np_flat_ds, 200)
transformed_np_flat_ds_sorted = transformed_np_flat_ds[indices]
np_flat_ds_sorted = np_flat_ds[indices]
with open(f"{name}.random-transform.train", "w") as out_f:
transformed_np_flat_ds.tofile(out_f)
with open(f"{name}.ordered-transform.train", "w") as out_f:
transformed_np_flat_ds_sorted.tofile(out_f)
with open(f"{name}.reversed-transform.train", "w") as out_f:
np.flip(transformed_np_flat_ds_sorted, axis=0).tofile(out_f)
with open(f"{name}.random.train", "w") as out_f:
np_flat_ds.tofile(out_f)
with open(f"{name}.reversed.train", "w") as out_f:
np.flip(np_flat_ds_sorted, axis=0).tofile(out_f)
with open(f"{name}.ordered.train", "w") as out_f:
np_flat_ds_sorted.tofile(out_f) |
@benwtrent thank you! despite this dataset is almost normalized, the non-transformed approach is still better for at least two scenarios (ordered and random). |
@searchivarius && @msokolov do we think any additional testing should be done? Or are we comfortable with #12479 going through as is with the testing done? FYI, I reran my tests locally. I re-read the requirements and it seemed that in the "transformed" case, we SHOULD be using So, non-transformed == max-inner product and trainsformed == euclidean. In my local reruns, there is almost no difference. The random case looks very similar, the ordered & reverse cases are more equal between the two runs, but the pareto graphs are effectively the same :/ |
Hi @benwtrent first question: do you have your test code publicly available?
Second, great you remembered it, but I think there's no difference between cosine and L2 (i.e., search results are the same) if queries and documents have constant norms. They don't have to be normalized to the unit norm, I think any constant would suffice: L2=(a-b)^2 = |a|^2 - ab + |b|^2 = const1 - cosine_similarity(a,b) * const2 What do you think? cc @jmazanec15 Additional experiments are always good. I have no immediate suggestion for realistic embeddings, though. However, one can try to stress test the method by synthetically changing the data: Multiply or divide vector elements by a random uniform number in the range [1, M]. For large enough M, the transform might become beneficial. |
Here are my results for the en-ja mixed dataset. max inner product in baseline, euclidean in transformed EN-Ja rerun
Updated knnPerfTest.pyMy testing run I do the following:
I found this to be the quickest way to test. i#!/usr/bin/env/python
import subprocess
import benchUtil
import constants
LUCENE_CHECKOUT = 'lucene_candidate'
# test parameters. This script will run KnnGraphTester on every combination of these parameters
VALUES = {
'ndoc': (100000,),
'maxConn': (48, ),
'beamWidthIndex': (200,),
'fanout': (0, 10, 50, 90, 190, 490, 590, 990),
'topK': (10,),
}
def advance(ix, values):
for i in reversed(range(len(ix))):
param = list(values.keys())[i]
#print("advance " + param)
if ix[i] == len(values[param]) - 1:
ix[i] = 0
else:
ix[i] += 1
return True
return False
def run_knn_benchmark(checkout, values, training_file, testing_file, dims, metric):
indexes = [0] * len(values.keys())
indexes[-1] = -1
args = []
print(f"\n\n\nNow running{training_file}\n\n\n")
dim = dims #768
doc_vectors = training_file
query_vectors = testing_file
cp = benchUtil.classPathToString(benchUtil.getClassPath(checkout))
JAVA_EXE = '/Users/benjamintrent/Library/Java/JavaVirtualMachines/jdk-20.0.1.jdk/Contents/Home/bin/java'
cmd = [JAVA_EXE,
'-cp', cp,
'--add-modules', 'jdk.incubator.vector',
'-Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false',
'KnnGraphTester']
print("recall\tlatency\tnDoc\tfanout\tmaxConn\tbeamWidth\tvisited\tindex ms")
while advance(indexes, values):
pv = {}
args = []
for (i, p) in enumerate(list(values.keys())):
if p in values:
if values[p]:
value = values[p][indexes[i]]
pv[p] = value
else:
args += ['-' + p]
args += [a for (k, v) in pv.items() for a in ('-' + k, str(v)) if a]
this_cmd = cmd + args + [
'-dim', str(dim),
'-docs', doc_vectors,
#'-reindex',
'-metric', metric,
'-search', query_vectors,
'-forceMerge',
'-quiet',
]
subprocess.run(this_cmd)
tests = [
('%s/util/en_ja.random.train' % constants.BASE_DIR, '%s/util/en_ja.test' % constants.BASE_DIR, 768, "angular"),
('%s/util/en_ja.ordered.train' % constants.BASE_DIR, '%s/util/en_ja.test' % constants.BASE_DIR, 768, "angular"),
('%s/util/en_ja.reversed.train' % constants.BASE_DIR, '%s/util/en_ja.test' % constants.BASE_DIR, 768, "angular"),
('%s/util/en_ja.random-transform.train' % constants.BASE_DIR, '%s/util/en_ja-transform.test' % constants.BASE_DIR, 769, "euclidean"),
('%s/util/en_ja.ordered-transform.train' % constants.BASE_DIR, '%s/util/en_ja-transform.test' % constants.BASE_DIR, 769, "euclidean"),
('%s/util/en_ja.reversed-transform.train' % constants.BASE_DIR, '%s/util/en_ja-transform.test' % constants.BASE_DIR, 769, "euclidean"),
]
for (training_file, testing_file, dims, metric) in tests:
run_knn_benchmark(LUCENE_CHECKOUT, VALUES, training_file, testing_file, dims, metric)
|
@benwtrent thank you for re-running, much appreciated! The results look the same, but do we agree that after the transform is done, all three similarities / distances should be equivalent for the purpose of k-NN search? This is due to all documents being normalized to the same length (and query length is the same during the search, simply because it's the very same query). The equivalence is more obvious when queries and documents all have unit length. However, I think it is still true assuming consistent normalization of documents. |
PS: @benwtrent could you publish benchmark code? We could possibly extend them, e.g., add new tests. |
@searchivarius my latest comment has the updated I use a build of Lucene that utilizes the dot-product scaling suggested by @jmazanec15 . I take the console output and copy/paste into the extraction & plotting code given in this comment: #12342 (comment) All very manual. |
@searchivarius yes I believe you are correct. Attached is a small proof that ordering is the same between euclidean distance and negative dot product when norm of all vectors is the same (please double check that I did not make any mistakes) ProofProve: Given a set of points, Starting with: By parallelogram law: Removing equal terms: Flipping sign: Definition of norm: Expanding: Removing equal terms: Dividing and flipping sign: |
Seems to be correct (after checking what's exactly a parallelogram law is). However, as I mentioned above a simpler expansion of the dot product gives us that L2 is a monotonic transformation of the cosine similarity (and dot product as well, b/c norms are constant). In that the monotonicity preserves the order. I just wanted so make sue my simple derivation of:
looked plausible. |
@searchivarius I see, yes thats correct. |
Here are some results on synthetic datasets. I utilized max-inner-product (scaled to prevent negatives) for baseline, euclidean & and the typical transformation for transform. Here is the code for generating the scaled data: https://gist.github.com/benwtrent/be084f4737f79dab7204eca3e6ab98fe Normal(loc=1,scale=0.1)Magnitude stats:
Normal(loc=1,scale=0.2)
Pareto(5)
Uniform
BiModal 0.5Two normal distributions with
BiModal 0.9Two normal distributions with
Gamma(shape=1,scale=1)
Gamma(shape=2,scale=2)
|
Looking great, many thanks! Could you remind me what is ordered and reversed? This is something related to insertion order? |
Yes, insertion order. So which docs are indexed first. Ordered is sorted by magnitude in ascending order. So smallest magnitudes are indexed first. Reverse is ordered reverse. So descending order by magnitudes (largest first) |
It makes sense that the transformation works better with the gamma distributions (but only when indexing the docs are ordered by magnitude) as it has by far the highest variance of all the others. It is an exceptional case for data to not only have such a high variance, but the user purposefully indexes the documents sorted by their magnitude. All our real datasets had a much lower variance. And in the random order case, baseline is still better than transformed. Even in the ordered & reverse cases, baseline was able to achieve 95% recall with just 1ms more latency. Are we comfortable with moving forward and merging #12479 bringing Max-inner-product into Lucene? |
@benwtrent I am comfortable with it. For future, if we ever decide to switch up the edge selection strategy where a node not only has to be smaller than a threshold distance, but smaller by a factor of X (I believe Vamana has a strategy like this), then we may need to take extra care that the scaling will work for the transformation - but this is not a problem now - and I believe there would be a similar problem with raw dot product anyway. |
…12479) The current dot-product score scaling and similarity implementation assumes normalized vectors. This disregards information that the model may store within the magnitude. See: #12342 (comment) for a good explanation for the need. To prevent from breaking current scoring assumptions in Lucene, a new `MAXIMUM_INNER_PRODUCT` similarity function is added. Because the similarity from a `dotProduct` function call could be negative, this similarity scorer will scale negative dotProducts to between 0-1 and then all positive dotProduct values are from 1-MAX. One concern with adding this similarity function is that it breaks the triangle inequality. It is assumed that this is needed to build graph structures. But, there is conflicting research here when it comes to real-world data. See: - For: #12342 (comment) - Against: #12342 (comment), #12342 (comment) To check if any transformation of the input is required to satisfy the triangle inequality, many tests have been ran See: - #12342 (comment) - #12342 (comment) - #12342 (comment) If there are any additional tests, or issues with the provided tests & scripts, please let me know. We want to make sure this works well for our users. closes: #12342
…12479) The current dot-product score scaling and similarity implementation assumes normalized vectors. This disregards information that the model may store within the magnitude. See: #12342 (comment) for a good explanation for the need. To prevent from breaking current scoring assumptions in Lucene, a new `MAXIMUM_INNER_PRODUCT` similarity function is added. Because the similarity from a `dotProduct` function call could be negative, this similarity scorer will scale negative dotProducts to between 0-1 and then all positive dotProduct values are from 1-MAX. One concern with adding this similarity function is that it breaks the triangle inequality. It is assumed that this is needed to build graph structures. But, there is conflicting research here when it comes to real-world data. See: - For: #12342 (comment) - Against: #12342 (comment), #12342 (comment) To check if any transformation of the input is required to satisfy the triangle inequality, many tests have been ran See: - #12342 (comment) - #12342 (comment) - #12342 (comment) If there are any additional tests, or issues with the provided tests & scripts, please let me know. We want to make sure this works well for our users. closes: #12342
Currently, VectorSimilarityFunction.DOT_PRODUCT function can return negative scores if the input vectors are not normalized. For ref, this is the method:
While in the method javadoc there is a warning to normalize the vectors before using, I am wondering if we can get rid of this by mapping negative scores between 0 and 1 and positive scores between 1 and Float.Max with:
and let the user worry about normalization
Related issue: opensearch-project/k-NN#865
The text was updated successfully, but these errors were encountered: