Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores #12342

Closed
jmazanec15 opened this issue Jun 1, 2023 · 84 comments · Fixed by #12479
Closed

Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores #12342

jmazanec15 opened this issue Jun 1, 2023 · 84 comments · Fixed by #12479

Comments

@jmazanec15
Copy link
Contributor

Currently, VectorSimilarityFunction.DOT_PRODUCT function can return negative scores if the input vectors are not normalized. For ref, this is the method:

public float compare(float[] v1, float[] v2) {
  return (1 + dotProduct(v1, v2)) / 2;
}

While in the method javadoc there is a warning to normalize the vectors before using, I am wondering if we can get rid of this by mapping negative scores between 0 and 1 and positive scores between 1 and Float.Max with:

dotProd = dotProduct(v1, v2)

if (dotProd < 0) {
  return 1 / (1 + -1*dotProd);
}
return dotProd + 1;

and let the user worry about normalization

Related issue: opensearch-project/k-NN#865

@benwtrent
Copy link
Member

I honestly would much prefer us removing the normalization restriction and not do anything to the output. IMO, plain ol' dot_product is perfectly fine at scoring as higher values mean more relevant (just like Lucene scoring). I am not sure why this was scaled at all.

But, since this changes how things are scored (and that would be considered a breaking change, correct?), why not remove scaling at all and allow the score to be the result of dot_product directly?

@benwtrent
Copy link
Member

Ok, did a bunch of reading on MAX Block WAND and MAXScore :/. I guess MBW is the reason for us bounding these vector scores somehow.

Technically, max inner product is unbounded. Encoding information in the vector magnitude is useful (especially for multi-lingual search from what I gather).

My questions would be:

  • does allowing unbounded scores from vectors completely break MBW?
  • if we need to scale it, can we do it continuously? I suppose that might break the existing scoring methodology.

@jmazanec15
Copy link
Contributor Author

@benwtrent Thanks for taking a look! Interesting, I am not too familiar with MBW. Ill take a look.

The main reason I wanted to avoid returning the dot product was to avoid negative scores, as ref by #9044.

Also, what do you mean by scaling continuously? The above formula gives negative and positive scores the same number of possible scores, reducing overall precision by 2 (please correct me if I am wrong).

@benwtrent
Copy link
Member

benwtrent commented Jun 12, 2023

Also, what do you mean by scaling continuously?

Your algorithm is piecewise vs. continuous. But, I am not sure how we could do a continuous transformation (everything is on the same scale). EDIT: Thinking more, I am not sure we would want to and your equation is OK. More thought here is required.

I am not too familiar with MBW.

Yeah, MBW and MAXSCORE get all mixed up in my brain. But, yes MAXSCORE is why we disallow negative scoring. Forgive my misdirection.

The above formula gives negative and positive scores the same number of possible scores, reducing overall precision by 2

My main concern overall is this: we are changing the scoring methodology period for positive scores (and thus considered "valid"). I think this means that it cannot go into a Lucene 9.x release (correct @jpountz ?).

What do you think @msokolov ? Maybe we have a new MAX_INNER_PRODUCT scoring that uses @jmazanec15 suggestion?

@msokolov
Copy link
Contributor

hmm, my view is dot_product is only usable if your vectors are normalized, as documented. I also don't think we can change the scoring formula in a minor release. As for producing a D.P. that is scaled for use with arbitrary vectors I don't see the point really. If what you want is to handle arbitrary scaled vectors, EUCLIDEAN is a better choice. It will produce the same rank as DOT_PRODUCT for normalized vectors and has the meaning of an actual metric (satisfies the triangle inequality). What does D.P. even mean for random vectors? What if one of the vectors is zero? Then it is equidistant to every other vector?

I guess I'd want to see a real use case before working to support this use case that seems weird to me. And honestly there are other distances that seem more useful (L1 norm, for example)

@jpountz
Copy link
Contributor

jpountz commented Jun 13, 2023

Historically Lucene did not have restrictions on scores, but approaches for dynamic pruning like MAXSCORE and WAND assume that matching another clause would make the score higher, not lower. In hindsight, it made sense that scores should be non-negative, so we updated the contract of Scorer.score() and our built-in similarities to only produce non-negative scores.

I'm not too worried about having a vector similarity that produces unbounded scores from the perspective of MAXSCORE/WAND. The way things work today, the vector query is first rewritten into a query that maps a small set of doc IDs to scores, so we can easily get access to the maximum score over a range of doc IDs, which is what WAND/MAXSCORE need. To me the main concern is more that unbounded scores make it hard to combine scores with another query via a disjunction as it's hard to know ahead of time whether the vector may completely dominate scores. And also bw compat as MikeS raised.

@benwtrent
Copy link
Member

To me the main concern is more that unbounded scores make it hard to combine scores with another query via a disjunction as it's hard to know ahead of time whether the vector may completely dominate scores.

I say we have that problem now. Vector scores and BM25 are nowhere near on the same scale. Folks need to adjust their boosts accordingly, regardless.

And also bw compat as MikeS raised.

I agree, BWC is a big deal here. And I suggest we create a new similarity that just uses dot product under the hood. Call it maximum-inner-product.

As for producing a D.P. that is scaled for use with arbitrary vectors I don't see the point really. If what you want is to handle arbitrary scaled vectors, EUCLIDEAN is a better choice.

Quoting a SLACK conversation with @nreimers:

Wow, what a bad implementation by Elastic.
Models with unnormalized vectors and dot product work better for search than models with normalized vectors / cosine similarity.
Models with cosine similarity have the issue that they often retrieve noise when your dataset gets noisier
...The best match for a query (e.g. What is the capital of the US ) with cosine similarity is the query itself, as cossim(query, query)=1.
So when your corpus gets bigger and is not carefully cleaned, it contains many short documents that look like queries. These are preferably retrieved by the model, so the user asks a questions and gets as a response a doc that is paraphrase of the query (e.g. query="What is the capital of the US" top-1 hit: Capital of the US).
Dot product has the tendency to work better when your corpus gets larger / noisy.

@msokolov
Copy link
Contributor

thanks for the reference @benwtrent, that's an interesting perspective. I wouldn't be opposed to adding a new distance if people find it useful

@benwtrent
Copy link
Member

The way I read this @jpountz

I'm not too worried about having a vector similarity that produces unbounded scores from the perspective of MAXSCORE/WAND. The way things work today, the vector query is first rewritten into a query that maps a small set of doc IDs to scores, so we can easily get access to the maximum score over a range of doc IDs, which is what WAND/MAXSCORE need.

Is that negative scores here are OK as the optimization constraints traditionally required do not apply.

If that is the case, I would suggest us adding a new scoring methodology that is simply the dot product and call it maximum inner product.

The current scaling for dot_product only makes sense for normalized vectors and it should only be treated as an optimization.

@uschindler
Copy link
Contributor

uschindler commented Jun 13, 2023

For byte vectors we already have some guard built in (at least they can't get <0). See VectorUtil #dotProductScore. In the other issue #12281 I have also seen issues with float vectors that produced infinite floats as dotProduct or NaN as cosine (due to Infinity / Infinity => NaN). We wanted to open a new issue already, so this one fits.

So this also relates to this discussion: Should we have some constraints on vectors while they are indexed. In the other PR we added the requirement to make all their components finite.

@jpountz
Copy link
Contributor

jpountz commented Jun 13, 2023

Is that negative scores here are OK as the optimization constraints traditionally required do not apply.

I was to convey that it should be ok not to have score upper bounds for the produced scores. I still think scores should always be non-negative.

@uschindler
Copy link
Contributor

Is that negative scores here are OK as the optimization constraints traditionally required do not apply.

I was to convey that it should be ok not to have score upper bounds for the produced scores. I still think scores should always be non-negative.

I think for floats it is not so easy like in the byte case with dotProductScore(). I also referenced to this from the other function query issue where the what the "default score" should be, if you have no vector in one of the documents. 0 works fine for classical scoring if you have only positve scores.

@msokolov
Copy link
Contributor

would it make sense to truncate negative scores to zero? Since we think this is an abuse/misconfiguration, loss of information seems OK, and least we would be able to guarantee not to violate the "no negative scores" contract. Then if we want to have a separate score that scales in an information-preserving way, we can add it.

@nreimers
Copy link

@msokolov The index / vector DB should return the dot product score as is. No scaling, no truncation.

Using dot product is tremendously useful for embedding models, they perform in asymmetric settings where you want to map a short search query to a longer relevant document (which is the most common case in search) much better than cosine similarity or euclidean distance.

But here the index should return the values as is and it should then be up to the user to truncate negative scores or to normalize these scores to pre-defined ranges.

@uschindler
Copy link
Contributor

@msokolov The index / vector DB should return the dot product score as is. No scaling, no truncation.

Using dot product is tremendously useful for embedding models, they perform in asymmetric settings where you want to map a short search query to a longer relevant document (which is the most common case in search) much better than cosine similarity or euclidean distance.

But here the index should return the values as is and it should then be up to the user to truncate negative scores or to normalize these scores to pre-defined ranges.

The problem is that this is not compatible with Lucene.

@benwtrent
Copy link
Member

I would think as long as more negative values are scored lower, we will retrieve documents in a sane manner.

Scaling negatives to restrict them and then not scaling positive values at all could work. The _score wouldn't always be the dot-product exactly, but it allows KNN search to find the most relevant information, even if all of the dot-products are negative when comparing with the query vector.

This brings us back to @jmazanec15 suggestion on scaling scores.

@msokolov
Copy link
Contributor

Yeah, after consideration, I think we could maybe argue for changing the scaling of negative values given that they were documented as unsupported, even though it would be breaking back-compat in the sense that scores would be changed. But I think we ought to preserve the scaling of non-negative values in case people have scaling factors they use for combining scores with other queries' scores. So we could go with @jmazanec15 suggestion except leaving in place the scale by 1/2?

@benwtrent
Copy link
Member

@msokolov Ah, so negative values would live between (0, 0.5) and positive values would still be between [0.5,...)?

@msokolov
Copy link
Contributor

Yeah. Another thing we could consider is doing this scaling in KnnVectorQuery and/or its Scorer. These have the ultimate responsibility of complying with the Scorer contract. If we did it there we wouldn't have to change the output of VectorSimilarity. However it's messy to do it there since this is specific to a particular similarity implementation, so on balance doing it in the similarity makes more sense to me.

@jmazanec15
Copy link
Contributor Author

jmazanec15 commented Jun 15, 2023

I think the scores would have to be preserved for not only positive dot products amongst normalized vectors, but also negative ones, to avoid breaking bwc. I think the current range of dot products that are valid is [-1, 1] and scores map to [0, 1]. So I dont think we could map all negative values between [0, 0.5]

@uschindler
Copy link
Contributor

Yeah. Another thing we could consider is doing this scaling in KnnVectorQuery and/or its Scorer. These have the ultimate responsibility of complying with the Scorer contract. If we did it there we wouldn't have to change the output of VectorSimilarity. However it's messy to do it there since this is specific to a particular similarity implementation, so on balance doing it in the similarity makes more sense to me.

Wasn't there the possibility to return a score for indexing and for search? Basically the VectorSimilarity enum could have a separate method called queryScore(v1, v2) that is enforced to be positive. Actually for cosine its not a problem as its normalized, so we can add 1 (and for safety to prevent rounding errors add Math.max(0, result)). The absolute values of scores are not important (unless you want to bring them together with other query scores, but for that you have boost of queries).

@benwtrent
Copy link
Member

If we did it there we wouldn't have to change the output of VectorSimilarity. However it's messy to do it there since this is specific to a particular similarity implementation, so on balance doing it in the similarity makes more sense to me.

I am not sure why we care about separating VectorSimilarity and scoring. VectorSimilarity is only ever for KNN search and indexing and as long as vectors that are less similar score lower, its fine.

If we start thinking about separating out scoring and similarity, we should do it for all the current similarities. This would be significant work and it would be tricky. Think of EUCLIDEAN, we invert it's calculation so that a higher score means more similar. So, we would still need to use queryScore as the indexing similarity without significant changes to the underlying assumptions of the graph builder,etc.

If folks want to use the raw vector distances, they should use VectorUtil.

I think the current range of dot products that are valid is [-1, 1] and scores map to [0, 1]. So I dont think we could map all negative values between [0, 0.5]

I think you are correct @jmazanec15 since normalized vectors are in the unit-sphere. Its possible to have negative values (and thus fall into the [0, 0.5] range) when they point in opposite directions within the sphere. Your scaling method + a new MAX_INNER_PRODUCT similarity (which just uses dotProduct and scales it differently) covers the requirement of disallowing negative scores & non-normalized vectors.

This may complicate things (which 'dotProduct' should I use?!?!?!), but we should not change the existing VectorSimilarityFunction#DOT_PRODUCT. Maybe we can deprecate VectorSimilarityFunction#DOT_PRODUCT usage for new fields in 9x to encourage switching to MAX_INNER_PRODUCT and remove VectorSimilarityFunction#DOT_PRODUCT in 10.

@jmazanec15
Copy link
Contributor Author

@benwtrent I think that makes sense, but would add a little confusion.

How common is it to use Vector results with MAX_SCORE/WAND? I am wondering if it would be better to just leave as is in 9.x and change the warning message in the javadoc that non-normalized vectors are supported, but they should not be used with WAND/MAX_SCORE and can return negatives. And then switch the score to scale in 10 as a breaking change. Or is condoning negative scores under any circumstances a non-starter?

@benwtrent
Copy link
Member

And then switch the score to scale in 10 as a breaking change. Or is condoning negative scores under any circumstances a non-starter?

If you are utilizing hybrid search, negating WAND/MAX_SCORE will slow things down significantly.

We should protect folks from shooting themselves in the foot.

but would add a little confusion.

I agree, there will be confusion. What do you think @uschindler & @msokolov ?

Being able to do non-normalized dot-product is an important aspect of recommendation engines and vector search as a whole. My imagination is too poor to come up with a better solution than adding a new similarity function that uses dot-product under the hood and scales differently.

@benwtrent
Copy link
Member

@jmazanec15 have you done any digging into the dot-product scaling and if it provides good recall in the MAX-INNER-PRODUCT search use-case?

https://blog.vespa.ai/announcing-maximum-inner-product-search/ && https://towardsdatascience.com/maximum-inner-product-search-using-nearest-neighbor-search-algorithms-c125d24777ef

Implies there might be some weirdness with HNSW and raw MIP. I am honestly not 100% sure if Lucene has this issue specifically with HNSW.

A key observation in MIP is that a vector is no longer closest to itself, but instead it would be much closer to 2*vector than just vector.

@jmazanec15
Copy link
Contributor Author

@benwtrent I have been thinking about this and am still not completely sure of the implications. It seems like the construction of the graphs may rely on some assumption about the underlying space supporting the triangle inequality. Thus, with inner product space where this does not hold, the graph construction might have problems.

However, graphs aside, with brute force search, utilizing the scaled negative dot product would preserve the ordering of MIPs search.

I will try to think more about this this week.

@benwtrent
Copy link
Member

@jmazanec15 adding the largest magnitude for scoring per segment isn't that bad a change in the codec if it means we can truly support maximum inner product. Plus it would be a change that would help other vector indexing codecs in the future besides HNSW.

@jmazanec15
Copy link
Contributor Author

@benwtrent Interesting, Im still not sure if this approach is necessary. I spoke with @searchivarius who is the maintainer of nmslib, and he mentioned that there was some research suggesting this is not required (https://proceedings.neurips.cc/paper_files/paper/2018/hash/229754d7799160502a143a72f6789927-Abstract.html, https://arxiv.org/pdf/1506.03163.pdf).

Let me try re-running the Vespa experiments with Lucene without the reduction and see what numbers we get. I dont think in the blog post they posted any comparison to using negative dot product approach (please correct me if I am missing something).

@searchivarius
Copy link

searchivarius commented Jul 12, 2023

thank you @jmazanec15 : there's also an unpublished paper (I can share the preprint privately) where we benchmarked HNSW for maximum inner product search on 3 datasets and it was just fine (for this paper I did try the reduction to the cosine similarity and I also got poorer outcomes). In my thesis, I benchmarked SW-graph (which is pretty much HNSW when it comes to peculiarities of handling the inner product search) using an inner-product like similarity (fusion of BM25 and MODEL1 scores) and it was fine. See the black asterisk run in Figure 3.2.

Moreover, HNSW and SW-graph were tested with non-metric similarities (see again my thesis and references therein) as well as Yury Malkov's HNSW paper. These methods established SOTA results as well.
There is also an extract from the thesis (published separately) that focuses specifically on search with non-metric similarities. Again, things just work..

One may wonder why, right? I think for real datasets the quirky distances don't deviate from the Euclidean distances all that much so the minimal set of geometric properties required for graph based retrieval is preserved (and no I don't think the triangle inequality is required).

Specifically, for the inner product search the outcomes are pretty close (in many cases) to the outcomes where the inner product search is replaced with cosine similarity (which is equivalent to L2 search) Why? Because with real embeddings the magnitude of vectors doesn't change all that much.

That said, there are of course degenerate cases (I know one, but embedding models don't produce such weirdness) where HNSW won't work with MIPS (or rather recall will be low). However, I am not aware of any realistic one. If you have some interesting examples of real datasets where direct application of HNSW/SW-graph fails, I would love to have a look.

REGARDING THE SCORE sign: dot-product scores need not be normalized, but the sign can be changed when the results is returned to the user.

@benwtrent
Copy link
Member

I fixed my bug (thanks @jmazanec15!). I will never forget how numpy#flip works now...

Here is my new run.

ef_search transformed_recall transform_latency baseline_recall baseline_latency
10 0.757 0.45 0.710 0.35
20 0.868 0.71 0.828 0.49
60 0.962 1.59 0.948 1.10
100 0.981 2.45 0.973 1.61
200 0.993 4.23 0.991 2.81
500 0.998 8.75 0.998 5.82
600 0.998 10.10 0.998 6.71
1000 0.999 14.80 1.00 9.71
image

@searchivarius
Copy link

Cool. Now thr nontransformed run is better.

@benwtrent
Copy link
Member

benwtrent commented Jul 28, 2023

OK ran on two more datasets, I only ran over the first 100k documents in my data sets.

EDIT: found a minor bug, the 'baseline' queries were being overwritten by the 'transform' ones (the ones with just a 0 at the 769th index). I reran to verify if results were effected; the weren't. Mainly because to measure recall, we take the brute-force kNN from the query file and compare to the approx-kNN. The data sets and the index built were not effected by this bug. If any more are found, I will happily re-run. Putting my M1 macbook to work :).

I think this exhausts our testing for Cohere, we need to find additional data if this is not considered enough.

EN

Reversed Ordered Random
image image image

JA

Reversed Ordered Random
image image image
updating data creation (after download)
import numpy as np
import pyarrow.parquet as pq

DATA_SETS =[
    {"name": "wiki768", "files": [
        "train-00000-of-00004-1a1932c9ca1c7152.parquet",
        "train-00001-of-00004-f4a4f5540ade14b4.parquet",
        "train-00002-of-00004-ff770df3ab420d14.parquet",
        "train-00003-of-00004-85b3dbbc960e92ec.parquet",
    ]},{
    "name": "wiki768en", "files": [
       "0-en.parquet",
            "1-en.parquet",
            "2-en.parquet",
            "3-en.parquet",
    ]},
    {"name": "wiki768ja", "files": [
        "0-ja.parquet",
        "1-ja.parquet",
        "2-ja.parquet",
        "3-ja.parquet",
    ]}
]


def transform_queries(Q):
    n, _ = Q.shape
    return np.concatenate([Q, np.zeros((n, 1))], axis=-1, dtype=np.float32)


def transform_docs(D, norms):
    n, d = D.shape
    max_norm = magnitudes.max()
    flipped_norms = np.copy(norms).reshape(n, 1)
    transformed_data = np.concatenate([D, np.sqrt(max_norm**2 - flipped_norms**2)], axis=-1, dtype=np.float32)
    return transformed_data


def validate_array_match_upto_dim(arr1, arr2, dim_eq_upto):
    assert np.allclose(arr1[:dim_eq_upto], arr2[:dim_eq_upto]), "data sets are different"


def validate_dataset_match_upto_dim(arr1, arr2, dim_eq_upto):
    n1, d1 = arr1.shape
    n2, d2 = arr2.shape
    assert n1 == n2, f"Shape does not map [{arr1.shape}] vs [{arr2.shape}]"
    for i in range(n1):
        validate_array_match_upto_dim(arr1[i], arr2[i], dim_eq_upto)

for ds in DATA_SETS:
    name = ds["name"]
    tb1 = pq.read_table(ds["files"][0], columns=['emb'])
    tb2 = pq.read_table(ds["files"][1], columns=['emb'])
    tb3 = pq.read_table(ds["files"][2], columns=['emb'])
    tb4 = pq.read_table(ds["files"][3], columns=['emb'])
    np1 = tb1[0].to_numpy()
    np2 = tb2[0].to_numpy()
    np3 = tb3[0].to_numpy()
    np4 = tb4[0].to_numpy()

    np_total = np.concatenate((np1, np2, np3, np4))

    #Have to convert to a list here to get
    #the numpy ndarray's shape correct later
    #There's probably a better way...
    flat_ds = list()
    for vec in np_total:
        flat_ds.append(vec)
    np_flat_ds = np.array(flat_ds)
    row_count = np_flat_ds.shape[0]
    query_count = 10_000
    training_rows = row_count - query_count
    print(f"{name} num rows: {training_rows}")


    transformed_queries = transform_queries(np_flat_ds[training_rows:-1])
    validate_dataset_match_upto_dim(transformed_queries, np_flat_ds[training_rows:-1], 768)
    with open(f"{name}-transform.test", "w") as out_f:
        transformed_queries.tofile(out_f)
    with open(f"{name}.test", "w") as out_f:
        np_flat_ds[training_rows:-1].tofile(out_f)

    magnitudes = np.linalg.norm(np_flat_ds[0:training_rows], axis=1)
    indices = np.argsort(magnitudes)
    transformed_np_flat_ds = transform_docs(np_flat_ds[0:training_rows], magnitudes)
    validate_dataset_match_upto_dim(transformed_np_flat_ds, np_flat_ds[0:training_rows], 768)
    transformed_np_flat_ds_sorted = transformed_np_flat_ds[indices]
    np_flat_ds_sorted = np_flat_ds[indices]
    with open(f"{name}.random-transform.train", "w") as out_f:
        transformed_np_flat_ds.tofile(out_f)
    with open(f"{name}.ordered-transform.train", "w") as out_f:
        transformed_np_flat_ds_sorted.tofile(out_f)
    with open(f"{name}.reversed-transform.train", "w") as out_f:
        np.flip(transformed_np_flat_ds_sorted, axis=0).tofile(out_f)

    with open(f"{name}.random.train", "w") as out_f:
        np.flip(np_flat_ds[0:training_rows], axis=0).tofile(out_f)
    with open(f"{name}.reversed.train", "w") as out_f:
        np.flip(np_flat_ds_sorted, axis=0).tofile(out_f)
    with open(f"{name}.ordered.train", "w") as out_f:
        np_flat_ds_sorted.tofile(out_f)
Useful parsing & plotting functions
def parse_console_output(terminal_output):
    # Regular expression patterns to extract recall and latency values
    recall_pattern = r"(?:\n\d+\.\d+)"
    latency_pattern = r"([\t, ]\d+\.\d+\t\d)"

    recall_values = [float(match.strip()) for match in re.findall(recall_pattern, terminal_output)]
    latency_values = [float(match.split()[0]) for match in re.findall(latency_pattern, terminal_output)]
    return (recall_values, latency_values)


def plot_things(name, baseline_recall, baseline_latency, transformed_recall, transform_latency):
    # Plotting series one: transformed_recall vs transform_latency
    plt.plot(transform_latency, transformed_recall, marker='o', label='transformed')

    # Plotting series two: baseline_recall vs baseline_latency
    plt.plot(baseline_latency, baseline_recall, marker='o', label='original (baseline)')

    # Add labels and title
    plt.xlabel('Latency')
    plt.ylabel('Recall')
    plt.title(f"{name} Transformed vs Baseline recall & latency")
    plt.legend()

    # Show the plot
    plt.grid(True)
    plt.show()

To use them:

transformed_terminal_output = """
WARNING: Gnuplot module not present; will not make charts
recall	latency	nDoc	fanout	maxConn	beamWidth	visited	index ms
WARNING: Using incubator modules: jdk.incubator.vector
Jul 28, 2023 12:01:58 PM org.apache.lucene.internal.vectorization.PanamaVectorizationProvider <init>
INFO: Java vector incubator API enabled; uses preferredBitSize=128
0.863	 0.32	100000	0	48	200	10	0	1.00	post-filter
...
"""

baseline_terminal_output = """
WARNING: Gnuplot module not present; will not make charts
recall	latency	nDoc	fanout	maxConn	beamWidth	visited	index ms
WARNING: Using incubator modules: jdk.incubator.vector
Jul 28, 2023 12:34:59 PM org.apache.lucene.internal.vectorization.PanamaVectorizationProvider <init>
INFO: Java vector incubator API enabled; uses preferredBitSize=128
0.816	 0.23	100000	0	48	200	10	0	1.00	post-filter
...
"""
name = "WikiEN-Reversed"

transformed_recall, transform_latency = parse_console_output(transformed_terminal_output)
baseline_recall, baseline_latency = parse_console_output(baseline_terminal_output)

plot_things(name, baseline_recall, baseline_latency, transformed_recall, transform_latency)
data downloading script
#!/bin/sh


# Japanese
curl -L https://huggingface.co/api/datasets/Cohere/wikipedia-22-12-ja-embeddings/parquet/Cohere--wikipedia-22-12-ja-embeddings/train/0.parquet -o 0-ja.parquet
curl -L https://huggingface.co/api/datasets/Cohere/wikipedia-22-12-ja-embeddings/parquet/Cohere--wikipedia-22-12-ja-embeddings/train/1.parquet -o 1-ja.parquet
curl -L https://huggingface.co/api/datasets/Cohere/wikipedia-22-12-ja-embeddings/parquet/Cohere--wikipedia-22-12-ja-embeddings/train/33.parquet -o 2-ja.parquet
curl -L https://huggingface.co/api/datasets/Cohere/wikipedia-22-12-ja-embeddings/parquet/Cohere--wikipedia-22-12-ja-embeddings/train/34.parquet -o 3-ja.parquet

# English
curl -L https://huggingface.co/api/datasets/Cohere/wikipedia-22-12-en-embeddings/parquet/Cohere--wikipedia-22-12-en-embeddings/train/0.parquet -o 0-en.parquet
curl -L https://huggingface.co/api/datasets/Cohere/wikipedia-22-12-en-embeddings/parquet/Cohere--wikipedia-22-12-en-embeddings/train/1.parquet -o 1-en.parquet
curl -L https://huggingface.co/api/datasets/Cohere/wikipedia-22-12-en-embeddings/parquet/Cohere--wikipedia-22-12-en-embeddings/train/251.parquet -o 2-en.parquet
curl -L https://huggingface.co/api/datasets/Cohere/wikipedia-22-12-en-embeddings/parquet/Cohere--wikipedia-22-12-en-embeddings/train/252.parquet -o 3-en.parquet

@searchivarius
Copy link

@benwtrent many thanks! This is even more convincing since now we have two cases where original is much better than transformed and one where it is roughly at par.

@benwtrent
Copy link
Member

OK, here (unless we have done these incorrectly), is the final Cohere test (IMO). This is mixing English and Japanese embeddings (just in case the cohere model encodes info for language in magnitude). To mix the language embeddings, I appended the embedded rows together, shuffled them and then ran the previous testing procedure.

EN-JA

Reversed Ordered Random
!image image image

@msokolov
Copy link
Contributor

Q: just want to make sure I understand what the transformation is that you are testing here. Is it that you are taking non-unit vectors and making them into unit vectors by dividing by the max length in the corpus and then adding another dimension to force the vector to be unit length? And then comparing this to max-inner-product on vectors of varying length? If so, what is the meaning of the length of the vectors - where does it come from? Do the training processes generate non-unit vectors?

@searchivarius
Copy link

@msokolov the transformation also preserves the inner product up to a query-specific constant.

what is the meaning of the length of the vectors - where does it come from? Do the training processes generate non-unit vectors

Yes, the training process often uses the inner/doct product and not the cosine similarity, so some information is being encoded by the vector length/magnitude. Typically, however, the variance isn't that huge.

@nreimers
Copy link

nreimers commented Jul 31, 2023

@msokolov In our BEIR paper we talked about this:
https://arxiv.org/abs/2104.08663

The issue with cosine similarity is that it just encodes the topic. For the query What is Pytorch, and you have two docs:
A:
Pytorch is a framework

B:
PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is free and open-source software released under the modified BSD license. Although the Python interface is more polished and the primary focus of development, PyTorch also has a C++ interface

A cosine similarity model would retrieve A (it is more on-topic). A dotproduct model would retrieve B, as it is on-topic as well and contains more information on the topic (represented as a longer vector).

I.e. cossim(query, A) > cossim(query, B) but dotprod(query, A) < dotprod(query, B).

So dotproduct models can encode topic + quantity/quality of information, while cosine similarity models just encode topic match (which isn't perfect for retrieval).

@benwtrent
Copy link
Member

I found another dataset, Yandex Text-to-image: https://research.yandex.com/blog/benchmarks-for-billion-scale-similarity-search

I tested against the first 500_000 values in the 1M dataset.

It utilizes inner-product, looking at magnitudes, they are all < 1. So this dataset might not be that useful :/. The magnitudes range from 0.79 - 0.99.

I am just looking for more realistic inner-product search data sets and found this one.

Yandex Image-to-Text

Reversed Ordered Random
image image image
code for transforming yandex fbin
import numpy as np


def read_fbin(filename, start_idx=0, chunk_size=None):
    """ Read *.fbin file that contains float32 vectors
    Args:
        :param filename (str): path to *.fbin file
        :param start_idx (int): start reading vectors from this index
        :param chunk_size (int): number of vectors to read.
                                 If None, read all vectors
    Returns:
        Array of float32 vectors (numpy.ndarray)
    """
    with open(filename, "rb") as f:
        nvecs, dim = np.fromfile(f, count=2, dtype=np.int32)
        nvecs = (nvecs - start_idx) if chunk_size is None else chunk_size
        arr = np.fromfile(f, count=nvecs * dim, dtype=np.float32,
                          offset=start_idx * 4 * dim)
    return arr.reshape(nvecs, dim)


def transform_queries(Q):
    n, _ = Q.shape
    return np.concatenate([Q, np.zeros((n, 1))], axis=-1, dtype=np.float32)


def transform_docs(D, norms):
    n, d = D.shape
    max_norm = magnitudes.max()
    flipped_norms = np.copy(norms).reshape(n, 1)
    transformed_data = np.concatenate([D, np.sqrt(max_norm**2 - flipped_norms**2)], axis=-1, dtype=np.float32)
    return transformed_data


def validate_array_match_upto_dim(arr1, arr2, dim_eq_upto):
    assert np.allclose(arr1[:dim_eq_upto], arr2[:dim_eq_upto]), "data sets are different"


def validate_dataset_match_upto_dim(arr1, arr2, dim_eq_upto):
    n1, d1 = arr1.shape
    n2, d2 = arr2.shape
    assert n1 == n2, f"Shape does not map [{arr1.shape}] vs [{arr2.shape}]"
    for i in range(n1):
        validate_array_match_upto_dim(arr1[i], arr2[i], dim_eq_upto)

name = "yandex"
np_flat_ds = read_fbin("base.1M.fbin")[0:500_000]
queries = read_fbin("query.public.100k.fbin")[0:10_000]


transformed_queries = transform_queries(queries)
validate_dataset_match_upto_dim(transformed_queries, queries, 200)
with open(f"{name}-transform.test", "w") as out_f:
    transformed_queries.tofile(out_f)
with open(f"{name}.test", "w") as out_f:
    queries.tofile(out_f)

magnitudes = np.linalg.norm(np_flat_ds, axis=1)
indices = np.argsort(magnitudes)
transformed_np_flat_ds = transform_docs(np_flat_ds, magnitudes)
validate_dataset_match_upto_dim(transformed_np_flat_ds, np_flat_ds, 200)
transformed_np_flat_ds_sorted = transformed_np_flat_ds[indices]
np_flat_ds_sorted = np_flat_ds[indices]
with open(f"{name}.random-transform.train", "w") as out_f:
    transformed_np_flat_ds.tofile(out_f)
with open(f"{name}.ordered-transform.train", "w") as out_f:
    transformed_np_flat_ds_sorted.tofile(out_f)
with open(f"{name}.reversed-transform.train", "w") as out_f:
    np.flip(transformed_np_flat_ds_sorted, axis=0).tofile(out_f)

with open(f"{name}.random.train", "w") as out_f:
    np_flat_ds.tofile(out_f)
with open(f"{name}.reversed.train", "w") as out_f:
    np.flip(np_flat_ds_sorted, axis=0).tofile(out_f)
with open(f"{name}.ordered.train", "w") as out_f:
    np_flat_ds_sorted.tofile(out_f)

@searchivarius
Copy link

@benwtrent thank you! despite this dataset is almost normalized, the non-transformed approach is still better for at least two scenarios (ordered and random).

@benwtrent
Copy link
Member

@searchivarius && @msokolov do we think any additional testing should be done? Or are we comfortable with #12479 going through as is with the testing done?

FYI, I reran my tests locally. I re-read the requirements and it seemed that in the "transformed" case, we SHOULD be using euclidean as the similarity comparator. My tests used "dot-product" (angular).

So, non-transformed == max-inner product and trainsformed == euclidean.

In my local reruns, there is almost no difference. The random case looks very similar, the ordered & reverse cases are more equal between the two runs, but the pareto graphs are effectively the same :/

@searchivarius
Copy link

Hi @benwtrent first question:

do you have your test code publicly available?

we SHOULD be using euclidean as the similarity comparator. My tests used "dot-product" (angular).

Second, great you remembered it, but I think there's no difference between cosine and L2 (i.e., search results are the same) if queries and documents have constant norms. They don't have to be normalized to the unit norm, I think any constant would suffice:

L2=(a-b)^2 = |a|^2 - ab + |b|^2 = const1 - cosine_similarity(a,b) * const2

What do you think? cc @jmazanec15

Additional experiments are always good. I have no immediate suggestion for realistic embeddings, though.

However, one can try to stress test the method by synthetically changing the data: Multiply or divide vector elements by a random uniform number in the range [1, M]. For large enough M, the transform might become beneficial.

@benwtrent
Copy link
Member

Here are my results for the en-ja mixed dataset.

max inner product in baseline, euclidean in transformed

EN-Ja rerun

Reversed Ordered Random
image image image
Updated knnPerfTest.py

My testing run I do the following:

  • Run all files & tests with just fanout:0 but with -reindex passed as an arg. This builds the indices
  • Then I remove the -reindex the param and run with all the fanout parameters.

I found this to be the quickest way to test.

i#!/usr/bin/env/python

import subprocess
import benchUtil
import constants


LUCENE_CHECKOUT = 'lucene_candidate'

# test parameters. This script will run KnnGraphTester on every combination of these parameters
VALUES = {
    'ndoc': (100000,),
    'maxConn': (48, ),
    'beamWidthIndex': (200,),
    'fanout': (0, 10, 50, 90, 190, 490, 590, 990),
    'topK': (10,),
}

def advance(ix, values):
    for i in reversed(range(len(ix))):
        param = list(values.keys())[i]
        #print("advance " + param)
        if ix[i] == len(values[param]) - 1:
            ix[i] = 0
        else:
            ix[i] += 1
            return True
    return False

def run_knn_benchmark(checkout, values, training_file, testing_file, dims, metric):
    indexes = [0] * len(values.keys())
    indexes[-1] = -1
    args = []
    print(f"\n\n\nNow running{training_file}\n\n\n")
    dim = dims #768
    doc_vectors = training_file
    query_vectors = testing_file
    cp = benchUtil.classPathToString(benchUtil.getClassPath(checkout))
    JAVA_EXE = '/Users/benjamintrent/Library/Java/JavaVirtualMachines/jdk-20.0.1.jdk/Contents/Home/bin/java'
    cmd = [JAVA_EXE,
           '-cp', cp,
           '--add-modules', 'jdk.incubator.vector',
           '-Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false',
           'KnnGraphTester']
    print("recall\tlatency\tnDoc\tfanout\tmaxConn\tbeamWidth\tvisited\tindex ms")
    while advance(indexes, values):
        pv = {}
        args = []
        for (i, p) in enumerate(list(values.keys())):
            if p in values:
                if values[p]:
                    value = values[p][indexes[i]]
                    pv[p] = value
                else:
                    args += ['-' + p]
        args += [a for (k, v) in pv.items() for a in ('-' + k, str(v)) if a]

        this_cmd = cmd + args + [
            '-dim', str(dim),
            '-docs', doc_vectors,
            #'-reindex',
            '-metric', metric,
            '-search', query_vectors,
            '-forceMerge',
            '-quiet',
            ]
        subprocess.run(this_cmd)
tests = [
    ('%s/util/en_ja.random.train' % constants.BASE_DIR, '%s/util/en_ja.test' % constants.BASE_DIR, 768, "angular"),
    ('%s/util/en_ja.ordered.train' % constants.BASE_DIR, '%s/util/en_ja.test' % constants.BASE_DIR, 768, "angular"),
    ('%s/util/en_ja.reversed.train' % constants.BASE_DIR, '%s/util/en_ja.test' % constants.BASE_DIR, 768, "angular"),
    ('%s/util/en_ja.random-transform.train' % constants.BASE_DIR, '%s/util/en_ja-transform.test' % constants.BASE_DIR, 769, "euclidean"),
    ('%s/util/en_ja.ordered-transform.train' % constants.BASE_DIR, '%s/util/en_ja-transform.test' % constants.BASE_DIR, 769, "euclidean"),
    ('%s/util/en_ja.reversed-transform.train' % constants.BASE_DIR, '%s/util/en_ja-transform.test' % constants.BASE_DIR, 769, "euclidean"),
]
for (training_file, testing_file, dims, metric) in tests:
    run_knn_benchmark(LUCENE_CHECKOUT, VALUES, training_file, testing_file, dims, metric)

@searchivarius
Copy link

@benwtrent thank you for re-running, much appreciated! The results look the same, but do we agree that after the transform is done, all three similarities / distances should be equivalent for the purpose of k-NN search? This is due to all documents being normalized to the same length (and query length is the same during the search, simply because it's the very same query). The equivalence is more obvious when queries and documents all have unit length. However, I think it is still true assuming consistent normalization of documents.

@searchivarius
Copy link

PS: @benwtrent could you publish benchmark code? We could possibly extend them, e.g., add new tests.

@benwtrent
Copy link
Member

@searchivarius my latest comment has the updated knnPerfTest.py that I have been using via Lucene Util.

I use a build of Lucene that utilizes the dot-product scaling suggested by @jmazanec15 .

I take the console output and copy/paste into the extraction & plotting code given in this comment: #12342 (comment)

All very manual.

@jmazanec15
Copy link
Contributor Author

Second, great you remembered it, but I think there's no difference between cosine and L2 (i.e., search results are the same) if queries and documents have constant norms. They don't have to be normalized to the unit norm, I think any constant would suffice:

L2=(a-b)^2 = |a|^2 - ab + |b|^2 = const1 - cosine_similarity(a,b) * const2

What do you think? cc @jmazanec15

@searchivarius yes I believe you are correct. Attached is a small proof that ordering is the same between euclidean distance and negative dot product when norm of all vectors is the same (please double check that I did not make any mistakes)

Proof

Prove: Given a set of points, $S \subset \mathbb{R}^d$, where $\exists x \in \mathbb{R} , \forall v \in S , ||v||^2 == x$. Then, $\forall q \in \mathbb{R}^d$, the ordering produced by $||q - v_1||^2 \le ||q - v_2||^2 \iff v_1 \le v_2$ is the same as the ordering produced by $-\langle q , v_1\rangle \le -\langle q , v_2\rangle \iff v_1 \le v_2$.

Starting with:

$$||q - v_1||^2 \le ||q - v_2||^2$$

By parallelogram law:

$$\Rightarrow 2(||q||^2 + ||v_1||^2) - ||q + v_1||^2 \le 2(||q||^2 + ||v_2||^2) - ||q + v_2||^2$$

Removing equal terms:

$$\Rightarrow - ||q + v_1||^2 \le - ||q + v_2||^2$$

Flipping sign:

$$\Rightarrow ||q + v_1||^2 \ge ||q + v_2||^2$$

Definition of norm:

$$\Rightarrow \langle q + v_1 , q + v_1\rangle \ge \langle q + v_2 , q + v_2\rangle$$

Expanding:

$$\Rightarrow \langle q , q \rangle + \langle v_1 , v_1 \rangle + 2\langle q , v_1 \rangle \ge \langle q , q \rangle + \langle v_2 , v_2 \rangle + 2\langle q , v_2 \rangle$$

Removing equal terms:

$$\Rightarrow 2\langle q , v_1 \rangle \ge 2\langle q , v_2 \rangle$$

Dividing and flipping sign:

$$\Rightarrow -\langle q , v_1 \rangle \le -\langle q , v_2 \rangle$$

@searchivarius
Copy link

searchivarius commented Aug 8, 2023

Seems to be correct (after checking what's exactly a parallelogram law is). However, as I mentioned above a simpler expansion of the dot product gives us that L2 is a monotonic transformation of the cosine similarity (and dot product as well, b/c norms are constant). In that the monotonicity preserves the order. I just wanted so make sue my simple derivation of:

L2=(a-b)^2 = |a|^2 - 2*(a,b) + |b|^2 = const1 - cosine_similarity(a,b) * const2

looked plausible.
PS: forgot 2 before (a,b), but an extra constant doesn't change the outcome.

@jmazanec15
Copy link
Contributor Author

@searchivarius I see, yes thats correct.

@benwtrent
Copy link
Member

Here are some results on synthetic datasets. I utilized max-inner-product (scaled to prevent negatives) for baseline, euclidean & and the typical transformation for transform.

Here is the code for generating the scaled data: https://gist.github.com/benwtrent/be084f4737f79dab7204eca3e6ab98fe

Normal(loc=1,scale=0.1)

Magnitude stats:

mean median var max min
5.768894195556641 5.769467830657959 0.054713960736989975 6.783757209777832 4.727940082550049
Reversed Ordered Random
image image image

Normal(loc=1,scale=0.2)

mean median var max min
10.196664810180664 10.195568084716797 0.03208443149924278 11.093703269958496 9.358368873596191
Reversed Ordered Random
image image image

Pareto(5)

mean median var max min
4.014744281768799 3.8864400386810303 0.5584204792976379 33.431556701660156 2.2141001224517822
Reversed Ordered Random
image image image

Uniform

mean median var max min
5.768894195556641 5.769467830657959 0.054713960736989975 6.783757209777832 4.727940082550049
Reversed Ordered Random
image image image

BiModal 0.5

Two normal distributions with loc=1 and loc=3, each multiplied by 0.5

mean median var max min
20.04954719543457 20.048864364624023 0.016332395374774933 20.65839195251465 19.469064712524414
Reversed Ordered Random
image image image

BiModal 0.9

Two normal distributions with loc=1 and loc=3, each multiplied by 0.9 & 0.1 respectively

mean median var max min
12.134689331054688 12.133764266967773 0.02648037299513817 12.946907997131348 11.34481430053711
Reversed Ordered Random
image image image

Gamma(shape=1,scale=1)

mean median var max min
14.072248458862305 13.95256233215332 1.9276633262634277 30.004880905151367 9.380194664001465
Reversed Ordered Random
image image image

Gamma(shape=2,scale=2)

mean median var max min
48.86893081665039 48.64986038208008 11.134326934814453 75.66817474365234 36.40324401855469
Reversed Ordered Random
image image image

@searchivarius
Copy link

Looking great, many thanks! Could you remind me what is ordered and reversed? This is something related to insertion order?

@benwtrent
Copy link
Member

Yes, insertion order. So which docs are indexed first.

Ordered is sorted by magnitude in ascending order. So smallest magnitudes are indexed first.

Reverse is ordered reverse. So descending order by magnitudes (largest first)

@benwtrent
Copy link
Member

Just to confirm the gamma numbers for the reversed case (as its the only one where transformed does better). I ran it again, but up to 2000 neighbors explored. The latency & recall graphs stay steady.

image

image

@benwtrent
Copy link
Member

It makes sense that the transformation works better with the gamma distributions (but only when indexing the docs are ordered by magnitude) as it has by far the highest variance of all the others.

It is an exceptional case for data to not only have such a high variance, but the user purposefully indexes the documents sorted by their magnitude.

All our real datasets had a much lower variance. And in the random order case, baseline is still better than transformed. Even in the ordered & reverse cases, baseline was able to achieve 95% recall with just 1ms more latency.

Are we comfortable with moving forward and merging #12479 bringing Max-inner-product into Lucene?

@jmazanec15
Copy link
Contributor Author

@benwtrent I am comfortable with it.

For future, if we ever decide to switch up the edge selection strategy where a node not only has to be smaller than a threshold distance, but smaller by a factor of X (I believe Vamana has a strategy like this), then we may need to take extra care that the scaling will work for the transformation - but this is not a problem now - and I believe there would be a similar problem with raw dot product anyway.

benwtrent added a commit that referenced this issue Aug 16, 2023
…12479)

The current dot-product score scaling and similarity implementation assumes normalized vectors. This disregards information that the model may store within the magnitude. 

See: #12342 (comment) for a good explanation for the need.

To prevent from breaking current scoring assumptions in Lucene, a new `MAXIMUM_INNER_PRODUCT` similarity function is added. 

Because the similarity from a `dotProduct` function call could be negative, this similarity scorer will scale negative dotProducts to between 0-1 and then all positive dotProduct values are from 1-MAX.

One concern with adding this similarity function is that it breaks the triangle inequality. It is assumed that this is needed to build graph structures. But, there is conflicting research here when it comes to real-world data.

See:
 - For: #12342 (comment)
 - Against: #12342 (comment), #12342 (comment)

To check if any transformation of the input is required to satisfy the triangle inequality, many tests have been ran

See:

 - #12342 (comment)
 - #12342 (comment)
 - #12342 (comment)

If there are any additional tests, or issues with the provided tests & scripts, please let me know. We want to make sure this works well for our users.

closes: #12342
benwtrent added a commit that referenced this issue Aug 16, 2023
…12479)

The current dot-product score scaling and similarity implementation assumes normalized vectors. This disregards information that the model may store within the magnitude. 

See: #12342 (comment) for a good explanation for the need.

To prevent from breaking current scoring assumptions in Lucene, a new `MAXIMUM_INNER_PRODUCT` similarity function is added. 

Because the similarity from a `dotProduct` function call could be negative, this similarity scorer will scale negative dotProducts to between 0-1 and then all positive dotProduct values are from 1-MAX.

One concern with adding this similarity function is that it breaks the triangle inequality. It is assumed that this is needed to build graph structures. But, there is conflicting research here when it comes to real-world data.

See:
 - For: #12342 (comment)
 - Against: #12342 (comment), #12342 (comment)

To check if any transformation of the input is required to satisfy the triangle inequality, many tests have been ran

See:

 - #12342 (comment)
 - #12342 (comment)
 - #12342 (comment)

If there are any additional tests, or issues with the provided tests & scripts, please let me know. We want to make sure this works well for our users.

closes: #12342
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants