Skip to content

Commit

Permalink
GITHUB#12342 Add new maximum inner product vector similarity method (#…
Browse files Browse the repository at this point in the history
…12479)

The current dot-product score scaling and similarity implementation assumes normalized vectors. This disregards information that the model may store within the magnitude. 

See: #12342 (comment) for a good explanation for the need.

To prevent from breaking current scoring assumptions in Lucene, a new `MAXIMUM_INNER_PRODUCT` similarity function is added. 

Because the similarity from a `dotProduct` function call could be negative, this similarity scorer will scale negative dotProducts to between 0-1 and then all positive dotProduct values are from 1-MAX.

One concern with adding this similarity function is that it breaks the triangle inequality. It is assumed that this is needed to build graph structures. But, there is conflicting research here when it comes to real-world data.

See:
 - For: #12342 (comment)
 - Against: #12342 (comment), #12342 (comment)

To check if any transformation of the input is required to satisfy the triangle inequality, many tests have been ran

See:

 - #12342 (comment)
 - #12342 (comment)
 - #12342 (comment)

If there are any additional tests, or issues with the provided tests & scripts, please let me know. We want to make sure this works well for our users.

closes: #12342
  • Loading branch information
benwtrent committed Aug 16, 2023
1 parent e850883 commit 181466d
Show file tree
Hide file tree
Showing 6 changed files with 71 additions and 2 deletions.
3 changes: 3 additions & 0 deletions lucene/CHANGES.txt
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@ New Features
search results can be provided. The first custom collector provides `ToParentBlockJoin[Float|Byte]KnnVectorQuery`
joining child vector documents with their parent documents. (Ben Trent)

* GITHUB#12479: Add new Maximum Inner Product vector similarity function for non-normalized dot-product
vector search. (Jack Mazanec, Ben Trent)

Improvements
---------------------
* GITHUB#12374: Add CachingLeafSlicesSupplier to compute the LeafSlices for concurrent segment search (Sorabh Hamirwasia)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
import static org.apache.lucene.util.VectorUtil.cosine;
import static org.apache.lucene.util.VectorUtil.dotProduct;
import static org.apache.lucene.util.VectorUtil.dotProductScore;
import static org.apache.lucene.util.VectorUtil.scaleMaxInnerProductScore;
import static org.apache.lucene.util.VectorUtil.squareDistance;

/**
Expand Down Expand Up @@ -76,6 +77,23 @@ public float compare(float[] v1, float[] v2) {
public float compare(byte[] v1, byte[] v2) {
return (1 + cosine(v1, v2)) / 2;
}
},

/**
* Maximum inner product. This is like {@link VectorSimilarityFunction#DOT_PRODUCT}, but does not
* require normalization of the inputs. Should be used when the embedding vectors store useful
* information within the vector magnitude
*/
MAXIMUM_INNER_PRODUCT {
@Override
public float compare(float[] v1, float[] v2) {
return scaleMaxInnerProductScore(dotProduct(v1, v2));
}

@Override
public float compare(byte[] v1, byte[] v2) {
return scaleMaxInnerProductScore(dotProduct(v1, v2));
}
};

/**
Expand Down
11 changes: 11 additions & 0 deletions lucene/core/src/java/org/apache/lucene/util/VectorUtil.java
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,17 @@ public static float dotProductScore(byte[] a, byte[] b) {
return 0.5f + dotProduct(a, b) / denom;
}

/**
* @param vectorDotProductSimilarity the raw similarity between two vectors
* @return A scaled score preventing negative scores for maximum-inner-product
*/
public static float scaleMaxInnerProductScore(float vectorDotProductSimilarity) {
if (vectorDotProductSimilarity < 0) {
return 1 / (1 + -1 * vectorDotProductSimilarity);
}
return vectorDotProductSimilarity + 1;
}

/**
* Checks if a float vector only has finite components.
*
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -380,6 +380,29 @@ public void testScoreCosine() throws IOException {
}
}

public void testScoreMIP() throws IOException {
try (Directory indexStore =
getIndexStore(
"field",
VectorSimilarityFunction.MAXIMUM_INNER_PRODUCT,
new float[] {0, 1},
new float[] {1, 2},
new float[] {0, 0});
IndexReader reader = DirectoryReader.open(indexStore)) {
IndexSearcher searcher = newSearcher(reader);
AbstractKnnVectorQuery kvq = getKnnVectorQuery("field", new float[] {0, -1}, 10);
assertMatches(searcher, kvq, 3);
ScoreDoc[] scoreDocs = searcher.search(kvq, 3).scoreDocs;
assertIdMatches(reader, "id2", scoreDocs[0]);
assertIdMatches(reader, "id0", scoreDocs[1]);
assertIdMatches(reader, "id1", scoreDocs[2]);

assertEquals(1.0, scoreDocs[0].score, 1e-7);
assertEquals(1 / 2f, scoreDocs[1].score, 1e-7);
assertEquals(1 / 3f, scoreDocs[2].score, 1e-7);
}
}

public void testExplain() throws IOException {
try (Directory d = newDirectory()) {
try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) {
Expand Down Expand Up @@ -773,11 +796,21 @@ public void testBitSetQuery() throws IOException {

/** Creates a new directory and adds documents with the given vectors as kNN vector fields */
Directory getIndexStore(String field, float[]... contents) throws IOException {
return getIndexStore(field, VectorSimilarityFunction.EUCLIDEAN, contents);
}

/**
* Creates a new directory and adds documents with the given vectors with similarity as kNN vector
* fields
*/
Directory getIndexStore(
String field, VectorSimilarityFunction vectorSimilarityFunction, float[]... contents)
throws IOException {
Directory indexStore = newDirectory();
RandomIndexWriter writer = new RandomIndexWriter(random(), indexStore);
for (int i = 0; i < contents.length; ++i) {
Document doc = new Document();
doc.add(getKnnVectorField(field, contents[i]));
doc.add(getKnnVectorField(field, contents[i], vectorSimilarityFunction));
doc.add(new StringField("id", "id" + i, Field.Store.YES));
writer.addDocument(doc);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1246,6 +1246,9 @@ private static String flags(org.apache.lucene.luke.models.documents.DocumentFiel
case EUCLIDEAN:
sb.append("euc");
break;
case MAXIMUM_INNER_PRODUCT:
sb.append("mip");
break;
default:
sb.append("???");
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1288,7 +1288,8 @@ public void testSimilarityFunctionIdentifiers() {
assertEquals(0, VectorSimilarityFunction.EUCLIDEAN.ordinal());
assertEquals(1, VectorSimilarityFunction.DOT_PRODUCT.ordinal());
assertEquals(2, VectorSimilarityFunction.COSINE.ordinal());
assertEquals(3, VectorSimilarityFunction.values().length);
assertEquals(3, VectorSimilarityFunction.MAXIMUM_INNER_PRODUCT.ordinal());
assertEquals(4, VectorSimilarityFunction.values().length);
}

public void testVectorEncodingOrdinals() {
Expand Down

0 comments on commit 181466d

Please sign in to comment.