Vector search (k nearest neighbour/knn) #6759

astigsen · 2023-06-29T14:17:23Z

This is an initial implementation of vector (k nearest neighbour/knn) search.

Vector embeddings are lists of floats.
Works with arbitrary dimensions, but the lists all have to all be of equal length.
Can be composed with regular queries to filter the results further.
This is a linear (bruteforce) search. No index is needed.
Defaults to InnerProduct for distance calculation, but is pluggable for other algorithms.
Is exposed as .knn_search(column, vector, k) on Results, so should be easy to use from SDKs.

ironage

Just a few comments, but this is looking great! 👍

ironage · 2023-08-15T20:54:46Z

src/realm/table.cpp

+    Obj o = get_object(row);
+    Lst<float> lst = o.get_list<float>(column);
+    if (lst.size() != dim)
+        throw IllegalOperation("Knn distance can only be calculated on lists of matching length");


What would you think about ignoring malformed data? We could either use the first dim items in a list, or we could just return infinity for objects with lists of unexpected size? That seems more forgiving than throwing an exception.

ironage · 2023-08-15T20:59:11Z

src/realm/table_view.cpp

+        }
+        for (size_t t = 0; t < detached_ref_count; ++t)
+            key_values.add(null_key);
+        };


I thought this was the end of the for loop, unindent to make it clear it is the end of the lambda.

... and use braces with the for loop.

ironage · 2023-08-15T21:01:54Z

src/realm/table_view.cpp

+        else {
+            if (using_indexpairs) {
+                //BaseDescriptor::Sorter predicate = base_descr->sorter(*m_table, index_pairs);
+                //base_descr->execute(index_pairs, predicate, nullptr);


remove commented code

ironage · 2023-08-15T21:22:14Z

test/object-store/results.cpp

+        REQUIRE(v2.size() == 2);
+        REQUIRE(v2.get(0).get<Int>(col_id) == 4);
+        REQUIRE(v2.get(1).get<Int>(col_id) == 1);
+    }


it would be great to also have a test for combining filter/distinct/sort. Something like results.distinct(aFieldWithDuplicates).knn_search(...).sort(col_id, descending)

Btw, does it make sense to allow it actually? After knn results are sorted starting from the nearest. Does it make sense to allow sort/filter/distinct afterwards? Also, i'd assume that this query is rather slow compared to other operations, we don't want to encourage doing it in the middle somewhere, no?

ironage · 2023-08-15T21:24:23Z

src/realm/sort_descriptor.cpp

+
+std::string SemanticSearchDescriptor::get_description(ConstTableRef) const
+{
+    return "KNN()";


We should fill this in if you'd like to eventually have support for this feature in the query parser. Is that something you'd like me to add?

kiburtse · 2023-08-16T20:26:36Z

src/external/hnswlib/hnswlib.h

+#include <x86intrin.h>
+#include <cpuid.h>
+#include <stdint.h>
+/*static void cpuid(int32_t cpuInfo[4], int32_t eax, int32_t ecx) {


Wouldn't it be better to add it as a submodule and exclude these pieces from compilation by some realm build definition? This library seems to still actively receiving updates, would be better for future maintenance.

kiburtse · 2023-08-16T20:35:39Z

test/object-store/results.cpp

+        REQUIRE(v2.size() == 2);
+        REQUIRE(v2.get(0).get<Int>(col_id) == 4);
+        REQUIRE(v2.get(1).get<Int>(col_id) == 1);
+    }


Btw, does it make sense to allow it actually? After knn results are sorted starting from the nearest. Does it make sense to allow sort/filter/distinct afterwards? Also, i'd assume that this query is rather slow compared to other operations, we don't want to encourage doing it in the middle somewhere, no?

src/realm/sort_descriptor.cpp

kiburtse · 2023-08-16T20:57:17Z

src/realm/sort_descriptor.cpp

@@ -523,6 +530,14 @@ bool DescriptorOrdering::will_apply_limit() const
    });
 }

+bool DescriptorOrdering::will_apply_knn() const
+{
+    return std::any_of(m_descriptors.begin(), m_descriptors.end(), [](const std::unique_ptr<BaseDescriptor>& desc) {


This is known if append_knn was called, no?

kiburtse · 2023-08-16T20:57:58Z

src/realm/object-store/results.cpp

+
+    auto new_order = m_descriptor_ordering;
+    new_order.append_knn(SemanticSearchDescriptor(query_data, k, column));
+    if (m_mode == Mode::Collection)


Test for this mode would be great.

jedelbo · 2023-08-29T09:11:42Z

src/realm/table.hpp

@@ -402,6 +403,9 @@ class Table {
    std::optional<Mixed> min(ColKey col_key, ObjKey* = nullptr) const;
    std::optional<Mixed> max(ColKey col_key, ObjKey* = nullptr) const;
    std::optional<Mixed> avg(ColKey col_key, size_t* value_count = nullptr) const;
+
+    // Calculate the distance between the two vectors (embeddings)
+    float dist_knn(const std::vector<float>& query_data, ColKey column, ObjKey row, hnswlib::SpaceInterface<float>& s) const;


This should not be an operation on Table. Could just be a lambda within SemanticSearchDescriptor::execute

jedelbo · 2023-08-29T09:12:52Z

src/realm/sort_descriptor.hpp

+
+private:
+    std::vector<float> m_query_data;
+    size_t m_k;


More descriptive variable names, please.

jedelbo · 2023-08-29T09:20:13Z

src/realm/sort_descriptor.cpp

+    for (size_t i = 0; i < n; i++) {
+        ObjKey r = key_values.get(i);
+        float dist = table.dist_knn(m_query_data, m_column, r, m_sp);
+        topResults.push(std::pair<float, ObjKey>(dist, r));


use emplace(dist, r)

jedelbo · 2023-08-29T09:21:38Z

src/realm/sort_descriptor.cpp

+    float lastdist = topResults.empty() ? std::numeric_limits<float>::max() : topResults.top().first;
+    for (size_t i = m_k; i < key_values.size(); i++) {
+        ObjKey r = key_values.get(i);
+        if (!table.is_valid(r)) continue;


Inefficient to check for validity first and then get object. Better to use try_get_object in dist_knn and return std::optional<float>

jedelbo · 2023-08-29T09:28:02Z

src/realm/sort_descriptor.cpp

+        float dist = table.dist_knn(m_query_data, m_column, r, m_sp);
+        topResults.push(std::pair<float, ObjKey>(dist, r));
+    }
+    float lastdist = topResults.empty() ? std::numeric_limits<float>::max() : topResults.top().first;


How can topResults be empty? I guess only if m_k or key_values size is zero in which case we should just return empty result.

jedelbo · 2023-08-29T09:37:44Z

src/realm/sort_descriptor.cpp

+{
+    if (m_k >= key_values.size()) return; // all entries already match as closest
+
+    std::priority_queue<std::pair<float, ObjKey>> topResults;


We use snake case for variables.

jedelbo · 2023-08-29T09:39:36Z

src/realm/sort_descriptor.cpp

+
+    // set result to the matches, in order of closest match first
+    key_values.clear();
+    while(!topResults.empty()) {


More efficient to traverse the underlying container in reverse and just add elements. Create a class like this:

class PrioQueue : public std::priority_queue<std::pair<float, ObjKey>> { public: auto begin() { return c.rbegin(); } auto end() { return c.rend(); } };

Then just

for (auto tr : topResults) { key_values.add(tr.second); }

jedelbo · 2023-08-29T09:41:57Z

src/realm/sort_descriptor.hpp

+    ColKey m_column;
+
+    // We are going to default to measure distance by Inner Product for now
+    mutable hnswlib::InnerProductSpace m_sp;


More descriptive variable names, please.

jedelbo · 2023-08-29T09:44:08Z

src/realm/table_view.cpp

+        }
+        for (size_t t = 0; t < detached_ref_count; ++t)
+            key_values.add(null_key);
+        };


... and use braces with the for loop.

jedelbo · 2023-08-29T09:55:00Z

Many formatting problems. You can compare with branch je/vector-search where I have fixed the formatting problems and done some improvements.

kiburtse · 2023-08-31T20:23:27Z

Many formatting problems. You can compare with branch je/vector-search where I have fixed the formatting problems and done some improvements.

@jedelbo could you create pr into this branch with your changes to not spread possible fixes and reviews?

waqasakram117 · 2023-11-26T19:11:26Z

Any update on this merge?

thrashr888 · 2024-03-08T02:53:54Z

Just to share why I and others might be interested in this functionality, AI prompting and AI chat significantly benefit in quality through RAG search using vector databases. Today, there are many ways to perform AI completions in macOS and iOS apps, but the options for local vector databases to use for RAG are very limited. If added, Realm Swift would be the best possible option for implementing local RAG search on macOS and iOS. I've tried other options and they are mainly just saving and loading JSON files fully in memory and cosine similarity search on the in-application-memory objects. Realm would be much more efficient (as I understand it) and the DevEx is vastly improved over the other options (thank you!). With Apple's investments in MLX, fully local AI projects are much closer to reality and apps can start practically applying AI in many use cases. Given this PR, it seems like we're very close to enabling good, efficient local RAG for the full ecosystem of Apple apps. I appreciate your time and attention!

carbonete · 2024-03-11T18:43:41Z

I create discussion in forum to support argumments to develope this feature.

https://www.mongodb.com/community/forums/t/arguments-to-vector-search-in-local-realm-please-contribute/268834

ashishatmdb

Is there an expectation from our users that the results of this vector search will be somewhat similar to the vector search results they might get from querying the atlas database? If so, what provides that stability of response on query between this library and our lucene implementation?

This is an initial implementation of vector (k nearest neighbour/knn) search. - Works on lists of floats. - The lists all have to all be of equal length. -Can be composed with regular queries to filter the results further. - This is a linear (bruteforce) search. No index is used. - Defaults to InnerProduct for distance calculation, but is pluggable for other algorithms.

coveralls-official · 2024-04-08T14:53:40Z

Pull Request Test Coverage Report for Build alexander.stigsen_3

Details

300 of 300 (100.0%) changed or added relevant lines in 7 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.04%) to 91.995%

Totals
Change from base Build 2204:	0.04%
Covered Lines:	244812
Relevant Lines:	266115

💛 - Coveralls

daviddaytw · 2024-07-29T23:19:07Z

Hi! Thank you all for the effort. I need this thing working for local private RAG in one of my React-Native apps. Is there anything I can do to push things forward?

carbonete · 2024-08-22T23:55:35Z

@jedelbo this feature to realm is abandoned ?

jedelbo · 2024-08-26T08:19:29Z

The product description for this feature has not been approved yet, so we don't know if this work here is still relevant.

carbonete · 2024-08-26T12:42:45Z

@jedelbo thanks for answering 👍

My arguments to include vector search in local realms. We need observe industry moving inference priority to edge ( see Intel, Qualcomm, AMD, Meta ( Llama mobile ) MS ( Onnx ) Google ( Gemini Nano ) and many others.

. transfer to edge.
. Reduced Infrastructure
. Scalability
. Enhanced Privacy and Security. Local data.
. Offline Functionality
. Resilience to Network Failures
. Low Latency, real-time responses
. Real-time Insights and Decision Making

Competition is taking first place

https://info.couchbase.com/couchbase-lite-vector-search-beta-program 20

Many user cases can benefit from this.

I know that this feature need a hard work and investment. But for clients like me that use Realm Sync and local Realm in app it’s great importance.

cla-bot bot added the cla: yes label Jun 29, 2023

github-actions bot assigned astigsen Jun 29, 2023

astigsen requested review from jedelbo, ironage and kiburtse August 15, 2023 13:32

ironage reviewed Aug 15, 2023

View reviewed changes

kiburtse reviewed Aug 16, 2023

View reviewed changes

kneth mentioned this pull request Aug 28, 2023

Vector similarity search using Realm database realm/realm-js#6103

Open

jedelbo requested changes Aug 29, 2023

View reviewed changes

ironage mentioned this pull request Aug 29, 2023

Query engine pipeline manager #6933

Open

ashishatmdb reviewed Mar 28, 2024

View reviewed changes

jedelbo force-pushed the as/vector-search branch from ee7ddc2 to b39c506 Compare April 8, 2024 14:00

jedelbo force-pushed the as/vector-search branch from b39c506 to 5d5bb6c Compare April 8, 2024 14:06

nhachicha added a commit to realm/realm-kotlin that referenced this pull request May 20, 2024

WIP demo for using KNN in Kotlin based on realm/realm-core#6759

c43d6c1

nhachicha mentioned this pull request May 20, 2024

WIP demo for using KNN in Kotlin based on https://github.com/realm/realm-core/pull/6759 realm/realm-kotlin#1752

Draft

jedelbo closed this Aug 14, 2024

github-actions bot locked as resolved and limited conversation to collaborators Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector search (k nearest neighbour/knn) #6759

Vector search (k nearest neighbour/knn) #6759

astigsen commented Jun 29, 2023 •

edited

Loading

ironage left a comment

ironage Aug 15, 2023

ironage Aug 15, 2023

jedelbo Aug 29, 2023

ironage Aug 15, 2023

ironage Aug 15, 2023

kiburtse Aug 16, 2023

ironage Aug 15, 2023

kiburtse Aug 16, 2023

kiburtse Aug 16, 2023

kiburtse Aug 16, 2023

kiburtse Aug 16, 2023

jedelbo Aug 29, 2023

jedelbo Aug 29, 2023

jedelbo Aug 29, 2023

jedelbo Aug 29, 2023

jedelbo Aug 29, 2023

jedelbo Aug 29, 2023

jedelbo Aug 29, 2023

jedelbo Aug 29, 2023

jedelbo Aug 29, 2023

jedelbo commented Aug 29, 2023

kiburtse commented Aug 31, 2023

waqasakram117 commented Nov 26, 2023

thrashr888 commented Mar 8, 2024

carbonete commented Mar 11, 2024

ashishatmdb left a comment

coveralls-official bot commented Apr 8, 2024

daviddaytw commented Jul 29, 2024

carbonete commented Aug 22, 2024

jedelbo commented Aug 26, 2024

carbonete commented Aug 26, 2024

Vector search (k nearest neighbour/knn) #6759

Vector search (k nearest neighbour/knn) #6759

Conversation

astigsen commented Jun 29, 2023 • edited Loading

ironage left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jedelbo commented Aug 29, 2023

kiburtse commented Aug 31, 2023

waqasakram117 commented Nov 26, 2023

thrashr888 commented Mar 8, 2024

carbonete commented Mar 11, 2024

ashishatmdb left a comment

Choose a reason for hiding this comment

coveralls-official bot commented Apr 8, 2024

Pull Request Test Coverage Report for Build alexander.stigsen_3

Details

💛 - Coveralls

daviddaytw commented Jul 29, 2024

carbonete commented Aug 22, 2024

jedelbo commented Aug 26, 2024

carbonete commented Aug 26, 2024

astigsen commented Jun 29, 2023 •

edited

Loading