Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector search (k nearest neighbour/knn) #6759

Closed
wants to merge 1 commit into from
Closed

Conversation

astigsen
Copy link
Contributor

@astigsen astigsen commented Jun 29, 2023

This is an initial implementation of vector (k nearest neighbour/knn) search.

  • Vector embeddings are lists of floats.
  • Works with arbitrary dimensions, but the lists all have to all be of equal length.
  • Can be composed with regular queries to filter the results further.
  • This is a linear (bruteforce) search. No index is needed.
  • Defaults to InnerProduct for distance calculation, but is pluggable for other algorithms.
  • Is exposed as .knn_search(column, vector, k) on Results, so should be easy to use from SDKs.

Copy link
Contributor

@ironage ironage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments, but this is looking great! 👍

Obj o = get_object(row);
Lst<float> lst = o.get_list<float>(column);
if (lst.size() != dim)
throw IllegalOperation("Knn distance can only be calculated on lists of matching length");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think about ignoring malformed data? We could either use the first dim items in a list, or we could just return infinity for objects with lists of unexpected size? That seems more forgiving than throwing an exception.

}
for (size_t t = 0; t < detached_ref_count; ++t)
key_values.add(null_key);
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this was the end of the for loop, unindent to make it clear it is the end of the lambda.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and use braces with the for loop.

else {
if (using_indexpairs) {
//BaseDescriptor::Sorter predicate = base_descr->sorter(*m_table, index_pairs);
//base_descr->execute(index_pairs, predicate, nullptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented code

REQUIRE(v2.size() == 2);
REQUIRE(v2.get(0).get<Int>(col_id) == 4);
REQUIRE(v2.get(1).get<Int>(col_id) == 1);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be great to also have a test for combining filter/distinct/sort. Something like results.distinct(aFieldWithDuplicates).knn_search(...).sort(col_id, descending)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, does it make sense to allow it actually? After knn results are sorted starting from the nearest. Does it make sense to allow sort/filter/distinct afterwards? Also, i'd assume that this query is rather slow compared to other operations, we don't want to encourage doing it in the middle somewhere, no?


std::string SemanticSearchDescriptor::get_description(ConstTableRef) const
{
return "KNN()";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should fill this in if you'd like to eventually have support for this feature in the query parser. Is that something you'd like me to add?

#include <x86intrin.h>
#include <cpuid.h>
#include <stdint.h>
/*static void cpuid(int32_t cpuInfo[4], int32_t eax, int32_t ecx) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to add it as a submodule and exclude these pieces from compilation by some realm build definition? This library seems to still actively receiving updates, would be better for future maintenance.

REQUIRE(v2.size() == 2);
REQUIRE(v2.get(0).get<Int>(col_id) == 4);
REQUIRE(v2.get(1).get<Int>(col_id) == 1);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, does it make sense to allow it actually? After knn results are sorted starting from the nearest. Does it make sense to allow sort/filter/distinct afterwards? Also, i'd assume that this query is rather slow compared to other operations, we don't want to encourage doing it in the middle somewhere, no?

src/realm/sort_descriptor.cpp Show resolved Hide resolved
@@ -523,6 +530,14 @@ bool DescriptorOrdering::will_apply_limit() const
});
}

bool DescriptorOrdering::will_apply_knn() const
{
return std::any_of(m_descriptors.begin(), m_descriptors.end(), [](const std::unique_ptr<BaseDescriptor>& desc) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is known if append_knn was called, no?


auto new_order = m_descriptor_ordering;
new_order.append_knn(SemanticSearchDescriptor(query_data, k, column));
if (m_mode == Mode::Collection)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test for this mode would be great.

@@ -402,6 +403,9 @@ class Table {
std::optional<Mixed> min(ColKey col_key, ObjKey* = nullptr) const;
std::optional<Mixed> max(ColKey col_key, ObjKey* = nullptr) const;
std::optional<Mixed> avg(ColKey col_key, size_t* value_count = nullptr) const;

// Calculate the distance between the two vectors (embeddings)
float dist_knn(const std::vector<float>& query_data, ColKey column, ObjKey row, hnswlib::SpaceInterface<float>& s) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be an operation on Table. Could just be a lambda within SemanticSearchDescriptor::execute


private:
std::vector<float> m_query_data;
size_t m_k;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More descriptive variable names, please.

for (size_t i = 0; i < n; i++) {
ObjKey r = key_values.get(i);
float dist = table.dist_knn(m_query_data, m_column, r, m_sp);
topResults.push(std::pair<float, ObjKey>(dist, r));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use emplace(dist, r)

float lastdist = topResults.empty() ? std::numeric_limits<float>::max() : topResults.top().first;
for (size_t i = m_k; i < key_values.size(); i++) {
ObjKey r = key_values.get(i);
if (!table.is_valid(r)) continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inefficient to check for validity first and then get object. Better to use try_get_object in dist_knn and return std::optional<float>

float dist = table.dist_knn(m_query_data, m_column, r, m_sp);
topResults.push(std::pair<float, ObjKey>(dist, r));
}
float lastdist = topResults.empty() ? std::numeric_limits<float>::max() : topResults.top().first;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can topResults be empty? I guess only if m_k or key_values size is zero in which case we should just return empty result.

{
if (m_k >= key_values.size()) return; // all entries already match as closest

std::priority_queue<std::pair<float, ObjKey>> topResults;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use snake case for variables.


// set result to the matches, in order of closest match first
key_values.clear();
while(!topResults.empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More efficient to traverse the underlying container in reverse and just add elements. Create a class like this:

class PrioQueue : public std::priority_queue<std::pair<float, ObjKey>> {
public:
    auto begin() {
        return c.rbegin();
    }
    auto end() {
        return c.rend();
    }
};

Then just

for (auto tr : topResults) {
    key_values.add(tr.second);
}

ColKey m_column;

// We are going to default to measure distance by Inner Product for now
mutable hnswlib::InnerProductSpace m_sp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More descriptive variable names, please.

}
for (size_t t = 0; t < detached_ref_count; ++t)
key_values.add(null_key);
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and use braces with the for loop.

@jedelbo
Copy link
Contributor

jedelbo commented Aug 29, 2023

Many formatting problems. You can compare with branch je/vector-search where I have fixed the formatting problems and done some improvements.

@kiburtse
Copy link
Contributor

Many formatting problems. You can compare with branch je/vector-search where I have fixed the formatting problems and done some improvements.

@jedelbo could you create pr into this branch with your changes to not spread possible fixes and reviews?

@waqasakram117
Copy link

Any update on this merge?

@thrashr888
Copy link

Just to share why I and others might be interested in this functionality, AI prompting and AI chat significantly benefit in quality through RAG search using vector databases. Today, there are many ways to perform AI completions in macOS and iOS apps, but the options for local vector databases to use for RAG are very limited. If added, Realm Swift would be the best possible option for implementing local RAG search on macOS and iOS. I've tried other options and they are mainly just saving and loading JSON files fully in memory and cosine similarity search on the in-application-memory objects. Realm would be much more efficient (as I understand it) and the DevEx is vastly improved over the other options (thank you!). With Apple's investments in MLX, fully local AI projects are much closer to reality and apps can start practically applying AI in many use cases. Given this PR, it seems like we're very close to enabling good, efficient local RAG for the full ecosystem of Apple apps. I appreciate your time and attention!

@carbonete
Copy link

I create discussion in forum to support argumments to develope this feature.

https://www.mongodb.com/community/forums/t/arguments-to-vector-search-in-local-realm-please-contribute/268834

Copy link

@ashishatmdb ashishatmdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an expectation from our users that the results of this vector search will be somewhat similar to the vector search results they might get from querying the atlas database? If so, what provides that stability of response on query between this library and our lucene implementation?

@jedelbo jedelbo force-pushed the as/vector-search branch from ee7ddc2 to b39c506 Compare April 8, 2024 14:00
This is an initial implementation of vector (k nearest neighbour/knn) search.

- Works on lists of floats.
- The lists all have to all be of equal length.
-Can be composed with regular queries to filter the results further.
- This is a linear (bruteforce) search. No index is used.
- Defaults to InnerProduct for distance calculation, but is pluggable for other algorithms.
@jedelbo jedelbo force-pushed the as/vector-search branch from b39c506 to 5d5bb6c Compare April 8, 2024 14:06
Copy link

Pull Request Test Coverage Report for Build alexander.stigsen_3

Details

  • 300 of 300 (100.0%) changed or added relevant lines in 7 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.04%) to 91.995%

Totals Coverage Status
Change from base Build 2204: 0.04%
Covered Lines: 244812
Relevant Lines: 266115

💛 - Coveralls

@daviddaytw
Copy link

Hi! Thank you all for the effort. I need this thing working for local private RAG in one of my React-Native apps. Is there anything I can do to push things forward?

@jedelbo jedelbo closed this Aug 14, 2024
@carbonete
Copy link

@jedelbo this feature to realm is abandoned ?

@jedelbo
Copy link
Contributor

jedelbo commented Aug 26, 2024

The product description for this feature has not been approved yet, so we don't know if this work here is still relevant.

@carbonete
Copy link

@jedelbo thanks for answering 👍

My arguments to include vector search in local realms. We need observe industry moving inference priority to edge ( see Intel, Qualcomm, AMD, Meta ( Llama mobile ) MS ( Onnx ) Google ( Gemini Nano ) and many others.

. transfer to edge.
. Reduced Infrastructure
. Scalability
. Enhanced Privacy and Security. Local data.
. Offline Functionality
. Resilience to Network Failures
. Low Latency, real-time responses
. Real-time Insights and Decision Making

Competition is taking first place

https://info.couchbase.com/couchbase-lite-vector-search-beta-program 20

Many user cases can benefit from this.

I know that this feature need a hard work and investment. But for clients like me that use Realm Sync and local Realm in app it’s great importance.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 25, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants