-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vector search (k nearest neighbour/knn) #6759
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments, but this is looking great! 👍
src/realm/table.cpp
Outdated
Obj o = get_object(row); | ||
Lst<float> lst = o.get_list<float>(column); | ||
if (lst.size() != dim) | ||
throw IllegalOperation("Knn distance can only be calculated on lists of matching length"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you think about ignoring malformed data? We could either use the first dim
items in a list, or we could just return infinity for objects with lists of unexpected size? That seems more forgiving than throwing an exception.
src/realm/table_view.cpp
Outdated
} | ||
for (size_t t = 0; t < detached_ref_count; ++t) | ||
key_values.add(null_key); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this was the end of the for loop, unindent to make it clear it is the end of the lambda.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and use braces with the for loop.
src/realm/table_view.cpp
Outdated
else { | ||
if (using_indexpairs) { | ||
//BaseDescriptor::Sorter predicate = base_descr->sorter(*m_table, index_pairs); | ||
//base_descr->execute(index_pairs, predicate, nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove commented code
REQUIRE(v2.size() == 2); | ||
REQUIRE(v2.get(0).get<Int>(col_id) == 4); | ||
REQUIRE(v2.get(1).get<Int>(col_id) == 1); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be great to also have a test for combining filter/distinct/sort. Something like results.distinct(aFieldWithDuplicates).knn_search(...).sort(col_id, descending)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, does it make sense to allow it actually? After knn results are sorted starting from the nearest. Does it make sense to allow sort/filter/distinct afterwards? Also, i'd assume that this query is rather slow compared to other operations, we don't want to encourage doing it in the middle somewhere, no?
|
||
std::string SemanticSearchDescriptor::get_description(ConstTableRef) const | ||
{ | ||
return "KNN()"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should fill this in if you'd like to eventually have support for this feature in the query parser. Is that something you'd like me to add?
#include <x86intrin.h> | ||
#include <cpuid.h> | ||
#include <stdint.h> | ||
/*static void cpuid(int32_t cpuInfo[4], int32_t eax, int32_t ecx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to add it as a submodule and exclude these pieces from compilation by some realm build definition? This library seems to still actively receiving updates, would be better for future maintenance.
REQUIRE(v2.size() == 2); | ||
REQUIRE(v2.get(0).get<Int>(col_id) == 4); | ||
REQUIRE(v2.get(1).get<Int>(col_id) == 1); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, does it make sense to allow it actually? After knn results are sorted starting from the nearest. Does it make sense to allow sort/filter/distinct afterwards? Also, i'd assume that this query is rather slow compared to other operations, we don't want to encourage doing it in the middle somewhere, no?
@@ -523,6 +530,14 @@ bool DescriptorOrdering::will_apply_limit() const | |||
}); | |||
} | |||
|
|||
bool DescriptorOrdering::will_apply_knn() const | |||
{ | |||
return std::any_of(m_descriptors.begin(), m_descriptors.end(), [](const std::unique_ptr<BaseDescriptor>& desc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is known if append_knn was called, no?
|
||
auto new_order = m_descriptor_ordering; | ||
new_order.append_knn(SemanticSearchDescriptor(query_data, k, column)); | ||
if (m_mode == Mode::Collection) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test for this mode would be great.
src/realm/table.hpp
Outdated
@@ -402,6 +403,9 @@ class Table { | |||
std::optional<Mixed> min(ColKey col_key, ObjKey* = nullptr) const; | |||
std::optional<Mixed> max(ColKey col_key, ObjKey* = nullptr) const; | |||
std::optional<Mixed> avg(ColKey col_key, size_t* value_count = nullptr) const; | |||
|
|||
// Calculate the distance between the two vectors (embeddings) | |||
float dist_knn(const std::vector<float>& query_data, ColKey column, ObjKey row, hnswlib::SpaceInterface<float>& s) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be an operation on Table. Could just be a lambda within SemanticSearchDescriptor::execute
|
||
private: | ||
std::vector<float> m_query_data; | ||
size_t m_k; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More descriptive variable names, please.
src/realm/sort_descriptor.cpp
Outdated
for (size_t i = 0; i < n; i++) { | ||
ObjKey r = key_values.get(i); | ||
float dist = table.dist_knn(m_query_data, m_column, r, m_sp); | ||
topResults.push(std::pair<float, ObjKey>(dist, r)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use emplace(dist, r)
src/realm/sort_descriptor.cpp
Outdated
float lastdist = topResults.empty() ? std::numeric_limits<float>::max() : topResults.top().first; | ||
for (size_t i = m_k; i < key_values.size(); i++) { | ||
ObjKey r = key_values.get(i); | ||
if (!table.is_valid(r)) continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inefficient to check for validity first and then get object. Better to use try_get_object
in dist_knn
and return std::optional<float>
src/realm/sort_descriptor.cpp
Outdated
float dist = table.dist_knn(m_query_data, m_column, r, m_sp); | ||
topResults.push(std::pair<float, ObjKey>(dist, r)); | ||
} | ||
float lastdist = topResults.empty() ? std::numeric_limits<float>::max() : topResults.top().first; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can topResults be empty? I guess only if m_k or key_values size is zero in which case we should just return empty result.
src/realm/sort_descriptor.cpp
Outdated
{ | ||
if (m_k >= key_values.size()) return; // all entries already match as closest | ||
|
||
std::priority_queue<std::pair<float, ObjKey>> topResults; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use snake case for variables.
src/realm/sort_descriptor.cpp
Outdated
|
||
// set result to the matches, in order of closest match first | ||
key_values.clear(); | ||
while(!topResults.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More efficient to traverse the underlying container in reverse and just add elements. Create a class like this:
class PrioQueue : public std::priority_queue<std::pair<float, ObjKey>> {
public:
auto begin() {
return c.rbegin();
}
auto end() {
return c.rend();
}
};
Then just
for (auto tr : topResults) {
key_values.add(tr.second);
}
ColKey m_column; | ||
|
||
// We are going to default to measure distance by Inner Product for now | ||
mutable hnswlib::InnerProductSpace m_sp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More descriptive variable names, please.
src/realm/table_view.cpp
Outdated
} | ||
for (size_t t = 0; t < detached_ref_count; ++t) | ||
key_values.add(null_key); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and use braces with the for loop.
Many formatting problems. You can compare with branch |
@jedelbo could you create pr into this branch with your changes to not spread possible fixes and reviews? |
Any update on this merge? |
Just to share why I and others might be interested in this functionality, AI prompting and AI chat significantly benefit in quality through RAG search using vector databases. Today, there are many ways to perform AI completions in macOS and iOS apps, but the options for local vector databases to use for RAG are very limited. If added, Realm Swift would be the best possible option for implementing local RAG search on macOS and iOS. I've tried other options and they are mainly just saving and loading JSON files fully in memory and cosine similarity search on the in-application-memory objects. Realm would be much more efficient (as I understand it) and the DevEx is vastly improved over the other options (thank you!). With Apple's investments in MLX, fully local AI projects are much closer to reality and apps can start practically applying AI in many use cases. Given this PR, it seems like we're very close to enabling good, efficient local RAG for the full ecosystem of Apple apps. I appreciate your time and attention! |
I create discussion in forum to support argumments to develope this feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an expectation from our users that the results of this vector search will be somewhat similar to the vector search results they might get from querying the atlas database? If so, what provides that stability of response on query between this library and our lucene implementation?
ee7ddc2
to
b39c506
Compare
This is an initial implementation of vector (k nearest neighbour/knn) search. - Works on lists of floats. - The lists all have to all be of equal length. -Can be composed with regular queries to filter the results further. - This is a linear (bruteforce) search. No index is used. - Defaults to InnerProduct for distance calculation, but is pluggable for other algorithms.
b39c506
to
5d5bb6c
Compare
Pull Request Test Coverage Report for Build alexander.stigsen_3Details
💛 - Coveralls |
Hi! Thank you all for the effort. I need this thing working for local private RAG in one of my React-Native apps. Is there anything I can do to push things forward? |
@jedelbo this feature to realm is abandoned ? |
The product description for this feature has not been approved yet, so we don't know if this work here is still relevant. |
@jedelbo thanks for answering 👍 My arguments to include vector search in local realms. We need observe industry moving inference priority to edge ( see Intel, Qualcomm, AMD, Meta ( Llama mobile ) MS ( Onnx ) Google ( Gemini Nano ) and many others. . transfer to edge. Competition is taking first place https://info.couchbase.com/couchbase-lite-vector-search-beta-program 20 Many user cases can benefit from this. I know that this feature need a hard work and investment. But for clients like me that use Realm Sync and local Realm in app it’s great importance. |
This is an initial implementation of vector (k nearest neighbour/knn) search.
.knn_search(column, vector, k)
onResults
, so should be easy to use from SDKs.