-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SearchByMetadata method to SqlLiteVectorCollection #66
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -57,6 +57,16 @@ | |
VectorSearchSettings? settings = default, | ||
CancellationToken cancellationToken = default); | ||
|
||
/// <summary> | ||
/// Searches for records based on a metadata filter from a specific collection. | ||
/// </summary> | ||
/// <param name="filters">The filters to apply to the search request.</param> | ||
/// <param name="cancellationToken">The cancellation token.</param> | ||
/// <returns>A task representing the asynchronous operation. The task result contains the search response.</returns> | ||
Task<List<Vector>> SearchByMetadata( | ||
Check warning on line 66 in src/Abstractions/src/IVectorCollection.cs
|
||
Dictionary<string, object> filters, | ||
CancellationToken cancellationToken = default); | ||
|
||
/// <summary> | ||
/// Checks if the collection is empty. | ||
/// </summary> | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,4 @@ | ||
|
||
namespace LangChain.Databases.Postgres; | ||
|
||
/// <summary> | ||
|
@@ -123,4 +124,9 @@ public Task<bool> IsEmptyAsync(CancellationToken cancellationToken = default) | |
{ | ||
throw new NotImplementedException(); | ||
} | ||
|
||
Task<List<Vector>> IVectorCollection.SearchByMetadata(Dictionary<string, object> filters, CancellationToken cancellationToken) | ||
{ | ||
throw new NotImplementedException(); | ||
} | ||
Comment on lines
+128
to
+131
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 💡 Codebase verification Inconsistent Implementation of The Consider standardizing the implementation of 🔗 Analysis chainConsider the impact of unimplemented While adding the Consider the following suggestions to mitigate potential issues:
To ensure consistency across the codebase, let's check if other vector collections have implemented this method: This will help us understand if this is a common pattern across different vector collections or if PostgresVectorCollection is an exception. 🏁 Scripts executedThe following scripts were executed for the analysis: Script: #!/bin/bash
# Search for SearchByMetadata implementations in other vector collections
rg -A 5 "SearchByMetadata\(" --glob "src/**/*VectorCollection.cs"
Length of output: 3572 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm sure I can do some of these like in memory and postgresql, but don't have access or knowledge of any of the other dbs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @robalexclark, understood! Implementing |
||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -75,4 +75,9 @@ public async Task<VectorSearchResponse> SearchAsync(VectorSearchRequest request, | |
.ToListAsync(cancellationToken).ConfigureAwait(false); | ||
return new VectorSearchResponse { Items = results.Select(x => new Vector { Text = x.Item1.Metadata.ExternalSourceName }).ToList() }; | ||
} | ||
|
||
Task<List<Vector>> IVectorCollection.SearchByMetadata(Dictionary<string, object> filters, CancellationToken cancellationToken) | ||
{ | ||
throw new NotImplementedException(); | ||
} | ||
Comment on lines
+79
to
+82
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion Consider enhancing the The explicit interface implementation of
Here's a suggested implementation: Task<List<Vector>> IVectorCollection.SearchByMetadata(Dictionary<string, object> filters, CancellationToken cancellationToken)
{
// TODO: Implement metadata search functionality
// Currently, metadata search is not supported for SemanticKernelMemoryStore
// Consider implementing this in the future if SemanticKernel adds support for metadata filtering
Console.WriteLine("Warning: SearchByMetadata is not implemented for SemanticKernelMemoryStoreCollection");
return Task.FromResult(new List<Vector>());
} This approach provides more context, logs a warning, and returns an empty list, allowing the calling code to continue execution. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Throwing a notimplementedexception is clear that the method is not currently supported, and any future developers can implement the method if they wish to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @robalexclark, understood. Throwing a (_/) ✏️ Learnings added
|
||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,7 @@ | ||
using System.Globalization; | ||
using Microsoft.Data.Sqlite; | ||
using System.Globalization; | ||
using System.Text.Json; | ||
using Microsoft.Data.Sqlite; | ||
using System.Text.RegularExpressions; | ||
|
||
namespace LangChain.Databases.Sqlite; | ||
|
||
|
@@ -37,49 +38,56 @@ private static string SerializeVector(float[] vector) | |
|
||
private async Task InsertDocument(string id, float[] vector, Vector document) | ||
{ | ||
var insertCommand = _connection.CreateCommand(); | ||
string query = $"INSERT INTO {Name} (id, vector, document) VALUES (@id, @vector, @document)"; | ||
insertCommand.CommandText = query; | ||
insertCommand.Parameters.AddWithValue("@id", id); | ||
insertCommand.Parameters.AddWithValue("@vector", SerializeVector(vector)); | ||
insertCommand.Parameters.AddWithValue("@document", SerializeDocument(document)); | ||
await insertCommand.ExecuteNonQueryAsync().ConfigureAwait(false); | ||
|
||
using (var insertCommand = _connection.CreateCommand()) | ||
{ | ||
string query = $"INSERT INTO {Name} (id, vector, document) VALUES (@id, @vector, @document)"; | ||
insertCommand.CommandText = query; | ||
insertCommand.Parameters.AddWithValue("@id", id); | ||
insertCommand.Parameters.AddWithValue("@vector", SerializeVector(vector)); | ||
insertCommand.Parameters.AddWithValue("@document", SerializeDocument(document)); | ||
await insertCommand.ExecuteNonQueryAsync().ConfigureAwait(false); | ||
} | ||
} | ||
|
||
private async Task DeleteDocument(string id) | ||
{ | ||
var deleteCommand = _connection.CreateCommand(); | ||
string query = $"DELETE FROM {Name} WHERE id=@id"; | ||
deleteCommand.CommandText = query; | ||
deleteCommand.Parameters.AddWithValue("@id", id); | ||
await deleteCommand.ExecuteNonQueryAsync().ConfigureAwait(false); | ||
using (var deleteCommand = _connection.CreateCommand()) | ||
{ | ||
string query = $"DELETE FROM {Name} WHERE id=@id"; | ||
deleteCommand.CommandText = query; | ||
deleteCommand.Parameters.AddWithValue("@id", id); | ||
await deleteCommand.ExecuteNonQueryAsync().ConfigureAwait(false); | ||
} | ||
} | ||
|
||
private async Task<List<(Vector, float)>> SearchByVector(float[] vector, int k) | ||
{ | ||
var searchCommand = _connection.CreateCommand(); | ||
string query = $"SELECT id, vector, document, distance(vector, @vector) d FROM {Name} ORDER BY d LIMIT @k"; | ||
searchCommand.CommandText = query; | ||
searchCommand.Parameters.AddWithValue("@vector", SerializeVector(vector)); | ||
searchCommand.Parameters.AddWithValue("@k", k); | ||
var res = new List<(Vector, float)>(); | ||
var reader = await searchCommand.ExecuteReaderAsync().ConfigureAwait(false); | ||
while (await reader.ReadAsync().ConfigureAwait(false)) | ||
using (var searchCommand = _connection.CreateCommand()) | ||
{ | ||
var id = reader.GetString(0); | ||
var vec = await reader.GetFieldValueAsync<string>(1).ConfigureAwait(false); | ||
var doc = await reader.GetFieldValueAsync<string>(2).ConfigureAwait(false); | ||
var docDeserialized = JsonSerializer.Deserialize(doc, SourceGenerationContext.Default.Vector) ?? new Vector | ||
string query = $"SELECT id, vector, document, distance(vector, @vector) d FROM {Name} ORDER BY d LIMIT @k"; | ||
searchCommand.CommandText = query; | ||
searchCommand.Parameters.AddWithValue("@vector", SerializeVector(vector)); | ||
searchCommand.Parameters.AddWithValue("@k", k); | ||
var res = new List<(Vector, float)>(); | ||
|
||
using (var reader = await searchCommand.ExecuteReaderAsync().ConfigureAwait(false)) | ||
{ | ||
Text = string.Empty, | ||
}; | ||
var distance = reader.GetFloat(3); | ||
res.Add((docDeserialized, distance)); | ||
while (await reader.ReadAsync().ConfigureAwait(false)) | ||
{ | ||
var id = reader.GetString(0); | ||
var vec = await reader.GetFieldValueAsync<string>(1).ConfigureAwait(false); | ||
var doc = await reader.GetFieldValueAsync<string>(2).ConfigureAwait(false); | ||
var docDeserialized = JsonSerializer.Deserialize(doc, SourceGenerationContext.Default.Vector) ?? new Vector | ||
{ | ||
Text = string.Empty, | ||
}; | ||
var distance = reader.GetFloat(3); | ||
res.Add((docDeserialized, distance)); | ||
} | ||
|
||
return res; | ||
} | ||
} | ||
|
||
return res; | ||
} | ||
|
||
/// <inheritdoc /> | ||
|
@@ -109,41 +117,48 @@ public async Task<IReadOnlyCollection<string>> AddAsync( | |
/// <inheritdoc /> | ||
public async Task<Vector?> GetAsync(string id, CancellationToken cancellationToken = default) | ||
{ | ||
var command = _connection.CreateCommand(); | ||
var query = $"SELECT vector, document FROM {Name} WHERE id=@id"; | ||
command.CommandText = query; | ||
command.Parameters.AddWithValue("@id", id); | ||
var reader = await command.ExecuteReaderAsync(cancellationToken).ConfigureAwait(false); | ||
if (!await reader.ReadAsync(cancellationToken).ConfigureAwait(false)) | ||
using (var command = _connection.CreateCommand()) | ||
{ | ||
return null; | ||
} | ||
var query = $"SELECT vector, document FROM {Name} WHERE id=@id"; | ||
command.CommandText = query; | ||
command.Parameters.AddWithValue("@id", id); | ||
|
||
var vec = await reader.GetFieldValueAsync<string>(0, cancellationToken).ConfigureAwait(false); | ||
var doc = await reader.GetFieldValueAsync<string>(1, cancellationToken).ConfigureAwait(false); | ||
var docDeserialized = JsonSerializer.Deserialize(doc, SourceGenerationContext.Default.Vector) ?? new Vector | ||
{ | ||
Text = string.Empty, | ||
}; | ||
using (var reader = await command.ExecuteReaderAsync(cancellationToken).ConfigureAwait(false)) | ||
{ | ||
if (!await reader.ReadAsync(cancellationToken).ConfigureAwait(false)) | ||
{ | ||
return null; | ||
} | ||
|
||
return new Vector | ||
{ | ||
Id = id, | ||
Text = docDeserialized.Text, | ||
Metadata = docDeserialized.Metadata, | ||
Embedding = JsonSerializer.Deserialize(vec, SourceGenerationContext.Default.SingleArray), | ||
}; | ||
var vec = await reader.GetFieldValueAsync<string>(0, cancellationToken).ConfigureAwait(false); | ||
var doc = await reader.GetFieldValueAsync<string>(1, cancellationToken).ConfigureAwait(false); | ||
var docDeserialized = JsonSerializer.Deserialize(doc, SourceGenerationContext.Default.Vector) ?? new Vector | ||
{ | ||
Text = string.Empty, | ||
}; | ||
|
||
return new Vector | ||
{ | ||
Id = id, | ||
Text = docDeserialized.Text, | ||
Metadata = docDeserialized.Metadata, | ||
Embedding = JsonSerializer.Deserialize(vec, SourceGenerationContext.Default.SingleArray), | ||
}; | ||
} | ||
} | ||
} | ||
|
||
/// <inheritdoc /> | ||
public async Task<bool> IsEmptyAsync(CancellationToken cancellationToken = default) | ||
{ | ||
var command = _connection.CreateCommand(); | ||
var query = $"SELECT COUNT(*) FROM {Name}"; | ||
command.CommandText = query; | ||
var count = await command.ExecuteScalarAsync(cancellationToken).ConfigureAwait(false); | ||
using (var command = _connection.CreateCommand()) | ||
{ | ||
var query = $"SELECT COUNT(*) FROM {Name}"; | ||
command.CommandText = query; | ||
var count = await command.ExecuteScalarAsync(cancellationToken).ConfigureAwait(false); | ||
|
||
return count == null || Convert.ToInt32(count, CultureInfo.InvariantCulture) == 0; | ||
return count == null || Convert.ToInt32(count, CultureInfo.InvariantCulture) == 0; | ||
} | ||
} | ||
|
||
/// <inheritdoc /> | ||
|
@@ -180,4 +195,67 @@ public async Task<VectorSearchResponse> SearchAsync( | |
}).ToArray(), | ||
}; | ||
} | ||
|
||
/// <inheritdoc /> | ||
public async Task<List<Vector>> SearchByMetadata( | ||
Dictionary<string, object> filters, | ||
CancellationToken cancellationToken = default) | ||
{ | ||
filters = filters ?? throw new ArgumentNullException(nameof(filters)); | ||
|
||
using (var command = _connection.CreateCommand()) | ||
{ | ||
var query = $"SELECT id, vector, document FROM {Name}"; | ||
|
||
var whereClauses = new List<string>(); | ||
int paramIndex = 0; | ||
|
||
foreach (var filter in filters) | ||
{ | ||
// Validate filter.Key to prevent injection in JSON path | ||
if (string.IsNullOrWhiteSpace(filter.Key) || !IsValidJsonKey(filter.Key)) | ||
throw new ArgumentException($"Invalid filter key: {filter.Key}", nameof(filters)); | ||
|
||
var paramName = "@param" + paramIndex++; | ||
whereClauses.Add($"json_extract(document, '$.Metadata.{filter.Key}') = {paramName}"); | ||
command.Parameters.AddWithValue(paramName, filter.Value); | ||
} | ||
query += " WHERE " + string.Join(" AND ", whereClauses); | ||
|
||
command.CommandText = query; | ||
var res = new List<Vector>(); | ||
|
||
using (var reader = await command.ExecuteReaderAsync(cancellationToken).ConfigureAwait(false)) | ||
{ | ||
while (await reader.ReadAsync(cancellationToken).ConfigureAwait(false)) | ||
{ | ||
var id = await reader.GetFieldValueAsync<string>(0, cancellationToken).ConfigureAwait(false); | ||
var vec = await reader.GetFieldValueAsync<string>(1, cancellationToken).ConfigureAwait(false); | ||
var doc = await reader.GetFieldValueAsync<string>(2, cancellationToken).ConfigureAwait(false); | ||
var docDeserialized = JsonSerializer.Deserialize(doc, SourceGenerationContext.Default.Vector) ?? new Vector | ||
{ | ||
Text = string.Empty, | ||
}; | ||
|
||
var vector = new Vector | ||
{ | ||
Id = id, | ||
Text = docDeserialized.Text, | ||
Metadata = docDeserialized.Metadata, | ||
Embedding = JsonSerializer.Deserialize(vec, SourceGenerationContext.Default.SingleArray), | ||
}; | ||
|
||
res.Add(vector); | ||
} | ||
|
||
return res; | ||
} | ||
} | ||
} | ||
|
||
Comment on lines
+200
to
+254
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion Ensure comprehensive validation for The method Consider the following options:
Ensure that any changes maintain security while providing flexibility for valid metadata keys.
|
||
private static bool IsValidJsonKey(string input) | ||
{ | ||
// Only allow letters, numbers, and underscores | ||
return Regex.IsMatch(input, @"^\w+$"); | ||
} | ||
} | ||
Comment on lines
+256
to
+260
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion Reconsider the regular expression in The current regular expression Depending on your use case, update the regex to include other permissible characters: -return Regex.IsMatch(input, @"^\w+$");
+return Regex.IsMatch(input, @"^[\w\- ]+$"); // Allows underscores, hyphens, and spaces Alternatively, if your metadata keys may contain any characters, you might need a different approach to safely include them in the JSON path without risking SQL injection.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Missing TODO comments in
SearchByMetadata
methodsThe
SearchByMetadata
method inIVectorCollection
implementations across multiple vector collections lacks TODO comments. Adding these comments is essential to guide future development and ensure consistent implementation.Affected files:
src/Chroma/src/ChromaVectorCollection.cs
src/SemanticKernel/src/SemanticKernelMemoryStoreCollection.cs
src/Postgres/src/PostgresVectorCollection.cs
src/OpenSearch/src/OpenSearchVectorCollection.cs
src/Mongo/src/MongoVectorCollection.cs
src/InMemory/src/InMemoryVectorCollection.cs
src/Elasticsearch/src/ElasticsearchVectorCollection.cs
Please add TODO comments to each
SearchByMetadata
method to indicate pending implementations.🔗 Analysis chain
LGTM! Consider adding a TODO comment for future implementation.
The addition of the
SearchByMetadata
method aligns with the PR objectives and follows good practices for asynchronous methods. TheNotImplementedException
is expected at this stage.Consider adding a TODO comment to remind about future implementation:
Task<List<Vector>> IVectorCollection.SearchByMetadata(Dictionary<string, object> filters, CancellationToken cancellationToken) { + // TODO: Implement SearchByMetadata for OpenSearch throw new NotImplementedException(); }
Let's verify if this change is consistent across other vector collection implementations:
[/run_scripts]
[/run_scripts]
[/run_scripts]
[/run_scripts]
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
Length of output: 2386
Script:
Length of output: 164
Script:
Length of output: 179
Script:
Length of output: 675
Script:
Length of output: 834
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need todo comments here - the NotImplementedException makes it clear that the work is TODO!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robalexclark, understood. I will keep that in mind for future reviews.
(>._.)?
✏️ Learnings added