Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change to readonly collection and cleanup #75

Merged
merged 1 commit into from
Oct 25, 2024

Conversation

robalexclark
Copy link
Contributor

@robalexclark robalexclark commented Oct 25, 2024

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced new asynchronous methods in the MongoVectorCollection class for adding, deleting, retrieving, and searching vectors in MongoDB.
  • Improvements

    • Updated the return type of the SearchByMetadata method across multiple collections to IReadOnlyList<Vector>, enhancing immutability and type safety.
    • Added JSON key validation functionality with the new IsValidJsonKey method in the VectorCollection class.
  • Bug Fixes

    • Enhanced error handling in the SearchByMetadata method by validating filter keys in PostgresVectorCollection.

Copy link

coderabbitai bot commented Oct 25, 2024

Walkthrough

The pull request introduces modifications primarily to the SearchByMetadata method across various classes implementing the IVectorCollection interface. The return type of this method has been changed from Task<List<Vector>> to Task<IReadOnlyList<Vector>>, enhancing its immutability. Additionally, new methods have been added to some classes, such as IsValidJsonKey in VectorCollection, and several asynchronous methods in MongoVectorCollection. Overall, the changes focus on refining method signatures and improving type safety without altering existing functionalities significantly.

Changes

File Path Change Summary
src/Abstractions/src/IVectorCollection.cs Updated method signature: SearchByMetadata from Task<List<Vector>> to Task<IReadOnlyList<Vector>>.
src/Abstractions/src/PublicAPI.Unshipped.txt Updated SearchByMetadata return type to IReadOnlyList<Vector>; added IsValidJsonKey method.
src/Abstractions/src/VectorCollection.cs Added protected static bool IsValidJsonKey(string input) method for JSON key validation.
src/Chroma/src/ChromaVectorCollection.cs Updated SearchByMetadata return type to IReadOnlyList<Vector>.
src/Elasticsearch/src/ElasticsearchVectorCollection.cs Updated SearchByMetadata return type to IReadOnlyList<Vector>; added async methods with NotImplementedException.
src/InMemory/src/InMemoryVectorCollection.cs Updated SearchByMetadata return type to IReadOnlyList<Vector>.
src/Mongo/src/MongoVectorCollection.cs Added several async methods: AddAsync, DeleteAsync, GetAsync, IsEmptyAsync, SearchAsync; updated SearchByMetadata return type.
src/OpenSearch/src/OpenSearchVectorCollection.cs Updated SearchByMetadata return type to IReadOnlyList<Vector>.
src/Postgres/src/PostgresVectorCollection.cs Updated SearchByMetadata return type to IReadOnlyList<Vector>; added validation for filter keys.
src/SemanticKernel/src/SemanticKernelMemoryStoreCollection.cs Updated SearchByMetadata return type to IReadOnlyList<Vector>.
src/Sqlite/src/SqLiteVectorCollection.cs Updated SearchByMetadata return type to IReadOnlyList<Vector>; removed IsValidJsonKey method.

Possibly related PRs

  • Add SearchByMetadata method to SqlLiteVectorCollection #66: The changes in this PR involve the addition of the SearchByMetadata method to the IVectorCollection interface, which is directly related to the changes made in the main PR regarding the return type of the SearchByMetadata method.
  • Add Index to SqlLite and fix using statements #69: This PR modifies the SearchByMetadata method in the ChromaVectorCollection class, changing its access modifier and updating the exception message, which is relevant to the overall context of the SearchByMetadata method's implementation.
  • Added metadata search for Postgres #73: This PR updates the SearchByMetadata method in the PostgresVectorCollection class, changing its access modifier and providing a functional implementation, which aligns with the changes made in the main PR regarding the method's return type.

🐇 In the code we hop and play,
With vectors in a new array.
Read-only lists, we cheer with glee,
Valid keys for JSON, oh so free!
Asynchronous methods, a joyful tune,
In our rabbit hole, we’ll code till noon! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@robalexclark robalexclark merged commit e153878 into main Oct 25, 2024
3 checks passed
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Outside diff range and nitpick comments (6)
src/Abstractions/src/IVectorCollection.cs (1)

66-68: Add XML documentation for consistency.

All other methods in this interface have XML documentation. Please add documentation for SearchByMetadata following the same pattern.

+    /// <summary>
+    /// Searches for records based on metadata filters.
+    /// </summary>
+    /// <param name="filters">The metadata filters to apply to the search.</param>
+    /// <param name="cancellationToken">The cancellation token.</param>
+    /// <returns>A task representing the asynchronous operation. The task result contains the matching vectors.</returns>
     Task<IReadOnlyList<Vector>> SearchByMetadata(
         Dictionary<string, object> filters,
         CancellationToken cancellationToken = default);
src/SemanticKernel/src/SemanticKernelMemoryStoreCollection.cs (1)

Line range hint 17-19: Consider using JSON serialization for metadata.

The current metadata serialization using string concatenation with "#" and "&" delimiters is fragile and could break if metadata values contain these characters. Additionally, this approach requires custom parsing logic in GetAsync.

Consider using JSON serialization instead:

-            //TODO: review way to map metadata
-            if (item.Metadata != null)
-                metadata = string.Join("#", item.Metadata.Select(kv => kv.Key + "&" + kv.Value));
+            if (item.Metadata != null)
+                metadata = System.Text.Json.JsonSerializer.Serialize(item.Metadata);

And in GetAsync:

-        if (record?.Metadata?.AdditionalMetadata != null)
-            metadata = record.Metadata.AdditionalMetadata
-                .Split('#')
-                .Select(part => part.Split('&'))
-                .ToDictionary(split => split[0], split => (object)split[1]);
+        if (record?.Metadata?.AdditionalMetadata != null)
+            metadata = System.Text.Json.JsonSerializer.Deserialize<Dictionary<string, object>>(
+                record.Metadata.AdditionalMetadata);
src/Elasticsearch/src/ElasticsearchVectorCollection.cs (1)

Line range hint 107-111: Add parameter validation for filters.

While the return type change to IReadOnlyList<Vector> is good, consider adding parameter validation consistent with other methods in the class.

Apply this diff to add parameter validation:

     Task<IReadOnlyList<Vector>> IVectorCollection.SearchByMetadata(Dictionary<string, object> filters, CancellationToken cancellationToken)
     {
+        filters = filters ?? throw new ArgumentNullException(nameof(filters));
         throw new NotImplementedException();
     }
src/InMemory/src/InMemoryVectorCollection.cs (1)

Line range hint 95-119: Optimize performance and simplify implementation.

The current implementation has some inefficiencies and can be simplified:

  1. Task.Run is unnecessary here as the operation is already running on the thread pool and doesn't block any UI thread.
  2. The null check for vector.Metadata can be simplified using the null-conditional operator.

Consider this more efficient implementation:

-    public async Task<IReadOnlyList<Vector>> SearchByMetadata(Dictionary<string, object> filters, CancellationToken cancellationToken = default)
+    public Task<IReadOnlyList<Vector>> SearchByMetadata(Dictionary<string, object> filters, CancellationToken cancellationToken = default)
     {
         filters = filters ?? throw new ArgumentNullException(nameof(filters));
-        var filteredVectors = await Task.Run(() => _vectors.Values.Where(vector =>
+        var filteredVectors = _vectors.Values.Where(vector =>
         {
             // Check if all filters match
             foreach (var filter in filters)
             {
-                object? metadataValue = null;
-                if (vector.Metadata != null && !vector.Metadata.TryGetValue(filter.Key, out metadataValue) || metadataValue == null)
+                if (!vector.Metadata?.TryGetValue(filter.Key, out var metadataValue) == true || metadataValue == null)
                 {
                     return false;
                 }
-                else if (!metadataValue.Equals(filter.Value)) // Convert metadata value to string and compare
+                if (!metadataValue.Equals(filter.Value))
                 {
                     return false;
                 }
             }
             return true;
-        }).ToList(), cancellationToken).ConfigureAwait(false);
+        }).ToList();
-        return filteredVectors;
+        return Task.FromResult<IReadOnlyList<Vector>>(filteredVectors);
     }

This implementation:

  1. Removes unnecessary async/await and Task.Run
  2. Simplifies the null checks
  3. Maintains the immutability contract with IReadOnlyList
src/Sqlite/src/SqLiteVectorCollection.cs (1)

Line range hint 192-241: Consider optimizing the SQL query construction.

While the current implementation is functionally correct, there are a few potential improvements:

  1. Consider using a StringBuilder for query construction to improve performance when dealing with multiple filters
  2. The WHERE clause could be constructed more efficiently by pre-allocating the parameter names

Here's a suggested optimization:

-        var query = $"SELECT id, vector, document FROM {Name}";
-
-        var whereClauses = new List<string>();
-        int paramIndex = 0;
-
-        foreach (var filter in filters)
-        {
-            // Validate filter.Key to prevent injection in JSON path
-            if (string.IsNullOrWhiteSpace(filter.Key) || !IsValidJsonKey(filter.Key))
-                throw new ArgumentException($"Invalid filter key: {filter.Key}", nameof(filters));
-
-            var paramName = "@param" + paramIndex++;
-            whereClauses.Add($"json_extract(document, '$.Metadata.{filter.Key}') = {paramName}");
-            command.Parameters.AddWithValue(paramName, filter.Value);
-        }
-        query += " WHERE " + string.Join(" AND ", whereClauses);
+        var queryBuilder = new StringBuilder($"SELECT id, vector, document FROM {Name} WHERE ");
+        var whereClauses = filters.Select((kvp, i) =>
+        {
+            if (string.IsNullOrWhiteSpace(kvp.Key) || !IsValidJsonKey(kvp.Key))
+                throw new ArgumentException($"Invalid filter key: {kvp.Key}", nameof(filters));
+
+            var paramName = $"@param{i}";
+            command.Parameters.AddWithValue(paramName, kvp.Value);
+            return $"json_extract(document, '$.Metadata.{kvp.Key}') = {paramName}";
+        });
+        
+        queryBuilder.Append(string.Join(" AND ", whereClauses));
+        command.CommandText = queryBuilder.ToString();

This optimization:

  • Uses StringBuilder for better performance with multiple filters
  • Reduces allocations by using string interpolation only when necessary
  • Makes the code more concise and easier to follow
src/Mongo/src/MongoVectorCollection.cs (1)

Line range hint 45-77: Validate request.Embeddings to prevent potential exceptions

Currently, the method accesses request.Embeddings.First() without checking if Embeddings is null or empty. This could lead to an InvalidOperationException. Add a check to ensure Embeddings is not null and contains at least one element.

Apply this change:

 request = request ?? throw new ArgumentNullException(nameof(request));
+ if (request.Embeddings == null || !request.Embeddings.Any())
+     throw new ArgumentException("Embeddings must contain at least one embedding.", nameof(request));

 settings ??= new VectorSearchSettings();
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between e9116ac and 0158148.

📒 Files selected for processing (11)
  • src/Abstractions/src/IVectorCollection.cs (1 hunks)
  • src/Abstractions/src/PublicAPI.Unshipped.txt (2 hunks)
  • src/Abstractions/src/VectorCollection.cs (2 hunks)
  • src/Chroma/src/ChromaVectorCollection.cs (1 hunks)
  • src/Elasticsearch/src/ElasticsearchVectorCollection.cs (1 hunks)
  • src/InMemory/src/InMemoryVectorCollection.cs (1 hunks)
  • src/Mongo/src/MongoVectorCollection.cs (2 hunks)
  • src/OpenSearch/src/OpenSearchVectorCollection.cs (1 hunks)
  • src/Postgres/src/PostgresVectorCollection.cs (1 hunks)
  • src/SemanticKernel/src/SemanticKernelMemoryStoreCollection.cs (1 hunks)
  • src/Sqlite/src/SqLiteVectorCollection.cs (1 hunks)
🧰 Additional context used
🪛 LanguageTool
src/Abstractions/src/PublicAPI.Unshipped.txt

[duplication] ~17-~17: Possible typo: you repeated a word
Context: ...g!, object!>! filters, System.Threading.CancellationToken cancellationToken = default(System.Threading.Cancellation...

(ENGLISH_WORD_REPEAT_RULE)

🔇 Additional comments (13)
src/Abstractions/src/VectorCollection.cs (1)

1-1: LGTM!

The using directive is appropriately placed and necessary for the Regex functionality.

src/Abstractions/src/IVectorCollection.cs (1)

66-68: LGTM! Good improvement to immutability.

The change from Task<List<Vector>> to Task<IReadOnlyList<Vector>> is a positive improvement that:

  • Enforces immutability of the returned collection
  • Aligns with the interface's existing pattern of returning read-only collections
  • Follows best practices for API design
src/SemanticKernel/src/SemanticKernelMemoryStoreCollection.cs (1)

79-82: ⚠️ Potential issue

Implementation required for SearchByMetadata method.

The method is currently throwing NotImplementedException while other vector collections have implemented this functionality. This could lead to runtime errors if the method is called.

Let's examine how other collections have implemented this method:

Would you like me to propose an implementation based on other collections' approaches?

src/Elasticsearch/src/ElasticsearchVectorCollection.cs (1)

Line range hint 13-106: Verify implementation timeline for async methods.

All async methods contain commented-out implementation code but throw NotImplementedException. While this is acceptable during development, it could cause runtime issues if deployed to production.

Would you like me to create a GitHub issue to track the implementation of these methods? This would help ensure these don't get overlooked and provide a clear timeline for completion.

✅ Verification successful

Implementation status is expected - no immediate action required

The unimplemented methods are part of a new Elasticsearch implementation that's currently under development. This is evidenced by:

  • The git history shows this is a very recent addition with only 3 commits
  • The implementation is part of a larger interface refactoring effort (changing to readonly collections)
  • Other vector collection implementations (MongoDB, SQLite, etc.) have working implementations
  • The commented code shows the planned implementation approach

The NotImplementedException throws are appropriate in this development stage as:

  1. The class is not yet being used in production (no direct usage found outside its own module)
  2. The implementation pattern follows a clear, staged development approach
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if there are any direct usages of these unimplemented methods
# that might cause runtime exceptions

echo "Searching for direct method calls..."
rg -l "GetAsync|DeleteAsync|AddAsync|SearchAsync|IsEmptyAsync" --type cs \
  --glob '!**/ElasticsearchVectorCollection.cs'

Length of output: 792


Script:

#!/bin/bash
# Let's check if this is part of a larger refactoring effort by examining the interface
echo "Checking IVectorCollection interface..."
cat src/Abstractions/src/IVectorCollection.cs

# Let's also check if there are any tests that might be affected
echo -e "\nChecking for related test files..."
rg -l "ElasticsearchVectorCollection" --type cs

# Let's see if this is a new implementation
echo -e "\nChecking git history for this file..."
git log --oneline -- src/Elasticsearch/src/ElasticsearchVectorCollection.cs

Length of output: 3918

src/InMemory/src/InMemoryVectorCollection.cs (1)

Line range hint 95-119: Consider thread-safety improvements for concurrent operations.

While ConcurrentDictionary provides thread-safe operations, the LINQ query might observe partial updates if the collection is modified during enumeration. For stronger consistency guarantees, consider taking a snapshot of the values at the start of the operation.

Consider this approach for better thread safety:

     public Task<IReadOnlyList<Vector>> SearchByMetadata(Dictionary<string, object> filters, CancellationToken cancellationToken = default)
     {
         filters = filters ?? throw new ArgumentNullException(nameof(filters));
+        // Take a snapshot of values for consistent enumeration
+        var snapshot = _vectors.Values.ToArray();
-        var filteredVectors = _vectors.Values.Where(vector =>
+        var filteredVectors = snapshot.Where(vector =>
src/Postgres/src/PostgresVectorCollection.cs (3)

148-153: LGTM! Good optimization with withEmbeddings parameter.

The explicit withEmbeddings: false parameter optimizes the query by not fetching unused embedding data.


134-145: 🛠️ Refactor suggestion

Verify the implementation of IsValidJsonKey method.

The validation logic is good, but we need to ensure that IsValidJsonKey correctly validates JSON property names according to the JSON specification (RFC 8259).

#!/bin/bash
# Search for IsValidJsonKey implementation
ast-grep --pattern 'IsValidJsonKey($_) {
  $$$
}'

Consider adding XML documentation for the allowed characters to help API consumers understand the validation rules:

+ /// <summary>
+ /// Validates if the given key is a valid JSON property name.
+ /// </summary>
+ /// <param name="key">The key to validate.</param>
+ /// <returns>True if the key is valid; otherwise, false.</returns>
+ /// <remarks>
+ /// Valid JSON property names must not contain control characters or whitespace.
+ /// </remarks>
 private bool IsValidJsonKey(string key)

127-128: Verify consumers of this API for breaking changes.

The change from Task<List<Vector>> to Task<IReadOnlyList<Vector>> improves immutability but could break existing code that modifies the returned list.

✅ Verification successful

The previous command had syntax issues. Let's search for method usages and interface implementations to understand the impact better. The -A 2 flag will show 2 lines after the match to see how the result is used.


Return type change is safe to proceed

The change from Task<List<Vector>> to Task<IReadOnlyList<Vector>> is part of a coordinated interface change across the codebase. Evidence shows:

  • The IVectorCollection interface in Abstractions defines SearchByMetadata with IReadOnlyList<Vector> return type
  • All implementations (Postgres, SQLite, MongoDB, InMemory, etc.) consistently use IReadOnlyList<Vector>
  • Integration tests show no direct list modifications, only count verification
  • The change is documented in PublicAPI.Unshipped.txt indicating proper API tracking
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for direct usage of SearchByMetadata where the result is modified
rg -l "SearchByMetadata.*\.(Add|Remove|Clear|Sort)"

Length of output: 224


Script:

#!/bin/bash
# Search for usages of SearchByMetadata method
rg -A 2 "SearchByMetadata\(" 

# Also search for interface implementation references
rg -l "IVectorCollection|PostgresVectorCollection"

Length of output: 5220

src/Sqlite/src/SqLiteVectorCollection.cs (2)

192-193: LGTM! Good change to IReadOnlyList.

The change from Task<List<Vector>> to Task<IReadOnlyList<Vector>> is a good improvement that:

  • Enforces immutability of the returned collection
  • Aligns with interface changes across other vector collection implementations
  • Maintains backward compatibility since List implements IReadOnlyList

Line range hint 192-241: Verify IsValidJsonKey implementation.

The code uses IsValidJsonKey which was moved to the base VectorCollection class. Let's verify its implementation and accessibility.

✅ Verification successful

IsValidJsonKey implementation is properly inherited and used correctly

The verification confirms:

  • IsValidJsonKey is implemented in the base VectorCollection class (src/Abstractions/src/VectorCollection.cs) as a protected static method
  • SqLiteVectorCollection properly inherits from VectorCollection and has access to this method
  • The implementation validates that the key contains only letters, numbers, and underscores using regex ^\w+$
  • The usage in SqLiteVectorCollection is correct, combining it with a null/whitespace check before using the key in SQL queries
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify IsValidJsonKey implementation in VectorCollection base class

# Test: Search for IsValidJsonKey implementation
echo "Searching for IsValidJsonKey implementation..."
ast-grep --pattern 'class VectorCollection {
  $$$
  protected static bool IsValidJsonKey($_) {
    $$$
  }
  $$$
}'

# Test: Verify no local implementation exists
echo "Verifying no local implementation exists..."
rg -p "IsValidJsonKey.*\{" src/Sqlite/src/SqLiteVectorCollection.cs

Length of output: 388


Script:

#!/bin/bash
# Let's try a different approach to find the IsValidJsonKey implementation

# Search for the base class file first
echo "Searching for VectorCollection class file..."
fd "VectorCollection.cs" --type f

# Search for IsValidJsonKey in all .cs files
echo "Searching for IsValidJsonKey implementation across all files..."
rg -p "IsValidJsonKey" --type cs -A 5

# Check inheritance of SQLiteVectorCollection
echo "Checking SQLiteVectorCollection class definition..."
rg -p "class SqLiteVectorCollection" --type cs -A 2

Length of output: 2272

src/Abstractions/src/PublicAPI.Unshipped.txt (2)

17-17: Breaking change: Consider migration guidance for SearchByMetadata.

The return type change from Task<List<Vector>> to Task<IReadOnlyList<Vector>> is a breaking change that improves immutability. While this aligns with the PR objective, consumers might need to update their code.

Let's check for direct List usage in the codebase:

Consider adding migration guidance in the changelog or documentation to help users adapt to this breaking change. Common migration patterns include:

  • Using .ToList() if mutation is required
  • Updating consumer code to work with IReadOnlyList interface
🧰 Tools
🪛 LanguageTool

[duplication] ~17-~17: Possible typo: you repeated a word
Context: ...g!, object!>! filters, System.Threading.CancellationToken cancellationToken = default(System.Threading.Cancellation...

(ENGLISH_WORD_REPEAT_RULE)


93-93: Review the accessibility of IsValidJsonKey method.

The addition of IsValidJsonKey as a public static method on VectorCollection expands the public API surface.

Let's check its current usage pattern:

Consider whether this utility method should be:

  1. Internal/protected if it's only meant for internal validation
  2. Moved to a dedicated validation utility class if it's truly meant for public consumption
  3. Documented with XML comments to clarify its purpose and validation rules
src/Mongo/src/MongoVectorCollection.cs (1)

Line range hint 79-105: Method signature update and implementation are appropriate

Changing the return type to IReadOnlyList<Vector> enhances immutability and aligns with the codebase updates. The method correctly constructs the filter and retrieves the results.

Comment on lines +27 to +31
protected static bool IsValidJsonKey(string input)
{
// Only allow letters, numbers, and underscores
return Regex.IsMatch(input, @"^\w+$");
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider enhancing the JSON key validation.

While the implementation is correct, consider these improvements for robustness:

  1. Add null check
  2. Add length limits
  3. Add XML documentation
  4. Make the regex pattern a constant

Here's a suggested implementation:

+    private const string JsonKeyPattern = @"^\w+$";
+
+    /// <summary>
+    /// Validates if the input string is a valid JSON key.
+    /// </summary>
+    /// <param name="input">The string to validate.</param>
+    /// <returns>True if the input consists only of letters, numbers, and underscores, and is not empty.</returns>
+    /// <exception cref="ArgumentNullException">Thrown when input is null.</exception>
     protected static bool IsValidJsonKey(string input)
     {
+        ArgumentNullException.ThrowIfNull(input);
+        
+        if (input.Length == 0 || input.Length > 256) // adjust max length as needed
+            return false;
+
         // Only allow letters, numbers, and underscores
-        return Regex.IsMatch(input, @"^\w+$");
+        return Regex.IsMatch(input, JsonKeyPattern);
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
protected static bool IsValidJsonKey(string input)
{
// Only allow letters, numbers, and underscores
return Regex.IsMatch(input, @"^\w+$");
}
private const string JsonKeyPattern = @"^\w+$";
/// <summary>
/// Validates if the input string is a valid JSON key.
/// </summary>
/// <param name="input">The string to validate.</param>
/// <returns>True if the input consists only of letters, numbers, and underscores, and is not empty.</returns>
/// <exception cref="ArgumentNullException">Thrown when input is null.</exception>
protected static bool IsValidJsonKey(string input)
{
ArgumentNullException.ThrowIfNull(input);
if (input.Length == 0 || input.Length > 256) // adjust max length as needed
return false;
// Only allow letters, numbers, and underscores
return Regex.IsMatch(input, JsonKeyPattern);
}

Comment on lines +137 to 140
Task<IReadOnlyList<Vector>> IVectorCollection.SearchByMetadata(Dictionary<string, object> filters, CancellationToken cancellationToken)
{
throw new NotImplementedException();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Implementation needed for SearchByMetadata.

While the method signature has been updated to match the interface changes, throwing NotImplementedException is not production-ready. This method needs to be implemented to maintain consistency with other IVectorCollection implementations.

Consider implementing the method using OpenSearch's query capabilities. Here's a suggested implementation structure:

-    Task<IReadOnlyList<Vector>> IVectorCollection.SearchByMetadata(Dictionary<string, object> filters, CancellationToken cancellationToken)
-    {
-        throw new NotImplementedException();
-    }
+    async Task<IReadOnlyList<Vector>> IVectorCollection.SearchByMetadata(Dictionary<string, object> filters, CancellationToken cancellationToken)
+    {
+        var response = await client.SearchAsync<VectorRecord>(s => s
+            .Index(Name)
+            .Query(q => BuildMetadataQuery(q, filters)), cancellationToken)
+            .ConfigureAwait(false);
+            
+        if (!response.IsValid)
+        {
+            throw new InvalidOperationException($"Failed to search collection '{Name}' by metadata. DebugInformation: {response.DebugInformation}");
+        }
+
+        return response.Documents
+            .Where(vectorRecord => !string.IsNullOrWhiteSpace(vectorRecord.Text))
+            .Select(vectorRecord => new Vector
+            {
+                Text = vectorRecord.Text ?? string.Empty,
+                Id = vectorRecord.Id,
+                Metadata = vectorRecord.Metadata?.ToDictionary(x => x.Key, x => x.Value),
+                Embedding = vectorRecord.Vector,
+            })
+            .ToList()
+            .AsReadOnly();
+    }
+
+    private static QueryContainer BuildMetadataQuery(QueryContainerDescriptor<VectorRecord> q, Dictionary<string, object> filters)
+    {
+        // Build a bool query combining all metadata filters
+        return q.Bool(b => b.Must(
+            filters.Select(kvp =>
+                q.Term(t => t.Field($"metadata.{kvp.Key}").Value(kvp.Value))
+            )
+        ));
+    }

Committable suggestion was skipped due to low confidence.

Comment on lines +170 to 173
public Task<IReadOnlyList<Vector>> SearchByMetadata(Dictionary<string, object> filters, CancellationToken cancellationToken)
{
throw new NotSupportedException("Chroma doesn't support collection metadata");
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider adding documentation and removing async overhead.

The signature change to IReadOnlyList<Vector> looks good. However, since this method immediately throws an exception, consider these improvements:

+    /// <summary>
+    /// Not supported in ChromaDB as it doesn't support collection metadata.
+    /// </summary>
+    /// <exception cref="NotSupportedException">Always thrown as this operation is not supported.</exception>
-    public Task<IReadOnlyList<Vector>> SearchByMetadata(Dictionary<string, object> filters, CancellationToken cancellationToken)
+    public Task<IReadOnlyList<Vector>> SearchByMetadata(Dictionary<string, object> filters, CancellationToken cancellationToken = default)
     {
-        throw new NotSupportedException("Chroma doesn't support collection metadata");
+        return Task.FromException<IReadOnlyList<Vector>>(
+            new NotSupportedException("Chroma doesn't support collection metadata"));
     }

This change:

  1. Adds XML documentation to clearly communicate the limitation
  2. Makes cancellationToken parameter optional for consistency
  3. Uses Task.FromException for better async/await performance
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
public Task<IReadOnlyList<Vector>> SearchByMetadata(Dictionary<string, object> filters, CancellationToken cancellationToken)
{
throw new NotSupportedException("Chroma doesn't support collection metadata");
}
/// <summary>
/// Not supported in ChromaDB as it doesn't support collection metadata.
/// </summary>
/// <exception cref="NotSupportedException">Always thrown as this operation is not supported.</exception>
public Task<IReadOnlyList<Vector>> SearchByMetadata(Dictionary<string, object> filters, CancellationToken cancellationToken = default)
{
return Task.FromException<IReadOnlyList<Vector>>(
new NotSupportedException("Chroma doesn't support collection metadata"));
}

Comment on lines +23 to 29
/// <inheritdoc />
public async Task<bool> DeleteAsync(IEnumerable<string> ids, CancellationToken cancellationToken = default)
{
var filter = Builders<Vector>.Filter.In(i => i.Id, ids);
var result = await _mongoCollection.DeleteManyAsync(filter, cancellationToken).ConfigureAwait(false);
return result.IsAcknowledged;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add a null check for ids to prevent potential ArgumentNullException

Currently, the method does not check if ids is null. To prevent a possible ArgumentNullException, consider adding a null check at the beginning of the method.

Apply this change:

+ if (ids == null) throw new ArgumentNullException(nameof(ids));
 var filter = Builders<Vector>.Filter.In(i => i.Id, ids);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/// <inheritdoc />
public async Task<bool> DeleteAsync(IEnumerable<string> ids, CancellationToken cancellationToken = default)
{
var filter = Builders<Vector>.Filter.In(i => i.Id, ids);
var result = await _mongoCollection.DeleteManyAsync(filter, cancellationToken).ConfigureAwait(false);
return result.IsAcknowledged;
}
/// <inheritdoc />
public async Task<bool> DeleteAsync(IEnumerable<string> ids, CancellationToken cancellationToken = default)
{
if (ids == null) throw new ArgumentNullException(nameof(ids));
var filter = Builders<Vector>.Filter.In(i => i.Id, ids);
var result = await _mongoCollection.DeleteManyAsync(filter, cancellationToken).ConfigureAwait(false);
return result.IsAcknowledged;
}

Comment on lines +31 to 36
/// <inheritdoc />
public async Task<Vector?> GetAsync(string id, CancellationToken cancellationToken = default)
{
var filter = Builders<Vector>.Filter.Eq(i => i.Id, id);
var result = await _mongoCollection.FindAsync(filter, cancellationToken: cancellationToken).ConfigureAwait(false);
return result.FirstOrDefault(cancellationToken: cancellationToken);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add a null check for id to ensure robustness

The method does not check if id is null. Adding a null check will prevent potential exceptions and improve the method's reliability.

Apply this change:

+ if (id == null) throw new ArgumentNullException(nameof(id));
 var filter = Builders<Vector>.Filter.Eq(i => i.Id, id);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/// <inheritdoc />
public async Task<Vector?> GetAsync(string id, CancellationToken cancellationToken = default)
{
var filter = Builders<Vector>.Filter.Eq(i => i.Id, id);
var result = await _mongoCollection.FindAsync(filter, cancellationToken: cancellationToken).ConfigureAwait(false);
return result.FirstOrDefault(cancellationToken: cancellationToken);
/// <inheritdoc />
public async Task<Vector?> GetAsync(string id, CancellationToken cancellationToken = default)
{
if (id == null) throw new ArgumentNullException(nameof(id));
var filter = Builders<Vector>.Filter.Eq(i => i.Id, id);
var result = await _mongoCollection.FindAsync(filter, cancellationToken: cancellationToken).ConfigureAwait(false);
return result.FirstOrDefault(cancellationToken: cancellationToken);

Comment on lines +39 to 43
/// <inheritdoc />
public async Task<bool> IsEmptyAsync(CancellationToken cancellationToken = default)
{
return await _mongoCollection.EstimatedDocumentCountAsync(cancellationToken: cancellationToken).ConfigureAwait(false) == 0;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Use CountDocumentsAsync instead of EstimatedDocumentCountAsync for accurate emptiness check

EstimatedDocumentCountAsync may return an approximate count and might not be reliable for checking if the collection is empty. Consider using CountDocumentsAsync for an accurate result.

Apply this change:

- return await _mongoCollection.EstimatedDocumentCountAsync(cancellationToken: cancellationToken).ConfigureAwait(false) == 0;
+ return await _mongoCollection.CountDocumentsAsync(FilterDefinition<Vector>.Empty, cancellationToken: cancellationToken).ConfigureAwait(false) == 0;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/// <inheritdoc />
public async Task<bool> IsEmptyAsync(CancellationToken cancellationToken = default)
{
return await _mongoCollection.EstimatedDocumentCountAsync(cancellationToken: cancellationToken).ConfigureAwait(false) == 0;
}
/// <inheritdoc />
public async Task<bool> IsEmptyAsync(CancellationToken cancellationToken = default)
{
return await _mongoCollection.CountDocumentsAsync(FilterDefinition<Vector>.Empty, cancellationToken: cancellationToken).ConfigureAwait(false) == 0;
}

Comment on lines +16 to 22
/// <inheritdoc />
public async Task<IReadOnlyCollection<string>> AddAsync(IReadOnlyCollection<Vector> items, CancellationToken cancellationToken = default)
{
await _mongoCollection.InsertManyAsync(items, cancellationToken: cancellationToken).ConfigureAwait(false);
return items.Select(i => i.Id).ToList();
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider returning a read-only collection to enforce immutability

Although the method returns IReadOnlyCollection<string>, it currently returns a modifiable List<string>. To ensure that the collection cannot be modified by the caller, consider returning a read-only collection using AsReadOnly().

Apply this change:

 return items.Select(i => i.Id).ToList()
+            .AsReadOnly();
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/// <inheritdoc />
public async Task<IReadOnlyCollection<string>> AddAsync(IReadOnlyCollection<Vector> items, CancellationToken cancellationToken = default)
{
await _mongoCollection.InsertManyAsync(items, cancellationToken: cancellationToken).ConfigureAwait(false);
return items.Select(i => i.Id).ToList();
}
/// <inheritdoc />
public async Task<IReadOnlyCollection<string>> AddAsync(IReadOnlyCollection<Vector> items, CancellationToken cancellationToken = default)
{
await _mongoCollection.InsertManyAsync(items, cancellationToken: cancellationToken).ConfigureAwait(false);
return items.Select(i => i.Id).ToList()
.AsReadOnly();
}

@robalexclark robalexclark deleted the Metadata-search-for-opensearch branch October 25, 2024 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant