Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement batch document optimization for text embedding processor #1217

Open
wants to merge 1 commit into
base: optimized-processor
Choose a base branch
from

Conversation

will-hwang
Copy link
Contributor

@will-hwang will-hwang commented Mar 7, 2025

Description

This PR includes changes for:

  1. refactor of InferenceFilter to simplify logic for copying embeddings
  2. implementation for batch document update by overriding TextEmbeddingProcessor's batchExecute method in its parent class AbstractBatchingProcessor

Proposed State [Batch Document Update]

proposed-batch-flow

Steps:

  1. Process Maps are generated for each IngestDocument based on the Field Map defined.
  2. If skip_existing feature is set to true, filter the process map for each IngestDocument.
    1. Existing documents are fetched via OpenSearch client’s Multi-Get Action to compare each of the existing inference text against its corresponding IngestDocument
      1. if documents do not exist, or any exception is thrown, fallback to calling model inference
    2. Locate the embedding fields in each existing document
      1. Recursively traverse the process map to find embedding fields. Keeping track of the traversal path is required for look up in existing document.
      2. Once embedding fields are found, attempt to copy embeddings from existing document to its corresponding ingest document.
    3. If eligible, copy over the vector embeddings from existing document to its corresponding ingest document
      1. It is eligible for copy if inference text in ingest document and its corresponding existing document is the same, and embeddings for the inference text exist in its existing document.
      2. Note, if in case of values in list, the fields in the same index are compared to determine text equality
    4. If eligible fields have been copied, remove the entry in process map
  3. Inference List is generated based on entries in Filtered Process Map.
  4. ML Commons InferenceSentence API is invoked with filtered inference list.
  5. Embeddings for filtered inference list are generated in ML Commons.
  6. Embeddings for filtered inference list are mapped to target fields via entries defined in process map.
  7. Embeddings for filtered inference list are populated to target fields in IngestDocument.

Related Issues

HLD: #1138

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@will-hwang will-hwang force-pushed the optimized-text-embedding-processor-batch branch from c04d826 to 516c43a Compare March 8, 2025 04:41
Copy link

codecov bot commented Mar 8, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.56%. Comparing base (1a6e58e) to head (516c43a).

Additional details and impacted files
@@                    Coverage Diff                    @@
##             optimized-processor    #1217      +/-   ##
=========================================================
- Coverage                  81.94%   81.56%   -0.39%     
+ Complexity                  2604     1311    -1293     
=========================================================
  Files                        194       97      -97     
  Lines                       8858     4496    -4362     
  Branches                    1498      762     -736     
=========================================================
- Hits                        7259     3667    -3592     
+ Misses                      1016      534     -482     
+ Partials                     583      295     -288     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant