implement batch document optimization for text embedding processor #1217

will-hwang · 2025-03-07T22:42:52Z

Description

This PR includes changes for:

refactor of InferenceFilter to simplify logic for copying embeddings
implementation for batch document update by overriding TextEmbeddingProcessor's batchExecute method in its parent class AbstractBatchingProcessor

Proposed State [Batch Document Update]

Steps:

Process Maps are generated for each IngestDocument based on the Field Map defined.
If skip_existing feature is set to true, filter the process map for each IngestDocument.
1. Existing documents are fetched via OpenSearch client’s Multi-Get Action to compare each of the existing inference text against its corresponding IngestDocument
  1. if documents do not exist, or any exception is thrown, fallback to calling model inference
2. Locate the embedding fields in each existing document
  1. Recursively traverse the process map to find embedding fields. Keeping track of the traversal path is required for look up in existing document.
  2. Once embedding fields are found, attempt to copy embeddings from existing document to its corresponding ingest document.
3. If eligible, copy over the vector embeddings from existing document to its corresponding ingest document
  1. It is eligible for copy if inference text in ingest document and its corresponding existing document is the same, and embeddings for the inference text exist in its existing document.
  2. Note, if in case of values in list, the fields in the same index are compared to determine text equality
4. If eligible fields have been copied, remove the entry in process map
Inference List is generated based on entries in Filtered Process Map.
ML Commons InferenceSentence API is invoked with filtered inference list.
Embeddings for filtered inference list are generated in ML Commons.
Embeddings for filtered inference list are mapped to target fields via entries defined in process map.
Embeddings for filtered inference list are populated to target fields in IngestDocument.

Related Issues

HLD: #1138

Check List

New functionality includes testing.
New functionality has been documented.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Will Hwang <sang7239@gmail.com>

codecov · 2025-03-08T04:53:38Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.56%. Comparing base (1a6e58e) to head (516c43a).

Additional details and impacted files

@@                    Coverage Diff                    @@
##             optimized-processor    #1217      +/-   ##
=========================================================
- Coverage                  81.94%   81.56%   -0.39%     
+ Complexity                  2604     1311    -1293     
=========================================================
  Files                        194       97      -97     
  Lines                       8858     4496    -4362     
  Branches                    1498      762     -736     
=========================================================
- Hits                        7259     3667    -3592     
+ Misses                      1016      534     -482     
+ Partials                     583      295     -288

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

implement batch document update optimization in text embedding processor

516c43a

Signed-off-by: Will Hwang <sang7239@gmail.com>

will-hwang force-pushed the optimized-text-embedding-processor-batch branch from c04d826 to 516c43a Compare March 8, 2025 04:41

will-hwang marked this pull request as ready for review March 8, 2025 04:58

will-hwang requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, sean-zheng-amazon, model-collapse, zane-neo, vibrantvarun, zhichao-aws, yuye-aws and minalsha as code owners March 8, 2025 04:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement batch document optimization for text embedding processor #1217

implement batch document optimization for text embedding processor #1217

will-hwang commented Mar 7, 2025 •

edited

Loading

codecov bot commented Mar 8, 2025

implement batch document optimization for text embedding processor #1217

Are you sure you want to change the base?

implement batch document optimization for text embedding processor #1217

Conversation

will-hwang commented Mar 7, 2025 • edited Loading

Description

Proposed State [Batch Document Update]

Related Issues

Check List

codecov bot commented Mar 8, 2025

Codecov Report

will-hwang commented Mar 7, 2025 •

edited

Loading