[RFC] JSON-to-JSON Transformer #12964

jackiehanyang · 2024-03-28T15:44:15Z

Is your feature request related to a problem? Please describe

Flow-Framework aims to make OpenSearch the easiest destination for building AI/ML applications on vector databases by differentiating OpenSearch on ease-of-use with high-flexibility to deliver an edge in the emerging and highly competitive vector database landscape. Flow-Framework aims to revamp how we build AI/ML flows so that it can support any AI use case suited for OpenSearch. This requires us to provide customers with an innovative paradigm that allows users to compose AI-augmented workflows using modular and re-useable search and ingest processors that can represent any relevant AI use case. We will provide users with the flexibility to configure these processors by introducing a JSON-to-JSON Transformer. This tool will allow customers to transform the processor input or output datasets, enabling them to seamlessly chain them together.

The JSON-to-JSON transformer functions as a standalone utility within the Core package. It enables users to configure transformations from one or multiple JSONs format to another, such as converting input JSON objects(e.g., search results from a previous flow step) into a different JSON format like a prompt template. It offers three approaches for data transformation: the Painless Script (P0 item), string manipulation function JSONPath (P0 item), and automated transformation based on specified inputs and outputs (P1 item). This utility should be stand alone and can be integrated into any processor, either before or after the processor execution flow, as a data transformation step.

Describe the solution you'd like

Providing a public utility method in core package that can be used by any processor. Depends on future requirement, we can expose this utility method to a REST API, or even a processor.

public static JsonNode JsonDataTransformation(List<JsonNode>, 
                                              DataTransformApproach approach, 
                                              List<String> source) {
   ...
}

List<JsonNode>, the dataset that needs to perform transform on. Usually it’s a list of SearchHits object.
DataTransformApproach approach, Enum PAINLESS, or Enum JSONPATH, the approach customer would like to use to transform the dataset.
List<String> source, the painless script source, or JSONPath field mapping instruction

Supported Transform Approach 1. Painless Script

Painless is a performant, secure scripting language that provides numerous capabilities. Writing Painless Scripts can be challenging for customers, and we aim to eliminate that difficulty. However, we still want to maintain this method as the default approach, allowing customers to achieve their objectives when string manipulation function JSONPath are not enough.

Supported Transform Approach 2. String Manipulation (JSONPath)

JSONPath is a query language designed for navigating and extracting parts of a JSON document. With JSONPath, you can specify and navigate to different parts of a JSON structure, making it easier to retrieve specific data elements without needing to process the entire structure manually in code.

AppSec has been clear for using JSONPath in ml-commons since 2.12. Will initiate another AppSec for this use case.

2.1. N-1 Transform: Merge multiple JSONs into one JSON or other format of data

In some cases, the transform has to be applied in a “many-to-one” mode by transforming all multiple objects like search results into a single JSON output. For instance, a re-ranker type mode may require the incoming search results (hits.fields) to be collapsed into a single array of strings as input into a re-ranker (eg. Cohere ReRank)

For example, when customer has the following

[
    {
        "hits": [
            {
                "_index": "media_library",
                "_id": "63MhYY0BFJSF4M0W0eUG",
                "_score": 1,
                "_source": {
                    "books": {
                        "name": "To Kill a Mockingbird",
                        "author": "Harper Lee",
                        "genres": "fiction",
                        "price": 15.99
                    },
                    "songs": {
                        "name": "Pocketful of Sunshine"
                    }
                }
            }
        ]
    },
    {
        "hits": [
            {
                "_index": "books_songs",
                "_id": "5nMhYY0BFJSF4M0W0eUG",
                "_score": 1,
                "_source": {
                    "books": {
                        "name": "Where the Crawdads Sing",
                        "author": "Delia Owens",
                        "genres": "fiction",
                        "cost": 12.99,
                        "year": 2018
                    },
                    "songs": {
                        "name": "If"
                    }
                }
            }
        ]
    }
]

Customer will need to provide the following JSONPath transform instruction

{
    "book_name": "$[*].hits[*]._source.books.name",
    "song_name": "$[*].hits[*]._source.songs.name"
}

The output would be

{
 "book_name" : ["To Kill a Mockingbird", "Where the Crawdads Sing"]
 "song_name" : ["Pocketful of Sunshine", "If"]
}

2.2. 1-1 Transform: Map a specific field in one JSON to another JSON
1-1 Transform is essentially the same as an N-1 Transform, with the distinction being that in a 1-1 Transform, N equals 1. Therefore, we don't need a separate DataTransformApproach Enum to differentiate between 1-1 and N-1 Transforms. However, for an 1-N Transform scenario, customers would need to use a painless script, as JSONPath may not be sufficient for such transformations.

Related component

Other

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

navneet1v · 2024-03-28T18:35:41Z

@jackiehanyang do we know the impact on JVM and latency for using a JSON to JSON transform(using painless and Json path) on a search response(taking reference of search response as that tends to be in general quite big) containing lets say 100 to 1000 results?

It would be good if we can have some micro-benchmarks done on this to understand the impact of this transform.

jackiehanyang · 2024-04-05T16:54:10Z

@jackiehanyang do we know the impact on JVM and latency for using a JSON to JSON transform(using painless and Json path) on a search response(taking reference of search response as that tends to be in general quite big) containing lets say 100 to 1000 results?

It would be good if we can have some micro-benchmarks done on this to understand the impact of this transform.

Will share the result once I have it

smacrakis · 2024-04-05T17:40:31Z

Would it make sense to use a binary format (e.g., Protobuf, Thrift, CBOR?) for communication between processors? It seems perversely inefficient to serialize/deserialize JSON multiple times in a pipeline. Does the ongoing work on Protobuf in OpenSearch help here?

arjunkumargiri · 2024-04-05T18:14:58Z

Thanks for building this functionality, couple of follow up questions:

Is the expectation of this transformer only to perform data manipulation?Painless Script supports multiple scripting functionalities in addition to data manipulation. By adding support for painless script users can make use of transformer to perform non data manipulation operations.
Why does output need to include document ID? Default JSONPath does not include ID. Also for non search document input this approach would not work.

jackiehanyang · 2024-04-05T19:28:34Z

Would it make sense to use a binary format (e.g., Protobuf, Thrift, CBOR?) for communication between processors? It seems perversely inefficient to serialize/deserialize JSON multiple times in a pipeline. Does the ongoing work on Protobuf in OpenSearch help here?

I do agree using a binary format for communication between processors is more efficient than serializing/deserializing JSON. Communicating with Dylan to see if Protobuf is something we should consider

jackiehanyang · 2024-04-05T19:31:16Z

Thanks for building this functionality, couple of follow up questions:

Is the expectation of this transformer only to perform data manipulation?Painless Script supports multiple scripting functionalities in addition to data manipulation. By adding support for painless script users can make use of transformer to perform non data manipulation operations.

Why does output need to include document ID? Default JSONPath does not include ID. Also for non search document input this approach would not work.

Yes, this transformer only perform data manipulation. It won't modify any data value
Because when merging multiple documents into one document, we need a way to differentiate json key names. If it's a non search document, we will need to append some GUID to it to make the key name unique

msfroh · 2024-04-05T20:52:12Z

Does this belong in https://github.com/opensearch-project/flow-framework ?

It doesn't seem to be related to OpenSearch core.

jackiehanyang · 2024-04-08T17:50:09Z

However, for an N-1 Transform scenario, customers would need to use a painless script, as JSONPath may not be sufficient for such transformations.

We're planning to develop this as a standalone utility function within the core repository. This will allow each processor that requires pre/post data transformation to call this function instead of integrating it as a processor or workflow step within the flow-framework. This approach aims to reduce dependency coupling and limitations when transitioning to serverless.

arjunkumargiri · 2024-04-08T19:02:39Z

Did you explore other json to json transformer libraries such as jolt: https://github.com/bazaarvoice/jolt .

dylan-tong-aws · 2024-04-09T17:52:43Z

Would it make sense to use a binary format (e.g., Protobuf, Thrift, CBOR?) for communication between processors? It seems perversely inefficient to serialize/deserialize JSON multiple times in a pipeline. Does the ongoing work on Protobuf in OpenSearch help here?

Performance is definitely important, and there are many possible solutions. I was under the impression that the transformations are performed between Java (JSON) objects, and that deserialization/serialization is not required.

@jackiehanyang is serialization/deserialization required?

dylan-tong-aws · 2024-04-09T18:12:04Z

The JSONPath option will need to be complemented with some helper String manipulation functions for it to be useful for a broader range of use cases.

I recommend reviewing the pre/post processors that were implemented for the AI connectors and identifying the transform logic that can't be translated into JSONPath. I believe there are string manipulation cases like escaping strings.

@ylwu-amzn, @zane-neo, and @Zhangxunmt should be able to help identify this gap.

dylan-tong-aws · 2024-04-09T19:42:00Z

@jackiehanyang, can you provide an example of how the interface for this functionality might look like within a processor?

@mingshl, @ylwu-amzn, this is the data transform functionality that can replace the pre/post processing functionality that current exists in the AI connectors. Would be good to see a proposal of how this functionality is interfaced through the ML inference processor (search pipelines).

peternied · 2024-04-10T15:46:44Z

[Triage - attendees 1 2 3 4 5 6]
@jackiehanyang Thanks for creating this RFC looking forward to seeing how this lands.

msfroh · 2024-05-03T18:19:50Z

I would just like to point out that no (ingest or search) processors in OpenSearch operate on JSON. They operate on (Java) objects. JSON is just a notation to represent objects (originally for Javascript).

If the goal is to support JSONPath as a language to manipulate objects, one option could be to add JSONPath as a scripting language supported by the OpenSearch scripting engine, then use the script processor with JSONPath as a script.

mingshl · 2024-05-03T18:20:32Z

@jackiehanyang, can you provide an example of how the interface for this functionality might look like within a processor?

@mingshl, @ylwu-amzn, this is the data transform functionality that can replace the pre/post processing functionality that current exists in the AI connectors. Would be good to see a proposal of how this functionality is interfaced through the ML inference processor (search pipelines).

ml inference processor now support json path, if this json to json transform is to use painless script. Using a script processor and chain with ml inference processor will serve the same purpose.

b4sjoo · 2024-05-10T18:36:29Z

Overall this looks good to me. The only caution would be the customer's input data sanction, if not done properly, could cause DoS attack.

andrross · 2024-05-16T21:52:35Z

The JSON-to-JSON transformer functions as a standalone utility within the Core package. It enables users to configure transformations from one or multiple JSONs format to another...

Is there any coupling to classes or concepts within this repository or is it truly a completely standalone Java utility that operates on arbitrary JSON? When you say "users" here, do you mean other Java developers that would take a dependency on whatever artifact defines this utility? You give an example of a "customer" providing two different JSON objects and getting a third one as output, but what is the interface? Is it just Java utility functions or is there some feature to be implemented within this repository that will provide that experience?

jackiehanyang · 2024-05-16T22:15:25Z

The JSON-to-JSON transformer functions as a standalone utility within the Core package. It enables users to configure transformations from one or multiple JSONs format to another...

Is there any coupling to classes or concepts within this repository or is it truly a completely standalone Java utility that operates on arbitrary JSON? When you say "users" here, do you mean other Java developers that would take a dependency on whatever artifact defines this utility? You give an example of a "customer" providing two different JSON objects and getting a third one as output, but what is the interface? Is it just Java utility functions or is there some feature to be implemented within this repository that will provide that experience?

It is just a Java utility function, and the users of this JSON-JSON transformer utility method are the processor owners. Every processor that needs to perform data transformation could leverage this utility method. Currently, in the 2.15 release cycle, we want to support the example I provided in this RFC by leveraging JSONPath. We will identify the gaps and limitations of JSONPath and aim to support more complicated data manipulation in future releases.

andrross · 2024-05-17T20:07:31Z

Thanks @jackiehanyang! My first instinct is that I'm hesitant for the OpenSearch repo to become the owner of utility code that is not used within the repo itself, because we certainly have enough code here as it is :) However, I wouldn't stand in the way of adding useful functionality in something like libs/common if it makes sense and is useful for other consumers of that library.

But it sounds like you've got a path forward in the short term and we can revisit this once you get more information on using JSONPath.

jackiehanyang added enhancement Enhancement or improvement to existing feature or request untriaged labels Mar 28, 2024

github-actions bot added the Other label Mar 28, 2024

peternied added RFC Issues requesting major changes and removed untriaged labels Apr 10, 2024

msfroh added the Roadmap:Search Project-wide roadmap label label May 14, 2024

getsaurabh02 added this to OpenSearch Roadmap May 31, 2024

github-project-automation bot moved this to Planned work items in OpenSearch Roadmap May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] JSON-to-JSON Transformer #12964

[RFC] JSON-to-JSON Transformer #12964

jackiehanyang commented Mar 28, 2024 •

edited

Loading

navneet1v commented Mar 28, 2024 •

edited

Loading

jackiehanyang commented Apr 5, 2024

smacrakis commented Apr 5, 2024

arjunkumargiri commented Apr 5, 2024

jackiehanyang commented Apr 5, 2024

jackiehanyang commented Apr 5, 2024

msfroh commented Apr 5, 2024

jackiehanyang commented Apr 8, 2024 •

edited

Loading

arjunkumargiri commented Apr 8, 2024

dylan-tong-aws commented Apr 9, 2024 •

edited

Loading

dylan-tong-aws commented Apr 9, 2024

dylan-tong-aws commented Apr 9, 2024

peternied commented Apr 10, 2024

msfroh commented May 3, 2024

mingshl commented May 3, 2024

b4sjoo commented May 10, 2024

andrross commented May 16, 2024

jackiehanyang commented May 16, 2024

andrross commented May 17, 2024

[RFC] JSON-to-JSON Transformer #12964

[RFC] JSON-to-JSON Transformer #12964

Comments

jackiehanyang commented Mar 28, 2024 • edited Loading

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

navneet1v commented Mar 28, 2024 • edited Loading

jackiehanyang commented Apr 5, 2024

smacrakis commented Apr 5, 2024

arjunkumargiri commented Apr 5, 2024

jackiehanyang commented Apr 5, 2024

jackiehanyang commented Apr 5, 2024

msfroh commented Apr 5, 2024

jackiehanyang commented Apr 8, 2024 • edited Loading

arjunkumargiri commented Apr 8, 2024

dylan-tong-aws commented Apr 9, 2024 • edited Loading

dylan-tong-aws commented Apr 9, 2024

dylan-tong-aws commented Apr 9, 2024

peternied commented Apr 10, 2024

msfroh commented May 3, 2024

mingshl commented May 3, 2024

b4sjoo commented May 10, 2024

andrross commented May 16, 2024

jackiehanyang commented May 16, 2024

andrross commented May 17, 2024

jackiehanyang commented Mar 28, 2024 •

edited

Loading

navneet1v commented Mar 28, 2024 •

edited

Loading

jackiehanyang commented Apr 8, 2024 •

edited

Loading

dylan-tong-aws commented Apr 9, 2024 •

edited

Loading