Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parsing a headerless jarray (expected behavior is similar to csv_parser) #30321

Closed
RoeiDimi opened this issue Jan 7, 2024 · 8 comments
Labels
enhancement New feature or request needs triage New item requiring triage pkg/stanza

Comments

@RoeiDimi
Copy link
Contributor

RoeiDimi commented Jan 7, 2024

Component(s)

pkg/stanza

Is your feature request related to a problem? Please describe.

csv_parser operator supports getting a list of headers and parsing an input csv line into attributes. there is no such support for jarray lines (ie receiving a similar comma separated line that is wrapped in brackets)

Describe the solution you'd like

Adding a flag 'is_jarray' to csv_parser. to demonstrate it, the operator config can then look something like this:

operators:
    - type: csv_parser
      header: TimeGenerated,SourceIP,SourcePort,DestinationIP,DestinationPort,Protocol,SentBytes,ReceivedBytes,ExtID
      parse_from: body
      parse_to: attributes
      is_jarray: true

and then add a suitable generate parse function (like generateJarrayParseFunc in this draft I created for example)

This will allow maximal code reuse as the expected behavior is basically the same as for a csv line besides the parsing (receiving headerless data, getting a pre-determined list of headers in the config and parsing it into attributes)

Describe alternatives you've considered

Implementing jarray_parser with the downside being most of csv_parser's code being duplicated

Additional context

This is a part of a bigger project in which we are using otel-collector as an infrastructure and receive many types of data from a client. The client's sent data is always a form of json and this use case is a subset in which the json is a simple headerless jarray and so we need a way to parse it in this manner

@RoeiDimi RoeiDimi added enhancement New feature or request needs triage New item requiring triage labels Jan 7, 2024
Copy link
Contributor

github-actions bot commented Jan 7, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@djaglowski
Copy link
Member

djaglowski commented Jan 9, 2024

Thanks for the suggestion @RoeiDimi.

Technically, I believe you could solve this by placing a regex parser in front of the csv parser. However, removing prefixes and suffixes could arguably be a general enough case where we could make the functionality available independently. Generally speaking, our operators are designed with composability in mind because this allows us to solve many more use cases, so either way I don't think it should be part of the csv parser only.

Have you tried removing the brackets with a regex parser yet? If not, can we start there and see how it performs?

@RoeiDimi
Copy link
Contributor Author

Thanks @djaglowski for the quick reply!

After giving it some thought and looking at examples, I think my initial description of the problem was a bit misleading. The difference between a CSV line and a JArray is not actually just the surrounding brackets, its also a bunch of parsing and escaping differences.

a few very simple examples -

jarray - ["Alice", 25, "female", true]
csv - Alice,25,female,true

jarray - ["Bob", 30, null, false]
csv - Bob,30,,false

Theoretically, I'm sure this can be somehow achieved using a regex but I think it's partially implementing a json parsing library.. It may be error-prone and probably hurt the performance, what do you think?

Also, what do you think about the alternative suggestion of implementing a new operator JArray_parser?

@djaglowski
Copy link
Member

I think I understand better why this requires new functionality.

what do you think about the alternative suggestion of implementing a new operator JArray_parser?

The one thing that stands out to me here is that we'd be duplicating a lot of functionality with the csv parser, so I see why you proposed making it part of the same. That said, still don't think it would be a great idea to pack other non-csv formats into the same operator.

Here's another idea - what if we introduce a jarray_csv transformer which converts jarray to csv? This could be a simpler solution because it wouldn't attempt to duplicate or replicate all the logic of the csv parser. Instead, it would just decode the json array and re-encode it as a csv line. This would allow you to convert a single field on each record and then just run the result through the csv parser as it is today. I bet encoding/json.Decoder and encoding/csv.Writer provide everything you'd need here.

@RoeiDimi
Copy link
Contributor Author

Thanks @djaglowski
Yeah I get the idea and logically it sounds good. I'm just a bit concerned about a potential significant performance hit and we really try to maximize on performance.
I will try it out, compare to the other approach and get back with numbers

@RoeiDimi
Copy link
Contributor Author

RoeiDimi commented Jan 14, 2024

Hi @djaglowski
I implemented the transformer.
And after trying to test it end-to-end i ran into a wall and i don't think this approach can work - converting a jarray to csv comes with losing the types information.

in my case the input data looks like this:
"[\"2023-06-21T10:41:57.75801Z\", \"127.0.0.1\", \"54525\", \"192.168.1.0\", 8080, \"HTTP\", 8765, 555666, \"id69\"]"
and then these operators are defined:

operators:
    - type: jarray_to_csv
      field: body
    - type: csv_parser
      header: TimeGenerated,SourceIP,SourcePort,DestinationIP,DestinationPort,Protocol,SentBytes,ReceivedBytes,ExtID
      parse_from: body
      parse_to: attributes

the exporter then uses these attributes, creates a json and sends it to our Microsoft log analytics workspace (which is expecting a json with matching key-value per column in the destination table and has a type definition for each column). the output json in that case is:

"{\"DestinationIP\":\"192.168.1.0\",\"DestinationPort\":\"8080\",\"ExtID\":\"id69\",\"Protocol\":\"HTTP\",\"ReceivedBytes\":\"555666\",\"SentBytes\":\"8765\",\"SourceIP\":\"127.0.0.1\",\"SourcePort\":\"54525\",\"TimeGenerated\":\"2023-06-21T10:41:57.75801Z\"}"

but it should be
"{\"DestinationIP\":\"192.168.1.0\",\"DestinationPort\":8080,\"ExtID\":\"id69\",\"Protocol\":\"HTTP\",\"ReceivedBytes\":555666,\"SentBytes\":8765,\"SourceIP\":\"127.0.0.1\",\"SourcePort\":54525,\"TimeGenerated\":\"2023-06-21T10:41:57.75801Z\"}"

(notice the numbers..)
This is a long way to say that a json holds some typing information but a csv doesnt.
And so if I convert to a csv then i need to dynamically store that information somewhere and then reuse it, again in the csv_parser (or later on when converting back to a json..)

I think the issue here is that csv_parser does 2 jobs:

  1. it parses a csv line into a list of strings
  2. it matches values with headers

If we take out the 2nd job into a different parser (or transformer im not sure) and call it something like "headers_matching_parser" then csv_parser can handle only parsing a csv into a list of strings and a new jarray_parser can be introduced and do only parsing from jarray to a list of strings and they both could use headers_matching_parser after that if needed

What would be the next best way in your eyes?
Thanks for your patience and sorry this one has been kind of a long one :)

@djaglowski
Copy link
Member

@RoeiDimi, I appreciate you going through the effort to prototype the transformer. Losing the type information does appear to be a good enough reason to make this more than a conversion to csv. I think it's probably not worth the complexity to users to change the functionality from the current csv parser, so a standalone parser makes sense to me at this point.

djaglowski added a commit that referenced this issue Jan 23, 2024
**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
We are using otel-collector as an infrastructure and receive many types
of data from a client. The client's sent data is always a form of json
and in one use case the json is a simple headerless jarray and so we
need a way to parse it and match headers to each field (something
similar to what csv_parser does but also supports types supported in a
json format and nested objects)

**Link to tracking Issue:** <Issue number if applicable>

#30321

**Testing:** <Describe what testing was performed and which tests were
added.>
* unittests
All the tests found in csv_parser were copied and adjusted adding test
scenarios for different types (numbers, booleans, null) as well as a
test for parsing a nested object as a part of the jarray
* End to end tests
Used generated traffic on a running otel collector thats using the
parser and verified the data is as expected in the end table

**Documentation:** <Describe the documentation added.>
*
[json_array_parser.md](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/beacea489ff4ae61c0bac4f477c04748944c9054/pkg/stanza/docs/operators/json_array_parser.md)
*
[assign_keys.md](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/beacea489ff4ae61c0bac4f477c04748944c9054/pkg/stanza/docs/operators/assign_keys.md)

---------

Co-authored-by: Daniel Jaglowski <jaglows3@gmail.com>
@djaglowski
Copy link
Member

Closed by #30644

djaglowski pushed a commit that referenced this issue Jan 31, 2024
**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
Adding a feature following
#30644.
This feature allow json_array_parser parser to accept a comma-delimited
header and for every json array it parses, output a map which contains
the header fileds as keys and the matching values are the ones parsed
from the input json array.

This feature as added mainly for performance reasons as from a
functional POV, this is mostly similar to chaining the 2 operators:
`json_array_parser -> assign_keys `
**Link to tracking Issue:** <Issue number if applicable>

#30321
**Testing:** <Describe what testing was performed and which tests were
added.>

- unittests
- End to end tests
Used generated traffic on a running otel collector thats using the
parser and verified the data is as expected in the end table and
performance looks good

**Documentation:** <Describe the documentation added.>

-
[json_array_parser.md](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/58cc91ca30eabbd35c074d79db8630fc474164d9/pkg/stanza/docs/operators/json_array_parser.md)
cparkins pushed a commit to AmadeusITGroup/opentelemetry-collector-contrib that referenced this issue Feb 1, 2024
…try#30644)

**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
We are using otel-collector as an infrastructure and receive many types
of data from a client. The client's sent data is always a form of json
and in one use case the json is a simple headerless jarray and so we
need a way to parse it and match headers to each field (something
similar to what csv_parser does but also supports types supported in a
json format and nested objects)

**Link to tracking Issue:** <Issue number if applicable>

open-telemetry#30321

**Testing:** <Describe what testing was performed and which tests were
added.>
* unittests
All the tests found in csv_parser were copied and adjusted adding test
scenarios for different types (numbers, booleans, null) as well as a
test for parsing a nested object as a part of the jarray
* End to end tests
Used generated traffic on a running otel collector thats using the
parser and verified the data is as expected in the end table

**Documentation:** <Describe the documentation added.>
*
[json_array_parser.md](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/beacea489ff4ae61c0bac4f477c04748944c9054/pkg/stanza/docs/operators/json_array_parser.md)
*
[assign_keys.md](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/beacea489ff4ae61c0bac4f477c04748944c9054/pkg/stanza/docs/operators/assign_keys.md)

---------

Co-authored-by: Daniel Jaglowski <jaglows3@gmail.com>
cparkins pushed a commit to AmadeusITGroup/opentelemetry-collector-contrib that referenced this issue Feb 1, 2024
**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
Adding a feature following
open-telemetry#30644.
This feature allow json_array_parser parser to accept a comma-delimited
header and for every json array it parses, output a map which contains
the header fileds as keys and the matching values are the ones parsed
from the input json array.

This feature as added mainly for performance reasons as from a
functional POV, this is mostly similar to chaining the 2 operators:
`json_array_parser -> assign_keys `
**Link to tracking Issue:** <Issue number if applicable>

open-telemetry#30321
**Testing:** <Describe what testing was performed and which tests were
added.>

- unittests
- End to end tests
Used generated traffic on a running otel collector thats using the
parser and verified the data is as expected in the end table and
performance looks good

**Documentation:** <Describe the documentation added.>

-
[json_array_parser.md](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/58cc91ca30eabbd35c074d79db8630fc474164d9/pkg/stanza/docs/operators/json_array_parser.md)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs triage New item requiring triage pkg/stanza
Projects
None yet
Development

No branches or pull requests

2 participants