[META][Discuss] add support for `input_type` in "dual modes" codecs #11885

colinsurprenant · 2020-05-09T16:47:10Z

As seen in logstash-plugins/logstash-codec-csv#8 and logstash-plugins/logstash-codec-multiline#63 some codecs would benefit from supporting two modes of operation for the 2 types of data that our input plugins can provide to codecs.

Our input plugins can provide two types of data for decoding:

line/document based data where each data chunk provided by the plugin to the codec is a complete data line or complete document. For example the file input or the http input.
stream based data where each data chunk provided by the plugin can be a part of a line/document where the complete line/document can spawn multiple chunks. For example the stdin input or the tcp input.

The way we have been dealing with this situation has beed to have two versions of a same codec, for example: json and json_lines or plain and line. Furthermore, to help deal with this confusion, we introduced the fix_streaming_codecs method to automagically swap these codecs depending on the input used.

logstash/logstash-core/lib/logstash/inputs/base.rb

Lines 145 to 159 in 196ec20

    
           def fix_streaming_codecs 
        
             require "logstash/codecs/plain" 
        
             require "logstash/codecs/line" 
        
             require "logstash/codecs/json" 
        
             require "logstash/codecs/json_lines" 
        
             case @codec.class.name 
        
               when "LogStash::Codecs::Plain" 
        
                 @logger.info("Automatically switching from #{@codec.class.config_name} to line codec", :plugin => self.class.config_name) 
        
                 @codec = LogStash::Codecs::Line.new("charset" => @codec.charset) 
        
               when "LogStash::Codecs::JSON" 
        
                 @logger.info("Automatically switching from #{@codec.class.config_name} to json_lines codec", :plugin => self.class.config_name) 
        
                 @codec = LogStash::Codecs::JSONLines.new("charset" => @codec.charset) 
        
             end 
        
           end

Until we figure a whole new/better input/codec architecture my proposal to iteratively improve the current design with:

introduce a new input_type config option in codecs that can support both input types (similar to [WIP] support line delimited data logstash-plugins/logstash-codec-csv#8)
get rid of these redundant codecs
figure a way for inputs to hints codecs about their input style so that a correct default can be set in the codec depending on the input plugin used.

I am looking for comments/suggestions about this plan and if we agree on the idea I will detail the steps for each iterations.

The text was updated successfully, but these errors were encountered:

yaauie · 2020-05-11T15:33:58Z

If we take a bit of a step back, I think this is an issue about whether ownership of the record-separator belongs with the inputs or with the codecs.

We currently have a mixed-ownership model, with some subset inputs claiming ownership (e.g., using BufferedTokenizer to split on and consume a delimiter), some codecs requiring ownership (e.g., CSV codec, since the record-delimiter can legally appear unescaped in a quoted value; JSON Lines, which will fail horribly if each record passed to JSONLines#decode does not have a trailing delimiter, etc.), and some codecs refusing ownership (e.g., JSON codec, which must be passed a string representing exactly one document).

You're right. This is complex.

Perhaps instead of adding a config option, which would require users to track the complexity of our plugin-compatibility matrix, the codec plugin class could advertise its expectations, e.g.:

exactly-one: requires exactly one document (input owns record separating)
one-or-more: accepts one or more complete documents
stream: fully owns record separation, accepts fragments of documents.

Which inputs could then use to properly prepare the payload that the codec requires.

I also wonder what the effect would be of making our BufferedTokenizer (optionally?) emit the trailing record separator as a part of the payload it sends with each call to Codec#decode. I think this would allow the codecs that need ownership of the separator to re-group consecutive documents, while codecs that expect exactly one "line" would continue to work without modification (or at most, a String#chomp).

colinsurprenant · 2020-05-11T15:51:07Z

This is the return of the milling idea and the whole codecs inconsistencies, some relevant information in #4858 and #5124.

We have to separate a redesign discussion with an improvement discussion. I am trying to improve on what we have without breaking backward compatibility nor having to wait for major version upgrade to factor in these improvements. This does not exclude reopening the redesign discussion but I think we should not mix these too much because this will block possible short-term improvements.

andsel · 2020-05-12T07:36:07Z

I think that the selectable behaviour is a good starting point to rationalize the codecs. For the long term discussion, I think we should however keep open the customization of the codec behaviour I don't know if it's sufficient a declaration of expectations between the codec and the input to automatically avoid the user to choose what to do

yaauie · 2020-05-12T13:27:46Z

When an input consumes a delimiter, it is lossy: a codec can usually assume the presence of a delimiter between calls to Codec#decode, but cannot reliably guess which delimiter , nor can it tell the difference between a size-limited chunk of a stream and the end of a record.

If we were to push this configurable behaviour to the input (e.g., whether or not to consume the delimiter), then I believe we would be on a better track for a future feature to make it automatic by enabling codecs to advertise their requirements and for inputs to provide what the codecs need.

colinsurprenant · 2020-05-12T18:55:33Z

I surveyed all input plugins to classify how they manipulate data they hand out to the codecs. Note that I did not verify if any of these are broken, non supported etc. I just looked at all logstash-input-* on the logstash-plugins repo.

Inputs

inputs that do not use codecs

jdbc, elasticsearch, twitter, journald, cloudwatch, gelf, jmx, jms, couchdb_changes, salesforce, snmp, nmptrap, github, log4j, eventlog, heartbeat, meetup, sqlite, varnishlog, dead_letter_queue, wmi, graphite, drupal_dblog, ganglia, puppet_facter, zenoss, neo4j

inputs that are data agnostic, they don't modify the data handed to the codec

kafka, redis, http_poller, kinesis, http, rabbitmq, s3, sqs, imap, google_pubsub, jms, rss, lumberjack, udp, azure_event_hubs, websocket, exec, heroku, zeromq, stomp, repl, unix, stdin, irc*	xmpp, rackspace, perfmon, gemfire, beats, tcp

inputs that are lossless line-based (using for example File#each)

s3, google_cloud_storage

input that are lossy line-based

file, pipe, s3sqs

Observations:

Practically speaking, only the file input is a lossy line-based input (the pipe and s3sqs inputs are either not working or not supported).
The other line-based inputs s3 and google_cloud_storage are handing over the line-breaks to the codec.
The data agnostic inputs are typically (but not limited to) document-based inputs in the sense that for each codec invocation a complete piece of data (not spawning multiple codec invocations) is handed to the codec with the exception of the two well-known streaming inputs: tcp and stdin (note that udp is in another class since ordering is not guaranteed and typical usage will assume it to be document-based). Also note that nothing precludes using these inputs in a streaming context, for example, we could see the kafka or redis inputs containing only parts of a complete document. This is not typical but could be a valid use-case. Most if not all of these inputs defaults to a document-based codec like plain or json.
Given the streaming nature of the stdin and tcp inputs, both enforce a codec swap from plain to line and from json to json_lines using the Inputs::Base#fix_streaming_codecs method.

Thoughts

The file input is a tricky one, it is lossy by stripping the line break but at the same time it could be considered more like a document-based input where each line of the file is treated like a complete document (a log line). This is essentially how it was designed, it already deals with line breaks and will always offer complete lines to its codec. One problem I can see with this lossy mode is with the csv codec where if line breaks are swallowed by the file input it will make it impossible to support the CSV line-break-in-quoted-columns use-case. We could look at adding a new option to make the file input lossless (which would not be the default for BWC).
For the exception of the tcp and stdin inputs, all data agnostic inputs can essentially assume a document-based mode for their codecs, but that does not preclude using a codec in a streaming mode if needed but that would be unusual.

Recommendations for BWC and minimize changes in current design

I think that offering a document/streaming mode (input-type option) makes sense for codecs that can be used for decoding data in both contexts like plain/line, json / json_lines and others in WIP for this like csv and multiline. These should default to
document mode. Only tcp and stdin require the streaming mode.
As a first step for BWC we could re-unify plain/line and json / json_lines by having a single implementation and having the paired codec a superclass of the other with just a different document/streaming mode.
As a next step we could investigate how to eliminate the nasty fix_streaming_codecs and find a cleaner way, in the current architecture, to hint the codec for the default document/streaming mode that currently only tcp and stdin would need to do.

colinsurprenant · 2020-05-13T13:53:18Z

I amended my thoughts on the file input and added that we should probably add a lossless mode if we want, for example, be able to support the CSV line-break-in-quoted-columns use-case with the csv codec. I think adding such a mode should be relatively trivial.

colinsurprenant · 2020-05-13T15:49:34Z

TL;DR Recap

Goal

Fix some codec inconsistencies and to solve current bugs and limitations in a BWC and with minimal changes not involving architecture redesign.

First Step

Introduce a new input_type or input_direction (open to other suggestions) option for the csv codec in [WIP] support line delimited data logstash-plugins/logstash-codec-csv#8 with a default to line-based/document-based which should be compatible with all but the tcp and stdin inputs. Users of the tcp and stdin inputs can flip that option to properly decode data which can spawn multiple decode calls.
Modify the multiline codec PR [WIP] use BufferedTokenizer and configurable line delimiter logstash-plugins/logstash-codec-multiline#63 to use the same strategy as the above csv codec.
Add relevant docs to better explain these concepts

Second and Optional Step

Refactor both codec pairs of plain/line and json/json_lines to add the above input_type option and make one of the codec in each pair a superclass of the other but with a different default for the input_type option. This will unify the code bases and remove duplication for mostly identical codecs.
Depending on discussion see if we want to add a lossless option to the file input.

Third and also Optional Step

Remove the fix_streaming_codecs hardcoded hack and figure a way (note that this has not yet been full analyzed) for the tcp and stdin input plugins to signal their codec of their streaming nature to set the correct default for the input_type of their codec.

Fourth and also Optional Step

Deprecate/remove the redundant codecs from the plain/line and json/json_lines pairs.

colinsurprenant added discuss design meta labels May 9, 2020

This was referenced May 9, 2020

[WIP] support line delimited data logstash-plugins/logstash-codec-csv#8

Closed

[WIP] use BufferedTokenizer and configurable line delimiter logstash-plugins/logstash-codec-multiline#63

Closed

colinsurprenant closed this as completed May 18, 2020

mj84 mentioned this issue Nov 12, 2020

Auto flush on multiline codec not working logstash-plugins/logstash-input-tcp#156

Open

mashhurs mentioned this issue Feb 11, 2025

[Placeholder] Codec adds new line when used with streaming input (TCP/UDP) plugins. logstash-plugins/logstash-codec-multiline#73

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[META][Discuss] add support for `input_type` in "dual modes" codecs #11885

[META][Discuss] add support for `input_type` in "dual modes" codecs #11885

colinsurprenant commented May 9, 2020

yaauie commented May 11, 2020

colinsurprenant commented May 11, 2020

andsel commented May 12, 2020

yaauie commented May 12, 2020 •

edited

Loading

colinsurprenant commented May 12, 2020 •

edited

Loading

colinsurprenant commented May 13, 2020 •

edited

Loading

colinsurprenant commented May 13, 2020 •

edited

Loading

[META][Discuss] add support for input_type in "dual modes" codecs #11885

[META][Discuss] add support for input_type in "dual modes" codecs #11885

Comments

colinsurprenant commented May 9, 2020

yaauie commented May 11, 2020

colinsurprenant commented May 11, 2020

andsel commented May 12, 2020

yaauie commented May 12, 2020 • edited Loading

colinsurprenant commented May 12, 2020 • edited Loading

Inputs

Observations:

Thoughts

Recommendations for BWC and minimize changes in current design

colinsurprenant commented May 13, 2020 • edited Loading

colinsurprenant commented May 13, 2020 • edited Loading

TL;DR Recap

Goal

First Step

Second and Optional Step

Third and also Optional Step

Fourth and also Optional Step

[META][Discuss] add support for `input_type` in "dual modes" codecs #11885

[META][Discuss] add support for `input_type` in "dual modes" codecs #11885

yaauie commented May 12, 2020 •

edited

Loading

colinsurprenant commented May 12, 2020 •

edited

Loading

colinsurprenant commented May 13, 2020 •

edited

Loading

colinsurprenant commented May 13, 2020 •

edited

Loading