Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[META][Discuss] add support for input_type in "dual modes" codecs #11885

Closed
colinsurprenant opened this issue May 9, 2020 · 7 comments
Closed

Comments

@colinsurprenant
Copy link
Contributor

As seen in logstash-plugins/logstash-codec-csv#8 and logstash-plugins/logstash-codec-multiline#63 some codecs would benefit from supporting two modes of operation for the 2 types of data that our input plugins can provide to codecs.

Our input plugins can provide two types of data for decoding:

  • line/document based data where each data chunk provided by the plugin to the codec is a complete data line or complete document. For example the file input or the http input.
  • stream based data where each data chunk provided by the plugin can be a part of a line/document where the complete line/document can spawn multiple chunks. For example the stdin input or the tcp input.

The way we have been dealing with this situation has beed to have two versions of a same codec, for example: json and json_lines or plain and line. Furthermore, to help deal with this confusion, we introduced the fix_streaming_codecs method to automagically swap these codecs depending on the input used.

def fix_streaming_codecs
require "logstash/codecs/plain"
require "logstash/codecs/line"
require "logstash/codecs/json"
require "logstash/codecs/json_lines"
case @codec.class.name
when "LogStash::Codecs::Plain"
@logger.info("Automatically switching from #{@codec.class.config_name} to line codec", :plugin => self.class.config_name)
@codec = LogStash::Codecs::Line.new("charset" => @codec.charset)
when "LogStash::Codecs::JSON"
@logger.info("Automatically switching from #{@codec.class.config_name} to json_lines codec", :plugin => self.class.config_name)
@codec = LogStash::Codecs::JSONLines.new("charset" => @codec.charset)
end
end

Until we figure a whole new/better input/codec architecture my proposal to iteratively improve the current design with:

  1. introduce a new input_type config option in codecs that can support both input types (similar to [WIP] support line delimited data logstash-plugins/logstash-codec-csv#8)
  2. get rid of these redundant codecs
  3. figure a way for inputs to hints codecs about their input style so that a correct default can be set in the codec depending on the input plugin used.

I am looking for comments/suggestions about this plan and if we agree on the idea I will detail the steps for each iterations.

@yaauie
Copy link
Member

yaauie commented May 11, 2020

If we take a bit of a step back, I think this is an issue about whether ownership of the record-separator belongs with the inputs or with the codecs.

We currently have a mixed-ownership model, with some subset inputs claiming ownership (e.g., using BufferedTokenizer to split on and consume a delimiter), some codecs requiring ownership (e.g., CSV codec, since the record-delimiter can legally appear unescaped in a quoted value; JSON Lines, which will fail horribly if each record passed to JSONLines#decode does not have a trailing delimiter, etc.), and some codecs refusing ownership (e.g., JSON codec, which must be passed a string representing exactly one document).

You're right. This is complex.

Perhaps instead of adding a config option, which would require users to track the complexity of our plugin-compatibility matrix, the codec plugin class could advertise its expectations, e.g.:

  • exactly-one: requires exactly one document (input owns record separating)
  • one-or-more: accepts one or more complete documents
  • stream: fully owns record separation, accepts fragments of documents.

Which inputs could then use to properly prepare the payload that the codec requires.


I also wonder what the effect would be of making our BufferedTokenizer (optionally?) emit the trailing record separator as a part of the payload it sends with each call to Codec#decode. I think this would allow the codecs that need ownership of the separator to re-group consecutive documents, while codecs that expect exactly one "line" would continue to work without modification (or at most, a String#chomp).

@colinsurprenant
Copy link
Contributor Author

This is the return of the milling idea and the whole codecs inconsistencies, some relevant information in #4858 and #5124.

We have to separate a redesign discussion with an improvement discussion. I am trying to improve on what we have without breaking backward compatibility nor having to wait for major version upgrade to factor in these improvements. This does not exclude reopening the redesign discussion but I think we should not mix these too much because this will block possible short-term improvements.

@andsel
Copy link
Contributor

andsel commented May 12, 2020

I think that the selectable behaviour is a good starting point to rationalize the codecs. For the long term discussion, I think we should however keep open the customization of the codec behaviour I don't know if it's sufficient a declaration of expectations between the codec and the input to automatically avoid the user to choose what to do

@yaauie
Copy link
Member

yaauie commented May 12, 2020

When an input consumes a delimiter, it is lossy: a codec can usually assume the presence of a delimiter between calls to Codec#decode, but cannot reliably guess which delimiter , nor can it tell the difference between a size-limited chunk of a stream and the end of a record.

If we were to push this configurable behaviour to the input (e.g., whether or not to consume the delimiter), then I believe we would be on a better track for a future feature to make it automatic by enabling codecs to advertise their requirements and for inputs to provide what the codecs need.

@colinsurprenant
Copy link
Contributor Author

colinsurprenant commented May 12, 2020

I surveyed all input plugins to classify how they manipulate data they hand out to the codecs. Note that I did not verify if any of these are broken, non supported etc. I just looked at all logstash-input-* on the logstash-plugins repo.

Inputs

  • inputs that do not use codecs
jdbc, elasticsearch, twitter, journald, cloudwatch, gelf, jmx, jms, couchdb_changes, salesforce, snmp, nmptrap, github, log4j, eventlog, heartbeat, meetup, sqlite, varnishlog, dead_letter_queue, wmi, graphite, drupal_dblog, ganglia, puppet_facter, zenoss, neo4j
  • inputs that are data agnostic, they don't modify the data handed to the codec
kafka, redis, http_poller, kinesis, http, rabbitmq, s3, sqs, imap, google_pubsub, jms, rss, lumberjack, udp, azure_event_hubs, websocket, exec, heroku, zeromq, stomp, repl, unix, stdin, irc*	xmpp, rackspace, perfmon, gemfire, beats, tcp
  • inputs that are lossless line-based (using for example File#each)
s3, google_cloud_storage
  • input that are lossy line-based
file, pipe, s3sqs

Observations:

  • Practically speaking, only the file input is a lossy line-based input (the pipe and s3sqs inputs are either not working or not supported).
  • The other line-based inputs s3 and google_cloud_storage are handing over the line-breaks to the codec.
  • The data agnostic inputs are typically (but not limited to) document-based inputs in the sense that for each codec invocation a complete piece of data (not spawning multiple codec invocations) is handed to the codec with the exception of the two well-known streaming inputs: tcp and stdin (note that udp is in another class since ordering is not guaranteed and typical usage will assume it to be document-based). Also note that nothing precludes using these inputs in a streaming context, for example, we could see the kafka or redis inputs containing only parts of a complete document. This is not typical but could be a valid use-case. Most if not all of these inputs defaults to a document-based codec like plain or json.
  • Given the streaming nature of the stdin and tcp inputs, both enforce a codec swap from plain to line and from json to json_lines using the Inputs::Base#fix_streaming_codecs method.

Thoughts

  • The file input is a tricky one, it is lossy by stripping the line break but at the same time it could be considered more like a document-based input where each line of the file is treated like a complete document (a log line). This is essentially how it was designed, it already deals with line breaks and will always offer complete lines to its codec. One problem I can see with this lossy mode is with the csv codec where if line breaks are swallowed by the file input it will make it impossible to support the CSV line-break-in-quoted-columns use-case. We could look at adding a new option to make the file input lossless (which would not be the default for BWC).

  • For the exception of the tcp and stdin inputs, all data agnostic inputs can essentially assume a document-based mode for their codecs, but that does not preclude using a codec in a streaming mode if needed but that would be unusual.

Recommendations for BWC and minimize changes in current design

  • I think that offering a document/streaming mode (input-type option) makes sense for codecs that can be used for decoding data in both contexts like plain/line, json / json_lines and others in WIP for this like csv and multiline. These should default to
    document mode. Only tcp and stdin require the streaming mode.
  • As a first step for BWC we could re-unify plain/line and json / json_lines by having a single implementation and having the paired codec a superclass of the other with just a different document/streaming mode.
  • As a next step we could investigate how to eliminate the nasty fix_streaming_codecs and find a cleaner way, in the current architecture, to hint the codec for the default document/streaming mode that currently only tcp and stdin would need to do.

@colinsurprenant
Copy link
Contributor Author

colinsurprenant commented May 13, 2020

I amended my thoughts on the file input and added that we should probably add a lossless mode if we want, for example, be able to support the CSV line-break-in-quoted-columns use-case with the csv codec. I think adding such a mode should be relatively trivial.

@colinsurprenant
Copy link
Contributor Author

colinsurprenant commented May 13, 2020

TL;DR Recap

Goal

  • Fix some codec inconsistencies and to solve current bugs and limitations in a BWC and with minimal changes not involving architecture redesign.

First Step

Second and Optional Step

  • Refactor both codec pairs of plain/line and json/json_lines to add the above input_type option and make one of the codec in each pair a superclass of the other but with a different default for the input_type option. This will unify the code bases and remove duplication for mostly identical codecs.
  • Depending on discussion see if we want to add a lossless option to the file input.

Third and also Optional Step

  • Remove the fix_streaming_codecs hardcoded hack and figure a way (note that this has not yet been full analyzed) for the tcp and stdin input plugins to signal their codec of their streaming nature to set the correct default for the input_type of their codec.

Fourth and also Optional Step

  • Deprecate/remove the redundant codecs from the plain/line and json/json_lines pairs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants