Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codec Replacement Plan #5124

Open
andrewvc opened this issue Apr 15, 2016 · 2 comments
Open

Codec Replacement Plan #5124

andrewvc opened this issue Apr 15, 2016 · 2 comments
Assignees

Comments

@andrewvc
Copy link
Contributor

Logstash currently uses codecs. The codec abstraction externally makes (some) sense
but has a number of problems and limitations.

  1. Codecs often have the dual responsibility of 1. splitting streams into discrete
    events, and 2.) deserializing data.
  2. Many codecs only decode or encode, not both. The interface doesn't let a plugin
    declare that it only does one or the other.
  3. Now that event mills exist the
    tokenization of events happens before the codec stage.
  4. Inputs and outputs waste time in the codec stage (they are often single threaded)
    when this work could be done by the pipeline with greater efficiency.

Moving forward it makes more sense to take each codec and split it up into a filter
and a new Serializer object that encodes data. So, the JSON codec would become:

  1. logstash-filter-json
  2. logstash-serializer-json

We could keep the existing syntax but have it work with these new internals. The execution model would move from

# Current design
input(internal codec) -> queue -> filters -> output(internal codec)
# New design
input -> mill -> queue -> filters -> serializer -> output

To keep compatibility with the current config syntax the 'codec' directive would
do the following:

  1. For inputs it would tag events coming out of a given input with a special directive
    asking that a special filter be applied to that event before the normal filter chain.
  2. For outputs all events going to an output would be run through the correct serializer first.
    The output would need to implement a new interface as follows:
class MyOutput < LogStash::Output::Base
  def receive_serialized(events_and_data)
  events_and_data.each do |event, data|
    # do something with event and data
  end
end
@guyboertje
Copy link
Contributor

I believe that the Tahoe decision to move the decode part of Codecs to being filters after the queue needs a rethink.

I would like to keep compatibility with the current config but bets are off.

We need to understand how the various functions of the input and codec will decompose. These functions are:

  1. Specific protocol decoding (where the source data needs to be converted before any further processing can be done)
  2. Charset normalisation.
  3. Event boundary detection (we know this as Milling).
  4. General protocol decoding (as directed by the codec config option)
  5. Local decoration (input specific fields and tags)
  6. Global decoration (add_field, add_tag)

All inputs compose some of these functions to work.

Before discussing the pragmatic moving of these functions (some/all) to after the PQ, I want to establish a guiding principle:

We seek to minimise the amount of processing applied to the source data before a generated event is persisted to the PQ and that we defer the work the input had previously done until after the PQ.

Unfortunately, after looking at the inputs, I feel that the amount of deferred input work is a spectrum of almost none <-> quite a lot, vague I know, but relevant because it is difficult to say, absolutely, all input work should be deferred until after the PQ.

Faced with this variation, can we design a system that allows for this spectrum i.e. allows for some inputs to defer work and others not to, where it makes sense to do one or the other?

I believe so.

Imagine an event put into PQ with minimal processing, what would it look like and what metadata would it need to have, to enable processing after the PQ. Imagine that this minimal event was generated by a TCP input.

{
  "message": "<some json>",
  "@metadata": {
    "global-decorator" : {
      "add_fields": [["env" , "dev"], ["sys" , "undef"]],
      "add_tags": ["conf-v1"]
    },
    "local-decorator": {
      "key": "logstash-input-tcp/decorator",
      "host": "<some ip>",
      "port": "<some port>",
      "ssl_enable": true,
      "ssl_subject": "<some subject>"
    },
    "charset" : "UTF-8",
    "decoder": {
      "key": "logstash-input-tcp/json",
      "source": "message"
    }
  }
}

In any batch, these events may have the same global-decorator, local-decorator and decoder (deferred input work) entries but they may very well have vastly different entries. This means that the filter_func can't know about these variations.

My initial thoughts are these:

  • Inside the worker thread loop, we will need a PQ client object to interact with the PQ.
  • Either this client or another component will execute the deferred work (deferred work executor).
  • An input may have a LocalDecorator class - it should be stateless and threadsafe.
  • An input may have a Decoder class that is a wrapper around the Decode/Filter function that @andrewvc spoke about earlier - it should be stateless and threadsafe.
  • During the register phase each input registers its deferred work instances with the pipeline into a special lookup cache.
  • In the deferred work executor, it looks at the event metadata for deferred work directives. Using the key field the executor fetches the instances and gives the relevant deferred work section plus the event to the instance. The instance augments the event according to the directives.

Problems:

  • the config can be reloaded or Logstash is restarted with a different config than was used for some events remaining in the PQ - the lookup cache must have a secondary mechanism to find a LocalDecorator instance for any input.
  • As Event Milling has happened before the PQ - we can't support situations where a field appearing in the event as a result of decoding requires further milling and event generation.
  • If we are willing to relax the guiding principle one could argue that global decoration can easily be done in the input before the PQ.
  • In addition to the above, if the local decoration is not dependent on fields added to the event after the deferred decode stage - then it can be done in the input too.

Advantages:

  • It may not be necessary to refactor all inputs in one go.

@andrewvc
Copy link
Contributor Author

Raw notes on a new plan:

  • If codecs are defined in inputs warn user. This will be removed in the future
  • Make compiler compile codec+type to special if block + filter in filter section.
  • Filing a ticket to document suspected edge cases with queue issues during restarts / failures

Plan for codecs:
If you specify codec w/o type, we will warn you and execute codecs before the queue as in today
Codecs will actually be filters

Figure out if we need Raw event types

New flow

input -> mill -> charset processor -> queue -> filters -> outputs

If they are using multiline codecs or other removed features display message:
You're using X feature. This is no longer supported. Please read the release notes at ....

# Will produce events converted from cp1252 -> UTF8 using the multiline mill
input {
  file {
    path => "/my/path"
    mill => multiline { ... }
    charset => "cp1252"
  }
}

# Will log a warning about using a codec. Will put data chunks through the default 'line' mill then the default UTF8 charset thing then go through the JSON filter (declared by the codec) before the queue
input {
  file {
    path => "/my/path"
    codec => json
  }
}

# Will log a warning about using a codec. Will create a 'filter' block that runs all 'funky' things through the JSON filter
input {
  file {
    codec => json
    type => funky
  }
}

will edit and reformat soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants