Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event Mills or how to turn a text stream into an Event stream #4858

Open
guyboertje opened this issue Mar 21, 2016 · 28 comments
Open

Event Mills or how to turn a text stream into an Event stream #4858

guyboertje opened this issue Mar 21, 2016 · 28 comments

Comments

@guyboertje
Copy link
Contributor

Motivation

After talking about core changes to include a persistent queue, we decided to divide up some functionality that is now in the inputs and put some before the Persistent Queue (PQ?) and some after.

We will remove the schizophrenia where some input sources provide byte oriented data and others provide line oriented data. We will ensure that all inputs that can will provide byte oriented data.

Any inputs that naturally provide Events streams will not change.

The concept of line and multiline as codecs are deprecated because they are boundary detectors and not decoders. Codecs will be split into decoders and encoders both available in the same LS library. Decoders are specifically for protocol/format handling.

Decoders go after the PQ.

Event boundary detection.

In byte oriented data we need to find where each event starts and stops. Most of the time this is at new line (LF) characters, but not always. In some cases the event boundaries span multiline lines. I have some POC state-machines that allow for a continuous detection of both line and multiline.

Identity management.

When looking for event boundaries in byte oriented data, chunks from different origins must be kept separate by a property - identity. In the case of the File Input, each file is a different origin and in the case of the TCP Input we could receive byte oriented data from any origin in any connection so ideally the far end should transmit the identity.

Event Mills

An Event Mill is used by the Input to feed byte oriented data into and Events should come out the other side. Based on the LS Input config it should know whether to include multiline capabilities. The Mill should called with an identity and some bytes. Internally it should create a new machine per identity. For line and multiline it should look like this.
Input -> (identity, byte oriented data) -> LineFSM -> (line) -> Input [callback] -> (hash) -> Eventifier -> (event) -> PQ
or
Input -> (identity, byte oriented data) -> MultilineFSM -> (lines as one string) -> Input [callback] -> (hash) -> Eventifier -> (event) -> PQ

If possible the Event Mill will be written as a JRuby extension.

Summary:

Data makes a Journey via some Transport mechanism from the Origin to the Mill to the PQ Storage.

@colinsurprenant
Copy link
Contributor

One thing I'd like to clarify and maybe this should be in another discussion but in our last discussion about the persistence interface their seemed to be some confusion about the decoupling using the PQ between the input+milling and the filter+output.

@guyboertje
Copy link
Contributor Author

Yes it is a less understood concept.

@guyboertje
Copy link
Contributor Author

Thanks to @purbon with questions about multiline xml docs that have no \n character at the end of the file we need a multiline (or pattern boundary detector) that operates on raw byte data and not only data that has been previously put through a line (or character boundary detector).

@guyboertje
Copy link
Contributor Author

NOTE: I removed the comment that showed POC FSM code - it is too premature for them.

@guyboertje
Copy link
Contributor Author

I will analyse the input plugins to see how Event Mills may be used and what changes are required to the inputs to support the minimum effort to yield Events to the PQ.

@guyboertje
Copy link
Contributor Author

We need to differentiate when byte oriented data is plain text where character or pattern boundary detection may be applied and if/when it must be decoded first.

We may need a user directive to tell us whether a chunk of byte oriented data is actually a full event that can be decoded after the queue or whether it is an encoded chunk that needs decoding and milling to find the events within it.

Many inputs use local decoration of the Event. It will be problematic to include enough metadata or directives (i.e. context) in the event before the PQ such that a generic decoration function can be applied to the Event after it is read from the PQ in the filter-output stage.

There is needs to be provision for Charset conversion before any EventMill.

For illustration purposes:

  • the terms LocalDecorate and GlobalDecorate refer to the decorate functions that are applied to Events in an Input prior to the PQ.
  • the terms NewlineMill and PatternMill refer to Mills that do \n boundary detection and /regex/ pattern boundary detection (multiline) respectively.
  • the terms JsonDecode and ProtocolDecode refer to the decode function that is applied to A) the raw bytes if (i) no event boundary detection needed or (ii) before any boundary detection; B) the event message text emitted by an Event Mill - when directed that the event message source is a serialised hash like structure.

Discussion Point from A(i) above
Is this is a special case of B?

Discussion Point from A(ii) above
How much of a requirement is it to unpack the raw bytes into text that boundary detection can be done on it?

Discussion Point from B above
Whether ProtocolDecode should occur before the Event is generated and persisted or after the Event is taken from the PQ for further processing - is dependent on whether the LocalDecorate and GlobalDecorate can also be moved to after the PQ too. From an input, is it feasible to register its Decorator class (and Decoder class if directed) with a lookup structure in the pipeline - so that a worker thread would be able to find, use and cache these classes by using a field in the metadata of each event?

Discussion Point
There is a very small chance that a serialised hash contains the raw bytes in a field and that the raw bytes would need event boundary detection and event generation i.e. Milling. Is it feasible to be able to Mill and generate secondary Events after the PQ? How will the worker thread be directed to do this? Would we disallow the use of PatternMills (multiline) or only cater for Pattern boundary detection within the raw bytes from one Event?
e.g.

{"message": "log line 1::-::log line 2::-::log line 3::-::log line 4::-::"}

Where ::-:: is the boundary pattern

@guyboertje
Copy link
Contributor Author

Inputs::Beats

  • underlying data source produces a hash and an identity
  • adds contextual tags
  • change: use milling based on directives with LocalDecorate and GlobalDecorate

@guyboertje
Copy link
Contributor Author

Inputs::CouchDBChanges

  • underlying data source produces line oriented json encoded chunks.
  • uses buftok to line orient the data
  • uses Json.load directly
  • adds special fields to hash
  • adds special metadata to hash
  • caches sequence id to disk
  • change: remove buftok and Json
  • change: use NewlineMill, JsonDecode with LocalDecorate and GlobalDecorate

@guyboertje
Copy link
Contributor Author

Inputs::Elasticsearch

  • underlying data source produces a hash with the event data hash in a key called _source
  • adds normal decoration
  • adds conditional decoration based on docinfo config option
  • conditional decoration targets @metadata or custom field
  • change: no Mills required
  • change: use LocalDecorate and GlobalDecorate

@guyboertje
Copy link
Contributor Author

Inputs::EventLog

  • suggest: retire this plugin in favor of WinLogBeat
  • underlying data source produces a hash
  • uses an orig_key => new_key map to transform hash into an Event
  • change: no Mills required
  • change: use LocalDecorate and GlobalDecorate

@guyboertje
Copy link
Contributor Author

Inputs::Exec

  • underlying data source produces bytes.
  • uses any codec to produce event.
  • adds special fields to hash command and host
  • change: use milling based on directives with LocalDecorate and GlobalDecorate
  • change: may need to use ProtocolDecode if the bytes need unpacking.

@guyboertje
Copy link
Contributor Author

Inputs::File

  • underlying data source produces lines with identity (path)
  • uses any codec to produce events
  • adds special fields to hash path and host
  • change: data source should provide bytes, identity and read position
  • change: data source should accept new position update for sincedb progress tracking
  • design: Mills must provide positional offsets when passing text for Eventifing.
  • change: use milling based on directives with LocalDecorate and GlobalDecorate

@guyboertje
Copy link
Contributor Author

Inputs::Ganglia

  • underlying data source produces binary and an ip address struct
  • only some packets received on the socket are eligible for event production - meta and data
  • the receiver holds meta in local state while waiting for a data packet
  • uses Custom code to protocol decode the source bytes
  • adds special fields to hash host -> ip address
  • design: this is a fine example of a data source that requires a decoder before events can be generated.

@guyboertje
Copy link
Contributor Author

Inputs::Gelf

  • underlying data source produces a json encoded string.
  • conditionally renames event keys.
  • conditionally remaps event keys.
  • change: capture json string in the event message and store in PQ. direct to use json decoder.
  • change: one of - use Json decoder after the PQ; use Event#from_json to build the event.
  • change: use LocalDecorate and GlobalDecorate.

@guyboertje
Copy link
Contributor Author

Inputs::Generator

  • underlying data source produces lines from a configurable set or stdin
  • adds special fields to hash host, sequence
  • change: persist line directly as an event message, store host and sequence in metadata
  • change: use LocalDecorate and GlobalDecorate

@guyboertje
Copy link
Contributor Author

Inputs::Graphite

  • is a subclass of the Inputs::Tcp
  • rudimentary decoder on message field
  • conditionally create timestamp (timestamp is created at ingest time not filter time)
  • change: option to create a Graphite decoder for after the PQ.

Discussion Point
For sources that do not supply a timestamp in the data, do we need an 'ingest' timestamp in the metadata (do we put one in regardless)?

@guyboertje
Copy link
Contributor Author

Inputs::Http

  • underlying data source produces bytes for a mime-type.
  • mime-type may be used to determine which decoder should be used.
  • adds special fields to hash host and headers.
  • change: create simple Event, message is source bytes.
  • change: add host and headers to Event metadata.
  • change: use LocalDecorate and GlobalDecorate; lookup decoder from mime-type or none

@jordansissel
Copy link
Contributor

Thanks to @purbon with questions about multiline xml docs that have no \n character at the end of the file

Here's an oddity for ya -- I have a little USB stick that talks to my power meter at home to gather power usage. The interface it presents when plugged in to my computer is a serial port that emits XML documents continuously. I wonder, for XML documents in general, if it would make sense to use to have an XML document mill? I use REXML::Parsers::StreamParser, but it's probably slow, but it does let me stream XML documents and emit them as each document is completed. Something to think about.

@andrewvc
Copy link
Contributor

@jordansissel since we're jruby only maybe we should use a Java parser? REXML is famously slow / not perfectly conforming. Maybe https://docs.oracle.com/javase/tutorial/jaxp/stax/ ?

The API isn't terrible. I've written a Wikipedia XML parser using it: https://github.com/andrewvc/wikiparse/blob/java/src/main/java/wikielastic/wiki/WikiParser.java

@jordansissel
Copy link
Contributor

Yeah I don't have opinions on the implementation, just wanted to offer
another use case (milling xml documents). +1 on avoiding rexml for speed
reasons

On Tuesday, April 12, 2016, Andrew Cholakian notifications@github.com
wrote:

@jordansissel https://github.com/jordansissel since we're jruby only
maybe we should use a Java parser? REXML is famously slow / not perfectly
conforming. Maybe https://docs.oracle.com/javase/tutorial/jaxp/stax/ ?

The API isn't terrible. I've written a Wikipedia XML parser using it:
https://github.com/andrewvc/wikiparse/blob/java/src/main/java/wikielastic/wiki/WikiParser.java


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#4858 (comment)

@guyboertje
Copy link
Contributor Author

guyboertje commented Apr 13, 2016

Inputs::HTTP_Poller

  • underlying data source produces a protocol encoded or plain string on success and a hand crafted Event on failure.
  • conditionally adds special fields/tags/metadata to Event (too numerous to list here)
  • change: drop codec, use Event#from_json to build success event.
  • change: use LocalDecorate before the PQ and GlobalDecorate after - this is due to the highly conditional way local decoration is done, there would be way too much context that would need persisting
  • change: OTOH, it can be argued that its not too much context and the LocalDecorate can occur after the PQ.

@guyboertje
Copy link
Contributor Author

@jordansissel,@andrewvc:
For me the biggest unanswered question is whether the user wants to A) decode the XML into an Event or to B) simply put the xml string into a message field of an Event and output it as such.

Case: B

  • we need to detect the start and end positions of the XML doc and extract the text between the two.
  • looking at stax, it seems easy to get start_document and end_document callbacks but its a bit harder to get the location from the line and column query. Their docs say approximate because of entity expansion and the location is only valid after the start_document callback returns.
  • put the excerpt in the message field of a new Event.

Case A:
one of...

  1. do a full XML stream -> new Event decode in the input and no Milling is required.
  2. do B and decode into existing Event after the PQ.

For A2 - we will tokenise twice.

@guyboertje
Copy link
Contributor Author

@jordansissel, @andrewvc:

Another interesting twist with pattern boundary detection is whether it is line or byte oriented.
Example.
Pattern: start: "--- begin ---", end: "--- end ---"
File:

--- begin ---\n
line 1\n
line 2\n
--- end ---\n
garbage 1\n
--- begin ---\n
line 3\n
--- end ---\n

if it is byte oriented and exclusive then the Event messages look like this...
\nline 1\nline2\n and \nline 3\n
if it is byte oriented and inclusive then the Event messages look like this...
--- begin ---\nline 1\nline2\n--- end---\n and --- begin ---\nline 3\n--- end ---\n

if it is line oriented and exclusive then the Event messages look like this...
line 1\nline2 and line 3
if it is line oriented and inclusive then the Event messages look like this...
--- begin ---\nline 1\nline2\n--- end--- and --- begin ---\nline 3\n--- end ---

@andrewvc
Copy link
Contributor

@guyboertje the HTTP poller is not JSON only. I've used it in the past to deal with CSV. It should just return plain data as a string. Users can use a JSON filter if needed.

@guyboertje
Copy link
Contributor Author

@andrewvc - thanks for the update.

@guyboertje
Copy link
Contributor Author

As we proposed in the last meeting, a new config option mill is required.

I suggest, after some analysis and convergence talk with the beats team, we need to define a channel inside the mill config. Composed of compute elements that allow the user to best specify exactly the transforms required for their source data and source input.

However:

input {
  file {
    codec => line | multiline | json_lines
    path => ...
    ...
    type => "sometype"
  }
}

Will generate a invalid config error.

Generic apache log file - using a < 5.0 config

input {
  file {
    path => ...
    ...
    type => "sometype"
  }
}

Will add a line mill to the input at register because file will produce bytes and the event has one line.

Generic apache log file - using a >= 5.0 config

input {
  file {
    path => ...
    ...
    type => "sometype"
  }
  mill => {
    encoding {
      charset => UTF-8
      force => true
    }
    line { end => LF }
  }
}

For a file of pretty printed JSON objects comma separated

{
  ...
},
{
  ...
},
{
  ...
}

input {
  file {
    path => ...
    ...
    type => "sometype"
  }
  mill => {
    encoding {
      charset => UTF-8
      force => true
    }
    line { end => LF }
    multiline {
      begin => "\A{\z"
      end => "\A},?\z"
      inclusive => true
    }
  }
}

filter {
  if [type] == "sometype" {
    json {
      source => "message"
    }
  }
}

@zslayton
Copy link

I'm working on a codec plugin that would be much better implemented with the help of a mill. Is this improvement still planned? I notice that many of the issues related to updating the codec model haven't been updated since mid-2016.

@colinsurprenant
Copy link
Contributor

@zslayton we haven't moved forward up with the mills concept yet and there is no short-term plan for it either (that does not mean it will not happen at some point).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants