-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Event Mills or how to turn a text stream into an Event stream #4858
Comments
One thing I'd like to clarify and maybe this should be in another discussion but in our last discussion about the persistence interface their seemed to be some confusion about the decoupling using the PQ between the input+milling and the filter+output. |
Yes it is a less understood concept. |
Thanks to @purbon with questions about multiline xml docs that have no |
NOTE: I removed the comment that showed POC FSM code - it is too premature for them. |
I will analyse the input plugins to see how Event Mills may be used and what changes are required to the inputs to support the minimum effort to yield Events to the PQ. |
We need to differentiate when byte oriented data is plain text where character or pattern boundary detection may be applied and if/when it must be decoded first. We may need a user directive to tell us whether a chunk of byte oriented data is actually a full event that can be decoded after the queue or whether it is an encoded chunk that needs decoding and milling to find the events within it. Many inputs use local decoration of the Event. It will be problematic to include enough metadata or directives (i.e. context) in the event before the PQ such that a generic decoration function can be applied to the Event after it is read from the PQ in the filter-output stage. There is needs to be provision for Charset conversion before any EventMill. For illustration purposes:
Discussion Point from A(i) above Discussion Point from A(ii) above Discussion Point from B above Discussion Point {"message": "log line 1::-::log line 2::-::log line 3::-::log line 4::-::"} Where |
Inputs::Beats
|
Inputs::CouchDBChanges
|
Inputs::Elasticsearch
|
Inputs::EventLog
|
Inputs::Exec
|
Inputs::File
|
Inputs::Ganglia
|
Inputs::Gelf
|
Inputs::Generator
|
Inputs::Graphite
Discussion Point |
Inputs::Http
|
Here's an oddity for ya -- I have a little USB stick that talks to my power meter at home to gather power usage. The interface it presents when plugged in to my computer is a serial port that emits XML documents continuously. I wonder, for XML documents in general, if it would make sense to use to have an XML document mill? I use REXML::Parsers::StreamParser, but it's probably slow, but it does let me stream XML documents and emit them as each document is completed. Something to think about. |
@jordansissel since we're jruby only maybe we should use a Java parser? REXML is famously slow / not perfectly conforming. Maybe https://docs.oracle.com/javase/tutorial/jaxp/stax/ ? The API isn't terrible. I've written a Wikipedia XML parser using it: https://github.com/andrewvc/wikiparse/blob/java/src/main/java/wikielastic/wiki/WikiParser.java |
Yeah I don't have opinions on the implementation, just wanted to offer On Tuesday, April 12, 2016, Andrew Cholakian notifications@github.com
|
Inputs::HTTP_Poller
|
@jordansissel,@andrewvc: Case: B
Case A:
For A2 - we will tokenise twice. |
Another interesting twist with pattern boundary detection is whether it is line or byte oriented.
if it is byte oriented and exclusive then the Event messages look like this... if it is line oriented and exclusive then the Event messages look like this... |
@guyboertje the HTTP poller is not JSON only. I've used it in the past to deal with CSV. It should just return plain data as a string. Users can use a JSON filter if needed. |
@andrewvc - thanks for the update. |
As we proposed in the last meeting, a new config option I suggest, after some analysis and convergence talk with the beats team, we need to define a channel inside the However:
Will generate a invalid config error. Generic apache log file - using a < 5.0 config
Will add a line mill to the input at register because file will produce bytes and the event has one line. Generic apache log file - using a >= 5.0 config
For a file of pretty printed JSON objects comma separated
|
I'm working on a codec plugin that would be much better implemented with the help of a |
@zslayton we haven't moved forward up with the mills concept yet and there is no short-term plan for it either (that does not mean it will not happen at some point). |
Motivation
After talking about core changes to include a persistent queue, we decided to divide up some functionality that is now in the inputs and put some before the Persistent Queue (PQ?) and some after.
We will remove the schizophrenia where some input sources provide byte oriented data and others provide line oriented data. We will ensure that all inputs that can will provide byte oriented data.
Any inputs that naturally provide Events streams will not change.
The concept of line and multiline as codecs are deprecated because they are boundary detectors and not decoders. Codecs will be split into decoders and encoders both available in the same LS library. Decoders are specifically for protocol/format handling.
Decoders go after the PQ.
Event boundary detection.
In byte oriented data we need to find where each event starts and stops. Most of the time this is at new line (LF) characters, but not always. In some cases the event boundaries span multiline lines. I have some POC state-machines that allow for a continuous detection of both line and multiline.
Identity management.
When looking for event boundaries in byte oriented data, chunks from different origins must be kept separate by a property - identity. In the case of the File Input, each file is a different origin and in the case of the TCP Input we could receive byte oriented data from any origin in any connection so ideally the far end should transmit the identity.
Event Mills
An Event Mill is used by the Input to feed byte oriented data into and Events should come out the other side. Based on the LS Input config it should know whether to include multiline capabilities. The Mill should called with an identity and some bytes. Internally it should create a new machine per identity. For line and multiline it should look like this.
Input -> (identity, byte oriented data) -> LineFSM -> (line) -> Input [callback] -> (hash) -> Eventifier -> (event) -> PQ
or
Input -> (identity, byte oriented data) -> MultilineFSM -> (lines as one string) -> Input [callback] -> (hash) -> Eventifier -> (event) -> PQ
If possible the Event Mill will be written as a JRuby extension.
Summary:
Data makes a Journey via some Transport mechanism from the Origin to the Mill to the PQ Storage.
The text was updated successfully, but these errors were encountered: