Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

16k buffer limit causes \n in message #16

Open
PhaedrusTheGreek opened this issue Nov 17, 2017 · 12 comments
Open

16k buffer limit causes \n in message #16

PhaedrusTheGreek opened this issue Nov 17, 2017 · 12 comments

Comments

@PhaedrusTheGreek
Copy link

After 16kb input, a line break is inserted into the messages due to a buffer limit:

per @jakelandis:

It is the stdin which uses a 2^14 buffer here.

This problem makes it difficult to use STDIN as a "read files and exit logstash" use case.

@jordansissel
Copy link
Contributor

I'm confused. This is not a buffer. We do a 16kb read and pass it to the codec. This is how tcp input works also. Can we add a way to reproduce this?

@PhaedrusTheGreek
Copy link
Author

Yep, I will pass you a reproduction offline

@jordansissel
Copy link
Contributor

I try to reproduce, but I don't know enough in the described behavior to know how to reproduce what you are talking about.

Here's my attempt:

  • Write a stdin that is one line long, larger than 16kb. Have Logstash write this line to a file.
% rm /tmp/x; ruby -e 'puts "x" * (16384+10) + "\n"' | bin/logstash -e 'input { stdin { codec => line } } output { file { path => "/tmp/x" codec => line { format => "%{message}" } } }'
  • Check the file:
% wc -l /tmp/x
1 /tmp/x
% wc -c /tmp/x
16395 /tmp/x

1 line, 16395 bytes.

% ruby -e 'puts "x" * (16384+10) + "\n"' | wc -c
16395

Identical to the input. Longer than 16kb

@jakelandis
Copy link
Contributor

@jordansissel
Copy link
Contributor

@jakelandis let me rephrase. We don't buffer two reads together. We do 1 read, then ship that data into the codec immediately. The contents of the buffer are (should be?) lost after that because the codec has done its work.

Buffering, in this context, I am meaning this strategy:

data = read(16384)
data << read(16384)
data << read(16384)

^^ data is buffering multiple reads.

In stdin, we do this:

  1. data = read()
  2. codec.decode(data)
    3 goto 1

@jordansissel
Copy link
Contributor

This is not a bug in stdin. It's not stated in this issue, but I believe the context is the use of the multiline codec with stdin.

And yes, stdin+multiline is broken. It's a bug caused by the fact that multiline was primarily written for the file input, and the file input itself is already line-oriented. stdin is not.

examples of this working with stdin + line codec:

# Write 10 large lines (>16kb per line)
% rm /tmp/x; ruby -e '10.times { puts "x" * (16384+10) + "\n" }' | bin/logstash -e 'input { stdin { codec => line } } output { file { path => "/tmp/x" codec => line { format => "%{message}" } } }'

# check the result
% wc -l /tmp/x
10 /tmp/x

Now let's test with multiline codec:

% rm /tmp/x; ruby -e '10.times { puts "x" * (16384+10) + "\n" }' | bin/logstash -e 'input { stdin { codec => multiline { pattern => "^\\s+" what => previous   } } } output { file { path => "/tmp/x" codec => line { format => "%{message}" } } }'
...
% wc -l /tmp/x
19 /tmp/x

^^ 19 lines? There are clearly only 10 lines provided on input, and none match the /^\s+/ multiline, so there should be 10 lines output.

Strange!

The tcp input works the same way (read a chunk of bytes, hand to codec):

% rm /tmp/x;bin/logstash -e 'input { tcp { port => 10000 codec => multiline { pattern => "^\\s+" what => previous   } } } output { file { path => "/tmp/x" codec => line { format => "%{message}" } } }'
...
% ruby -e '10.times { puts "x" * (16384+10) + "\n" }' | nc localhost 10000

...
% wc -l /tmp/x
34 /tmp/x

But notice, if I just use the line codec on tcp, it works correctly:

% rm /tmp/x;bin/logstash -e 'input { tcp { port => 10000 codec => line } } output { file { path => "/tmp/x" codec => line { format => "%{message}" } } }'
...
% wc -l /tmp/x
10 /tmp/x

This is a bug in the way the multiline codec was designed to work only with the file input which has a bug in the way it provides data to Logstash (this is an ancient issue, but we can fix it now if desired).

@jordansissel
Copy link
Contributor

The bug is in the multiline codec which assumes each decode() call a complete event (or a complete sequence of events) -- https://github.com/logstash-plugins/logstash-codec-multiline/blob/master/lib/logstash/codecs/multiline.rb#L198

@jordansissel
Copy link
Contributor

The multiline codec causes this bug because it makes some false assumptions (this is a really old bug, from what I remember).

https://github.com/logstash-plugins/logstash-codec-multiline/blob/master/lib/logstash/codecs/multiline.rb#L198

  def decode(text, &block)
    text = @converter.convert(text)
text.split("\n").each do |line|

This causes a bug because a partial read (of a full event) will be sent like this:

If we said "hello world" and only read "hello" so far:

decode("hello")

which calls "hello".split("\n") which returns ["hello"] as the "line".

The multiline codec needs to buffer until it has at least a full line, but it does not do this.

The line codec does this, for what it's worth.

Historically, the multiline codec was primarily focused on the file input which is why I think it has this behavior; the file input only emits one whole line at a time, so it accidentally dodges this bug in the multiline codec.

@colinsurprenant
Copy link
Contributor

Yes, to echo what @jordansissel said: this is a not a stdin input problem but a multiline codec problem: This was the interpretation of the original problem which triggered the creation of this issue by @PhaedrusTheGreek :

  • stdin reads a 16k block and passes it to the multiline codec here
  • multiline codec parses all the events correctly up the the last one which falls on the 16k block boundary
  • since the codec does a text.split("\n").each do |line| on the buffer, the last incomplete line that fell on the 16k boundary has now been converted to a proper string and added to the multiline buffer here

Which explains the rogue \n and the resulting invalid json. Obviously changing the buffer size in not the solution here.

Our long standing milling concept would the long term solution to this I believe otherwise, as @jordansissel mentioned maybe just adding line buffering in the multiline code using the BufferedTokenizer just like the line coded would work here?

@Ghysbrecht
Copy link

Ran into the same issue today. I am making a project that uses the stdin and multiline plugin. A lot of lines are rejected by my grok filter because of the newlines. Any new info about this issue?

@edperry
Copy link

edperry commented Sep 11, 2018

Fyi I just ran in to this issue too. I added detail comments in #37

@colinsurprenant
Copy link
Contributor

Until we move forward with proper boundary detection across inputs/codecs, I am proposing this solution to the multiline codec logstash-plugins/logstash-codec-multiline#63

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants