16k buffer limit causes \n in message #16

PhaedrusTheGreek · 2017-11-17T18:33:54Z

After 16kb input, a line break is inserted into the messages due to a buffer limit:

It is the stdin which uses a 2^14 buffer here.

This problem makes it difficult to use STDIN as a "read files and exit logstash" use case.

jordansissel · 2017-11-17T18:39:17Z

I'm confused. This is not a buffer. We do a 16kb read and pass it to the codec. This is how tcp input works also. Can we add a way to reproduce this?

PhaedrusTheGreek · 2017-11-17T18:41:36Z

Yep, I will pass you a reproduction offline

jordansissel · 2017-11-17T18:42:05Z

I try to reproduce, but I don't know enough in the described behavior to know how to reproduce what you are talking about.

Here's my attempt:

Write a stdin that is one line long, larger than 16kb. Have Logstash write this line to a file.

% rm /tmp/x; ruby -e 'puts "x" * (16384+10) + "\n"' | bin/logstash -e 'input { stdin { codec => line } } output { file { path => "/tmp/x" codec => line { format => "%{message}" } } }'

Check the file:

% wc -l /tmp/x
1 /tmp/x
% wc -c /tmp/x
16395 /tmp/x

1 line, 16395 bytes.

% ruby -e 'puts "x" * (16384+10) + "\n"' | wc -c
16395

Identical to the input. Longer than 16kb

jakelandis · 2017-11-17T18:45:04Z

This is not a buffer.

It is a buffer:
https://github.com/colinsurprenant/jruby-stdin-channel/blob/master/src/main/java/com/jrubystdinchannel/StdinChannelLibrary.java#L93

jordansissel · 2017-11-17T18:51:26Z

@jakelandis let me rephrase. We don't buffer two reads together. We do 1 read, then ship that data into the codec immediately. The contents of the buffer are (should be?) lost after that because the codec has done its work.

Buffering, in this context, I am meaning this strategy:

data = read(16384)
data << read(16384)
data << read(16384)

^^ data is buffering multiple reads.

In stdin, we do this:

data = read()
codec.decode(data)
3 goto 1

jordansissel · 2017-11-17T18:59:41Z

This is not a bug in stdin. It's not stated in this issue, but I believe the context is the use of the multiline codec with stdin.

And yes, stdin+multiline is broken. It's a bug caused by the fact that multiline was primarily written for the file input, and the file input itself is already line-oriented. stdin is not.

examples of this working with stdin + line codec:

# Write 10 large lines (>16kb per line)
% rm /tmp/x; ruby -e '10.times { puts "x" * (16384+10) + "\n" }' | bin/logstash -e 'input { stdin { codec => line } } output { file { path => "/tmp/x" codec => line { format => "%{message}" } } }'

# check the result
% wc -l /tmp/x
10 /tmp/x

Now let's test with multiline codec:

% rm /tmp/x; ruby -e '10.times { puts "x" * (16384+10) + "\n" }' | bin/logstash -e 'input { stdin { codec => multiline { pattern => "^\\s+" what => previous   } } } output { file { path => "/tmp/x" codec => line { format => "%{message}" } } }'
...
% wc -l /tmp/x
19 /tmp/x

^^ 19 lines? There are clearly only 10 lines provided on input, and none match the /^\s+/ multiline, so there should be 10 lines output.

Strange!

The tcp input works the same way (read a chunk of bytes, hand to codec):

% rm /tmp/x;bin/logstash -e 'input { tcp { port => 10000 codec => multiline { pattern => "^\\s+" what => previous   } } } output { file { path => "/tmp/x" codec => line { format => "%{message}" } } }'
...
% ruby -e '10.times { puts "x" * (16384+10) + "\n" }' | nc localhost 10000

...
% wc -l /tmp/x
34 /tmp/x

But notice, if I just use the line codec on tcp, it works correctly:

% rm /tmp/x;bin/logstash -e 'input { tcp { port => 10000 codec => line } } output { file { path => "/tmp/x" codec => line { format => "%{message}" } } }'
...
% wc -l /tmp/x
10 /tmp/x

This is a bug in the way the multiline codec was designed to work only with the file input which has a bug in the way it provides data to Logstash (this is an ancient issue, but we can fix it now if desired).

jordansissel · 2017-11-17T19:01:40Z

The bug is in the multiline codec which assumes each decode() call a complete event (or a complete sequence of events) -- https://github.com/logstash-plugins/logstash-codec-multiline/blob/master/lib/logstash/codecs/multiline.rb#L198

jordansissel · 2017-11-17T19:05:08Z

The multiline codec causes this bug because it makes some false assumptions (this is a really old bug, from what I remember).

https://github.com/logstash-plugins/logstash-codec-multiline/blob/master/lib/logstash/codecs/multiline.rb#L198

  def decode(text, &block)
    text = @converter.convert(text)
text.split("\n").each do |line|

This causes a bug because a partial read (of a full event) will be sent like this:

If we said "hello world" and only read "hello" so far:

decode("hello")

which calls "hello".split("\n") which returns ["hello"] as the "line".

The multiline codec needs to buffer until it has at least a full line, but it does not do this.

The line codec does this, for what it's worth.

Historically, the multiline codec was primarily focused on the file input which is why I think it has this behavior; the file input only emits one whole line at a time, so it accidentally dodges this bug in the multiline codec.

colinsurprenant · 2017-11-20T20:41:42Z

Yes, to echo what @jordansissel said: this is a not a stdin input problem but a multiline codec problem: This was the interpretation of the original problem which triggered the creation of this issue by @PhaedrusTheGreek :

stdin reads a 16k block and passes it to the multiline codec here
multiline codec parses all the events correctly up the the last one which falls on the 16k block boundary
since the codec does a text.split("\n").each do |line| on the buffer, the last incomplete line that fell on the 16k boundary has now been converted to a proper string and added to the multiline buffer here

Which explains the rogue \n and the resulting invalid json. Obviously changing the buffer size in not the solution here.

Our long standing milling concept would the long term solution to this I believe otherwise, as @jordansissel mentioned maybe just adding line buffering in the multiline code using the BufferedTokenizer just like the line coded would work here?

Ghysbrecht · 2018-02-20T11:22:37Z

Ran into the same issue today. I am making a project that uses the stdin and multiline plugin. A lot of lines are rejected by my grok filter because of the newlines. Any new info about this issue?

edperry · 2018-09-11T17:08:07Z

Fyi I just ran in to this issue too. I added detail comments in #37

colinsurprenant · 2018-09-25T19:37:05Z

Until we move forward with proper boundary detection across inputs/codecs, I am proposing this solution to the multiline codec logstash-plugins/logstash-codec-multiline#63

PhaedrusTheGreek mentioned this issue Mar 26, 2018

stdin/multiline-code adds random newlines logstash-plugins/logstash-codec-multiline#37

Closed

colinsurprenant mentioned this issue Sep 24, 2018

[WIP] use BufferedTokenizer and configurable line delimiter logstash-plugins/logstash-codec-multiline#63

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

16k buffer limit causes \n in message #16

16k buffer limit causes \n in message #16

PhaedrusTheGreek commented Nov 17, 2017

jordansissel commented Nov 17, 2017

PhaedrusTheGreek commented Nov 17, 2017

jordansissel commented Nov 17, 2017

jakelandis commented Nov 17, 2017

jordansissel commented Nov 17, 2017

jordansissel commented Nov 17, 2017

jordansissel commented Nov 17, 2017

jordansissel commented Nov 17, 2017

colinsurprenant commented Nov 20, 2017

Ghysbrecht commented Feb 20, 2018

edperry commented Sep 11, 2018

colinsurprenant commented Sep 25, 2018

16k buffer limit causes \n in message #16

16k buffer limit causes \n in message #16

Comments

PhaedrusTheGreek commented Nov 17, 2017

jordansissel commented Nov 17, 2017

PhaedrusTheGreek commented Nov 17, 2017

jordansissel commented Nov 17, 2017

jakelandis commented Nov 17, 2017

jordansissel commented Nov 17, 2017

jordansissel commented Nov 17, 2017

jordansissel commented Nov 17, 2017

jordansissel commented Nov 17, 2017

colinsurprenant commented Nov 20, 2017

Ghysbrecht commented Feb 20, 2018

edperry commented Sep 11, 2018

colinsurprenant commented Sep 25, 2018