1.8: new network timeout is not working properly #4660

edsiper · 2022-01-22T02:24:17Z

Bug Report

This bug report comes from PR #4659 (1.8 backport of mbedtls + unit test for event injection)
the new network timeout mechanism is not working reliably, this is the test case from the command line:

$ bin/fluent-bit -i dummy -p samples=1 -o tcp -m '*' -p "retry_limit=no_retries" -p host=35.243.247.233 -p port=54321 -p net.connect_timeout=5s -f 2
Fluent Bit v1.8.12
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2022/01/21 20:17:07] [ info] [engine] started (pid=1235106)
[2022/01/21 20:17:07] [ info] [storage] version=1.1.5, initializing...
[2022/01/21 20:17:07] [ info] [storage] in-memory
[2022/01/21 20:17:07] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2022/01/21 20:17:07] [ info] [cmetrics] version=0.2.2
[2022/01/21 20:17:07] [ info] [sp] stream processor started
[2022/01/21 20:17:18] [error] [upstream] connection #25 to 35.243.247.233:54321 timed out after 5 seconds

some comments:

the timeout message above happened after than 5 seconds
the error coming from TCP output plugin in this line is not triggered, which means the co-routine has not been resumed.

Looking at the storage content with kill -SIGCONT `pidof fluent-bit` we can see that the task still exists, which re-confirm the co-routine has not returned:

[2022/01/21 20:18:08] [engine] caught signal (SIGCONT)
[2022/01/21 20:18:08] Fluent Bit Dump

===== Input =====
dummy.0 (dummy)
│
├─ status
│  └─ overlimit     : no
│     ├─ mem size   : 26b (26 bytes)
│     └─ mem limit  : 0b (0 bytes)
│
├─ tasks
│  ├─ total tasks   : 1
│  ├─ new           : 0
│  ├─ running       : 1
│  └─ size          : 26b (26 bytes)
│
└─ chunks
   └─ total chunks  : 1
      ├─ up chunks  : 1
      ├─ down chunks: 0
      └─ busy chunks: 1
         ├─ size    : 26b (26 bytes)
         └─ size err: 0


===== Storage Layer =====
total chunks     : 1
├─ mem chunks    : 1
└─ fs chunks     : 0
   ├─ up         : 0
   └─ down       : 0

The text was updated successfully, but these errors were encountered:

edsiper · 2022-01-22T02:39:59Z

Looks like I found the root cause of the problem.

In GIT master the event injection only happens in the iteration of the busy_queue:

https://github.com/fluent/fluent-bit/blob/master/src/flb_upstream.c#L845-L847

in the 1.8-backport branch patch the event injection only happens in the iteration of the av_queue (which is used for keepalive timeouts):

fluent-bit/src/flb_upstream.c

Lines 864 to 866 in 54a8e5f

    
           mk_event_inject(u_conn->evl, &u_conn->event, 
        
                           MK_EVENT_READ | MK_EVENT_WRITE, 
        
                           FLB_TRUE);

So GIT master and 1.8-backport are broken because:

GIT Master don't do injection on available queue
1.8-backport don't do injection on busy queue

leonardo-albertovich · 2022-01-22T10:31:29Z

You are right @edsiper, 1.8 is wrong, it seems like I misplaced it yesterday. As for the available list, I originally thought (and still think) that we don't need to resume a coroutine there because the connection is not bound, it's just stored, isn't that the case?

I'm about to send a PR to move the event injection code to the right place now (I tested it of course).

edsiper · 2022-01-22T23:41:28Z

thanks this has been fixed in 1.8

edsiper added the status: waiting-for-triage label Jan 22, 2022

edsiper assigned leonardo-albertovich Jan 22, 2022

edsiper added bug and removed status: waiting-for-triage labels Jan 22, 2022

edsiper closed this as completed Jan 22, 2022

lecaros added this to the Fluent Bit v1.8.12 milestone Jan 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.8: new network timeout is not working properly #4660

1.8: new network timeout is not working properly #4660

edsiper commented Jan 22, 2022 •

edited

Loading

edsiper commented Jan 22, 2022

leonardo-albertovich commented Jan 22, 2022 •

edited

Loading

edsiper commented Jan 22, 2022

1.8: new network timeout is not working properly #4660

1.8: new network timeout is not working properly #4660

Comments

edsiper commented Jan 22, 2022 • edited Loading

Bug Report

edsiper commented Jan 22, 2022

leonardo-albertovich commented Jan 22, 2022 • edited Loading

edsiper commented Jan 22, 2022

edsiper commented Jan 22, 2022 •

edited

Loading

leonardo-albertovich commented Jan 22, 2022 •

edited

Loading