Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.8: new network timeout is not working properly #4660

Closed
edsiper opened this issue Jan 22, 2022 · 3 comments
Closed

1.8: new network timeout is not working properly #4660

edsiper opened this issue Jan 22, 2022 · 3 comments
Assignees
Labels

Comments

@edsiper
Copy link
Member

edsiper commented Jan 22, 2022

Bug Report

This bug report comes from PR #4659 (1.8 backport of mbedtls + unit test for event injection)
the new network timeout mechanism is not working reliably, this is the test case from the command line:

$ bin/fluent-bit -i dummy -p samples=1 -o tcp -m '*' -p "retry_limit=no_retries" -p host=35.243.247.233 -p port=54321 -p net.connect_timeout=5s -f 2
Fluent Bit v1.8.12
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2022/01/21 20:17:07] [ info] [engine] started (pid=1235106)
[2022/01/21 20:17:07] [ info] [storage] version=1.1.5, initializing...
[2022/01/21 20:17:07] [ info] [storage] in-memory
[2022/01/21 20:17:07] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2022/01/21 20:17:07] [ info] [cmetrics] version=0.2.2
[2022/01/21 20:17:07] [ info] [sp] stream processor started
[2022/01/21 20:17:18] [error] [upstream] connection #25 to 35.243.247.233:54321 timed out after 5 seconds

some comments:

  • the timeout message above happened after than 5 seconds
  • the error coming from TCP output plugin in this line is not triggered, which means the co-routine has not been resumed.

Looking at the storage content with kill -SIGCONT `pidof fluent-bit` we can see that the task still exists, which re-confirm the co-routine has not returned:

[2022/01/21 20:18:08] [engine] caught signal (SIGCONT)
[2022/01/21 20:18:08] Fluent Bit Dump

===== Input =====
dummy.0 (dummy)
│
├─ status
│  └─ overlimit     : no
│     ├─ mem size   : 26b (26 bytes)
│     └─ mem limit  : 0b (0 bytes)
│
├─ tasks
│  ├─ total tasks   : 1
│  ├─ new           : 0
│  ├─ running       : 1
│  └─ size          : 26b (26 bytes)
│
└─ chunks
   └─ total chunks  : 1
      ├─ up chunks  : 1
      ├─ down chunks: 0
      └─ busy chunks: 1
         ├─ size    : 26b (26 bytes)
         └─ size err: 0


===== Storage Layer =====
total chunks     : 1
├─ mem chunks    : 1
└─ fs chunks     : 0
   ├─ up         : 0
   └─ down       : 0
@edsiper
Copy link
Member Author

edsiper commented Jan 22, 2022

Looks like I found the root cause of the problem.

In GIT master the event injection only happens in the iteration of the busy_queue:

in the 1.8-backport branch patch the event injection only happens in the iteration of the av_queue (which is used for keepalive timeouts):

So GIT master and 1.8-backport are broken because:

  • GIT Master don't do injection on available queue
  • 1.8-backport don't do injection on busy queue

@leonardo-albertovich
Copy link
Collaborator

leonardo-albertovich commented Jan 22, 2022

You are right @edsiper, 1.8 is wrong, it seems like I misplaced it yesterday. As for the available list, I originally thought (and still think) that we don't need to resume a coroutine there because the connection is not bound, it's just stored, isn't that the case?

I'm about to send a PR to move the event injection code to the right place now (I tested it of course).

@edsiper
Copy link
Member Author

edsiper commented Jan 22, 2022

thanks this has been fixed in 1.8

@edsiper edsiper closed this as completed Jan 22, 2022
@lecaros lecaros added this to the Fluent Bit v1.8.12 milestone Jan 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants