You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fluent Bit when used with multiple CloudWatch Logs c plugin, cloudwatch_logs, may occasionally hang indefinitely when used with high throughput and net.keepalive off for versions 2.28.4 and prior.
To resolve this issue, please update aws-for-fluent-bit to version 2.29.1 or higher.
Background
As of late September 2022, the FireLens team received a high volume of tickets related to aws-for-fluent-bit 2.28.4 and prior, with CloudWatch Logs as a destination, failing to send logs, as it hung in network call, eventually ran Out of Memory (OOM) and got terminated. Customers sending logs at high throughput to CloudWatch Logs noticed that at a random point in Fluent Bit’s execution, Fluent Bit would hang and stop sending data to CloudWatch, but at the same time continue to execute, ingest data, pass health checks, and buffer logs to memory. Over time the hung Fluent Bit’s memory consumption would increase steadily until eventually resulting in an OOM error, ending Fluent Bit and causing loss of the buffered log data.
Impact
Upon investigation, the FireLens team found that a hang occurs when a user’s Fluent Bit configuration meets the following conditions and logs are sent at high throughput:
Aws-for-fluent-bit version 2.28.4 or prior is used
Multiple CloudWatch Log output plugins are configured
The net.keepalive configuration option is set to off
We suggested net.keepalive off to reduce Fluent Bit frequent network errors due to reusing failed connection that were kept alive, which was experienced by customers. This net.keepalive on error has been resolved in Fluent Bit 1.9.x and the recommendation has since been updated.
Investigation & Solution
Investigative Efforts
While at first we tried to resolve the networking hang issues found in the synchronous network stack by adding OpenSSL error handling, DNS Timeouts, and enabling unidirectional TLS shutdowns, these efforts only made fluent bit go from failing once in 5 minutes without the changes to once in 5 hours with the changes - when tested under a high load failure case. We determined that it would take too much effort to isolate synchronous network stack issues and decided to invest efforts switching to the widely used Fluent Bit asynchronous network stack.
Proposed Solution
Our proposed solution is to migrate the Cloudwatch Logs output plugin to Fluent Bit’s asynchronous network stack.
Switching to Async Network Stack
CloudWatch API Synchronous Usage Requirements - CloudWatch relies on the synchronous networking to ensure that CloudWatch Logs PutLogEvents requests are done sequentially. Normally when the asynchronous network stack is used, Fluent Bit context switches in the next batch of logs into processing when the previous batch yields on a network call. This defeats the desired sequential PutLogEvents execution required by CloudWatch.
Existing Core Synchronous Scheduler - In order to enforce sequential processing of log data when the asynchronous network stack is used, we opt our CloudWatch Logs plugin into a Fluent Bit Core synchronous task scheduler which limits one batch of logs to be processed at a time, essentially using the asynchronous networking stack in a synchronous manner. A bottleneck was discovered in the Fluent Bit Core Synchronous scheduler, which limits processing logs to 1 batch per second (or per flush interval). If used, the existing synchronous scheduler impacts performance and stability of the CloudWatch output plugin causing log delay and eventually log loss.
New Performant Core Synchronous Scheduler - a performant new core scheduler was written by the FireLens team that removes this 1 batch per second restriction while keeping the one batch at a time processing restriction in place. The CloudWatch Plugin opts into the performant Synchronous Scheduler implementation and uses the asynchronous network stack.
A series of 24 hour tests were conducted on Fluent Bit 1.9.9 with the patch with and without Valgrind. No network hangs were observed on 1.9 and no memory leaks were introduced by the patch.
A 24 hour test was conducted on Fluent Bit 2.0x with the patch. No network hangs were observed.
Parallel Long Running Durability Tests
To simulate the customer’s long running execution of Fluent Bit, 40-100 ECS FireLens test tasks per test were run in parallel to accumulate cumulative running time and gain confidence in the patch.
The following is a stability matrix outlining the patches impact on Fluent Bit’s durability rating which is described lowerbounded average hours to failure (HTF)
Fluent 1.9x (AWS For Fluent Bit Official Release)
aws-for-fluent-bit
version 2.29.0+
version 2.28.4 and prior
Keepalive On
Very Stable (+2 years)
Segfault on some network errors after throttling limits (~80h)
Keepalive Off
Somewhat stable (~3000h)
Cloudwatch Hang (~0.08h)
Here we can see that the patch increases stability of Fluent Bit during durability tests from 80HTF to 3500+HTF when keepalive is enabled and from 0.08HTF to ~2000HTF when keepalive is disabled.
Hours to failure is calculated as the duration of the testing divided by percentage of failed tasks or 1/total if no tasks failed.
The text was updated successfully, but these errors were encountered:
matthewfala
changed the title
CloudWatch_Logs Hang Issue Affecting 2.28.4 and Prior
Fluent Bit Hang Affecting CloudWatch C Plugin 2.28.4 and Prior
Jan 26, 2023
matthewfala
changed the title
Fluent Bit Hang Affecting CloudWatch C Plugin 2.28.4 and Prior
Fluent Bit Hang Affecting CloudWatch C Plugin aws-for-fluent-bit v2.28.4 and Prior
Jan 26, 2023
Hang Summary
Fluent Bit when used with multiple CloudWatch Logs c plugin,
cloudwatch_logs
, may occasionally hang indefinitely when used with high throughput andnet.keepalive off
for versions2.28.4
and prior.To resolve this issue, please update aws-for-fluent-bit to version
2.29.1
or higher.Background
As of late September 2022, the FireLens team received a high volume of tickets related to aws-for-fluent-bit 2.28.4 and prior, with CloudWatch Logs as a destination, failing to send logs, as it hung in network call, eventually ran Out of Memory (OOM) and got terminated. Customers sending logs at high throughput to CloudWatch Logs noticed that at a random point in Fluent Bit’s execution, Fluent Bit would hang and stop sending data to CloudWatch, but at the same time continue to execute, ingest data, pass health checks, and buffer logs to memory. Over time the hung Fluent Bit’s memory consumption would increase steadily until eventually resulting in an OOM error, ending Fluent Bit and causing loss of the buffered log data.
Impact
Upon investigation, the FireLens team found that a hang occurs when a user’s Fluent Bit configuration meets the following conditions and logs are sent at high throughput:
2.28.4
or prior is usednet.keepalive configuration
option is set tooff
net.keepalive on
error has been resolved in Fluent Bit 1.9.x and the recommendation has since been updated.Investigation & Solution
Investigative Efforts
While at first we tried to resolve the networking hang issues found in the synchronous network stack by adding OpenSSL error handling, DNS Timeouts, and enabling unidirectional TLS shutdowns, these efforts only made fluent bit go from failing once in 5 minutes without the changes to once in 5 hours with the changes - when tested under a high load failure case. We determined that it would take too much effort to isolate synchronous network stack issues and decided to invest efforts switching to the widely used Fluent Bit asynchronous network stack.
Proposed Solution
Our proposed solution is to migrate the Cloudwatch Logs output plugin to Fluent Bit’s asynchronous network stack.
Switching to Async Network Stack
CloudWatch API Synchronous Usage Requirements - CloudWatch relies on the synchronous networking to ensure that CloudWatch Logs PutLogEvents requests are done sequentially. Normally when the asynchronous network stack is used, Fluent Bit context switches in the next batch of logs into processing when the previous batch yields on a network call. This defeats the desired sequential PutLogEvents execution required by CloudWatch.
Existing Core Synchronous Scheduler - In order to enforce sequential processing of log data when the asynchronous network stack is used, we opt our CloudWatch Logs plugin into a Fluent Bit Core synchronous task scheduler which limits one batch of logs to be processed at a time, essentially using the asynchronous networking stack in a synchronous manner. A bottleneck was discovered in the Fluent Bit Core Synchronous scheduler, which limits processing logs to 1 batch per second (or per flush interval). If used, the existing synchronous scheduler impacts performance and stability of the CloudWatch output plugin causing log delay and eventually log loss.
New Performant Core Synchronous Scheduler - a performant new core scheduler was written by the FireLens team that removes this 1 batch per second restriction while keeping the one batch at a time processing restriction in place. The CloudWatch Plugin opts into the performant Synchronous Scheduler implementation and uses the asynchronous network stack.
The Synchronous Scheduler PR (https://github.com/fluent/fluent-bit/pull/6339/files) that has been merged into Fluent Bit 1.9.10
Testing and Results
Unit Testing
A series of 24 hour tests were conducted on Fluent Bit 1.9.9 with the patch with and without Valgrind. No network hangs were observed on 1.9 and no memory leaks were introduced by the patch.
A 24 hour test was conducted on Fluent Bit 2.0x with the patch. No network hangs were observed.
Parallel Long Running Durability Tests
To simulate the customer’s long running execution of Fluent Bit, 40-100 ECS FireLens test tasks per test were run in parallel to accumulate cumulative running time and gain confidence in the patch.
The following is a stability matrix outlining the patches impact on Fluent Bit’s durability rating which is described lowerbounded average hours to failure (HTF)
Fluent 1.9x (AWS For Fluent Bit Official Release)
Here we can see that the patch increases stability of Fluent Bit during durability tests from 80HTF to 3500+HTF when keepalive is enabled and from 0.08HTF to ~2000HTF when keepalive is disabled.
Hours to failure is calculated as the duration of the testing divided by percentage of failed tasks or 1/total if no tasks failed.
The text was updated successfully, but these errors were encountered: