add timeout config for AWS SDK Go HTTP calls #179

jpaskhay · 2021-10-27T01:34:08Z

Fixes #178

Description of changes:
This PR adds a new configuration to customize the timeout in the HTTP client being used by the AWS SDK Go PutRecords calls. It is based on guidance specified in the SDK docs. All other default settings in the HTTP client being used will stay the same.

It has been tested and verified via local builds/patches (both specifying a config and leaving it out to default to existing behavior).

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

hossain-rayhan · 2021-10-27T05:33:59Z

README.md

@@ -25,6 +25,7 @@ If you think you’ve found a potential security issue, please do not post it in
 * `aggregation`: Setting `aggregation` to `true` will enable KPL aggregation of records sent to Kinesis.  This feature isn't compatible with the `partition_key` feature.  See the KPL aggregation section below for more details.
 * `compression`: Specify an algorithm for compression of each record. Supported compression algorithms are `zlib` and `gzip`. By default this feature is disabled and records are not compressed.
 * `replace_dots`: Replace dot characters in key names with the value of this option. For example, if you add `replace_dots _` in your config then all occurrences of `.` will be replaced with an underscore. By default, dots will not be replaced.
+* `timeout`: Specify a timeout (in seconds) for the underlying AWS SDK Go HTTP call when sending records to Kinesis.


Let's rename this. Maybe http_request_timeout?

sure, will do. i'll update the associated vars as well, similar to your other comment

hossain-rayhan · 2021-10-27T05:35:29Z

fluent-bit-kinesis.go

@@ -160,7 +162,21 @@ func newKinesisOutput(ctx unsafe.Pointer, pluginID int) (*kinesis.OutputPlugin,
 		return nil, fmt.Errorf("[kinesis %d] Invalid 'compression' value (%s) specified, must be 'zlib', 'gzip', 'none', or undefined", pluginID, compression)
 	}

-	return kinesis.NewOutputPlugin(region, stream, dataKeys, partitionKey, roleARN, kinesisEndpoint, stsEndpoint, timeKey, timeKeyFmt, logKey, replaceDots, concurrencyInt, concurrencyRetriesInt, isAggregate, appendNL, comp, pluginID)
+	var timeoutInt int
+	var timeoutDur time.Duration


Let's rename this variable. Maybe httpTimeoutDurationSeconds?

Also, I prefer to declare the variables in the beginning of the function.

Also, I prefer to declare the variables in the beginning of the function.

Ya, I was mostly trying to keep it consistent with the other configs that had additional parsing/validation. If you want this changed, do you prefer the other config vars (e.g. compressions, concurrency) also get moved to the top?

It may also be possible to avoid some of the var declarations based on your comment around extracting the int parsing into a function 🤔 . I can play around with it a bit when looking at that change.

hossain-rayhan · 2021-10-27T05:37:51Z

fluent-bit-kinesis.go

+		timeoutInt, err = strconv.Atoi(timeout)
+		if err != nil {
+			logrus.Errorf("[kinesis %d] Invalid 'timeout' value %s specified: %v", pluginID, timeout, err)
+			return nil, err
+		}
+		if timeoutInt < 0 {
+			return nil, fmt.Errorf("[kinesis %d] Invalid 'timeout' value (%s) specified, must be a non-negative number", pluginID, timeout)
+		}


We can extract these 7 lines into a function. Because we are doing similar things for other configs.

Agreed, it will definitely clean some things up. I also just noticed this carried over an inconsistency around the error logging that can be cleaned up as part of the extraction (duplicate logging in case of error).

jpaskhay · 2021-10-28T16:26:50Z

Thanks again for your initial review @hossain-rayhan! I pushed an initial set of changes.

Let me know if you want the refactoring taken further -- I was erring on trying to minimize this PR's impact on existing code. It may be possible to replace the var usage with shorthand by moving the empty string check into parseNonNegativeConfig, but there may still be one or two extra if checks as a result (e.g. the defaultConcurrentRetries).

zhonghui12 · 2021-11-23T01:45:09Z

LGTM as long as it is fully tested. Also @jpaskhay, you might need to rebase the code as this branch is out-of-date.

DrewZhang13

Overall LGTM, left small comments.

DrewZhang13 · 2021-12-13T04:57:57Z

fluent-bit-kinesis.go

+	if err != nil {
+		return 0, fmt.Errorf("[kinesis %d] Invalid '%s' value (%s) specified: %v", pluginID, configName, configValue, err)
+	}
+	if configValueInt < 0 {


what is the expected behavior is configValueInt == 0

it depends on the individual configs and their subsequent usage. this was just a refactor of some existing repetitive code based on feedback.

side note: looks like there is an existing bug where setting the experimental_concurrency_retries to 0 when using experimental_concurrency > 0 will swallow records without actually sending to kinesis since this for-loop condition will never be satisfied and the Flush call is inside it:

amazon-kinesis-streams-for-fluent-bit/kinesis/kinesis.go

Line 350 in 301b29e

for tries = 0; tries < outputPlugin.concurrencyRetryLimit; tries++ {

So the fix here is to change the condition to <=? @zhonghui12

So the fix here is to change the condition to <=? @zhonghui12

ya that seems like it would be the simplest fix

DrewZhang13 · 2021-12-13T05:00:09Z

README.md

@@ -25,6 +25,7 @@ If you think you’ve found a potential security issue, please do not post it in
 * `aggregation`: Setting `aggregation` to `true` will enable KPL aggregation of records sent to Kinesis.  This feature isn't compatible with the `partition_key` feature.  See the KPL aggregation section below for more details.
 * `compression`: Specify an algorithm for compression of each record. Supported compression algorithms are `zlib` and `gzip`. By default this feature is disabled and records are not compressed.
 * `replace_dots`: Replace dot characters in key names with the value of this option. For example, if you add `replace_dots _` in your config then all occurrences of `.` will be replaced with an underscore. By default, dots will not be replaced.
+* `http_request_timeout`: Specify a timeout (in seconds) for the underlying AWS SDK Go HTTP call when sending records to Kinesis.


Could we explain what 0 means for timeout config if 0 is also a valid value

sure, can update this. the only caveat is that although a value of 0 is supposed to represent no timeout, there is still some sort of timeout mechanism that occurs within the AWS SDK Go at 5m30s (mentioned in the linked Issue this PR is trying to address). guessing it's related to the DefaultRetryer behavior but i wasn't familiar enough with the SDK to see how it all tied together. the caveat would likely be worth mentioning in some way for accuracy, but i may not be able to speak to its specifics as accurately

zhonghui12 · 2022-01-25T00:52:30Z

@jpaskhay this PR is out-of-date to merge. You might need to rebase it before we can merge it. Also, @PettitWesley, it seems that team has reviewed this PR before, could you please take a look at this one before we merge it?

PettitWesley

@zhonghui12 Can you test this out to be safe? You can run it in a container with network mode none or turn your internet off to make all network requests fail.

zhonghui12 · 2022-01-27T01:00:57Z

@jpaskhay , I am trying to test you change locally and could you tell me which LogLevelType you are using and how do you verify the timeout? Thanks.

Also, you may need to combine your commits into one before we can merge your code.

jpaskhay · 2022-01-27T01:32:47Z

hey @zhonghui12 , we used aws.LogLevel(aws.LogDebugWithRequestRetries | aws.LogDebugWithRequestErrors | aws.LogDebugWithHTTPBody). i think i mentioned it in aws/aws-sdk-go#4149 but not explicitly in #178. just noticed i didn't link in that issue to the SDK issue i raised -- will do that for visibility

sure, i can rebase. sorry i was being lazy and just clicking the Update branch in the UI 😅. i saw previous PRs with different approaches for updating to mainline, so wasn't sure what your preferred process was. figured your team would merge with squash/rebase option if needed, but i can try to clean it up on my end

zhonghui12 · 2022-01-27T03:02:34Z

Thanks for these details @jpaskhay. I will test it tomorrow and let you know. Today I tried aws.LogDebugWithHTTPBody but cannot see a signal for timeout. Will try the other two later.

jpaskhay · 2022-01-27T07:26:10Z

sounds good. been a while since i paired w/ a colleague on it, so was having trouble recalling all of the details. i think this blurb from the SDK issue covers the most important parts of what we observed:

When using the default HTTP Client, we see a read timeout (consistently after 5m30s). When we pass in a HTTP Client w/ a custom timeout, we see a context cancellation timeout. In either case, we observe a 200 OK response is logged at the point of timeout but the Date response header has an earlier time (generally within 1 second of the request), and it appears to have an empty response body. The SDK then retries sending the request and succeeds as normal, resulting in duplicate records.

i'm not sure if you'll be able to replicate the exact scenario we had, but hopefully you should still be able to see the different timeouts (read and context cancellation) in your test setup. the empty response body with 200 OK seemed to be the cause of the SDK hanging and treating it as a timeout for our original issue. reading through the issues i linked in the SDK report again, it's likely just a result of the point in the SDK code flow when network/connection issues occur (i.e. response body read)

zhonghui12 · 2022-01-27T22:06:17Z

@jpaskhay you are right.. so I really cannot reproduce it but I've verified that kinesis is still working with this new feature. So if you've tested that the solution works for your issue then I think we can merge it.

So could you please clear up the code a little bit and resolve the conflict? We have new changes merged so it seems that your code has some conflicts. And it would be great if you could test it again and attach some output logs for us. Thanks.

jpaskhay · 2022-01-28T06:27:44Z

ok, i resolved the conflict. sorry for the extra force push -- didn't notice my IDE had auto-formatted the file 🤦, so had to revert that.

...could you please clear up the code a little bit...

just to clarify, was there anything else to address or were you just referring to the conflict?

sure, let me put together a simpler MVP for retesting this. it should be a little easier than how we were testing before. should be able to do so some time tomorrow. i believe in previous tests the timeout was observed on a colleague's laptop, so hopefully it can still be reproduced reliably.

also this discussion wasn't resolved. were you folks wanting that addressed in this PR or a separate issue/PR? not sure if @PettitWesley wanted to proceed with his suggested solution.

zhonghui12 · 2022-01-28T20:41:32Z

@jpaskhay Thanks.

just to clarify, was there anything else to address or were you just referring to the conflict?

Yeah, I just referred to the conflict.

zhonghui12 · 2022-01-28T21:41:24Z

also this discussion wasn't resolved. were you folks wanting that addressed in this PR or a separate issue/PR? not sure if @PettitWesley wanted to proceed with his suggested solution.

So we are fine with addressing it in this PR as long as it is a separate commit. Very Appreciate it if you would like to fix it for us. Thanks

jpaskhay · 2022-01-28T23:12:59Z

So we are fine with addressing it in this PR as long as it is a separate commit. Very Appreciate it if you would like to fix it for us. Thanks

sure, will do. can probably throw in a test case for that one too. i probably won't get to this and rerunning the tests till over the weekend or first thing monday. thanks again for pushing this forward!

jpaskhay · 2022-02-01T08:17:35Z

Ok, I have a test that will be running overnight. Hopefully it will encounter a timeout, and I can scrape some relevant logs for it to post here. Worst case I may need to create a VM in our problematic network zone to try reproducing there.

In the meantime, I rebased to this morning's changes and have pushed up the retry handling fix along with a corresponding test. While trying to add the test case, I ran into an issue that made me notice the existing test cases w/ the mock client weren't actually verifying the calls (they were all missing the deferred Finish call). You can observe this by changing one of the non-concurrent test methods in mainline code to add .Times(some_num_greater_than_1) to the expect setup and have it succeed when it should fail. Adding the defer ctrl.Finish() properly fails it.

The existing TestAddRecordWithConcurrency was actually a false positive test case as a result (thankfully the feature still worked!). I added synchronization to it (and the new test case) to properly verify the goroutine calls the client before the test method exits. I also verified reverting the retry fix causes the test case to fail -- it panics. Ideally, instead of a panic we'd have a timeout, but I couldn't get the channel handling to work properly due to my Go knowledge being a bit rusty (feel free to improve that). Anyway, this should help harden those tests in case of future changes/features.

jpaskhay · 2022-02-04T19:13:04Z

Hey @zhonghui12 , unfortunately we've been unable to reproduce the issue. At the time this PR was submitted, we were able to reproduce the timeout issue within a few minutes of the start of a test run. We don't recall if it was only in our busier network zone, or if it also occurred on laptops and other network zones (sorry been a few months now since we troubleshooted).

So perhaps our network zone and/or the kinesis service itself have improved since then. I didn't see anything obvious in the recent SDK release notes related to our issue, but maybe there's something indirect that helped there too. We've tried test runs spanning anywhere from 3h-9h on laptops, different network zones, etc. and we haven't observed the issue. Obviously that's a good thing overall, but it would still be nice to have this option available as that seems to be the most common work-around in case it reappears for us or others.

Looks like I need to rebase again. Thanks again for all your help, and let me know if there's anything else we can do for this.

- also refactored repetitive non-negative integer config parsing based on feedback

… messages - added missing Controller.Finish calls to tests using mock so expectations are actually checked - added WaitGroup to test cases that invoke goroutines to ensure they complete before test method exits - added test to protect against bugfix case (panics without bugfix)

PettitWesley · 2022-02-07T23:30:01Z

@zhonghui12 Can we merge this one soon?

zhonghui12 · 2022-02-08T21:16:37Z

@PettitWesley yeah I think so. I am good to merge this PR if team thinks we can make this change even the origin problem cannot be reproduced.

PettitWesley · 2022-02-08T21:46:01Z

@zhonghui12 Yea I am fine to merge it. The change is straightforward. For core Fluent Bit plugins, there are many network settings including timeout, so it makes sense to have this for the go plugin as well. Please merge it.

zhonghui12 · 2022-02-09T00:51:30Z

Hi @jpaskhay,
Thanks for all your contribution! I will merge the change now.

jpaskhay requested a review from a team as a code owner October 27, 2021 01:34

hossain-rayhan reviewed Oct 27, 2021

View reviewed changes

jpaskhay mentioned this pull request Oct 27, 2021

Kinesis PutRecords intermittently hangs with successful but empty response aws/aws-sdk-go#4149

Closed

3 tasks

PettitWesley requested a review from a team November 21, 2021 19:12

DrewZhang13 reviewed Dec 13, 2021

View reviewed changes

PettitWesley approved these changes Jan 25, 2022

View reviewed changes

jpaskhay force-pushed the add_sdk_timeout branch from c2ed5b4 to 287a73f Compare January 27, 2022 06:11

jpaskhay force-pushed the add_sdk_timeout branch 2 times, most recently from f71bd34 to 194c214 Compare January 28, 2022 05:43

jpaskhay force-pushed the add_sdk_timeout branch 2 times, most recently from c9d475f to 198bddd Compare February 1, 2022 02:18

jpaskhay added 2 commits February 4, 2022 12:13

add timeout config for AWS SDK Go HTTP calls

7e1504b

- also refactored repetitive non-negative integer config parsing based on feedback

jpaskhay force-pushed the add_sdk_timeout branch from 198bddd to 1df964f Compare February 4, 2022 19:13

zhonghui12 merged commit 75e2ade into aws:mainline Feb 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add timeout config for AWS SDK Go HTTP calls #179

add timeout config for AWS SDK Go HTTP calls #179

jpaskhay commented Oct 27, 2021

hossain-rayhan Oct 27, 2021

jpaskhay Oct 27, 2021

hossain-rayhan Oct 27, 2021

hossain-rayhan Oct 27, 2021

jpaskhay Oct 27, 2021

hossain-rayhan Oct 27, 2021

jpaskhay Oct 27, 2021

jpaskhay commented Oct 28, 2021

zhonghui12 commented Nov 23, 2021

DrewZhang13 left a comment

DrewZhang13 Dec 13, 2021

jpaskhay Dec 14, 2021

PettitWesley Jan 25, 2022

jpaskhay Jan 25, 2022

DrewZhang13 Dec 13, 2021

jpaskhay Dec 14, 2021 •

edited

Loading

zhonghui12 commented Jan 25, 2022

PettitWesley left a comment

zhonghui12 commented Jan 27, 2022

jpaskhay commented Jan 27, 2022

zhonghui12 commented Jan 27, 2022

jpaskhay commented Jan 27, 2022

zhonghui12 commented Jan 27, 2022

jpaskhay commented Jan 28, 2022

zhonghui12 commented Jan 28, 2022

zhonghui12 commented Jan 28, 2022

jpaskhay commented Jan 28, 2022

jpaskhay commented Feb 1, 2022

jpaskhay commented Feb 4, 2022

PettitWesley commented Feb 7, 2022

zhonghui12 commented Feb 8, 2022

PettitWesley commented Feb 8, 2022

zhonghui12 commented Feb 9, 2022

add timeout config for AWS SDK Go HTTP calls #179

add timeout config for AWS SDK Go HTTP calls #179

Conversation

jpaskhay commented Oct 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpaskhay commented Oct 28, 2021

zhonghui12 commented Nov 23, 2021

DrewZhang13 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpaskhay Dec 14, 2021 • edited Loading

Choose a reason for hiding this comment

zhonghui12 commented Jan 25, 2022

PettitWesley left a comment

Choose a reason for hiding this comment

zhonghui12 commented Jan 27, 2022

jpaskhay commented Jan 27, 2022

zhonghui12 commented Jan 27, 2022

jpaskhay commented Jan 27, 2022

zhonghui12 commented Jan 27, 2022

jpaskhay commented Jan 28, 2022

zhonghui12 commented Jan 28, 2022

zhonghui12 commented Jan 28, 2022

jpaskhay commented Jan 28, 2022

jpaskhay commented Feb 1, 2022

jpaskhay commented Feb 4, 2022

PettitWesley commented Feb 7, 2022

zhonghui12 commented Feb 8, 2022

PettitWesley commented Feb 8, 2022

zhonghui12 commented Feb 9, 2022

jpaskhay Dec 14, 2021 •

edited

Loading