Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix aggregator size estimation #155

Merged
merged 5 commits into from
Aug 23, 2021
Merged

Conversation

jamiees2
Copy link
Contributor

Description of changes:

The aggregator tries to keep track of the size of the serialised message, so that it can know if it should flush before adding a new record, and not overflow Kinesis' 1MB limit. The current implementation keeps track of this by adding seemingly random constants, which break when e.g an integer is passed that does not fit in a 1-byte varint.

This caused issues for us in production, where fluent bit would get stuck on a record that Kinesis refused to accept because the plugin would keep trying to submit a record larger than 1MB that it thought was going to be smaller than 1MB.

This changes the code to map to protobuf's actual size calculation, keeping track of the sizes of varints, buffer lengths, etc. This change fixed the bug in our production environment, and caused the plugin to correctly keep the record data below 1MB.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@jamiees2 jamiees2 requested a review from a team as a code owner August 14, 2021 19:50
@PettitWesley
Copy link
Contributor

@zackwine FYI

Copy link
Contributor

@PettitWesley PettitWesley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks alright to me, and it sounds like you have thoroughly tested it in your production to verify it fixed the issue and didn't add any new regressions?

@jamiees2
Copy link
Contributor Author

Yeah, to be really clear on what I did: I deployed the new plugin to a machine that was stuck on a record like this with debug logging enabled, and saw it get past the bad records,and still compute correct sizes for everything else. I watched it for ~5m before opening the PR and didn't see any size mismatches.

@PettitWesley
Copy link
Contributor

@jamiees2 Ok, we will merge soon and then release it. I want to give @zackwine a chance to review since he originally built this feature and understands the code a lot better than I do. If he doesn't review after a few days, we'll just merge it and release.

@jamiees2
Copy link
Contributor Author

Cool, we have already deployed the fix to our production environments, so that will just let this bake for a bit 🙂

@hossain-rayhan
Copy link
Contributor

@jamiees2 I think it would be nice to have some positive and negative unit tests.

@jamiees2
Copy link
Contributor Author

Added tests, and realized protowire which is included in google.golang.org/protobuf exposes the sizeof functions we need, so I don't need to inline them here. This cleans up the code somewhat :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants