Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

httptransport: check for err before deferring resp.Body.Close() #1173

Merged
merged 1 commit into from
Feb 5, 2021

Conversation

jan-zmeskal
Copy link
Contributor

@jan-zmeskal jan-zmeskal commented Feb 5, 2021

There is quite a big story behind this PR. First, let me offer TLDR. If err isn't checked before referring to resp.Body, we might hit nil pointer dereference. See this SO post.

Here's the whole story. It all started with Red Hat Product Security team substantially extending data available in OVAL v2 streams. When this change hit the production, we started seeing issues with notifier deployed in our OpenShift cluster.

The first symptom that I found was that notifier simply could not process update operation created by certain stream. In notifier pod, it looked like this:
Screenshot from 2021-02-05 14-32-04
Basically all of the four processors would try to acquire lock on one update operation and none of them could actually acquire it. This caused the whole notifier to be forever stuck on one update operation.

I connected to our Clair DB and found out that there is an advisory lock sitting there.
image
I couldn't find out how long has it been there, but I started to monitor it and could see that it was still there after 30 minutes. No operation could take that long, so it must meant that the lock is stale.

As far as I can understand, these advisory locks get freed up when either a transaction or a session closes. So I came to the conclusion that some process must have died without gracefully tearing down whatever DB operation it was doing.

And indeed, I found out that our notifier crashed couple of hours ago and then OpenShift just spun up new pods:
image

So I guess you're getting the picture now:

  1. Notifier pod is spun up but crashes during processing of update operation
  2. As a result of the crash, new notifier pod is created
  3. None of the notifier processors in the new pod can acquire lock on that update operation
  4. Notifier is forever stuck on trying to acquire the lock

So the crash is the cause of it all. To my best knowledge, it happens here when we try to refer resp.Body. However, resp.Body is guaranteed to be non-nil only when err is nil, see here.

Hence my change. I suggest we first check the error returned by Do. If it's non-nil, we return as we'd do anyway. If it's nil, we can safely defer closing of the body as it's guaranteed to be non-nil.

Addendum: The crash seems to have appeared after with sent HTTP request with 273MB JSON body here. I still don't know if that's as problem of clair per se or if this is related to our deployment. Be as it may, this PR won't solve that issue. However, it should make sure that notifier won't get stuck in an infinite loop.

Signed-off-by: Jan Zmeskal <jzmeskal@redhat.com>
@jan-zmeskal jan-zmeskal marked this pull request as ready for review February 5, 2021 14:00
Copy link
Member

@hdonnay hdonnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I think both Louis and I have misunderstood the Body-and-err interaction a few different times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants