httptransport: check for err before deferring resp.Body.Close() #1173
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There is quite a big story behind this PR. First, let me offer TLDR. If
err
isn't checked before referring toresp.Body
, we might hit nil pointer dereference. See this SO post.Here's the whole story. It all started with Red Hat Product Security team substantially extending data available in OVAL v2 streams. When this change hit the production, we started seeing issues with notifier deployed in our OpenShift cluster.
The first symptom that I found was that notifier simply could not process update operation created by certain stream. In notifier pod, it looked like this:
![Screenshot from 2021-02-05 14-32-04](https://user-images.githubusercontent.com/22600243/107040417-7e481600-67bf-11eb-8aa8-45cbba4a31b2.png)
Basically all of the four processors would try to acquire lock on one update operation and none of them could actually acquire it. This caused the whole notifier to be forever stuck on one update operation.
I connected to our Clair DB and found out that there is an advisory lock sitting there.
![image](https://user-images.githubusercontent.com/22600243/107041058-53aa8d00-67c0-11eb-9a8a-9519f9cf6221.png)
I couldn't find out how long has it been there, but I started to monitor it and could see that it was still there after 30 minutes. No operation could take that long, so it must meant that the lock is stale.
As far as I can understand, these advisory locks get freed up when either a transaction or a session closes. So I came to the conclusion that some process must have died without gracefully tearing down whatever DB operation it was doing.
And indeed, I found out that our notifier crashed couple of hours ago and then OpenShift just spun up new pods:
![image](https://user-images.githubusercontent.com/22600243/107041631-0aa70880-67c1-11eb-9b42-ed8141a2065c.png)
So I guess you're getting the picture now:
So the crash is the cause of it all. To my best knowledge, it happens here when we try to refer
resp.Body
. However,resp.Body
is guaranteed to be non-nil only whenerr
is nil, see here.Hence my change. I suggest we first check the error returned by
Do
. If it's non-nil, we return as we'd do anyway. If it's nil, we can safelydefer
closing of the body as it's guaranteed to be non-nil.Addendum: The crash seems to have appeared after with sent HTTP request with 273MB JSON body here. I still don't know if that's as problem of clair per se or if this is related to our deployment. Be as it may, this PR won't solve that issue. However, it should make sure that notifier won't get stuck in an infinite loop.