-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add max duration timeout #12322
Add max duration timeout #12322
Conversation
Codecov Report
@@ Coverage Diff @@
## main #12322 +/- ##
==========================================
+ Coverage 87.43% 87.44% +0.01%
==========================================
Files 195 195
Lines 9658 9673 +15
==========================================
+ Hits 8444 8459 +15
- Misses 930 931 +1
+ Partials 284 283 -1
Continue to review full report at Codecov.
|
/ok-to-test |
Failing test not related:
|
pkg/http/handler/timeout.go
Outdated
tw.mu.Lock() | ||
tw.requestStartTime = tw.clock.Now() | ||
tw.mu.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unnecessary to do locking for. Can't you add the time above when constructing the timeoutWriter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanted to be closer to the actual httpServe call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markusthoemmes Regarding the lock requestStartTime
is also read by tw.tryMaxDurationTimeoutAndWriteError in select, so it is needed with the current code:
timedOut := tw.tryMaxDurationTimeoutAndWriteError(cur, h.maxDurationTimeout, h.body)
if timedOut {
maxDurationTimeoutDrained = true
return
}
so I guess to be on the safe side I could move it earlier above, remove the lock and so make sure there is an initialized value to compare with.
Alternatively, keep the lock and check if the value is zero and since the max timeout expired when tw.tryMaxDurationTimeoutAndWriteError is called we need to simply fail so existing code can be written:
if tw.requestStartTime.isZero() || curTime.Sub(tw.requestStartTime) >= maxDurationTimeout {
...
}
Wdyth?
// maxDurationTimeoutSeconds is the maximum duration in seconds a request will be allowed | ||
// to stay open. | ||
// +optional | ||
MaxDurationTimeoutSeconds *int64 `json:"maxDurationTimeoutSeconds,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without an E2E test, I would rather not add the API fields. Maybe cut the API changes out of this and ship them seperately with the respective tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I agree although I was planning to do it asap after merging this, I will add the test it is not a big of an addition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is resolved no? Given the conformance test was added?
@markusthoemmes @dprotaso @julz gentle ping. |
Added an e2e test, simplified timeoutHandler logic. |
pkg/http/handler/timeout.go
Outdated
|
||
// make sure that when max duration time out expires | ||
// curTime - requestTime >= timeout | ||
tw.requestStartTime = tw.clock.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably reading the PR wrong, but I can't figure out where this field is used / what it does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to remove I guess now that I dont calculate any time diff, it is a relic of my previous commit.
cmd/queue/main.go
Outdated
QueueServingPort string `split_words:"true" required:"true"` | ||
UserPort string `split_words:"true" required:"true"` | ||
RevisionTimeoutSeconds int `split_words:"true" required:"true"` | ||
RevisionMaxDurationTimeoutSeconds int `split_words:"true" required:"true"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to worry about upgrades not having this env var, e.g. if the defaults ConfigMap with the new QP version updates before the new controller code is deployed? (I guess maybe not since the upgrade tests didn't fail 🤔)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont set any defaults at the revision level to avoid such issues for the next release (will introduce defaults a release later). If the env var does not exist (contoller code is old) then the new QP code will pick zero as the value for the timeout and will ignore it later on (same as idle timeout).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but this env var is set as required here, so won't envconfig panic on startup if this version of queue proxy rolls out before the corresponding code change that adds the env var?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct my intention was to make it optional, good point copy paste left over. Will update in a sec.
|
||
// make sure high enough max duration has no effect in default cases | ||
for _, tc := range testCases { | ||
tc.maxDurationTimeoutSeconds = neverExpireMaxDurationSeconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw I'd be tempted to have one single TestRevisionTimeouts testing all the timeout cases since they're all related anyway; extracting and modifying the base table to avoid copy pasting three test cases doesn't seem worth the extra boilerplate we end up with to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this but to me it seemed more readable to have the new tests cases separated, but merging all in one (initially had it like this) can be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged the tests.
Failures:
|
/retest |
@dprotaso gentle ping. There is a storm of failing tests but not related afaik. |
it's not - just mentioning it for context /lgtm Holding in case anyone has any last minute bikeshedding on the property name otherwise LGTM |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dprotaso, skonto The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm apart from this bit of last minute bike shedding: #12322 (comment)
@dprotaso gentle ping should we unhold? |
Release shipped yesterday, so: /lgtm |
We added MaxDurationSeconds (knative#12322) because the behaviour of RevisionSpec.Timeout changed from total duration to time to first byte. In hindsight changing the behaviour of Timeout was a mistake since it goes against the original specification. Thus we're going to create a path for migration and the first part is to remove MaxDurationSeconds from the RevisionSpec.
We added MaxDurationSeconds (#12322) because the behaviour of RevisionSpec.Timeout changed from total duration to time to first byte. In hindsight changing the behaviour of Timeout was a mistake since it goes against the original specification. Thus we're going to create a path for migration and the first part is to remove MaxDurationSeconds from the RevisionSpec.
We added MaxDurationSeconds (knative#12322) because the behaviour of RevisionSpec.Timeout changed from total duration to time to first byte. In hindsight changing the behaviour of Timeout was a mistake since it goes against the original specification. Thus we're going to create a path for migration and the first part is to remove MaxDurationSeconds from the RevisionSpec.
* Drop MaxDurationSeconds from the RevisionSpec (#12635) We added MaxDurationSeconds (#12322) because the behaviour of RevisionSpec.Timeout changed from total duration to time to first byte. In hindsight changing the behaviour of Timeout was a mistake since it goes against the original specification. Thus we're going to create a path for migration and the first part is to remove MaxDurationSeconds from the RevisionSpec. * fix conformance test
…12640) * Drop MaxDurationSeconds from the RevisionSpec (knative#12635) We added MaxDurationSeconds (knative#12322) because the behaviour of RevisionSpec.Timeout changed from total duration to time to first byte. In hindsight changing the behaviour of Timeout was a mistake since it goes against the original specification. Thus we're going to create a path for migration and the first part is to remove MaxDurationSeconds from the RevisionSpec. * fix conformance test
* Pin to 1.23 S-O branch * Add 0-kourier.yaml and 1-config-network.yaml to kourier.yaml (#1122) * Rename kourier.yaml with 0-kourier.yaml * Concat the files * fix csv logic (#1125) * Reduce the period and failure threshold for activator readiness (knative#12618) The default drain timeout is 45 seconds which was much shorter than the time it takes the activator to be recognized as not ready (2 minutes) This was resulting in 503s since the activator was receiving traffic when it was not expecting it Co-authored-by: dprotaso <dprotaso@gmail.com> * Address 503s when the autoscaler is being rolled (knative#12621) The activator's readiness depends on the status of web socket connection to the autoscaler. When the connection is down the activator will report ready=false. This can occur when the autoscaler deployment is updating. PR knative#12614 made the activator's readiness probe fail aggressively after a single failure. This didn't seem to impact istio but with contour it started returning 503s since the activator started to report ready=false immediately. This PR does two things to mitigate 503s: - bump the readiness threshold to give the autoscaler more time to rollout/startup. This still remains lower than the drain duration - Update the autoscaler rollout strategy so we spin up a new instance prior to bring down the older one. This is done using maxUnavailable=0 Co-authored-by: dprotaso <dprotaso@gmail.com> * [release-1.2] Drop MaxDurationSeconds from the RevisionSpec (knative#12640) * Drop MaxDurationSeconds from the RevisionSpec (knative#12635) We added MaxDurationSeconds (knative#12322) because the behaviour of RevisionSpec.Timeout changed from total duration to time to first byte. In hindsight changing the behaviour of Timeout was a mistake since it goes against the original specification. Thus we're going to create a path for migration and the first part is to remove MaxDurationSeconds from the RevisionSpec. * fix conformance test * [release-1.2] fix ytt package name (knative#12657) * fix ytt package name * use correct path Co-authored-by: dprotaso <dprotaso@gmail.com> * Remove an unnecessary start delay when resolving tag to digests (knative#12669) Co-authored-by: dprotaso <dprotaso@gmail.com> * Drop collecting performance data in release branch (knative#12673) Co-authored-by: dprotaso <dprotaso@gmail.com> * bump ggcr which includes auth config lookup fixes for k8s (knative#12656) Includes the fixes: - google/go-containerregistry#1299 - google/go-containerregistry#1300 * Fixes an activator panic when the throttle encounters a cache.DeleteFinalStateUnknown (knative#12680) Co-authored-by: dprotaso <dprotaso@gmail.com> * upgrade to latest dependencies (knative#12674) bumping knative.dev/pkg 77555ea...083dd97: > 083dd97 Wait for reconciler/controllers to return prior to exiting the process (# 2438) > df430fa dizzy: we must use `flags` instead of `pflags`, since this is not working. It seems like pflag.* adds the var to its own flag set, not the one package flag uses, and it doesn't expose the internal flag.Var externally - hence this fix. (# 2415) Signed-off-by: Knative Automation <automation@knative.team> * [release-1.2] fix tag to digest resolution (ggcr bump) (knative#12834) * pin k8s dep * Fix tag to digest resolution with K8s secrets I forgot to bump ggcr's sub package in the prior release github.com/google/go-containerregistry/pkg/authn/k8schain * bump ggcr which fixes tag-to-digest resolution for Azure & GitLab (knative#12857) Co-authored-by: Stavros Kontopoulos <st.kontopoulos@gmail.com> Co-authored-by: Knative Prow Robot <knative-prow-robot@google.com> Co-authored-by: dprotaso <dprotaso@gmail.com> Co-authored-by: knative-automation <automation@knative.team>
Fixes #10851
Proposed Changes
Benchmark before and after for the http time handler:
Release Note