Add max duration timeout #12322

skonto · 2021-11-19T13:15:40Z

Proposed Changes

Adds a timeout for max duration of a connection at the queue proxy side
Exposes the timeout at the revision level without setting a default. See for more here.
Adds required unit tests, updates schema and docs.
Adds a test in conformance suite.

Benchmark before and after for the http time handler:

benchmark                                 old ns/op     new ns/op     delta
BenchmarkTimeoutHandler/sequential-12     2014          2427          +20.51%
BenchmarkTimeoutHandler/parallel-12       557           702           +25.89%

benchmark                                 old allocs     new allocs     delta
BenchmarkTimeoutHandler/sequential-12     6              6              +0.00%
BenchmarkTimeoutHandler/parallel-12       6              6              +0.00%

benchmark                                 old bytes     new bytes     delta
BenchmarkTimeoutHandler/sequential-12     686           711           +3.64%
BenchmarkTimeoutHandler/parallel-12       682           703           +3.08%

Release Note

`template.spec.maxDurationSeconds` can now be set to limit the total duration of a request

codecov · 2021-11-19T13:33:13Z

Codecov Report

Merging #12322 (25a2f98) into main (580a23d) will increase coverage by 0.01%.
The diff coverage is 75.75%.

@@            Coverage Diff             @@
##             main   #12322      +/-   ##
==========================================
+ Coverage   87.43%   87.44%   +0.01%     
==========================================
  Files         195      195              
  Lines        9658     9673      +15     
==========================================
+ Hits         8444     8459      +15     
- Misses        930      931       +1     
+ Partials      284      283       -1

Impacted Files	Coverage Δ
cmd/queue/main.go	`0.53% <0.00%> (-0.01%)`	⬇️
pkg/apis/serving/v1/revision_types.go	`100.00% <ø> (ø)`
pkg/http/handler/timeout.go	`89.90% <76.00%> (+3.39%)`	⬆️
pkg/reconciler/revision/resources/queue.go	`98.24% <100.00%> (+0.04%)`	⬆️
pkg/apis/serving/metadata_validation.go	`95.91% <0.00%> (-0.24%)`	⬇️
pkg/apis/serving/v1/route_validation.go	`97.72% <0.00%> (-0.13%)`	⬇️
pkg/apis/serving/v1/service_validation.go	`100.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 580a23d...25a2f98. Read the comment docs.

skonto · 2021-11-19T15:22:14Z

/ok-to-test

skonto · 2021-11-22T13:16:51Z

Failing test not related:

--- FAIL: TestAutoscaleSustaining (175.94s)
    --- FAIL: TestAutoscaleSustaining/aggregation-weightedExponential (49.47s)
        autoscale_test.go:109: Creating a new Route and Configuration
        service.go:113: Creating a new Service service autoscale-sustaining-aggregation-weighted-hxvxkzui
        crd.go:36:  resource {<nil> <nil> <*>{&TypeMeta{Kind:,APIVersion:,} &ObjectMeta{Name:autoscale-sustaining-aggregation-weighted-hxvxkzui,GenerateName:,Namespace:,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[]OwnerReference{},Finalizers:[],ClusterName:,ManagedFields:[]ManagedFieldsEntry{},}

skonto · 2021-11-22T13:17:59Z

@dprotaso @julz pls review. I will add a separate integration test once this is merged.

markusthoemmes · 2021-11-22T13:33:37Z

pkg/http/handler/timeout.go

+		tw.mu.Lock()
+		tw.requestStartTime = tw.clock.Now()
+		tw.mu.Unlock()


This seems unnecessary to do locking for. Can't you add the time above when constructing the timeoutWriter?

Wanted to be closer to the actual httpServe call.

@markusthoemmes Regarding the lock requestStartTime is also read by tw.tryMaxDurationTimeoutAndWriteError in select, so it is needed with the current code:

timedOut := tw.tryMaxDurationTimeoutAndWriteError(cur, h.maxDurationTimeout, h.body) if timedOut { maxDurationTimeoutDrained = true return }

so I guess to be on the safe side I could move it earlier above, remove the lock and so make sure there is an initialized value to compare with.
Alternatively, keep the lock and check if the value is zero and since the max timeout expired when tw.tryMaxDurationTimeoutAndWriteError is called we need to simply fail so existing code can be written:

if tw.requestStartTime.isZero() || curTime.Sub(tw.requestStartTime) >= maxDurationTimeout { ... }

Wdyth?

markusthoemmes · 2021-11-22T13:36:10Z

pkg/apis/serving/v1/revision_types.go

+	// maxDurationTimeoutSeconds is the maximum duration in seconds a request will be allowed
+	// to stay open.
+	// +optional
+	MaxDurationTimeoutSeconds *int64 `json:"maxDurationTimeoutSeconds,omitempty"`


Without an E2E test, I would rather not add the API fields. Maybe cut the API changes out of this and ship them seperately with the respective tests?

Ok I agree although I was planning to do it asap after merging this, I will add the test it is not a big of an addition.

Looks like this is resolved no? Given the conformance test was added?

skonto · 2021-11-23T20:00:30Z

@markusthoemmes @dprotaso @julz gentle ping.

skonto · 2021-11-23T21:59:31Z

Added an e2e test, simplified timeoutHandler logic.

julz · 2021-11-24T09:39:04Z

pkg/http/handler/timeout.go

+
+	// make sure that when max duration time out expires
+	// curTime - requestTime >= timeout
+	tw.requestStartTime = tw.clock.Now()


Probably reading the PR wrong, but I can't figure out where this field is used / what it does?

Need to remove I guess now that I dont calculate any time diff, it is a relic of my previous commit.

julz · 2021-11-24T09:39:22Z

cmd/queue/main.go

+	QueueServingPort                  string `split_words:"true" required:"true"`
+	UserPort                          string `split_words:"true" required:"true"`
+	RevisionTimeoutSeconds            int    `split_words:"true" required:"true"`
+	RevisionMaxDurationTimeoutSeconds int    `split_words:"true" required:"true"`


Do we have to worry about upgrades not having this env var, e.g. if the defaults ConfigMap with the new QP version updates before the new controller code is deployed? (I guess maybe not since the upgrade tests didn't fail 🤔)

I dont set any defaults at the revision level to avoid such issues for the next release (will introduce defaults a release later). If the env var does not exist (contoller code is old) then the new QP code will pick zero as the value for the timeout and will ignore it later on (same as idle timeout).

but this env var is set as required here, so won't envconfig panic on startup if this version of queue proxy rolls out before the corresponding code change that adds the env var?

Correct my intention was to make it optional, good point copy paste left over. Will update in a sec.

julz · 2021-11-24T09:45:24Z

test/conformance/api/v1/revision_timeout_test.go

+
+	// make sure high enough max duration has no effect in default cases
+	for _, tc := range testCases {
+		tc.maxDurationTimeoutSeconds = neverExpireMaxDurationSeconds


fwiw I'd be tempted to have one single TestRevisionTimeouts testing all the timeout cases since they're all related anyway; extracting and modifying the base table to avoid copy pasting three test cases doesn't seem worth the extra boilerplate we end up with to me

I thought about this but to me it seemed more readable to have the new tests cases separated, but merging all in one (initially had it like this) can be done.

Merged the tests.

skonto · 2021-11-29T12:46:41Z

@julz @dprotaso gentle ping :)

skonto · 2021-12-02T14:15:50Z

Failures:

--- FAIL: TestGlobalResyncOnDefaultCMChange (0.16s)
    logger.go:130: 2021-11-24T12:21:50.828Z	ERROR	revision/controller.go:108	Failed to create resolver transport	{"error": "open /var/run/secrets/kubernetes.io/serviceaccount/ca.crt: no such file or directory"}
    logger.go:130: 2021-11-24T12:21:50.829Z	INFO	revision/controller.go:117	Fetch GitHub commit ID from kodata failed{error 26 0  "KO_DATA_PATH" does not exist or is empty}

2021/11/30 13:50:42 Error during command execution: unknown flag: --gateway-api-version
Step failed: ./test/e2e-tests.sh --gateway-api-version latest

skonto · 2021-12-08T11:06:59Z

/retest

skonto · 2021-12-08T11:22:38Z

@dprotaso gentle ping. There is a storm of failing tests but not related afaik.

dprotaso · 2021-12-08T14:59:16Z

If we set the max duration to 30s the container could effectively be processing 2 requests

Is this different from the existing first-byte timeout though? If the first-byte timeout times out a request but the user container continues processing it the same thing would happen, no?

it's not - just mentioning it for context

/lgtm
/approve
/hold

Holding in case anyone has any last minute bikeshedding on the property name otherwise LGTM

knative-prow-robot · 2021-12-08T14:59:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso, skonto

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [dprotaso]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dprotaso · 2021-12-08T15:04:16Z

cc @markusthoemmes @julz

julz

lgtm apart from this bit of last minute bike shedding: #12322 (comment)

skonto · 2021-12-09T11:02:45Z

@julz hi, dropped the request prefix. @dprotaso gentle ping.

julz · 2021-12-10T09:20:39Z

lgtm, thanks @skonto, I'll let @dprotaso decide whether we want to unhold so close to release or wait till Tuesday to land (arguably it's a new field so the risk is probably pretty low I guess)

skonto · 2021-12-15T09:28:01Z

@dprotaso gentle ping should we unhold?

julz · 2021-12-15T09:37:42Z

Release shipped yesterday, so:

/lgtm
/unhold

We added MaxDurationSeconds (knative#12322) because the behaviour of RevisionSpec.Timeout changed from total duration to time to first byte. In hindsight changing the behaviour of Timeout was a mistake since it goes against the original specification. Thus we're going to create a path for migration and the first part is to remove MaxDurationSeconds from the RevisionSpec.

We added MaxDurationSeconds (#12322) because the behaviour of RevisionSpec.Timeout changed from total duration to time to first byte. In hindsight changing the behaviour of Timeout was a mistake since it goes against the original specification. Thus we're going to create a path for migration and the first part is to remove MaxDurationSeconds from the RevisionSpec.

We added MaxDurationSeconds (knative#12322) because the behaviour of RevisionSpec.Timeout changed from total duration to time to first byte. In hindsight changing the behaviour of Timeout was a mistake since it goes against the original specification. Thus we're going to create a path for migration and the first part is to remove MaxDurationSeconds from the RevisionSpec.

* Drop MaxDurationSeconds from the RevisionSpec (#12635) We added MaxDurationSeconds (#12322) because the behaviour of RevisionSpec.Timeout changed from total duration to time to first byte. In hindsight changing the behaviour of Timeout was a mistake since it goes against the original specification. Thus we're going to create a path for migration and the first part is to remove MaxDurationSeconds from the RevisionSpec. * fix conformance test

…12640) * Drop MaxDurationSeconds from the RevisionSpec (knative#12635) We added MaxDurationSeconds (knative#12322) because the behaviour of RevisionSpec.Timeout changed from total duration to time to first byte. In hindsight changing the behaviour of Timeout was a mistake since it goes against the original specification. Thus we're going to create a path for migration and the first part is to remove MaxDurationSeconds from the RevisionSpec. * fix conformance test

* Pin to 1.23 S-O branch * Add 0-kourier.yaml and 1-config-network.yaml to kourier.yaml (#1122) * Rename kourier.yaml with 0-kourier.yaml * Concat the files * fix csv logic (#1125) * Reduce the period and failure threshold for activator readiness (knative#12618) The default drain timeout is 45 seconds which was much shorter than the time it takes the activator to be recognized as not ready (2 minutes) This was resulting in 503s since the activator was receiving traffic when it was not expecting it Co-authored-by: dprotaso <dprotaso@gmail.com> * Address 503s when the autoscaler is being rolled (knative#12621) The activator's readiness depends on the status of web socket connection to the autoscaler. When the connection is down the activator will report ready=false. This can occur when the autoscaler deployment is updating. PR knative#12614 made the activator's readiness probe fail aggressively after a single failure. This didn't seem to impact istio but with contour it started returning 503s since the activator started to report ready=false immediately. This PR does two things to mitigate 503s: - bump the readiness threshold to give the autoscaler more time to rollout/startup. This still remains lower than the drain duration - Update the autoscaler rollout strategy so we spin up a new instance prior to bring down the older one. This is done using maxUnavailable=0 Co-authored-by: dprotaso <dprotaso@gmail.com> * [release-1.2] Drop MaxDurationSeconds from the RevisionSpec (knative#12640) * Drop MaxDurationSeconds from the RevisionSpec (knative#12635) We added MaxDurationSeconds (knative#12322) because the behaviour of RevisionSpec.Timeout changed from total duration to time to first byte. In hindsight changing the behaviour of Timeout was a mistake since it goes against the original specification. Thus we're going to create a path for migration and the first part is to remove MaxDurationSeconds from the RevisionSpec. * fix conformance test * [release-1.2] fix ytt package name (knative#12657) * fix ytt package name * use correct path Co-authored-by: dprotaso <dprotaso@gmail.com> * Remove an unnecessary start delay when resolving tag to digests (knative#12669) Co-authored-by: dprotaso <dprotaso@gmail.com> * Drop collecting performance data in release branch (knative#12673) Co-authored-by: dprotaso <dprotaso@gmail.com> * bump ggcr which includes auth config lookup fixes for k8s (knative#12656) Includes the fixes: - google/go-containerregistry#1299 - google/go-containerregistry#1300 * Fixes an activator panic when the throttle encounters a cache.DeleteFinalStateUnknown (knative#12680) Co-authored-by: dprotaso <dprotaso@gmail.com> * upgrade to latest dependencies (knative#12674) bumping knative.dev/pkg 77555ea...083dd97: > 083dd97 Wait for reconciler/controllers to return prior to exiting the process (# 2438) > df430fa dizzy: we must use `flags` instead of `pflags`, since this is not working. It seems like pflag.* adds the var to its own flag set, not the one package flag uses, and it doesn't expose the internal flag.Var externally - hence this fix. (# 2415) Signed-off-by: Knative Automation <automation@knative.team> * [release-1.2] fix tag to digest resolution (ggcr bump) (knative#12834) * pin k8s dep * Fix tag to digest resolution with K8s secrets I forgot to bump ggcr's sub package in the prior release github.com/google/go-containerregistry/pkg/authn/k8schain * bump ggcr which fixes tag-to-digest resolution for Azure & GitLab (knative#12857) Co-authored-by: Stavros Kontopoulos <st.kontopoulos@gmail.com> Co-authored-by: Knative Prow Robot <knative-prow-robot@google.com> Co-authored-by: dprotaso <dprotaso@gmail.com> Co-authored-by: knative-automation <automation@knative.team>

add max duration timeout

1537c6c

knative-prow-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 19, 2021

google-cla bot added the cla: yes Indicates the PR's author has signed the CLA. label Nov 19, 2021

knative-prow-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/API API objects and controllers area/autoscale area/networking labels Nov 19, 2021

knative-prow-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Nov 19, 2021

skonto changed the title ~~[wip] Add max duration timeout~~ Add max duration timeout Nov 22, 2021

knative-prow-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 22, 2021

markusthoemmes reviewed Nov 22, 2021

View reviewed changes

skonto changed the title ~~Add max duration timeout~~ [wip]Add max duration timeout Nov 23, 2021

knative-prow-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 23, 2021

fixes and e2e test

c08ea02

knative-prow-robot added the area/test-and-release It flags unit/e2e/conformance/perf test issues for product features label Nov 23, 2021

skonto changed the title ~~[wip]Add max duration timeout~~ Add max duration timeout Nov 23, 2021

knative-prow-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 23, 2021

julz reviewed Nov 24, 2021

View reviewed changes

skonto added 2 commits November 24, 2021 14:12

updates

ec3d9db

make config optional

28d6a18

skonto added 2 commits December 2, 2021 17:20

Merge remote-tracking branch 'upstream/main' into add_max_timeout_qproxy

9f00753

Merge remote-tracking branch 'upstream/main' into add_max_timeout_qproxy

a83922c

knative-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 8, 2021

knative-prow-robot assigned dprotaso Dec 8, 2021

knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 8, 2021

knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 8, 2021

julz reviewed Dec 9, 2021

View reviewed changes

drop request

25a2f98

knative-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label Dec 9, 2021

knative-prow-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 15, 2021

knative-prow-robot assigned julz Dec 15, 2021

knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 15, 2021

knative-prow-robot merged commit ef89ff8 into knative:main Dec 15, 2021

dprotaso added this to the v1.2.0 milestone Jan 26, 2022

skonto mentioned this pull request Feb 3, 2022

Timeout issue on long requests #12564

Closed

dprotaso mentioned this pull request Feb 16, 2022

Drop MaxDurationSeconds from the RevisionSpec #12635

Merged

dprotaso mentioned this pull request Feb 16, 2022

[release-1.2] Drop MaxDurationSeconds from the RevisionSpec #12640

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add max duration timeout #12322

Add max duration timeout #12322

skonto commented Nov 19, 2021 •

edited by dprotaso

Loading

codecov bot commented Nov 19, 2021 •

edited

Loading

skonto commented Nov 19, 2021

skonto commented Nov 22, 2021 •

edited

Loading

skonto commented Nov 22, 2021

markusthoemmes Nov 22, 2021

skonto Nov 22, 2021

skonto Nov 22, 2021 •

edited

Loading

markusthoemmes Nov 22, 2021

skonto Nov 22, 2021 •

edited

Loading

dprotaso Dec 3, 2021

skonto commented Nov 23, 2021

skonto commented Nov 23, 2021 •

edited

Loading

julz Nov 24, 2021

skonto Nov 24, 2021 •

edited

Loading

julz Nov 24, 2021

skonto Nov 24, 2021 •

edited

Loading

julz Nov 24, 2021

skonto Nov 24, 2021 •

edited

Loading

julz Nov 24, 2021

skonto Nov 24, 2021

skonto Nov 24, 2021

skonto commented Nov 29, 2021

skonto commented Dec 2, 2021

skonto commented Dec 8, 2021

skonto commented Dec 8, 2021 •

edited

Loading

dprotaso commented Dec 8, 2021

knative-prow-robot commented Dec 8, 2021

dprotaso commented Dec 8, 2021

julz left a comment

skonto commented Dec 9, 2021 •

edited

Loading

julz commented Dec 10, 2021

skonto commented Dec 15, 2021

julz commented Dec 15, 2021

Add max duration timeout #12322

Add max duration timeout #12322

Conversation

skonto commented Nov 19, 2021 • edited by dprotaso Loading

Proposed Changes

Release Note

codecov bot commented Nov 19, 2021 • edited Loading

Codecov Report

skonto commented Nov 19, 2021

skonto commented Nov 22, 2021 • edited Loading

skonto commented Nov 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skonto Nov 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skonto Nov 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skonto commented Nov 23, 2021

skonto commented Nov 23, 2021 • edited Loading

Choose a reason for hiding this comment

skonto Nov 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skonto Nov 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skonto Nov 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skonto commented Nov 29, 2021

skonto commented Dec 2, 2021

skonto commented Dec 8, 2021

skonto commented Dec 8, 2021 • edited Loading

dprotaso commented Dec 8, 2021

knative-prow-robot commented Dec 8, 2021

dprotaso commented Dec 8, 2021

julz left a comment

Choose a reason for hiding this comment

skonto commented Dec 9, 2021 • edited Loading

julz commented Dec 10, 2021

skonto commented Dec 15, 2021

julz commented Dec 15, 2021

skonto commented Nov 19, 2021 •

edited by dprotaso

Loading

codecov bot commented Nov 19, 2021 •

edited

Loading

skonto commented Nov 22, 2021 •

edited

Loading

skonto Nov 22, 2021 •

edited

Loading

skonto Nov 22, 2021 •

edited

Loading

skonto commented Nov 23, 2021 •

edited

Loading

skonto Nov 24, 2021 •

edited

Loading

skonto Nov 24, 2021 •

edited

Loading

skonto Nov 24, 2021 •

edited

Loading

skonto commented Dec 8, 2021 •

edited

Loading

skonto commented Dec 9, 2021 •

edited

Loading