Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add max duration timeout #12322

Merged
merged 12 commits into from
Dec 15, 2021
Merged

Conversation

skonto
Copy link
Contributor

@skonto skonto commented Nov 19, 2021

Fixes #10851

Proposed Changes

  • Adds a timeout for max duration of a connection at the queue proxy side
  • Exposes the timeout at the revision level without setting a default. See for more here.
  • Adds required unit tests, updates schema and docs.
  • Adds a test in conformance suite.

Benchmark before and after for the http time handler:

benchmark                                 old ns/op     new ns/op     delta
BenchmarkTimeoutHandler/sequential-12     2014          2427          +20.51%
BenchmarkTimeoutHandler/parallel-12       557           702           +25.89%

benchmark                                 old allocs     new allocs     delta
BenchmarkTimeoutHandler/sequential-12     6              6              +0.00%
BenchmarkTimeoutHandler/parallel-12       6              6              +0.00%

benchmark                                 old bytes     new bytes     delta
BenchmarkTimeoutHandler/sequential-12     686           711           +3.64%
BenchmarkTimeoutHandler/parallel-12       682           703           +3.08%

Release Note

`template.spec.maxDurationSeconds` can now be set to limit the total duration of a request

@knative-prow-robot knative-prow-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 19, 2021
@google-cla google-cla bot added the cla: yes Indicates the PR's author has signed the CLA. label Nov 19, 2021
@knative-prow-robot knative-prow-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/API API objects and controllers area/autoscale area/networking labels Nov 19, 2021
@codecov
Copy link

codecov bot commented Nov 19, 2021

Codecov Report

Merging #12322 (25a2f98) into main (580a23d) will increase coverage by 0.01%.
The diff coverage is 75.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main   #12322      +/-   ##
==========================================
+ Coverage   87.43%   87.44%   +0.01%     
==========================================
  Files         195      195              
  Lines        9658     9673      +15     
==========================================
+ Hits         8444     8459      +15     
- Misses        930      931       +1     
+ Partials      284      283       -1     
Impacted Files Coverage Δ
cmd/queue/main.go 0.53% <0.00%> (-0.01%) ⬇️
pkg/apis/serving/v1/revision_types.go 100.00% <ø> (ø)
pkg/http/handler/timeout.go 89.90% <76.00%> (+3.39%) ⬆️
pkg/reconciler/revision/resources/queue.go 98.24% <100.00%> (+0.04%) ⬆️
pkg/apis/serving/metadata_validation.go 95.91% <0.00%> (-0.24%) ⬇️
pkg/apis/serving/v1/route_validation.go 97.72% <0.00%> (-0.13%) ⬇️
pkg/apis/serving/v1/service_validation.go 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 580a23d...25a2f98. Read the comment docs.

@skonto
Copy link
Contributor Author

skonto commented Nov 19, 2021

/ok-to-test

@knative-prow-robot knative-prow-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Nov 19, 2021
@skonto
Copy link
Contributor Author

skonto commented Nov 22, 2021

Failing test not related:

--- FAIL: TestAutoscaleSustaining (175.94s)
    --- FAIL: TestAutoscaleSustaining/aggregation-weightedExponential (49.47s)
        autoscale_test.go:109: Creating a new Route and Configuration
        service.go:113: Creating a new Service service autoscale-sustaining-aggregation-weighted-hxvxkzui
        crd.go:36:  resource {<nil> <nil> <*>{&TypeMeta{Kind:,APIVersion:,} &ObjectMeta{Name:autoscale-sustaining-aggregation-weighted-hxvxkzui,GenerateName:,Namespace:,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[]OwnerReference{},Finalizers:[],ClusterName:,ManagedFields:[]ManagedFieldsEntry{},} 

@skonto skonto changed the title [wip] Add max duration timeout Add max duration timeout Nov 22, 2021
@knative-prow-robot knative-prow-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 22, 2021
@skonto
Copy link
Contributor Author

skonto commented Nov 22, 2021

@dprotaso @julz pls review. I will add a separate integration test once this is merged.

Comment on lines 115 to 117
tw.mu.Lock()
tw.requestStartTime = tw.clock.Now()
tw.mu.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessary to do locking for. Can't you add the time above when constructing the timeoutWriter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted to be closer to the actual httpServe call.

Copy link
Contributor Author

@skonto skonto Nov 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markusthoemmes Regarding the lock requestStartTime is also read by tw.tryMaxDurationTimeoutAndWriteError in select, so it is needed with the current code:

			timedOut := tw.tryMaxDurationTimeoutAndWriteError(cur, h.maxDurationTimeout, h.body)
			if timedOut {
				maxDurationTimeoutDrained = true
				return
			} 

so I guess to be on the safe side I could move it earlier above, remove the lock and so make sure there is an initialized value to compare with.
Alternatively, keep the lock and check if the value is zero and since the max timeout expired when tw.tryMaxDurationTimeoutAndWriteError is called we need to simply fail so existing code can be written:


if tw.requestStartTime.isZero() || curTime.Sub(tw.requestStartTime) >= maxDurationTimeout {
...
}

Wdyth?

// maxDurationTimeoutSeconds is the maximum duration in seconds a request will be allowed
// to stay open.
// +optional
MaxDurationTimeoutSeconds *int64 `json:"maxDurationTimeoutSeconds,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without an E2E test, I would rather not add the API fields. Maybe cut the API changes out of this and ship them seperately with the respective tests?

Copy link
Contributor Author

@skonto skonto Nov 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I agree although I was planning to do it asap after merging this, I will add the test it is not a big of an addition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is resolved no? Given the conformance test was added?

@skonto skonto changed the title Add max duration timeout [wip]Add max duration timeout Nov 23, 2021
@knative-prow-robot knative-prow-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 23, 2021
@knative-prow-robot knative-prow-robot added the area/test-and-release It flags unit/e2e/conformance/perf test issues for product features label Nov 23, 2021
@skonto skonto changed the title [wip]Add max duration timeout Add max duration timeout Nov 23, 2021
@knative-prow-robot knative-prow-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 23, 2021
@skonto
Copy link
Contributor Author

skonto commented Nov 23, 2021

@markusthoemmes @dprotaso @julz gentle ping.

@skonto
Copy link
Contributor Author

skonto commented Nov 23, 2021

Added an e2e test, simplified timeoutHandler logic.


// make sure that when max duration time out expires
// curTime - requestTime >= timeout
tw.requestStartTime = tw.clock.Now()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably reading the PR wrong, but I can't figure out where this field is used / what it does?

Copy link
Contributor Author

@skonto skonto Nov 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to remove I guess now that I dont calculate any time diff, it is a relic of my previous commit.

QueueServingPort string `split_words:"true" required:"true"`
UserPort string `split_words:"true" required:"true"`
RevisionTimeoutSeconds int `split_words:"true" required:"true"`
RevisionMaxDurationTimeoutSeconds int `split_words:"true" required:"true"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to worry about upgrades not having this env var, e.g. if the defaults ConfigMap with the new QP version updates before the new controller code is deployed? (I guess maybe not since the upgrade tests didn't fail 🤔)

Copy link
Contributor Author

@skonto skonto Nov 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont set any defaults at the revision level to avoid such issues for the next release (will introduce defaults a release later). If the env var does not exist (contoller code is old) then the new QP code will pick zero as the value for the timeout and will ignore it later on (same as idle timeout).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this env var is set as required here, so won't envconfig panic on startup if this version of queue proxy rolls out before the corresponding code change that adds the env var?

Copy link
Contributor Author

@skonto skonto Nov 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct my intention was to make it optional, good point copy paste left over. Will update in a sec.


// make sure high enough max duration has no effect in default cases
for _, tc := range testCases {
tc.maxDurationTimeoutSeconds = neverExpireMaxDurationSeconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw I'd be tempted to have one single TestRevisionTimeouts testing all the timeout cases since they're all related anyway; extracting and modifying the base table to avoid copy pasting three test cases doesn't seem worth the extra boilerplate we end up with to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this but to me it seemed more readable to have the new tests cases separated, but merging all in one (initially had it like this) can be done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged the tests.

@skonto
Copy link
Contributor Author

skonto commented Nov 29, 2021

@julz @dprotaso gentle ping :)

@skonto
Copy link
Contributor Author

skonto commented Dec 2, 2021

Failures:

--- FAIL: TestGlobalResyncOnDefaultCMChange (0.16s)
    logger.go:130: 2021-11-24T12:21:50.828Z	ERROR	revision/controller.go:108	Failed to create resolver transport	{"error": "open /var/run/secrets/kubernetes.io/serviceaccount/ca.crt: no such file or directory"}
    logger.go:130: 2021-11-24T12:21:50.829Z	INFO	revision/controller.go:117	Fetch GitHub commit ID from kodata failed{error 26 0  "KO_DATA_PATH" does not exist or is empty}
2021/11/30 13:50:42 Error during command execution: unknown flag: --gateway-api-version
Step failed: ./test/e2e-tests.sh --gateway-api-version latest

@skonto
Copy link
Contributor Author

skonto commented Dec 8, 2021

/retest

@skonto
Copy link
Contributor Author

skonto commented Dec 8, 2021

@dprotaso gentle ping. There is a storm of failing tests but not related afaik.

@dprotaso
Copy link
Member

dprotaso commented Dec 8, 2021

If we set the max duration to 30s the container could effectively be processing 2 requests

Is this different from the existing first-byte timeout though? If the first-byte timeout times out a request but the user container continues processing it the same thing would happen, no?

it's not - just mentioning it for context

/lgtm
/approve
/hold

Holding in case anyone has any last minute bikeshedding on the property name otherwise LGTM

@knative-prow-robot knative-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 8, 2021
@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 8, 2021
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso, skonto

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 8, 2021
@dprotaso
Copy link
Member

dprotaso commented Dec 8, 2021

cc @markusthoemmes @julz

Copy link
Member

@julz julz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm apart from this bit of last minute bike shedding: #12322 (comment)

@knative-prow-robot knative-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label Dec 9, 2021
@skonto
Copy link
Contributor Author

skonto commented Dec 9, 2021

@julz hi, dropped the request prefix. @dprotaso gentle ping.

@julz
Copy link
Member

julz commented Dec 10, 2021

lgtm, thanks @skonto, I'll let @dprotaso decide whether we want to unhold so close to release or wait till Tuesday to land (arguably it's a new field so the risk is probably pretty low I guess)

@skonto
Copy link
Contributor Author

skonto commented Dec 15, 2021

@dprotaso gentle ping should we unhold?

@julz
Copy link
Member

julz commented Dec 15, 2021

Release shipped yesterday, so:

/lgtm
/unhold

@knative-prow-robot knative-prow-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 15, 2021
@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 15, 2021
@knative-prow-robot knative-prow-robot merged commit ef89ff8 into knative:main Dec 15, 2021
@dprotaso dprotaso added this to the v1.2.0 milestone Jan 26, 2022
dprotaso added a commit to dprotaso/serving that referenced this pull request Feb 16, 2022
We added MaxDurationSeconds (knative#12322) because the behaviour of
RevisionSpec.Timeout changed from total duration to time to first byte.

In hindsight changing the behaviour of Timeout was a mistake since
it goes against the original specification.

Thus we're going to create a path for migration and the first part is
to remove MaxDurationSeconds from the RevisionSpec.
knative-prow-robot pushed a commit that referenced this pull request Feb 16, 2022
We added MaxDurationSeconds (#12322) because the behaviour of
RevisionSpec.Timeout changed from total duration to time to first byte.

In hindsight changing the behaviour of Timeout was a mistake since
it goes against the original specification.

Thus we're going to create a path for migration and the first part is
to remove MaxDurationSeconds from the RevisionSpec.
dprotaso added a commit to dprotaso/serving that referenced this pull request Feb 16, 2022
We added MaxDurationSeconds (knative#12322) because the behaviour of
RevisionSpec.Timeout changed from total duration to time to first byte.

In hindsight changing the behaviour of Timeout was a mistake since
it goes against the original specification.

Thus we're going to create a path for migration and the first part is
to remove MaxDurationSeconds from the RevisionSpec.
knative-prow-robot pushed a commit that referenced this pull request Feb 16, 2022
* Drop MaxDurationSeconds from the RevisionSpec (#12635)

We added MaxDurationSeconds (#12322) because the behaviour of
RevisionSpec.Timeout changed from total duration to time to first byte.

In hindsight changing the behaviour of Timeout was a mistake since
it goes against the original specification.

Thus we're going to create a path for migration and the first part is
to remove MaxDurationSeconds from the RevisionSpec.

* fix conformance test
nak3 pushed a commit to nak3/serving that referenced this pull request May 26, 2022
…12640)

* Drop MaxDurationSeconds from the RevisionSpec (knative#12635)

We added MaxDurationSeconds (knative#12322) because the behaviour of
RevisionSpec.Timeout changed from total duration to time to first byte.

In hindsight changing the behaviour of Timeout was a mistake since
it goes against the original specification.

Thus we're going to create a path for migration and the first part is
to remove MaxDurationSeconds from the RevisionSpec.

* fix conformance test
openshift-merge-robot pushed a commit to openshift/knative-serving that referenced this pull request May 26, 2022
* Pin to 1.23 S-O branch

* Add 0-kourier.yaml and 1-config-network.yaml to kourier.yaml (#1122)

* Rename kourier.yaml with 0-kourier.yaml

* Concat the files

* fix csv logic (#1125)

* Reduce the period and failure threshold for activator readiness (knative#12618)

The default drain timeout is 45 seconds which was much shorter than
the time it takes the activator to be recognized as not ready (2 minutes)

This was resulting in 503s since the activator was receiving traffic when it
was not expecting it

Co-authored-by: dprotaso <dprotaso@gmail.com>

* Address 503s when the autoscaler is being rolled (knative#12621)

The activator's readiness depends on the status of web socket connection
to the autoscaler. When the connection is down the activator will report
ready=false. This can occur when the autoscaler deployment is updating.

PR knative#12614 made the activator's readiness probe fail aggressively after
a single failure. This didn't seem to impact istio but with contour it
started returning 503s since the activator started to report ready=false
immediately.

This PR does two things to mitigate 503s:
- bump the readiness threshold to give the autoscaler more time to
  rollout/startup. This still remains lower than the drain duration
- Update the autoscaler rollout strategy so we spin up a new instance
  prior to bring down the older one. This is done using maxUnavailable=0

Co-authored-by: dprotaso <dprotaso@gmail.com>

* [release-1.2] Drop MaxDurationSeconds from the RevisionSpec  (knative#12640)

* Drop MaxDurationSeconds from the RevisionSpec (knative#12635)

We added MaxDurationSeconds (knative#12322) because the behaviour of
RevisionSpec.Timeout changed from total duration to time to first byte.

In hindsight changing the behaviour of Timeout was a mistake since
it goes against the original specification.

Thus we're going to create a path for migration and the first part is
to remove MaxDurationSeconds from the RevisionSpec.

* fix conformance test

* [release-1.2] fix ytt package name (knative#12657)

* fix ytt package name

* use correct path

Co-authored-by: dprotaso <dprotaso@gmail.com>

* Remove an unnecessary start delay when resolving tag to digests (knative#12669)

Co-authored-by: dprotaso <dprotaso@gmail.com>

* Drop collecting performance data in release branch (knative#12673)

Co-authored-by: dprotaso <dprotaso@gmail.com>

* bump ggcr which includes auth config lookup fixes for k8s (knative#12656)

Includes the fixes:
- google/go-containerregistry#1299
- google/go-containerregistry#1300

* Fixes an activator panic when the throttle encounters a cache.DeleteFinalStateUnknown (knative#12680)

Co-authored-by: dprotaso <dprotaso@gmail.com>

* upgrade to latest dependencies (knative#12674)

bumping knative.dev/pkg 77555ea...083dd97:
  > 083dd97 Wait for reconciler/controllers to return prior to exiting the process (# 2438)
  > df430fa dizzy: we must use `flags` instead of `pflags`, since this is not working. It seems like pflag.* adds the var to its own flag set, not the one package flag uses, and it doesn't expose the internal flag.Var externally - hence this fix. (# 2415)

Signed-off-by: Knative Automation <automation@knative.team>

* [release-1.2] fix tag to digest resolution (ggcr bump) (knative#12834)

* pin k8s dep

* Fix tag to digest resolution with K8s secrets

I forgot to bump ggcr's sub package in the prior release

github.com/google/go-containerregistry/pkg/authn/k8schain

* bump ggcr which fixes tag-to-digest resolution for Azure & GitLab (knative#12857)

Co-authored-by: Stavros Kontopoulos <st.kontopoulos@gmail.com>
Co-authored-by: Knative Prow Robot <knative-prow-robot@google.com>
Co-authored-by: dprotaso <dprotaso@gmail.com>
Co-authored-by: knative-automation <automation@knative.team>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/API API objects and controllers area/autoscale area/networking area/test-and-release It flags unit/e2e/conformance/perf test issues for product features cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Should be possible to set an actual revision timeout (max duration, not first byte)
5 participants