Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http 503 responses during load test on GKE during a node auto-scale up #1797

Closed
ahmetb opened this issue Aug 3, 2018 · 8 comments
Closed
Assignees
Labels
area/autoscale area/networking kind/bug Categorizes issue or PR as related to a bug.

Comments

@ahmetb
Copy link
Contributor

ahmetb commented Aug 3, 2018

/area autoscale
/area networking
/kind bug

I use Knative on GKE (manual install) with node autoscaling enabled. When I send a n=100000 concurrency=200 (or 1000) I sometimes see some responses failing with HTTP 503 or 504.

Expected Behavior

All requests succeed.

Actual Behavior

  • Some small percentage (~%0.2) fail with HTTP 503 (and sometimes HTTP 504). The test almost never completes without any errors (even when GKE scales up to 10 nodes).
  • Initial load test has made GKE API unavailable (because GKE master is single instance by default, and this load test triggered a GKE node scale up, and this caused GKE master to resize, which made GKE API down). This caused more 503s.
  • I've actually had -n=100000 -c=200 succeed with no errors many times.

Steps to Reproduce the Problem

  1. Knative installed on GKE
gcloud container clusters create $CLUSTER_NAME \
  --zone=$CLUSTER_ZONE \
  --cluster-version=latest \
  --machine-type=n1-standard-4 \
  --enable-autoscaling --min-nodes=1 --max-nodes=10 \
  --enable-autorepair \
  --scopes=service-control,service-management,compute-rw,storage-ro,cloud-platform,logging-write,monitoring-write,pubsub,datastore \
  --num-nodes=3

1b. Make sure GKE cluster is hovering around 3-4 nodes (not scaled up yet)
2. Deploy helloworld-go app.
3. Install hey go get github.com/rakyll/hey
4. hey -m GET -n 100000 -c 1000 -host helloworld-go.default.example.com http://35.188.214.219/
5. Observe: GKE node scales up gcloud compute instances list in a few seconds.
6. Observe: during this scale up, kubectl get pods show a lot of Pending pods (that don't come up fast enough til the load test completes)

Summary:
  Total:	123.2564 secs
  Slowest:	13.2501 secs
  Fastest:	0.0425 secs
  Average:	1.1116 secs
  Requests/sec:	811.3169

  Total data:	1310208 bytes
  Size/request:	13 bytes
...
Status code distribution:
  [200]	99768 responses
  [503]	232 responses

alternatively it shows:

Status code distribution:
  [200]	98229 responses
  [503]	1592 responses
  [504]	176 responses

Error distribution:
  [3]	Get http://35.188.214.219/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  1. If the command is re-run when GKE is scaled up to 10 nodes, re-run the load test and you'll see 503 or 504 errors.
@google-prow-robot google-prow-robot added area/API API objects and controllers area/autoscale area/build Build topics specifically related to Knative area/monitoring area/networking area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/question Further information is requested kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/doc Something isn't clear kind/feature Well-understood/specified features, ready for coding. kind/good-first-issue kind/process Changes in how we work kind/spec Discussion of how a feature should be exposed to customers. labels Aug 3, 2018
@ahmetb
Copy link
Contributor Author

ahmetb commented Aug 3, 2018

woah the commented out area caused all labels to be added.

@google-prow-robot google-prow-robot removed area/API API objects and controllers area/autoscale area/build Build topics specifically related to Knative area/monitoring area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/doc Something isn't clear kind/feature Well-understood/specified features, ready for coding. kind/good-first-issue kind/process Changes in how we work kind/question Further information is requested labels Aug 3, 2018
@google-prow-robot google-prow-robot removed the kind/spec Discussion of how a feature should be exposed to customers. label Aug 3, 2018
@google-prow-robot
Copy link

@ahmetb: Those labels are not set on the issue: area/api, area/autoscale, area/build, area/monitoring, area/test-and-release, kind/cleanup, kind/doc, kind/feature, kind/good-first-issue, kind/process, kind/question, kind/spec

In response to this:

/remove-area API
/remove-area autoscale
/remove-area build
/remove-area monitoring
/remove-area test-and-release
/remove-kind cleanup
/remove-kind doc
/remove-kind feature
/remove-kind good-first-issue
/remove-kind process
/remove-kind question
/remove-kind spec

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tcnghia
Copy link
Contributor

tcnghia commented Aug 3, 2018

@ahmetb some 503s may be due to our throttling. can you try changing concurrencyTarget in the autoscaling config map?

If possible, can you give also include the error message in your distribution? There probably won't be too many different unique messages.

@ahmetb
Copy link
Contributor Author

ahmetb commented Aug 3, 2018

The tool I'm using isn't making it easy to get the response body.

If I can find time, I'll get back to looking at components for error/throttling messages and will try to get response body as well. But if you have time I encourage you to dig in, it's easy to repro on top of default cluster/app.

@tcnghia tcnghia self-assigned this Aug 8, 2018
@tcnghia
Copy link
Contributor

tcnghia commented Aug 29, 2018

I am running with latest Knative using a 1000 QPS tests for 300 seconds, without sidecar injection, with currencyTarget=2 and see this break down of error rate:

  • 300975 responses with HTTP 200
  • 24 responses with 503, all of which has response body overload, indicating a rejection from our queue-proxy (which only allows a small number of requests to be queued).

I will run more tests with sidecar proxy and add results here.

@tcnghia
Copy link
Contributor

tcnghia commented Aug 29, 2018

When running with sidecar injection, concurrencyTarget = 2,

  • 300265 responses with HTTP 200
  • 733 responses with 503 overload

This increase is consistent with the fact that Pods are slower to start up with sidecar injection.

@tcnghia
Copy link
Contributor

tcnghia commented Aug 29, 2018

The throttling 503s (503 overload) will be reduced with new design in #1846. Closing this bug as we are tracking this improvement in #1846.

@tcnghia tcnghia closed this as completed Aug 29, 2018
@ahmetb
Copy link
Contributor Author

ahmetb commented Aug 29, 2018

Thanks for taking time to take a look at this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/autoscale area/networking kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants