http 503 responses during load test on GKE during a node auto-scale up #1797

ahmetb · 2018-08-03T19:15:44Z

/area autoscale
/area networking
/kind bug

I use Knative on GKE (manual install) with node autoscaling enabled. When I send a n=100000 concurrency=200 (or 1000) I sometimes see some responses failing with HTTP 503 or 504.

Expected Behavior

All requests succeed.

Actual Behavior

Some small percentage (~%0.2) fail with HTTP 503 (and sometimes HTTP 504). The test almost never completes without any errors (even when GKE scales up to 10 nodes).
Initial load test has made GKE API unavailable (because GKE master is single instance by default, and this load test triggered a GKE node scale up, and this caused GKE master to resize, which made GKE API down). This caused more 503s.
I've actually had -n=100000 -c=200 succeed with no errors many times.

Steps to Reproduce the Problem

Knative installed on GKE

gcloud container clusters create $CLUSTER_NAME \
  --zone=$CLUSTER_ZONE \
  --cluster-version=latest \
  --machine-type=n1-standard-4 \
  --enable-autoscaling --min-nodes=1 --max-nodes=10 \
  --enable-autorepair \
  --scopes=service-control,service-management,compute-rw,storage-ro,cloud-platform,logging-write,monitoring-write,pubsub,datastore \
  --num-nodes=3

1b. Make sure GKE cluster is hovering around 3-4 nodes (not scaled up yet)
2. Deploy helloworld-go app.
3. Install hey go get github.com/rakyll/hey
4. hey -m GET -n 100000 -c 1000 -host helloworld-go.default.example.com http://35.188.214.219/
5. Observe: GKE node scales up gcloud compute instances list in a few seconds.
6. Observe: during this scale up, kubectl get pods show a lot of Pending pods (that don't come up fast enough til the load test completes)

Summary:
  Total:	123.2564 secs
  Slowest:	13.2501 secs
  Fastest:	0.0425 secs
  Average:	1.1116 secs
  Requests/sec:	811.3169

  Total data:	1310208 bytes
  Size/request:	13 bytes
...
Status code distribution:
  [200]	99768 responses
  [503]	232 responses

alternatively it shows:

Status code distribution:
  [200]	98229 responses
  [503]	1592 responses
  [504]	176 responses

Error distribution:
  [3]	Get http://35.188.214.219/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

If the command is re-run when GKE is scaled up to 10 nodes, re-run the load test and you'll see 503 or 504 errors.

The text was updated successfully, but these errors were encountered:

ahmetb · 2018-08-03T19:16:07Z

woah the commented out area caused all labels to be added.

google-prow-robot · 2018-08-03T19:18:47Z

@ahmetb: Those labels are not set on the issue: area/api, area/autoscale, area/build, area/monitoring, area/test-and-release, kind/cleanup, kind/doc, kind/feature, kind/good-first-issue, kind/process, kind/question, kind/spec

In response to this:

/remove-area API
/remove-area autoscale
/remove-area build
/remove-area monitoring
/remove-area test-and-release
/remove-kind cleanup
/remove-kind doc
/remove-kind feature
/remove-kind good-first-issue
/remove-kind process
/remove-kind question
/remove-kind spec

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tcnghia · 2018-08-03T21:59:39Z

@ahmetb some 503s may be due to our throttling. can you try changing concurrencyTarget in the autoscaling config map?

If possible, can you give also include the error message in your distribution? There probably won't be too many different unique messages.

ahmetb · 2018-08-03T22:41:05Z

The tool I'm using isn't making it easy to get the response body.

If I can find time, I'll get back to looking at components for error/throttling messages and will try to get response body as well. But if you have time I encourage you to dig in, it's easy to repro on top of default cluster/app.

tcnghia · 2018-08-29T17:55:38Z

I am running with latest Knative using a 1000 QPS tests for 300 seconds, without sidecar injection, with currencyTarget=2 and see this break down of error rate:

300975 responses with HTTP 200
24 responses with 503, all of which has response body overload, indicating a rejection from our queue-proxy (which only allows a small number of requests to be queued).

I will run more tests with sidecar proxy and add results here.

tcnghia · 2018-08-29T18:23:28Z

When running with sidecar injection, concurrencyTarget = 2,

300265 responses with HTTP 200
733 responses with 503 overload

This increase is consistent with the fact that Pods are slower to start up with sidecar injection.

tcnghia · 2018-08-29T20:51:52Z

The throttling 503s (503 overload) will be reduced with new design in #1846. Closing this bug as we are tracking this improvement in #1846.

ahmetb · 2018-08-29T21:54:12Z

Thanks for taking time to take a look at this.

ahmetb mentioned this issue Aug 3, 2018

/area /kind commands are still effective in html comments  kubernetes/test-infra#8937

Closed

google-prow-robot removed the kind/spec Discussion of how a feature should be exposed to customers. label Aug 3, 2018

google-prow-robot added the area/autoscale label Aug 3, 2018

tcnghia self-assigned this Aug 8, 2018

tcnghia closed this as completed Aug 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

http 503 responses during load test on GKE during a node auto-scale up #1797

http 503 responses during load test on GKE during a node auto-scale up #1797

ahmetb commented Aug 3, 2018 •

edited

Loading

ahmetb commented Aug 3, 2018

google-prow-robot commented Aug 3, 2018

tcnghia commented Aug 3, 2018

ahmetb commented Aug 3, 2018

tcnghia commented Aug 29, 2018

tcnghia commented Aug 29, 2018 •

edited

Loading

tcnghia commented Aug 29, 2018

ahmetb commented Aug 29, 2018

http 503 responses during load test on GKE during a node auto-scale up #1797

http 503 responses during load test on GKE during a node auto-scale up #1797

Comments

ahmetb commented Aug 3, 2018 • edited Loading

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

ahmetb commented Aug 3, 2018

google-prow-robot commented Aug 3, 2018

tcnghia commented Aug 3, 2018

ahmetb commented Aug 3, 2018

tcnghia commented Aug 29, 2018

tcnghia commented Aug 29, 2018 • edited Loading

tcnghia commented Aug 29, 2018

ahmetb commented Aug 29, 2018

ahmetb commented Aug 3, 2018 •

edited

Loading

tcnghia commented Aug 29, 2018 •

edited

Loading