Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No KeepAlive timeout set in load test clients #3466

Open
szuecs opened this issue Nov 22, 2023 · 13 comments
Open

No KeepAlive timeout set in load test clients #3466

szuecs opened this issue Nov 22, 2023 · 13 comments
Assignees
Labels
evaluation needed proposal needs to be validated or tested before fully implementing it in k6 feature new-http issues that would require (or benefit from) a new HTTP API

Comments

@szuecs
Copy link

szuecs commented Nov 22, 2023

Brief summary

We use the k6 nodejs http client to do load tests and as we can see in CPU graphs it sticks with only one socket to the target.
This is of course saturating only a single target even if the target is autoscaling, which is a severe bug for a load test tool.
I am not very experienced in nodejs, but you can likely set keepalive timeout to something like 60s on the http client to fix the issue. See also this Go issue that shows the problem in more detail golang/go#23427 .

Example load test client:

import http from "k6/http";

function getLimits(order, baseUrl, token) {

    let url = `${baseUrl}/purchase-quantity-restrictions`
    let payload = JSON.stringify({
...
    });

    let params = {
        headers: {
            "Authorization": `Bearer ${token}`,
            "Content-Type": "application/json",
...
        },
    };

    return http.post(url, payload, params);

k6 version

latest

OS

ubuntu 22.04 lts

Docker version and image (if applicable)

FROM loadimpact/k6:latest AS k6official

Steps to reproduce the problem

run this client and increase load to an autoscaled target.

Expected behaviour

all targets should have the same CPU usage

Actual behaviour

new started targets are not having the same CPU usage as the others.

@kullanici0606
Copy link

Hi,

Could you please tell how does autoscaling work in your case? For example, in AWS, I think, autoscaling works by adding new nodes to load balancer definitions, in that case DNS is not changed.

In your case, are new nodes' IP addresses added to the DNS when autoscaling launches new nodes? Is load balancing handled by DNS?

For load balancer case, sticky sessions may cause unbalanced node loads. Also, sometimes load balancers use source ip address for load balancing strategy, which is not ideal for load testing scenario if load is generated from one IP address.

Could you give more information about your setup, please?

@codebien
Copy link
Contributor

codebien commented Dec 20, 2023

Hi @szuecs,
I can imagine some use cases, but as @kullanici0606 it would be helpful to get a better explanation about your specifics.
It would be nice to know what limits you're hitting using the currently available way by the DNS option https://grafana.com/docs/k6/latest/using-k6/k6-options/reference/#dns.

Why do you evaluate it as a bug? It doesn't sound so. I don't think we say somewhere that we support this case. It sounds to me simply a feature to add. I removed the label for now and we might re-add it after if it is required.

@codebien codebien added evaluation needed proposal needs to be validated or tested before fully implementing it in k6 awaiting user waiting for user to respond new-http issues that would require (or benefit from) a new HTTP API feature and removed bug triage labels Dec 20, 2023
@codebien codebien removed their assignment Dec 20, 2023
@szuecs
Copy link
Author

szuecs commented Dec 21, 2023

@codebien @kullanici0606
The problem is not DNS! I referenced this Go issue to make you aware that TCP connections are not closed and recreated but hold forever.

There are 2 problems.
One problem is if you run for example NLB than you get packets forwarded and not proxied, which means you load test connection pool won't connect to new connected nodes that are attached to the NLB TargetGroup.
The second problem is that you can take down an ALB because you connect not to the scaled-out ALB instances. ALB will have initially only 3 EC2 instances if you create an ALB spanning 3 AZs, one in each zone. If you hit their limit, they need to scale out, so also adding more EC2 and later also adding more IPs to the ALB. These IPs you will not lookup as the Go issue shows.

In both cases you limit the throughput to whatever was available at start of the load test. That means it's not a scalable solution and scaling out upfront is not really a realistic load test scenario.

I hope I could clarify your questions and why it's a bug and not a feature.

Our load test will run in one of our kubernetes clusters that run on nodes with public IPs attached, such that all pods running on one node share the same src IP, but that seems to be fine. The ingress setup is best explained in https://opensource.zalando.com/skipper/kubernetes/ingress-controller/#aws-deployment but replace ALB with NLB. The second layer load balancer is skipper, which is the only one in the TargetGroup of the NLBs and baked by an hpa as shown in https://opensource.zalando.com/skipper/kubernetes/ingress-controller/#deployment

@szuecs
Copy link
Author

szuecs commented Apr 2, 2024

@codebien I tried to do my diligence, but I got no response and the label "awaiting user" seems not to be correct if I respond and got no comment.

@mohitk05
Copy link

mohitk05 commented Apr 2, 2024

Hello @codebien @kullanici0606, I work with Sandor and I looked into this issue since we are planning to use k6 to perform large load tests. I wanted to highlight a few points here and propose solutions for us to discuss.

As @szuecs has pointed out, the mentioned use case is indeed an issue but it might be tricky to place it into the right category (bug/feature) because of its occurrence under specific situations. I looked into projects other than k6 and checked public articles to find if there are others who face this issue and I found the following:

The above links show that this is a common issue faced when persistent TCP connections are used against a target that is auto-scaling, though the discussions seem to be open and there is no single accepted solution.

I believe k6 should provide ways to solve this since the use case would be more frequently faced when running load tests. Disabling persistent connection can poorly affect k6's performance hence I'm avoiding that as a potential solution.

To solve this issue, I can think of the following solutions:

  1. Provide an option in k6 to set the maximum number of requests per persistent TCP connection
  2. Provide an option in k6 to close TCP connections every X seconds (a TTL for connections, similar to Sandor's suggestion of a goroutine)
  3. Send a Connection: close header from the load balancer at regular intervals (difficult to implement).

Solving this in the load test client seems like a better solution given the probability of facing this issue. Other load test tools that solve this:

@codebien
Copy link
Contributor

Hi @mohitk05 @szuecs,
we met today with the team to discuss this issue.

In summary, our position remains similar as previously communicated: we might add support for this use-case as a feature in the next version of the HTTP module, that we’re currently developing, where we expect to introduce a more configurable solution to this specific issue.

To give you more context:

Unfortunately, as we explained previously in another issue, any configuration that will force closing connections has to be as much as possible predictable, otherwise, they might impact metrics and create confusion during investigations, troubleshooting or debugging.

Yet, even if we were keen to introduce a well-documented, configurable change that would bring a solution to this, we would still encounter a significant technical challenge: there isn't, currently, anyway for k6 to control the HTTP connection pool directly. It is mostly the same reason as for @szuecs’s original issue on the Go repository.
The available API provided by Go to enforce closing connections via the HTTP package is closeConnectionIdle, which doesn’t guarantee any closing on the active connections, as the name clearly explains.

Supporting this use-case would effectively involve building custom solutions that would not be relying on the standard Go HTTP library, which is a practice we aim to follow as much as possible. Doing the opposite is a difficult decision that has to be properly tested and evaluated considering all the corner cases HTTP, TCP and networking have in general. Continuing to use the Go standard library for HTTP module, is for us the guarantee for high standards to one the most critical modules of k6.

Of course, as with all engineering decisions, it has trade-offs; unfortunately, this issue is one of them.

At this point, the alternative option we can offer, other than the already available work-around to use noVUConnectionReuse option is for you to benefit from the extensions model where you can build your custom implementation. We can offer our support there helping in finding solutions and reviewing code, for us, it might be useful as an exploration phase for httpv2.

Eventually, if you build something generic then it could even be considered as the first version for HTTP v2 and we may consider including it as an experimental module in k6 core, if it is something you are interested on benfitting. But, any evaluation, has to be deferred to future time when your extension will be available. Instead, if this is already right now a high desire to you, then feel free to start discussing here a potential proposal and how you would like to implement it.

I hope it helps.

@codebien codebien removed awaiting user waiting for user to respond triage labels Apr 19, 2024
@szuecs
Copy link
Author

szuecs commented Apr 19, 2024

Thanks for coming back to us!

In general I think we can try to use noVUConnectionReuse option.
@mohitk05 can you test it and provide feedback if that helps?
I am happy to review all kind of data on skipper side.

For us (I am running skipper as ingress router with 1M rps every day in 100 production clusters), the closeIdleConnection loop never showed a negative impact on performance. I reviewed hundreds of latency issues that never showed that connection establishments were really a problem. If you stay inside an aws region I never saw >1ms connection times measured by go httptrace module. Here's our client deeplink to httptrace instrumentation https://github.com/zalando/skipper/blob/master/net/httpclient.go#L404
The Go httptrace module is pure gold!

Let me know if I can help more.

@RafaMalicia
Copy link

Hi,

Could you please tell how does autoscaling work in your case? For example, in AWS, I think, autoscaling works by adding new nodes to load balancer definitions, in that case DNS is not changed.

In your case, are new nodes' IP addresses added to the DNS when autoscaling launches new nodes? Is load balancing handled by DNS?

For load balancer case, sticky sessions may cause unbalanced node loads. Also, sometimes load balancers use source ip address for load balancing strategy, which is not ideal for load testing scenario if load is generated from one IP address.

Could you give more information about your setup, please?

Hello, I do have exactly this issue, seems to be that the sticky session option on AAG v2 is causing unbalanced application calls, is there any workaround I can use to prevent this? I'm using a single VM (one IP address) to run a local test against 2 web servers controlled by an AAG. Thanks in advance!

@codebien
Copy link
Contributor

Hi @RafaMalicia,

seems to be that the sticky session option on AAG v2 is causing unbalanced application calls

I'm not an expert of Azure platform, but on AWS seems to be a dedicated section regarding how to workaround this case. The different suggestions seem to be platform-agnostic so you could consider applying the same strategy to Azure as well.

Check the Manual rebalancing section inside of the sticky sessions page on the AWS ALB docs.

@oleiade oleiade removed the triage label Sep 5, 2024
@mohitk05
Copy link

mohitk05 commented Sep 9, 2024

Hello again 👋 We have been using open-source k6 deployed in Kubernetes internally and have successfully conducted large-scale load tests with it. I, first of all, want to appreciate the tooling available and the amount of customization and control we get on the tests! Here are a few learnings we found out regarding using noConnectionReuse:

  • The performance per VU almost halves (30-50% drop) when the option is used which essentially disables keep-alive connections. For now it's okay for us since the overall performance is still much better than our previous tooling, so we have been running tests with this option enabled.
  • RPS per VU and hence VUs per load generator reduce by the above factor when this option is enabled. We are okay with deploying more generators per test for now.

I explored k6 extensions as you suggested and looks like we might have a solution there. I created this extension and will be testing it out, leaving here if anyone faces similar issues: https://github.com/mohitk05/xk6-close-idle-connections/blob/main/close_idle_conn.go. My assumption would be that this extension would also have some performance impact, but would be much better than no keep-alives. I will share results from our testing soon.

@szuecs
Copy link
Author

szuecs commented Sep 9, 2024

@mohitk05 yes it reduces a bot of performance because every few seconds unused idle connections are closed and the system needs to resolve DNS and connects (TCP/IP handshake) to the endpoint. However if we do this every 30s and we would have 100 RPS, than we would have at least 3k requests possible through one connection until we need to create a new one (in practice it's likely a bit different but for the purpose of simple calculation let's assume it is like this). So instead of 1 requests per handshake we would have 3k requests per handshake.

@mohitk05
Copy link

Sharing an update here:

https://github.com/mohitk05/xk6-close-idle-connections
We tested this extension in a medium-sized load test and it performed well as compared to using no connection reuse. We did not hit the limits where the target would autoscale, but will soon perform another test where this would happen. For now, we'll continue using the extension as it solves the issue for us and will share more findings later. Thanks for the support!

@codebien
Copy link
Contributor

codebien commented Jan 10, 2025

Hey @mohitk05 @szuecs,
Sorry for the late reply, but I was on a long PTO.

First, thanks for sharing the extension and report back your observations.

Did you try to maximize the reusage of the pool's connections? Or is it part of the nature of your test to make connections idle? I remember the main issue was mostly triggered from the autoscaling logic, which I expect to be in a small percentage of events along the entire test lifecycle. I assume you're flushing the connections, mostly using the 5 seconds interval, which is the default value in the shared extension.
Furthermore, I see a feature has been added to Skipper to control the max lifetime span on the server side. Isn't it resolving the issue? Why do you require closing on the client side too?

I'm pretty sure I'm not getting the full picture here, so can you expand a bit on the test's context, please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
evaluation needed proposal needs to be validated or tested before fully implementing it in k6 feature new-http issues that would require (or benefit from) a new HTTP API
Projects
None yet
Development

No branches or pull requests

6 participants