-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No KeepAlive timeout set in load test clients #3466
Comments
Hi, Could you please tell how does autoscaling work in your case? For example, in AWS, I think, autoscaling works by adding new nodes to load balancer definitions, in that case DNS is not changed. In your case, are new nodes' IP addresses added to the DNS when autoscaling launches new nodes? Is load balancing handled by DNS? For load balancer case, sticky sessions may cause unbalanced node loads. Also, sometimes load balancers use source ip address for load balancing strategy, which is not ideal for load testing scenario if load is generated from one IP address. Could you give more information about your setup, please? |
Hi @szuecs, Why do you evaluate it as a bug? It doesn't sound so. I don't think we say somewhere that we support this case. It sounds to me simply a feature to add. I removed the label for now and we might re-add it after if it is required. |
@codebien @kullanici0606 There are 2 problems. In both cases you limit the throughput to whatever was available at start of the load test. That means it's not a scalable solution and scaling out upfront is not really a realistic load test scenario. I hope I could clarify your questions and why it's a bug and not a feature. Our load test will run in one of our kubernetes clusters that run on nodes with public IPs attached, such that all pods running on one node share the same src IP, but that seems to be fine. The ingress setup is best explained in https://opensource.zalando.com/skipper/kubernetes/ingress-controller/#aws-deployment but replace ALB with NLB. The second layer load balancer is skipper, which is the only one in the TargetGroup of the NLBs and baked by an hpa as shown in https://opensource.zalando.com/skipper/kubernetes/ingress-controller/#deployment |
@codebien I tried to do my diligence, but I got no response and the label "awaiting user" seems not to be correct if I respond and got no comment. |
Hello @codebien @kullanici0606, I work with Sandor and I looked into this issue since we are planning to use k6 to perform large load tests. I wanted to highlight a few points here and propose solutions for us to discuss. As @szuecs has pointed out, the mentioned use case is indeed an issue but it might be tricky to place it into the right category (bug/feature) because of its occurrence under specific situations. I looked into projects other than k6 and checked public articles to find if there are others who face this issue and I found the following:
The above links show that this is a common issue faced when persistent TCP connections are used against a target that is auto-scaling, though the discussions seem to be open and there is no single accepted solution. I believe k6 should provide ways to solve this since the use case would be more frequently faced when running load tests. Disabling persistent connection can poorly affect k6's performance hence I'm avoiding that as a potential solution. To solve this issue, I can think of the following solutions:
Solving this in the load test client seems like a better solution given the probability of facing this issue. Other load test tools that solve this: |
Hi @mohitk05 @szuecs, In summary, our position remains similar as previously communicated: we might add support for this use-case as a feature in the next version of the HTTP module, that we’re currently developing, where we expect to introduce a more configurable solution to this specific issue. To give you more context: Unfortunately, as we explained previously in another issue, any configuration that will force closing connections has to be as much as possible predictable, otherwise, they might impact metrics and create confusion during investigations, troubleshooting or debugging. Yet, even if we were keen to introduce a well-documented, configurable change that would bring a solution to this, we would still encounter a significant technical challenge: there isn't, currently, anyway for k6 to control the HTTP connection pool directly. It is mostly the same reason as for @szuecs’s original issue on the Go repository. Supporting this use-case would effectively involve building custom solutions that would not be relying on the standard Go HTTP library, which is a practice we aim to follow as much as possible. Doing the opposite is a difficult decision that has to be properly tested and evaluated considering all the corner cases HTTP, TCP and networking have in general. Continuing to use the Go standard library for HTTP module, is for us the guarantee for high standards to one the most critical modules of k6. Of course, as with all engineering decisions, it has trade-offs; unfortunately, this issue is one of them. At this point, the alternative option we can offer, other than the already available work-around to use noVUConnectionReuse option is for you to benefit from the extensions model where you can build your custom implementation. We can offer our support there helping in finding solutions and reviewing code, for us, it might be useful as an exploration phase for httpv2. Eventually, if you build something generic then it could even be considered as the first version for HTTP v2 and we may consider including it as an experimental module in k6 core, if it is something you are interested on benfitting. But, any evaluation, has to be deferred to future time when your extension will be available. Instead, if this is already right now a high desire to you, then feel free to start discussing here a potential proposal and how you would like to implement it. I hope it helps. |
Thanks for coming back to us! In general I think we can try to use noVUConnectionReuse option. For us (I am running skipper as ingress router with 1M rps every day in 100 production clusters), the closeIdleConnection loop never showed a negative impact on performance. I reviewed hundreds of latency issues that never showed that connection establishments were really a problem. If you stay inside an aws region I never saw >1ms connection times measured by go httptrace module. Here's our client deeplink to httptrace instrumentation https://github.com/zalando/skipper/blob/master/net/httpclient.go#L404 Let me know if I can help more. |
Hello, I do have exactly this issue, seems to be that the sticky session option on AAG v2 is causing unbalanced application calls, is there any workaround I can use to prevent this? I'm using a single VM (one IP address) to run a local test against 2 web servers controlled by an AAG. Thanks in advance! |
Hi @RafaMalicia,
I'm not an expert of Azure platform, but on AWS seems to be a dedicated section regarding how to workaround this case. The different suggestions seem to be platform-agnostic so you could consider applying the same strategy to Azure as well. Check the Manual rebalancing section inside of the sticky sessions page on the AWS ALB docs. |
Hello again 👋 We have been using open-source k6 deployed in Kubernetes internally and have successfully conducted large-scale load tests with it. I, first of all, want to appreciate the tooling available and the amount of customization and control we get on the tests! Here are a few learnings we found out regarding using noConnectionReuse:
I explored k6 extensions as you suggested and looks like we might have a solution there. I created this extension and will be testing it out, leaving here if anyone faces similar issues: https://github.com/mohitk05/xk6-close-idle-connections/blob/main/close_idle_conn.go. My assumption would be that this extension would also have some performance impact, but would be much better than no keep-alives. I will share results from our testing soon. |
@mohitk05 yes it reduces a bot of performance because every few seconds unused idle connections are closed and the system needs to resolve DNS and connects (TCP/IP handshake) to the endpoint. However if we do this every 30s and we would have 100 RPS, than we would have at least 3k requests possible through one connection until we need to create a new one (in practice it's likely a bit different but for the purpose of simple calculation let's assume it is like this). So instead of 1 requests per handshake we would have 3k requests per handshake. |
Sharing an update here: https://github.com/mohitk05/xk6-close-idle-connections |
Hey @mohitk05 @szuecs, First, thanks for sharing the extension and report back your observations. Did you try to maximize the reusage of the pool's connections? Or is it part of the nature of your test to make connections idle? I remember the main issue was mostly triggered from the autoscaling logic, which I expect to be in a small percentage of events along the entire test lifecycle. I assume you're flushing the connections, mostly using the 5 seconds interval, which is the default value in the shared extension. I'm pretty sure I'm not getting the full picture here, so can you expand a bit on the test's context, please? |
Brief summary
We use the k6 nodejs http client to do load tests and as we can see in CPU graphs it sticks with only one socket to the target.
This is of course saturating only a single target even if the target is autoscaling, which is a severe bug for a load test tool.
I am not very experienced in nodejs, but you can likely set keepalive timeout to something like 60s on the http client to fix the issue. See also this Go issue that shows the problem in more detail golang/go#23427 .
Example load test client:
k6 version
latest
OS
ubuntu 22.04 lts
Docker version and image (if applicable)
FROM loadimpact/k6:latest AS k6official
Steps to reproduce the problem
run this client and increase load to an autoscaled target.
Expected behaviour
all targets should have the same CPU usage
Actual behaviour
new started targets are not having the same CPU usage as the others.
The text was updated successfully, but these errors were encountered: