Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove timeouts from db connection #17424

Open
wants to merge 1 commit into
base: develop2
Choose a base branch
from

Conversation

vient
Copy link
Contributor

@vient vient commented Dec 5, 2024

Changelog: Fix: Remove DB connection timeouts

As observed in #13924 (comment) and also by me, Conan failed to acquire database lock may be triggered on machines with lots of cores and high load - in CI, in other words. All cases I observed were because of high load. I don't think it is meaningful to fail because of this, as it's treated as a flap, build gets restarted which does not help with cpu load. I don't know any other possible reasons for this timeout to trigger, so in my opinion it should not exist at all.

  • Refer to the issue that supports this Pull Request.
  • If the issue has missing info, explain the purpose/use case/pain/need that covers this Pull Request.
  • I've read the Contributing guide.
  • I've followed the PEP8 style guides for Python code.
  • I've opened another PR in the Conan docs repo to the develop branch, documenting this one.

@memsharded memsharded self-assigned this Dec 6, 2024
@memsharded
Copy link
Member

Hi @vient

I think this is quite risky. If some operation on the DB that in normal conditions takes milliseconds, is not able to work in 10 seconds, then waiting for ever will probably not only not solve the problem, but will aggravate the whole situation, getting all concurrent jobs acting on the DB to stall or interlock each other. The fact that one of the concurrent processes fails because of this timeout is a safeguard to let the other processes succeed.

In our experience this happens more for excessive concurrency over the DB than because of high CPU usage on the machine. Are you using separate caches for separate jobs, and still seeing this issue? Or are you sharing the same cache with a high concurrency (see https://docs.conan.io/2/knowledge/guidelines.html notes about concurrent caches)

@vient
Copy link
Contributor Author

vient commented Dec 7, 2024

Our case looks like this: Jenkins starts pods which use local conan folders. To speed up conan start up we pass a shared download cache and set parallel=4, that's it. Server has 256 logical cores, we see spontaneous failed to acquire database lock errors when server load approaches its capacity. I'll also note that our servers seem to be weak on disk side - I infer from that that an INSERT into db may take a long time if server is loaded because there is a big disk queue.

Also I wonder if there may be a thread starvation occurring somehow - we use around 100 conan packages, if one thread is waiting for lock while another thread continuously succeeds in acquiring it and then performs a long db operation because of disk latency, first thread may see a 10s timeout even if all db operations take <100ms.

If some operation on the DB that in normal conditions takes milliseconds, is not able to work in 10 seconds, then waiting for ever will probably not only not solve the problem, but will aggravate the whole situation, getting all concurrent jobs acting on the DB to stall or interlock each other.

Yes, but this principle works if you use retries. Conan does not retry db operations though, exiting with error immediately instead. This approach aggravates the situation even more, at least in my case when this conan process is the only one using this particular conan folder anyway.


My current understanding is that these timeouts only help point out that there is something wrong with db operation latency. There are cases like mine though where nothing can be done about it, and it is expected. If you think that timeouts are indeed helpful, may we discuss increasing them or making an option for timeout value instead?
There is also a WAL mode but I don't know much about it.

@memsharded
Copy link
Member

Thanks for the detailed feedback.

To speed up conan start up we pass a shared download cache and set parallel=4

I am a bit surprised that you are not seeing other kind of race conditions, because as commented above, the cache is not yet designed for concurrency. It is true that if you are very careful and make sure to install/build different packages in the different pods, and avoid the race conditions by the right orchestration and order of builds, the race conditions might be avoided.

If there is system exhaustion at the machine level, it might be possible that the DB is sometimes not the actual blocker or bottleneck, but maybe if the DB is getting timeouts even after 10 seconds, then those timeouts are not the bigger problem anyway?

If it helps, I think we might be able to increase a bit more those timeouts, like to 15 or even 20 seconds, but I'd still like to keep some limit, waiting forever without any limit seems to me to have more risks than benefits.

@memsharded
Copy link
Member

Hi @vient

Any further feedback about my last comment?
As suggested there, removing the timeout seems too risky, we could increase a bit the value, would you like to do that? Thanks!

@vient
Copy link
Contributor Author

vient commented Jan 7, 2025

Hi, thanks for reminder

the cache is not yet designed for concurrency

Just in case, I'm speaking about download cache (core.cache:download_cache), not the conan folder. Download cache looks pretty simple and robust, we did not see any problems with it. Conan working folder (cache in your terms) is pod-local and empty on pod start.

maybe if the DB is getting timeouts even after 10 seconds, then those timeouts are not the bigger problem anyway?

In a way, yes. Unfortunately, it may not be easy (or even possible?) to prevent latency spikes when you have a big server with lots of noisy neighbors on it - if some of them coincidentally start to do something disk-intensive simultaneously, everyone is affected. Disk I/O QoS seems to reduce performance, unlike CPU and RAM QoS - in CI we care more about throughput than latency, so decreasing disk performance when it's already strained is not optimal.

We are pretty happy with our CI performance overall, latency spikes rarely cause any problems, one of this problems being this timeout. We've never seen this timeout trigger in any situation other than the high server load, in which case we would prefer it to not trigger because nothing bad actually happens.

we could increase a bit the value

Other than creating a config option for timeout duration, this is the only other way. Can we increase it to, say, 30 seconds?

@memsharded
Copy link
Member

Just in case, I'm speaking about download cache (core.cache:download_cache), not the conan folder. Download cache looks pretty simple and robust, we did not see any problems with it. Conan working folder (cache in your terms) is pod-local and empty on pod start.

The download cache doesn't use DB or sqlite at all, it is purely filesystem (and file locks) based.

Then, if pods are fully local and not shared for concurrent jobs, then there will be no concurrency over the package storage, and it would be safe, no concurrency over the same Conan cache. And it would seem then mostly a machine resource depletion issue.

Other than creating a config option for timeout duration, this is the only other way. Can we increase it to, say, 30 seconds?

Yes, just in case, make sure to update to latest Conan versions, we have done several optimizations that would reduce the number of sqlite reads to the DB, in some cases reducing a lot of calls to the DB

I think with the latest optimizations, doubling the time should be quite a huge improvements, I'd suggest trying timeout=20 for the next release and see how it goes. I don't think it deserves a special configuration, as the goal is not to have to configure this at all, it seems more like a workaround than something that users really want to configure to different values.

@vient
Copy link
Contributor Author

vient commented Jan 7, 2025

make sure to update to latest Conan versions

We use 2.9 at the moment, will try 2.11 then

@memsharded
Copy link
Member

I see some of the larger improvements in this were in 2.9 in #17083, but there are still some other smaller later ones like #17311 that went to 2.10, so might be worth updating in any case, just in case, yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants