-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[windows][CI/CD] ADOT collector delayed start #1788
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the comment on the linked issue it appears that more investigation/testing is required on whether this proposed fix is the best path forward.
FWIW, I have seen the same issue running the OTel collector contrib build on Windows (also packaged up and configured on the target system with WiX). We already run this with the |
@DevOpsFu thanks for the insight! Do any other mitigations for lessen the effect? We are still exploring solutions around this problem and any feedback would help. |
I've not yet found anything that would mitigate the issue. At least nothing that isn't a horrible hack and I wouldn't consider using, e.g.:
Even the Windows service recovery options to automatically restart the service on failure do not work - presumably because the service needs to at least start up successfully once and then die for the recovery measures to kick in, and this isn't happening here. Here are some useful links though:
I've not yet delved deeply into the Otel collector code (I'm not a seasoned Go developer either), but it seems to make sense to start by looking for any areas around the Windows-specific code that might be loading DLLs outside of a function declaration. I had a very quick look earlier today and couldn't find anything using GitHub's search functions. Maybe my searches were too specific though - or perhaps this isn't the issue at all. From my limited understanding of the root cause though, the problem seems to be that something is making startup take a long time on first invocation of the executable. Normally this isn't a problem and probably is never visible on a fast system, but if you have the perfect storm of a system that has:
Then it creates the perfect set of conditions for the startup to take longer than 30s, and then the Windows SCM considers the service to be dead. |
Having done some more digging into this tonight, it seems to me that it might not be possible to solve this one within the OTel collector itself without changing the service startup code. This issue describes the problem really nicely. A viable workaround which would not require any hacks is to remove the responsibility of communicating with the Windows SCM from the OTel collector entirely - i.e. run it in interactive mode via a service wrapper such as NSSM or WinSW. I prefer WinSW myself, and it lends itself well to being packaged up and installed by something like WiX. Rather than using the built in An example WinSW XML file might look like this:
Hope this helps! |
Sorry for spamming this issue (plus sorry for not making these comments against the issue rather than this PR, I only just realised! 🤦 ) I tested the OTel collector on a Windows system this morning under heavy CPU load. Using the binary directly as a service yielded the issue with the service failing to start up quickly enough for the Windows SCM. Testing the OTel collector running under WinSW was successful; the service responded to the Windows SCM quickly and the service status was shown as running. The OTel collector itself took about 7 minutes to fully start up in the background (this was on a system with the CPU load at a constant 100%). Right now, in the absence of any solution in the initialization code in the OTel collector, I'd say that this is the best mitigation for this issue. |
@DevOpsFu Thank you very very much for sharing your learning with us. I know we appreciate this and will definitely be working of your initial research as we look for the correct long term solution. I have had the initialization shared with me from other team members and it is something that we are looking at. We do have control over that in the ADOT distribution but the effort required to implement it is not clear to me. In the prometheus example the implementation was pretty straight forward so I think we could try to take a stab at that. I'm hoping with a significant amount testing we can reduce this issue. The good thing about the initialization fix proposal is that this would be something that we could contribute back to upstream if it works out well for the ADOT distribution. |
This PR is stale because it has been open 30 days with no activity. |
Description:
Sets ADOT collector agent as Automatic (delayed start) services to mitigate known go windows issues with 1.9.2: golang/go#23479
Services would not restart across reboots on Automatic services, they would timeout before coming up and the service control manager would give up spawning them.
Link to tracking Issue: #1767
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.