-
Notifications
You must be signed in to change notification settings - Fork 381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
watchdog anomalies after upgrade to 0.35.1 #1030
Comments
Definitely suspicious, but no, we did not adjust sampling. 0.35.0 did shuffle around how trace components are configured though, and we introduced payload chunking (which uses Net/HTTP) so maybe there's something related. Whats the name of this metric that spiked up? |
Thanks @delner
|
Thank you @gingerlime! I did a comparison of all the changes between 0.34.2 and 0.35.1 and there's nothing that indicates that Let us know if you are using Sinatra. But in this scenario, trace splitting seems more plausible, but the only way I can imagine it causing an increase in traces is if, for some reason, the code is sending the same traces multiple times. Not sure if this feasible, but are you able to notice if the increase is being caused by duplicate spans, or they are mostly unique? If we want to continue investigating the chunking logic, I suggest you enable "diagnostic health metrics" which will tells exactly how many times chunking happened: If we are seeing greater than zero numbers for |
Thanks @marcotc ! Not using Sinatra. It's a rails app. I've had a bit of a history of false alarms (ok, one so far), so I'm getting a bit nervous :) but I checked other commits on the same deploy and couldn't spot anything that can cause an increase of http requests of this magnitude.
I'm not sure I follow you completely. Any pointers to what to look at more specifically? What I did try however is looking at traces. Given that both the hit count increased, and the latency decreased, I was actually looking for fast traces, less than 1ms, and I can see plenty of these
Do these ring a bell with you guys? If I look before the deploy/spike, I can't see any of these urls in the traces.
Is it safe to turn on diagnostics in production? I'm a bit hesitating to cause additional load on the system. Besides the increased hit count, everything seems to run fine, and I'm not sure if we can easily spot a hit difference on our staging environment which is generally much more quiet anyway. I'd be happy to try it out if it's safe though. |
If you're able to reproduce this in a test environment or a canary I'd suggest starting there when using health metrics. Health metrics are not the same thing as "debug"; it will emit Statsd metrics over UDP to the agent (assuming you have your agent running Statsd), but should not produce any additional log messages. In this sense it should be production safe, but its always a good idea to try this out in the least sensitive setting you can manage.
It's very weird that you're seeing lots of these, and that they're under 1ms (unless the payload is small and the agent is co-located on the same host/container.) We'll have to look deeper into this... have some possible ideas. |
Okay, one of my tests picked up a problem with the HTTP instrumentation that I think is causing this. While I work on confirming the cause, I would recommend either disabling Will keep you posted with any updates. |
Okay I think I may have found the cause: the HTTP circuit breaker wasn't short-circuiting the HTTP instrumentation for the transport requests, and was generating traces for them. @gingerlime Can you give #1033 a try? You can also try this pre-release gem if that's more suitable for you:
|
Thanks again @delner ! I deployed the branch on our staging environment and it definitely looks like the number of hits drop after the deploy |
Okay great, glad to see this is effective @gingerlime. We're going to try to deploy this as a bugfix today: I'll keep you posted. |
Alright, we merged the PR to fix this. We'll deploy it shortly as 0.35.2. Thanks for the report @gingerlime , please always feel free to report anything suspicious, and don't worry too much about any false alarms :) I'm glad we were able to find and fix this. |
@gingerlime thank you again for this issue report! |
Thank you both for the quick turnaround time and for keeping me posted. I really appreciate it! |
This correlates with a deploy which included the ddtrace upgrade from 0.34.2 to 0.35.1
Was there a change to sampling rates or something?
The text was updated successfully, but these errors were encountered: