-
Notifications
You must be signed in to change notification settings - Fork 35
santa-api /_ah/warmup 500 errors #32
Comments
Is there a stacktrace?
…On Fri, Nov 2, 2018 at 7:59 PM Alexander Mohr ***@***.***> wrote:
We're getting a LOT of 500s from the santa-api with errors like these in
the logs:
The request failed because the instance could not start successfully
Threads started by this request continued executing past the hard deadline.
any ideas?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#32>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABCsssxCxnMbkNGjtEWbWQzKgpaGa76Yks5urOqMgaJpZM4YMj7d>
.
|
This happens when the instance does not become live within the alotted time (I think 2s?). I think we saw this once before when there was work being done in the appengine_config.py that didn't finish in time. If I remember correctly, it was a pain to debug at the time because the normal tracing mechanisms didn't work. I'd try removing any work either being done directly from the config or by things imported from the config. It also could be a service degradation on App Engines part but, regardless, keep us posted on any findings/progress. Thanks! |
Some stats, in the last hour we've had 7,435 of Server errors with URI
we're using the default
Can I just comment out that line? That sounds like something that should instead be done when the tables are first setup. |
I suspect that may be the cause (and I think that code was recently touched). The reason that check is there is because, if those certs aren't whitelisted, a lockdown fleet is DoS'd and this seems like a tolerable failsafe. But yeah. Do try commenting that out. |
I think it's separate from this bug but I'll address it here briefly. What you're describing is called request hedging and I'm not sure it's applicable here. It works well for smoothing the effects of high tail latency curves. Notably, though, as you can see from the trace summary on the right, it looks like all the request operations are taking up around 200ms (of 60s) so they're not the cause of the timeout. Upvote provides a lever, SANTA_EVENT_BATCH_SIZE, to alleviate issues related to individual syncs taking too much time. All this said, it looks like you may be encountering some issues with your App Engine instances. I'd be happy to help figure out where this latency is coming from. |
great, thanks! how can I provide more information? |
Actually, @russellhancox just realized that it could be related to your instance size being too small to churn through the initial warmup. What instance type are you using? |
I'm using the default from the upvote repo, whatever that creates. Sorry, slowly acclimating to GAE from AWS. |
I just noticed that this repo is using 1.9.65 of GAE SDK, perhaps it would be better to upgrade? |
Oh also a good idea. So yeah:
|
ok, ran into bazelbuild/rules_appengine#90, resolved, now ran into:
never ends :) |
ok, got things working, but now I have:
|
ok, figured it out, changes required for others that want to do this in the future: farmersbusinessnetwork@05abd2d (trick was adding |
I'm not super familiar with the rule_appengine codebase now that it's refactored but we'll need to update our use of it to accommodate those (backward incompatible) changes. Alright I guess it's bisecting time 😢. Do previous versions (i.e. before major merges) exhibit the same behavior? If not, which is the latest merge you're seeing this behavior in? |
update: I removed my last posts because the status code returned was 200. Let me find if there are any 500s |
Hmmm ok so that's not a bad sign. Can you trigger the behavior by flipping between F2 and F4? |
(quick note: in the future can you avoid removing posts wholesale and instead reformat them to strikethrough the obsoleted text? makes things a bit less ambiguous for our collective future selves 😄) |
ahhhhhh interesting. You might try to have it flush in-process 1 and trigger the flush manually on the upvote common base handler(s). I think you could add something like: def dispatch(self):
datadog_stats = datadog.ThreadStats()
try:
super(UpvoteRequestHandler, self).dispatch()
finally:
datadog_stats.flush() |
switching to sync API. long term probably need to run a datadog agent process. btw, I really appreciate all the help. Has really helped getting upvote running well. Next thing I'm adding in our branch is compiler whitelist GUI support. I ended up having to writing a blockable uploader since santa doesn't upload blockables for "compilers" (ex: codesign) which are signed by an apple cert. |
Absolutely! Happy you got to the bottom of this! As for the compiler feature, it was really only designed to handle a few binaries (most stuff ends up passing through I'd be curious to hear any experiences you have with the compiler feature and would be happy to help out and pass those on to the other engineers involved. And thanks for kicking the tires on Upvote 😄 |
Here's what I ended up writing btw: https://github.com/farmersbusinessnetwork/upvote/blob/master/fbn/sync_file.py with this we're thinking of adding a "UpVote compiler" button or something like that. |
Yeah that'll certainly do. I'd try to avoid doing too much extra if you can get Santa's plumbing to populate and upload an event via the standard process. Less code to maintain and less chance that a bug or incompatibility causes issues. |
ya, may log a bug against santa to do it, or if I feel adventurous fix it myself :) |
ok, cool, that (farmersbusinessnetwork@da3b321) also fixed the fact that my DD stats reporting was broken :) Going to close this, thanks for the help guys! Feel free to use the fixes I came up with to upgrade the bazel appengine version. However I'm still at a loss as to why the SDK instance version didn't get bumped. If you have any thoughts let me know! |
We're getting a LOT of 500s from the santa-api with errors like these in the logs:
any ideas?
The text was updated successfully, but these errors were encountered: