santa-api /_ah/warmup 500 errors #32

thehesiod · 2018-11-03T00:59:56Z

We're getting a LOT of 500s from the santa-api with errors like these in the logs:

The request failed because the instance could not start successfully
Threads started by this request continued executing past the hard deadline.

any ideas?

The text was updated successfully, but these errors were encountered:

chief8192 · 2018-11-03T01:05:57Z

Is there a stacktrace?

…

On Fri, Nov 2, 2018 at 7:59 PM Alexander Mohr ***@***.***> wrote: We're getting a LOT of 500s from the santa-api with errors like these in the logs: The request failed because the instance could not start successfully Threads started by this request continued executing past the hard deadline. any ideas? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#32>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABCsssxCxnMbkNGjtEWbWQzKgpaGa76Yks5urOqMgaJpZM4YMj7d> .

msuozzo · 2018-11-03T03:12:26Z

This happens when the instance does not become live within the alotted time (I think 2s?). I think we saw this once before when there was work being done in the appengine_config.py that didn't finish in time. If I remember correctly, it was a pain to debug at the time because the normal tracing mechanisms didn't work. I'd try removing any work either being done directly from the config or by things imported from the config.

It also could be a service degradation on App Engines part but, regardless, keep us posted on any findings/progress.

Thanks!

thehesiod · 2018-11-05T21:04:51Z

Some stats, in the last hour we've had 7,435 of Server errors with URI /_ah/warmup per the dashboard. That's 99.65% of the requests :( If I click on one I get the lines:

0.1.0.3 - - [05/Nov/2018:13:01:26 -0800] "GET /_ah/warmup HTTP/1.1" 500 - - "-" "auto.santa-api.santaupvote.appspot.com" ms=1 cpu_ms=0 cpm_usd=0 loading_request=0 instance=- app_engine_release=1.9.65 trace_id=77b15853f52c3f58725a786fdf8dd95f
2018-11-05 13:01:26.943 PST Threads started by this request continued executing past the hard deadline.
2018-11-05 13:01:26.943 PST The request failed because the instance could not start successfully

we're using the default appengine_config.py, it appears it's doing:

rule.EnsureCriticalRules(settings.CRITICAL_MAC_OS_CERT_HASHES)

Can I just comment out that line? That sounds like something that should instead be done when the tables are first setup.

msuozzo · 2018-11-05T21:29:48Z

I suspect that may be the cause (and I think that code was recently touched). The reason that check is there is because, if those certs aren't whitelisted, a lockdown fleet is DoS'd and this seems like a tolerable failsafe.

But yeah. Do try commenting that out.

thehesiod · 2018-11-05T21:30:17Z

aha, just found out about Stackdriver-Trace. Check this out

seems like it may be a google backend issue?

thehesiod · 2018-11-05T21:33:28Z

holy cow, look at this:

this does not seem like it's production ready. What we do our on AWS side is have short timeouts with several retries. In many cases do several idempotent requests in parallel, pick the winner and cancel the rest.

msuozzo · 2018-11-05T21:46:35Z

I think it's separate from this bug but I'll address it here briefly.

What you're describing is called request hedging and I'm not sure it's applicable here. It works well for smoothing the effects of high tail latency curves. Notably, though, as you can see from the trace summary on the right, it looks like all the request operations are taking up around 200ms (of 60s) so they're not the cause of the timeout.

Upvote provides a lever, SANTA_EVENT_BATCH_SIZE, to alleviate issues related to individual syncs taking too much time.

All this said, it looks like you may be encountering some issues with your App Engine instances. I'd be happy to help figure out where this latency is coming from.

thehesiod · 2018-11-05T21:55:04Z

great, thanks! how can I provide more information?

msuozzo · 2018-11-05T22:00:13Z

Actually, @russellhancox just realized that it could be related to your instance size being too small to churn through the initial warmup. What instance type are you using?

thehesiod · 2018-11-05T22:00:55Z

I'm using the default from the upvote repo, whatever that creates. Sorry, slowly acclimating to GAE from AWS.

thehesiod · 2018-11-05T22:01:27Z

I just noticed that this repo is using 1.9.65 of GAE SDK, perhaps it would be better to upgrade?

msuozzo · 2018-11-05T22:03:49Z

Oh also a good idea. So yeah:

Try upgrading the santa_api to use an F4 instance (more info)
Try upgrading the SDK
Try reducing the BATCH_SIZE

thehesiod · 2018-11-05T22:08:34Z

thehesiod · 2018-11-06T00:06:26Z

ok, ran into bazelbuild/rules_appengine#90, resolved, now ran into:

INFO: Build completed successfully, 1 total action
INFO: Build completed successfully, 1 total action
ERROR: (gcloud.app.deploy) [/private/var/folders/9z/50thrw1j52g7v2d7yrl87l940000gn/T/war.3V8ZEdG9/__main__/santaupvote] does not exist.

never ends :)

thehesiod · 2018-11-06T03:10:16Z

ok, got things working, but now I have:

Traceback (most recent call last):
  File "/base/alloc/tmpfs/dynamic_runtimes/python27g/3b44e98ed7fbb86b/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 240, in Handle
    handler = _config_handle.add_wsgi_middleware(self._LoadHandler())
  File "/base/alloc/tmpfs/dynamic_runtimes/python27g/3b44e98ed7fbb86b/python27/python27_lib/versions/1/google/appengine/api/lib_config.py", line 358, in __getattr__
    self._update_configs()
  File "/base/alloc/tmpfs/dynamic_runtimes/python27g/3b44e98ed7fbb86b/python27/python27_lib/versions/1/google/appengine/api/lib_config.py", line 294, in _update_configs
    self._registry.initialize()
  File "/base/alloc/tmpfs/dynamic_runtimes/python27g/3b44e98ed7fbb86b/python27/python27_lib/versions/1/google/appengine/api/lib_config.py", line 165, in initialize
    import_func(self._modname)
  File "/base/alloc/tmpfs/dynamic_runtimes/python27g/3b44e98ed7fbb86b/python27/python27_dist/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/base/data/home/apps/m~santaupvote/santa-api:auto.413785751680884145/appengine_config.py", line 17, in <module>
    from upvote.gae.datastore.models import rule
  File "/base/data/home/apps/m~santaupvote/santa-api:auto.413785751680884145/upvote/gae/datastore/models/rule.py", line 17, in <module>
    from upvote.gae.datastore.models import santa as santa_models
  File "/base/data/home/apps/m~santaupvote/santa-api:auto.413785751680884145/upvote/gae/datastore/models/santa.py", line 21, in <module>
    from upvote.gae.bigquery import tables
  File "/base/data/home/apps/m~santaupvote/santa-api:auto.413785751680884145/upvote/gae/bigquery/tables.py", line 23, in <module>
    import upvote.gae.shared.common.google_cloud_lib_fixer  # pylint: disable=unused-import
  File "/base/data/home/apps/m~santaupvote/santa-api:auto.413785751680884145/upvote/gae/shared/common/google_cloud_lib_fixer.py", line 32, in <module>
    from requests_toolbelt.adapters import appengine
ImportError: No module named requests_toolbelt.adapters

thehesiod · 2018-11-06T04:32:50Z

ok, figured it out, changes required for others that want to do this in the future: farmersbusinessnetwork@05abd2d (trick was adding overwrite_appengine_config = False) however strangely the instance SDK was not upgraded with these changes, any ideas why not?

msuozzo · 2018-11-06T06:22:02Z

I'm not super familiar with the rule_appengine codebase now that it's refactored but we'll need to update our use of it to accommodate those (backward incompatible) changes.

Alright I guess it's bisecting time 😢. Do previous versions (i.e. before major merges) exhibit the same behavior? If not, which is the latest merge you're seeing this behavior in?

thehesiod · 2018-11-06T06:48:01Z

update: I removed my last posts because the status code returned was 200. Let me find if there are any 500s

msuozzo · 2018-11-06T06:54:53Z

Hmmm ok so that's not a bad sign. Can you trigger the behavior by flipping between F2 and F4?

msuozzo · 2018-11-06T06:58:25Z

(quick note: in the future can you avoid removing posts wholesale and instead reformat them to strikethrough the obsoleted text? makes things a bit less ambiguous for our collective future selves 😄)

thehesiod · 2018-11-06T07:00:35Z

hmm, so when a request times out after 60s it looks like:

where the logservice.Flush goes on for > 200 times. What's interesting is that my datadog service outputs right before this. My current guess is that the threads it creates are preventing the request from "finishing"

msuozzo · 2018-11-06T07:11:57Z

ahhhhhh interesting. You might try to have it flush in-process 1 and trigger the flush manually on the upvote common base handler(s).

I think you could add something like:

def dispatch(self):
  datadog_stats = datadog.ThreadStats()
  try:
    super(UpvoteRequestHandler, self).dispatch()
  finally:
    datadog_stats.flush()

thehesiod · 2018-11-06T07:17:39Z

switching to sync API. long term probably need to run a datadog agent process. btw, I really appreciate all the help. Has really helped getting upvote running well. Next thing I'm adding in our branch is compiler whitelist GUI support. I ended up having to writing a blockable uploader since santa doesn't upload blockables for "compilers" (ex: codesign) which are signed by an apple cert.

msuozzo · 2018-11-06T07:23:03Z

Absolutely! Happy you got to the bottom of this!

As for the compiler feature, it was really only designed to handle a few binaries (most stuff ends up passing through ld anyway) and for those binaries the most expedient way of getting event data ingested was just to create a blacklist rule for that hash and trigger a block manually. It's kludgy but it worked on the small scale necessary.

I'd be curious to hear any experiences you have with the compiler feature and would be happy to help out and pass those on to the other engineers involved.

And thanks for kicking the tires on Upvote 😄

thehesiod · 2018-11-06T07:26:10Z

Here's what I ended up writing btw: https://github.com/farmersbusinessnetwork/upvote/blob/master/fbn/sync_file.py with this we're thinking of adding a "UpVote compiler" button or something like that.

msuozzo · 2018-11-06T07:29:16Z

Yeah that'll certainly do. I'd try to avoid doing too much extra if you can get Santa's plumbing to populate and upload an event via the standard process. Less code to maintain and less chance that a bug or incompatibility causes issues.

thehesiod · 2018-11-06T07:38:38Z

ya, may log a bug against santa to do it, or if I feel adventurous fix it myself :)

thehesiod · 2018-11-06T08:11:39Z

ok, cool, that (farmersbusinessnetwork@da3b321) also fixed the fact that my DD stats reporting was broken :) Going to close this, thanks for the help guys! Feel free to use the fixes I came up with to upgrade the bazel appengine version. However I'm still at a loss as to why the SDK instance version didn't get bumped. If you have any thoughts let me know!

thehesiod closed this as completed Nov 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

santa-api /_ah/warmup 500 errors #32

santa-api /_ah/warmup 500 errors #32

thehesiod commented Nov 3, 2018

chief8192 commented Nov 3, 2018 via email

msuozzo commented Nov 3, 2018 •

edited

Loading

thehesiod commented Nov 5, 2018 •

edited

Loading

msuozzo commented Nov 5, 2018

thehesiod commented Nov 5, 2018

thehesiod commented Nov 5, 2018

msuozzo commented Nov 5, 2018 •

edited

Loading

thehesiod commented Nov 5, 2018

msuozzo commented Nov 5, 2018

thehesiod commented Nov 5, 2018

thehesiod commented Nov 5, 2018

msuozzo commented Nov 5, 2018

thehesiod commented Nov 5, 2018

thehesiod commented Nov 6, 2018 •

edited

Loading

thehesiod commented Nov 6, 2018

thehesiod commented Nov 6, 2018 •

edited

Loading

msuozzo commented Nov 6, 2018

thehesiod commented Nov 6, 2018

msuozzo commented Nov 6, 2018

msuozzo commented Nov 6, 2018

thehesiod commented Nov 6, 2018

msuozzo commented Nov 6, 2018 •

edited

Loading

thehesiod commented Nov 6, 2018

msuozzo commented Nov 6, 2018 •

edited

Loading

thehesiod commented Nov 6, 2018

msuozzo commented Nov 6, 2018

thehesiod commented Nov 6, 2018

thehesiod commented Nov 6, 2018

santa-api /_ah/warmup 500 errors #32

santa-api /_ah/warmup 500 errors #32

Comments

thehesiod commented Nov 3, 2018

chief8192 commented Nov 3, 2018 via email

msuozzo commented Nov 3, 2018 • edited Loading

thehesiod commented Nov 5, 2018 • edited Loading

msuozzo commented Nov 5, 2018

thehesiod commented Nov 5, 2018

thehesiod commented Nov 5, 2018

msuozzo commented Nov 5, 2018 • edited Loading

thehesiod commented Nov 5, 2018

msuozzo commented Nov 5, 2018

thehesiod commented Nov 5, 2018

thehesiod commented Nov 5, 2018

msuozzo commented Nov 5, 2018

thehesiod commented Nov 5, 2018

thehesiod commented Nov 6, 2018 • edited Loading

thehesiod commented Nov 6, 2018

thehesiod commented Nov 6, 2018 • edited Loading

msuozzo commented Nov 6, 2018

thehesiod commented Nov 6, 2018

msuozzo commented Nov 6, 2018

msuozzo commented Nov 6, 2018

thehesiod commented Nov 6, 2018

msuozzo commented Nov 6, 2018 • edited Loading

thehesiod commented Nov 6, 2018

msuozzo commented Nov 6, 2018 • edited Loading

thehesiod commented Nov 6, 2018

msuozzo commented Nov 6, 2018

thehesiod commented Nov 6, 2018

thehesiod commented Nov 6, 2018

msuozzo commented Nov 3, 2018 •

edited

Loading

thehesiod commented Nov 5, 2018 •

edited

Loading

msuozzo commented Nov 5, 2018 •

edited

Loading

thehesiod commented Nov 6, 2018 •

edited

Loading

thehesiod commented Nov 6, 2018 •

edited

Loading

msuozzo commented Nov 6, 2018 •

edited

Loading

msuozzo commented Nov 6, 2018 •

edited

Loading