Scheduler instance boot failure #85

ababaian · 2020-05-13T18:35:56Z

When running Serratus via terraform apply, all the initial instances are going online and about 20% of the time the scheduler runs into a problem with what appears to be aws credentials (IAM is attached correctly) and therefore the cluster has to be restarted. This only happens at initiation so it's easy to fix but kind of annoying

Cloudwatch logs

2020-05-13T18:19:35.453Z | [2020-05-13 18:19:35 +0000] [6] [INFO] Starting gunicorn 20.0.4
-- | --
  | 2020-05-13T18:19:35.454Z | [2020-05-13 18:19:35 +0000] [6] [INFO] Listening at: http://0.0.0.0:8000 (6)
  | 2020-05-13T18:19:35.454Z | [2020-05-13 18:19:35 +0000] [6] [INFO] Using worker: sync
  | 2020-05-13T18:19:35.455Z | [2020-05-13 18:19:35 +0000] [8] [INFO] Booting worker with pid: 8
  | 2020-05-13T18:19:35.657Z | Creating new process
  | 2020-05-13T18:19:36.681Z | Exception in thread Thread-2:
  | 2020-05-13T18:19:36.681Z | Traceback (most recent call last):
  | 2020-05-13T18:19:36.681Z | File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
  | 2020-05-13T18:19:36.682Z | self.run()
  | 2020-05-13T18:19:36.682Z | File "/usr/lib/python3.8/threading.py", line 870, in run
  | 2020-05-13T18:19:36.682Z | self._target(*self._args, **self._kwargs)
  | 2020-05-13T18:19:36.682Z | File "/opt/flask_app/cron.py", line 60, in adjust_autoscaling_loop
  | 2020-05-13T18:19:36.682Z | autoscaling = boto3.client('autoscaling', region_name=app.config["AWS_REGION"])
  | 2020-05-13T18:19:36.682Z | File "/usr/local/lib/python3.8/dist-packages/boto3/__init__.py", line 91, in client
  | 2020-05-13T18:19:36.683Z | return _get_default_session().client(*args, **kwargs)
  | 2020-05-13T18:19:36.683Z | File "/usr/local/lib/python3.8/dist-packages/boto3/session.py", line 258, in client
  | 2020-05-13T18:19:36.683Z | return self._session.create_client(
  | 2020-05-13T18:19:36.683Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 824, in create_client
  | 2020-05-13T18:19:36.683Z | endpoint_resolver = self._get_internal_component('endpoint_resolver')
  | 2020-05-13T18:19:36.684Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 697, in _get_internal_component
  | 2020-05-13T18:19:36.684Z | return self._internal_components.get_component(name)
  | 2020-05-13T18:19:36.684Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 923, in get_component
  | 2020-05-13T18:19:36.684Z | del self._deferred[name]
  | 2020-05-13T18:19:36.684Z | KeyError: 'endpoint_resolver'

The text was updated successfully, but these errors were encountered:

ababaian · 2020-05-13T18:38:44Z

The other version of this error message is here:

[2020-05-09 23:34:13 +0000] [6] [INFO] Starting gunicorn 20.0.4
[2020-05-09 23:34:13 +0000] [6] [INFO] Listening at: http://0.0.0.0:8000 (6)
[2020-05-09 23:34:13 +0000] [6] [INFO] Using worker: sync
[2020-05-09 23:34:13 +0000] [8] [INFO] Booting worker with pid: 8
Creating new process
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/opt/flask_app/cron.py", line 60, in adjust_autoscaling_loop
autoscaling = boto3.client('autoscaling', region_name=app.config["AWS_REGION"])
File "/usr/local/lib/python3.8/dist-packages/boto3/__init__.py", line 91, in client
return _get_default_session().client(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/boto3/session.py", line 258, in client
return self._session.create_client(
File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 823, in create_client
credentials = self.get_credentials()
File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 427, in get_credentials
self._credentials = self._components.get_component(
File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 923, in get_component
del self._deferred[name]
KeyError: 'credential_provider'
clear_terminated_jobs() finished. Running again in 10 seconds

brietaylor · 2020-05-14T01:07:02Z

Might be related to boto/boto3#1592 as I think we're using session objects across threads.

ababaian · 2020-05-17T22:12:22Z

Message today


2020-05-17T22:05:04.619Z | [2020-05-17 22:05:04 +0000] [6] [INFO] Starting gunicorn 20.0.4
-- | --
  | 2020-05-17T22:05:04.620Z | [2020-05-17 22:05:04 +0000] [6] [INFO] Listening at: http://0.0.0.0:8000 (6)
  | 2020-05-17T22:05:04.620Z | [2020-05-17 22:05:04 +0000] [6] [INFO] Using worker: sync
  | 2020-05-17T22:05:04.622Z | [2020-05-17 22:05:04 +0000] [8] [INFO] Booting worker with pid: 8
  | 2020-05-17T22:05:04.846Z | Creating new process
  | 2020-05-17T22:05:05.872Z | Exception in thread Thread-2:
  | 2020-05-17T22:05:05.872Z | Traceback (most recent call last):
  | 2020-05-17T22:05:05.872Z | File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
  | 2020-05-17T22:05:05.872Z | self.run()
  | 2020-05-17T22:05:05.872Z | File "/usr/lib/python3.8/threading.py", line 870, in run
  | 2020-05-17T22:05:05.873Z | self._target(*self._args, **self._kwargs)
  | 2020-05-17T22:05:05.873Z | File "/opt/flask_app/cron.py", line 60, in adjust_autoscaling_loop
  | 2020-05-17T22:05:05.873Z | autoscaling = boto3.client('autoscaling', region_name=app.config["AWS_REGION"])
  | 2020-05-17T22:05:05.873Z | File "/usr/local/lib/python3.8/dist-packages/boto3/__init__.py", line 91, in client
  | 2020-05-17T22:05:05.873Z | return _get_default_session().client(*args, **kwargs)
  | 2020-05-17T22:05:05.873Z | File "/usr/local/lib/python3.8/dist-packages/boto3/session.py", line 258, in client
  | 2020-05-17T22:05:05.873Z | return self._session.create_client(
  | 2020-05-17T22:05:05.873Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 824, in create_client
  | 2020-05-17T22:05:05.873Z | endpoint_resolver = self._get_internal_component('endpoint_resolver')
  | 2020-05-17T22:05:05.873Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 697, in _get_internal_component
  | 2020-05-17T22:05:05.873Z | return self._internal_components.get_component(name)
  | 2020-05-17T22:05:05.873Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 923, in get_component
  | 2020-05-17T22:05:05.878Z | del self._deferred[name]
  | 2020-05-17T22:05:05.878Z | KeyError: 'endpoint_resolver'
  | 2020-05-17T22:05:06.219Z | clear_terminated_jobs() finished. Running again in 600 seconds

ababaian · 2020-05-26T22:34:07Z

Perhaps related, but when trying to load ~96K accessions in to the scheduler I get the following error (happened on 3x attempts). Reducing input to 20K per batch now.


2020-05-26T22:16:57.119Z | Creating new process
-- | --
  | 2020-05-26T22:16:58.403Z | clear_terminated_jobs() finished. Running again in 600 seconds
  | 2020-05-26T22:16:58.945Z | ajust_autoscaling() finished. Running again in 300 seconds
  | 2020-05-26T22:21:59.300Z | ajust_autoscaling() finished. Running again in 300 seconds
  | 2020-05-26T22:22:44.275Z | [2020-05-26 22:22:44 +0000] [6] [CRITICAL] WORKER TIMEOUT (pid:8)
  | 2020-05-26T22:22:44.276Z | [2020-05-26 22:22:44 +0000] [8] [INFO] Worker exiting (pid: 8)
  | 2020-05-26T22:22:44.340Z | [2020-05-26 22:22:44 +0000] [11] [INFO] Booting worker with pid: 11
  | 2020-05-26T22:22:44.552Z | Creating new process
  | 2020-05-26T22:22:44.563Z | Exception in thread Thread-2:
  | 2020-05-26T22:22:44.563Z | Traceback (most recent call last):
  | 2020-05-26T22:22:44.563Z | File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
  | 2020-05-26T22:22:44.563Z | self.run()
  | 2020-05-26T22:22:44.563Z | File "/usr/lib/python3.8/threading.py", line 870, in run
  | 2020-05-26T22:22:44.564Z | self._target(*self._args, **self._kwargs)
  | 2020-05-26T22:22:44.564Z | File "/opt/flask_app/cron.py", line 60, in adjust_autoscaling_loop
  | 2020-05-26T22:22:44.565Z | autoscaling = boto3.client('autoscaling', region_name=app.config["AWS_REGION"])
  | 2020-05-26T22:22:44.565Z | File "/usr/local/lib/python3.8/dist-packages/boto3/__init__.py", line 91, in client
  | 2020-05-26T22:22:44.566Z | return _get_default_session().client(*args, **kwargs)
  | 2020-05-26T22:22:44.566Z | File "/usr/local/lib/python3.8/dist-packages/boto3/session.py", line 258, in client
  | 2020-05-26T22:22:44.567Z | return self._session.create_client(
  | 2020-05-26T22:22:44.567Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 823, in create_client
  | 2020-05-26T22:22:44.568Z | credentials = self.get_credentials()
  | 2020-05-26T22:22:44.568Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 427, in get_credentials
  | 2020-05-26T22:22:44.568Z | self._credentials = self._components.get_component(
  | 2020-05-26T22:22:44.568Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 923, in get_component
  | 2020-05-26T22:22:44.569Z | del self._deferred[name]
  | 2020-05-26T22:22:44.569Z | KeyError: 'credential_provider'
  | 2020-05-26T22:22:45.837Z | clear_terminated_jobs() finished. Running again in 600 seconds

ababaian · 2020-05-27T05:55:22Z

So while I was booting today it appears that the credentials error arises when you run the create_tunnel script to quickly while the instance is still booting up. There appears to be a race condition of some sort and if you just give everything time it can boot up normally.

The adding 90K accessions at once issue is semi-resolved as I just add 20K 'batch' of accessions. So far I have 60K loaded into the scheduler and nothing has blown up.

ababaian · 2020-06-26T02:07:03Z

I don't remember which commit closed this, but it is no longer an issue.

ababaian added bug Something isn't working AWS Amazon Web Services Tasks labels May 13, 2020

ababaian assigned brietaylor May 17, 2020

ababaian closed this as completed Jun 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler instance boot failure #85

Scheduler instance boot failure #85

ababaian commented May 13, 2020

ababaian commented May 13, 2020

brietaylor commented May 14, 2020

ababaian commented May 17, 2020

ababaian commented May 26, 2020

ababaian commented May 27, 2020

ababaian commented Jun 26, 2020

Scheduler instance boot failure #85

Scheduler instance boot failure #85

Comments

ababaian commented May 13, 2020

ababaian commented May 13, 2020

brietaylor commented May 14, 2020

ababaian commented May 17, 2020

ababaian commented May 26, 2020

ababaian commented May 27, 2020

ababaian commented Jun 26, 2020