Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler instance boot failure #85

Closed
ababaian opened this issue May 13, 2020 · 6 comments
Closed

Scheduler instance boot failure #85

ababaian opened this issue May 13, 2020 · 6 comments
Assignees
Labels
AWS Amazon Web Services Tasks bug Something isn't working

Comments

@ababaian
Copy link
Owner

When running Serratus via terraform apply, all the initial instances are going online and about 20% of the time the scheduler runs into a problem with what appears to be aws credentials (IAM is attached correctly) and therefore the cluster has to be restarted. This only happens at initiation so it's easy to fix but kind of annoying

Cloudwatch logs

2020-05-13T18:19:35.453Z | [2020-05-13 18:19:35 +0000] [6] [INFO] Starting gunicorn 20.0.4
-- | --
  | 2020-05-13T18:19:35.454Z | [2020-05-13 18:19:35 +0000] [6] [INFO] Listening at: http://0.0.0.0:8000 (6)
  | 2020-05-13T18:19:35.454Z | [2020-05-13 18:19:35 +0000] [6] [INFO] Using worker: sync
  | 2020-05-13T18:19:35.455Z | [2020-05-13 18:19:35 +0000] [8] [INFO] Booting worker with pid: 8
  | 2020-05-13T18:19:35.657Z | Creating new process
  | 2020-05-13T18:19:36.681Z | Exception in thread Thread-2:
  | 2020-05-13T18:19:36.681Z | Traceback (most recent call last):
  | 2020-05-13T18:19:36.681Z | File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
  | 2020-05-13T18:19:36.682Z | self.run()
  | 2020-05-13T18:19:36.682Z | File "/usr/lib/python3.8/threading.py", line 870, in run
  | 2020-05-13T18:19:36.682Z | self._target(*self._args, **self._kwargs)
  | 2020-05-13T18:19:36.682Z | File "/opt/flask_app/cron.py", line 60, in adjust_autoscaling_loop
  | 2020-05-13T18:19:36.682Z | autoscaling = boto3.client('autoscaling', region_name=app.config["AWS_REGION"])
  | 2020-05-13T18:19:36.682Z | File "/usr/local/lib/python3.8/dist-packages/boto3/__init__.py", line 91, in client
  | 2020-05-13T18:19:36.683Z | return _get_default_session().client(*args, **kwargs)
  | 2020-05-13T18:19:36.683Z | File "/usr/local/lib/python3.8/dist-packages/boto3/session.py", line 258, in client
  | 2020-05-13T18:19:36.683Z | return self._session.create_client(
  | 2020-05-13T18:19:36.683Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 824, in create_client
  | 2020-05-13T18:19:36.683Z | endpoint_resolver = self._get_internal_component('endpoint_resolver')
  | 2020-05-13T18:19:36.684Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 697, in _get_internal_component
  | 2020-05-13T18:19:36.684Z | return self._internal_components.get_component(name)
  | 2020-05-13T18:19:36.684Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 923, in get_component
  | 2020-05-13T18:19:36.684Z | del self._deferred[name]
  | 2020-05-13T18:19:36.684Z | KeyError: 'endpoint_resolver'

@ababaian ababaian added bug Something isn't working AWS Amazon Web Services Tasks labels May 13, 2020
@ababaian
Copy link
Owner Author

The other version of this error message is here:

[2020-05-09 23:34:13 +0000] [6] [INFO] Starting gunicorn 20.0.4
[2020-05-09 23:34:13 +0000] [6] [INFO] Listening at: http://0.0.0.0:8000 (6)
[2020-05-09 23:34:13 +0000] [6] [INFO] Using worker: sync
[2020-05-09 23:34:13 +0000] [8] [INFO] Booting worker with pid: 8
Creating new process
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/opt/flask_app/cron.py", line 60, in adjust_autoscaling_loop
autoscaling = boto3.client('autoscaling', region_name=app.config["AWS_REGION"])
File "/usr/local/lib/python3.8/dist-packages/boto3/__init__.py", line 91, in client
return _get_default_session().client(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/boto3/session.py", line 258, in client
return self._session.create_client(
File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 823, in create_client
credentials = self.get_credentials()
File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 427, in get_credentials
self._credentials = self._components.get_component(
File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 923, in get_component
del self._deferred[name]
KeyError: 'credential_provider'
clear_terminated_jobs() finished. Running again in 10 seconds

@brietaylor
Copy link
Collaborator

Might be related to boto/boto3#1592 as I think we're using session objects across threads.

@ababaian
Copy link
Owner Author

Message today


2020-05-17T22:05:04.619Z | [2020-05-17 22:05:04 +0000] [6] [INFO] Starting gunicorn 20.0.4
-- | --
  | 2020-05-17T22:05:04.620Z | [2020-05-17 22:05:04 +0000] [6] [INFO] Listening at: http://0.0.0.0:8000 (6)
  | 2020-05-17T22:05:04.620Z | [2020-05-17 22:05:04 +0000] [6] [INFO] Using worker: sync
  | 2020-05-17T22:05:04.622Z | [2020-05-17 22:05:04 +0000] [8] [INFO] Booting worker with pid: 8
  | 2020-05-17T22:05:04.846Z | Creating new process
  | 2020-05-17T22:05:05.872Z | Exception in thread Thread-2:
  | 2020-05-17T22:05:05.872Z | Traceback (most recent call last):
  | 2020-05-17T22:05:05.872Z | File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
  | 2020-05-17T22:05:05.872Z | self.run()
  | 2020-05-17T22:05:05.872Z | File "/usr/lib/python3.8/threading.py", line 870, in run
  | 2020-05-17T22:05:05.873Z | self._target(*self._args, **self._kwargs)
  | 2020-05-17T22:05:05.873Z | File "/opt/flask_app/cron.py", line 60, in adjust_autoscaling_loop
  | 2020-05-17T22:05:05.873Z | autoscaling = boto3.client('autoscaling', region_name=app.config["AWS_REGION"])
  | 2020-05-17T22:05:05.873Z | File "/usr/local/lib/python3.8/dist-packages/boto3/__init__.py", line 91, in client
  | 2020-05-17T22:05:05.873Z | return _get_default_session().client(*args, **kwargs)
  | 2020-05-17T22:05:05.873Z | File "/usr/local/lib/python3.8/dist-packages/boto3/session.py", line 258, in client
  | 2020-05-17T22:05:05.873Z | return self._session.create_client(
  | 2020-05-17T22:05:05.873Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 824, in create_client
  | 2020-05-17T22:05:05.873Z | endpoint_resolver = self._get_internal_component('endpoint_resolver')
  | 2020-05-17T22:05:05.873Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 697, in _get_internal_component
  | 2020-05-17T22:05:05.873Z | return self._internal_components.get_component(name)
  | 2020-05-17T22:05:05.873Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 923, in get_component
  | 2020-05-17T22:05:05.878Z | del self._deferred[name]
  | 2020-05-17T22:05:05.878Z | KeyError: 'endpoint_resolver'
  | 2020-05-17T22:05:06.219Z | clear_terminated_jobs() finished. Running again in 600 seconds

@ababaian
Copy link
Owner Author

Perhaps related, but when trying to load ~96K accessions in to the scheduler I get the following error (happened on 3x attempts). Reducing input to 20K per batch now.


2020-05-26T22:16:57.119Z | Creating new process
-- | --
  | 2020-05-26T22:16:58.403Z | clear_terminated_jobs() finished. Running again in 600 seconds
  | 2020-05-26T22:16:58.945Z | ajust_autoscaling() finished. Running again in 300 seconds
  | 2020-05-26T22:21:59.300Z | ajust_autoscaling() finished. Running again in 300 seconds
  | 2020-05-26T22:22:44.275Z | [2020-05-26 22:22:44 +0000] [6] [CRITICAL] WORKER TIMEOUT (pid:8)
  | 2020-05-26T22:22:44.276Z | [2020-05-26 22:22:44 +0000] [8] [INFO] Worker exiting (pid: 8)
  | 2020-05-26T22:22:44.340Z | [2020-05-26 22:22:44 +0000] [11] [INFO] Booting worker with pid: 11
  | 2020-05-26T22:22:44.552Z | Creating new process
  | 2020-05-26T22:22:44.563Z | Exception in thread Thread-2:
  | 2020-05-26T22:22:44.563Z | Traceback (most recent call last):
  | 2020-05-26T22:22:44.563Z | File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
  | 2020-05-26T22:22:44.563Z | self.run()
  | 2020-05-26T22:22:44.563Z | File "/usr/lib/python3.8/threading.py", line 870, in run
  | 2020-05-26T22:22:44.564Z | self._target(*self._args, **self._kwargs)
  | 2020-05-26T22:22:44.564Z | File "/opt/flask_app/cron.py", line 60, in adjust_autoscaling_loop
  | 2020-05-26T22:22:44.565Z | autoscaling = boto3.client('autoscaling', region_name=app.config["AWS_REGION"])
  | 2020-05-26T22:22:44.565Z | File "/usr/local/lib/python3.8/dist-packages/boto3/__init__.py", line 91, in client
  | 2020-05-26T22:22:44.566Z | return _get_default_session().client(*args, **kwargs)
  | 2020-05-26T22:22:44.566Z | File "/usr/local/lib/python3.8/dist-packages/boto3/session.py", line 258, in client
  | 2020-05-26T22:22:44.567Z | return self._session.create_client(
  | 2020-05-26T22:22:44.567Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 823, in create_client
  | 2020-05-26T22:22:44.568Z | credentials = self.get_credentials()
  | 2020-05-26T22:22:44.568Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 427, in get_credentials
  | 2020-05-26T22:22:44.568Z | self._credentials = self._components.get_component(
  | 2020-05-26T22:22:44.568Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 923, in get_component
  | 2020-05-26T22:22:44.569Z | del self._deferred[name]
  | 2020-05-26T22:22:44.569Z | KeyError: 'credential_provider'
  | 2020-05-26T22:22:45.837Z | clear_terminated_jobs() finished. Running again in 600 seconds

@ababaian
Copy link
Owner Author

So while I was booting today it appears that the credentials error arises when you run the create_tunnel script to quickly while the instance is still booting up. There appears to be a race condition of some sort and if you just give everything time it can boot up normally.

The adding 90K accessions at once issue is semi-resolved as I just add 20K 'batch' of accessions. So far I have 60K loaded into the scheduler and nothing has blown up.

@ababaian
Copy link
Owner Author

I don't remember which commit closed this, but it is no longer an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AWS Amazon Web Services Tasks bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants