Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_Py_CheckRecursiveCall during _dequeue_batch in Celery after using S3fs #186

Closed
JonnyWaffles opened this issue Mar 9, 2023 · 6 comments
Closed

Comments

@JonnyWaffles
Copy link

Hi friends, I am trying to debug a very difficult problem because I cannot recreate the issue locally, when running. It only occurs when executed inside of an Airflow Celery Worker running as a service in an Amazon Elastic Container Service task. Even stranger, it only occurs after I run a test case which leverages fsspec s3fs. If I remove the successful test, I don't see the below issue when logging the result to the task CloudWatch stream. I still haven't isolated why this is, but I will update when I know more. The application logs via WatchTower, but the underlying ec2 system logs are available. You can see from the ec2 log result below, the test cases finish successfully, but then my celery worker is killed while trying to log the results to the app's CloudWatch.

[2023-03-09 22:54:34,778] {logging_mixin.py:137} INFO - ================= 76 passed, 19 skipped, 20 warnings in 50.31s =================
Fatal Python error: _Py_CheckRecursiveCall: Cannot recover from stack overflow.
Python runtime state: initialized
Thread 0x00007f7e20632700 (most recent call first):
  File "/usr/lib/python3.9/threading.py", line 316 in wait
  File "/usr/lib/python3.9/queue.py", line 180 in get
  File "/app/.local/lib/python3.9/site-packages/watchtower/__init__.py", line 398 in _dequeue_batch
  File "/usr/lib/python3.9/threading.py", line 892 in run
  File "/usr/lib/python3.9/threading.py", line 954 in _bootstrap_inner
  File "/usr/lib/python3.9/threading.py", line 912 in _bootstrap
Thread 0x00007f7e14dff700 (most recent call first):
  File "/usr/lib/python3.9/threading.py", line 316 in wait
  File "/usr/lib/python3.9/queue.py", line 180 in get
  File "/app/.local/lib/python3.9/site-packages/watchtower/__init__.py", line 398 in _dequeue_batch
  File "/usr/lib/python3.9/threading.py", line 892 in run
  File "/usr/lib/python3.9/threading.py", line 954 in _bootstrap_inner
  File "/usr/lib/python3.9/threading.py", line 912 in _bootstrap
Current thread 0x00007f7e390a7740 (most recent call first):
  File "/app/.local/lib/python3.9/site-packages/watchtower/__init__.py", line 281 in _get_stream_name
  File "/app/.local/lib/python3.9/site-packages/watchtower/__init__.py", line 353 in emit
  File "/app/.local/lib/python3.9/site-packages/airflow/utils/log/file_task_handler.py", line 80 in emit
  File "/usr/lib/python3.9/logging/__init__.py", line 952 in handle
  File "/usr/lib/python3.9/logging/__init__.py", line 1661 in callHandlers
  File "/usr/lib/python3.9/logging/__init__.py", line 1599 in handle
  File "/usr/lib/python3.9/logging/__init__.py", line 1589 in _log
  File "/usr/lib/python3.9/logging/__init__.py", line 1512 in log
  File "/app/.local/lib/python3.9/site-packages/airflow/utils/log/logging_mixin.py", line 137 in _propagate_log
  File "/app/.local/lib/python3.9/site-packages/airflow/utils/log/logging_mixin.py", line 156 in flush
  File "/app/.local/lib/python3.9/site-packages/airflow/utils/log/logging_mixin.py", line 149 in write
  File "/usr/lib/python3.9/logging/__init__.py", line 1004 in handleError
  File "/usr/lib/python3.9/logging/__init__.py", line 1091 in emit
  File "/usr/lib/python3.9/logging/__init__.py", line 952 in handle
  File "/usr/lib/python3.9/logging/__init__.py", line 1661 in callHandlers
  File "/usr/lib/python3.9/logging/__init__.py", line 1599 in handle
  File "/usr/lib/python3.9/logging/__init__.py", line 1589 in _log
  File "/usr/lib/python3.9/logging/__init__.py", line 1512 in log
  File "/app/.local/lib/python3.9/site-packages/airflow/utils/log/logging_mixin.py", line 137 in _propagate_log
  File "/app/.local/lib/python3.9/site-packages/airflow/utils/log/logging_mixin.py", line 156 in flush
  File "/app/.local/lib/python3.9/site-packages/airflow/utils/log/logging_mixin.py", line 149 in write
  File "/usr/lib/python3.9/logging/__init__.py", line 1004 in handleError
  File "/usr/lib/python3.9/logging/__init__.py", line 1091 in emit
  File "/usr/lib/python3.9/logging/__init__.py", line 952 in handle
  File "/usr/lib/python3.9/logging/__init__.py", line 1661 in callHandlers
  File "/usr/lib/python3.9/logging/__init__.py", line 1599 in handle
  File "/usr/lib/python3.9/logging/__init__.py", line 1589 in _log
  File "/usr/lib/python3.9/logging/__init__.py", line 1512 in log
  File "/app/.local/lib/python3.9/site-packages/airflow/utils/log/logging_mixin.py", line 137 in _propagate_log
  File "/app/.local/lib/python3.9/site-packages/airflow/utils/log/logging_mixin.py", line 156 in flush
  File "/app/.local/lib/python3.9/site-packages/airflow/utils/log/logging_mixin.py", line 137 in _propagate_log
--
...
[2023-03-09 22:54:35,530: INFO/ForkPoolWorker-14] Task airflow.executors.celery_executor.execute_command[932a9892-ca3b-47c3-baf1-e1e82dc4cf6f] succeeded in 53.58347679200233s: None

If I check the target log stream (the one my Airflow task logger is trying to flush to in the above system log) I can only see

Dependencies all met for <TaskInstance: platform.integration_test.run_pytest jenkins__2023-03-09T22:31:07.704454+00:00 [queued]>
--
Dependencies all met for <TaskInstance: platform.integration_test.run_pytest jenkins__2023-03-09T22:31:07.704454+00:00 [queued]>
--------------------------------------------------------------------------------
Starting attempt 1 of 1
--------------------------------------------------------------------------------
Executing <Task(_PythonDecoratedOperator): run_pytest> on 2023-03-09 22:31:07.704454+00:00
Started process 3800 to run task
Task exited with return code Negsignal.SIGABRT
0 downstream tasks scheduled from follow-on schedule check

It's very strange. The reason we rely on the service logs and task logs, is that the service logs detail all the tasks a particular worker instance is executing, but the task logs are saved with a url key like [dag_id=platform.integration_test/run_id=jenkins__2023-03-09T22_31_07.704454+00_00/task_id=run_pytest/attempt=1.log] so they show up nicely and isolated in the UI.

I'll continue to investigate, but do you have any idea why a test using s3fs would cause the logging to crash my worker well after the test itself completed?

@JonnyWaffles JonnyWaffles changed the title _Py_CheckRecursiveCall during _dequeue_batch _Py_CheckRecursiveCall during _dequeue_batch in Celery after using S3fs Mar 9, 2023
@JonnyWaffles
Copy link
Author

I should add I also found this note about multiprocessing in the s3fs docs, does this help?

When using Python’s multiprocessing, the start method must be set to either spawn or forkserver. fork is not safe to use because of the open sockets and async thread used by s3fs, and may lead to hard-to-find bugs and occasional deadlocks. Read more about the available start methods.

@JonnyWaffles
Copy link
Author

I realized running the app locally via Docker Compose wasn't identical to remote because I never configured the local logger to log to CloudWatch. I configured my compose to leverage the same remote CloudWatch log settings as the deployed app, but even still I cannot recreate this issue locally. It only occurs when my image is deployed! Locally, the image works, I can see my logs in CloudWatch and everything is a success. This really is a strange pickle.

@kislyuk
Copy link
Owner

kislyuk commented Mar 10, 2023

@JonnyWaffles from the traceback, it looks like you're dealing with a situation that produces more logs any time the log handler's flush function is called. We have a special case in the handler that prevents botocore and boto3 log messages from causing this, but you may have other loggers that are causing this (possibly s3fs). You could try increasing the log level, or disconnecting the watchtower handler altogether before shutting down.

@JonnyWaffles
Copy link
Author

Thanks for the feedback @kislyuk . From what I can tell our older boto3 compatible version of s3fs is configured to not log by default https://github.com/fsspec/s3fs/blob/a396dc4b6f56f754de2ac55043d85fb8f9006b6e/s3fs/core.py#L24-L30, so not sure if s3fs is the culprit yet. Still investigating but your above information helps point me in the right direction. Thanks!

@JonnyWaffles
Copy link
Author

Hi @kislyuk we figured out the problem!

It only occurs when you execute dag.test() on a dag with dynamic mapped tasks of a celery worker and use CloudWatch logging.

In order to integrate our test kit with our Jenkins CICD, we run pytest inside a Dag and send the result over XCOM back to Jenkins. This means all of our tests run as a single Python task. After the test kit completes and watchtower tries to flush, it encounters an infinite recursion Fatal Python error: _Py_CheckRecursiveCall: Cannot recover from stack overflow. exception. Looking at the stack trace each call to dequeue

  File "/app/.local/lib/python3.9/site-packages/watchtower/__init__.py", line 398 in _dequeue_batch

ends up with a recursive call to

  File "/app/.local/lib/python3.9/site-packages/airflow/utils/log/logging_mixin.py", line 137 in _propagate_log

For now, we can test the dynamic mapped dags locally without CloudWatch, but long term we’ll need to find a workaround. I'll post the same in Airflow, but I suspect our weird use case of running Pytest as a Dag may just be not recommended and I wouldn't blame them!

@JonnyWaffles
Copy link
Author

We can close this out. It occurs even without watchtower. I opened an issue with Airflow cheers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants