_Py_CheckRecursiveCall during _dequeue_batch in Celery after using S3fs #186

JonnyWaffles · 2023-03-09T23:31:18Z

Hi friends, I am trying to debug a very difficult problem because I cannot recreate the issue locally, when running. It only occurs when executed inside of an Airflow Celery Worker running as a service in an Amazon Elastic Container Service task. Even stranger, it only occurs after I run a test case which leverages fsspec s3fs. If I remove the successful test, I don't see the below issue when logging the result to the task CloudWatch stream. I still haven't isolated why this is, but I will update when I know more. The application logs via WatchTower, but the underlying ec2 system logs are available. You can see from the ec2 log result below, the test cases finish successfully, but then my celery worker is killed while trying to log the results to the app's CloudWatch.

[2023-03-09 22:54:34,778] {logging_mixin.py:137} INFO - ================= 76 passed, 19 skipped, 20 warnings in 50.31s =================
Fatal Python error: _Py_CheckRecursiveCall: Cannot recover from stack overflow.
Python runtime state: initialized
Thread 0x00007f7e20632700 (most recent call first):
  File "/usr/lib/python3.9/threading.py", line 316 in wait
  File "/usr/lib/python3.9/queue.py", line 180 in get
  File "/app/.local/lib/python3.9/site-packages/watchtower/__init__.py", line 398 in _dequeue_batch
  File "/usr/lib/python3.9/threading.py", line 892 in run
  File "/usr/lib/python3.9/threading.py", line 954 in _bootstrap_inner
  File "/usr/lib/python3.9/threading.py", line 912 in _bootstrap
Thread 0x00007f7e14dff700 (most recent call first):
  File "/usr/lib/python3.9/threading.py", line 316 in wait
  File "/usr/lib/python3.9/queue.py", line 180 in get
  File "/app/.local/lib/python3.9/site-packages/watchtower/__init__.py", line 398 in _dequeue_batch
  File "/usr/lib/python3.9/threading.py", line 892 in run
  File "/usr/lib/python3.9/threading.py", line 954 in _bootstrap_inner
  File "/usr/lib/python3.9/threading.py", line 912 in _bootstrap
Current thread 0x00007f7e390a7740 (most recent call first):
  File "/app/.local/lib/python3.9/site-packages/watchtower/__init__.py", line 281 in _get_stream_name
  File "/app/.local/lib/python3.9/site-packages/watchtower/__init__.py", line 353 in emit
  File "/app/.local/lib/python3.9/site-packages/airflow/utils/log/file_task_handler.py", line 80 in emit
  File "/usr/lib/python3.9/logging/__init__.py", line 952 in handle
  File "/usr/lib/python3.9/logging/__init__.py", line 1661 in callHandlers
  File "/usr/lib/python3.9/logging/__init__.py", line 1599 in handle
  File "/usr/lib/python3.9/logging/__init__.py", line 1589 in _log
  File "/usr/lib/python3.9/logging/__init__.py", line 1512 in log
  File "/app/.local/lib/python3.9/site-packages/airflow/utils/log/logging_mixin.py", line 137 in _propagate_log
  File "/app/.local/lib/python3.9/site-packages/airflow/utils/log/logging_mixin.py", line 156 in flush
  File "/app/.local/lib/python3.9/site-packages/airflow/utils/log/logging_mixin.py", line 149 in write
  File "/usr/lib/python3.9/logging/__init__.py", line 1004 in handleError
  File "/usr/lib/python3.9/logging/__init__.py", line 1091 in emit
  File "/usr/lib/python3.9/logging/__init__.py", line 952 in handle
  File "/usr/lib/python3.9/logging/__init__.py", line 1661 in callHandlers
  File "/usr/lib/python3.9/logging/__init__.py", line 1599 in handle
  File "/usr/lib/python3.9/logging/__init__.py", line 1589 in _log
  File "/usr/lib/python3.9/logging/__init__.py", line 1512 in log
  File "/app/.local/lib/python3.9/site-packages/airflow/utils/log/logging_mixin.py", line 137 in _propagate_log
  File "/app/.local/lib/python3.9/site-packages/airflow/utils/log/logging_mixin.py", line 156 in flush
  File "/app/.local/lib/python3.9/site-packages/airflow/utils/log/logging_mixin.py", line 149 in write
  File "/usr/lib/python3.9/logging/__init__.py", line 1004 in handleError
  File "/usr/lib/python3.9/logging/__init__.py", line 1091 in emit
  File "/usr/lib/python3.9/logging/__init__.py", line 952 in handle
  File "/usr/lib/python3.9/logging/__init__.py", line 1661 in callHandlers
  File "/usr/lib/python3.9/logging/__init__.py", line 1599 in handle
  File "/usr/lib/python3.9/logging/__init__.py", line 1589 in _log
  File "/usr/lib/python3.9/logging/__init__.py", line 1512 in log
  File "/app/.local/lib/python3.9/site-packages/airflow/utils/log/logging_mixin.py", line 137 in _propagate_log
  File "/app/.local/lib/python3.9/site-packages/airflow/utils/log/logging_mixin.py", line 156 in flush
  File "/app/.local/lib/python3.9/site-packages/airflow/utils/log/logging_mixin.py", line 137 in _propagate_log
--
...
[2023-03-09 22:54:35,530: INFO/ForkPoolWorker-14] Task airflow.executors.celery_executor.execute_command[932a9892-ca3b-47c3-baf1-e1e82dc4cf6f] succeeded in 53.58347679200233s: None

If I check the target log stream (the one my Airflow task logger is trying to flush to in the above system log) I can only see

Dependencies all met for <TaskInstance: platform.integration_test.run_pytest jenkins__2023-03-09T22:31:07.704454+00:00 [queued]>
--
Dependencies all met for <TaskInstance: platform.integration_test.run_pytest jenkins__2023-03-09T22:31:07.704454+00:00 [queued]>
--------------------------------------------------------------------------------
Starting attempt 1 of 1
--------------------------------------------------------------------------------
Executing <Task(_PythonDecoratedOperator): run_pytest> on 2023-03-09 22:31:07.704454+00:00
Started process 3800 to run task
Task exited with return code Negsignal.SIGABRT
0 downstream tasks scheduled from follow-on schedule check

It's very strange. The reason we rely on the service logs and task logs, is that the service logs detail all the tasks a particular worker instance is executing, but the task logs are saved with a url key like [dag_id=platform.integration_test/run_id=jenkins__2023-03-09T22_31_07.704454+00_00/task_id=run_pytest/attempt=1.log] so they show up nicely and isolated in the UI.

I'll continue to investigate, but do you have any idea why a test using s3fs would cause the logging to crash my worker well after the test itself completed?

The text was updated successfully, but these errors were encountered:

JonnyWaffles · 2023-03-09T23:36:41Z

I should add I also found this note about multiprocessing in the s3fs docs, does this help?

When using Python’s multiprocessing, the start method must be set to either spawn or forkserver. fork is not safe to use because of the open sockets and async thread used by s3fs, and may lead to hard-to-find bugs and occasional deadlocks. Read more about the available start methods.

JonnyWaffles · 2023-03-10T00:13:47Z

I realized running the app locally via Docker Compose wasn't identical to remote because I never configured the local logger to log to CloudWatch. I configured my compose to leverage the same remote CloudWatch log settings as the deployed app, but even still I cannot recreate this issue locally. It only occurs when my image is deployed! Locally, the image works, I can see my logs in CloudWatch and everything is a success. This really is a strange pickle.

kislyuk · 2023-03-10T15:03:51Z

@JonnyWaffles from the traceback, it looks like you're dealing with a situation that produces more logs any time the log handler's flush function is called. We have a special case in the handler that prevents botocore and boto3 log messages from causing this, but you may have other loggers that are causing this (possibly s3fs). You could try increasing the log level, or disconnecting the watchtower handler altogether before shutting down.

JonnyWaffles · 2023-03-10T17:35:10Z

Thanks for the feedback @kislyuk . From what I can tell our older boto3 compatible version of s3fs is configured to not log by default https://github.com/fsspec/s3fs/blob/a396dc4b6f56f754de2ac55043d85fb8f9006b6e/s3fs/core.py#L24-L30, so not sure if s3fs is the culprit yet. Still investigating but your above information helps point me in the right direction. Thanks!

JonnyWaffles · 2023-03-10T20:10:13Z

Hi @kislyuk we figured out the problem!

It only occurs when you execute dag.test() on a dag with dynamic mapped tasks of a celery worker and use CloudWatch logging.

In order to integrate our test kit with our Jenkins CICD, we run pytest inside a Dag and send the result over XCOM back to Jenkins. This means all of our tests run as a single Python task. After the test kit completes and watchtower tries to flush, it encounters an infinite recursion Fatal Python error: _Py_CheckRecursiveCall: Cannot recover from stack overflow. exception. Looking at the stack trace each call to dequeue

  File "/app/.local/lib/python3.9/site-packages/watchtower/__init__.py", line 398 in _dequeue_batch

ends up with a recursive call to

  File "/app/.local/lib/python3.9/site-packages/airflow/utils/log/logging_mixin.py", line 137 in _propagate_log

For now, we can test the dynamic mapped dags locally without CloudWatch, but long term we’ll need to find a workaround. I'll post the same in Airflow, but I suspect our weird use case of running Pytest as a Dag may just be not recommended and I wouldn't blame them!

JonnyWaffles · 2023-03-10T20:35:46Z

We can close this out. It occurs even without watchtower. I opened an issue with Airflow cheers.

JonnyWaffles changed the title ~~_Py_CheckRecursiveCall during _dequeue_batch~~ _Py_CheckRecursiveCall during _dequeue_batch in Celery after using S3fs Mar 9, 2023

JonnyWaffles closed this as completed Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_Py_CheckRecursiveCall during _dequeue_batch in Celery after using S3fs #186

_Py_CheckRecursiveCall during _dequeue_batch in Celery after using S3fs #186

JonnyWaffles commented Mar 9, 2023

JonnyWaffles commented Mar 9, 2023

JonnyWaffles commented Mar 10, 2023

kislyuk commented Mar 10, 2023 •

edited

Loading

JonnyWaffles commented Mar 10, 2023

JonnyWaffles commented Mar 10, 2023

JonnyWaffles commented Mar 10, 2023

_Py_CheckRecursiveCall during _dequeue_batch in Celery after using S3fs #186

_Py_CheckRecursiveCall during _dequeue_batch in Celery after using S3fs #186

Comments

JonnyWaffles commented Mar 9, 2023

JonnyWaffles commented Mar 9, 2023

JonnyWaffles commented Mar 10, 2023

kislyuk commented Mar 10, 2023 • edited Loading

JonnyWaffles commented Mar 10, 2023

JonnyWaffles commented Mar 10, 2023

JonnyWaffles commented Mar 10, 2023

kislyuk commented Mar 10, 2023 •

edited

Loading