-
-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
_Py_CheckRecursiveCall during _dequeue_batch in Celery after using S3fs #186
Comments
I should add I also found this note about multiprocessing in the s3fs docs, does this help?
|
I realized running the app locally via Docker Compose wasn't identical to remote because I never configured the local logger to log to CloudWatch. I configured my compose to leverage the same remote CloudWatch log settings as the deployed app, but even still I cannot recreate this issue locally. It only occurs when my image is deployed! Locally, the image works, I can see my logs in CloudWatch and everything is a success. This really is a strange pickle. |
@JonnyWaffles from the traceback, it looks like you're dealing with a situation that produces more logs any time the log handler's flush function is called. We have a special case in the handler that prevents botocore and boto3 log messages from causing this, but you may have other loggers that are causing this (possibly s3fs). You could try increasing the log level, or disconnecting the watchtower handler altogether before shutting down. |
Thanks for the feedback @kislyuk . From what I can tell our older boto3 compatible version of s3fs is configured to not log by default https://github.com/fsspec/s3fs/blob/a396dc4b6f56f754de2ac55043d85fb8f9006b6e/s3fs/core.py#L24-L30, so not sure if s3fs is the culprit yet. Still investigating but your above information helps point me in the right direction. Thanks! |
Hi @kislyuk we figured out the problem! It only occurs when you execute dag.test() on a dag with dynamic mapped tasks of a celery worker and use CloudWatch logging. In order to integrate our test kit with our Jenkins CICD, we run pytest inside a Dag and send the result over XCOM back to Jenkins. This means all of our tests run as a single Python task. After the test kit completes and watchtower tries to flush, it encounters an infinite recursion
ends up with a recursive call to
For now, we can test the dynamic mapped dags locally without CloudWatch, but long term we’ll need to find a workaround. I'll post the same in Airflow, but I suspect our weird use case of running Pytest as a Dag may just be not recommended and I wouldn't blame them! |
We can close this out. It occurs even without watchtower. I opened an issue with Airflow cheers. |
Hi friends, I am trying to debug a very difficult problem because I cannot recreate the issue locally, when running. It only occurs when executed inside of an Airflow Celery Worker running as a service in an Amazon Elastic Container Service task. Even stranger, it only occurs after I run a test case which leverages fsspec s3fs. If I remove the successful test, I don't see the below issue when logging the result to the task CloudWatch stream. I still haven't isolated why this is, but I will update when I know more. The application logs via WatchTower, but the underlying ec2 system logs are available. You can see from the ec2 log result below, the test cases finish successfully, but then my celery worker is killed while trying to log the results to the app's CloudWatch.
If I check the target log stream (the one my Airflow task logger is trying to flush to in the above system log) I can only see
It's very strange. The reason we rely on the service logs and task logs, is that the service logs detail all the tasks a particular worker instance is executing, but the task logs are saved with a url key like
[dag_id=platform.integration_test/run_id=jenkins__2023-03-09T22_31_07.704454+00_00/task_id=run_pytest/attempt=1.log]
so they show up nicely and isolated in the UI.I'll continue to investigate, but do you have any idea why a test using s3fs would cause the logging to crash my worker well after the test itself completed?
The text was updated successfully, but these errors were encountered: