Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug - Deadlock when using multithreaded & multiprocessing Environment #231

Closed
shachakz opened this issue Mar 30, 2020 · 13 comments
Closed
Labels
bug Something isn't working

Comments

@shachakz
Copy link

shachakz commented Mar 30, 2020

Hi,

Python version: 3.6.8
loguru version: 0.4.1
OS: Linux
Dev environment: Terminal

Bug Description: Creating a new process during logging from another thread will cause deadlock inside the new process whenever it calls the logger.
I assume that it happens because handler's _lock is already taken whenever the process is created, resulting that _lock will be taken forever (from the new process perspective) [see 1st comment in here].

Reproduce code:

import multiprocessing as mp
import sys
import threading
from random import uniform
from time import sleep
from loguru import logger

logger.remove()

logger.add(
    sys.stdout,
    colorize=False,
    enqueue=True,  # This bug will reproduce also with enqueue=False
    serialize=False,
    level="DEBUG",
)


def sub_worker():
    while True:
        logger.trace(f"This is the inner thread")
        sleep(0.01)

def worker():
    sleep(uniform(0, 0.1))
    logger.debug(f"Printing some logs")

if __name__ == "__main__":
    threading.Thread(target=sub_worker).start()

    while True:
        w = mp.Process(target=worker, )
        w.start()
        w.join()

        logger.debug(f"Loop ok!")
@Delgan
Copy link
Owner

Delgan commented Mar 30, 2020

Yeah, it's a problem that's been hanging in my backlog for a while. This is the same problem as with the standard logging library. I didn't have the courage to correct this because I found the workaround inelegant. But I guess there's not much choice. See relevant commits:

Anyway, thanks a lot for the report and the fully workable reproducible example! I will implement a fix for the next version.

@Delgan
Copy link
Owner

Delgan commented May 2, 2020

What a journey! There was not one, not two, but three bugs here! 😄

1. Deadlock while mixing threading with multiprocessing (1122e82)

This was the easy one. It happens if a process is created while the internal lock of one handler (used for thread-safety) is already acquired. This is only possible if the sink is used by another thread while the child process is created. The child process will inherits the locked lock (due to fork() copying the entire process state), and hence deadlock as soon as a message tries to be logged. This problem is extensively discussed in the issues related to the commits I linked above, it's very instructive.

It can happen even if enqueue=False and is easily reproducible by using a "slow" sink:

import multiprocessing
import threading
import time
import sys
from loguru import logger

def slow_sink(msg):
    time.sleep(1)
    sys.stderr.write(msg)

def worker():
    logger.info("Working...")

if __name__ == "__main__":
    logger.remove()
    logger.add(slow_sink, enqueue=False)

    thread = threading.Thread(target=worker)
    thread.start()

    process = multiprocessing.Process(target=worker)
    process.start()

    thread.join()
    process.join()

The fix for this relies on the os.register_at_fork() function. We simply need to acquire all locks in the main process before a fork occurs, and release them in both main and child processes once the fork is finished.

Contrary to the logging library, loguru does not use re-entrant locks and does not expose API to acquire them. This eases the process of protecting handlers during fork and avoid others possibles kind of deadlocks that the standard library faced.

2. Deadlock while using enqueue=True with multiprocessing (8c42511)

So, after I fixed the first mentioned issue, I thought I can call it a day and move on, but there were still more troubles to come. 🙂

While using enqueue=True, a thread is internally started to continuously consume log messages put in the queue. Unfortunately, threading and multiprocessing don't interoperate very well. There was another possible deadlock, due to the fact that sys.stderr (and sys.stdout) itself internally uses a lock which is not protected in case of a fork.

Consequently, if a fork occurs while the internal handler' thread is printing a logging message, the child process will run into a deadlock if sys.stderr is manipulated. This happens beyond the scope of loguru, as child processes do not use the sink (they simply add log messages to the queue).

import multiprocessing
import sys
from loguru import logger

if __name__ == "__main__":
    logger.remove()
    logger.add(sys.stderr, enqueue=True)

    while True:
        logger.info("No deadlock yet...")
        process = multiprocessing.Process(target=lambda: None)
        process.start()
        process.join()

The fix is quite straightforward using the new locking mechanism previously introduced: we have to protect the sink with a lock even if enqueue=True, so it's acquired and released during a fork.

3. Deadlock while joining child process with enqueue=True (be597bd)

About this one, I was not able to make a reproducible example, but the multiprocessing documentation is quite explicit about it:

Warning: As mentioned above, if a child process has put items on a queue (and it has not used JoinableQueue.cancel_join_thread), then that process will not terminate until all buffered items have been flushed to the pipe.

This means that if you try joining that process you may get a deadlock unless you are sure that all items which have been put on the queue have been consumed. Similarly, if the child process is non-daemonic then the parent process may hang on exit when it tries to join all its non-daemonic children.

So, it seems possible for a child process to deadlock if it's exited while items added to the queue have not been fully handled. I'm not sure it apply to SimpleQueue used by loguru, but to be sure, I slightly modified the logger.complete() function. This method can now be called outside the asynchronous functions, it will block until all log messages added to the queue (at the time of the call) have been processed.

@Delgan Delgan closed this as completed May 2, 2020
@kapsh
Copy link

kapsh commented May 14, 2020

@Delgan hello! I am hitting this deadlock while using loguru 0.4.1. Could you please release a new version with these fixes included?

It seems to me that still reproduces on 981edf0 if Process.run of custom worker is wrapped in loguru.catch.

@Delgan
Copy link
Owner

Delgan commented May 14, 2020

@kapsh I will try to release a new version this week-end.

However, I don't encounter any deadlock wrapping Process.run(). Could you please open a new ticket with the snippet that triggered the bug?

@kapsh
Copy link

kapsh commented May 14, 2020 via email

@Delgan
Copy link
Owner

Delgan commented May 17, 2020

@kapsh You can now update loguru to 0.5.0, deadlocks should be fixed. 👍

If you encounter any other problem, please open a new issue so I can investigate it. 😉

@kapsh
Copy link

kapsh commented May 17, 2020

@Delgan meanwhile I couldn't catch deadlock in my script (pretty annoying to reproduce) yet I suspect it is caused by my own interaction with queues.
Thanks for taking care!

@ahuan2616
Copy link

ahuan2616 commented Jun 4, 2021

I can still reproduce this issue "2. Deadlock while using enqueue=True with multiprocessing" with version 0.5.3, am I missing something? I used these code to test:

import multiprocessing
import sys
from loguru import logger

if __name__ == "__main__":
    logger.remove()
    logger.add(sys.stderr, enqueue=True)

    while True:
        logger.info("No deadlock yet...")
        process = multiprocessing.Process(target=lambda: None)
        process.start()
        process.join()

@kapsh
Copy link

kapsh commented Jun 4, 2021

@ahuan2616 which python version are you running when this happens? In my case I concluded that some 3.6 release is responsible, because it stopped reproducing under 3.9.

@Delgan
Copy link
Owner

Delgan commented Jun 4, 2021

@ahuan2616 Sorry, it seems I forgot to mention that the fix requires at least Python 3.7 as @kapsh suggested.
The solution implemented relies on the os.register_at_fork() function which has been introduced with Python 3.7.

@ahuan2616
Copy link

@kapsh @Delgan Thanks a lot for your help! My Python version was 3.6 before, it has been resolved when I update to Python 3.9. Thanks for your quickly reply!

@Helicopt
Copy link

It seems in Python 3.8.0 and loguru 0.6.0, it still goes into a deadlock.

Now I don't have spare time to make a detailed analysis, but I can post some info for you.

Environment:

Linux xxxxxx 5.15.0-50-generic #56-Ubuntu SMP Tue Sep 20 13:23:26 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Python 3.8.0 (default, Nov 6 2019, 21:49:08)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.

import loguru
loguru.version
'0.6.0'

In Process 0: It runs normally and reaches a sync command (after printing try resize)

2022-10-22 06:50:52.848 | DEBUG | yolox.core.trainer:after_iter:280 - print info msg
2022-10-22 06:50:52.848 | INFO | yolox.core.trainer:after_iter:281 - epoch: 26/100, iter: 520/534, mem: 21206Mb, iter_time: 0.651s, data_time: 0.002s, total_loss: 5.378, iou_loss: 2.539, l1_loss: 0.000, conf_loss: 2.225, cls_loss: 0.614, lr: 1.915e-04, size: 1120, ETA: 7:22:11
2022-10-22 06:50:52.848 | DEBUG | yolox.core.trainer:after_iter:291 - clear meters
2022-10-22 06:50:52.848 | DEBUG | yolox.core.trainer:after_iter:293 - clear meters done
2022-10-22 06:50:52.849 | DEBUG | yolox.core.trainer:after_iter:297 - try resize

In Process 1: It gets stuck in logger.info.

2022-10-22 06:50:52.835 | DEBUG | yolox.core.trainer:after_iter:278 - get gpu number
2022-10-22 06:50:52.836 | DEBUG | yolox.core.trainer:after_iter:280 - print info msg

In Process 2: It gets stuck in logger.info before I kill process 1, and then it jumps out of deadlock (because I kill process1), and continues to run and reaches a sync command (after printing try resize). It gets stuck for 1 hour, you can see the timestamp.

2022-10-22 06:50:52.837 | DEBUG | yolox.core.trainer:after_iter:278 - get gpu number
2022-10-22 06:50:52.837 | DEBUG | yolox.core.trainer:after_iter:280 - print info msg
2022-10-22 06:50:52.838 | INFO | yolox.core.trainer:after_iter:281 - epoch: 26/100, iter: 520/534, mem: 21470Mb, iter_time: 0.649s, data_time: 0.002s, total_loss: 5.432, iou_loss: 2.601, l1_loss: 0.000, conf_loss: 2.211, cls_loss: 0.621, lr: 1.915e-04, size: 1120, ETA: 7:23:12
2022-10-22 07:53:44.065 | DEBUG | yolox.core.trainer:after_iter:291 - clear meters
2022-10-22 07:53:44.066 | DEBUG | yolox.core.trainer:after_iter:293 - clear meters done
2022-10-22 07:53:44.067 | DEBUG | yolox.core.trainer:after_iter:297 - try resize

Short discussion:

So the message string is created normally because the timestamp is normal, and it gets stuck when outputting the message.

Killing a process can release the deadlock.

My temporary solution is not to log in more than 1 subprocess.

I will update more info when I get time.

@Delgan
Copy link
Owner

Delgan commented Oct 22, 2022

@Helicopt There have been some improvements made recently about deadlocks in #712, you can use the master branch of loguru to perhaps get more debugging information.

pip install git+https://github.com/Delgan/loguru.git@master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants