-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out of Memory when using Streaming Dataloader #652
Comments
Hey y’all, thanks for bringing this issue to our attention. We’re looking into this and will get back to you soon. |
Skimmed through the blog and PyTorch issue, is this an issue particular to Streaming or is it on the PyTorch side? StreamingDataLoader is a simple (stateful) subclass of PyTorch’s DataLoader. Does this also happen with other Datasets? @VikaasVarma @miguelalba96 |
So Streaming is designed for fast random sample access, from shards that live on disk. Samples, outside of dataloader prefetching, are never kept in memory. We conserve RAM to do other things though, like sample partitioning and shuffling, but this happens at the start of training. So I'm inclined to think that this is a PyTorch DataLoader issue, given the links you sent as well. To track memory usage, maybe you could call |
This does not happen outside of Mosaic or with other datasets. Using PyTorch's Dataloader instead of the StreamingDataLoader also leads to a memory leak. When pulling the data using PyTorch's torchdata to construct the dataset, there is no significant memory overhead. I don't think the problem lies within the StreamingDataloader. The links seem to point towards large lists of python objects generally causing this issue. There are a few cases of this in the StreamingDataset (the stored shards, spanner's, stream filepaths, etc...). |
To echo @VikaasVarma's point here-- the copy on read issues with the torch dataloader comes back to the dataset object storing a large number of naive python objects that can't use shared memory. I noticed that most of the dataset metadata is in fact in shared memory except for With smaller datasets, we never ran into this issue (or it never came up through the training lifecycle). We're only running into it now with a dataset that is a few orders of magnitude larger (more rows + larger row size and thus more shards). If this is a copy-on-read issue, the memory wouldn't grow by a factor of row size, only the number of shards which I think is the case. |
@VikaasVarma Is this a typo in your repro script?
To clarify, you expected "cond" to be gc by streamingdataset? |
@VikaasVarma can you clarify your plot a bit? e.g., what does pss uss shared mean? and x, y axis. |
Encounter a similar issue, the CPU Memory usage keep increasing until OOM in about two hours. |
Check this issue I found a memory leak problem, it is in the work of the boto3 library |
Encounter similar CPU memory leak issue when training on H800. |
Hey @huxuan @wanghao14 @miguelalba96, |
@snarayan21 There is no boto3 in my code. |
@snarayan21 I am using 1.5TB of images stored in shards locally in 4 nodes, each with an entire copy of the data, so technically I am not streaming |
Environment
To reproduce
Steps to reproduce the behavior:
When using the
StreamingDataloader
(or the vanilla pytorchDataloader
) withnum_workers>0
, the processes slowly take more and more memory until the CPU RAM is filled.Expected behavior
The dataloader should be able to provide samples indefinitely without using a significant portion of available RAM.
Additional context
Below is the dataset and dataloader implementation. Each sample is roughly 10 MB. With 16 workers, a prefetch factor of 4, and a batch size of 32, the total memory usage should be, at max, 20 GB. The dataset is made up of around 1.3 million shards.
A similar problem seems to be documented in an issue and a blog post. I have recreated the graphs found in the blog post below.
The text was updated successfully, but these errors were encountered: