-
Notifications
You must be signed in to change notification settings - Fork 19.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memory leak when using tensorflow #2102
Comments
Hi, can you put here some tests and profiling of what you mean? Some compilation time numbers and memory usage for example. Anything that we can reproduce might work. We could use that information later to write a PR. |
Here is sample code, and the results: from keras.models import Sequential
from keras.layers.core import Dense, Activation
import os
import psutil
import timeit
import gc
def get_mem_usage():
process = psutil.Process(os.getpid())
return process.memory_info()
def build():
model = Sequential()
model.add(Dense(output_dim=4096, input_dim=4096, init="glorot_uniform"))
model.add(Activation("relu"))
model.compile(loss='categorical_crossentropy', optimizer='sgd')
return model
if __name__ == '__main__':
for i in xrange(10):
gc.collect()
t = timeit.timeit('build()', number=1, setup="from __main__ import build")
mem = get_mem_usage()
print('build time: {}, mem: {}'.format(t, mem)) results:
notice compilation time and mem usage going up. After cleaning the default graph between iterations, these are the results:
|
We'll consider a On 29 March 2016 at 22:28, tzachar notifications@github.com wrote:
|
A different solution is to wrap everything (from user point of view) inside a: with tf.Graph().as_default(): However, this does not play nicely with the way Keras initializes a tf session, holding it as global from the process init. A clear_session() method is needed anyway. |
This might be relevant #2535. |
We got the same problem in a loop for a sklearn kfold experiment. No problem switching to Theano. |
I run into OOM exceptions while using KerasClassifier to sweep large hyperparameter grids with TF backend. No problems with Theano. |
I'm seeing this too. For me, it happens when I'm using kfolds. |
You can now use |
You should update Keras. |
Hi,
You should update Keras. clear_session was added a few months ago.— |
Hi guys, after google quite long time about the tensorflow/keras memory leak, most answer is to add K.clear_session() at the end. Therefore, I used this code every iterations in a loop of model fitting and checked the number of graph operations(the length of operations is fixed). However, the memory was still increasing and finally reached almost 100%. Any ideas on this issue? My code is like this:
|
Hi, Try from keras import backend as be |
Same here. Want to use keras.wrappers.scikit_learn.KerasClassifier and With ~640 different combinations: 1 hour to OOM The server is large enough (really!) and contains two Tesla K80 GPUs. Reduced the dataset also, but no luck. If I reduce the parameter any further, GridSearch makes no sense anymore. And I do not see, how to run clear_session with GridSearchCV without rewriting it. Edit: if I run clear_session manually, the memory still remains like this: |
Now we are using the Keras 2.1.5 and the problem exists and does not get resolved by |
With TF 1.8 and Keras 2.2.0 |
I have keras 2.2 and TF 1.8 but I don't see that error. try to install it by |
@VanitarNordic |
Can confirm that with TF 1.8 and Keras 2.2.0 K.clear_session() leads to crash. The same code on TF 1.8 and Keras 2.1.6 is working correctly. |
We have to use it, that's the only method we can use to make consistent results when the code is inside a loop. I faced the crash also but I did not know it is because of that, because it was not generating any error. |
@VanitarNordic I know. I'm using it for the same reason(GridSearchCV). It's crashing for me without any message too (once i got message that program tried do something with memory address 0). |
Exactly it happens in the third iteration! funny. I had to downgrade to the 2.1.6 either. |
We also have memory leaks when using keras + tensorflow. There are multiple places where it consumes RAM and doesn't free afterwards. We create models in a loop, after some time it consumes all free memory; for example, on a server it takes all 132Gb. ENVs: Here is a demo script with one of the leak cases (requires from __future__ import print_function
import os, sys, gc
import objgraph, psutil
from keras.layers import Input, Dense
from keras.models import Model
from keras.regularizers import l2
from keras import backend as K
data = []
ps = psutil.Process(os.getpid())
getrss = lambda: ps.memory_info().rss / 1024 / 1024
def simple():
data.append(['sdsds'] * 1000000)
def model():
coef = l2(0.0005)
input_data = Input(shape=(33,))
enc_layer = Dense(40, activation='relu', kernel_regularizer=coef)
dec_layer = Dense(33, activation='linear', kernel_regularizer=coef)
enc = enc_layer(input_data)
dec = dec_layer(enc)
dae = Model(inputs=input_data, outputs=dec)
# K.clear_session()
def print_obj(title, limit=None):
print('\n' + title)
objgraph.show_growth(limit=limit)
print('')
def main(func, show_obj, iterations=10):
print('ITERATIONS:', iterations)
start = getrss()
print('MEM BEFORE RUN:', start)
if show_obj: print_obj('OBJECTS BEFORE RUN:', 3)
# Do something ...
for _ in range(iterations):
func()
print('MEM AFTER RUN:', getrss())
global data
del data[:]
print('GC COUNT: ', gc.collect())
end = getrss()
if show_obj: print_obj('OBJECTS AFTER RUN:')
delta = end - start
print('MEM AFTER GC: {} (leak: {})'.format(end, delta))
# USAGE: KERAS_BACKEND=tensorflow python memtest.py [num_iterations] [simple] [showobj]
if __name__ == '__main__':
func = simple if 'simple' in sys.argv else model
show_obj = 'showobj' in sys.argv
iterations = next((int(x) for x in sys.argv if x.isdigit()), 10)
main(func, show_obj, iterations) Output:
Similar issue: tensorflow/tensorflow#10408 Is there way to fix that? |
First, I suggest you dial down ur tone a bit. As for a fix, if the clear_session() way does not work for you, I would suggest reusing the models. If you are generating a small number of different models, you can do something like this: def generate_models():
models = {
'model1': gen_model_1(),
'model2': gen_model_2(),
}
for k, model in models.items():
model.save_weights(k)
return models
def get_blank_model(k, models):
model = model[k]
model.load_weights(k)
return model as long as you do not need several models of the same type in parallel, you are all good. Otherwise, please be more specific about your use case. |
Keras 2.2.2, TF 1.9.0 OOM during CV validation within an inner loop. Same result if model is reused or recreated after 12 iterations. By the way... I can confirm that downgrading to Keras 2.1.6 fixes the issue. |
Just came across this issue. I'm using tf 1.9.0 and its keras version 2.1.6-tf. |
Is it possible to reopen this issue? |
This should not be necessary, but it appears that Keras/TensorFlow leaks memory and the GPU eventually runs out and crashes. Hopefully this will fix the crash (yet to be tested). Commit for back-up purposes. Note: keras-team/keras#2102.
downgrade tf to 1.8 |
Here is a pattern I adopted when fighting OOM that in retrospect may have caused OOM on its own:
I suspect that is why I was hitting OOM after my first del/clear_session(): deleting the model may deprive TF of info it needs to clear the session properly. Now I am not reloading the model anyway, and the original OOM seems to be gone, maybe due to newer versions of everything. I'm not testing that 'del model' before clear_session() caused the latest memory leak, because it takes a while, but I recommend anyone using that sort of pattern try deleting things after the clear_session():
Beware of adoption becoming maladaptation. :-) |
Is it possible to do this from C++? I have the exact same problem but with C++ code and being unable to release memory without fully killing the program or using cudaDeviceReset() which works but does not allow further use of tensorflow within the calling process. |
Worst case, maybe you could fork the calling process, and the child would be able to start tf. Tho if you have a lot in memory it could be an awkward copy.
On Monday, October 8, 2018, 11:35:21 AM PDT, Zach <notifications@github.com> wrote:
Is it possible to do this from C++?
I have the exact same problem but with C++ code and being unable to release memory without fully killing the program or using cudaDeviceReset() which works but does not allow further use of tensorflow within the calling process.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I can also confirm that downgrading to Keras 2.1.6 fixes the issue |
Will this |
Same problem. Config:
Context: model overwritten and fitted several times in a for loop (I store a few key indicators at the end of the loop, I'm not interested in the model per se). ==> Without K.clear_session() -> memory leaks Updated both (TF 1.12.0 / Keras 2.2.4) -> Problem gone. |
it may work. |
I'm still seeing this issue with: I've tried I should note that I haven't been able to pinpoint why this happens. But I know the problem occurs when I call Has anyone encountered this issue while directly fitting the Keras model to TensorFlow data generators? I'm trying not to downgrade too far due to the TensorFlow generator support in the more recent releases. I'm working on an [mcve], but my code is still a bit lengthy to post. |
I solved this problem switching to theano
|
I am having the exactly problems as you had described. As soon as model.fit is called, memory for tuple increased. |
@tzachar I want to know how to know how to add the following function you mentioned in my code:
my code :
|
Not exactly sure why this issue has been closed. What can be done to mitigate the growing loading time when calling E.g. having ten different models that need to be loaded in memory, which means that using import keras
from keras.model import load_model
keras.backend.clear_session()
files = ['model1.h5', 'model2.h5', 'model3.h5', 'model4.h5', '...']
models = [load_model(f) for f in files]
# each model takes 30 seconds more than the previous one to load
# in particular, models 9 or 10 really take ages to load
do_something_with(models) |
Its been 5 years and this bug is still here. |
6 years? |
7 years? |
8 years |
No, this is STILL an issue. Using keras.backend.clear_session() does not effectively address the problem of memory build-up during iterative modeling training or loading, which eventually leads to slower performance. I train thousands of small models and this is such a thorn in my side and slows down my research. I've thought of circumventing the issue by encapsulating the training in a subprocess but this is a jank solution. |
That's what I ended up doing as well about 3-4 years ago. My models were a bit complex: six networks, each receiving one image of the same object and all networks sharing one classifier. Then ran that in 10-fold cross-validation and created ensembles of those folds. So the models got really heavy. But as long as you create some form of main function that runs inside the subprocess and then import all the necessary classes and functions inside that main function, it does work, although I had to set up environment variables as well. But like I said, it was 3-4 years ago and can't remember all details any more. Can't believe this is still an issue though. |
I save my model, K.clear_session(), reload, then resume training. Code is open-sourced now, in case it helps. https://github.com/phobrain/Phobrain/blob/main/pr/bin/train_brain.py For multiple models, I'd try doing parallel saves and loads, tho not sure how good python is at that. |
Hello.
When using tensorflow, all ops are entered into the global tf graph. This results in memory leaks and loooong compilation times when building several models, one after the other, in the same python process (think ipython, cross validation, etc.)
For now, I solve this on my end by doing the following:
Maybe we should incorporate this into a keras.reset() function?
The text was updated successfully, but these errors were encountered: