memory leak when using tensorflow #2102

tzachar · 2016-03-28T09:07:05Z

Hello.

When using tensorflow, all ops are entered into the global tf graph. This results in memory leaks and loooong compilation times when building several models, one after the other, in the same python process (think ipython, cross validation, etc.)

For now, I solve this on my end by doing the following:

import keras.backend.tensorflow_backend
if keras.backend.tensorflow_backend._SESSION:
   import tensorflow as tf
   tf.reset_default_graph() 
   keras.backend.tensorflow_backend._SESSION.close()
   keras.backend.tensorflow_backend._SESSION = None

Maybe we should incorporate this into a keras.reset() function?

The text was updated successfully, but these errors were encountered:

EderSantana · 2016-03-30T03:26:20Z

Hi, can you put here some tests and profiling of what you mean? Some compilation time numbers and memory usage for example. Anything that we can reproduce might work. We could use that information later to write a PR.

tzachar · 2016-03-30T05:28:22Z

Here is sample code, and the results:

from keras.models import Sequential                                                                                                                
from keras.layers.core import Dense, Activation                                                                                                    
import os                                                                                                                                          
import psutil                                                                                                                                      
import timeit                                                                                                                                      
import gc                                                                                                                                                   

def get_mem_usage():                                                                                                                               
    process = psutil.Process(os.getpid())                                                                                                          
    return process.memory_info()                                                                                                                   


def build():                                                                                                                                       
    model = Sequential()                                                                                                                           
    model.add(Dense(output_dim=4096, input_dim=4096, init="glorot_uniform"))                                                                       
    model.add(Activation("relu"))                                                                                                                  
    model.compile(loss='categorical_crossentropy', optimizer='sgd')                                                                                
    return model                                                                                                                                   


if __name__ == '__main__':                                                                                                                         
    for i in xrange(10): 
        gc.collect()                                                                                                                          
        t = timeit.timeit('build()', number=1, setup="from __main__ import build")                                                                 
        mem = get_mem_usage()                                                                                                                      
        print('build time: {}, mem: {}'.format(t, mem))

results:

Using TensorFlow backend.
build time: 1.02965593338, mem: pmem(rss=599789568, vms=1527300096)
build time: 1.0096321106, mem: pmem(rss=1141383168, vms=2068729856)
build time: 1.03104996681, mem: pmem(rss=1682370560, vms=2610061312)
build time: 1.0659198761, mem: pmem(rss=2223833088, vms=3151384576)
build time: 1.08011817932, mem: pmem(rss=2765127680, vms=3692707840)
build time: 1.10519003868, mem: pmem(rss=3306053632, vms=4233703424)
build time: 1.13465809822, mem: pmem(rss=3847581696, vms=4775194624)
build time: 1.14798998833, mem: pmem(rss=4387577856, vms=5314605056)
build time: 1.17501521111, mem: pmem(rss=4929052672, vms=5856210944)
build time: 1.25362706184, mem: pmem(rss=5469794304, vms=6396817408)

notice compilation time and mem usage going up. After cleaning the default graph between iterations, these are the results:

Using TensorFlow backend.
build time: 0.988173961639, mem: pmem(rss=598212608, vms=1527754752)
build time: 0.976176023483, mem: pmem(rss=598134784, vms=1527767040)
build time: 0.973516941071, mem: pmem(rss=598507520, vms=1528115200)
build time: 0.975924968719, mem: pmem(rss=598638592, vms=1528377344)
build time: 0.975230932236, mem: pmem(rss=599068672, vms=1528639488)
build time: 0.976888895035, mem: pmem(rss=599187456, vms=1528623104)
build time: 0.978793144226, mem: pmem(rss=599056384, vms=1528639488)
build time: 0.975780010223, mem: pmem(rss=598925312, vms=1528647680)
build time: 0.977483987808, mem: pmem(rss=598794240, vms=1528639488)
build time: 0.974485874176, mem: pmem(rss=599236608, vms=1528623104)

fchollet · 2016-03-30T05:33:21Z

We'll consider a clear_session backend method for TensorFlow.

On 29 March 2016 at 22:28, tzachar notifications@github.com wrote:

Here is sample code, and the results:

from keras.models import Sequential from keras.layers.core import Dense, Activation import os import psutil import timeit import gc
def get_mem_usage():
process = psutil.Process(os.getpid())
return process.memory_info()

def build():
model = Sequential()
model.add(Dense(output_dim=4096, input_dim=4096, init="glorot_uniform"))
model.add(Activation("relu"))
model.compile(loss='categorical_crossentropy', optimizer='sgd')
return model

if name == 'main':
for i in xrange(10):
gc.collect()
t = timeit.timeit('build()', number=1, setup="from main import build")
mem = get_mem_usage()
print('build time: {}, mem: {}'.format(t, mem))

results:

Using TensorFlow backend.
build time: 1.02965593338, mem: pmem(rss=599789568, vms=1527300096)
build time: 1.0096321106, mem: pmem(rss=1141383168, vms=2068729856)
build time: 1.03104996681, mem: pmem(rss=1682370560, vms=2610061312)
build time: 1.0659198761, mem: pmem(rss=2223833088, vms=3151384576)
build time: 1.08011817932, mem: pmem(rss=2765127680, vms=3692707840)
build time: 1.10519003868, mem: pmem(rss=3306053632, vms=4233703424)
build time: 1.13465809822, mem: pmem(rss=3847581696, vms=4775194624)
build time: 1.14798998833, mem: pmem(rss=4387577856, vms=5314605056)
build time: 1.17501521111, mem: pmem(rss=4929052672, vms=5856210944)
build time: 1.25362706184, mem: pmem(rss=5469794304, vms=6396817408)

notice compilation time and mem usage going up. After cleaning the default
graph between iterations, these are the results:

Using TensorFlow backend.
build time: 0.988173961639, mem: pmem(rss=598212608, vms=1527754752)
build time: 0.976176023483, mem: pmem(rss=598134784, vms=1527767040)
build time: 0.973516941071, mem: pmem(rss=598507520, vms=1528115200)
build time: 0.975924968719, mem: pmem(rss=598638592, vms=1528377344)
build time: 0.975230932236, mem: pmem(rss=599068672, vms=1528639488)
build time: 0.976888895035, mem: pmem(rss=599187456, vms=1528623104)
build time: 0.978793144226, mem: pmem(rss=599056384, vms=1528639488)
build time: 0.975780010223, mem: pmem(rss=598925312, vms=1528647680)
build time: 0.977483987808, mem: pmem(rss=598794240, vms=1528639488)
build time: 0.974485874176, mem: pmem(rss=599236608, vms=1528623104)

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#2102 (comment)

tzachar · 2016-03-30T05:36:23Z

A different solution is to wrap everything (from user point of view) inside a:

with tf.Graph().as_default():

However, this does not play nicely with the way Keras initializes a tf session, holding it as global from the process init. A clear_session() method is needed anyway.

qdrk · 2016-04-27T19:48:49Z

This might be relevant #2535.

bhack · 2016-05-08T22:18:09Z

We got the same problem in a loop for a sklearn kfold experiment. No problem switching to Theano.

leonweber · 2016-05-24T05:54:45Z

I run into OOM exceptions while using KerasClassifier to sweep large hyperparameter grids with TF backend. No problems with Theano.

ckleban · 2016-06-26T04:45:41Z

I'm seeing this too. For me, it happens when I'm using kfolds.

fchollet · 2016-07-19T21:35:18Z

You can now use K.clear_session() when using TensorFlow, which will clean up everything. This is recommended if you ever create models inside a loop.

fchollet · 2016-09-04T00:59:01Z

You should update Keras. clear_session was added a few months ago.

jhmeijer · 2016-09-08T19:59:42Z

Hi,
Yes I realized that an hour later. I have updated Keras and it works now.
Thanks for the great software!
Jeroen Meijer

On Saturday, September 3, 2016 8:59 PM, François Chollet <notifications@github.com> wrote:

You should update Keras. clear_session was added a few months ago.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

mingliking · 2017-11-02T15:05:19Z

Hi guys, after google quite long time about the tensorflow/keras memory leak, most answer is to add K.clear_session() at the end. Therefore, I used this code every iterations in a loop of model fitting and checked the number of graph operations(the length of operations is fixed). However, the memory was still increasing and finally reached almost 100%. Any ideas on this issue?

My code is like this:

for date in date_list:   
    #### data cleaning
    df = df_lstm.loc[df_lstm.index<=date]
    df_y = df['ret'] - df['ret'].mean()
    trainY = df_y[timesteps-1:-1]
    trainX = x_transformed[:-1]
    #trainX = np.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
    testX = x_transformed[-timesteps:]
    testXX = np.reshape(testX, (1,testX.shape[0], testX.shape[1]))
    data_dim = trainX.shape[1]
    trainYY = np.array([[0,1] if x <= 0 else [1,0] for x in trainY])
    from numpy import array
    trainXX=array([trainX[i:i+timesteps,:] for i in range(trainX.shape[0]-timesteps+1)])
    #### start to build models
    config = tf.ConfigProto()
    config.gpu_options.per_process_gpu_memory_fraction = 0.3
    config.gpu_options.allow_growth = True
    K.set_session(tf.Session(graph=tf.get_default_graph(),config=config))
   
    model = Sequential()
    model.add(LSTM(dimension_of_lstm, input_shape=(timesteps, data_dim),dropout_W=0.25,dropout_U=0.25))  # returns a sequence of vectors of dimension 32
     # returns a sequence of vectors of dimension 32
    # return a single vector of dimension 32
    model.add(Dense(16, activation='relu'))
    model.add(Dense(8, activation='relu'))
    model.add(Dense(2, activation='softmax'))
    model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
    model.fit(trainXX, trainYY, batch_size=batchsize, nb_epoch=epoch_num, )
    y_pred_enet = model.predict(testXX)
    del model
    #g = tf.get_default_graph()
    #print(len(g.get_operations()))
    #tried all the answers I could find at the end 
    K.clear_session()
    tf.reset_default_graph()
    tf.contrib.keras.backend.clear_session()

jhmeijer · 2017-11-02T23:57:53Z

Hi,

Try

from keras import backend as be
(...)
be.clear_session()

xentity · 2018-01-23T14:20:40Z

Same here. Want to use keras.wrappers.scikit_learn.KerasClassifier and
from sklearn.model_selection.GridSearchCV for my thesis. I have to reduce the number (not the values) of possible values of the hyper parameters.

With ~640 different combinations: 1 hour to OOM
With ~450 different combinations: 3 hours to OOM
With ~290 different combinations: 5 hours to OOM

The server is large enough (really!) and contains two Tesla K80 GPUs.

Reduced the dataset also, but no luck. If I reduce the parameter any further, GridSearch makes no sense anymore. And I do not see, how to run clear_session with GridSearchCV without rewriting it.

MyVanitar · 2018-03-13T20:27:10Z

Now we are using the Keras 2.1.5 and the problem exists and does not get resolved by K.crear_session()

talpay · 2018-06-09T12:38:24Z

With TF 1.8 and Keras 2.2.0 K.clear_session() leads to Process finished with exit code 139 (interrupted by signal 11: SIGSEGV) when used in a context such as #4417

MyVanitar · 2018-06-09T13:56:27Z

@talpay

I have keras 2.2 and TF 1.8 but I don't see that error. try to install it by conda install -c hesi_m keras which installs both Keras 2.2 and TF 1.8 and do not mix it with Pip. it might solve the case

talpay · 2018-06-09T20:37:21Z

@VanitarNordic
It's definitely not a package management issue and I've recreated it with some of the Keras example code. Have you tested it with a tensorboard callback that has histogram_freq=1? Because it only happens when training multiple models in a loop, having the tensorboard callback, and then calling K.clear_session() (which is necessary as pointed out in above issue).

BluerBlack · 2018-06-10T17:33:07Z

Can confirm that with TF 1.8 and Keras 2.2.0 K.clear_session() leads to crash. The same code on TF 1.8 and Keras 2.1.6 is working correctly.

MyVanitar · 2018-06-10T17:37:58Z

@BluerBlack

We have to use it, that's the only method we can use to make consistent results when the code is inside a loop. I faced the crash also but I did not know it is because of that, because it was not generating any error.

BluerBlack · 2018-06-10T17:47:18Z

@VanitarNordic

I know. I'm using it for the same reason(GridSearchCV). It's crashing for me without any message too (once i got message that program tried do something with memory address 0).
K.clear_session() is constantly crashing after 3rd call for me and I'm also using tensorboard callback but with histogram_freq=0.

MyVanitar · 2018-06-10T17:50:50Z

@BluerBlack

Exactly it happens in the third iteration! funny. I had to downgrade to the 2.1.6 either.

skozlovf · 2018-06-26T07:07:53Z

We also have memory leaks when using keras + tensorflow. There are multiple places where it consumes RAM and doesn't free afterwards. We create models in a loop, after some time it consumes all free memory; for example, on a server it takes all 132Gb. clear_session() doesn't help.

ENVs:
Ubuntu 16.04.4, python 2.7.15 (Anaconda)
Linux Mint 18.2, python 2.7.9
tensorflow 1.8.0
Keras 2.2.0

Here is a demo script with one of the leak cases (requires objgraph and psutil):

from __future__ import print_function
import os, sys, gc
import objgraph, psutil
from keras.layers import Input, Dense
from keras.models import Model
from keras.regularizers import l2
from keras import backend as K

data = []
ps = psutil.Process(os.getpid())
getrss = lambda: ps.memory_info().rss / 1024 / 1024


def simple():
    data.append(['sdsds'] * 1000000)


def model():
    coef = l2(0.0005)
    input_data = Input(shape=(33,))
    enc_layer = Dense(40, activation='relu', kernel_regularizer=coef)
    dec_layer = Dense(33, activation='linear', kernel_regularizer=coef)
    enc = enc_layer(input_data)
    dec = dec_layer(enc)
    dae = Model(inputs=input_data, outputs=dec)
    # K.clear_session()


def print_obj(title, limit=None):
    print('\n' + title)
    objgraph.show_growth(limit=limit)
    print('')


def main(func, show_obj, iterations=10):
    print('ITERATIONS:', iterations)
    start = getrss()
    print('MEM BEFORE RUN:', start)

    if show_obj: print_obj('OBJECTS BEFORE RUN:', 3)

    # Do something ...
    for _ in range(iterations):
        func()

    print('MEM AFTER RUN:', getrss())

    global data
    del data[:]
    print('GC COUNT: ', gc.collect())

    end = getrss()

    if show_obj: print_obj('OBJECTS AFTER RUN:')

    delta = end - start
    print('MEM AFTER GC: {} (leak: {})'.format(end, delta))


# USAGE: KERAS_BACKEND=tensorflow python memtest.py [num_iterations] [simple] [showobj]
if __name__ == '__main__':
    func = simple if 'simple' in sys.argv else model
    show_obj = 'showobj' in sys.argv
    iterations = next((int(x) for x in sys.argv if x.isdigit()), 10)
    main(func, show_obj, iterations)

Output:

$ KERAS_BACKEND=tensorflow python memtest.py
Using TensorFlow backend.
ITERATIONS: 10
MEM BEFORE RUN: 158
MEM AFTER RUN: 166
GC COUNT:  49
MEM AFTER GC: 166 (leak: 8)

Similar issue: tensorflow/tensorflow#10408

Is there way to fix that?

tzachar · 2018-06-26T07:48:40Z

First, I suggest you dial down ur tone a bit.
This is not the place to go trolling.

As for a fix, if the clear_session() way does not work for you, I would suggest reusing the models. If you are generating a small number of different models, you can do something like this:

def generate_models():
    models = {
        'model1': gen_model_1(),
        'model2': gen_model_2(),
    }
    for k, model in models.items():
        model.save_weights(k)
    return models

def get_blank_model(k, models):
    model = model[k]
    model.load_weights(k)
    return model

as long as you do not need several models of the same type in parallel, you are all good. Otherwise, please be more specific about your use case.

thundo · 2018-09-05T21:33:23Z

Keras 2.2.2, TF 1.9.0

OOM during CV validation within an inner loop. Same result if model is reused or recreated after 12 iterations.

By the way... I can confirm that downgrading to Keras 2.1.6 fixes the issue.

igorcadelima · 2018-09-05T22:52:34Z

Just came across this issue. I'm using tf 1.9.0 and its keras version 2.1.6-tf.

thundo · 2018-09-17T09:19:53Z

Is it possible to reopen this issue?

This should not be necessary, but it appears that Keras/TensorFlow leaks memory and the GPU eventually runs out and crashes. Hopefully this will fix the crash (yet to be tested). Commit for back-up purposes. Note: keras-team/keras#2102.

kkpriyankacoding · 2018-09-26T04:18:36Z

Is it possible to reopen this issue?

downgrade tf to 1.8
@igorcadelima

phobrain · 2018-10-03T15:52:19Z

Here is a pattern I adopted when fighting OOM that in retrospect may have caused OOM on its own:

model = load_model(...)
# predictions
del model   
K.clear_session()
model = load_model(...)
# predictions

I suspect that is why I was hitting OOM after my first del/clear_session(): deleting the model may deprive TF of info it needs to clear the session properly.

Now I am not reloading the model anyway, and the original OOM seems to be gone, maybe due to newer versions of everything. I'm not testing that 'del model' before clear_session() caused the latest memory leak, because it takes a while, but I recommend anyone using that sort of pattern try deleting things after the clear_session():

K.clear_session()
del model
model = load_model(...)

Beware of adoption becoming maladaptation. :-)

acidtonic · 2018-10-08T18:33:37Z

Is it possible to do this from C++?

I have the exact same problem but with C++ code and being unable to release memory without fully killing the program or using cudaDeviceReset() which works but does not allow further use of tensorflow within the calling process.

phobrain · 2018-10-08T22:28:06Z

Worst case, maybe you could fork the calling process, and the child would be able to start tf. Tho if you have a lot in memory it could be an awkward copy. On Monday, October 8, 2018, 11:35:21 AM PDT, Zach <notifications@github.com> wrote: Is it possible to do this from C++? I have the exact same problem but with C++ code and being unable to release memory without fully killing the program or using cudaDeviceReset() which works but does not allow further use of tensorflow within the calling process. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

magnusmagnusson000 · 2018-10-26T13:53:19Z

I can also confirm that downgrading to Keras 2.1.6 fixes the issue

zgbkdlm · 2018-11-14T08:33:41Z

You can now use K.clear_session() when using TensorFlow, which will clean up everything. This is recommended if you ever create models inside a loop.

Will this K.clear_session() also reset the tf.set_random_seed()?

adimajo · 2019-01-17T13:28:19Z

Same problem.

Config:

Mac OS X
Anaconda
TF 1.8.0
Keras 2.2.0

Context: model overwritten and fitted several times in a for loop (I store a few key indicators at the end of the loop, I'm not interested in the model per se).

==> Without K.clear_session() -> memory leaks
==> With K.clear_session() and from Jupyter Notebook (I was told it's not the best option in conjunction with Keras / TF) -> Kernel died

Updated both (TF 1.12.0 / Keras 2.2.4) -> Problem gone.

Bjoux2 · 2019-03-04T12:29:26Z

import ... as K
import gc

model = ....
del model
K.clear_session()
gc.collect()

it may work.

campellcl · 2019-04-14T00:48:01Z

I'm still seeing this issue with:
TensorFlow Version: 1.13.1
TensorFlow.keras Version: 2.2.4-tf
OS: Windows 10
TensorFlow-GPU running on: NVIDIA GTX 1080 ti

I've tried tf.keras.backend.clear_session() with no luck, still hitting RAM OOM errors eventually. I've also tried manually invoking garbage collection with no luck.

I should note that tf.keras.backend.clear_session() does result in a visible drop in RAM, but the next call to Model.fit(...) during looping, consumes more memory than was freed during the initial call to tf.keras.backend.clear_session(). I should also note that I am using TensorFlow datasets with one-shot iterators during training.

I haven't been able to pinpoint why this happens. But I know the problem occurs when I call Model.fit(...) on my Keras model with the two one-shot-iterators in a repeated loop. If i just initialize the one-shot iterators and don't fit the Keras model (only compile the model) then the memory usage is uniform. As soon as Model.fit(...) is called with train_ds.make_one_shot_iterator() and val_ds.make_one_shot_iterator(), I slowly leak RAM despite calling tf.keras.backend.clear_session() at the beginning of the loop.

Has anyone encountered this issue while directly fitting the Keras model to TensorFlow data generators? I'm trying not to downgrade too far due to the TensorFlow generator support in the more recent releases.

I'm working on an [mcve], but my code is still a bit lengthy to post.

eneszv · 2019-04-28T22:05:36Z

I solved this problem switching to theano

import os
os.environ['KERAS_BACKEND'] = 'theano'
from keras.models import Sequential
....

HackerTon · 2019-06-10T23:22:52Z

I'm still seeing this issue with:
TensorFlow Version: 1.13.1
TensorFlow.keras Version: 2.2.4-tf
OS: Windows 10
TensorFlow-GPU running on: NVIDIA GTX 1080 ti

I've tried tf.keras.backend.clear_session() with no luck, still hitting RAM OOM errors eventually. I've also tried manually invoking garbage collection with no luck.

I should note that tf.keras.backend.clear_session() does result in a visible drop in RAM, but the next call to Model.fit(...) during looping, consumes more memory than was freed during the initial call to tf.keras.backend.clear_session(). I should also note that I am using TensorFlow datasets with one-shot iterators during training.

I haven't been able to pinpoint why this happens. But I know the problem occurs when I call Model.fit(...) on my Keras model with the two one-shot-iterators in a repeated loop. If i just initialize the one-shot iterators and don't fit the Keras model (only compile the model) then the memory usage is uniform. As soon as Model.fit(...) is called with train_ds.make_one_shot_iterator() and val_ds.make_one_shot_iterator(), I slowly leak RAM despite calling tf.keras.backend.clear_session() at the beginning of the loop.

Has anyone encountered this issue while directly fitting the Keras model to TensorFlow data generators? I'm trying not to downgrade too far due to the TensorFlow generator support in the more recent releases.

I'm working on an [mcve], but my code is still a bit lengthy to post.

I am having the exactly problems as you had described. As soon as model.fit is called, memory for tuple increased.

tianke0711 · 2019-07-18T04:42:33Z

@tzachar I want to know how to know how to add the following function you mentioned in my code:

import keras.backend.tensorflow_backend
if keras.backend.tensorflow_backend._SESSION:
   import tensorflow as tf
   tf.reset_default_graph() 
   keras.backend.tensorflow_backend._SESSION.close()
   keras.backend.tensorflow_backend._SESSION = None

my code :

`@app.before_first_request
# @app.route('/loading')
def load_resnet_model():
    print('begin to get model')
    global graph
    graph = tf.get_default_graph()
    global model_image
    img_dim = (299, 299, 3)
    num_label = 2
    input_tensor = Input(shape=img_dim)
    base_model = InceptionResNetV2(include_top=False, input_shape=img_dim, weights='imagenet')
    x = input_tensor
    x = Lambda(preprocess_input, name='preprocessing')(x)
    x = base_model(x)
    x = GlobalAveragePooling2D()(x)
    x = Dropout(0.5)(x)
    x = Dense(num_label, activation='softmax', name='softmax')(x)
    model_image = Model(input_tensor, x)

    print('finish loading model')

@app.route("/api/", methods=["POST"])
def predict_tag():
print('beginning to prediction')

data = request.get_json()


len_test = validation_batch.shape[0]

for t_image in lst_main_image:
    n_fold = 5
    preds_test = np.zeros((len_test, 2), dtype=np.float)
    print('t_image:', t_image)
    tag_i_time = time.time()
    for i in range(1, 6):
        model_image.load_weights('../model/{}/main_image/{}_aug_inception.fold_{}{}.hdf5'.format(industry, industry, i, t_image))
        model_image.compile(optimizer=Adam(lr=1e-4), loss='binary_crossentropy', metrics=['accuracy'])
        test_prob = model_image.predict(validation_batch)
        preds_test += test_prob
    tag_i_e = time.time()
    print('each tag the times:', t_image, tag_i_e - tag_i_time)
    preds_test /= n_fold
    y_pred = preds_test.argmax(axis=-1)
    lst_result_image.append(list(y_pred))
    print('finish predict the tag:', t_image)

lst_all_result = {} 

return jsonify(lst_all_result)


if __name__ == '__main__':

    app.run(debug=True)`

JivanRoquet · 2020-04-26T23:47:26Z

Not exactly sure why this issue has been closed.

What can be done to mitigate the growing loading time when calling load_model sequentially?

E.g. having ten different models that need to be loaded in memory, which means that using clear_session() is not an option here.

import keras
from keras.model import load_model
keras.backend.clear_session()

files = ['model1.h5', 'model2.h5', 'model3.h5', 'model4.h5', '...']

models = [load_model(f) for f in files]
# each model takes 30 seconds more than the previous one to load
# in particular, models 9 or 10 really take ages to load

do_something_with(models)

jeremyevith · 2021-09-06T00:56:35Z

Its been 5 years and this bug is still here.

jstiegerstanford · 2022-08-26T23:03:43Z

6 years?

Corne173 · 2023-04-25T10:00:59Z

7 years?

oleksandr-cynamics · 2024-06-24T07:52:35Z

8 years

Corne173 · 2024-06-24T12:10:02Z

No, this is STILL an issue. Using keras.backend.clear_session() does not effectively address the problem of memory build-up during iterative modeling training or loading, which eventually leads to slower performance. I train thousands of small models and this is such a thorn in my side and slows down my research.

I've thought of circumventing the issue by encapsulating the training in a subprocess but this is a jank solution.

ISipi · 2024-06-24T14:27:51Z

No, this is STILL an issue. Using keras.backend.clear_session() does not effectively address the problem of memory build-up during iterative modeling training or loading, which eventually leads to slower performance. I train thousands of small models and this is such a thorn in my side and slows down my research.

I've thought of circumventing the issue by encapsulating the training in a subprocess but this is a jank solution.

That's what I ended up doing as well about 3-4 years ago. My models were a bit complex: six networks, each receiving one image of the same object and all networks sharing one classifier. Then ran that in 10-fold cross-validation and created ensembles of those folds. So the models got really heavy. But as long as you create some form of main function that runs inside the subprocess and then import all the necessary classes and functions inside that main function, it does work, although I had to set up environment variables as well. But like I said, it was 3-4 years ago and can't remember all details any more.

Can't believe this is still an issue though.

phobrain · 2024-06-24T18:18:43Z

I save my model, K.clear_session(), reload, then resume training. Code is open-sourced now, in case it helps.

https://github.com/phobrain/Phobrain/blob/main/pr/bin/train_brain.py

For multiple models, I'd try doing parallel saves and loads, tho not sure how good python is at that.

bhack mentioned this issue May 8, 2016

Resource exhausted tensorflow/tensorflow#1355

Closed

fchollet closed this as completed Jul 19, 2016

beniz mentioned this issue Oct 31, 2016

C++ memory leak after Session::Close() / C++ equivalent to Python reset_default_graph() ? tensorflow/tensorflow#5302

Closed

uschmidt83 mentioned this issue Feb 14, 2017

TensorBoard with histograms crashes over multiple runs #4499

Closed

2 tasks

vessemer mentioned this issue Jan 31, 2018

Reactivate tests and clear models drivendataorg/concept-to-clinic#306

Merged

1 task

svenbuechel mentioned this issue Apr 15, 2018

Memory leaks when using keras.models.model_from_json #5909

Closed

MoyanZitto mentioned this issue Aug 31, 2018

Is there any method to unload a model from memory? #11015

Closed

james-large mentioned this issue Jun 24, 2019

Implementation of dl-4-tsc Keras classifiers sktime/sktime#88

Closed

james-large mentioned this issue Jul 24, 2019

[BUG] Possible memory leak sktime/sktime-dl#1

Closed

This was referenced Dec 26, 2022

Allow model unloading from VRAM toriato/stable-diffusion-webui-wd14-tagger#31

Closed

Unable to unload DeepDanbooru model toriato/stable-diffusion-webui-wd14-tagger#33

Open

WSH032 mentioned this issue Jul 19, 2023

Unloading ML-Danbooru, is it possible without webui reload? picobyte/stable-diffusion-webui-wd14-tagger#17

Open

memory leak when using tensorflow #2102

memory leak when using tensorflow #2102

Comments

tzachar commented Mar 28, 2016

EderSantana commented Mar 30, 2016

tzachar commented Mar 30, 2016

fchollet commented Mar 30, 2016

tzachar commented Mar 30, 2016

qdrk commented Apr 27, 2016

bhack commented May 8, 2016

leonweber commented May 24, 2016 • edited Loading

ckleban commented Jun 26, 2016

fchollet commented Jul 19, 2016

fchollet commented Sep 4, 2016

jhmeijer commented Sep 8, 2016

mingliking commented Nov 2, 2017 • edited Loading

jhmeijer commented Nov 2, 2017

xentity commented Jan 23, 2018 • edited Loading

MyVanitar commented Mar 13, 2018

talpay commented Jun 9, 2018

MyVanitar commented Jun 9, 2018

talpay commented Jun 9, 2018 • edited Loading

BluerBlack commented Jun 10, 2018

MyVanitar commented Jun 10, 2018

BluerBlack commented Jun 10, 2018 • edited Loading

MyVanitar commented Jun 10, 2018

skozlovf commented Jun 26, 2018

tzachar commented Jun 26, 2018

thundo commented Sep 5, 2018 • edited Loading

igorcadelima commented Sep 5, 2018

thundo commented Sep 17, 2018

kkpriyankacoding commented Sep 26, 2018 • edited Loading

phobrain commented Oct 3, 2018 • edited Loading

acidtonic commented Oct 8, 2018

phobrain commented Oct 8, 2018 via email

magnusmagnusson000 commented Oct 26, 2018

zgbkdlm commented Nov 14, 2018

adimajo commented Jan 17, 2019

Bjoux2 commented Mar 4, 2019

campellcl commented Apr 14, 2019

eneszv commented Apr 28, 2019 • edited Loading

HackerTon commented Jun 10, 2019

tianke0711 commented Jul 18, 2019 • edited Loading

JivanRoquet commented Apr 26, 2020

jeremyevith commented Sep 6, 2021

jstiegerstanford commented Aug 26, 2022

Corne173 commented Apr 25, 2023

oleksandr-cynamics commented Jun 24, 2024

Corne173 commented Jun 24, 2024

ISipi commented Jun 24, 2024

phobrain commented Jun 24, 2024 • edited Loading

leonweber commented May 24, 2016 •

edited

Loading

mingliking commented Nov 2, 2017 •

edited

Loading

xentity commented Jan 23, 2018 •

edited

Loading

talpay commented Jun 9, 2018 •

edited

Loading

BluerBlack commented Jun 10, 2018 •

edited

Loading

thundo commented Sep 5, 2018 •

edited

Loading

kkpriyankacoding commented Sep 26, 2018 •

edited

Loading

phobrain commented Oct 3, 2018 •

edited

Loading

eneszv commented Apr 28, 2019 •

edited

Loading

tianke0711 commented Jul 18, 2019 •

edited

Loading

phobrain commented Jun 24, 2024 •

edited

Loading