Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory leak when using tensorflow #2102

Closed
tzachar opened this issue Mar 28, 2016 · 56 comments
Closed

memory leak when using tensorflow #2102

tzachar opened this issue Mar 28, 2016 · 56 comments

Comments

@tzachar
Copy link
Contributor

tzachar commented Mar 28, 2016

Hello.

When using tensorflow, all ops are entered into the global tf graph. This results in memory leaks and loooong compilation times when building several models, one after the other, in the same python process (think ipython, cross validation, etc.)

For now, I solve this on my end by doing the following:

import keras.backend.tensorflow_backend
if keras.backend.tensorflow_backend._SESSION:
   import tensorflow as tf
   tf.reset_default_graph() 
   keras.backend.tensorflow_backend._SESSION.close()
   keras.backend.tensorflow_backend._SESSION = None

Maybe we should incorporate this into a keras.reset() function?

@EderSantana
Copy link
Contributor

Hi, can you put here some tests and profiling of what you mean? Some compilation time numbers and memory usage for example. Anything that we can reproduce might work. We could use that information later to write a PR.

@tzachar
Copy link
Contributor Author

tzachar commented Mar 30, 2016

Here is sample code, and the results:

from keras.models import Sequential                                                                                                                
from keras.layers.core import Dense, Activation                                                                                                    
import os                                                                                                                                          
import psutil                                                                                                                                      
import timeit                                                                                                                                      
import gc                                                                                                                                                   

def get_mem_usage():                                                                                                                               
    process = psutil.Process(os.getpid())                                                                                                          
    return process.memory_info()                                                                                                                   


def build():                                                                                                                                       
    model = Sequential()                                                                                                                           
    model.add(Dense(output_dim=4096, input_dim=4096, init="glorot_uniform"))                                                                       
    model.add(Activation("relu"))                                                                                                                  
    model.compile(loss='categorical_crossentropy', optimizer='sgd')                                                                                
    return model                                                                                                                                   


if __name__ == '__main__':                                                                                                                         
    for i in xrange(10): 
        gc.collect()                                                                                                                          
        t = timeit.timeit('build()', number=1, setup="from __main__ import build")                                                                 
        mem = get_mem_usage()                                                                                                                      
        print('build time: {}, mem: {}'.format(t, mem))                   

results:

Using TensorFlow backend.
build time: 1.02965593338, mem: pmem(rss=599789568, vms=1527300096)
build time: 1.0096321106, mem: pmem(rss=1141383168, vms=2068729856)
build time: 1.03104996681, mem: pmem(rss=1682370560, vms=2610061312)
build time: 1.0659198761, mem: pmem(rss=2223833088, vms=3151384576)
build time: 1.08011817932, mem: pmem(rss=2765127680, vms=3692707840)
build time: 1.10519003868, mem: pmem(rss=3306053632, vms=4233703424)
build time: 1.13465809822, mem: pmem(rss=3847581696, vms=4775194624)
build time: 1.14798998833, mem: pmem(rss=4387577856, vms=5314605056)
build time: 1.17501521111, mem: pmem(rss=4929052672, vms=5856210944)
build time: 1.25362706184, mem: pmem(rss=5469794304, vms=6396817408)

notice compilation time and mem usage going up. After cleaning the default graph between iterations, these are the results:

Using TensorFlow backend.
build time: 0.988173961639, mem: pmem(rss=598212608, vms=1527754752)
build time: 0.976176023483, mem: pmem(rss=598134784, vms=1527767040)
build time: 0.973516941071, mem: pmem(rss=598507520, vms=1528115200)
build time: 0.975924968719, mem: pmem(rss=598638592, vms=1528377344)
build time: 0.975230932236, mem: pmem(rss=599068672, vms=1528639488)
build time: 0.976888895035, mem: pmem(rss=599187456, vms=1528623104)
build time: 0.978793144226, mem: pmem(rss=599056384, vms=1528639488)
build time: 0.975780010223, mem: pmem(rss=598925312, vms=1528647680)
build time: 0.977483987808, mem: pmem(rss=598794240, vms=1528639488)
build time: 0.974485874176, mem: pmem(rss=599236608, vms=1528623104)

@fchollet
Copy link
Collaborator

We'll consider a clear_session backend method for TensorFlow.

On 29 March 2016 at 22:28, tzachar notifications@github.com wrote:

Here is sample code, and the results:

from keras.models import Sequential from keras.layers.core import Dense, Activation import os import psutil import timeit import gc
def get_mem_usage():
process = psutil.Process(os.getpid())
return process.memory_info()

def build():
model = Sequential()
model.add(Dense(output_dim=4096, input_dim=4096, init="glorot_uniform"))
model.add(Activation("relu"))
model.compile(loss='categorical_crossentropy', optimizer='sgd')
return model

if name == 'main':
for i in xrange(10):
gc.collect()
t = timeit.timeit('build()', number=1, setup="from main import build")
mem = get_mem_usage()
print('build time: {}, mem: {}'.format(t, mem))

results:

Using TensorFlow backend.
build time: 1.02965593338, mem: pmem(rss=599789568, vms=1527300096)
build time: 1.0096321106, mem: pmem(rss=1141383168, vms=2068729856)
build time: 1.03104996681, mem: pmem(rss=1682370560, vms=2610061312)
build time: 1.0659198761, mem: pmem(rss=2223833088, vms=3151384576)
build time: 1.08011817932, mem: pmem(rss=2765127680, vms=3692707840)
build time: 1.10519003868, mem: pmem(rss=3306053632, vms=4233703424)
build time: 1.13465809822, mem: pmem(rss=3847581696, vms=4775194624)
build time: 1.14798998833, mem: pmem(rss=4387577856, vms=5314605056)
build time: 1.17501521111, mem: pmem(rss=4929052672, vms=5856210944)
build time: 1.25362706184, mem: pmem(rss=5469794304, vms=6396817408)

notice compilation time and mem usage going up. After cleaning the default
graph between iterations, these are the results:

Using TensorFlow backend.
build time: 0.988173961639, mem: pmem(rss=598212608, vms=1527754752)
build time: 0.976176023483, mem: pmem(rss=598134784, vms=1527767040)
build time: 0.973516941071, mem: pmem(rss=598507520, vms=1528115200)
build time: 0.975924968719, mem: pmem(rss=598638592, vms=1528377344)
build time: 0.975230932236, mem: pmem(rss=599068672, vms=1528639488)
build time: 0.976888895035, mem: pmem(rss=599187456, vms=1528623104)
build time: 0.978793144226, mem: pmem(rss=599056384, vms=1528639488)
build time: 0.975780010223, mem: pmem(rss=598925312, vms=1528647680)
build time: 0.977483987808, mem: pmem(rss=598794240, vms=1528639488)
build time: 0.974485874176, mem: pmem(rss=599236608, vms=1528623104)


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#2102 (comment)

@tzachar
Copy link
Contributor Author

tzachar commented Mar 30, 2016

A different solution is to wrap everything (from user point of view) inside a:

with tf.Graph().as_default():

However, this does not play nicely with the way Keras initializes a tf session, holding it as global from the process init. A clear_session() method is needed anyway.

@qdrk
Copy link

qdrk commented Apr 27, 2016

This might be relevant #2535.

@bhack
Copy link
Contributor

bhack commented May 8, 2016

We got the same problem in a loop for a sklearn kfold experiment. No problem switching to Theano.

@leonweber
Copy link

leonweber commented May 24, 2016

I run into OOM exceptions while using KerasClassifier to sweep large hyperparameter grids with TF backend. No problems with Theano.

@ckleban
Copy link

ckleban commented Jun 26, 2016

I'm seeing this too. For me, it happens when I'm using kfolds.

@fchollet
Copy link
Collaborator

You can now use K.clear_session() when using TensorFlow, which will clean up everything. This is recommended if you ever create models inside a loop.

@fchollet
Copy link
Collaborator

fchollet commented Sep 4, 2016

You should update Keras. clear_session was added a few months ago.

@jhmeijer
Copy link

jhmeijer commented Sep 8, 2016

Hi,
Yes I realized that an hour later. I have updated Keras and it works now.
Thanks for the great software!
Jeroen Meijer

On Saturday, September 3, 2016 8:59 PM, François Chollet <notifications@github.com> wrote:

You should update Keras. clear_session was added a few months ago.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

@mingliking
Copy link

mingliking commented Nov 2, 2017

Hi guys, after google quite long time about the tensorflow/keras memory leak, most answer is to add K.clear_session() at the end. Therefore, I used this code every iterations in a loop of model fitting and checked the number of graph operations(the length of operations is fixed). However, the memory was still increasing and finally reached almost 100%. Any ideas on this issue?

My code is like this:

for date in date_list:   
    #### data cleaning
    df = df_lstm.loc[df_lstm.index<=date]
    df_y = df['ret'] - df['ret'].mean()
    trainY = df_y[timesteps-1:-1]
    trainX = x_transformed[:-1]
    #trainX = np.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
    testX = x_transformed[-timesteps:]
    testXX = np.reshape(testX, (1,testX.shape[0], testX.shape[1]))
    data_dim = trainX.shape[1]
    trainYY = np.array([[0,1] if x <= 0 else [1,0] for x in trainY])
    from numpy import array
    trainXX=array([trainX[i:i+timesteps,:] for i in range(trainX.shape[0]-timesteps+1)])
    #### start to build models
    config = tf.ConfigProto()
    config.gpu_options.per_process_gpu_memory_fraction = 0.3
    config.gpu_options.allow_growth = True
    K.set_session(tf.Session(graph=tf.get_default_graph(),config=config))
   
    model = Sequential()
    model.add(LSTM(dimension_of_lstm, input_shape=(timesteps, data_dim),dropout_W=0.25,dropout_U=0.25))  # returns a sequence of vectors of dimension 32
     # returns a sequence of vectors of dimension 32
    # return a single vector of dimension 32
    model.add(Dense(16, activation='relu'))
    model.add(Dense(8, activation='relu'))
    model.add(Dense(2, activation='softmax'))
    model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
    model.fit(trainXX, trainYY, batch_size=batchsize, nb_epoch=epoch_num, )
    y_pred_enet = model.predict(testXX)
    del model
    #g = tf.get_default_graph()
    #print(len(g.get_operations()))
    #tried all the answers I could find at the end 
    K.clear_session()
    tf.reset_default_graph()
    tf.contrib.keras.backend.clear_session()

@jhmeijer
Copy link

jhmeijer commented Nov 2, 2017

Hi,

Try

from keras import backend as be
(...)
be.clear_session()

@xentity
Copy link

xentity commented Jan 23, 2018

Same here. Want to use keras.wrappers.scikit_learn.KerasClassifier and
from sklearn.model_selection.GridSearchCV for my thesis. I have to reduce the number (not the values) of possible values of the hyper parameters.

With ~640 different combinations: 1 hour to OOM
With ~450 different combinations: 3 hours to OOM
With ~290 different combinations: 5 hours to OOM

The server is large enough (really!) and contains two Tesla K80 GPUs.

Reduced the dataset also, but no luck. If I reduce the parameter any further, GridSearch makes no sense anymore. And I do not see, how to run clear_session with GridSearchCV without rewriting it.

Edit: if I run clear_session manually, the memory still remains like this:
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
| 0 14943 C /usr/bin/python3 324MiB |
| 0 53052 C /usr/bin/python3 10588MiB | <--- my process
| 1 14943 C /usr/bin/python3 368MiB |
| 1 53052 C /usr/bin/python3 10506MiB | <--- my process

@MyVanitar
Copy link

Now we are using the Keras 2.1.5 and the problem exists and does not get resolved by K.crear_session()

@talpay
Copy link
Contributor

talpay commented Jun 9, 2018

With TF 1.8 and Keras 2.2.0 K.clear_session() leads to Process finished with exit code 139 (interrupted by signal 11: SIGSEGV) when used in a context such as #4417

@MyVanitar
Copy link

@talpay

I have keras 2.2 and TF 1.8 but I don't see that error. try to install it by conda install -c hesi_m keras which installs both Keras 2.2 and TF 1.8 and do not mix it with Pip. it might solve the case

@talpay
Copy link
Contributor

talpay commented Jun 9, 2018

@VanitarNordic
It's definitely not a package management issue and I've recreated it with some of the Keras example code. Have you tested it with a tensorboard callback that has histogram_freq=1? Because it only happens when training multiple models in a loop, having the tensorboard callback, and then calling K.clear_session() (which is necessary as pointed out in above issue).

@BluerBlack
Copy link

Can confirm that with TF 1.8 and Keras 2.2.0 K.clear_session() leads to crash. The same code on TF 1.8 and Keras 2.1.6 is working correctly.

@MyVanitar
Copy link

@BluerBlack

We have to use it, that's the only method we can use to make consistent results when the code is inside a loop. I faced the crash also but I did not know it is because of that, because it was not generating any error.

@BluerBlack
Copy link

BluerBlack commented Jun 10, 2018

@VanitarNordic

I know. I'm using it for the same reason(GridSearchCV). It's crashing for me without any message too (once i got message that program tried do something with memory address 0).
K.clear_session() is constantly crashing after 3rd call for me and I'm also using tensorboard callback but with histogram_freq=0.

@MyVanitar
Copy link

@BluerBlack

Exactly it happens in the third iteration! funny. I had to downgrade to the 2.1.6 either.

@skozlovf
Copy link

We also have memory leaks when using keras + tensorflow. There are multiple places where it consumes RAM and doesn't free afterwards. We create models in a loop, after some time it consumes all free memory; for example, on a server it takes all 132Gb. clear_session() doesn't help.

ENVs:
Ubuntu 16.04.4, python 2.7.15 (Anaconda)
Linux Mint 18.2, python 2.7.9
tensorflow 1.8.0
Keras 2.2.0

Here is a demo script with one of the leak cases (requires objgraph and psutil):

from __future__ import print_function
import os, sys, gc
import objgraph, psutil
from keras.layers import Input, Dense
from keras.models import Model
from keras.regularizers import l2
from keras import backend as K

data = []
ps = psutil.Process(os.getpid())
getrss = lambda: ps.memory_info().rss / 1024 / 1024


def simple():
    data.append(['sdsds'] * 1000000)


def model():
    coef = l2(0.0005)
    input_data = Input(shape=(33,))
    enc_layer = Dense(40, activation='relu', kernel_regularizer=coef)
    dec_layer = Dense(33, activation='linear', kernel_regularizer=coef)
    enc = enc_layer(input_data)
    dec = dec_layer(enc)
    dae = Model(inputs=input_data, outputs=dec)
    # K.clear_session()


def print_obj(title, limit=None):
    print('\n' + title)
    objgraph.show_growth(limit=limit)
    print('')


def main(func, show_obj, iterations=10):
    print('ITERATIONS:', iterations)
    start = getrss()
    print('MEM BEFORE RUN:', start)

    if show_obj: print_obj('OBJECTS BEFORE RUN:', 3)

    # Do something ...
    for _ in range(iterations):
        func()

    print('MEM AFTER RUN:', getrss())

    global data
    del data[:]
    print('GC COUNT: ', gc.collect())

    end = getrss()

    if show_obj: print_obj('OBJECTS AFTER RUN:')

    delta = end - start
    print('MEM AFTER GC: {} (leak: {})'.format(end, delta))


# USAGE: KERAS_BACKEND=tensorflow python memtest.py [num_iterations] [simple] [showobj]
if __name__ == '__main__':
    func = simple if 'simple' in sys.argv else model
    show_obj = 'showobj' in sys.argv
    iterations = next((int(x) for x in sys.argv if x.isdigit()), 10)
    main(func, show_obj, iterations)

Output:

$ KERAS_BACKEND=tensorflow python memtest.py
Using TensorFlow backend.
ITERATIONS: 10
MEM BEFORE RUN: 158
MEM AFTER RUN: 166
GC COUNT:  49
MEM AFTER GC: 166 (leak: 8)

Similar issue: tensorflow/tensorflow#10408

Is there way to fix that?

@tzachar
Copy link
Contributor Author

tzachar commented Jun 26, 2018

First, I suggest you dial down ur tone a bit.
This is not the place to go trolling.

As for a fix, if the clear_session() way does not work for you, I would suggest reusing the models. If you are generating a small number of different models, you can do something like this:

def generate_models():
    models = {
        'model1': gen_model_1(),
        'model2': gen_model_2(),
    }
    for k, model in models.items():
        model.save_weights(k)
    return models

def get_blank_model(k, models):
    model = model[k]
    model.load_weights(k)
    return model

as long as you do not need several models of the same type in parallel, you are all good. Otherwise, please be more specific about your use case.

@thundo
Copy link

thundo commented Sep 5, 2018

Keras 2.2.2, TF 1.9.0

OOM during CV validation within an inner loop. Same result if model is reused or recreated after 12 iterations.

By the way... I can confirm that downgrading to Keras 2.1.6 fixes the issue.

@igorcadelima
Copy link

Just came across this issue. I'm using tf 1.9.0 and its keras version 2.1.6-tf.

@thundo
Copy link

thundo commented Sep 17, 2018

Is it possible to reopen this issue?

Vettejeep added a commit to Vettejeep/MSDS_686 that referenced this issue Sep 20, 2018
This should not be necessary, but it appears that Keras/TensorFlow leaks memory and the GPU eventually runs out and crashes.  Hopefully this will fix the crash (yet to be tested).  Commit for back-up purposes. Note: keras-team/keras#2102.
@kkpriyankacoding
Copy link

kkpriyankacoding commented Sep 26, 2018

Is it possible to reopen this issue?

downgrade tf to 1.8
@igorcadelima

@phobrain
Copy link

phobrain commented Oct 3, 2018

Here is a pattern I adopted when fighting OOM that in retrospect may have caused OOM on its own:

model = load_model(...)
# predictions
del model   
K.clear_session()
model = load_model(...)
# predictions

I suspect that is why I was hitting OOM after my first del/clear_session(): deleting the model may deprive TF of info it needs to clear the session properly.

Now I am not reloading the model anyway, and the original OOM seems to be gone, maybe due to newer versions of everything. I'm not testing that 'del model' before clear_session() caused the latest memory leak, because it takes a while, but I recommend anyone using that sort of pattern try deleting things after the clear_session():

K.clear_session()
del model
model = load_model(...)

Beware of adoption becoming maladaptation. :-)

@acidtonic
Copy link

Is it possible to do this from C++?

I have the exact same problem but with C++ code and being unable to release memory without fully killing the program or using cudaDeviceReset() which works but does not allow further use of tensorflow within the calling process.

@phobrain
Copy link

phobrain commented Oct 8, 2018 via email

@magnusmagnusson000
Copy link

I can also confirm that downgrading to Keras 2.1.6 fixes the issue

@zgbkdlm
Copy link

zgbkdlm commented Nov 14, 2018

You can now use K.clear_session() when using TensorFlow, which will clean up everything. This is recommended if you ever create models inside a loop.

Will this K.clear_session() also reset the tf.set_random_seed()?

@adimajo
Copy link

adimajo commented Jan 17, 2019

Same problem.

Config:

  • Mac OS X
  • Anaconda
  • TF 1.8.0
  • Keras 2.2.0

Context: model overwritten and fitted several times in a for loop (I store a few key indicators at the end of the loop, I'm not interested in the model per se).

==> Without K.clear_session() -> memory leaks
==> With K.clear_session() and from Jupyter Notebook (I was told it's not the best option in conjunction with Keras / TF) -> Kernel died

Updated both (TF 1.12.0 / Keras 2.2.4) -> Problem gone.

@Bjoux2
Copy link

Bjoux2 commented Mar 4, 2019

import ... as K
import gc

model = ....
del model
K.clear_session()
gc.collect()

it may work.

@campellcl
Copy link

I'm still seeing this issue with:
TensorFlow Version: 1.13.1
TensorFlow.keras Version: 2.2.4-tf
OS: Windows 10
TensorFlow-GPU running on: NVIDIA GTX 1080 ti

I've tried tf.keras.backend.clear_session() with no luck, still hitting RAM OOM errors eventually. I've also tried manually invoking garbage collection with no luck.

I should note that tf.keras.backend.clear_session() does result in a visible drop in RAM, but the next call to Model.fit(...) during looping, consumes more memory than was freed during the initial call to tf.keras.backend.clear_session(). I should also note that I am using TensorFlow datasets with one-shot iterators during training.

I haven't been able to pinpoint why this happens. But I know the problem occurs when I call Model.fit(...) on my Keras model with the two one-shot-iterators in a repeated loop. If i just initialize the one-shot iterators and don't fit the Keras model (only compile the model) then the memory usage is uniform. As soon as Model.fit(...) is called with train_ds.make_one_shot_iterator() and val_ds.make_one_shot_iterator(), I slowly leak RAM despite calling tf.keras.backend.clear_session() at the beginning of the loop.

Has anyone encountered this issue while directly fitting the Keras model to TensorFlow data generators? I'm trying not to downgrade too far due to the TensorFlow generator support in the more recent releases.

I'm working on an [mcve], but my code is still a bit lengthy to post.

@eneszv
Copy link

eneszv commented Apr 28, 2019

I solved this problem switching to theano

import os
os.environ['KERAS_BACKEND'] = 'theano'
from keras.models import Sequential
....

@HackerTon
Copy link

I'm still seeing this issue with:
TensorFlow Version: 1.13.1
TensorFlow.keras Version: 2.2.4-tf
OS: Windows 10
TensorFlow-GPU running on: NVIDIA GTX 1080 ti

I've tried tf.keras.backend.clear_session() with no luck, still hitting RAM OOM errors eventually. I've also tried manually invoking garbage collection with no luck.

I should note that tf.keras.backend.clear_session() does result in a visible drop in RAM, but the next call to Model.fit(...) during looping, consumes more memory than was freed during the initial call to tf.keras.backend.clear_session(). I should also note that I am using TensorFlow datasets with one-shot iterators during training.

I haven't been able to pinpoint why this happens. But I know the problem occurs when I call Model.fit(...) on my Keras model with the two one-shot-iterators in a repeated loop. If i just initialize the one-shot iterators and don't fit the Keras model (only compile the model) then the memory usage is uniform. As soon as Model.fit(...) is called with train_ds.make_one_shot_iterator() and val_ds.make_one_shot_iterator(), I slowly leak RAM despite calling tf.keras.backend.clear_session() at the beginning of the loop.

Has anyone encountered this issue while directly fitting the Keras model to TensorFlow data generators? I'm trying not to downgrade too far due to the TensorFlow generator support in the more recent releases.

I'm working on an [mcve], but my code is still a bit lengthy to post.

I am having the exactly problems as you had described. As soon as model.fit is called, memory for tuple increased.

@tianke0711
Copy link

tianke0711 commented Jul 18, 2019

@tzachar I want to know how to know how to add the following function you mentioned in my code:

import keras.backend.tensorflow_backend
if keras.backend.tensorflow_backend._SESSION:
   import tensorflow as tf
   tf.reset_default_graph() 
   keras.backend.tensorflow_backend._SESSION.close()
   keras.backend.tensorflow_backend._SESSION = None

my code :

`@app.before_first_request
# @app.route('/loading')
def load_resnet_model():
    print('begin to get model')
    global graph
    graph = tf.get_default_graph()
    global model_image
    img_dim = (299, 299, 3)
    num_label = 2
    input_tensor = Input(shape=img_dim)
    base_model = InceptionResNetV2(include_top=False, input_shape=img_dim, weights='imagenet')
    x = input_tensor
    x = Lambda(preprocess_input, name='preprocessing')(x)
    x = base_model(x)
    x = GlobalAveragePooling2D()(x)
    x = Dropout(0.5)(x)
    x = Dense(num_label, activation='softmax', name='softmax')(x)
    model_image = Model(input_tensor, x)

    print('finish loading model')

@app.route("/api/", methods=["POST"])
def predict_tag():
print('beginning to prediction')

data = request.get_json()


len_test = validation_batch.shape[0]

for t_image in lst_main_image:
    n_fold = 5
    preds_test = np.zeros((len_test, 2), dtype=np.float)
    print('t_image:', t_image)
    tag_i_time = time.time()
    for i in range(1, 6):
        model_image.load_weights('../model/{}/main_image/{}_aug_inception.fold_{}{}.hdf5'.format(industry, industry, i, t_image))
        model_image.compile(optimizer=Adam(lr=1e-4), loss='binary_crossentropy', metrics=['accuracy'])
        test_prob = model_image.predict(validation_batch)
        preds_test += test_prob
    tag_i_e = time.time()
    print('each tag the times:', t_image, tag_i_e - tag_i_time)
    preds_test /= n_fold
    y_pred = preds_test.argmax(axis=-1)
    lst_result_image.append(list(y_pred))
    print('finish predict the tag:', t_image)

lst_all_result = {} 

return jsonify(lst_all_result)


if __name__ == '__main__':

    app.run(debug=True)`

@JivanRoquet
Copy link

Not exactly sure why this issue has been closed.

What can be done to mitigate the growing loading time when calling load_model sequentially?

E.g. having ten different models that need to be loaded in memory, which means that using clear_session() is not an option here.

import keras
from keras.model import load_model
keras.backend.clear_session()

files = ['model1.h5', 'model2.h5', 'model3.h5', 'model4.h5', '...']

models = [load_model(f) for f in files]
# each model takes 30 seconds more than the previous one to load
# in particular, models 9 or 10 really take ages to load

do_something_with(models)

@jeremyevith
Copy link

Its been 5 years and this bug is still here.

@jstiegerstanford
Copy link

6 years?

@Corne173
Copy link

7 years?

@oleksandr-cynamics
Copy link

8 years

@Corne173
Copy link

No, this is STILL an issue. Using keras.backend.clear_session() does not effectively address the problem of memory build-up during iterative modeling training or loading, which eventually leads to slower performance. I train thousands of small models and this is such a thorn in my side and slows down my research.

I've thought of circumventing the issue by encapsulating the training in a subprocess but this is a jank solution.

@ISipi
Copy link

ISipi commented Jun 24, 2024

No, this is STILL an issue. Using keras.backend.clear_session() does not effectively address the problem of memory build-up during iterative modeling training or loading, which eventually leads to slower performance. I train thousands of small models and this is such a thorn in my side and slows down my research.

I've thought of circumventing the issue by encapsulating the training in a subprocess but this is a jank solution.

That's what I ended up doing as well about 3-4 years ago. My models were a bit complex: six networks, each receiving one image of the same object and all networks sharing one classifier. Then ran that in 10-fold cross-validation and created ensembles of those folds. So the models got really heavy. But as long as you create some form of main function that runs inside the subprocess and then import all the necessary classes and functions inside that main function, it does work, although I had to set up environment variables as well. But like I said, it was 3-4 years ago and can't remember all details any more.

Can't believe this is still an issue though.

@phobrain
Copy link

phobrain commented Jun 24, 2024

I save my model, K.clear_session(), reload, then resume training. Code is open-sourced now, in case it helps.

https://github.com/phobrain/Phobrain/blob/main/pr/bin/train_brain.py

For multiple models, I'd try doing parallel saves and loads, tho not sure how good python is at that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests