Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OutOfMemoryError while training the cross-encoder #22

Closed
kunalr97 opened this issue Nov 27, 2023 · 6 comments
Closed

OutOfMemoryError while training the cross-encoder #22

kunalr97 opened this issue Nov 27, 2023 · 6 comments
Assignees
Labels
documentation Improvements or additions to documentation question Further information is requested

Comments

@kunalr97
Copy link

kunalr97 commented Nov 27, 2023

train_args = CrossEncoderTrainingArgs(num_train_epochs = 5)

rr = CrossEncoderReranker()
output_dir = f'../outputs/{label2dict[label]}_index/cross_encoder_training/'

rr.fit(
    train_dataset = train,
    val_dataset = val,
    output_dir= output_dir,
    training_args = train_args,
    show_progress_bar = False
)

When i try to train the cross encoder on the BRONCO dataset for prediciting the ICD code for the diagnoses entities. I get this error:

OutOfMemoryError: CUDA out of memory. Tried to allocate 768.00 MiB (GPU 0; 15.77 GiB total capacity; 14.34 GiB already allocated; 379.12 MiB free; 15.03 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I tried running this line and it does not seem to work. Also there are not any other processes running on the GPU.

import torch
torch.cuda.empty_cache()

Thanks in advance for your help.

@phlobo
Copy link
Member

phlobo commented Nov 27, 2023

Hello!

The cross-encoder is indeed quite memory intensive (I tested everything with 48GB GPU memory). Two things that might work:

  1. I'm not sure if all memory allocated by SapBERT would be cleared by empty_cache(), you might instead want to save the candidate dataset to disk and restart the process / notebook to make sure CUDA memory is entirely freed up.

  2. You can reduce the memory footprint of the cross-encoder by reducing the number of candidates subject to re-ranking (which equals the batch size) to something like 16 instead of 64.

@phlobo
Copy link
Member

phlobo commented Nov 27, 2023

Another thing that might work (though I have not tested the performance), would be to use a smaller BERT model, i.e.,

train_args = CrossEncoderTrainingArgs(model_name="distilbert-base-multilingual-cased")

@kunalr97
Copy link
Author

Hi,
Thanks for your quick response. I will try this and hope that it works. Where exactly do i need to do this ?

2. You can reduce the memory footprint of the cross-encoder by reducing the number of candidates subject to re-ranking (which equals the batch size) to something like 16 instead of 64.

Thanks in advance

@phlobo
Copy link
Member

phlobo commented Nov 27, 2023

There are multiple steps at which you can reduce the number of candidates. However, if you follow this notebook (https://github.com/hpi-dhc/xmen/blob/main/examples/02_BRONCO.ipynb), then setting K_RERANKING = 16 just before calling CrossEncoderReranker.prepare_data should do the trick.

Note: I assume that this will cost you a bit of recall@1, but it might actually increase precision. To get precision, recall and F1 scores at the end, use evaluate instead of evaluate_at_k

@kunalr97
Copy link
Author

Thanks a lot! I don't get that error now.

@phlobo
Copy link
Member

phlobo commented Nov 27, 2023

Thank you for pointing this issue out, I have linked this thread in the README

@phlobo phlobo added documentation Improvements or additions to documentation question Further information is requested labels Nov 28, 2023
@phlobo phlobo self-assigned this Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants