-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resuming boosting gives GPU Invalid Device Error #3870
Comments
Could you please pin |
Thanks for the suggestio! I've actually never encountered these params before. I'm running on a machine with 3 NVIDIA V100 GPUs so I just set both these params to 0. Added these to my |
Yeah, you did everything right. It is a pity that those params didn't help. My last guess is to try re-create training Dataset one more time before executing
|
Thanks for your help on this! Adding the line above still produces the same error unfortunately. I refactored the code to to do the training and save the model in one function (so that the original booster goes out of scope) and then in the second bout of training, I load the model from disk and then train with that. This does end up training successfully. So it seems that the booster object is somehow locking the GPU and preventing any other object from getting a handle to the device while the first booster still exists. The below code does get around the problem, but do you think there is a more permanent fix for this? Ideally code that works on the CPU would "just work" on the GPU with the change of a command line flag and wouldn't need to be refactored and introduce these extra (de)serialisation steps. Also open to other ideas to get this example on the GPU def do_initial_training(train_start, args, params, x_train, y_train):
ds_train = lgb.Dataset(x_train, y_train.ravel())
start = time.perf_counter()
gbm = lgb.train(
params,
ds_train,
num_boost_round=args.n_boosting_rounds,
keep_training_booster=args.keep_training_booster,
)
end = time.perf_counter()
print(f"Initial training finished: {end - start}s", flush=True)
persist_model_to_disk(gbm, args.model_format.lower(), train_start)
def do_more_training(train_start, params, x_train, y_train):
print("Doing more training...", flush=True)
ds_train = lgb.Dataset(x_train, y_train.ravel())
with open(model_file, 'rb') as fin:
gbm = joblib.load(fin)
start = time.perf_counter()
gbm = lgb.train(
params,
ds_train,
num_boost_round=10,
init_model=gbm,
)
end = time.perf_counter()
print(f"Extra training finished: {end - start}s", flush=True)
persist_model_to_disk(gbm, args.model_format.lower(), train_start) |
@CHDev93 Thank you very much for detailed explanation and possible workaround! ping @huanzhang12 for further investigation |
How you are using LightGBM?
LightGBM component: Python package
Environment info
Operating System: Windows 10
CPU/GPU model: GPU
C++ compiler version: NA
CMake version: NA
Java version: NA
Python version: 3.6.6
R version: NA
Other: NA
LightGBM version or commit hash: 3.1.0
Error message and / or logs
When training on a booster on the GPU and then attempting to continue training with it, I get an error indicating the GPU is unavailable. I often see this same error if I try to launch two ligthgbm jobs using the same GPU from different shells so it seems like somehow one booster locks the entire GPU and doesn't release it until the program ends
Reproducible example(s)
Below is a simple example producing the error. Switching the device type to CPU, this programs runs to completion without issue.
The text was updated successfully, but these errors were encountered: