Resuming boosting gives GPU Invalid Device Error #3870

CHDev93 · 2021-01-27T11:51:25Z

How you are using LightGBM?

LightGBM component: Python package

Environment info

Operating System: Windows 10

CPU/GPU model: GPU

C++ compiler version: NA

CMake version: NA

Java version: NA

Python version: 3.6.6

R version: NA

Other: NA

LightGBM version or commit hash: 3.1.0

Error message and / or logs

When training on a booster on the GPU and then attempting to continue training with it, I get an error indicating the GPU is unavailable. I often see this same error if I try to launch two ligthgbm jobs using the same GPU from different shells so it seems like somehow one booster locks the entire GPU and doesn't release it until the program ends

[LightGBM] [Debug] Trained a tree with leaves = 50 and max_depth = 8
CASEY: Finished final boosting iteration
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 63750
[LightGBM] [Info] Number of data points in the train set: 300000, number of used features: 250
Traceback (most recent call last):
  File ".\lgbm_large_boosting_repro.py", line 138, in <module>
    main(args)
  File ".\lgbm_large_boosting_repro.py", line 75, in main
    init_model=gbm,
  File "D:\chdev\.virtualenv\tf_1_12_gpu\lib\site-packages\lightgbm\engine.py", line 231, in train
    booster = Booster(params=params, train_set=train_set)
  File "D:\chdev\.virtualenv\tf_1_12_gpu\lib\site-packages\lightgbm\basic.py", line 2061, in __init__
    ctypes.byref(self.handle)))
  File "D:\chdev\.virtualenv\tf_1_12_gpu\lib\site-packages\lightgbm\basic.py", line 55, in _safe_call
    raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
lightgbm.basic.LightGBMError: Invalid Device

Reproducible example(s)

Below is a simple example producing the error. Switching the device type to CPU, this programs runs to completion without issue.

print(f"LGBM version: {lgb.__version__}", flush=True)
 n = int(3e6)
 m = 250
 max_bin = 255
 max_leaves = 50
n_boosting_rounds = 50

x_train = np.random.randn(n, m).astype(np.float32)
A = np.random.randint(-5, 5, size=(m, 1))
y_train = (x_train @ A).astype(np.float32)

params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': ['rmse'],
    'device': 'gpu',
    'num_leaves': max_leaves,
    'bagging_fraction': 0.5,
    'feature_fraction': 0.5,
    'learning_rate': 0.001,
#     'num_threads': 20,
    'verbose': 2,
    'max_bin': max_bin,
}

ds_train = lgb.Dataset(x_train, y_train.ravel(), free_raw_data=False)  # seems we need this for continued training
start = time.perf_counter()
gbm = lgb.train(
    params,
    ds_train,
    num_boost_round=n_boosting_rounds,
    keep_training_booster=True,
)

    ################## TRAIN SOME MORE ###################
    gbm = lgb.train(
        params,
        ds_train,
        num_boost_round=10,
        init_model=gbm,
    )
    #######################################################

The text was updated successfully, but these errors were encountered:

StrikerRUS · 2021-01-27T12:19:56Z

Could you please pin gpu_platform_id and gpu_device_id params to some non-default values?
https://lightgbm.readthedocs.io/en/latest/Parameters.html#gpu-parameters

CHDev93 · 2021-01-27T12:38:26Z

Thanks for the suggestio! I've actually never encountered these params before.

I'm running on a machine with 3 NVIDIA V100 GPUs so I just set both these params to 0. CUDA_VISIBLE_DEVICES is also set to "0" before lightgbm is imported so it should only be seeing one of the GPUs anyway. My understanding of the docs you linked make it seem if all the GPUs are homogeneous this should be fine.

Added these to my params dict in the above MWE and still seeing the same error. Is my understanding of what to set the gpu_platform_id and gpu_device_id to correct given the setup description?

StrikerRUS · 2021-01-27T12:55:34Z

Yeah, you did everything right. It is a pity that those params didn't help.

My last guess is to try re-create training Dataset one more time before executing train function because you have some params that are used during Dataset initialization and cannot be changed after that.

    ################## TRAIN SOME MORE ###################
    ds_train = lgb.Dataset(x_train, y_train.ravel(), free_raw_data=False)  # add this
    gbm = lgb.train(
        params,
        ds_train,
        num_boost_round=10,
        init_model=gbm,
    )
    #######################################################

CHDev93 · 2021-01-27T13:56:54Z

Thanks for your help on this! Adding the line above still produces the same error unfortunately.

I refactored the code to to do the training and save the model in one function (so that the original booster goes out of scope) and then in the second bout of training, I load the model from disk and then train with that. This does end up training successfully. So it seems that the booster object is somehow locking the GPU and preventing any other object from getting a handle to the device while the first booster still exists.

The below code does get around the problem, but do you think there is a more permanent fix for this? Ideally code that works on the CPU would "just work" on the GPU with the change of a command line flag and wouldn't need to be refactored and introduce these extra (de)serialisation steps.

Also open to other ideas to get this example on the GPU

def do_initial_training(train_start, args, params, x_train, y_train):
    ds_train = lgb.Dataset(x_train, y_train.ravel())
    start = time.perf_counter()
    gbm = lgb.train(
        params,
        ds_train,
        num_boost_round=args.n_boosting_rounds,
        keep_training_booster=args.keep_training_booster,
    )
    end = time.perf_counter()
    print(f"Initial training finished: {end - start}s", flush=True)
    persist_model_to_disk(gbm, args.model_format.lower(), train_start)


def do_more_training(train_start, params, x_train, y_train):
    print("Doing more training...", flush=True)
    ds_train = lgb.Dataset(x_train, y_train.ravel())
    with open(model_file, 'rb') as fin:
        gbm = joblib.load(fin)
    start = time.perf_counter()
    gbm = lgb.train(
        params,
        ds_train,
        num_boost_round=10,
        init_model=gbm,
    )
    end = time.perf_counter()
    print(f"Extra training finished: {end - start}s", flush=True)
    persist_model_to_disk(gbm, args.model_format.lower(), train_start)

StrikerRUS · 2021-01-27T14:16:26Z

@CHDev93 Thank you very much for detailed explanation and possible workaround!

ping @huanzhang12 for further investigation

StrikerRUS added the bug label Jan 27, 2021

StrikerRUS mentioned this issue Jan 28, 2021

v3.2.0 release #3872

Merged

jameslamb mentioned this issue Apr 14, 2022

[RFC] 4.0.0 Release #5153

Closed

60 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resuming boosting gives GPU Invalid Device Error #3870

Resuming boosting gives GPU Invalid Device Error #3870

CHDev93 commented Jan 27, 2021

StrikerRUS commented Jan 27, 2021

CHDev93 commented Jan 27, 2021

StrikerRUS commented Jan 27, 2021

CHDev93 commented Jan 27, 2021

StrikerRUS commented Jan 27, 2021

Resuming boosting gives GPU Invalid Device Error #3870

Resuming boosting gives GPU Invalid Device Error #3870

Comments

CHDev93 commented Jan 27, 2021

How you are using LightGBM?

Environment info

Error message and / or logs

Reproducible example(s)

StrikerRUS commented Jan 27, 2021

CHDev93 commented Jan 27, 2021

StrikerRUS commented Jan 27, 2021

CHDev93 commented Jan 27, 2021

StrikerRUS commented Jan 27, 2021