Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resuming boosting gives GPU Invalid Device Error #3870

Open
Tracked by #5153
CHDev93 opened this issue Jan 27, 2021 · 5 comments
Open
Tracked by #5153

Resuming boosting gives GPU Invalid Device Error #3870

CHDev93 opened this issue Jan 27, 2021 · 5 comments
Labels

Comments

@CHDev93
Copy link

CHDev93 commented Jan 27, 2021

How you are using LightGBM?

LightGBM component: Python package

Environment info

Operating System: Windows 10

CPU/GPU model: GPU

C++ compiler version: NA

CMake version: NA

Java version: NA

Python version: 3.6.6

R version: NA

Other: NA

LightGBM version or commit hash: 3.1.0

Error message and / or logs

When training on a booster on the GPU and then attempting to continue training with it, I get an error indicating the GPU is unavailable. I often see this same error if I try to launch two ligthgbm jobs using the same GPU from different shells so it seems like somehow one booster locks the entire GPU and doesn't release it until the program ends

[LightGBM] [Debug] Trained a tree with leaves = 50 and max_depth = 8
CASEY: Finished final boosting iteration
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 63750
[LightGBM] [Info] Number of data points in the train set: 300000, number of used features: 250
Traceback (most recent call last):
  File ".\lgbm_large_boosting_repro.py", line 138, in <module>
    main(args)
  File ".\lgbm_large_boosting_repro.py", line 75, in main
    init_model=gbm,
  File "D:\chdev\.virtualenv\tf_1_12_gpu\lib\site-packages\lightgbm\engine.py", line 231, in train
    booster = Booster(params=params, train_set=train_set)
  File "D:\chdev\.virtualenv\tf_1_12_gpu\lib\site-packages\lightgbm\basic.py", line 2061, in __init__
    ctypes.byref(self.handle)))
  File "D:\chdev\.virtualenv\tf_1_12_gpu\lib\site-packages\lightgbm\basic.py", line 55, in _safe_call
    raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
lightgbm.basic.LightGBMError: Invalid Device

Reproducible example(s)

Below is a simple example producing the error. Switching the device type to CPU, this programs runs to completion without issue.

print(f"LGBM version: {lgb.__version__}", flush=True)
 n = int(3e6)
 m = 250
 max_bin = 255
 max_leaves = 50
n_boosting_rounds = 50

x_train = np.random.randn(n, m).astype(np.float32)
A = np.random.randint(-5, 5, size=(m, 1))
y_train = (x_train @ A).astype(np.float32)

params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': ['rmse'],
    'device': 'gpu',
    'num_leaves': max_leaves,
    'bagging_fraction': 0.5,
    'feature_fraction': 0.5,
    'learning_rate': 0.001,
#     'num_threads': 20,
    'verbose': 2,
    'max_bin': max_bin,
}

ds_train = lgb.Dataset(x_train, y_train.ravel(), free_raw_data=False)  # seems we need this for continued training
start = time.perf_counter()
gbm = lgb.train(
    params,
    ds_train,
    num_boost_round=n_boosting_rounds,
    keep_training_booster=True,
)

    ################## TRAIN SOME MORE ###################
    gbm = lgb.train(
        params,
        ds_train,
        num_boost_round=10,
        init_model=gbm,
    )
    #######################################################
@StrikerRUS
Copy link
Collaborator

Could you please pin gpu_platform_id and gpu_device_id params to some non-default values?
https://lightgbm.readthedocs.io/en/latest/Parameters.html#gpu-parameters

@CHDev93
Copy link
Author

CHDev93 commented Jan 27, 2021

Thanks for the suggestio! I've actually never encountered these params before.

I'm running on a machine with 3 NVIDIA V100 GPUs so I just set both these params to 0. CUDA_VISIBLE_DEVICES is also set to "0" before lightgbm is imported so it should only be seeing one of the GPUs anyway. My understanding of the docs you linked make it seem if all the GPUs are homogeneous this should be fine.

Added these to my params dict in the above MWE and still seeing the same error. Is my understanding of what to set the gpu_platform_id and gpu_device_id to correct given the setup description?

@StrikerRUS
Copy link
Collaborator

Yeah, you did everything right. It is a pity that those params didn't help.

My last guess is to try re-create training Dataset one more time before executing train function because you have some params that are used during Dataset initialization and cannot be changed after that.

    ################## TRAIN SOME MORE ###################
    ds_train = lgb.Dataset(x_train, y_train.ravel(), free_raw_data=False)  # add this
    gbm = lgb.train(
        params,
        ds_train,
        num_boost_round=10,
        init_model=gbm,
    )
    #######################################################

@CHDev93
Copy link
Author

CHDev93 commented Jan 27, 2021

Thanks for your help on this! Adding the line above still produces the same error unfortunately.

I refactored the code to to do the training and save the model in one function (so that the original booster goes out of scope) and then in the second bout of training, I load the model from disk and then train with that. This does end up training successfully. So it seems that the booster object is somehow locking the GPU and preventing any other object from getting a handle to the device while the first booster still exists.

The below code does get around the problem, but do you think there is a more permanent fix for this? Ideally code that works on the CPU would "just work" on the GPU with the change of a command line flag and wouldn't need to be refactored and introduce these extra (de)serialisation steps.

Also open to other ideas to get this example on the GPU

def do_initial_training(train_start, args, params, x_train, y_train):
    ds_train = lgb.Dataset(x_train, y_train.ravel())
    start = time.perf_counter()
    gbm = lgb.train(
        params,
        ds_train,
        num_boost_round=args.n_boosting_rounds,
        keep_training_booster=args.keep_training_booster,
    )
    end = time.perf_counter()
    print(f"Initial training finished: {end - start}s", flush=True)
    persist_model_to_disk(gbm, args.model_format.lower(), train_start)


def do_more_training(train_start, params, x_train, y_train):
    print("Doing more training...", flush=True)
    ds_train = lgb.Dataset(x_train, y_train.ravel())
    with open(model_file, 'rb') as fin:
        gbm = joblib.load(fin)
    start = time.perf_counter()
    gbm = lgb.train(
        params,
        ds_train,
        num_boost_round=10,
        init_model=gbm,
    )
    end = time.perf_counter()
    print(f"Extra training finished: {end - start}s", flush=True)
    persist_model_to_disk(gbm, args.model_format.lower(), train_start)

@StrikerRUS
Copy link
Collaborator

@CHDev93 Thank you very much for detailed explanation and possible workaround!

ping @huanzhang12 for further investigation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants