-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[python-package] Create Dataset from multiple data files (#4089)
* [python-package] create Dataset from sampled data. * [python-package] create Dataset from List[Sequence]. 1. Use random access for data sampling 2. Support read data from multiple input files 3. Read data in batch so no need to hold all data in memory * [python-package] example: create Dataset from multiple HDF5 file. * fix: revert is_class implementation for seq * fix: unwanted memory view reference for seq * fix: seq is_class accepts sklearn matrices * fix: requirements for example * fix: pycode * feat: print static code linting stage * fix: linting: avoid shell str regex conversion * code style: doc style * code style: isort * fix ci dependency: h5py on windows * [py] remove rm files in test seq #4089 (comment) * docs(python): init_from_sample summary #4089 (comment) * remove dataset dump sample data debugging code. * remove typo fix. Create separate PR for this. * fix typo in src/c_api.cpp Co-authored-by: James Lamb <jaylamb20@gmail.com> * style(linting): py3 type hint for seq * test(basic): os.path style path handling * Revert "feat: print static code linting stage" This reverts commit 10bd79f. * feat(python): sequence on validation set * minor(python): comment * minor(python): test option hint * style(python): fix code linting * style(python): add pydoc for ref_dataset * doc(python): sequence Co-authored-by: shiyu1994 <shiyu_k1994@qq.com> * revert(python): sequence class abc * chore(python): remove rm_files * Remove useless static_assert. * refactor: test_basic test for sequence. * fix lint complaint. * remove dataset._dump_text in sequence test. * Fix reverting typo fix. * Apply suggestions from code review Co-authored-by: James Lamb <jaylamb20@gmail.com> * Fix type hint, code and doc style. * fix failing test_basic. * Remove TODO about keep constant in sync with cpp. * Install h5py only when running python-examples. * Fix lint complaint. * Apply suggestions from code review Co-authored-by: James Lamb <jaylamb20@gmail.com> * Doc fixes, remove unused params_str in __init_from_seqs. * Apply suggestions from code review Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * Remove unnecessary conda install in windows ci script. * Keep param as example in dataset_from_multi_hdf5.py * Add _get_sample_count function to remove code duplication. * Use batch_size parameter in generate_hdf. * Apply suggestions from code review Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * Fix after applying suggestions. * Fix test, check idx is instance of numbers.Integral. * Update python-package/lightgbm/basic.py Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * Expose Sequence class in Python-API doc. * Handle Sequence object not having batch_size. * Fix isort lint complaint. * Apply suggestions from code review Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * Update docstring to mention Sequence as data input. * Remove get_one_line in test_basic.py * Make Sequence an abstract class. * Reduce number of tests for test_sequence. * Add c_api: LGBM_SampleCount, fix potential bug in LGBMSampleIndices. * empty commit to trigger ci * Apply suggestions from code review Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * Rename to LGBM_GetSampleCount, change LGBM_SampleIndices out_len to int32_t. Also rename total_nrow to num_total_row in c_api.h for consistency. * Doc about Sequence in docs/Python-Intro.rst. * Fix: basic.py change LGBM_SampleIndices out_len to int32. * Add create_valid test case with Dataset from Sequence. * Apply suggestions from code review Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * Apply suggestions from code review Co-authored-by: shiyu1994 <shiyu_k1994@qq.com> * Remove no longer used DEFAULT_BIN_CONSTRUCT_SAMPLE_CNT. * Update python-package/lightgbm/basic.py Co-authored-by: Nikita Titov <nekit94-08@mail.ru> Co-authored-by: Willian Zhang <willian@willian.email> Co-authored-by: Willian Z <Willian@Willian-Zhang.com> Co-authored-by: James Lamb <jaylamb20@gmail.com> Co-authored-by: shiyu1994 <shiyu_k1994@qq.com> Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
- Loading branch information
1 parent
f37b0d4
commit c359896
Showing
11 changed files
with
625 additions
and
10 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,6 +12,7 @@ Data Structure API | |
Dataset | ||
Booster | ||
CVBooster | ||
Sequence | ||
|
||
Training API | ||
------------ | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
import h5py | ||
import numpy as np | ||
import pandas as pd | ||
|
||
import lightgbm as lgb | ||
|
||
|
||
class HDFSequence(lgb.Sequence): | ||
def __init__(self, hdf_dataset, batch_size): | ||
""" | ||
Construct a sequence object from HDF5 with required interface. | ||
Parameters | ||
---------- | ||
hdf_dataset : h5py.Dataset | ||
Dataset in HDF5 file. | ||
batch_size : int | ||
Size of a batch. When reading data to construct lightgbm Dataset, each read reads batch_size rows. | ||
""" | ||
# We can also open HDF5 file once and get access to | ||
self.data = hdf_dataset | ||
self.batch_size = batch_size | ||
|
||
def __getitem__(self, idx): | ||
return self.data[idx] | ||
|
||
def __len__(self): | ||
return len(self.data) | ||
|
||
|
||
def create_dataset_from_multiple_hdf(input_flist, batch_size): | ||
data = [] | ||
ylist = [] | ||
for f in input_flist: | ||
f = h5py.File(f, 'r') | ||
data.append(HDFSequence(f['X'], batch_size)) | ||
ylist.append(f['Y'][:]) | ||
|
||
params = { | ||
'bin_construct_sample_cnt': 200000, | ||
'max_bin': 255, | ||
} | ||
y = np.concatenate(ylist) | ||
dataset = lgb.Dataset(data, label=y, params=params) | ||
# With binary dataset created, we can use either Python API or cmdline version to train. | ||
# | ||
# Note: in order to create exactly the same dataset with the one created in simple_example.py, we need | ||
# to modify simple_example.py to pass numpy array instead of pandas DataFrame to Dataset constructor. | ||
# The reason is that DataFrame column names will be used in Dataset. For a DataFrame with Int64Index | ||
# as columns, Dataset will use column names like ["0", "1", "2", ...]. While for numpy array, column names | ||
# are using the default one assigned in C++ code (dataset_loader.cpp), like ["Column_0", "Column_1", ...]. | ||
dataset.save_binary('regression.train.from_hdf.bin') | ||
|
||
|
||
def save2hdf(input_data, fname, batch_size): | ||
"""Store numpy array to HDF5 file. | ||
Please note chunk size settings in the implementation for I/O performance optimization. | ||
""" | ||
with h5py.File(fname, 'w') as f: | ||
for name, data in input_data.items(): | ||
nrow, ncol = data.shape | ||
if ncol == 1: | ||
# Y has a single column and we read it in single shot. So store it as an 1-d array. | ||
chunk = (nrow,) | ||
data = data.values.flatten() | ||
else: | ||
# We use random access for data sampling when creating LightGBM Dataset from Sequence. | ||
# When accessing any element in a HDF5 chunk, it's read entirely. | ||
# To save I/O for sampling, we should keep number of total chunks much larger than sample count. | ||
# Here we are just creating a chunk size that matches with batch_size. | ||
# | ||
# Also note that the data is stored in row major order to avoid extra copy when passing to | ||
# lightgbm Dataset. | ||
chunk = (batch_size, ncol) | ||
f.create_dataset(name, data=data, chunks=chunk, compression='lzf') | ||
|
||
|
||
def generate_hdf(input_fname, output_basename, batch_size): | ||
# Save to 2 HDF5 files for demonstration. | ||
df = pd.read_csv(input_fname, header=None, sep='\t') | ||
|
||
mid = len(df) // 2 | ||
df1 = df.iloc[:mid] | ||
df2 = df.iloc[mid:] | ||
|
||
# We can store multiple datasets inside a single HDF5 file. | ||
# Separating X and Y for choosing best chunk size for data loading. | ||
fname1 = f'{output_basename}1.h5' | ||
fname2 = f'{output_basename}2.h5' | ||
save2hdf({'Y': df1.iloc[:, :1], 'X': df1.iloc[:, 1:]}, fname1, batch_size) | ||
save2hdf({'Y': df2.iloc[:, :1], 'X': df2.iloc[:, 1:]}, fname2, batch_size) | ||
|
||
return [fname1, fname2] | ||
|
||
|
||
def main(): | ||
batch_size = 64 | ||
output_basename = 'regression' | ||
hdf_files = generate_hdf('../regression/regression.train', output_basename, batch_size) | ||
|
||
create_dataset_from_multiple_hdf(hdf_files, batch_size=batch_size) | ||
|
||
|
||
if __name__ == '__main__': | ||
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.