Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple train data on single machine #3900

Closed
wants to merge 3 commits into from

Conversation

cyfdecyf
Copy link
Contributor

@cyfdecyf cyfdecyf commented Feb 3, 2021

This is a quick and dirty hack which I implemented a month ago.

As there are requests #2031 #2638 for this feature, I'd like to send a PR and see if you are interested in adding this feature.

I've tested correctness by running on higgs dataset with similary config from boosting_tree_benchmarks. With deterministic=true and compared the output model file and prediction result.

@guolinke
Copy link
Collaborator

guolinke commented Feb 8, 2021

@shiyu1994 can you help to review this PR?

@shiyu1994 shiyu1994 self-assigned this Feb 8, 2021
auto interval_time = std::chrono::duration<double, std::milli>(now - prev_time) * 1e-3;
Log::Info("Read %.1f GBs in %.1f seconds (from %s), last interval speed %.1f MB/s",
1.0 * (*bytes_read) / kGbs, total_time,
1.0 * read_progress_interval_bytes_ / kGbs * 1024.0 / interval_time.count(), filename);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exchange the last two arguments in Log::Info?


void SetHeader(const char* filename);
void SetHeader(const char* filenames);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change filenames back to filename? Since only the first file is used for SetHeader.

std::vector<std::string> SampleTextDataFromFile(const char* filename, const Metadata& metadata, int rank, int num_machines, int* num_global_data, std::vector<data_size_t>* used_data_indices);
std::vector<std::string> SampleTextDataFromFile(const std::vector<const char*>& filenames, const Metadata& metadata, int rank, int num_machines, int* num_global_data, std::vector<data_size_t>* used_data_indices);

std::vector<std::string> SampleTextDataFromFile(const char* filename, const Metadata& metadata, int rank, int num_machines, int* num_global_data, std::vector<data_size_t>* used_data_indices) {
Copy link
Collaborator

@shiyu1994 shiyu1994 Feb 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this function (SampleTextDataFromFile with single filename) is now unused. Maybe we can remove this.

@cyfdecyf
Copy link
Contributor Author

@shiyu1994 Thanks for taking time to review this PR.

As I said this is a quick and dirty hack. As I ignored many use cases, I guess this PR is far from being acceptable to be merged. If that's the case, I'd rather close this PR instead of continue working on this feature.

I'm now adding new features for Dataset related python bindings as this would be a more general approach for my problems.

The core idea is to support two_round in Python:

  1. Load input data in Python, so I'm not limited to use text data
  2. Do data sampling in Python, feed sampled data to C++ to construct bin mapper
  3. Fead data in many batches to construct the final Dataset
    • So we can read data from multiple file
    • Hope this can also avoid triple memory usage I've seen when using Python interface

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants