Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add weight_column option to CSV parser #399

Merged
merged 7 commits into from
Jun 30, 2018
Merged

Conversation

hcho3
Copy link
Contributor

@hcho3 hcho3 commented May 10, 2018

It is often useful to specify weights for training examples. Currently, it is not possible to use a separate *.weights file for partitioned CSV files. To get around this, we designate a column as the weight column, from which example weights are drawn.

@hcho3 hcho3 force-pushed the csv_weight_column branch 2 times, most recently from 39846a3 to cb58415 Compare June 30, 2018 19:52
hcho3 added 7 commits June 30, 2018 19:16
Currently the LibSVM and LibFM parsers assume 0-based indexing. However, many
LibSVM files in the wild use 1-based indexing. (In fact, the [original LIBSVM
implementation](https://github.com/cjlin1/libsvm/blob/master/README) uses
1-based indexing.) This PR adds a new option `indexing_mode` to both parsers
to enable 1-based indexing.

Description:
* `indexing_mode>0`: use 1-based indexing.
* `indexing_mode<0`: use 0-based indexing
* `indexing_mode=0`: use heuristic to decide between the two modes of indexing.
  When the smallest feature index is 1 or greater, 1-based indexing is chosen.

For examples, see the unit tests in `test/unittest/unittest_parser.cc`.

Note. Backward compatibility is preserved as follows: when `indexing_mode` is
not specified, 0-based indexing is assumed.
@hcho3 hcho3 force-pushed the csv_weight_column branch from cb58415 to 474fd99 Compare June 30, 2018 23:26
@hcho3 hcho3 merged commit fc28775 into dmlc:master Jun 30, 2018
@hcho3 hcho3 deleted the csv_weight_column branch June 30, 2018 23:34
ruslo pushed a commit to hunter-packages/dmlc-core that referenced this pull request Mar 23, 2019
* Add weight_column option to CSV parser

* Add a second test to ensure that unspecified weight_column results into empty weight vector

* Also check the value field

* Fix index type used in RowBlockContainer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant