[jvm-packages] support distributed synchronization of customized evaluation metrics #4280

CodingCat · 2019-03-20T06:27:22Z

this PR implements the functionality to synchronize custom metrics in distributed training setting. Before this PR, only the built-in metrics are synchronized resulting that users cannot leverage functionalities like early-stopping with custom metrics (since different workers may have different decision about whether to stop).

This PR is to

implement distributed synchronization of custom metrics
avoid users who implement the metrics to explicitly call rabit operation

closes #3595 #3813

thvasilo · 2019-03-20T23:15:31Z

Hello @CodingCat , could you write a couple of things on the purpose of this so I can potentially help out with reviewing and testing?

CodingCat · 2019-03-24T15:48:31Z

@thvasilo thanks for offering the help on testing the feature, I will have a more comprehensive description of the PR after I finalize the code part

CodingCat · 2019-03-24T15:48:41Z

@thvasilo thanks for offering the help on testing the feature, I will have a more comprehensive description of the PR after I finalize the code part

CodingCat · 2019-03-28T02:53:50Z

@trivialfis @RAMitchell @hcho3

please help to review this, as I moved several classes in the original .cu files to headers to be shared by other files

trivialfis · 2019-03-28T20:47:48Z

Had a brief look. This is gonna take some time.

trivialfis

@CodingCat Let's don't merge this for now.

Is the feature jvm only or it's universal to other bindings (I have never done this before)? If it's jvm only then we can discuss how to make it more generic before moving so many code.
Is it necessary to move the whole implementation into header? We can create a suitable interface.
Other stuffs in comments.

include/xgboost/c_api.h

include/xgboost/learner.h

include/xgboost/metric/metric.h

RAMitchell

My concern with this PR is that it fuses the internal C++ implementation of metrics to the external Java API. If we merge this we effectively lose the ability to refactor or change the way metrics are implemented.

While this provides some generality for users it still doesn't give us the ability to implement metrics like AUC via Java/Scala.

Given how complicated it seems to expose this feature and the fact that it is not a complete solution, I wonder if users should be implementing these functions in C++?

CodingCat · 2019-04-06T05:58:31Z

My concern with this PR is that it fuses the internal C++ implementation of metrics to the external Java API. If we merge this we effectively lose the ability to refactor or change the way metrics are implemented.

While this provides some generality for users it still doesn't give us the ability to implement metrics like AUC via Java/Scala.

Given how complicated it seems to expose this feature and the fact that it is not a complete solution, I wonder if users should be implementing these functions in C++?

@RAMitchell , thanks for the review

actually we have make an regulation on the signature of methods in metrics implementations, e.g.

xgboost/src/metric/elementwise_metric.cu

Line 106 in 2e052e7

bst_float residue = d_policy.EvalRow(s_label[idx], s_preds[idx]);

,

as long as we want to expose the distributed metrics syncing to users without requiring them to call rabit directly (which is not an elegant solution as we are exposing an XGBoost dependency to the user), they need to follow that signature no matter which language they want to

regarding implementing something like AUC, the users can do that with existing interface like https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j/src/main/scala/ml/dmlc/xgboost4j/scala/EvalTrait.scala#L38 (they just cannot use it in a distributed setting for purposes like early stopping)

the question is only on do we want to expose the distributed metrics syncing to without requiring them to call rabit directly ?

CodingCat · 2019-04-06T06:06:18Z

@CodingCat Let's don't merge this for now.

Is the feature jvm only or it's universal to other bindings (I have never done this before)? If it's jvm only then we can discuss how to make it more generic before moving so many code.

Is it necessary to move the whole implementation into header? We can create a suitable interface.

Other stuffs in comments.

for now, only JVM is having this functionality as the strategy is to make cross-language calls via language-specific APIs....

IIRC, other languages didn't provide any base class for users to extend and implement custom metrics, instead, they require the user to implement the whole function , like

xgboost/python-package/xgboost/core.py

Line 1147 in 82dca3c

def eval_set(self, evals, iteration=0, feval=None):

. Then the users needs to care about everything from logic to distributed sync

if we want to extend to other languages, I think the pattern here is a good references though cannot be applied directly in implementation (because every language has its own cross-language function call APIs)

CodingCat · 2019-04-15T01:16:41Z

can we take further consideration on this PR?

CodingCat · 2019-04-19T04:11:14Z

@trivialfis @hcho3 @RAMitchell I got failed compilation in Jenkins

it says


[2019-04-19T03:53:37.599Z] In file included from /workspace/src/metric/multiclass_metric.cc:8:0:

[2019-04-19T03:53:37.599Z] /workspace/include/xgboost/metric/multiclass_metric.h:17:59: fatal error: thrust/execution_policy.h: No such file or directory

[2019-04-19T03:53:37.599Z] compilation terminated.

do you have any insights about what is happening?

trivialfis · 2019-04-19T04:33:39Z

@CodingCat Let me try fixing it in another PR.

CodingCat · 2019-04-19T04:43:55Z

nvm, I think I messed up the switch for conditional compilation

trivialfis · 2019-04-19T04:53:26Z

@CodingCat The thrust header paths are managed by nvcc, which will pass the right arguments and device code to gcc. For a cu file this all happens automatically. But for a cc file, you need guards like __CUDACC__ or XGBOOST_USE_CUDA to avoid letting gcc process CUDA related code directly.

CodingCat · 2019-04-19T04:54:19Z

@CodingCat The thrust header paths are managed by nvcc, which will pass the right arguments and device code to gcc. For a cu file this all happens automatically. But for a cc file, you need guards like __CUDACC__ or XGBOOST_USE_CUDA to avoid letting gcc process CUDA related code directly.

got it! thanks!

CodingCat · 2019-04-19T15:49:24Z

ping for the further review @trivialfis @RAMitchell @hcho3

RAMitchell · 2019-04-22T20:57:28Z

Can we explore the alternative a little more? Could we expose a very basic AllReduce method via the xgboost API that connects on the back end to the rabit library linked to the xgboost shared library?

CodingCat · 2019-04-22T21:20:38Z

we actually have exposed something like https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j/src/native/xgboost4j.h#L327-L328

do you mean we are going to document how to implement a customized metrics for distributed training with this API (instead of masking the call to rabit api with the implementation here)?

RAMitchell · 2019-04-22T22:02:44Z

do you mean we are going to document how to implement a customized metrics for distributed training with this API (instead of masking the call to rabit api with the implementation here)?

I guess I would like to know what it look like and how complicated it would be for users to call rabit themselves in their jvm implementation of the metric. If it's not overly difficult and distributed metrics can be achieved this way it may be preferable to this current implementation for me.

CodingCat · 2019-04-25T21:47:23Z

@RAMitchell I was trying to write an example to concretely show how it is complicated for a user to write a distributed metrics by their own, but discarded just because of the complexity,

given the current interface https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j/src/main/java/ml/dmlc/xgboost4j/java/IEvaluation.java#L40

in the first step, users need to get predictions, labels and weights
then they need to implement something like https://github.com/dmlc/xgboost/blob/master/src/metric/elementwise_metric.cu#L36-L57
after that, they need to call Rabit API to synchronize residual and weights, like https://github.com/dmlc/xgboost/blob/master/src/metric/elementwise_metric.cu#L36-L57
another headache is that we need to either provide a specific API tailored for syncing float array or the user has to use our current rabit api but deal with transform their local evaluation results got from step 1, 2 to a nio.directBuffer in java and pass to our API...

I would still prefer to mask all these complexities to the users, since we have implemented everything in C++, it's not worth repeating it in Java....and I am actually unclear about the downside of the approach in PR except it's a big diff most of which are just from moving files?

RAMitchell · 2019-04-26T00:06:13Z

IMO the solution would be make it easy for the users to get predictions, labels and weights on the JVM side and a function simplifying the rabit reduction. The reason why I am taking this angle as it requires no changes to the C API and leaves us much more flexible in terms of how we internally implement metrics in C++ code. I don't think this solution is more complicated than what you have.

I realise you have done a lot of work already but I want to make sure this is correct as I think it will have long-term implications on my work and others. Lets look for other opinions, in particular I'm curious what @tqchen thinks.

CodingCat · 2019-04-26T00:37:24Z

One thing I would add is that we should think about how we treat metrics system in xgb, is it like a helper function which is fine under any changed or we are going to treat it as a submodule sharing by other modules

If we want to make it a formal module, we need to sign the contract to regulate the changes

The contract here is, no matter how you change it, you should allow others to tell

What’s the output given the prediction and label of a single row
What’s the output given the sum of errors and weights

re on more opinions

CodingCat · 2019-05-12T07:03:57Z

I will get back on this after 0.9 release

and until so far, I have no idea about

(1) why we require every user to write the same chunk of code like https://github.com/dmlc/xgboost/blob/master/src/metric/elementwise_metric.cu#L36-L57 if they want to use custom evaluation metrics in distributed setting

(2) why we require XGBoost user knows what Rabit is? (Do we also require our user knows what NCCL is and how to use it?)

(3) why we are so afraid of stablizing any API even we are seriously considering 1.0 release

(4) what is the concrete plan of metrics system refactoring (as concrete as the solution which have been here for two months) which can be broken by exposing any API?

I don't think we are on the right track to move things forward ...such disaster happened in other communities as well and essentially hurt users and create many private folks of the project. One of the most significant example might be the nested column pruning feature in SparkSQL, that's a feature proposed 2 years ago, but was delayed time and time again due to the reason like "hmmmm, we need to refactor/re-design/whatever A,B,C, there might be some conflict" (turns out no in most of cases), even that's a feature waited by many users....

and eventually they only merged a (very very) partial version of that feature after 2 years, and pissed-off users have created some private folks of Spark containing that feature and many other innovations, for example, FB/Uber/Apple

hcho3 · 2019-05-12T07:19:00Z

@CodingCat One thing I like about this pull request is IEvalMultiClassesDistributed. I've seen quite a few users tripping over writing customized objectives for multi-class classification, e.g. #4288. It turns out that writing customized objective for multi-class classification is hard to do correctly, even in non-distributed setting. I'd love to see first-class support for this use case.

hcho3 · 2019-05-12T07:27:03Z

@CodingCat One little useful detail: AUC / AUCPR are actually list-wise ranking metric. So IEvalRankListDistributed should be able to accommodate AUC / AUCPR. (In the case of binary classification, set the number of query groups to 1.)

hcho3

See my comments on the metric interface design. I plan to review literature in survival analysis and learning-to-rank and design the metric design for these applications.

hcho3 · 2019-05-29T23:01:59Z

doc/jvm/xgboost4j_spark_tutorial.rst

+----------------
+With XGBoost4j (including XGBoost4J-Spark), users are able to implement their own custom evaluation metrics and synchronize the values in the distributed training setting. To implement a custom evaluation metric, users should implement the interface ``ml.dmlc.xgboost4j.java.IEvalElementWiseDistributed`` (for binary classification and regression), ``ml.dmlc.xgboost4j.java.IEvalMultiClassesDistributed`` (for multi classification) and ``ml.dmlc.xgboost4j.java.IEvalRankListDistributed`` (for ranking).
+
+* ``ml.dmlc.xgboost4j.java.IEvalElementWiseDistributed``: users are supposed to implement ``float evalRow(float label, float pred);`` which calculates the metric for a single sample given the prediction and label, as well as ``float getFinal(float errorSum, float weightSum);`` which performs the final transformation over the sum of error and weights of samples.


For completeness, we need to have weights included in element-wise metric:

float evalRow(float label, float pred, float weight)

We may also want to define an interface to let the user define a custom reduce op.

hcho3 · 2019-05-29T23:04:59Z

doc/jvm/xgboost4j_spark_tutorial.rst

+
+* ``ml.dmlc.xgboost4j.java.IEvalElementWiseDistributed``: users are supposed to implement ``float evalRow(float label, float pred);`` which calculates the metric for a single sample given the prediction and label, as well as ``float getFinal(float errorSum, float weightSum);`` which performs the final transformation over the sum of error and weights of samples.
+
+* ``ml.dmlc.xgboost4j.java.IEvalMultiClassesDistributed``: the methods to be implemented by the users are similar to ``ml.dmlc.xgboost4j.java.IEvalElementWiseDistributed`` except that the single row metric calculating method is ``float evalRow(int label, float pred, int numClasses);``


Should be float evalRow(int label, float pred, float weight, int numClasses);

hcho3 · 2019-05-29T23:05:27Z

doc/jvm/xgboost4j_spark_tutorial.rst

+
+* ``ml.dmlc.xgboost4j.java.IEvalMultiClassesDistributed``: the methods to be implemented by the users are similar to ``ml.dmlc.xgboost4j.java.IEvalElementWiseDistributed`` except that the single row metric calculating method is ``float evalRow(int label, float pred, int numClasses);``
+
+* ``ml.dmlc.xgboost4j.java.IEvalRankListDistributed``: users are to implement ``float evalMetric(float[] preds, int[] labels);`` which gives the predictions and labels for instances in the same group;


Should be float evalMetric(float[] preds, int[] labels, float group_weight);

trivialfis · 2019-05-30T03:23:34Z

@CodingCat I will see if it's possible to "unmask" the sum op.

CodingCat force-pushed the distributed_custom_eval_metrics branch from 6f2119a to 6776568 Compare March 23, 2019 05:44

CodingCat mentioned this pull request Mar 23, 2019

make REGISTER thread safe and idempotent dmlc/dmlc-core#520

Merged

CodingCat changed the title ~~[jvm-packages][WIP] support distributed synchronization of customized evaluation metrics~~ [WIP] support distributed synchronization of customized evaluation metrics Mar 26, 2019

CodingCat changed the title ~~[WIP] support distributed synchronization of customized evaluation metrics~~ [jvm-packages] support distributed synchronization of customized evaluation metrics Mar 28, 2019

CodingCat requested review from trivialfis, hcho3 and RAMitchell and removed request for hcho3 March 28, 2019 02:48

CodingCat force-pushed the distributed_custom_eval_metrics branch from 1e005fa to 067f96c Compare March 30, 2019 00:05

trivialfis requested changes Apr 1, 2019

View reviewed changes

include/xgboost/c_api.h Outdated Show resolved Hide resolved

include/xgboost/learner.h Show resolved Hide resolved

include/xgboost/metric/metric.h Outdated Show resolved Hide resolved

include/xgboost/metric/metric.h Outdated Show resolved Hide resolved

RAMitchell reviewed Apr 3, 2019

View reviewed changes

Nan Zhu added 10 commits April 18, 2019 09:02

add JNI stub

2ee821a

temp

6e7e28a

temp

06d48bb

basic framework

5a30277

fixing metrics adding logic

4d93614

investigation

fa37eef

fix refernce problem

640dd19

fix race condition when registering new customized functions

642a8fc

temporarily to private dmlc-core

1cabe70

update dmlc-core

c3540ba

Nan Zhu added 2 commits April 18, 2019 13:20

recover header

708bf69

address comments

3802324

CodingCat force-pushed the distributed_custom_eval_metrics branch from 28feba0 to 3802324 Compare April 18, 2019 20:21

Nan Zhu added 5 commits April 18, 2019 13:46

update file names

d8ced76

sync with upstream

137d014

fix cuda buiod

17785c5

change include path

82e0ebb

change include path

416d3af

fix mc_metrics.cc

6942142

hcho3 mentioned this pull request May 6, 2019

XGBoost 0.90 Roadmap #4389

Closed

18 tasks

RAMitchell mentioned this pull request May 28, 2019

Provide example of custom distributed evaluation function via dask #4508

Closed

CodingCat closed this May 29, 2019

hcho3 reviewed May 29, 2019

View reviewed changes

lock bot locked as resolved and limited conversation to collaborators Aug 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jvm-packages] support distributed synchronization of customized evaluation metrics #4280

[jvm-packages] support distributed synchronization of customized evaluation metrics #4280

CodingCat commented Mar 20, 2019 •

edited

Loading

thvasilo commented Mar 20, 2019

CodingCat commented Mar 24, 2019

CodingCat commented Mar 24, 2019

CodingCat commented Mar 28, 2019

trivialfis commented Mar 28, 2019

trivialfis left a comment

RAMitchell left a comment

CodingCat commented Apr 6, 2019

CodingCat commented Apr 6, 2019

CodingCat commented Apr 15, 2019

CodingCat commented Apr 19, 2019

trivialfis commented Apr 19, 2019

CodingCat commented Apr 19, 2019

trivialfis commented Apr 19, 2019

CodingCat commented Apr 19, 2019

CodingCat commented Apr 19, 2019

RAMitchell commented Apr 22, 2019

CodingCat commented Apr 22, 2019 •

edited

Loading

RAMitchell commented Apr 22, 2019

CodingCat commented Apr 25, 2019 •

edited

Loading

RAMitchell commented Apr 26, 2019

CodingCat commented Apr 26, 2019

CodingCat commented May 12, 2019 •

edited

Loading

hcho3 commented May 12, 2019

hcho3 commented May 12, 2019

hcho3 left a comment

hcho3 May 29, 2019

hcho3 May 29, 2019

hcho3 May 29, 2019

trivialfis commented May 30, 2019


		* ``ml.dmlc.xgboost4j.java.IEvalElementWiseDistributed``: users are supposed to implement ``float evalRow(float label, float pred);`` which calculates the metric for a single sample given the prediction and label, as well as ``float getFinal(float errorSum, float weightSum);`` which performs the final transformation over the sum of error and weights of samples.

		* ``ml.dmlc.xgboost4j.java.IEvalMultiClassesDistributed``: the methods to be implemented by the users are similar to ``ml.dmlc.xgboost4j.java.IEvalElementWiseDistributed`` except that the single row metric calculating method is ``float evalRow(int label, float pred, int numClasses);``


		* ``ml.dmlc.xgboost4j.java.IEvalMultiClassesDistributed``: the methods to be implemented by the users are similar to ``ml.dmlc.xgboost4j.java.IEvalElementWiseDistributed`` except that the single row metric calculating method is ``float evalRow(int label, float pred, int numClasses);``

		* ``ml.dmlc.xgboost4j.java.IEvalRankListDistributed``: users are to implement ``float evalMetric(float[] preds, int[] labels);`` which gives the predictions and labels for instances in the same group;

[jvm-packages] support distributed synchronization of customized evaluation metrics #4280

[jvm-packages] support distributed synchronization of customized evaluation metrics #4280

Conversation

CodingCat commented Mar 20, 2019 • edited Loading

thvasilo commented Mar 20, 2019

CodingCat commented Mar 24, 2019

CodingCat commented Mar 24, 2019

CodingCat commented Mar 28, 2019

trivialfis commented Mar 28, 2019

trivialfis left a comment

Choose a reason for hiding this comment

RAMitchell left a comment

Choose a reason for hiding this comment

CodingCat commented Apr 6, 2019

CodingCat commented Apr 6, 2019

CodingCat commented Apr 15, 2019

CodingCat commented Apr 19, 2019

trivialfis commented Apr 19, 2019

CodingCat commented Apr 19, 2019

trivialfis commented Apr 19, 2019

CodingCat commented Apr 19, 2019

CodingCat commented Apr 19, 2019

RAMitchell commented Apr 22, 2019

CodingCat commented Apr 22, 2019 • edited Loading

RAMitchell commented Apr 22, 2019

CodingCat commented Apr 25, 2019 • edited Loading

RAMitchell commented Apr 26, 2019

CodingCat commented Apr 26, 2019

CodingCat commented May 12, 2019 • edited Loading

hcho3 commented May 12, 2019

hcho3 commented May 12, 2019

hcho3 left a comment

Choose a reason for hiding this comment

hcho3 May 29, 2019

Choose a reason for hiding this comment

hcho3 May 29, 2019

Choose a reason for hiding this comment

hcho3 May 29, 2019

Choose a reason for hiding this comment

trivialfis commented May 30, 2019

CodingCat commented Mar 20, 2019 •

edited

Loading

CodingCat commented Apr 22, 2019 •

edited

Loading

CodingCat commented Apr 25, 2019 •

edited

Loading

CodingCat commented May 12, 2019 •

edited

Loading