-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12664][ML] Expose probability in mlp model #17373
[SPARK-12664][ML] Expose probability in mlp model #17373
Conversation
Test build #74965 has finished for PR 17373 at commit
|
@WeichenXu123 are you keep working on this own or do you want me to take it over? I'm also interested in adding this feature. thanks! |
@nicodri Hi, I am modifying this PR and will commit this week! Thanks! |
1f0da4e
to
4cf8cee
Compare
cc @yanboliang thanks! |
Test build #75835 has finished for PR 17373 at commit
|
In which version this is going to be available on PySpark? |
Test build #79374 has finished for PR 17373 at commit
|
I'd like to express my demand for this feature. It also is important for building ensemble classifiers. |
@LeoIV sorry for delay! I will update code soon! |
The probability should always between 0 and 1
Send me your test code and test data to help me find out where is wrong.
In my own test the result is ok.
…Sent from my iPhone
On 12 Jul 2017, at 11:57 PM, Leonard Hövelmann <notifications@github.com<mailto:notifications@github.com>> wrote:
I checked out version 2.1.2-SNAPSHOT and performed your changes there (for me locally). It works, however the probabilites are not in range [-1,1]. Is this intended?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#17373 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ASWEkm7B6dTvsgY3C9IZRrsPQTnFW_-1ks5sNb_KgaJpZM4MjfvG>.
|
@WeichenXu123 I deleted my last comment because I wasn't quite sure, if I had no mistakes at other places. As I described above, I performed your changes in version 2.1. For small datasets, I get raw predictions, that are not in [0, 1]. You should be able to check it, using this small test case:
Using this, I get the following results:
Does this work in your code? |
RawPrediction is not probability
It's range is from -inf to inf
Softmax(raw predictions) get probabilities
It's range is from 0 to 1
Thanks!
…Sent from my iPhone
On 14 Jul 2017, at 6:38 AM, Leonard Hövelmann <notifications@github.com<mailto:notifications@github.com>> wrote:
@WeichenXu123<https://github.com/weichenxu123> I deleted my last comment because I wasn't quite sure, if I had no mistakes at other places. As I described above, I performed your changes in version 2.1. For small datasets, I get raw predictions, that are not in [0, 1]. You should be able to check it, using this small test case:
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import org.apache.spark.sql.types.{IntegerType, StructType}
import org.apache.spark.sql.{Row, SparkSession}
/**
* Created by Leonard Hövelmann (leonard.hoevelmann@adesso.de<mailto:leonard.hoevelmann@adesso.de>) on 14.07.2017.
*/
object TestProb {
def main(args: Array[String]) = {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val rowSchema = new StructType().add("class", IntegerType).add("features", org.apache.spark.ml.linalg.SQLDataTypes.VectorType)
val testData: RDD[Row] = spark.sparkContext.parallelize(Seq(
new GenericRowWithSchema(Array(0, Vectors.dense(Array(0.1, 0.2, 0.3, 0.4, 0.5))), rowSchema).asInstanceOf[Row],
new GenericRowWithSchema(Array(0, Vectors.dense(Array(0.1, 0.2, 0.3, 0.4, 0.5))), rowSchema).asInstanceOf[Row],
new GenericRowWithSchema(Array(1, Vectors.dense(Array(0.1, 0.5, 0.3, 0.4, 0.5))), rowSchema).asInstanceOf[Row],
new GenericRowWithSchema(Array(1, Vectors.dense(Array(0.1, 0.5, 0.3, 0.4, 0.5))), rowSchema).asInstanceOf[Row],
new GenericRowWithSchema(Array(2, Vectors.dense(Array(0.1, 0.2, 0.8, 0.4, 0.5))), rowSchema).asInstanceOf[Row],
new GenericRowWithSchema(Array(2, Vectors.dense(Array(0.1, 0.2, 0.8, 0.4, 0.5))), rowSchema).asInstanceOf[Row]
))
val testDataDf = spark.sqlContext.createDataFrame(testData, rowSchema)
val mlp = new MultilayerPerceptronClassifier().setFeaturesCol("features").setLabelCol("class").setLayers(Array(5, 4, 3))
val mlpModel = mlp.fit(testDataDf)
mlpModel.transform(testDataDf).show(6)
}
}
Using this, I get the following results:
+-----+--------------------+--------------------+--------------------+----------+
|class| features| rawPrediction| probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
| 0|[0.1,0.2,0.3,0.4,...|[52.5097295377110...|[1.0,1.1880726027...| 0.0|
| 0|[0.1,0.2,0.3,0.4,...|[52.5097295377110...|[1.0,1.1880726027...| 0.0|
| 1|[0.1,0.5,0.3,0.4,...|[22.9478511752010...|[4.03649486150668...| 1.0|
| 1|[0.1,0.5,0.3,0.4,...|[22.9478511752010...|[4.03649486150668...| 1.0|
| 2|[0.1,0.2,0.8,0.4,...|[6.36424366031029...|[4.39122384367774...| 2.0|
| 2|[0.1,0.2,0.8,0.4,...|[6.36424366031029...|[4.39122384367774...| 2.0|
+-----+--------------------+--------------------+--------------------+----------+
Does this work in your code?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#17373 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ASWEkjtBMz-Hk6Uujc-VS1Nn5fnsPOhhks5sN27_gaJpZM4MjfvG>.
|
Alright, thanks. But have a look at the probabilities. They aren’t in [0,1] either. |
They are, I think some values are something like 4.7532244532E-10 the display truncate them.
Thanks
…Sent from my iPhone
On 15 Jul 2017, at 12:35 AM, Leonard Hövelmann <notifications@github.com<mailto:notifications@github.com>> wrote:
Alright, thanks. But have a look at the probabilities. They aren’t in [0,1] either.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#17373 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ASWEklwhRnYnSO_KDKASS09NLsaw_Aq2ks5sOGvZgaJpZM4MjfvG>.
|
Right 🙈 That explains why it works perfectly fine in with my classifier :-) |
4cf8cee
to
fb83553
Compare
Test build #79806 has finished for PR 17373 at commit
|
fb83553
to
14c4c6c
Compare
Test build #79950 has finished for PR 17373 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some comment when I review it again.
@@ -463,7 +479,7 @@ private[ml] class FeedForwardModel private( | |||
private var outputs: Array[BDM[Double]] = null | |||
private var deltas: Array[BDM[Double]] = null | |||
|
|||
override def forward(data: BDM[Double]): Array[BDM[Double]] = { | |||
override def forward(data: BDM[Double], containsLastLayer: Boolean): Array[BDM[Double]] = { | |||
// Initialize output arrays for all layers. Special treatment for InPlace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last layer is always softmax
, add the containsLastLayer
parameter, when true
the forward computing will contains last layer, otherwise not. The parameter is used when we need rawPrediction
, the last layer softmax
should discard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add the above comment in the code, it could be useful for folks reading/editing this in the future.
Also it seems like the last layer could also be a SigmoidLayerWithSqueredError or a SigmiodFunction do we need to hand those cases any differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MrBago In MultiLayerPerceptronClassifier.train
there is a line:
val topology = FeedForwardTopology.multiLayerPerceptron(myLayers, softmaxOnTop = true)
So MultiLayerPerceptronClassifier always use softmax as the last layer.
Vectors.dense(result.last.toArray) | ||
} | ||
|
||
override def predictRaw(data: Vector): Vector = { | ||
val size = data.size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add predictRaw
method, computing without last layer (softmax)
} | ||
|
||
override def raw2ProbabilityInPlace(data: Vector): Vector = { | ||
val dataMatrix = new BDM[Double](data.size, 1, data.toArray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add raw2ProbabilityInPlace
, what it compute is:
softmax(rawPredictionsVector) ==> predictionsVector
directly call the last layer function to compute it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WeichenXu123 I left some comments on the PR. This looks good overall, the main thing I'd like to do is add a stronger test.
@@ -463,7 +479,7 @@ private[ml] class FeedForwardModel private( | |||
private var outputs: Array[BDM[Double]] = null | |||
private var deltas: Array[BDM[Double]] = null | |||
|
|||
override def forward(data: BDM[Double]): Array[BDM[Double]] = { | |||
override def forward(data: BDM[Double], containsLastLayer: Boolean): Array[BDM[Double]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use a variable name likeincludeLastLayer
here? containsLastLayer
sounds like a property of the model instead of an instruction to the method.
@@ -463,7 +479,7 @@ private[ml] class FeedForwardModel private( | |||
private var outputs: Array[BDM[Double]] = null | |||
private var deltas: Array[BDM[Double]] = null | |||
|
|||
override def forward(data: BDM[Double]): Array[BDM[Double]] = { | |||
override def forward(data: BDM[Double], containsLastLayer: Boolean): Array[BDM[Double]] = { | |||
// Initialize output arrays for all layers. Special treatment for InPlace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add the above comment in the code, it could be useful for folks reading/editing this in the future.
Also it seems like the last layer could also be a SigmoidLayerWithSqueredError or a SigmiodFunction do we need to hand those cases any differently?
@@ -363,7 +363,7 @@ private[ann] trait TopologyModel extends Serializable { | |||
* @param data input data | |||
* @return array of outputs for each of the layers | |||
*/ | |||
def forward(data: BDM[Double]): Array[BDM[Double]] | |||
def forward(data: BDM[Double], containsLastLayer: Boolean): Array[BDM[Double]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the docstring for this method to add the argument?
assert(result.select(cmpVec(features2prob(col("features")), col("probability"))) | ||
.rdd.map(_.getBoolean(0)).reduce(_ && _)) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should include a stronger test for this. I did a quick search and couldn't find a strong test for mlpModel.predict
, it might be good to add one. Also, I believe this xor dataset only produces probability predictions ~equal to 0 or 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MrBago
Which way of the strong test should be done ? Add a test to check the probability vector equals given vectors ?
14c4c6c
to
645fdc4
Compare
Test build #80371 has finished for PR 17373 at commit
|
I think this will change the output of |
@WeichenXu123 Can you please add "[ML]" to the PR title? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @WeichenXu123 ! I just finished a review pass.
One challenge with the ProbabilisticClassifier abstraction is that it introduces different code paths for predictions depending on which output columns are turned on or off: probability, rawPrediction, prediction. We ran into a bug in MLOR with this. It'd be good to follow up after this PR with another PR to add a generic test for this; I created https://issues.apache.org/jira/browse/SPARK-21729 to track this issue.
val model = trainer.fit(strongDataset) | ||
val result = model.transform(strongDataset) | ||
model.setProbabilityCol("probability") | ||
MLTestingUtils.checkCopyAndUids(trainer, model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkCopyAndUids is a generic test which should only be run in a single test; it does not need to be run in each test. Please remove it from here.
val result = model.transform(strongDataset) | ||
model.setProbabilityCol("probability") | ||
MLTestingUtils.checkCopyAndUids(trainer, model) | ||
// result.select("probability").show(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove old comment
@@ -82,6 +83,49 @@ class MultilayerPerceptronClassifierSuite | |||
} | |||
} | |||
|
|||
test("strong dataset test") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make the test title more descriptive so it is clear what it is testing. E.g. "Predicted class probabilities: calibration on toy dataset"
@@ -82,6 +83,49 @@ class MultilayerPerceptronClassifierSuite | |||
} | |||
} | |||
|
|||
test("strong dataset test") { | |||
val layers = Array[Int](4, 5, 5, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this test faster by using a simpler network, e.g., by removing one of the middle layers?
@@ -374,6 +380,22 @@ private[ann] trait TopologyModel extends Serializable { | |||
def predict(data: Vector): Vector | |||
|
|||
/** | |||
* Raw prediction of the model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This documentation does not add any information. Can you please link to ProbabilisticClassifier instead?
@@ -361,9 +361,15 @@ private[ann] trait TopologyModel extends Serializable { | |||
* Forward propagation | |||
* | |||
* @param data input data | |||
* @param includeLastLayer include last layer when computing. In MultilayerPerceptronClassifier, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This text is unclear. This phrasing is better: "Include the last layer in the output. In MultilayerPerceptronClassifier, the last layer is always softmax; the last layer of outputs is needed for class predictions, but not for rawPrediction."
Vectors.dense(result.last.toArray) | ||
} | ||
|
||
override def predictRaw(data: Vector): Vector = { | ||
val size = data.size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This temp val of "size" is only used once, so I recommend removing it to make the code clearer.
|
||
override def raw2ProbabilityInPlace(data: Vector): Vector = { | ||
val dataMatrix = new BDM[Double](data.size, 1, data.toArray) | ||
layerModels.last.eval(dataMatrix, dataMatrix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes that the eval
method can operate in-place. That is fine for the last layer for MLP (SoftmaxLayerModelWithCrossEntropyLoss), but not OK in general. More generally, these methods for classifiers should not go in the very general TopologyModel abstraction; that abstraction may be used in the future for regression as well. I'd be fine with putting this classification-specific logic in MLP itself; we do not need to generalize the logic until we add other Classifiers, which might take a long time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping: If this proposal sounds good, then can you please update accordingly?
.setMaxIter(100) | ||
.setSolver("l-bfgs") | ||
val model = trainer.fit(dataset) | ||
model.setProbabilityCol("probability") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the default already, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping --- this should not be necessary
val result = model.transform(dataset) | ||
val features2prob = udf { features: Vector => model.mlpModel.predict(features) } | ||
val cmpVec = udf { (v1: Vector, v2: Vector) => v1 ~== v2 relTol 1e-3 } | ||
assert(result.select(cmpVec(features2prob(col("features")), col("probability"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this test fails, it will not give much info. How about collecting the data and comparing on the driver?
cc @jkbradley Code updated, thanks! |
Test build #80684 has finished for PR 17373 at commit
|
Jenkins, test this please. |
Test build #80689 has finished for PR 17373 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates! Just a few items remain. (See responses to other comments above as well.)
|
||
override def raw2ProbabilityInPlace(data: Vector): Vector = { | ||
val dataMatrix = new BDM[Double](data.size, 1, data.toArray) | ||
layerModels.last.eval(dataMatrix, dataMatrix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping: If this proposal sounds good, then can you please update accordingly?
Thinking more about the proposal about separating the classification-specific logic out of the generic Topology, it's something we should definitely do at some point, but I'm OK with leaving it as is for now. Adding new, unused classes is probably not worth the trouble right now. Can you please document very clearly, though, that predictRaw and raw2ProbabilityInPlace are only for classification? Thanks! |
Test build #80762 has finished for PR 17373 at commit
|
LGTM, but I want to check about the comment above: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this won't show up automatically in R because only a selected subset is exposed in the summary() method.
would be great if probability is
https://github.com/apache/spark/blob/master/R/pkg/R/mllib_classification.R#L488
@felixcheung So it do not cause bugs in sparkR, we can leave it in a separated PR ? |
We can open a JIRA to track |
Oh OK makes sense. @WeichenXu123 could you please open a JIRA (linked from this task's JIRA) and CC @felixcheung on it? Thanks! I'll rerun tests to be safe and merge this afterwards. |
Test build #3894 has finished for PR 17373 at commit
|
Jenkins, test this please. |
Test build #80946 has finished for PR 17373 at commit
|
Thanks! Will merge after rerunning tests |
Test build #3895 has finished for PR 17373 at commit
|
Merging with master |
What changes were proposed in this pull request?
Modify MLP model to inherit
ProbabilisticClassificationModel
and so that it can expose the probability column when transforming data.How was this patch tested?
Test added.