Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12664][ML] Expose probability in mlp model #17373

Closed

Conversation

WeichenXu123
Copy link
Contributor

What changes were proposed in this pull request?

Modify MLP model to inherit ProbabilisticClassificationModel and so that it can expose the probability column when transforming data.

How was this patch tested?

Test added.

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74965 has finished for PR 17373 at commit 1f0da4e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nicodri
Copy link

nicodri commented Apr 14, 2017

@WeichenXu123 are you keep working on this own or do you want me to take it over? I'm also interested in adding this feature. thanks!

@WeichenXu123
Copy link
Contributor Author

@nicodri Hi, I am modifying this PR and will commit this week! Thanks!

@WeichenXu123 WeichenXu123 force-pushed the expose_probability_in_mlp_model branch from 1f0da4e to 4cf8cee Compare April 16, 2017 15:37
@WeichenXu123
Copy link
Contributor Author

cc @yanboliang thanks!

@SparkQA
Copy link

SparkQA commented Apr 16, 2017

Test build #75835 has finished for PR 17373 at commit 4cf8cee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@alwaysprep
Copy link

In which version this is going to be available on PySpark?

@SparkQA
Copy link

SparkQA commented Jul 8, 2017

Test build #79374 has finished for PR 17373 at commit 4cf8cee.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@LeoIV
Copy link

LeoIV commented Jul 12, 2017

I'd like to express my demand for this feature. It also is important for building ensemble classifiers.

@WeichenXu123
Copy link
Contributor Author

@LeoIV sorry for delay! I will update code soon!

@WeichenXu123
Copy link
Contributor Author

WeichenXu123 commented Jul 13, 2017 via email

@LeoIV
Copy link

LeoIV commented Jul 14, 2017

@WeichenXu123 I deleted my last comment because I wasn't quite sure, if I had no mistakes at other places. As I described above, I performed your changes in version 2.1. For small datasets, I get raw predictions, that are not in [0, 1]. You should be able to check it, using this small test case:

import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import org.apache.spark.sql.types.{IntegerType, StructType}
import org.apache.spark.sql.{Row, SparkSession}

object TestProb {

  def main(args: Array[String]) = {
    val spark = SparkSession.builder().master("local[*]").getOrCreate()

    val rowSchema = new StructType().add("class", IntegerType).add("features", org.apache.spark.ml.linalg.SQLDataTypes.VectorType)

    val testData: RDD[Row] = spark.sparkContext.parallelize(Seq(
      new GenericRowWithSchema(Array(0, Vectors.dense(Array(0.1, 0.2, 0.3, 0.4, 0.5))), rowSchema).asInstanceOf[Row],
      new GenericRowWithSchema(Array(0, Vectors.dense(Array(0.1, 0.2, 0.3, 0.4, 0.5))), rowSchema).asInstanceOf[Row],
      new GenericRowWithSchema(Array(1, Vectors.dense(Array(0.1, 0.5, 0.3, 0.4, 0.5))), rowSchema).asInstanceOf[Row],
      new GenericRowWithSchema(Array(1, Vectors.dense(Array(0.1, 0.5, 0.3, 0.4, 0.5))), rowSchema).asInstanceOf[Row],
      new GenericRowWithSchema(Array(2, Vectors.dense(Array(0.1, 0.2, 0.8, 0.4, 0.5))), rowSchema).asInstanceOf[Row],
      new GenericRowWithSchema(Array(2, Vectors.dense(Array(0.1, 0.2, 0.8, 0.4, 0.5))), rowSchema).asInstanceOf[Row]
    ))

    val testDataDf = spark.sqlContext.createDataFrame(testData, rowSchema)

    val mlp = new MultilayerPerceptronClassifier().setFeaturesCol("features").setLabelCol("class").setLayers(Array(5, 4, 3))

    val mlpModel = mlp.fit(testDataDf)

    mlpModel.transform(testDataDf).show(6)
  }

}

Using this, I get the following results:

+-----+--------------------+--------------------+--------------------+----------+
|class|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|    0|[0.1,0.2,0.3,0.4,...|[52.5097295377110...|[1.0,1.1880726027...|       0.0|
|    0|[0.1,0.2,0.3,0.4,...|[52.5097295377110...|[1.0,1.1880726027...|       0.0|
|    1|[0.1,0.5,0.3,0.4,...|[22.9478511752010...|[4.03649486150668...|       1.0|
|    1|[0.1,0.5,0.3,0.4,...|[22.9478511752010...|[4.03649486150668...|       1.0|
|    2|[0.1,0.2,0.8,0.4,...|[6.36424366031029...|[4.39122384367774...|       2.0|
|    2|[0.1,0.2,0.8,0.4,...|[6.36424366031029...|[4.39122384367774...|       2.0|
+-----+--------------------+--------------------+--------------------+----------+

Does this work in your code?

@WeichenXu123
Copy link
Contributor Author

WeichenXu123 commented Jul 14, 2017 via email

@LeoIV
Copy link

LeoIV commented Jul 15, 2017

Alright, thanks. But have a look at the probabilities. They aren’t in [0,1] either.

@WeichenXu123
Copy link
Contributor Author

WeichenXu123 commented Jul 15, 2017 via email

@LeoIV
Copy link

LeoIV commented Jul 15, 2017

Right 🙈 That explains why it works perfectly fine in with my classifier :-)

@WeichenXu123 WeichenXu123 force-pushed the expose_probability_in_mlp_model branch from 4cf8cee to fb83553 Compare July 20, 2017 21:31
@WeichenXu123
Copy link
Contributor Author

cc @jkbradley @yanboliang

@SparkQA
Copy link

SparkQA commented Jul 20, 2017

Test build #79806 has finished for PR 17373 at commit fb83553.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123 WeichenXu123 force-pushed the expose_probability_in_mlp_model branch from fb83553 to 14c4c6c Compare July 26, 2017 01:06
@SparkQA
Copy link

SparkQA commented Jul 26, 2017

Test build #79950 has finished for PR 17373 at commit 14c4c6c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class MultilayerPerceptronClassificationModel(JavaModel, JavaClassificationModel, JavaMLWritable,

Copy link
Contributor Author

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some comment when I review it again.

@@ -463,7 +479,7 @@ private[ml] class FeedForwardModel private(
private var outputs: Array[BDM[Double]] = null
private var deltas: Array[BDM[Double]] = null

override def forward(data: BDM[Double]): Array[BDM[Double]] = {
override def forward(data: BDM[Double], containsLastLayer: Boolean): Array[BDM[Double]] = {
// Initialize output arrays for all layers. Special treatment for InPlace
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last layer is always softmax, add the containsLastLayer parameter, when true the forward computing will contains last layer, otherwise not. The parameter is used when we need rawPrediction, the last layer softmax should discard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the above comment in the code, it could be useful for folks reading/editing this in the future.

Also it seems like the last layer could also be a SigmoidLayerWithSqueredError or a SigmiodFunction do we need to hand those cases any differently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MrBago In MultiLayerPerceptronClassifier.train there is a line:

val topology = FeedForwardTopology.multiLayerPerceptron(myLayers, softmaxOnTop = true)

So MultiLayerPerceptronClassifier always use softmax as the last layer.

Vectors.dense(result.last.toArray)
}

override def predictRaw(data: Vector): Vector = {
val size = data.size
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add predictRaw method, computing without last layer (softmax)

}

override def raw2ProbabilityInPlace(data: Vector): Vector = {
val dataMatrix = new BDM[Double](data.size, 1, data.toArray)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add raw2ProbabilityInPlace, what it compute is:

softmax(rawPredictionsVector) ==> predictionsVector

directly call the last layer function to compute it.

@WeichenXu123
Copy link
Contributor Author

cc @yanboliang @jkbradley

Copy link
Contributor

@MrBago MrBago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WeichenXu123 I left some comments on the PR. This looks good overall, the main thing I'd like to do is add a stronger test.

@@ -463,7 +479,7 @@ private[ml] class FeedForwardModel private(
private var outputs: Array[BDM[Double]] = null
private var deltas: Array[BDM[Double]] = null

override def forward(data: BDM[Double]): Array[BDM[Double]] = {
override def forward(data: BDM[Double], containsLastLayer: Boolean): Array[BDM[Double]] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use a variable name likeincludeLastLayer here? containsLastLayer sounds like a property of the model instead of an instruction to the method.

@@ -463,7 +479,7 @@ private[ml] class FeedForwardModel private(
private var outputs: Array[BDM[Double]] = null
private var deltas: Array[BDM[Double]] = null

override def forward(data: BDM[Double]): Array[BDM[Double]] = {
override def forward(data: BDM[Double], containsLastLayer: Boolean): Array[BDM[Double]] = {
// Initialize output arrays for all layers. Special treatment for InPlace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the above comment in the code, it could be useful for folks reading/editing this in the future.

Also it seems like the last layer could also be a SigmoidLayerWithSqueredError or a SigmiodFunction do we need to hand those cases any differently?

@@ -363,7 +363,7 @@ private[ann] trait TopologyModel extends Serializable {
* @param data input data
* @return array of outputs for each of the layers
*/
def forward(data: BDM[Double]): Array[BDM[Double]]
def forward(data: BDM[Double], containsLastLayer: Boolean): Array[BDM[Double]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the docstring for this method to add the argument?

assert(result.select(cmpVec(features2prob(col("features")), col("probability")))
.rdd.map(_.getBoolean(0)).reduce(_ && _))
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should include a stronger test for this. I did a quick search and couldn't find a strong test for mlpModel.predict, it might be good to add one. Also, I believe this xor dataset only produces probability predictions ~equal to 0 or 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MrBago
Which way of the strong test should be done ? Add a test to check the probability vector equals given vectors ?

@WeichenXu123 WeichenXu123 force-pushed the expose_probability_in_mlp_model branch from 14c4c6c to 645fdc4 Compare August 7, 2017 23:58
@SparkQA
Copy link

SparkQA commented Aug 8, 2017

Test build #80371 has finished for PR 17373 at commit 645fdc4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

I think this will change the output of summary on spark.mlp in R right?

@jkbradley
Copy link
Member

@WeichenXu123 Can you please add "[ML]" to the PR title?

@WeichenXu123 WeichenXu123 changed the title [SPARK-12664] Expose probability in mlp model [SPARK-12664][ML] Expose probability in mlp model Aug 14, 2017
Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @WeichenXu123 ! I just finished a review pass.

One challenge with the ProbabilisticClassifier abstraction is that it introduces different code paths for predictions depending on which output columns are turned on or off: probability, rawPrediction, prediction. We ran into a bug in MLOR with this. It'd be good to follow up after this PR with another PR to add a generic test for this; I created https://issues.apache.org/jira/browse/SPARK-21729 to track this issue.

val model = trainer.fit(strongDataset)
val result = model.transform(strongDataset)
model.setProbabilityCol("probability")
MLTestingUtils.checkCopyAndUids(trainer, model)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkCopyAndUids is a generic test which should only be run in a single test; it does not need to be run in each test. Please remove it from here.

val result = model.transform(strongDataset)
model.setProbabilityCol("probability")
MLTestingUtils.checkCopyAndUids(trainer, model)
// result.select("probability").show(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove old comment

@@ -82,6 +83,49 @@ class MultilayerPerceptronClassifierSuite
}
}

test("strong dataset test") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the test title more descriptive so it is clear what it is testing. E.g. "Predicted class probabilities: calibration on toy dataset"

@@ -82,6 +83,49 @@ class MultilayerPerceptronClassifierSuite
}
}

test("strong dataset test") {
val layers = Array[Int](4, 5, 5, 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this test faster by using a simpler network, e.g., by removing one of the middle layers?

@@ -374,6 +380,22 @@ private[ann] trait TopologyModel extends Serializable {
def predict(data: Vector): Vector

/**
* Raw prediction of the model
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This documentation does not add any information. Can you please link to ProbabilisticClassifier instead?

@@ -361,9 +361,15 @@ private[ann] trait TopologyModel extends Serializable {
* Forward propagation
*
* @param data input data
* @param includeLastLayer include last layer when computing. In MultilayerPerceptronClassifier,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This text is unclear. This phrasing is better: "Include the last layer in the output. In MultilayerPerceptronClassifier, the last layer is always softmax; the last layer of outputs is needed for class predictions, but not for rawPrediction."

Vectors.dense(result.last.toArray)
}

override def predictRaw(data: Vector): Vector = {
val size = data.size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This temp val of "size" is only used once, so I recommend removing it to make the code clearer.


override def raw2ProbabilityInPlace(data: Vector): Vector = {
val dataMatrix = new BDM[Double](data.size, 1, data.toArray)
layerModels.last.eval(dataMatrix, dataMatrix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that the eval method can operate in-place. That is fine for the last layer for MLP (SoftmaxLayerModelWithCrossEntropyLoss), but not OK in general. More generally, these methods for classifiers should not go in the very general TopologyModel abstraction; that abstraction may be used in the future for regression as well. I'd be fine with putting this classification-specific logic in MLP itself; we do not need to generalize the logic until we add other Classifiers, which might take a long time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping: If this proposal sounds good, then can you please update accordingly?

.setMaxIter(100)
.setSolver("l-bfgs")
val model = trainer.fit(dataset)
model.setProbabilityCol("probability")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the default already, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping --- this should not be necessary

val result = model.transform(dataset)
val features2prob = udf { features: Vector => model.mlpModel.predict(features) }
val cmpVec = udf { (v1: Vector, v2: Vector) => v1 ~== v2 relTol 1e-3 }
assert(result.select(cmpVec(features2prob(col("features")), col("probability")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this test fails, it will not give much info. How about collecting the data and comparing on the driver?

@WeichenXu123
Copy link
Contributor Author

cc @jkbradley Code updated, thanks!

@SparkQA
Copy link

SparkQA commented Aug 15, 2017

Test build #80684 has finished for PR 17373 at commit eedc647.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Aug 15, 2017

Test build #80689 has finished for PR 17373 at commit eedc647.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! Just a few items remain. (See responses to other comments above as well.)


override def raw2ProbabilityInPlace(data: Vector): Vector = {
val dataMatrix = new BDM[Double](data.size, 1, data.toArray)
layerModels.last.eval(dataMatrix, dataMatrix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping: If this proposal sounds good, then can you please update accordingly?

@jkbradley
Copy link
Member

Thinking more about the proposal about separating the classification-specific logic out of the generic Topology, it's something we should definitely do at some point, but I'm OK with leaving it as is for now. Adding new, unused classes is probably not worth the trouble right now. Can you please document very clearly, though, that predictRaw and raw2ProbabilityInPlace are only for classification? Thanks!

@SparkQA
Copy link

SparkQA commented Aug 17, 2017

Test build #80762 has finished for PR 17373 at commit 5369b08.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

jkbradley commented Aug 17, 2017

LGTM, but I want to check about the comment above:
@felixcheung How will this affect spark.mlp in R?

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't show up automatically in R because only a selected subset is exposed in the summary() method.

would be great if probability is
https://github.com/apache/spark/blob/master/R/pkg/R/mllib_classification.R#L488

@WeichenXu123
Copy link
Contributor Author

@felixcheung So it do not cause bugs in sparkR, we can leave it in a separated PR ?

@felixcheung
Copy link
Member

We can open a JIRA to track

@jkbradley
Copy link
Member

Oh OK makes sense. @WeichenXu123 could you please open a JIRA (linked from this task's JIRA) and CC @felixcheung on it? Thanks!

I'll rerun tests to be safe and merge this afterwards.

@SparkQA
Copy link

SparkQA commented Aug 21, 2017

Test build #3894 has finished for PR 17373 at commit 5369b08.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #80946 has finished for PR 17373 at commit 5369b08.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Thanks! Will merge after rerunning tests

@SparkQA
Copy link

SparkQA commented Aug 23, 2017

Test build #3895 has finished for PR 17373 at commit 5369b08.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Merging with master
Thanks @WeichenXu123 !

@asfgit asfgit closed this in d6b30ed Aug 23, 2017
@WeichenXu123 WeichenXu123 deleted the expose_probability_in_mlp_model branch August 23, 2017 05:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants