[SPARK-10931][ML][PYSPARK] PySpark Models Copy Param Values from Estimator #17849

BryanCutler · 2017-05-03T23:02:16Z

What changes were proposed in this pull request?

Added call to copy values of Params from Estimator to Model after fit in PySpark ML. This will copy values for any params that are also defined in the Model. Since currently most Models do not define the same params from the Estimator, also added method to create new Params from looking at the Java object if they do not exist in the Python object. This is a temporary fix that can be removed once the PySpark models properly define the params themselves.

How was this patch tested?

Refactored the check_params test to optionally check if the model params for Python and Java match and added this check to an existing fitted model that shares params between Estimator and Model.

… instead of continue

…n-params-SPARK-10931

…order to match

BryanCutler · 2017-05-03T23:03:10Z

python/pyspark/ml/classification.py

@@ -1325,7 +1325,7 @@ def __init__(self, featuresCol="features", labelCol="label", predictionCol="pred
        super(MultilayerPerceptronClassifier, self).__init__()
        self._java_obj = self._new_java_obj(
            "org.apache.spark.ml.classification.MultilayerPerceptronClassifier", self.uid)
-        self._setDefault(maxIter=100, tol=1E-4, blockSize=128, stepSize=0.03, solver="l-bfgs")
+        self._setDefault(maxIter=100, tol=1E-6, blockSize=128, stepSize=0.03, solver="l-bfgs")


This is a difference in default values between Python and Java that wasn't being caught because of check_params prematurely returning

Looks like 1e-6 is correct default value.

Yes, the check_params test was meant to catch that but was broken

BryanCutler · 2017-05-03T23:04:30Z

python/pyspark/ml/tests.py

-                             % (p.name, str(py_stage)))
-            if py_has_default:
-                if p.name == "seed":
-                    return  # Random seeds between Spark and PySpark are different


this should not return, I changed it to continue above

BryanCutler · 2017-05-03T23:18:43Z

python/pyspark/ml/tests.py

@@ -1355,7 +1370,7 @@ def test_java_params(self):
            for name, cls in inspect.getmembers(module, inspect.isclass):
                if not name.endswith('Model') and issubclass(cls, JavaParams)\
                        and not inspect.isabstract(cls):
-                    self.check_params(cls())
+                    ParamTests.check_params(self, cls(), check_params_exist=False)


Setting check_params_exist to True will uncover any params that exist in Java but not in Python

This might make sense to include as a comment in the code for whoever is coming to update this.

sure, will do

SparkQA · 2017-05-03T23:25:05Z

Test build #76429 has finished for PR 17849 at commit 765eb5f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-05-03T23:38:12Z

@jkbradley @holdenk the heart of this change is just adding the call to _copyValues to copy param values from Estimator to Model. That doesn't really do much though, since most of the Python models do not define any params and there is nothing to copy to. So I added a temporary little hack to look at the Java Model params after fitting and create any params that don't already exist, then any set values can be copied. Also needed to do the same after loading a Python model or this will fail persistence tests.

I know having this temporary 'fix' isn't ideal but it would allow us to incrementally add missing Params or restructure class hierarchy to match Scala versions and will continue to copy these values to the Models. Until that is done, there won't be explicit methods to get each param, such as getMaxDepth() but the param value can still be accessed by param.getOrDefault("maxDepth") to give users a workaround for all of those type of JIRAs that have come up. What do you guys think?

BryanCutler · 2017-05-04T17:59:45Z

python/pyspark/ml/wrapper.py

+        # SPARK-10931: This is a temporary fix to allow models to own params
+        # from estimators. Eventually, these params should be in models through
+        # using common base classes between estimators and models.
+        model._create_params_from_java()


This might be better to move to JavaModel.__init() for the case of creating a model without fitting - e.g. CountVectorizerModel from vocabulary.

So right now this would apply to all of the models, would it make sense to make it so that we can selectively move the params forward one at a time?

I don't think there is really any downside of just creating all the Params from Java, see my comment below.

…ake a model without fitting

holdenk · 2017-05-06T07:45:30Z

python/pyspark/ml/tests.py

@@ -404,6 +404,53 @@ def test_copy_param_extras(self):
        self.assertEqual(tp._paramMap, copied_no_extra)
        self.assertEqual(tp._defaultParamMap, tp_copy._defaultParamMap)

+    @staticmethod
+    def check_params(test_self, py_stage, check_params_exist=True):


Thank so you much for putting in the time on this. :D :D

no problem!

holdenk

Thanks a lot for working on this, I've done a first read through with some questions :)

holdenk · 2017-05-06T21:55:32Z

python/pyspark/ml/tests.py

@@ -1355,7 +1370,7 @@ def test_java_params(self):
            for name, cls in inspect.getmembers(module, inspect.isclass):
                if not name.endswith('Model') and issubclass(cls, JavaParams)\
                        and not inspect.isabstract(cls):
-                    self.check_params(cls())
+                    ParamTests.check_params(self, cls(), check_params_exist=False)


This might make sense to include as a comment in the code for whoever is coming to update this.

holdenk · 2017-05-06T21:58:54Z

python/pyspark/ml/wrapper.py

+        # SPARK-10931: This is a temporary fix to allow models to own params
+        # from estimators. Eventually, these params should be in models through
+        # using common base classes between estimators and models.
+        model._create_params_from_java()


So right now this would apply to all of the models, would it make sense to make it so that we can selectively move the params forward one at a time?

BryanCutler · 2017-05-08T18:33:26Z

Thanks @holdenk for the review! I think I wrote the description a little too rushed, so let me clarify a bit...

The temporary "fix" will just create empty params in the model if they exist in the Java model but not the Python one. There should be no risk of having these added to the Python model since they are empty when created and not yet defined with a value. These params will be set in 2 ways: 1) after the model is fit in the call to _copy_values where the value is copied from the estimator for any matching params, 2) when the model is loaded there is a call to _transfer_params_from_java that will copy value if the the Java param has been explicitly set (I think I need to add something here for the case that the Java model has a default value but Python model doesn't).

I think the best way forward to get parity with the Scala API is to then organize a JIRA with subtasks to update the Python ML class hierarchies to match the Scala ones, so that the Params will be defined that way with proper "get" and "set" methods too. It might be good to also have a Python test that checks for matching params in Java for both the estimators and models. It could be ignored by default and then enabled during the QA period. The temporary fix here would continue to work and not interfere while the params are being added. It could be removed once we feel that most of the params have been properly added and close to matching the Scala API.

…n when loading model

SparkQA · 2017-05-08T22:29:27Z

Test build #76596 has finished for PR 17849 at commit 4a66e90.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-05-30T22:00:36Z

ping @jkbradley @holdenk , please have a look when you can, thanks!

holdenk · 2017-07-30T00:34:28Z

This looks pretty reasonable, sorry for the delay. If you have a chance to update this to master would be good to do.

BryanCutler · 2017-07-31T17:34:34Z

Thanks @holdenk! Sure, I'll update to master

…n-params-SPARK-10931

SparkQA · 2017-07-31T19:14:36Z

Test build #80089 has finished for PR 17849 at commit 4affa01.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-07-31T19:18:22Z

ping @holdenk - think this is good to go?

WeichenXu123 · 2017-08-03T05:51:31Z

Thanks your work on this but I am curious what is the benefit of doing this? In pyspark there is no param in Model itself currently, what is the problem or bugs it can resolve after adding params to pyspark model ?

BryanCutler · 2017-08-03T17:06:50Z

If params are defined in the PySpark model, when that model is fit a Scala version is created then the PySpark model is wrapped around it. The param values from the Scala version are never transferred to the PySpark model, so the defined params will only have default values.

BryanCutler · 2017-08-09T23:07:00Z

ping @holdenk , also @HyukjinKwon if you are able to take a look

BryanCutler · 2017-08-09T23:17:05Z

python/pyspark/ml/wrapper.py

@@ -263,7 +284,8 @@ def _fit_java(self, dataset):

    def _fit(self, dataset):
        java_model = self._fit_java(dataset)
-        return self._create_model(java_model)
+        model = self._create_model(java_model)
+        return self._copyValues(model)


This is the crucial line being added in this PR. Without this, if a Python model defines a param (matching one from Scala), then when the model is fit in Scala that param value will never be sent back to Python.

Here I think it is going to copy values from the estimator to the created model. So I think we assume that the params in estimator and model are the same?

Yes, that is the assumption and it's the same on the Scala side too. The estimators and models should both have a shared mixin that defines the common params used. That's how it's done on the Scala side and Python should follow (once that's done, the temporary fix from here can be removed).

HyukjinKwon · 2017-08-10T02:12:37Z

I am rather a backend developer and work together with data scientists. So, my ML knowledge is limited (am studying hard :)). Will leave few comments together if there are some nits and someone starts to review so that they can be addressed together.

cc @viirya who I believe knows ML bit and @zero323 who I believe should be able to review this (but now is inactive though), are you maybe able to make a pass for this one?

HyukjinKwon · 2017-08-10T02:22:40Z

Will try to give a pass anyway.

HyukjinKwon · 2017-08-10T03:09:33Z

python/pyspark/ml/wrapper.py

+            java_param_name = java_param.name()
+            if not hasattr(self, java_param_name):
+                param = Param(self, java_param_name, java_param.doc())
+                setattr(param, "created_from_java_param", True)


BTW, would you mind if I ask where created_from_java_param is used?

Since this is part of a temporary fix to add Params that are defined in Java but not in Python, then this just adds a tag to the Param in case something goes wrong we will know the param was created here.

viirya · 2017-08-10T03:39:28Z

python/pyspark/ml/wrapper.py

+        from pyspark.ml.param import Param
+        for java_param in java_params:
+            java_param_name = java_param.name()
+            if not hasattr(self, java_param_name):


If self contains a same name attribute which is not a Param, should we process it like throw exception?

Good point, it's possible that there could be an attribute with that name that is not a param. If that's the case, then it is probably best to just ignore silently since this is not critical to the model.

holdenk · 2017-08-10T03:44:39Z

Sorry, let me try and take a look tomorrow.

viirya · 2017-08-10T04:01:10Z

python/pyspark/ml/wrapper.py

-                if self._java_obj.isSet(java_param):
+                if self._java_obj.isSet(java_param) or (
+                        # SPARK-10931: Temporary fix for params that have a default in Java
+                        self._java_obj.hasDefault(java_param) and not self.isDefined(param)):


This change will make a default value for a param in java side as an user-provided param value in python side. I think we should use _setDefault for default value instead of _set.

True. I was thinking since this is part of the temporary fix, then it doesn't matter, but it won't be much extra to use _setDefault and probably be clearer.

ok, fixed to use _setDefault

viirya · 2017-08-10T04:09:15Z

python/pyspark/ml/tests.py

+            test_self.assertEqual(
+                param_names, sorted(java_param_names),
+                "Param list in Python does not match Java for %s:\nJava = %s\nPython = %s"
+                % (py_stage_str, java_param_names, param_names))


Line 436-443 is the only change to check_params?

I also changed the return to continue on line 454, this loop is checking all params so it was meant to skip over random seed params - not break out of the loop entirely (this is why that default value for MLP was missed). I cleaned up the NaN checks, before it was just checking for Imputer params, but it should be the same for any params with NaN's as default values. This is lines 460-462

viirya · 2017-08-10T04:23:02Z

python/pyspark/ml/tests.py

@@ -1572,7 +1588,8 @@ def test_java_params(self):
            for name, cls in inspect.getmembers(module, inspect.isclass):
                if not name.endswith('Model') and issubclass(cls, JavaParams)\
                        and not inspect.isabstract(cls):
-                    self.check_params(cls())
+                    # NOTE: disable check_params_exist until there is parity with Scala API
+                    ParamTests.check_params(self, cls(), check_params_exist=False)


This skips param test for Model. Should we do similar check to all models?

Yes, ideally but most of the models need to be trained first so that is why they are skipped here. Some basic framework would need to be added to allow this, and I'm looking into that as a follow on.

BryanCutler · 2017-08-10T18:31:44Z

Thanks for reviewing @viirya and @HyukjinKwon !
Btw, the temporary fix I talk about here is an optional addition to this PR to allow users to access model param values this way decision_tree_model.getOrDefault("maxDepth") as a workaround until proper accessors (like getMaxDepth()) can be added, since I've seen a lot of JIRAs with people asking for this.

… made NaN check better message when fail

SparkQA · 2017-08-10T19:08:25Z

Test build #80499 has finished for PR 17849 at commit f4a657e.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-10T22:05:32Z

Test build #80506 has finished for PR 17849 at commit 07f6e85.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-08-18T23:47:10Z

LGTM, its certainly sort of an intermediary fix state but making the params accessible without users having to go through py4j manually is worth while.

I'll leave this over the weekend in case anyone has issues.

BryanCutler · 2017-08-22T04:16:25Z

@holdenk , do you think this is good to go now?

WeichenXu123 · 2017-08-22T04:44:19Z

What do you think about this ? @jkbradley

holdenk · 2017-08-22T19:19:27Z

I think its good to go for master pending jenkins (it's been awhile since the last run). So let's just make sure everything is still ok: Jenkins retest this please.

SparkQA · 2017-08-22T19:39:59Z

Test build #81004 has finished for PR 17849 at commit 07f6e85.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-08-23T00:50:40Z

Merged to master, thanks everyone :) (There is also a follow up JIRA https://issues.apache.org/jira/browse/SPARK-21812 for explicitly defining all of the params in Python).

BryanCutler · 2017-08-23T17:55:04Z

Thanks @holdenk!

BryanCutler added 9 commits March 13, 2017 16:43

added regression test case for PySpark models not owning params

a4ede3f

fixed default PySpark param value that was being overlooked by return…

3b921a4

… instead of continue

added copy of param values to python model when estimator fit is called

dff7863

Added temporary fix to add Params when fitting and persisting models

398ef27

Merge remote-tracking branch 'upstream/master' into pyspark-models-ow…

1f3de13

…n-params-SPARK-10931

added check for NaN default param values

d621c89

need to create params from java when model is fit and unpersisted in …

acdb4b9

…order to match

removed blank line

9b7b886

cleaned old comment block in test

765eb5f

BryanCutler commented May 3, 2017

View reviewed changes

BryanCutler commented May 4, 2017

View reviewed changes

moved call to create params to JavaModel constructor for case where m…

ca52db4

…ake a model without fitting

holdenk reviewed May 6, 2017

View reviewed changes

BryanCutler added 2 commits May 8, 2017 15:01

need to copy param value if java has default but not defined in pytho…

a22a2cc

…n when loading model

added some comments for test additions

4a66e90

sethah mentioned this pull request May 30, 2017

[SPARK-20498][PYSPARK][ML] Expose getMaxDepth for ensemble tree model in PySpark #18120

Closed

2 tasks

Merge remote-tracking branch 'upstream/master' into pyspark-models-ow…

4affa01

…n-params-SPARK-10931

BryanCutler commented Aug 9, 2017

View reviewed changes

HyukjinKwon reviewed Aug 10, 2017

View reviewed changes

viirya reviewed Aug 10, 2017

View reviewed changes

Changed wrapper to use setDefault for undef default params from Java,…

f4a657e

… made NaN check better message when fail

style fix

07f6e85

asfgit closed this in 41bb1dd Aug 23, 2017

BryanCutler mentioned this pull request Sep 6, 2017

[SPARK-21915][ML][PySpark]Model 1 and Model 2 ParamMaps Missing #19126

Closed

kschelonka mentioned this pull request Jun 7, 2019

Params from parent java estimators aren't copied to python mmlspark models microsoft/SynapseML#582

Open

[SPARK-10931][ML][PYSPARK] PySpark Models Copy Param Values from Estimator #17849

[SPARK-10931][ML][PYSPARK] PySpark Models Copy Param Values from Estimator #17849

Conversation

BryanCutler commented May 3, 2017

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 3, 2017

BryanCutler commented May 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holdenk May 6, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holdenk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented May 8, 2017

SparkQA commented May 8, 2017

BryanCutler commented May 30, 2017

holdenk commented Jul 30, 2017

BryanCutler commented Jul 31, 2017

SparkQA commented Jul 31, 2017

BryanCutler commented Jul 31, 2017

WeichenXu123 commented Aug 3, 2017

BryanCutler commented Aug 3, 2017

BryanCutler commented Aug 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Aug 10, 2017

HyukjinKwon commented Aug 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holdenk commented Aug 10, 2017

viirya Aug 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented Aug 10, 2017

SparkQA commented Aug 10, 2017

SparkQA commented Aug 10, 2017

holdenk commented Aug 18, 2017

BryanCutler commented Aug 22, 2017

WeichenXu123 commented Aug 22, 2017

holdenk commented Aug 22, 2017

SparkQA commented Aug 22, 2017

holdenk commented Aug 23, 2017

BryanCutler commented Aug 23, 2017

holdenk May 6, 2017 •

edited

Loading

HyukjinKwon commented Aug 10, 2017 •

edited

Loading

viirya Aug 10, 2017 •

edited

Loading