Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Params from parent java estimators aren't copied to python mmlspark models #582

Open
kschelonka opened this issue Jun 7, 2019 · 4 comments
Assignees

Comments

@kschelonka
Copy link

kschelonka commented Jun 7, 2019

The java params for mmlspark estimators like LightGBMClassifier, etc. aren't copied over to the python instances.

This is related to this Jira ticket: PySpark ML Models should contain Param values

A temporary fix was added so that the params can be accessed using getOrDefault method. This does make it possible to pull in the params from mmlspark models, like LightGBMClassifier:
Screen Shot 2019-06-07 at 9 20 27 AM

Spark developers are planning on incrementally updating the pyspark API to use the appropriate getter and setter methods, and having the pyspark models define the params within themselves (see SPARK-21812). For example, CountVectorizer was updated in this fashion.

Since it's very useful to be able to access model parameters, I propose updating mmlspark models in a similar fashion. Happy to contribute to this effort.

@imatiach-msft
Copy link
Contributor

@kschelonka thanks for bringing up this issue!
"having the pyspark models define the params within themselves"
this sounds reasonable, although I think we should only allow the parameters to be settable on the model if they can actually change the output of the model (I recall seeing cases in Spark ML where this was not the case which seemed strange to me).
"I propose updating mmlspark models in a similar fashion. Happy to contribute to this effort."
that would be great, if you have the cycles to do so. Note that all of our python wrappers are autogenerated - so this should hopefully require less coding and work on all python models at once.
Also, I should mention that @mhamilton723 is doing a large refactor of the codebase right now. Currently, we have a special ./runme script that sets up mmlspark on linux, but it relies on bash and may do too much on behalf of the developer. @mhamilton723 is refactoring the codebase to rely only on sbt which should make it possible to develop on windows and mac os. Setting up the code for development is a bit difficult right now. He might suggest that you use the refactor branch, I'm not sure how far along he is in the refactor.

@kschelonka
Copy link
Author

kschelonka commented Jun 7, 2019

Thanks for the response!
"we should only allow the parameters to be settable on the model if they can actually change the output of the model "
That makes sense to me as well, but I would suggest that the settable params should include anything that has no effect on model fit. So for example featuresCol should be able to be settable, even though it doesn't change the model's output.

The build refactor sounds like it would make development easier, so I'll wait for updates on that. Thanks!

@mhamilton723
Copy link
Collaborator

@kschelonka, @imatiach-msft I think this would be mitigated by making the LightGBM models proper sparkML estimators with getter and setters instead of being constuctor defined. This would also improve code re-use. Ilya, could you consider this when doing yout LGBM cleanup PR?

@imatiach-msft
Copy link
Contributor

FYI this is now fixed on LightGBM (because of ComplexParamsWritable) but not on some of the other estimators like TrainClassifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants