-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3530][MLLIB] pipeline and parameters with examples #3099
Conversation
Test build #22907 has started for PR 3099 at commit
|
Test build #22907 has finished for PR 3099 at commit
|
Test FAILed. |
@@ -0,0 +1,6 @@ | |||
package org.apache.spark.ml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Missing copyright header)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @srowen, please ignore these. I'm not trying to let it pass Jenkins now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh Ok, is it worth me reviewing this yet? I'm liking it already.
@srowen Glad to hear that you like it :) Your feedback will be greatly appreciated, but I don't want to waste your time on minor details. Let's run the discussion in the main thread, so we can still see them when I update the code. One thing I have been struggling with is the way we handle parameters here. Please check the last section of the PR description. Basically, I think
looks better than
But if we use setters/getters, it is a little weird that
Does it look better than the current version? Or we can just ask users to remember |
@mengxr Yes I agree with your conclusion. With a generic get/set by key method, I suppose you don't strictly need all the particular getters and setters. I suppose that's simpler, and if there's just one way to do it, there's no confusion about key vs getter/setter. I don't know if you want to remove the getters/setters entirely, but it wouldn't be crazy. |
If maxIter is a constant, would it be clearer to use MAX_ITER? |
@sryza |
Both the reference and the class internals are immutable, no? Typical Java conventions would put such a variable in all caps, though maybe in Scala it's different? I suppose the fact that Param includes a reference to its parent makes this a grayer area. What's the thinking behind linking a Param to a specific instance? My (uninformed) gut is that they seem like general attributes / capabilities of a class. |
/** | ||
* Filter this param map for the given parent. | ||
*/ | ||
def filter(parent: Identifiable): ParamMap = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious - what would this filtering be used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is useful when we try to store training parameters in trained models.
I vote for the lr.setRegParam(0.1) setup. I also vote for setting parameters beforehand, and not allowing parameters in fit(). |
Test build #22962 has started for PR 3099 at commit
|
Test build #22962 has finished for PR 3099 at commit
|
Test FAILed. |
@mengxr Thanks for putting this together ! I had some high level comments by looking at the code
will become one liners like
Also we will still have an empty constructor that will just set the default values that exist for Params right now. Also, FWIW I am not very sure about having setter, getters using traits like HasRegParam etc. This creates a lot of classes as we add more classes each having 2-3 params ?
For example in LogisticRegression.scala, we can have
can become
Also if you don't mind I'll try to port some of the image processing pipelines we have to this class structure by forking your branch. I feel that'll help me figure out if what features are easy to use etc. |
I have a few comments based upon the API:
|
@shivaram Thanks for your feedback!
It would be hard to maintain binary compatibility in the long run. After we remove the @experimental tag, each time we add a new parameter, we need to create a new auxiliary constructor. It is also easier to deprecate parameters. It will also make Scala code different from Java's, which we try to avoid. Using getters and setters, the Java code
I'm okay with either approach. I felt this is a nice feature of Scala. Some parameters are very common, e.g.,
The model parameter like
I agree. We need to know the schema in order to transform individual instances. The row object doesn't have reference to its schema. We can add
We can call
That's great! I felt that making the code compile is really helpful to see the trade-offs. But please understand that this is a WIP and I may update the code. |
The model serving work would really benefit from being able to evaluate models without requiring a Spark context especially since we are shooting for 10s millisecond latencies. Though more generally, we should think about how one might want to use the artifact of the pipeline. I suspect their are uses that may exist outside of Spark and the extent to which the models themselves are "portable functions" could enable greater adoption. |
@manishamde Thanks for your feedback!
Me too and this is implemented in the current version.
If we only allow setters, the code will be
Another reason I want to have parameters specified in
Besides the binary compatibility issue and Java issue I mentioned, it doesn't save you many characters:
I think we should keep methods that operate on normal RDDs and individual instances.
We need to deal with the serialization of objects and parameters. @davies is the expert. I expect the Python API be very similar to Scala/Java API.
This is beyond this PR. SPARK-3702 is relevant to your question, which @jkbradley is working on. |
Test build #22986 has started for PR 3099 at commit
|
Test build #22986 has finished for PR 3099 at commit
|
Test FAILed. |
Why can't we just remove the |
After a user fits a pipeline, without
Is this operation useful? I would say yes. If we both agree, the question becomes whether the current form is okay and what other options we have. I understand that your concern but we need to expose something that can make a "working" pipeline in order to ask users to try. We can test this internally for months, but without users' feedback the use cases we know would be quite limited. The users should be aware of the fact that this is alpha and we should not be afraid of changing APIs that make things better. |
How about just taking a While I see you point about getting feedback, I'm just trying to make sure we stick to the goal of exposing a user API and not exposing a developer API [which as far as I can see there is consensus on ?] |
@shivaram I found that this is not the only place we need Let me try to hide APIs as much as I can and see how it goes. |
Test build #23260 has started for PR 3099 at commit
|
Test build #23260 has finished for PR 3099 at commit
|
Test FAILed. |
Test FAILed. |
I tried to hide APIs as much as I can while maintaining the code at a level where user can actually try creating, configuring, and tuning a pipeline. All major classes are marked as "AlphaComponent". The schema transformation layer is hidden.
|
test this please |
Test build #23261 has started for PR 3099 at commit
|
Test build #23262 has started for PR 3099 at commit
|
Test build #23261 has finished for PR 3099 at commit
|
Test PASSed. |
Test build #23262 has finished for PR 3099 at commit
|
Test PASSed. |
New changes look good to me. |
Thanks @mengxr -- I agree that having The latest set of changes do help though and look good to me. |
Thanks everyone for reviewing the code! I've merged this into master and branch-1.2 as an alpha component. User feedback would be greatly appreciated, as it would definitely helps us improve both user and developer experience. |
This PR adds package "org.apache.spark.ml" with pipeline and parameters, as discussed on the JIRA. This is a joint work of jkbradley etrain shivaram and many others who helped on the design, also with help from marmbrus and liancheng on the Spark SQL side. The design doc can be found at: https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing **org.apache.spark.ml** This is a new package with new set of ML APIs that address practical machine learning pipelines. (Sorry for taking so long!) It will be an alpha component, so this is definitely not something set in stone. The new set of APIs, inspired by the MLI project from AMPLab and scikit-learn, takes leverage on Spark SQL's schema support and execution plan optimization. It introduces the following components that help build a practical pipeline: 1. Transformer, which transforms a dataset into another 2. Estimator, which fits models to data, where models are transformers 3. Evaluator, which evaluates model output and returns a scalar metric 4. Pipeline, a simple pipeline that consists of transformers and estimators Parameters could be supplied at fit/transform or embedded with components. 1. Param: a strong-typed parameter key with self-contained doc 2. ParamMap: a param -> value map 3. Params: trait for components with parameters For any component that implements `Params`, user can easily check the doc by calling `explainParams`: ~~~ > val lr = new LogisticRegression > lr.explainParams maxIter: max number of iterations (default: 100) regParam: regularization constant (default: 0.1) labelCol: label column name (default: label) featuresCol: features column name (default: features) ~~~ or user can check individual param: ~~~ > lr.maxIter maxIter: max number of iterations (default: 100) ~~~ **Please start with the example code in test suites and under `org.apache.spark.examples.ml`, where I put several examples:** 1. run a simple logistic regression job ~~~ val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(1.0) val model = lr.fit(dataset) model.transform(dataset, model.threshold -> 0.8) // overwrite threshold .select('label, 'score, 'prediction).collect() .foreach(println) ~~~ 2. run logistic regression with cross-validation and grid search using areaUnderROC (default) as the metric ~~~ val lr = new LogisticRegression val lrParamMaps = new ParamGridBuilder() .addGrid(lr.regParam, Array(0.1, 100.0)) .addGrid(lr.maxIter, Array(0, 5)) .build() val eval = new BinaryClassificationEvaluator val cv = new CrossValidator() .setEstimator(lr) .setEstimatorParamMaps(lrParamMaps) .setEvaluator(eval) .setNumFolds(3) val bestModel = cv.fit(dataset) ~~~ 3. run a pipeline that consists of a standard scaler and a logistic regression component ~~~ val scaler = new StandardScaler() .setInputCol("features") .setOutputCol("scaledFeatures") val lr = new LogisticRegression() .setFeaturesCol(scaler.getOutputCol) val pipeline = new Pipeline() .setStages(Array(scaler, lr)) val model = pipeline.fit(dataset) val predictions = model.transform(dataset) .select('label, 'score, 'prediction) .collect() .foreach(println) ~~~ 4. a simple text classification pipeline, which recognizes "spark": ~~~ val training = sparkContext.parallelize(Seq( LabeledDocument(0L, "a b c d e spark", 1.0), LabeledDocument(1L, "b d", 0.0), LabeledDocument(2L, "spark f g h", 1.0), LabeledDocument(3L, "hadoop mapreduce", 0.0))) val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr)) val model = pipeline.fit(training) val test = sparkContext.parallelize(Seq( Document(4L, "spark i j k"), Document(5L, "l m"), Document(6L, "mapreduce spark"), Document(7L, "apache hadoop"))) model.transform(test) .select('id, 'text, 'prediction, 'score) .collect() .foreach(println) ~~~ Java examples are very similar. I put example code that creates a simple text classification pipeline in Scala and Java, where a simple tokenizer is defined as a transformer outside `org.apache.spark.ml`. **What are missing now and will be added soon:** 1. ~~Runtime check of schemas. So before we touch the data, we will go through the schema and make sure column names and types match the input parameters.~~ 2. ~~Java examples.~~ 3. ~~Store training parameters in trained models.~~ 4. (later) Serialization and Python API. Author: Xiangrui Meng <meng@databricks.com> Closes #3099 from mengxr/SPARK-3530 and squashes the following commits: 2cc93fd [Xiangrui Meng] hide APIs as much as I can 34319ba [Xiangrui Meng] use local instead local[2] for unit tests 2524251 [Xiangrui Meng] rename PipelineStage.transform to transformSchema c9daab4 [Xiangrui Meng] remove mockito version 1397ab5 [Xiangrui Meng] use sqlContext from LocalSparkContext instead of TestSQLContext 6ffc389 [Xiangrui Meng] try to fix unit test a59d8b7 [Xiangrui Meng] doc updates 977fd9d [Xiangrui Meng] add scala ml package object 6d97fe6 [Xiangrui Meng] add AlphaComponent annotation 731f0e4 [Xiangrui Meng] update package doc 0435076 [Xiangrui Meng] remove ;this from setters fa21d9b [Xiangrui Meng] update extends indentation f1091b3 [Xiangrui Meng] typo 228a9f4 [Xiangrui Meng] do not persist before calling binary classification metrics f51cd27 [Xiangrui Meng] rename default to defaultValue b3be094 [Xiangrui Meng] refactor schema transform in lr 8791e8e [Xiangrui Meng] rename copyValues to inheritValues and make it do the right thing 51f1c06 [Xiangrui Meng] remove leftover code in Transformer 494b632 [Xiangrui Meng] compure score once ad678e9 [Xiangrui Meng] more doc for Transformer 4306ed4 [Xiangrui Meng] org imports in text pipeline 6e7c1c7 [Xiangrui Meng] update pipeline 4f9e34f [Xiangrui Meng] more doc for pipeline aa5dbd4 [Xiangrui Meng] fix typo 11be383 [Xiangrui Meng] fix unit tests 3df7952 [Xiangrui Meng] clean up 986593e [Xiangrui Meng] re-org java test suites 2b11211 [Xiangrui Meng] remove external data deps 9fd4933 [Xiangrui Meng] add unit test for pipeline 2a0df46 [Xiangrui Meng] update tests 2d52e4d [Xiangrui Meng] add @AlphaComponent to package-info 27582a4 [Xiangrui Meng] doc changes 73a000b [Xiangrui Meng] add schema transformation layer 6736e87 [Xiangrui Meng] more doc / remove HasMetricName trait 80a8b5e [Xiangrui Meng] rename SimpleTransformer to UnaryTransformer 62ca2bb [Xiangrui Meng] check param parent in set/get 1622349 [Xiangrui Meng] add getModel to PipelineModel a0e0054 [Xiangrui Meng] update StandardScaler to use SimpleTransformer d0faa04 [Xiangrui Meng] remove implicit mapping from ParamMap c7f6921 [Xiangrui Meng] move ParamGridBuilder test to ParamGridBuilderSuite e246f29 [Xiangrui Meng] re-org: 7772430 [Xiangrui Meng] remove modelParams add a simple text classification pipeline b95c408 [Xiangrui Meng] remove implicits add unit tests to params bab3e5b [Xiangrui Meng] update params fe0ee92 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3530 6e86d98 [Xiangrui Meng] some code clean-up 2d040b3 [Xiangrui Meng] implement setters inside each class, add Params.copyValues [ci skip] fd751fc [Xiangrui Meng] add java-friendly versions of fit and tranform 3f810cd [Xiangrui Meng] use multi-model training api in cv 5b8f413 [Xiangrui Meng] rename model to modelParams 9d2d35d [Xiangrui Meng] test varargs and chain model params f46e927 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3530 1ef26e0 [Xiangrui Meng] specialize methods/types for Java df293ed [Xiangrui Meng] switch to setter/getter 376db0a [Xiangrui Meng] pipeline and parameters (cherry picked from commit 4b736db) Signed-off-by: Xiangrui Meng <meng@databricks.com>
It looks like this may have broken the Master Maven build: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/981/ It looks like most of the failures are due to "Task not serializable" exceptions in the MLlib test suites. |
Just a question. I'm comparing your example code to the scikit-learn examples in the design document. Looks like it's difficult for me to turn a configuration (data) to a pipeline without writing too much code. Here I need to call a bunch of setXXX methods to specify the parameters while in scikit-learn I just need a Map. One important scenario is loading the pipeline and parameters from a configuration file without the need to recompile my code. Can I do that? Jianshi |
@huangjs Serialization is on the TODO list. Each parameter is associated with a unique name and each object contains a unique id. So loading from or saving to a configuration file is definitely feasible. We didn't expose |
@mengxr, Thanks for the explanation. Statically type checked pipelines is a nice, but I don't think we need it for parameters, as usually the parameters are read either from user inputs or some database. And we'll need to check their range anyway (e.g. positive non zero values, or [0, 1), etc. ). Input validation would be good enough. And we can just use a Map[String, Any] for inputs. And we can have a Map for pipeline -> parameters too. |
This PR adds package "org.apache.spark.ml" with pipeline and parameters, as discussed on the JIRA. This is a joint work of @jkbradley @etrain @shivaram and many others who helped on the design, also with help from @marmbrus and @liancheng on the Spark SQL side. The design doc can be found at:
https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
org.apache.spark.ml
This is a new package with new set of ML APIs that address practical machine learning pipelines. (Sorry for taking so long!) It will be an alpha component, so this is definitely not something set in stone. The new set of APIs, inspired by the MLI project from AMPLab and scikit-learn, takes leverage on Spark SQL's schema support and execution plan optimization. It introduces the following components that help build a practical pipeline:
Parameters could be supplied at fit/transform or embedded with components.
For any component that implements
Params
, user can easily check the doc by callingexplainParams
:or user can check individual param:
Please start with the example code in test suites and under
org.apache.spark.examples.ml
, where I put several examples:Java examples are very similar. I put example code that creates a simple text classification pipeline in Scala and Java, where a simple tokenizer is defined as a transformer outside
org.apache.spark.ml
.What are missing now and will be added soon:
Runtime check of schemas. So before we touch the data, we will go through the schema and make sure column names and types match the input parameters.Java examples.Store training parameters in trained models.