Fix so not preparing data twice when calling model selector fit method #251

leahmcguire · 2019-03-26T21:18:31Z

Related issues
The prepare method on the spliters had unclear use - it was used both to measure the properties of the data and to resample data for cross validation / training split. This meant that the fit method of the model selectors had a bug where it would rebalance data before cross validation as well as during cross validation leading to potential label leakage. This was not generally a problem as within a workflow we called the findBestEstimator method rather than fit - but it would lead to bad results for anyone using the model selectors as a stand alone spark stage.

Describe the proposed solution
made the base splitter class have two methods which are called independently. one to measure the data and decide how to clean or rebalance it and one to actually do the cleaning or rebalancing

Describe alternatives you've considered
make prepare only estimate once - however this could lead to unexpected results

…ross validation

codecov · 2019-03-26T21:46:35Z

Codecov Report

Merging #251 into master will decrease coverage by 3.94%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #251      +/-   ##
==========================================
- Coverage   86.53%   82.59%   -3.95%     
==========================================
  Files         314      314              
  Lines       10297    10301       +4     
  Branches      331      537     +206     
==========================================
- Hits         8911     8508     -403     
- Misses       1386     1793     +407

Impacted Files	Coverage Δ
...om/salesforce/op/stages/impl/tuning/Splitter.scala	`96.87% <ø> (ø)`	⬆️
...salesforce/op/stages/impl/tuning/OpValidator.scala	`94.11% <ø> (ø)`	⬆️
...alesforce/op/stages/impl/tuning/DataBalancer.scala	`96.22% <100%> (+0.03%)`	⬆️
...sforce/op/stages/impl/selector/ModelSelector.scala	`98.14% <100%> (+0.07%)`	⬆️
...alesforce/op/stages/impl/tuning/DataSplitter.scala	`60% <100%> (ø)`	⬆️
.../salesforce/op/stages/impl/tuning/DataCutter.scala	`95.74% <100%> (+0.09%)`	⬆️
...alesforce/op/cli/gen/templates/SimpleProject.scala	`0% <0%> (-100%)`	⬇️
.../scala/com/salesforce/op/cli/gen/ProblemKind.scala	`0% <0%> (-100%)`	⬇️
...cala/com/salesforce/op/cli/gen/FileInProject.scala	`0% <0%> (-100%)`	⬇️
...in/scala/com/salesforce/op/cli/CommandParser.scala	`0% <0%> (-98.12%)`	⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2566a08...1cedbbe. Read the comment docs.

codecov · 2019-03-26T21:46:35Z

Codecov Report

Merging #251 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #251      +/-   ##
==========================================
- Coverage   86.56%   86.55%   -0.02%     
==========================================
  Files         314      314              
  Lines       10294    10298       +4     
  Branches      339      342       +3     
==========================================
+ Hits         8911     8913       +2     
- Misses       1383     1385       +2

Impacted Files	Coverage Δ
...om/salesforce/op/stages/impl/tuning/Splitter.scala	`96.87% <ø> (ø)`	⬆️
...salesforce/op/stages/impl/tuning/OpValidator.scala	`94.11% <ø> (ø)`	⬆️
...alesforce/op/stages/impl/tuning/DataBalancer.scala	`96.22% <100%> (+0.03%)`	⬆️
...sforce/op/stages/impl/selector/ModelSelector.scala	`98.14% <100%> (+0.07%)`	⬆️
...alesforce/op/stages/impl/tuning/DataSplitter.scala	`60% <100%> (ø)`	⬆️
.../salesforce/op/stages/impl/tuning/DataCutter.scala	`95.74% <100%> (+0.09%)`	⬆️
...es/src/main/scala/com/salesforce/op/OpParams.scala	`85.71% <0%> (-4.09%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3e89a43...bc0ea7f. Read the comment docs.

gerashegalov · 2019-03-26T21:33:23Z

core/src/main/scala/com/salesforce/op/stages/impl/tuning/Splitter.scala

+   * @param data
+   * @return Parameters set in examining data
+   */
+  def examine(data: Dataset[Row]): Option[SplitterSummary]


this function is designed to have a side-effects for prepare whereas examine sounds read-only. Can we call it something like setupPrepare or prePrepare?

what about assessForPrepare?

same read-only connotation, I think your current is preValidationPrepare is good

gerashegalov · 2019-03-26T21:35:01Z

core/src/main/scala/com/salesforce/op/stages/impl/selector/ModelSelector.scala

@@ -113,6 +113,7 @@ E <: Estimator[_] with OpPipelineStage2[RealNN, OPVector, Prediction]]
  protected[op] def findBestEstimator(data: Dataset[_], dag: StagesDAG, persistEveryKStages: Int = 0)
    (implicit spark: SparkSession): Unit = {

+    splitter.map(_.examine(data.select(labelColName).toDF()))


since we don't use the result of transformation let us do splitter.foreach(_.examine(...))

tovbinm · 2019-03-27T04:35:07Z

@leahmcguire please update PR title

… lm/leakfix

gerashegalov

LGTM. modulo rename followup work

gerashegalov · 2019-03-27T22:41:57Z

core/src/main/scala/com/salesforce/op/stages/impl/tuning/DataBalancer.scala

  /**
-   * Split into a training set and a test set and balance the training set
+   * Function to use examine the data set to set parameters for preparation


Search/replace examine, sorry for the post-rename work.

michaelweilsalesforce · 2019-03-27T23:19:55Z

Don't realize the bug, but it's been a while. I remember something like the first call of Splitter would "examine", but doesn't seem to be the case

leahmcguire · 2019-03-27T23:24:50Z

@mweilsalesforce yes that it what I remember as well but it is not the case now - it is hard to unravel the history since this got moved to the public repo about the time this code went in.

michaelweilsalesforce

LGTM

tovbinm

lgtm, one comment

tovbinm · 2019-03-28T20:35:00Z

core/src/main/scala/com/salesforce/op/stages/impl/tuning/DataBalancer.scala

+   */
+  def validationPrepare(data: Dataset[Row]): Dataset[Row] = {
+
+    if (summary.isEmpty) throw new RuntimeException("Cannot call prepare until examine has been called")


examine was renamed to preValidationPrepare? so please update the error message.

salesforce-cla · 2020-12-01T02:56:25Z

Thanks for the contribution! Unfortunately we can't verify the commit author(s): leahmcguire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request.

leahmcguire added 2 commits March 26, 2019 12:46

updating base class to have 2 methods so dont rebalance data before c…

3662b44

…ross validation

fixed test

1cedbbe

leahmcguire requested a review from tovbinm as a code owner March 26, 2019 21:18

leahmcguire requested review from gerashegalov, Jauntbox, mweilsalesforce and kinfaikan March 26, 2019 21:18

gerashegalov suggested changes Mar 26, 2019

View reviewed changes

leahmcguire changed the title ~~Lm/leakfix~~ Fix so not preparing data twice when calling model selector fit method Mar 27, 2019

leahmcguire and others added 4 commits March 27, 2019 11:18

name change

326ecbc

Merge branch 'master' into lm/leakfix

b76ae67

fixed names

6efbbde

Merge branch 'lm/leakfix' of github.com:salesforce/TransmogrifAI into…

761be95

… lm/leakfix

leahmcguire requested a review from gerashegalov March 27, 2019 18:40

leahmcguire added the ready for review label Mar 27, 2019

gerashegalov approved these changes Mar 27, 2019

View reviewed changes

michaelweilsalesforce approved these changes Mar 27, 2019

View reviewed changes

updated comments to reflect new name

bc0ea7f

leahmcguire merged commit 3aa144a into master Mar 28, 2019

leahmcguire deleted the lm/leakfix branch March 28, 2019 18:27

tovbinm reviewed Mar 28, 2019

View reviewed changes

gerashegalov mentioned this pull request Apr 3, 2019

DataCutter-related fixes for multiclass #263

Merged

tovbinm mentioned this pull request Apr 10, 2019

Release 0.5.2 #277

Merged

salesforce-cla bot added the cla:signed label Mar 6, 2020

salesforce-cla bot removed the cla:signed label Dec 1, 2020

salesforce-cla bot added the cla:missing label Dec 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix so not preparing data twice when calling model selector fit method #251

Fix so not preparing data twice when calling model selector fit method #251

leahmcguire commented Mar 26, 2019

codecov bot commented Mar 26, 2019

codecov bot commented Mar 26, 2019 •

edited

Loading

gerashegalov Mar 26, 2019

leahmcguire Mar 27, 2019 •

edited

Loading

gerashegalov Mar 28, 2019 •

edited

Loading

gerashegalov Mar 26, 2019

tovbinm commented Mar 27, 2019

gerashegalov left a comment •

edited

Loading

gerashegalov Mar 27, 2019

michaelweilsalesforce commented Mar 27, 2019

leahmcguire commented Mar 27, 2019

michaelweilsalesforce left a comment

tovbinm left a comment

tovbinm Mar 28, 2019

salesforce-cla bot commented Dec 1, 2020

Fix so not preparing data twice when calling model selector fit method #251

Fix so not preparing data twice when calling model selector fit method #251

Conversation

leahmcguire commented Mar 26, 2019

codecov bot commented Mar 26, 2019

Codecov Report

codecov bot commented Mar 26, 2019 • edited Loading

Codecov Report

gerashegalov Mar 26, 2019

Choose a reason for hiding this comment

leahmcguire Mar 27, 2019 • edited Loading

Choose a reason for hiding this comment

gerashegalov Mar 28, 2019 • edited Loading

Choose a reason for hiding this comment

gerashegalov Mar 26, 2019

Choose a reason for hiding this comment

tovbinm commented Mar 27, 2019

gerashegalov left a comment • edited Loading

Choose a reason for hiding this comment

gerashegalov Mar 27, 2019

Choose a reason for hiding this comment

michaelweilsalesforce commented Mar 27, 2019

leahmcguire commented Mar 27, 2019

michaelweilsalesforce left a comment

Choose a reason for hiding this comment

tovbinm left a comment

Choose a reason for hiding this comment

tovbinm Mar 28, 2019

Choose a reason for hiding this comment

salesforce-cla bot commented Dec 1, 2020

codecov bot commented Mar 26, 2019 •

edited

Loading

leahmcguire Mar 27, 2019 •

edited

Loading

gerashegalov Mar 28, 2019 •

edited

Loading

gerashegalov left a comment •

edited

Loading