Regression training limit #413

AdamChit · 2019-10-03T21:39:20Z

Related issues

DataBalancer for binary classification has a parameter that controls the max data passed into modeling - Regression should allow similar limits

Describe the proposed solution

Steps:

Investigate where the check should occur (somewhere in DataSplitter)
Add logic to downsample when the number of records is reached
Add downsampling information to the summary object and log
Add tests to DataSplitterTest and RegressionModelSelectorTest to cover the downsampling logic

Describe alternatives you've considered

Not having a limit for any of the model types - this was not optimal because some spark models may have very long runtimes or bad behavior with too much data. So the default will be to downsample once we have passes 1M records and give the user the option to set their own maxTrainingSample if they are ok with working with large dataset

… that inherit from splitter can use them

…level

salesforce-cla · 2019-10-03T21:39:25Z

Thanks for the contribution! Before we can merge this, we need @AdamChit to sign the Salesforce.com Contributor License Agreement.

AdamChit · 2019-10-03T21:41:03Z

@TuanNguyen27 Could you review

leahmcguire · 2019-10-04T17:36:41Z

core/src/main/scala/com/salesforce/op/stages/impl/tuning/DataSplitter.scala

+   *
+   * @group param
+   */
+  private[op] final val downSampleFraction = new DoubleParam(this, "downSampleFraction",


these can be protected instead of private to OP

also maybe set a default value of 1

Added the default value here

leahmcguire · 2019-10-04T17:39:20Z

core/src/main/scala/com/salesforce/op/stages/impl/tuning/DataSplitter.scala

  ): DataSplitter = {
    new DataSplitter()
      .setSeed(seed)
      .setReserveTestFraction(reserveTestFraction)
+      .setMaxTrainingSample(maxTrainingSample)


is this also exposed for the datacutter class?

Yes, I added it to SplitterParams which datacutter has access to - e4b8a92. So that I can use the same set/get functions across DataBalancer, DataCutter and DataSplitter.

crupley · 2019-10-04T20:21:34Z

core/src/test/scala/com/salesforce/op/stages/impl/regression/RegressionModelSelectorTest.scala

+    modelSelector.splitter.get.getMaxTrainingSample shouldBe 1000
+  }
+
+  it should "set maxTrainingSample and down-sample" in {


Did you also mean to check maxTrainingSample was set in this test as well?

I moved testing the maxTrainingSample set/get functions to here and forgot to rename this test. Great catch!

core/src/test/scala/com/salesforce/op/stages/impl/regression/RegressionModelSelectorTest.scala

crupley · 2019-10-04T20:27:49Z

core/src/main/scala/com/salesforce/op/stages/impl/tuning/DataSplitter.scala

-    summary = Option(DataSplitterSummary())
+    val dataSetSize = data.count().toDouble
+    val sampleF = getMaxTrainingSample / dataSetSize
+    val DownSampleFraction = if (getMaxTrainingSample < dataSetSize) sampleF else 1


could be a min here too

also: lowerCamelCase

Suggested change

val DownSampleFraction = if (getMaxTrainingSample < dataSetSize) sampleF else 1

val downSampleFraction = math.min(sampleF, 1.0)

core/src/test/scala/com/salesforce/op/stages/impl/tuning/DataSplitterTest.scala

crupley · 2019-10-04T20:44:27Z

core/src/test/scala/com/salesforce/op/stages/impl/tuning/DataSplitterTest.scala

@@ -43,6 +43,7 @@ class DataSplitterTest extends FlatSpec with TestSparkContext with SplitterSumma

  val seed = 1234L
  val dataCount = 1000
+  val MaxTrainingSampleDefault = 1E6.toLong


I would give this a different name so we don't confuse it with the actual default in SplitterParams

crupley · 2019-10-04T20:55:20Z

core/src/test/scala/com/salesforce/op/stages/impl/tuning/DataSplitterTest.scala

+
+    dataSplitter.getMaxTrainingSample shouldBe maxRows
+  }
+


It's probably worth checking downSampleFraction params was set here as well, for completeness so that you have checked everything in DataSplitterParams

made changes here 433d483

…AdamChit/TransmogrifAI into achit/regression-training-limit

…plitterTest.scala Co-Authored-By: Christopher Rupley <crupley@gmail.com>

…egressionModelSelectorTest.scala Co-Authored-By: Christopher Rupley <crupley@gmail.com>

codecov · 2019-10-04T21:52:23Z

Codecov Report

Merging #413 into master will increase coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #413      +/-   ##
==========================================
+ Coverage   86.97%   86.99%   +0.02%     
==========================================
  Files         337      337              
  Lines       11060    11078      +18     
  Branches      357      597     +240     
==========================================
+ Hits         9619     9637      +18     
  Misses       1441     1441

Impacted Files	Coverage Δ
...alesforce/op/stages/impl/tuning/DataBalancer.scala	`96.11% <ø> (-0.18%)`	⬇️
...om/salesforce/op/stages/impl/tuning/Splitter.scala	`98.07% <100%> (+0.34%)`	⬆️
...e/op/stages/impl/selector/ModelSelectorNames.scala	`100% <100%> (ø)`	⬆️
...alesforce/op/stages/impl/tuning/DataSplitter.scala	`90% <100%> (+23.33%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1bf6fdf...cfbe22f. Read the comment docs.

…AdamChit/TransmogrifAI into achit/regression-training-limit

leahmcguire · 2019-10-07T16:37:07Z

core/src/main/scala/com/salesforce/op/stages/impl/tuning/Splitter.scala

+      case s if s == classOf[DataSplitterSummary].getName => DataSplitterSummary(
+        preSplitterDataCount = metadata.getLong(ModelSelectorNames.PreSplitterDataCount),
+        downSamplingFraction = metadata.getDouble(ModelSelectorNames.DownSample)
+      )


please add the downsample fraction to the datacutter params as well...

I added downsample fraction into the datacutter params as part of the multi class classification training limit changes. I'll create the PR for it today.

AdamChit · 2019-10-07T20:11:13Z

Unrelated Test fails during the Travis CI check https://travis-ci.com/salesforce/TransmogrifAI/jobs/243091644 I'll create a ticket to increase the tolerance on the test here

AdamChit and others added 6 commits September 20, 2019 15:46

refactored maxTrainingSample get and set function so that all classes…

e4b8a92

… that inherit from splitter can use them

added downsampling logic if MaxTrainingSample reached

2170254

added unit tests for downsampling in regression data splitter

dff09b9

added integration tests to test downsampling from the model selector …

14c6b42

…level

style changes

722341b

changed the test to reduce run time

34d5bf1

AdamChit requested review from gerashegalov, Jauntbox, leahmcguire, tovbinm and wsuchy as code owners October 3, 2019 21:39

salesforce-cla bot added the cla:missing label Oct 3, 2019

Merge branch 'master' into achit/regression-training-limit

8e2778d

salesforce-cla bot added cla:signed and removed cla:missing labels Oct 3, 2019

leahmcguire reviewed Oct 4, 2019

View reviewed changes

crupley reviewed Oct 4, 2019

View reviewed changes

core/src/test/scala/com/salesforce/op/stages/impl/regression/RegressionModelSelectorTest.scala Outdated Show resolved Hide resolved

crupley reviewed Oct 4, 2019

View reviewed changes

core/src/test/scala/com/salesforce/op/stages/impl/tuning/DataSplitterTest.scala Outdated Show resolved Hide resolved

crupley reviewed Oct 4, 2019

View reviewed changes

AdamChit and others added 4 commits October 4, 2019 14:37

test now checks all data splitter params

433d483

Merge branch 'achit/regression-training-limit' of https://github.com/…

0932810

…AdamChit/TransmogrifAI into achit/regression-training-limit

Update core/src/test/scala/com/salesforce/op/stages/impl/tuning/DataS…

2b02f8a

…plitterTest.scala Co-Authored-By: Christopher Rupley <crupley@gmail.com>

Update core/src/test/scala/com/salesforce/op/stages/impl/regression/R…

0521a37

…egressionModelSelectorTest.scala Co-Authored-By: Christopher Rupley <crupley@gmail.com>

tovbinm changed the title ~~Achit/regression training limit~~ Regression training limit Oct 4, 2019

AdamChit added 6 commits October 4, 2019 16:35

added downSampleFraction default value and made style changes

962e06f

Merge branch 'achit/regression-training-limit' of https://github.com/…

80a80d5

…AdamChit/TransmogrifAI into achit/regression-training-limit

renamed test

0ab4d9a

changed getDownSampleFraction to protected

ef4327c

name change based on RP comments

8ca0e78

added datacount to summary

8e67f27

leahmcguire reviewed Oct 7, 2019

View reviewed changes

Trigger re-build

009706d

leahmcguire approved these changes Oct 7, 2019

View reviewed changes

Trigger travis re-build

cfbe22f

AdamChit merged commit 53dd954 into salesforce:master Oct 8, 2019

AdamChit mentioned this pull request Oct 8, 2019

Multi-class classification training limit #414

Merged

sanmitra mentioned this pull request Oct 11, 2019

0.6.2 release #417

Closed

nicodv mentioned this pull request Jun 11, 2020

0.7.0 release #481

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression training limit #413

Regression training limit #413

AdamChit commented Oct 3, 2019

salesforce-cla bot commented Oct 3, 2019

AdamChit commented Oct 3, 2019

leahmcguire Oct 4, 2019

leahmcguire Oct 4, 2019

AdamChit Oct 5, 2019

leahmcguire Oct 4, 2019

AdamChit Oct 4, 2019

crupley Oct 4, 2019

AdamChit Oct 4, 2019

crupley Oct 4, 2019

crupley Oct 4, 2019

crupley Oct 4, 2019

AdamChit Oct 4, 2019

codecov bot commented Oct 4, 2019 •

edited

Loading

leahmcguire Oct 7, 2019

AdamChit Oct 7, 2019

AdamChit commented Oct 7, 2019

	val DownSampleFraction = if (getMaxTrainingSample < dataSetSize) sampleF else 1
	val downSampleFraction = math.min(sampleF, 1.0)

Regression training limit #413

Regression training limit #413

Conversation

AdamChit commented Oct 3, 2019

salesforce-cla bot commented Oct 3, 2019

AdamChit commented Oct 3, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Oct 4, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AdamChit commented Oct 7, 2019

codecov bot commented Oct 4, 2019 •

edited

Loading