-
Notifications
You must be signed in to change notification settings - Fork 398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression training limit #413
Regression training limit #413
Conversation
… that inherit from splitter can use them
Thanks for the contribution! Before we can merge this, we need @AdamChit to sign the Salesforce.com Contributor License Agreement. |
@TuanNguyen27 Could you review |
* | ||
* @group param | ||
*/ | ||
private[op] final val downSampleFraction = new DoubleParam(this, "downSampleFraction", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these can be protected instead of private to OP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also maybe set a default value of 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the default value here
): DataSplitter = { | ||
new DataSplitter() | ||
.setSeed(seed) | ||
.setReserveTestFraction(reserveTestFraction) | ||
.setMaxTrainingSample(maxTrainingSample) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this also exposed for the datacutter class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I added it to SplitterParams which datacutter has access to - e4b8a92. So that I can use the same set/get functions across DataBalancer, DataCutter and DataSplitter.
modelSelector.splitter.get.getMaxTrainingSample shouldBe 1000 | ||
} | ||
|
||
it should "set maxTrainingSample and down-sample" in { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you also mean to check maxTrainingSample
was set in this test as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved testing the maxTrainingSample set/get functions to here and forgot to rename this test. Great catch!
core/src/test/scala/com/salesforce/op/stages/impl/regression/RegressionModelSelectorTest.scala
Outdated
Show resolved
Hide resolved
summary = Option(DataSplitterSummary()) | ||
val dataSetSize = data.count().toDouble | ||
val sampleF = getMaxTrainingSample / dataSetSize | ||
val DownSampleFraction = if (getMaxTrainingSample < dataSetSize) sampleF else 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be a min
here too
also: lowerCamelCase
val DownSampleFraction = if (getMaxTrainingSample < dataSetSize) sampleF else 1 | |
val downSampleFraction = math.min(sampleF, 1.0) |
core/src/test/scala/com/salesforce/op/stages/impl/tuning/DataSplitterTest.scala
Outdated
Show resolved
Hide resolved
@@ -43,6 +43,7 @@ class DataSplitterTest extends FlatSpec with TestSparkContext with SplitterSumma | |||
|
|||
val seed = 1234L | |||
val dataCount = 1000 | |||
val MaxTrainingSampleDefault = 1E6.toLong |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would give this a different name so we don't confuse it with the actual default in SplitterParams
|
||
dataSplitter.getMaxTrainingSample shouldBe maxRows | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably worth checking downSampleFraction
params was set here as well, for completeness so that you have checked everything in DataSplitterParams
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made changes here 433d483
…AdamChit/TransmogrifAI into achit/regression-training-limit
…plitterTest.scala Co-Authored-By: Christopher Rupley <crupley@gmail.com>
…egressionModelSelectorTest.scala Co-Authored-By: Christopher Rupley <crupley@gmail.com>
Codecov Report
@@ Coverage Diff @@
## master #413 +/- ##
==========================================
+ Coverage 86.97% 86.99% +0.02%
==========================================
Files 337 337
Lines 11060 11078 +18
Branches 357 597 +240
==========================================
+ Hits 9619 9637 +18
Misses 1441 1441
Continue to review full report at Codecov.
|
…AdamChit/TransmogrifAI into achit/regression-training-limit
case s if s == classOf[DataSplitterSummary].getName => DataSplitterSummary( | ||
preSplitterDataCount = metadata.getLong(ModelSelectorNames.PreSplitterDataCount), | ||
downSamplingFraction = metadata.getDouble(ModelSelectorNames.DownSample) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add the downsample fraction to the datacutter params as well...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added downsample fraction into the datacutter params as part of the multi class classification training limit changes. I'll create the PR for it today.
Unrelated Test fails during the Travis CI check https://travis-ci.com/salesforce/TransmogrifAI/jobs/243091644 I'll create a ticket to increase the tolerance on the test here |
Related issues
DataBalancer for binary classification has a parameter that controls the max data passed into modeling - Regression should allow similar limits
Describe the proposed solution
Steps:
Investigate where the check should occur (somewhere in DataSplitter)
Add logic to downsample when the number of records is reached
Add downsampling information to the summary object and log
Add tests to DataSplitterTest and RegressionModelSelectorTest to cover the downsampling logic
Describe alternatives you've considered
Not having a limit for any of the model types - this was not optimal because some spark models may have very long runtimes or bad behavior with too much data. So the default will be to downsample once we have passes 1M records and give the user the option to set their own maxTrainingSample if they are ok with working with large dataset