-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-30642][ML][PYSPARK] LinearSVC blockify input vectors #27360
Conversation
testCode: import org.apache.spark.ml.classification._
import org.apache.spark.storage.StorageLevel
var df = spark.read.format("libsvm").load("/data1/Datasets/a9a/a9a").withColumn("label", (col("label")+1)/2)
df.persist(StorageLevel.MEMORY_AND_DISK)
df.count
(0 until 8).foreach{ _ => df = df.union(df) }
df.count
new LinearSVC().setMaxIter(10).fit(df) // warm up
val svc = new LinearSVC().setMaxIter(100)
val start = System.currentTimeMillis; val model = svc.fit(df); val end = System.currentTimeMillis; end - start
val svc = new LinearSVC().setMaxIter(100).setFitIntercept(false)
val start = System.currentTimeMillis; val model = svc.fit(df); val end = System.currentTimeMillis; end - start result: MASTER: Native-BLAS is NOT used in above tests, maybe future performance gain can be obtained by setting appropriate Native BLAS. |
@@ -220,8 +227,22 @@ class LinearSVC @Since("2.2.0") ( | |||
None | |||
} | |||
|
|||
val getAggregatorFunc = new HingeAggregator(bcFeaturesStd, $(fitIntercept))(_) | |||
val costFun = new RDDLossFunction(instances, getAggregatorFunc, regularization, | |||
val standardized = instances.map { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the standardization outside of HingeAggregator
, so that no longer need to standardize input in each iter.
Test build #117398 has finished for PR 27360 at commit
|
Test build #117400 has finished for PR 27360 at commit
|
OK, I would generally support it if it's not a huge change, optimizes performance, and doesn't change behavior / correctness. |
I added |
Test build #117404 has finished for PR 27360 at commit
|
Test build #117407 has finished for PR 27360 at commit
|
retest this please |
Test build #117410 has finished for PR 27360 at commit
|
Test build #117416 has finished for PR 27360 at commit
|
Test build #117418 has finished for PR 27360 at commit
|
retest this please |
Test build #117420 has finished for PR 27360 at commit
|
2f84a3d
to
cb91306
Compare
Test build #117444 has finished for PR 27360 at commit
|
Test build #117474 has finished for PR 27360 at commit
|
Merged to master. |
### What changes were proposed in this pull request? Revert #27360 #27396 #27374 #27389 ### Why are the changes needed? BLAS need more performace tests, specially on sparse datasets. Perfermance test of LogisticRegression (#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression. LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression. ### Does this PR introduce any user-facing change? remove newly added param blockSize ### How was this patch tested? reverted testsuites Closes #27487 from zhengruifeng/revert_blockify_ii. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request? Revert #27360 #27396 #27374 #27389 ### Why are the changes needed? BLAS need more performace tests, specially on sparse datasets. Perfermance test of LogisticRegression (#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression. LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression. ### Does this PR introduce any user-facing change? remove newly added param blockSize ### How was this patch tested? reverted testsuites Closes #27487 from zhengruifeng/revert_blockify_ii. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request? Revert apache#27360 apache#27396 apache#27374 apache#27389 ### Why are the changes needed? BLAS need more performace tests, specially on sparse datasets. Perfermance test of LogisticRegression (apache#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression. LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression. ### Does this PR introduce any user-facing change? remove newly added param blockSize ### How was this patch tested? reverted testsuites Closes apache#27487 from zhengruifeng/revert_blockify_ii. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
What changes were proposed in this pull request?
1, stack input vectors to blocks (like ALS/MLP);
2, add new param
blockSize
;3, add a new class
InstanceBlock
4, standardize the input outside of optimization procedure;
Why are the changes needed?
1, reduce RAM to persist traing dataset; (save ~40% in test)
2, use Level-2 BLAS routines; (12% ~ 41% faster, without native BLAS)
Does this PR introduce any user-facing change?
a new param
blockSize
How was this patch tested?
existing and updated testsuites