Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30642][ML][PYSPARK] LinearSVC blockify input vectors #27360

Closed
wants to merge 10 commits into from

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Jan 25, 2020

What changes were proposed in this pull request?

1, stack input vectors to blocks (like ALS/MLP);
2, add new param blockSize;
3, add a new class InstanceBlock
4, standardize the input outside of optimization procedure;

Why are the changes needed?

1, reduce RAM to persist traing dataset; (save ~40% in test)
2, use Level-2 BLAS routines; (12% ~ 41% faster, without native BLAS)

Does this PR introduce any user-facing change?

a new param blockSize

How was this patch tested?

existing and updated testsuites

@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented Jan 25, 2020

testCode:

import org.apache.spark.ml.classification._
import org.apache.spark.storage.StorageLevel


var df = spark.read.format("libsvm").load("/data1/Datasets/a9a/a9a").withColumn("label", (col("label")+1)/2)
df.persist(StorageLevel.MEMORY_AND_DISK)
df.count

(0 until 8).foreach{ _ => df = df.union(df) }
df.count

new LinearSVC().setMaxIter(10).fit(df) // warm up

val svc = new LinearSVC().setMaxIter(100)
val start = System.currentTimeMillis; val model = svc.fit(df); val end = System.currentTimeMillis; end - start


val svc = new LinearSVC().setMaxIter(100).setFitIntercept(false)
val start = System.currentTimeMillis; val model = svc.fit(df); val end = System.currentTimeMillis; end - start

result:
this PR:
RAM:1418.9M
Duration: 396524(fitIntercept=true), 324944(fitIntercept=false)

MASTER:
RAM: 2.3G
Duration: 446354(fitIntercept=true), 441961(fitIntercept=false)

Native-BLAS is NOT used in above tests, maybe future performance gain can be obtained by setting appropriate Native BLAS.

@@ -220,8 +227,22 @@ class LinearSVC @Since("2.2.0") (
None
}

val getAggregatorFunc = new HingeAggregator(bcFeaturesStd, $(fitIntercept))(_)
val costFun = new RDDLossFunction(instances, getAggregatorFunc, regularization,
val standardized = instances.map {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the standardization outside of HingeAggregator, so that no longer need to standardize input in each iter.

@SparkQA
Copy link

SparkQA commented Jan 25, 2020

Test build #117398 has finished for PR 27360 at commit 6dcb9eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait HasBlockSize extends Params
  • class HasBlockSize(Params):

@SparkQA
Copy link

SparkQA commented Jan 25, 2020

Test build #117400 has finished for PR 27360 at commit c51c37b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jan 25, 2020

OK, I would generally support it if it's not a huge change, optimizes performance, and doesn't change behavior / correctness.

@zhengruifeng
Copy link
Contributor Author

I added InstanceBlock as a common class, since I think it will also works in LoR, LiR, and other algs.
I add some testsuite to check correctness.
Thanks for reviewing! @srowen

@SparkQA
Copy link

SparkQA commented Jan 26, 2020

Test build #117404 has finished for PR 27360 at commit ea83dc1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 26, 2020

Test build #117407 has finished for PR 27360 at commit 7676b8e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 26, 2020

Test build #117410 has finished for PR 27360 at commit 7676b8e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 26, 2020

Test build #117416 has finished for PR 27360 at commit 487ee0f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 26, 2020

Test build #117418 has finished for PR 27360 at commit 2f84a3d.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 26, 2020

Test build #117420 has finished for PR 27360 at commit 2f84a3d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 27, 2020

Test build #117444 has finished for PR 27360 at commit cb91306.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 28, 2020

Test build #117474 has finished for PR 27360 at commit 819227e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

Merged to master.

@zhengruifeng zhengruifeng deleted the blockify_svc branch January 28, 2020 12:56
@zhengruifeng zhengruifeng restored the blockify_svc branch February 6, 2020 08:03
zhengruifeng added a commit that referenced this pull request Feb 8, 2020
### What changes were proposed in this pull request?
Revert
#27360
#27396
#27374
#27389

### Why are the changes needed?
BLAS need more performace tests, specially on sparse datasets.
Perfermance test of LogisticRegression (#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression.
LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression.

### Does this PR introduce any user-facing change?
remove newly added param blockSize

### How was this patch tested?
reverted testsuites

Closes #27487 from zhengruifeng/revert_blockify_ii.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
zhengruifeng added a commit that referenced this pull request Feb 25, 2020
### What changes were proposed in this pull request?
Revert
#27360
#27396
#27374
#27389

### Why are the changes needed?
BLAS need more performace tests, specially on sparse datasets.
Perfermance test of LogisticRegression (#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression.
LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression.

### Does this PR introduce any user-facing change?
remove newly added param blockSize

### How was this patch tested?
reverted testsuites

Closes #27487 from zhengruifeng/revert_blockify_ii.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?
Revert
apache#27360
apache#27396
apache#27374
apache#27389

### Why are the changes needed?
BLAS need more performace tests, specially on sparse datasets.
Perfermance test of LogisticRegression (apache#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression.
LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression.

### Does this PR introduce any user-facing change?
remove newly added param blockSize

### How was this patch tested?
reverted testsuites

Closes apache#27487 from zhengruifeng/revert_blockify_ii.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
@zhengruifeng zhengruifeng deleted the blockify_svc branch February 25, 2021 10:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants