mllib test code - RAM / AUC improvements needed #5

mengxr · 2015-05-05T06:39:20Z

@szilard For MLlib, you should repartition the data to match the number of cores. For example, try train.repartition(32).cache(). Otherwise, you may not use all the cores. Also, if the data is sparse, you should create sparse vectors instead of dense vectors.

The text was updated successfully, but these errors were encountered:

szilard · 2015-05-05T18:46:07Z

It was using all 32 cores (I think it made 41 partitions for n=1M). How about we do some work together to make sure I'm not missing obvious things?

Here is a script to generate the n=1M dataset and also the 1-hot encoding and the categoricalFeaturesInfo style-encoding: https://gist.github.com/szilard/9d75ad09918d338624ad

Let's try to train a random forest with 10 trees first on r3.8xlarge (32 cores, 250G RAM), then maybe on a cluster. I'm starting with this: https://gist.github.com/szilard/b9e1e7b5dec502b9f8c6

mengxr · 2015-05-05T18:54:43Z

41 partitions on 32 cores is not the best setting because you are measuring the wall-clock time. For sparse input, you need to create an RDD of sparse vectors (by removing all the zeros) instead of dense ones to take advantage of sparsity.

We have the official performance tests implemented at https://github.com/databricks/spark-perf. It would be great if you can add a test there with generated data (so we can change the scaling factors easily).

szilard · 2015-05-07T06:36:50Z

Thanks. Using sparse vectors and 32 partitions made things a bit faster.

However, memory footprint is still pretty large and the AUC is lower than with any of the other methods (see writeup in the main README).

At this stage I think it's easier to experiment interactively on the command line and figure out what's going on rather than spending time writing a rigorous performance test in Scala (not my strength anyway). Any chance you can take a look?

mengxr · 2015-06-28T23:05:55Z

@szilard I had been busy with the 1.4 release. I understand that it is easy to compare implementations with their default settings. However, the implementations were made with different assumptions. You should understand how to tune each implementation to publish benchmark results.

For example, I tried your dataset using logistic regression on a 8x 4-core EC2 cluster, which may match your setting. Here is my code:

import org.apache.spark.sql.DataFrame
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

def load(filename: String): DataFrame = {
  sc.textFile(s"s3n://benchm-ml--spark/$filename")
  .map { line =>
    val vv = line.split(',').map(_.toDouble)
    val label = vv(0)
    val features = Vectors.dense(vv.slice(1, vv.length)).toSparse
    (label, features)
  }.toDF("label", "features")
}

val train = load("spark-train-1m.csv").repartition(32).cache()
val test = load("spark-test-1m.csv").repartition(32).cache()
train.count()
test.count()

val lr = new LogisticRegression()
  .setTol(1e-3)
val model = lr.fit(train)

val evaluator = new BinaryClassificationEvaluator()
  .setMetricName("areaUnderROC")
evaluator.evaluate(model.transform(train))
evaluator.evaluate(model.transform(test))

train uses 216.6MB memory, while test uses 21.6MB memory (see attached screenshot). lr.fit took less than 3 seconds to finish. ROC on train is 0.7139922364513273 and on test is 0.7088271083682873.

Please check your test code and update the result, especially on the memory requirement.

szilard · 2015-06-30T02:19:10Z

Thanks @mengxr. This github issue was related mostly to random forests, while your help is for logistic regression, so I'm splitting the issue into two: random forest issue continues here #16 while logistic regression continued here #17.

szilard · 2015-07-03T01:13:13Z

closing this after split above into 2 new issues

szilard changed the title ~~mllib test code~~ mllib test code - RAM / AUC improvements needed May 17, 2015

szilard added wontfix enhancement help wanted and removed wontfix enhancement labels May 17, 2015

This was referenced Jun 30, 2015

Spark random forest low AUC etc #16

Closed

Spark logistic regression issues #17

Closed

szilard closed this as completed Jul 3, 2015

szilard removed enhancement help wanted labels Aug 14, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mllib test code - RAM / AUC improvements needed #5

mllib test code - RAM / AUC improvements needed #5

mengxr commented May 5, 2015

szilard commented May 5, 2015

mengxr commented May 5, 2015

szilard commented May 7, 2015

mengxr commented Jun 28, 2015

szilard commented Jun 30, 2015

szilard commented Jul 3, 2015

mllib test code - RAM / AUC improvements needed #5

mllib test code - RAM / AUC improvements needed #5

Comments

mengxr commented May 5, 2015

szilard commented May 5, 2015

mengxr commented May 5, 2015

szilard commented May 7, 2015

mengxr commented Jun 28, 2015

szilard commented Jun 30, 2015

szilard commented Jul 3, 2015