[SPARK-29823][MLLIB] Improper persist strategy in mllib.clustering.KMeans.run() #26483

amanomer · 2019-11-12T14:52:04Z

What changes were proposed in this pull request?

Adjust RDD to persist.

Why are the changes needed?

To handle the improper persist strategy.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manually

amanomer · 2019-11-12T14:56:24Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

@@ -223,12 +223,12 @@ class KMeans private (

    // Compute squared norms and cache them.
    val norms = data.map(Vectors.norm(_, 2.0))


norms have only one child zippedData so all actions that rely on norms also rely on zippedData and in runAlgorithm(), multiple actions have been applied on zippedData.

amanomer · 2019-11-12T22:50:38Z

cc @srowen

srowen

OK looks good as a point fix, pending tests. I checked and VectorWithNorm is registered by default wit Kryo, so serializing it to memory should be OK

SparkQA · 2019-11-13T00:43:51Z

Test build #4930 has finished for PR 26483 at commit 99f165d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-11-13T14:16:31Z

Merged to master

zhengruifeng · 2019-12-29T06:51:28Z

@srowen @amanomer Do we need this change?
data: RDD[Vector] and norms: RDD[Double] are both cached, zippedData just involved extra cost of convert vectors to VectorWithNorms.

Do this help improving performace? I donot see any performance result.
on the contrary, this will cause double caching problem:
Since vectors are already cached in the .ml or maybe outside of the run method, but never used.

zhengruifeng · 2019-12-29T07:05:48Z

I mean, input: RDD[Vector] is likely to be cached outside of this method:
1, it is cached in ml.KMeans
2, end uers are likely to cache it outside of train/run, since it is suggested in related docs

So if we cache zippedData, we really cache input twice.

srowen · 2019-12-29T14:15:26Z

It's better than the previous behavior, but yes this could also instead only bother persisting if the input wasn't (and remove the warning about persisting the input)

zhengruifeng · 2019-12-30T05:58:07Z

@srowen I agree that there should be somewhat perfermance improvement if RAM is big enough, at the cost of double caching.
However if the RAM can not fit the two copies, I think this will hurt the perfermance.

End uers are telled to cache the input RDD/DF for ML algorithms, and I think double caching matters more than this perfermance improvement.
So I am against to this change.

srowen · 2019-12-30T15:01:02Z

Yes I'm not suggesting two caches is helpful here; it was doing that before too though. That's not what this change does. I do think it's fine to follow this up and make it conditionally cache. @amanomer are you interested?

amanomer · 2019-12-30T15:20:18Z

Yes. I will raise a follow up PR.

### What changes were proposed in this pull request? Check before caching zippedData (as suggested in #26483 (comment)). ### Why are the changes needed? If the `data` is already cached before calling `run` method of `KMeans` then `zippedData.persist()` will hurt the performance. Hence, persisting it conditionally. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually. Closes #27052 from amanomer/29823followup. Authored-by: Aman Omer <amanomer1996@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>

Initial commit

99f165d

amanomer commented Nov 12, 2019

View reviewed changes

srowen changed the title ~~[SPARK-29823][MLIB] Improper persist strategy in mllib.clustering.KMeans.run()~~ [SPARK-29823][MLLIB] Improper persist strategy in mllib.clustering.KMeans.run() Nov 12, 2019

srowen approved these changes Nov 12, 2019

View reviewed changes

srowen closed this in 8c2bf64 Nov 13, 2019

amanomer mentioned this pull request Dec 30, 2019

[SPARK-30390][MLLIB] Avoid double caching in mllib.KMeans#runWithWeights. #27052

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29823][MLLIB] Improper persist strategy in mllib.clustering.KMeans.run() #26483

[SPARK-29823][MLLIB] Improper persist strategy in mllib.clustering.KMeans.run() #26483

amanomer commented Nov 12, 2019

amanomer Nov 12, 2019 •

edited

Loading

amanomer commented Nov 12, 2019

srowen left a comment

SparkQA commented Nov 13, 2019

srowen commented Nov 13, 2019

zhengruifeng commented Dec 29, 2019

zhengruifeng commented Dec 29, 2019

srowen commented Dec 29, 2019

zhengruifeng commented Dec 30, 2019

srowen commented Dec 30, 2019

amanomer commented Dec 30, 2019

		@@ -223,12 +223,12 @@ class KMeans private (

		// Compute squared norms and cache them.
		val norms = data.map(Vectors.norm(_, 2.0))

[SPARK-29823][MLLIB] Improper persist strategy in mllib.clustering.KMeans.run() #26483

[SPARK-29823][MLLIB] Improper persist strategy in mllib.clustering.KMeans.run() #26483

Conversation

amanomer commented Nov 12, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

amanomer Nov 12, 2019 • edited Loading

Choose a reason for hiding this comment

amanomer commented Nov 12, 2019

srowen left a comment

Choose a reason for hiding this comment

SparkQA commented Nov 13, 2019

srowen commented Nov 13, 2019

zhengruifeng commented Dec 29, 2019

zhengruifeng commented Dec 29, 2019

srowen commented Dec 29, 2019

zhengruifeng commented Dec 30, 2019

srowen commented Dec 30, 2019

amanomer commented Dec 30, 2019

amanomer Nov 12, 2019 •

edited

Loading