Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29823][MLLIB] Improper persist strategy in mllib.clustering.KMeans.run() #26483

Closed
wants to merge 1 commit into from

Conversation

amanomer
Copy link
Contributor

What changes were proposed in this pull request?

Adjust RDD to persist.

Why are the changes needed?

To handle the improper persist strategy.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manually

@@ -223,12 +223,12 @@ class KMeans private (

// Compute squared norms and cache them.
val norms = data.map(Vectors.norm(_, 2.0))
Copy link
Contributor Author

@amanomer amanomer Nov 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

norms have only one child zippedData so all actions that rely on norms also rely on zippedData and in runAlgorithm(), multiple actions have been applied on zippedData.

@amanomer
Copy link
Contributor Author

cc @srowen

@srowen srowen changed the title [SPARK-29823][MLIB] Improper persist strategy in mllib.clustering.KMeans.run() [SPARK-29823][MLLIB] Improper persist strategy in mllib.clustering.KMeans.run() Nov 12, 2019
Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK looks good as a point fix, pending tests. I checked and VectorWithNorm is registered by default wit Kryo, so serializing it to memory should be OK

@SparkQA
Copy link

SparkQA commented Nov 13, 2019

Test build #4930 has finished for PR 26483 at commit 99f165d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen srowen closed this in 8c2bf64 Nov 13, 2019
@srowen
Copy link
Member

srowen commented Nov 13, 2019

Merged to master

@zhengruifeng
Copy link
Contributor

@srowen @amanomer Do we need this change?
data: RDD[Vector] and norms: RDD[Double] are both cached, zippedData just involved extra cost of convert vectors to VectorWithNorms.

Do this help improving performace? I donot see any performance result.
on the contrary, this will cause double caching problem:
Since vectors are already cached in the .ml or maybe outside of the run method, but never used.

@zhengruifeng
Copy link
Contributor

I mean, input: RDD[Vector] is likely to be cached outside of this method:
1, it is cached in ml.KMeans
2, end uers are likely to cache it outside of train/run, since it is suggested in related docs

So if we cache zippedData, we really cache input twice.

@srowen
Copy link
Member

srowen commented Dec 29, 2019

It's better than the previous behavior, but yes this could also instead only bother persisting if the input wasn't (and remove the warning about persisting the input)

@zhengruifeng
Copy link
Contributor

@srowen I agree that there should be somewhat perfermance improvement if RAM is big enough, at the cost of double caching.
However if the RAM can not fit the two copies, I think this will hurt the perfermance.

End uers are telled to cache the input RDD/DF for ML algorithms, and I think double caching matters more than this perfermance improvement.
So I am against to this change.

@srowen
Copy link
Member

srowen commented Dec 30, 2019

Yes I'm not suggesting two caches is helpful here; it was doing that before too though. That's not what this change does. I do think it's fine to follow this up and make it conditionally cache. @amanomer are you interested?

@amanomer
Copy link
Contributor Author

Yes. I will raise a follow up PR.

srowen pushed a commit that referenced this pull request Jan 4, 2020
### What changes were proposed in this pull request?
Check before caching zippedData (as suggested in #26483 (comment)).

### Why are the changes needed?
If the `data` is already cached before calling `run` method of `KMeans` then `zippedData.persist()` will hurt the performance. Hence, persisting it conditionally.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Manually.

Closes #27052 from amanomer/29823followup.

Authored-by: Aman Omer <amanomer1996@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants