-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29823][MLLIB] Improper persist strategy in mllib.clustering.KMeans.run() #26483
Conversation
@@ -223,12 +223,12 @@ class KMeans private ( | |||
|
|||
// Compute squared norms and cache them. | |||
val norms = data.map(Vectors.norm(_, 2.0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
norms
have only one child zippedData
so all actions that rely on norms
also rely on zippedData
and in runAlgorithm(), multiple actions have been applied on zippedData
.
cc @srowen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK looks good as a point fix, pending tests. I checked and VectorWithNorm is registered by default wit Kryo, so serializing it to memory should be OK
Test build #4930 has finished for PR 26483 at commit
|
Merged to master |
@srowen @amanomer Do we need this change? Do this help improving performace? I donot see any performance result. |
I mean, So if we cache |
It's better than the previous behavior, but yes this could also instead only bother persisting if the input wasn't (and remove the warning about persisting the input) |
@srowen I agree that there should be somewhat perfermance improvement if RAM is big enough, at the cost of double caching. End uers are telled to cache the input RDD/DF for ML algorithms, and I think double caching matters more than this perfermance improvement. |
Yes I'm not suggesting two caches is helpful here; it was doing that before too though. That's not what this change does. I do think it's fine to follow this up and make it conditionally cache. @amanomer are you interested? |
Yes. I will raise a follow up PR. |
### What changes were proposed in this pull request? Check before caching zippedData (as suggested in #26483 (comment)). ### Why are the changes needed? If the `data` is already cached before calling `run` method of `KMeans` then `zippedData.persist()` will hurt the performance. Hence, persisting it conditionally. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually. Closes #27052 from amanomer/29823followup. Authored-by: Aman Omer <amanomer1996@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>
What changes were proposed in this pull request?
Adjust RDD to persist.
Why are the changes needed?
To handle the improper persist strategy.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Manually