-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes to support KMeans with large feature space #10739
Conversation
Hi @levin-royl, you need to remove the two log files and create a JIRA. See https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark |
Thank you, I removed the log files and added the following JIRA request: |
You may create your JIRA in a wrong place. Not Kylin, but Spark https://issues.apache.org/jira/browse/SPARK |
Sorry, I am a little new to this. For some reason when choosing "create new" I only had the options: Kylin, Atlas or Apache Infrastructure. Now through the link you sent I created the following JIRA request in Spark: |
Hi, just wanted to know if there are any unhanded items on my end WRT this change. Thanks. |
Read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first. You may want to search for duplicate JIRAs too. There are several on this topic and k-means. |
Hi, there are indeed some similar issues I found, e.g.: https://issues.apache.org/jira/browse/SPARK-4039 But the difference is that in the problem I describe reducing the dimensions of the problem (i.e., the feature space) to allow using dense vectors is not suitable. Also, the solution I implemented supports this while allowing full flexibility to the user --- i.e., using the default dense vector implementation or selecting an alternative (only when the default it is not desired). I will update the JIRA issue on this as well. Please advise if there are any additional steps I need to do at this point. Thanks in advance. |
1 similar comment
Hi, there are indeed some similar issues I found, e.g.: https://issues.apache.org/jira/browse/SPARK-4039 But the difference is that in the problem I describe reducing the dimensions of the problem (i.e., the feature space) to allow using dense vectors is not suitable. Also, the solution I implemented supports this while allowing full flexibility to the user --- i.e., using the default dense vector implementation or selecting an alternative (only when the default it is not desired). I will update the JIRA issue on this as well. Please advise if there are any additional steps I need to do at this point. Thanks in advance. |
I wanted to know if you took a look at the code and the proposed solution in general. Are there any comments? Thanks. |
@levin-royl it looks like @hhbyyh has a branch with some code that tackles the same issue (see the jira discussion for more information), you may want to coordinate there. Also, I suggest you take again a look at the pull request section of the guidelines https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark |
I didn't send a PR because there's some ongoing effort on transforming the implementation of KMeans to Matrix multiplication. |
Can one of the admins verify this patch? |
Closes apache#11785 Closes apache#13027 Closes apache#13614 Closes apache#13761 Closes apache#15197 Closes apache#14006 Closes apache#12576 Closes apache#15447 Closes apache#13259 Closes apache#15616 Closes apache#14473 Closes apache#16638 Closes apache#16146 Closes apache#17269 Closes apache#17313 Closes apache#17418 Closes apache#17485 Closes apache#17551 Closes apache#17463 Closes apache#17625 Closes apache#10739 Closes apache#15193 Closes apache#15344 Closes apache#14804 Closes apache#16993 Closes apache#17040 Closes apache#15180 Closes apache#17238
This pr proposed to close stale PRs. Currently, we have 400+ open PRs and there are some stale PRs whose JIRA tickets have been already closed and whose JIRA tickets does not exist (also, they seem not to be minor issues). // Open PRs whose JIRA tickets have been already closed Closes apache#11785 Closes apache#13027 Closes apache#13614 Closes apache#13761 Closes apache#15197 Closes apache#14006 Closes apache#12576 Closes apache#15447 Closes apache#13259 Closes apache#15616 Closes apache#14473 Closes apache#16638 Closes apache#16146 Closes apache#17269 Closes apache#17313 Closes apache#17418 Closes apache#17485 Closes apache#17551 Closes apache#17463 Closes apache#17625 // Open PRs whose JIRA tickets does not exist and they are not minor issues Closes apache#10739 Closes apache#15193 Closes apache#15344 Closes apache#14804 Closes apache#16993 Closes apache#17040 Closes apache#15180 Closes apache#17238 N/A Author: Takeshi Yamamuro <yamamuro@apache.org> Closes apache#17734 from maropu/resolved_pr. Change-Id: Id2e590aa7283fe5ac01424d30a40df06da6098b5
The problem:
In Spark's KMeans code the center vectors are always represented as dense vectors. As a result, when each such center has a large domain space the algorithm quickly runs out of memory. In my example I have a feature space of around 50000 and k ~= 500. This sums up to around 200MB RAM for the center vectors alone while in fact the center vectors are very sparse and require a lot less RAM.
Since I am running on a system with relatively low resources I keep getting OutOfMemory errors. In my setting it is OK to trade off runtime for using less RAM. This is what I set out to do in my solution while allowing users the flexibility to choose.
My solution:
Allow the kmeans algorithm to accept a VectorFactory which decides when vectors used inside the algorithm should be sparse and when they should be dense. For backward compatibility the default behavior is to always make them dense (like the situation is now). But now potentially the user can provide a SmartVectorFactory (or some proprietary VectorFactory) which can decide to make vectors sparse.
For this I made the following changes:
(1) Added a method called reassign to SparseVectors allowing to change the indices and values
(2) Allow axpy to accept SparseVectors
(3) create a trait called VectorFactory and two implementations for it that are used within KMeans code