-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31007][ML] KMeans optimization based on triangle-inequality #27758
Conversation
In current impl, following Lemma is used in KMeans: 0, Let x be a point, let b be a center and o be the origin, then d(x,c) >= |(d(x,o) - d(c,o))| = |norm-norm(c)| val center = centers(i)
// Since `\|a - b\| \geq |\|a\| - \|b\||`, we can use this lower bound to avoid unnecessary
// distance computation.
var lowerBoundOfSqDist = center.norm - point.norm
lowerBoundOfSqDist = lowerBoundOfSqDist * lowerBoundOfSqDist
if (lowerBoundOfSqDist < bestDistance) {
...
} this can only be applied in According to Using the Triangle Inequality to Accelerate K-Meanswe , we can go futher, and there are another two Lemmas can be used:
this can be applied in This PR is mainly for Lemma 1. Its idea is quite simple, if point x is close to center b enough (less than a pre-computed radius), then we can say point x belong to center c without computing the distances between x and other centers. It can be used in both training and predction.
this can be applied in EuclideanDistance, but not in CosineDistance The application of Lemma 2 is a little complex:
|
env: bin/spark-shell --driver-memory=64G testCode: import org.apache.spark.ml.linalg._
import org.apache.spark.ml.clustering._
val df = spark.read.format("libsvm").load("/data1/Datasets/webspam/webspam_wc_normalized_trigram.svm.10k")
df.persist()
val km = new KMeans().setK(16).setMaxIter(5)
val kmm = km.fit(df)
val results = Seq(2,4,8,16).map { k =>
val km = new KMeans().setK(k).setMaxIter(20);
val start = System.currentTimeMillis;
val kmm = km.fit(df);
val end = System.currentTimeMillis;
val train = end - start;
kmm.transform(df).count; // trigger the lazy computation of radii
val start2 = System.currentTimeMillis;
kmm.transform(df).count;
val end2 = System.currentTimeMillis;
val predict = end2 - start2;
val result = (k, train, kmm.summary.numIter, kmm.summary.trainingCost, predict);
println(result);
result
}
val df2 = spark.read.format("libsvm").load("/data1/Datasets/a9a/a9a")
df2.persist()
val km = new KMeans().setK(16).setMaxIter(10)
val kmm = km.fit(df2)
val results2 = Seq(2,4,8,16,32,64,128,256).map { k =>
val km = new KMeans().setK(k).setMaxIter(20);
val start = System.currentTimeMillis;
val kmm = km.fit(df2);
val end = System.currentTimeMillis;
val train = end - start;
kmm.transform(df2).count; // trigger the lazy computation of radii
val start2 = System.currentTimeMillis;
kmm.transform(df).count;
val end2 = System.currentTimeMillis;
val predict = end2 - start2;
val result = (k, train, kmm.summary.numIter, kmm.summary.trainingCost, predict);
println(result);
result } 1, sparse dataset: webspam, numInstances: 10,000, numFeatures: 8,289,919
2, dense dataset: a9a, numInstances: 32,561, numFeatures: 123
We can see that: |
Test build #119166 has finished for PR 27758 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, does this actually speed things up? the benchmarks don't show much win, but I agree it could be in theory. It would have to be a pretty large k
.
* @return Radii of centers. If distance between point x and center c is less than | ||
* the radius of center c, then center c is the closest center to point x. | ||
*/ | ||
def computeRadii(centers: Array[VectorWithNorm]): Array[Double] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any value in this default impl? it's overridden in both subclasses. Or is it to support future impls that may not be able to use this optimization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I should remove this. Not all distance can be expected to have such triangle-inequality.
/** | ||
* @return the index of the closest center to the given point, as well as the cost. | ||
*/ | ||
def findClosest( | ||
centers: TraversableOnce[VectorWithNorm], | ||
centers: Array[VectorWithNorm], | ||
radii: Array[Double], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need the other implementation, if you have this one? this short-circuits the case where you find a point within a cluster radius. This is kept to support cosine distance?
if (d < r) { | ||
bestDistance = d | ||
bestIndex = i | ||
found = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just return here rather than carry a flag around?
val center = centers(i) | ||
// Since `\|a - b\| \geq |\|a\| - \|b\||`, we can use this lower bound to avoid unnecessary | ||
// distance computation. | ||
var lowerBoundOfSqDist = center.norm - point.norm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I'd just make these two vals for clarity
if (d < r * r) { | ||
bestDistance = d | ||
bestIndex = i | ||
found = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, just return?
|
||
// d = 1 - cos(x) | ||
// r = 1 - cos(x/2) = 1 - sqrt((cos(x) + 1) / 2) = 1 - sqrt(1 - d/2) | ||
distances.map(d => 1 - math.sqrt(1 - d / 2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cosine distance doesn't obey the triangle inequality; is this meant to be angular distance? For my benefit (and possibly comments) how is r the angular distance here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Cosine distance doesn't obey the triangle inequality, but the following lemma should be available to apply:
given a point x, and let b and c be centers. If angle(x, b)<angle(b,c)/2, then angle(x,b)<angle(x,c), cos_distance(x,b)=1-cos(x,b)<cos_distance(x,c)=1-cos(x,c)
That is because: PRINCIPLES FROM GEOMETRY, point 3
Each side of a spherical triangle is less than the sum of the other two.
In spherical geometry, the shortest distance between two points is an arc of a great circle, but the triangle inequality holds provided the restriction is made that the distance between two points on a sphere is the length of a minor spherical line segment (that is, one with central angle in [0, π]) with those endpoints.[4][5]
angle(x,b) + angle(x,c) > angle(b,c)
angle(x,b) < angle(b,c)/2
=> angle(x,c) > angle(b,c)/2 > angle(x,b)
=> cos_distance(x,c) > cos_distance(x,b)
angle(x,b) < angle(b,c)/2
<=> cos(x,b) > sqrt{ (cos(b,c) + 1)/2 }
<=> cos_distance(x,b) < 1 - sqrt{ (cos(b,c) + 1)/2 } = 1 - sqrt{ 1 - cos_distance(b,c) / 2 }
=> Give two centers b and c, if point x has cos_distance(x,b) < 1 - sqrt{ 1 - cos_distance(b,c) / 2 }, then point x belongs to center b.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In short, cos_distance do not obey triangle inequality, so we can NOT say:
If cos_distance(b,x) < cos_distance(b,c)/2, then cos_distance(b,x) < cos_distance(c,x)
However, the arc distance (or angle) obeys Each side of a spherical triangle is less than the sum of the other two.
, so we can get a angular bound, and then a cos_distance bound:
if point cos_distance(b,x) < 1 - sqrt{ 1 - cos_distance(b,c) / 2 }, then cos_distance(b,x) < cos_distance(c,x)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, the last expression comes from the cos(2x) identity, OK.
|
||
val iterationStartTime = System.nanoTime() | ||
|
||
instr.foreach(_.logNumFeatures(centers.head.vector.size)) | ||
|
||
// Execute iterations of Lloyd's algorithm until converged | ||
while (iteration < maxIterations && !converged) { | ||
val radiiStartTime = System.nanoTime() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might remove this after testing; it's not super important.
c423b74
to
dd2aff7
Compare
I update the Then I retest above testsuite, the prediction is a litter faster, but not significantly. I also test on cosine-distance, results are:
We can see that KMeans impls with cosine distance have the (almost) same convergen. |
Test build #119208 has finished for PR 27758 at commit
|
If I'm reading that right, it's slower to train in many cases in the first instance, and slightly faster at scale in the second. I'm surprised because I'd have thought it's more easily a win, even with the overhead of calculating stats. Is the purpose more about prediction speed? I just wonder how much one is worth vs the other. |
@srowen Sorry to reply late. I missed the emails from github.
It also help saving training time, if the dataset is large enough. Since the cost of computing stats is about O(k^2 * m), while the cost of computing distances at one iteration is O(k * n * m) where m is the number of features, and n is the number of instances; I guess I can compute the stats distributedly in some case (when k is large); I just mark this PR WIP for two reasons: |
PS it might be worth benchmarking again anyway, as I just merged the change that uses native BLAS for level 1 operations of vectors >= 256 elements. |
dd2aff7
to
b4fabb1
Compare
Test build #121275 has finished for PR 27758 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's no worse for small k I think this is more fine. I'm slightly concerned about the extra complexity but it is probably worth the speedup
/** | ||
* Statistics used in triangle inequality to obtain useful bounds to find closest centers. | ||
* | ||
* @return A symmetric matrix containing statistics, matrix(i)(j) represents: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be too hard to implement, but if it's symmetric you only need half this many elements to represent. The indexing becomes more complex though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, it is symmetric. I will study how other impls like GMM to store only the upper triangular part of the matrix.
Maybe it is helpful to support symmetric dense matrix in .linalg? since it is used in many places
while (j < k) { | ||
val d = distance(centers(i), centers(j)) | ||
val s = computeStatistics(d) | ||
stats(i)(j) = s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're micro-optimizing, I suppose you can lift stats(i) out of the loop, but it may not o anything
/** | ||
* @return the index of the closest center to the given point, as well as the cost. | ||
*/ | ||
def findClosest( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any clean way to avoid duplicating most of this code? maybe not. It looks almost identical to the above though
Test build #121400 has finished for PR 27758 at commit
|
Test build #121401 has finished for PR 27758 at commit
|
compute stats distributedly nit nit
package upper tri package upper tri package upper tri
need collect need collect
0afc845
to
b050973
Compare
Test build #121517 has finished for PR 27758 at commit
|
Test build #121518 has finished for PR 27758 at commit
|
retest this please |
Test build #121524 has finished for PR 27758 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking pretty fine if the speedup still holds and it doesn't change behavior
mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala
Show resolved
Hide resolved
mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala
Outdated
Show resolved
Hide resolved
When this condition holds, it can remove the computation of distances for other centers. If not the case, it will go to the normal K-means path? Does it hurt by introducing more overhead by pre-calculation when K is large since you need to pre-calculate when centers are updated for each iteration and the condition needs to always hold to get the benefit?
|
@xwu99 Thanks for reviewing!
Yes, this PR will use two short-circuit:
I had optimize it by make the pre-calculation distributedly when I think triangle-inequality based optimization is complementary with BLAS optimization. When we do not enable Level-3 BLAS, then the triangle-inequality approach will work. |
Test build #121571 has finished for PR 27758 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are no more comments, LGTM
Merged to master |
Thanks for reviewing! |
What changes were proposed in this pull request?
apply Lemma 1 in Using the Triangle Inequality to Accelerate K-Means:
It can be directly applied in EuclideanDistance, but not in CosineDistance.
However, for CosineDistance we can luckily get a variant in the space of radian/angle.
Why are the changes needed?
It help improving the performance of prediction and training (mostly)
Does this PR introduce any user-facing change?
No
How was this patch tested?
existing testsuites