[Spark] Implement Hilbert clustering #2314

weiluo-db · 2023-11-20T19:17:25Z

Which Delta project/connector is this regarding?

Description

This PR is part of #1874.

This PR implements a new data clustering algorithm based on Hilbert curve. No code uses this new implementation yet. Will implement incremental clustering using ZCube in follow-up PRs.

Design can be found at: https://docs.google.com/document/d/1FWR3odjOw4v4-hjFy_hVaNdxHVs4WuK1asfB6M6XEMw/edit#heading=h.uubbjjd24plb.

How was this patch tested?

Unit tests.

Does this PR introduce any user-facing changes?

No.

imback82

few nits, but overall lgtm

spark/src/test/scala/org/apache/spark/sql/delta/expressions/HilbertIndexSuite.scala

spark/src/test/scala/org/apache/spark/sql/delta/skipping/MultiDimClusteringSuite.scala

spark/src/main/scala/org/apache/spark/sql/delta/expressions/HilbertUtils.scala

spark/src/main/scala/org/apache/spark/sql/delta/expressions/HilbertLongIndex.scala

spark/src/main/scala/org/apache/spark/sql/delta/expressions/HilbertUtils.scala

imback82

LGTM

imback82 · 2023-11-22T17:37:51Z

cc @bart-samwel FYI

acruise · 2023-11-27T21:51:24Z

spark/src/main/scala/org/apache/spark/sql/delta/skipping/MultiDimClusteringFunctions.scala

+  /**
+   * Transforms the provided integer columns into their corresponding position in the hilbert
+   * curve for the given dimension.
+   * @see http://www.dcs.bbk.ac.uk/~jkl/thesis.pdf


link has rotted; https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=bfd6d94c98627756989b0147a68b7ab1f881a0d6 seems to be equivalent

Thanks! Fixed.

acruise · 2023-11-27T21:52:23Z

spark/src/main/scala/org/apache/spark/sql/delta/expressions/HilbertLongIndex.scala

+
+/**
+ * Represents a hilbert index built from the provided columns.
+ * The columns are expected to all be Ints and to have at most numBits.


at most numBits individually, or collectively?

Individually. The Scaladoc from the caller side mentions this as well:

delta/spark/src/main/scala/org/apache/spark/sql/delta/skipping/MultiDimClusteringFunctions.scala

Line 66 in d617d8b

* @param numBits The number of bits to consider in each column.

.

acruise · 2023-11-27T21:54:26Z

spark/src/main/scala/org/apache/spark/sql/delta/expressions/HilbertLongIndex.scala

+
+/**
+ * The following code is based on this paper:
+ *   http://www.dcs.bbk.ac.uk/~jkl/thesis.pdf


link rot, https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=bfd6d94c98627756989b0147a68b7ab1f881a0d6

acruise · 2023-11-27T21:55:33Z

spark/src/main/scala/org/apache/spark/sql/delta/expressions/HilbertLongIndex.scala

+
+  /**
+   * This will construct an x2-gray-codes sequence of order n as described in
+   *  http://www.dcs.bbk.ac.uk/~jkl/thesis.pdf


acruise · 2023-11-27T22:04:54Z

spark/src/main/scala/org/apache/spark/sql/delta/expressions/HilbertStates.java

+                return HilbertIndex9.STATE_LIST;
+            default:
+                throw new SparkException(
+                  "Cannot perform hilbert clustering on more than 9 dimensions");


This will catch <2 as well, maybe update this error, or add another specific one?

Thanks! Updated the error. FWIW, the only caller, HilbertClustering always ensures that n is greater than 1:

delta/spark/src/main/scala/org/apache/spark/sql/delta/skipping/MultiDimClustering.scala

Line 108 in cfc37ff

assert(cols.size > 1, "Cannot do Hilbert clustering by zero or one column!")

.

acruise · 2023-11-27T22:10:13Z

spark/src/main/scala/org/apache/spark/sql/delta/expressions/HilbertLongIndex.scala

+ * The columns are expected to all be Ints and to have at most numBits.
+ * The points along the hilbert curve are represented by Longs.
+ */
+private[sql] case class HilbertLongIndex(numBits: Int, children: Seq[Expression])


Is there a maximum practical value of numBits and/or a maximum practical number of children given we're targeting a 64-bit space of output values?

Number of children is the number of clustering columns, and it's usually a good idea to keep it <=4 in practice, as you would for zorder-by columns. numBits is a tradeoff between better granularity and computation cost, as explained by the config's doc:

delta/spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSQLConf.scala

Lines 811 to 818 in fcfd440

val MDC_NUM_RANGE_IDS =

SQLConf.buildConf("spark.databricks.io.skipping.mdc.rangeId.max")

.internal()

.doc("This controls the domain of rangeId values to be interleaved. The bigger, the better " +

"granularity, but at the expense of performance (more data gets sampled).")

.intConf

.checkValue(_ > 1, "'spark.databricks.io.skipping.mdc.rangeId.max' must be greater than 1")

.createWithDefault(1000)

spark/src/main/scala/org/apache/spark/sql/delta/expressions/HilbertLongIndex.scala

acruise · 2023-11-27T22:23:04Z

spark/src/test/scala/org/apache/spark/sql/delta/skipping/MultiDimClusteringSuite.scala

-          val partCount = new File(tempDir, "source").listFiles(new FilenameFilter {
-            override def accept(dir: File, name: String): Boolean = {
-              name.startsWith("part-0000")
+    Seq("zorder", "hilbert").foreach{ curve =>


nit: space before { seems like the house style here :)

Orpheuz · 2023-11-30T18:25:47Z

Hey, @weiluo-db, I don't know if it's the right place to ask but I was curious if there was any particular reason to prefer the state tables approach over John Skilling's iterative approach found in https://pubs.aip.org/aip/acp/article-abstract/707/1/381/719611/Programming-the-Hilbert-curve?redirectedFrom=PDF

weiluo-db · 2023-12-11T23:26:50Z

Hey, @weiluo-db, I don't know if it's the right place to ask but I was curious if there was any particular reason to prefer the state tables approach over John Skilling's iterative approach found in https://pubs.aip.org/aip/acp/article-abstract/707/1/381/719611/Programming-the-Hilbert-curve?redirectedFrom=PDF

@Orpheuz Thanks a lot for the pointer! It does appear to be a promising alternative! That said, there wasn't any particular reason why we implemented Hilbert curve using state tables per se - the iterative approach that you pointed out above just wasn't under our radar. Further evaluating and possibly adopting the iterative approach could certainly be good follow-ups.

weiluo-db · 2023-12-12T16:47:00Z

spark/src/main/scala/org/apache/spark/sql/delta/expressions/HilbertIndex.scala

+    result
+  }
+
+  private[this] case class Row(y: Int, x1: Int, x2: (Int, Int), dy: Int, m: HilbertMatrix)


Row is used only by getStateGenerator(n: Int):

delta/spark/src/main/scala/org/apache/spark/sql/delta/expressions/HilbertIndex.scala

Lines 145 to 164 in d617d8b

def getStateGenerator(n: Int): GeneratorTable = {

val x2s = getX2GrayCodes(n)

val len = 1 << n

val rows = (0 until len).map { i =>

val x21 = x2s(i << 1)

val x22 = x2s((i << 1) + 1)

val dy = x21 ^ x22

Row(

y = i,

x1 = i ^ (i >>> 1),

x2 = (x21, x22),

dy = dy,

m = HilbertMatrix(n, x21, getSetColumn(n, dy))

)

}

new GeneratorTable(n, rows)

}

Because n can only be between 2 to 9, there can be < 2^10 such Rows in memory in total, which should be negligible in practice.

OTOH, it just came to me that x2 and dy are not really needed for later calculation, so I just decided to remove them from Row altogether.

acruise · 2023-12-12T21:26:53Z

spark/src/main/scala/org/apache/spark/sql/delta/expressions/HilbertIndex.scala

@@ -0,0 +1,405 @@
+/*
+ * Copyright (2021) The Delta Lake Project Authors.


nit: copyright years may need updating

It looks like every file under spark/ always uses 2021.

imback82

LGTM!

acruise

Thanks for indulging my questions! :)

This PR is part of delta-io#1874. This PR implements a new data clustering algorithm based on Hilbert curve. No code uses this new implementation yet. Will implement incremental clustering using ZCube in follow-up PRs. Design can be found at: https://docs.google.com/document/d/1FWR3odjOw4v4-hjFy_hVaNdxHVs4WuK1asfB6M6XEMw/edit#heading=h.uubbjjd24plb. Closes delta-io#2314 GitOrigin-RevId: abafaa717ba8f7d8809114858c0fd2a25861fcb8

implement hilbert clustering

cfc37ff

imback82 reviewed Nov 21, 2023

View reviewed changes

address review comments

3f15897

weiluo-db requested a review from imback82 November 21, 2023 22:25

imback82 approved these changes Nov 22, 2023

View reviewed changes

acruise reviewed Nov 27, 2023

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/delta/expressions/HilbertLongIndex.scala Outdated

This comment was marked as resolved.

Sign in to view

acruise reviewed Nov 27, 2023

View reviewed changes

address more review comments

d617d8b

weiluo-db requested review from imback82 and acruise December 11, 2023 23:26

acruise reviewed Dec 12, 2023

View reviewed changes

minor clean-ups

bf79158

weiluo-db requested a review from acruise December 12, 2023 16:47

fix typo

23e1ad5

acruise reviewed Dec 12, 2023

View reviewed changes

weiluo-db requested a review from acruise December 13, 2023 03:50

imback82 approved these changes Dec 13, 2023

View reviewed changes

acruise approved these changes Dec 14, 2023

View reviewed changes

vkorukanti closed this in 2940429 Dec 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark] Implement Hilbert clustering #2314

[Spark] Implement Hilbert clustering #2314

weiluo-db commented Nov 20, 2023

imback82 left a comment

imback82 left a comment

imback82 commented Nov 22, 2023

acruise Nov 27, 2023

weiluo-db Dec 11, 2023

acruise Nov 27, 2023

weiluo-db Dec 11, 2023 •

edited

Loading

acruise Nov 27, 2023

acruise Nov 27, 2023

acruise Nov 27, 2023

weiluo-db Dec 11, 2023

acruise Nov 27, 2023

weiluo-db Dec 11, 2023

This comment was marked as resolved.

acruise Nov 27, 2023

Orpheuz commented Nov 30, 2023

weiluo-db commented Dec 11, 2023

This comment was marked as resolved.

weiluo-db Dec 12, 2023 •

edited

Loading

acruise Dec 12, 2023

weiluo-db Dec 13, 2023

imback82 left a comment

acruise left a comment

	val MDC_NUM_RANGE_IDS =
	SQLConf.buildConf("spark.databricks.io.skipping.mdc.rangeId.max")
	.internal()
	.doc("This controls the domain of rangeId values to be interleaved. The bigger, the better " +
	"granularity, but at the expense of performance (more data gets sampled).")
	.intConf
	.checkValue(_ > 1, "'spark.databricks.io.skipping.mdc.rangeId.max' must be greater than 1")
	.createWithDefault(1000)

	def getStateGenerator(n: Int): GeneratorTable = {
	val x2s = getX2GrayCodes(n)

	val len = 1 << n
	val rows = (0 until len).map { i =>
	val x21 = x2s(i << 1)
	val x22 = x2s((i << 1) + 1)
	val dy = x21 ^ x22

	Row(
	y = i,
	x1 = i ^ (i >>> 1),
	x2 = (x21, x22),
	dy = dy,
	m = HilbertMatrix(n, x21, getSetColumn(n, dy))
	)
	}

	new GeneratorTable(n, rows)
	}

		@@ -0,0 +1,405 @@
		/*
		* Copyright (2021) The Delta Lake Project Authors.

[Spark] Implement Hilbert clustering #2314

[Spark] Implement Hilbert clustering #2314

Conversation

weiluo-db commented Nov 20, 2023

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

imback82 left a comment

Choose a reason for hiding this comment

imback82 left a comment

Choose a reason for hiding this comment

imback82 commented Nov 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weiluo-db Dec 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as resolved.

Choose a reason for hiding this comment

Orpheuz commented Nov 30, 2023

weiluo-db commented Dec 11, 2023

This comment was marked as resolved.

weiluo-db Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imback82 left a comment

Choose a reason for hiding this comment

acruise left a comment

Choose a reason for hiding this comment

weiluo-db Dec 11, 2023 •

edited

Loading

weiluo-db Dec 12, 2023 •

edited

Loading