HilbertClustering

japila-books · Feb 4, 2024 · 3956e1d · 3956e1d
1 parent 67f1edc
commit 3956e1d
Show file tree

Hide file tree

Showing 9 changed files with 150 additions and 24 deletions.
diff --git a/docs/commands/optimize/HilbertByteArrayIndex.md b/docs/commands/optimize/HilbertByteArrayIndex.md
@@ -0,0 +1,3 @@
+# HilbertByteArrayIndex
+
+`HilbertByteArrayIndex` is...FIXME
diff --git a/docs/commands/optimize/HilbertClustering.md b/docs/commands/optimize/HilbertClustering.md
@@ -0,0 +1,57 @@
+# HilbertClustering
+
+`HilbertClustering` is a [SpaceFillingCurveClustering](SpaceFillingCurveClustering.md) for [multi-dimensional clustering](MultiDimClustering.md#cluster-utility) with [hilbert](OptimizeExecutor.md#hilbert) curve.
+
+`HilbertClustering` requires between 2 and [up to 9 columns](MultiDimClusteringFunctions.md#hilbert_index) to cluster by.
+
+??? note "Singleton Object"
+    `HilbertClustering` is a Scala **object** which is a class that has exactly one instance. It is created lazily when it is referenced, like a `lazy val`.
+
+    Learn more in [Tour of Scala](https://docs.scala-lang.org/tour/singleton-objects.html).
+
+## Clustering Expression { #getClusteringExpression }
+
+??? note "SpaceFillingCurveClustering"
+
+    ```scala
+    getClusteringExpression(
+      cols: Seq[Column],
+      numRanges: Int): Column
+    ```
+
+    `getClusteringExpression` is part of the [SpaceFillingCurveClustering](SpaceFillingCurveClustering.md#getClusteringExpression) abstraction.
+
+`getClusteringExpression` creates a `rangeIdCols` as [range_partition_id](MultiDimClusteringFunctions.md#range_partition_id) for the given `cols` columns and the `numRanges` number of partitions (_buckets_).
+
+In the end, `getClusteringExpression` [hilbert_index](MultiDimClusteringFunctions.md#hilbert_index) with the following:
+
+* The number of bits being one more than the number of trailing zeros of the int value with at most a single one-bit, in the position of the highest-order ("leftmost") one-bit in the `numRanges` value
+
+    ??? note "Number of Bits Example"
+        Given `numRanges` is `5`, the position of the highest-order ("leftmost") one-bit is `2`.
+
+        ```scala
+        val numRanges = 5
+        scala> println(s"$numRanges in the two's complement binary representation is ${Integer.toBinaryString(numRanges)}")
+        5 in the two's complement binary representation is 101
+        ```
+
+        Counting positions from left to right, starting from `0`, gives `2` as the position of the highest-order ("leftmost") one-bit.
+
+        ```scala
+        scala> print(s"For ${numRanges}, the int value with at most a single one-bit is ${Integer.highestOneBit(numRanges)}")
+        For 5, the int value with at most a single one-bit is 4
+        ```
+
+        The int value with at most a single one-bit in the position of the highest-order ("leftmost") one-bit being `2` is `4` (`2^2`).
+
+        The number of zero bits following the lowest-order ("rightmost") one-bit in the two's complement binary representation of the int value (`4`) is `2`.
+
+        ```scala
+        scala> println(s"For ${Integer.highestOneBit(numRanges)}, the number of zero bits is ${Integer.numberOfTrailingZeros(Integer.highestOneBit(numRanges))}")
+        For 4, the number of zero bits is 2
+        ```
+
+        In the end, `getClusteringExpression` uses `3` as the number of bits.
+
+* The [range_partition_id](MultiDimClusteringFunctions.md#range_partition_id) columns
diff --git a/docs/commands/optimize/HilbertLongIndex.md b/docs/commands/optimize/HilbertLongIndex.md
@@ -0,0 +1,3 @@
+# HilbertLongIndex
+
+`HilbertLongIndex` is...FIXME
diff --git a/docs/commands/optimize/MultiDimClustering.md b/docs/commands/optimize/MultiDimClustering.md
@@ -41,18 +41,25 @@ cluster(
   curve: String): DataFrame
 ```
 
-`cluster` asserts that the given `colNames` contains at least one column name. Otherwise, `cluster` reports an `AssertionError`:
+??? note "`curve` Argument and Supported Values: `zorder` or `hilbert`"
+    `curve` is based on [OptimizeExecutor](OptimizeExecutor.md#curve) (and can only be two values, `zorder` or `hilbert`).
 
-```text
-assertion failed : Cannot cluster by zero columns!
-```
+`cluster` asserts that the given `colNames` contains at least one column name.
+
+??? note "AssertionError"
+
+    `cluster` reports an `AssertionError` for an unknown curve type name.
+
+    ```text
+    assertion failed : Cannot cluster by zero columns!
+    ```
 
 `cluster` selects the multi-dimensional clustering algorithm based on the given `curve` name.
 
 Curve Type | Clustering Algorithm
 -----------|---------------------
- `hilbert` | `HilbertClustering`
- `zorder`  | `ZOrderClustering`
+ `hilbert` | [HilbertClustering](HilbertClustering.md)
+ `zorder`  | [ZOrderClustering](ZOrderClustering.md)
 
 ??? note "SparkException"
     `cluster` accepts these two algorithms only or throws a `SparkException`:

diff --git a/docs/commands/optimize/MultiDimClusteringFunctions.md b/docs/commands/optimize/MultiDimClusteringFunctions.md
@@ -2,7 +2,7 @@
 
 `MultiDimClusteringFunctions` utility offers Spark SQL functions for multi-dimensional clustering.
 
-## <span id="range_partition_id"> range_partition_id
+## range_partition_id { #range_partition_id }
 
 ```scala
 range_partition_id(
@@ -12,11 +12,13 @@ range_partition_id(
 
 `range_partition_id` creates a `Column` ([Spark SQL]({{ book.spark_sql }}/Column)) with [RangePartitionId](RangePartitionId.md) unary expression (for the given arguments).
 
+---
+
 `range_partition_id` is used when:
 
 * `ZOrderClustering` utility is used for the [clustering expression](ZOrderClustering.md#getClusteringExpression)
 
-## <span id="interleave_bits"> interleave_bits
+## interleave_bits { #interleave_bits }
 
 ```scala
 interleave_bits(
@@ -25,6 +27,36 @@ interleave_bits(
 
 `interleave_bits` creates a `Column` ([Spark SQL]({{ book.spark_sql }}/Column)) with [InterleaveBits](InterleaveBits.md) expression (for the expressions of the given columns).
 
+---
+
 `interleave_bits` is used when:
 
 * `ZOrderClustering` utility is used for the [clustering expression](ZOrderClustering.md#getClusteringExpression)
+
+## hilbert_index { #hilbert_index }
+
+```scala
+hilbert_index(
+  numBits: Int,
+  cols: Column*): Column
+```
+
+`hilbert_index` creates a `Column` ([Spark SQL]({{ book.spark_sql }}/Column)) to execute one of the following `Expression`s ([Spark SQL]({{ book.spark_sql }}/expressions/Expression)) based on the _hilbertBits_:
+
+* [HilbertLongIndex](HilbertLongIndex.md) for up to 64 hilbert bits
+* [HilbertByteArrayIndex](HilbertByteArrayIndex.md), otherwise
+
+The _hilbertBits_ is the number of columns (`cols`) multiplied by the number of bits (`numBits`).
+
+??? note "SparkException: Hilbert indexing can only be used on 9 or fewer columns"
+    `hilbert_index` throws a `SparkException` for 10 or more columns (`cols`).
+
+    ```text
+    Hilbert indexing can only be used on 9 or fewer columns.
+    ```
+
+---
+
+`hilbert_index` is used when:
+
+* `HilbertClustering` is requested for the [clustering expression](HilbertClustering.md#getClusteringExpression)
diff --git a/docs/commands/optimize/OptimizeExecutor.md b/docs/commands/optimize/OptimizeExecutor.md
@@ -11,13 +11,30 @@
 * <span id="sparkSession"> `SparkSession` ([Spark SQL]({{ book.spark_sql }}/SparkSession))
 * <span id="deltaLog"> [DeltaLog](../../DeltaLog.md) (of the Delta table to be optimized)
 * <span id="partitionPredicate"> Partition predicate expressions ([Spark SQL]({{ book.spark_sql }}/expressions/Expression))
-* <span id="zOrderByColumns"> Z-OrderBy Columns (Names)
+* <span id="zOrderByColumns"> Z-OrderBy Column Names
 
 `OptimizeExecutor` is created when:
 
 * `OptimizeTableCommand` is requested to [run](OptimizeTableCommand.md#run)
 
-## optimize
+## <span id="hilbert"><span id="zorder"> Curve { #curve }
+
+```scala
+curve: String
+```
+
+`curve` can be one of the two supported values:
+
+* `zorder` for one or more [zOrderByColumns](#zOrderByColumns)
+* `hilbert` for no [zOrderByColumns](#zOrderByColumns) and [clustered tables](#isClusteredTable) feature enabled
+
+---
+
+`curve` is used when:
+
+* `OptimizeExecutor` is requested to [runOptimizeBinJob](#runOptimizeBinJob)
+
+## Performing Optimization { #optimize }
 
 ```scala
 optimize(): Seq[Row]

diff --git a/docs/commands/optimize/SpaceFillingCurveClustering.md b/docs/commands/optimize/SpaceFillingCurveClustering.md
@@ -4,20 +4,26 @@
 
 ## Contract
 
-### getClusteringExpression { #getClusteringExpression }
+### ClusteringExpression { #getClusteringExpression }
 
 ```scala
 getClusteringExpression(
   cols: Seq[Column],
   numRanges: Int): Column
 ```
 
+See:
+
+* [HilbertClustering](HilbertClustering.md#getClusteringExpression)
+* [ZOrderClustering](ZOrderClustering.md#getClusteringExpression)
+
 Used when:
 
 * `SpaceFillingCurveClustering` is requested to execute [multi-dimensional clustering](#cluster)
 
 ## Implementations
 
+* [HilbertClustering](HilbertClustering.md)
 * [ZOrderClustering](ZOrderClustering.md)
 
 ## Multi-Dimensional Clustering { #cluster }

diff --git a/docs/commands/optimize/ZOrderClustering.md b/docs/commands/optimize/ZOrderClustering.md
@@ -1,24 +1,24 @@
 # ZOrderClustering
 
-`ZOrderClustering` is a [SpaceFillingCurveClustering](SpaceFillingCurveClustering.md) for [MultiDimClustering.cluster](MultiDimClustering.md#cluster-utility) utility.
+`ZOrderClustering` is a [SpaceFillingCurveClustering](SpaceFillingCurveClustering.md) for [multi-dimensional clustering](MultiDimClustering.md#cluster-utility) with [zorder](OptimizeExecutor.md#zorder) curve.
 
-## <span id="getClusteringExpression"> getClusteringExpression
+## Clustering Expression { #getClusteringExpression }
 
-```scala
-getClusteringExpression(
-  cols: Seq[Column],
-  numRanges: Int): Column
-```
+??? note "SpaceFillingCurveClustering"
 
-`getClusteringExpression` is part of the [SpaceFillingCurveClustering](SpaceFillingCurveClustering.md#getClusteringExpression) abstraction.
+    ```scala
+    getClusteringExpression(
+      cols: Seq[Column],
+      numRanges: Int): Column
+    ```
 
----
+    `getClusteringExpression` is part of the [SpaceFillingCurveClustering](SpaceFillingCurveClustering.md#getClusteringExpression) abstraction.
 
 `getClusteringExpression` creates a [range_partition_id](MultiDimClusteringFunctions.md#range_partition_id) function (with the given `numRanges` for the number of partitions) for every `Column` (in the given `cols`).
 
 In the end, `getClusteringExpression` [interleave_bits](MultiDimClusteringFunctions.md#interleave_bits) with the `range_partition_id` columns and casts the (evaluation) result to `StringType`.
 
-### <span id="getClusteringExpression-demo"> Demo
+### Demo { #getClusteringExpression-demo }
 
 For some reason, [getClusteringExpression](#getClusteringExpression) is `protected[skipping]` so let's hop over the fence with the following hack.
 

diff --git a/docs/liquid-clustering/index.md b/docs/liquid-clustering/index.md
@@ -6,19 +6,19 @@ subtitle: Clustered Tables
 
 # Liquid Clustering
 
-**Liquid Clustering** is an optimization technique in Delta Lake that...FIXME
+**Liquid Clustering** is an optimization technique in Delta Lake that uses [OPTIMIZE](../commands/optimize/index.md) with [Hilbert clustering](../commands/optimize/HilbertClustering.md).
 
 !!! info "Not Recommended for Production Use"
     1. A clustered table is currently in preview and is disabled by default.
     1. A clustered table is not recommended for production use (e.g., unsupported incremental clustering).
 
-Liquid Clustering can be enabled using [spark.databricks.delta.clusteredTable.enableClusteringTablePreview](../configuration-properties/index.md#spark.databricks.delta.clusteredTable.enableClusteringTablePreview) configuration property.
+Liquid Clustering can be enabled system-wide using [spark.databricks.delta.clusteredTable.enableClusteringTablePreview](../configuration-properties/index.md#spark.databricks.delta.clusteredTable.enableClusteringTablePreview) configuration property.
 
 ```sql
 SET spark.databricks.delta.clusteredTable.enableClusteringTablePreview=true
 ```
 
-Liquid Clustering can be applied to delta tables that were created with `CLUSTER BY` clause.
+Liquid Clustering can only be applied to delta tables created with `CLUSTER BY` clause.
 
 ```sql
 CREATE TABLE IF NOT EXISTS delta_table
@@ -54,3 +54,4 @@ DESC EXTENDED delta_table
 
 1. Liquid Clustering cannot be used with partitioning (`PARTITIONED BY`)
 1. Liquid Clustering cannot be used with bucketing (`CLUSTERED BY INTO BUCKETS`)
+1. Liquid Clustering can be used with 2 and [up to 9 columns](../commands/optimize/MultiDimClusteringFunctions.md#hilbert_index) to `CLUSTER BY`.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# HilbertByteArrayIndex

		`HilbertByteArrayIndex` is...FIXME
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# HilbertLongIndex

		`HilbertLongIndex` is...FIXME