diff --git a/docs/commands/optimize/HilbertByteArrayIndex.md b/docs/commands/optimize/HilbertByteArrayIndex.md new file mode 100644 index 000000000..648fe663c --- /dev/null +++ b/docs/commands/optimize/HilbertByteArrayIndex.md @@ -0,0 +1,3 @@ +# HilbertByteArrayIndex + +`HilbertByteArrayIndex` is...FIXME diff --git a/docs/commands/optimize/HilbertClustering.md b/docs/commands/optimize/HilbertClustering.md new file mode 100644 index 000000000..b57f3c0cf --- /dev/null +++ b/docs/commands/optimize/HilbertClustering.md @@ -0,0 +1,57 @@ +# HilbertClustering + +`HilbertClustering` is a [SpaceFillingCurveClustering](SpaceFillingCurveClustering.md) for [multi-dimensional clustering](MultiDimClustering.md#cluster-utility) with [hilbert](OptimizeExecutor.md#hilbert) curve. + +`HilbertClustering` requires between 2 and [up to 9 columns](MultiDimClusteringFunctions.md#hilbert_index) to cluster by. + +??? note "Singleton Object" + `HilbertClustering` is a Scala **object** which is a class that has exactly one instance. It is created lazily when it is referenced, like a `lazy val`. + + Learn more in [Tour of Scala](https://docs.scala-lang.org/tour/singleton-objects.html). + +## Clustering Expression { #getClusteringExpression } + +??? note "SpaceFillingCurveClustering" + + ```scala + getClusteringExpression( + cols: Seq[Column], + numRanges: Int): Column + ``` + + `getClusteringExpression` is part of the [SpaceFillingCurveClustering](SpaceFillingCurveClustering.md#getClusteringExpression) abstraction. + +`getClusteringExpression` creates a `rangeIdCols` as [range_partition_id](MultiDimClusteringFunctions.md#range_partition_id) for the given `cols` columns and the `numRanges` number of partitions (_buckets_). + +In the end, `getClusteringExpression` [hilbert_index](MultiDimClusteringFunctions.md#hilbert_index) with the following: + +* The number of bits being one more than the number of trailing zeros of the int value with at most a single one-bit, in the position of the highest-order ("leftmost") one-bit in the `numRanges` value + + ??? note "Number of Bits Example" + Given `numRanges` is `5`, the position of the highest-order ("leftmost") one-bit is `2`. + + ```scala + val numRanges = 5 + scala> println(s"$numRanges in the two's complement binary representation is ${Integer.toBinaryString(numRanges)}") + 5 in the two's complement binary representation is 101 + ``` + + Counting positions from left to right, starting from `0`, gives `2` as the position of the highest-order ("leftmost") one-bit. + + ```scala + scala> print(s"For ${numRanges}, the int value with at most a single one-bit is ${Integer.highestOneBit(numRanges)}") + For 5, the int value with at most a single one-bit is 4 + ``` + + The int value with at most a single one-bit in the position of the highest-order ("leftmost") one-bit being `2` is `4` (`2^2`). + + The number of zero bits following the lowest-order ("rightmost") one-bit in the two's complement binary representation of the int value (`4`) is `2`. + + ```scala + scala> println(s"For ${Integer.highestOneBit(numRanges)}, the number of zero bits is ${Integer.numberOfTrailingZeros(Integer.highestOneBit(numRanges))}") + For 4, the number of zero bits is 2 + ``` + + In the end, `getClusteringExpression` uses `3` as the number of bits. + +* The [range_partition_id](MultiDimClusteringFunctions.md#range_partition_id) columns diff --git a/docs/commands/optimize/HilbertLongIndex.md b/docs/commands/optimize/HilbertLongIndex.md new file mode 100644 index 000000000..93c778ee6 --- /dev/null +++ b/docs/commands/optimize/HilbertLongIndex.md @@ -0,0 +1,3 @@ +# HilbertLongIndex + +`HilbertLongIndex` is...FIXME diff --git a/docs/commands/optimize/MultiDimClustering.md b/docs/commands/optimize/MultiDimClustering.md index 42b13898a..8035c5b61 100644 --- a/docs/commands/optimize/MultiDimClustering.md +++ b/docs/commands/optimize/MultiDimClustering.md @@ -41,18 +41,25 @@ cluster( curve: String): DataFrame ``` -`cluster` asserts that the given `colNames` contains at least one column name. Otherwise, `cluster` reports an `AssertionError`: +??? note "`curve` Argument and Supported Values: `zorder` or `hilbert`" + `curve` is based on [OptimizeExecutor](OptimizeExecutor.md#curve) (and can only be two values, `zorder` or `hilbert`). -```text -assertion failed : Cannot cluster by zero columns! -``` +`cluster` asserts that the given `colNames` contains at least one column name. + +??? note "AssertionError" + + `cluster` reports an `AssertionError` for an unknown curve type name. + + ```text + assertion failed : Cannot cluster by zero columns! + ``` `cluster` selects the multi-dimensional clustering algorithm based on the given `curve` name. Curve Type | Clustering Algorithm -----------|--------------------- - `hilbert` | `HilbertClustering` - `zorder` | `ZOrderClustering` + `hilbert` | [HilbertClustering](HilbertClustering.md) + `zorder` | [ZOrderClustering](ZOrderClustering.md) ??? note "SparkException" `cluster` accepts these two algorithms only or throws a `SparkException`: diff --git a/docs/commands/optimize/MultiDimClusteringFunctions.md b/docs/commands/optimize/MultiDimClusteringFunctions.md index 544c7d832..581ea2836 100644 --- a/docs/commands/optimize/MultiDimClusteringFunctions.md +++ b/docs/commands/optimize/MultiDimClusteringFunctions.md @@ -2,7 +2,7 @@ `MultiDimClusteringFunctions` utility offers Spark SQL functions for multi-dimensional clustering. -## range_partition_id +## range_partition_id { #range_partition_id } ```scala range_partition_id( @@ -12,11 +12,13 @@ range_partition_id( `range_partition_id` creates a `Column` ([Spark SQL]({{ book.spark_sql }}/Column)) with [RangePartitionId](RangePartitionId.md) unary expression (for the given arguments). +--- + `range_partition_id` is used when: * `ZOrderClustering` utility is used for the [clustering expression](ZOrderClustering.md#getClusteringExpression) -## interleave_bits +## interleave_bits { #interleave_bits } ```scala interleave_bits( @@ -25,6 +27,36 @@ interleave_bits( `interleave_bits` creates a `Column` ([Spark SQL]({{ book.spark_sql }}/Column)) with [InterleaveBits](InterleaveBits.md) expression (for the expressions of the given columns). +--- + `interleave_bits` is used when: * `ZOrderClustering` utility is used for the [clustering expression](ZOrderClustering.md#getClusteringExpression) + +## hilbert_index { #hilbert_index } + +```scala +hilbert_index( + numBits: Int, + cols: Column*): Column +``` + +`hilbert_index` creates a `Column` ([Spark SQL]({{ book.spark_sql }}/Column)) to execute one of the following `Expression`s ([Spark SQL]({{ book.spark_sql }}/expressions/Expression)) based on the _hilbertBits_: + +* [HilbertLongIndex](HilbertLongIndex.md) for up to 64 hilbert bits +* [HilbertByteArrayIndex](HilbertByteArrayIndex.md), otherwise + +The _hilbertBits_ is the number of columns (`cols`) multiplied by the number of bits (`numBits`). + +??? note "SparkException: Hilbert indexing can only be used on 9 or fewer columns" + `hilbert_index` throws a `SparkException` for 10 or more columns (`cols`). + + ```text + Hilbert indexing can only be used on 9 or fewer columns. + ``` + +--- + +`hilbert_index` is used when: + +* `HilbertClustering` is requested for the [clustering expression](HilbertClustering.md#getClusteringExpression) diff --git a/docs/commands/optimize/OptimizeExecutor.md b/docs/commands/optimize/OptimizeExecutor.md index 9a4f2707d..b803db78c 100644 --- a/docs/commands/optimize/OptimizeExecutor.md +++ b/docs/commands/optimize/OptimizeExecutor.md @@ -11,13 +11,30 @@ * `SparkSession` ([Spark SQL]({{ book.spark_sql }}/SparkSession)) * [DeltaLog](../../DeltaLog.md) (of the Delta table to be optimized) * Partition predicate expressions ([Spark SQL]({{ book.spark_sql }}/expressions/Expression)) -* Z-OrderBy Columns (Names) +* Z-OrderBy Column Names `OptimizeExecutor` is created when: * `OptimizeTableCommand` is requested to [run](OptimizeTableCommand.md#run) -## optimize +## Curve { #curve } + +```scala +curve: String +``` + +`curve` can be one of the two supported values: + +* `zorder` for one or more [zOrderByColumns](#zOrderByColumns) +* `hilbert` for no [zOrderByColumns](#zOrderByColumns) and [clustered tables](#isClusteredTable) feature enabled + +--- + +`curve` is used when: + +* `OptimizeExecutor` is requested to [runOptimizeBinJob](#runOptimizeBinJob) + +## Performing Optimization { #optimize } ```scala optimize(): Seq[Row] diff --git a/docs/commands/optimize/SpaceFillingCurveClustering.md b/docs/commands/optimize/SpaceFillingCurveClustering.md index c52d36eed..2b4771d11 100644 --- a/docs/commands/optimize/SpaceFillingCurveClustering.md +++ b/docs/commands/optimize/SpaceFillingCurveClustering.md @@ -4,7 +4,7 @@ ## Contract -### getClusteringExpression { #getClusteringExpression } +### ClusteringExpression { #getClusteringExpression } ```scala getClusteringExpression( @@ -12,12 +12,18 @@ getClusteringExpression( numRanges: Int): Column ``` +See: + +* [HilbertClustering](HilbertClustering.md#getClusteringExpression) +* [ZOrderClustering](ZOrderClustering.md#getClusteringExpression) + Used when: * `SpaceFillingCurveClustering` is requested to execute [multi-dimensional clustering](#cluster) ## Implementations +* [HilbertClustering](HilbertClustering.md) * [ZOrderClustering](ZOrderClustering.md) ## Multi-Dimensional Clustering { #cluster } diff --git a/docs/commands/optimize/ZOrderClustering.md b/docs/commands/optimize/ZOrderClustering.md index db58a4f96..2ec69a7ca 100644 --- a/docs/commands/optimize/ZOrderClustering.md +++ b/docs/commands/optimize/ZOrderClustering.md @@ -1,24 +1,24 @@ # ZOrderClustering -`ZOrderClustering` is a [SpaceFillingCurveClustering](SpaceFillingCurveClustering.md) for [MultiDimClustering.cluster](MultiDimClustering.md#cluster-utility) utility. +`ZOrderClustering` is a [SpaceFillingCurveClustering](SpaceFillingCurveClustering.md) for [multi-dimensional clustering](MultiDimClustering.md#cluster-utility) with [zorder](OptimizeExecutor.md#zorder) curve. -## getClusteringExpression +## Clustering Expression { #getClusteringExpression } -```scala -getClusteringExpression( - cols: Seq[Column], - numRanges: Int): Column -``` +??? note "SpaceFillingCurveClustering" -`getClusteringExpression` is part of the [SpaceFillingCurveClustering](SpaceFillingCurveClustering.md#getClusteringExpression) abstraction. + ```scala + getClusteringExpression( + cols: Seq[Column], + numRanges: Int): Column + ``` ---- + `getClusteringExpression` is part of the [SpaceFillingCurveClustering](SpaceFillingCurveClustering.md#getClusteringExpression) abstraction. `getClusteringExpression` creates a [range_partition_id](MultiDimClusteringFunctions.md#range_partition_id) function (with the given `numRanges` for the number of partitions) for every `Column` (in the given `cols`). In the end, `getClusteringExpression` [interleave_bits](MultiDimClusteringFunctions.md#interleave_bits) with the `range_partition_id` columns and casts the (evaluation) result to `StringType`. -### Demo +### Demo { #getClusteringExpression-demo } For some reason, [getClusteringExpression](#getClusteringExpression) is `protected[skipping]` so let's hop over the fence with the following hack. diff --git a/docs/liquid-clustering/index.md b/docs/liquid-clustering/index.md index d99b7acad..4c2b9278f 100644 --- a/docs/liquid-clustering/index.md +++ b/docs/liquid-clustering/index.md @@ -6,19 +6,19 @@ subtitle: Clustered Tables # Liquid Clustering -**Liquid Clustering** is an optimization technique in Delta Lake that...FIXME +**Liquid Clustering** is an optimization technique in Delta Lake that uses [OPTIMIZE](../commands/optimize/index.md) with [Hilbert clustering](../commands/optimize/HilbertClustering.md). !!! info "Not Recommended for Production Use" 1. A clustered table is currently in preview and is disabled by default. 1. A clustered table is not recommended for production use (e.g., unsupported incremental clustering). -Liquid Clustering can be enabled using [spark.databricks.delta.clusteredTable.enableClusteringTablePreview](../configuration-properties/index.md#spark.databricks.delta.clusteredTable.enableClusteringTablePreview) configuration property. +Liquid Clustering can be enabled system-wide using [spark.databricks.delta.clusteredTable.enableClusteringTablePreview](../configuration-properties/index.md#spark.databricks.delta.clusteredTable.enableClusteringTablePreview) configuration property. ```sql SET spark.databricks.delta.clusteredTable.enableClusteringTablePreview=true ``` -Liquid Clustering can be applied to delta tables that were created with `CLUSTER BY` clause. +Liquid Clustering can only be applied to delta tables created with `CLUSTER BY` clause. ```sql CREATE TABLE IF NOT EXISTS delta_table @@ -54,3 +54,4 @@ DESC EXTENDED delta_table 1. Liquid Clustering cannot be used with partitioning (`PARTITIONED BY`) 1. Liquid Clustering cannot be used with bucketing (`CLUSTERED BY INTO BUCKETS`) +1. Liquid Clustering can be used with 2 and [up to 9 columns](../commands/optimize/MultiDimClusteringFunctions.md#hilbert_index) to `CLUSTER BY`.