Skip to content

Commit

Permalink
HilbertClustering
Browse files Browse the repository at this point in the history
  • Loading branch information
jaceklaskowski committed Feb 4, 2024
1 parent 67f1edc commit 3956e1d
Show file tree
Hide file tree
Showing 9 changed files with 150 additions and 24 deletions.
3 changes: 3 additions & 0 deletions docs/commands/optimize/HilbertByteArrayIndex.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# HilbertByteArrayIndex

`HilbertByteArrayIndex` is...FIXME
57 changes: 57 additions & 0 deletions docs/commands/optimize/HilbertClustering.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# HilbertClustering

`HilbertClustering` is a [SpaceFillingCurveClustering](SpaceFillingCurveClustering.md) for [multi-dimensional clustering](MultiDimClustering.md#cluster-utility) with [hilbert](OptimizeExecutor.md#hilbert) curve.

`HilbertClustering` requires between 2 and [up to 9 columns](MultiDimClusteringFunctions.md#hilbert_index) to cluster by.

??? note "Singleton Object"
`HilbertClustering` is a Scala **object** which is a class that has exactly one instance. It is created lazily when it is referenced, like a `lazy val`.

Learn more in [Tour of Scala](https://docs.scala-lang.org/tour/singleton-objects.html).

## Clustering Expression { #getClusteringExpression }

??? note "SpaceFillingCurveClustering"

```scala
getClusteringExpression(
cols: Seq[Column],
numRanges: Int): Column
```

`getClusteringExpression` is part of the [SpaceFillingCurveClustering](SpaceFillingCurveClustering.md#getClusteringExpression) abstraction.

`getClusteringExpression` creates a `rangeIdCols` as [range_partition_id](MultiDimClusteringFunctions.md#range_partition_id) for the given `cols` columns and the `numRanges` number of partitions (_buckets_).

In the end, `getClusteringExpression` [hilbert_index](MultiDimClusteringFunctions.md#hilbert_index) with the following:

* The number of bits being one more than the number of trailing zeros of the int value with at most a single one-bit, in the position of the highest-order ("leftmost") one-bit in the `numRanges` value

??? note "Number of Bits Example"
Given `numRanges` is `5`, the position of the highest-order ("leftmost") one-bit is `2`.

```scala
val numRanges = 5
scala> println(s"$numRanges in the two's complement binary representation is ${Integer.toBinaryString(numRanges)}")
5 in the two's complement binary representation is 101
```

Counting positions from left to right, starting from `0`, gives `2` as the position of the highest-order ("leftmost") one-bit.

```scala
scala> print(s"For ${numRanges}, the int value with at most a single one-bit is ${Integer.highestOneBit(numRanges)}")
For 5, the int value with at most a single one-bit is 4
```

The int value with at most a single one-bit in the position of the highest-order ("leftmost") one-bit being `2` is `4` (`2^2`).

The number of zero bits following the lowest-order ("rightmost") one-bit in the two's complement binary representation of the int value (`4`) is `2`.

```scala
scala> println(s"For ${Integer.highestOneBit(numRanges)}, the number of zero bits is ${Integer.numberOfTrailingZeros(Integer.highestOneBit(numRanges))}")
For 4, the number of zero bits is 2
```

In the end, `getClusteringExpression` uses `3` as the number of bits.

* The [range_partition_id](MultiDimClusteringFunctions.md#range_partition_id) columns
3 changes: 3 additions & 0 deletions docs/commands/optimize/HilbertLongIndex.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# HilbertLongIndex

`HilbertLongIndex` is...FIXME
19 changes: 13 additions & 6 deletions docs/commands/optimize/MultiDimClustering.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,18 +41,25 @@ cluster(
curve: String): DataFrame
```

`cluster` asserts that the given `colNames` contains at least one column name. Otherwise, `cluster` reports an `AssertionError`:
??? note "`curve` Argument and Supported Values: `zorder` or `hilbert`"
`curve` is based on [OptimizeExecutor](OptimizeExecutor.md#curve) (and can only be two values, `zorder` or `hilbert`).

```text
assertion failed : Cannot cluster by zero columns!
```
`cluster` asserts that the given `colNames` contains at least one column name.

??? note "AssertionError"

`cluster` reports an `AssertionError` for an unknown curve type name.

```text
assertion failed : Cannot cluster by zero columns!
```

`cluster` selects the multi-dimensional clustering algorithm based on the given `curve` name.

Curve Type | Clustering Algorithm
-----------|---------------------
`hilbert` | `HilbertClustering`
`zorder` | `ZOrderClustering`
`hilbert` | [HilbertClustering](HilbertClustering.md)
`zorder` | [ZOrderClustering](ZOrderClustering.md)

??? note "SparkException"
`cluster` accepts these two algorithms only or throws a `SparkException`:
Expand Down
36 changes: 34 additions & 2 deletions docs/commands/optimize/MultiDimClusteringFunctions.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

`MultiDimClusteringFunctions` utility offers Spark SQL functions for multi-dimensional clustering.

## <span id="range_partition_id"> range_partition_id
## range_partition_id { #range_partition_id }

```scala
range_partition_id(
Expand All @@ -12,11 +12,13 @@ range_partition_id(

`range_partition_id` creates a `Column` ([Spark SQL]({{ book.spark_sql }}/Column)) with [RangePartitionId](RangePartitionId.md) unary expression (for the given arguments).

---

`range_partition_id` is used when:

* `ZOrderClustering` utility is used for the [clustering expression](ZOrderClustering.md#getClusteringExpression)

## <span id="interleave_bits"> interleave_bits
## interleave_bits { #interleave_bits }

```scala
interleave_bits(
Expand All @@ -25,6 +27,36 @@ interleave_bits(

`interleave_bits` creates a `Column` ([Spark SQL]({{ book.spark_sql }}/Column)) with [InterleaveBits](InterleaveBits.md) expression (for the expressions of the given columns).

---

`interleave_bits` is used when:

* `ZOrderClustering` utility is used for the [clustering expression](ZOrderClustering.md#getClusteringExpression)

## hilbert_index { #hilbert_index }

```scala
hilbert_index(
numBits: Int,
cols: Column*): Column
```

`hilbert_index` creates a `Column` ([Spark SQL]({{ book.spark_sql }}/Column)) to execute one of the following `Expression`s ([Spark SQL]({{ book.spark_sql }}/expressions/Expression)) based on the _hilbertBits_:

* [HilbertLongIndex](HilbertLongIndex.md) for up to 64 hilbert bits
* [HilbertByteArrayIndex](HilbertByteArrayIndex.md), otherwise

The _hilbertBits_ is the number of columns (`cols`) multiplied by the number of bits (`numBits`).

??? note "SparkException: Hilbert indexing can only be used on 9 or fewer columns"
`hilbert_index` throws a `SparkException` for 10 or more columns (`cols`).

```text
Hilbert indexing can only be used on 9 or fewer columns.
```

---

`hilbert_index` is used when:

* `HilbertClustering` is requested for the [clustering expression](HilbertClustering.md#getClusteringExpression)
21 changes: 19 additions & 2 deletions docs/commands/optimize/OptimizeExecutor.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,30 @@
* <span id="sparkSession"> `SparkSession` ([Spark SQL]({{ book.spark_sql }}/SparkSession))
* <span id="deltaLog"> [DeltaLog](../../DeltaLog.md) (of the Delta table to be optimized)
* <span id="partitionPredicate"> Partition predicate expressions ([Spark SQL]({{ book.spark_sql }}/expressions/Expression))
* <span id="zOrderByColumns"> Z-OrderBy Columns (Names)
* <span id="zOrderByColumns"> Z-OrderBy Column Names

`OptimizeExecutor` is created when:

* `OptimizeTableCommand` is requested to [run](OptimizeTableCommand.md#run)

## optimize
## <span id="hilbert"><span id="zorder"> Curve { #curve }

```scala
curve: String
```

`curve` can be one of the two supported values:

* `zorder` for one or more [zOrderByColumns](#zOrderByColumns)
* `hilbert` for no [zOrderByColumns](#zOrderByColumns) and [clustered tables](#isClusteredTable) feature enabled

---

`curve` is used when:

* `OptimizeExecutor` is requested to [runOptimizeBinJob](#runOptimizeBinJob)

## Performing Optimization { #optimize }

```scala
optimize(): Seq[Row]
Expand Down
8 changes: 7 additions & 1 deletion docs/commands/optimize/SpaceFillingCurveClustering.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,26 @@

## Contract

### getClusteringExpression { #getClusteringExpression }
### ClusteringExpression { #getClusteringExpression }

```scala
getClusteringExpression(
cols: Seq[Column],
numRanges: Int): Column
```

See:

* [HilbertClustering](HilbertClustering.md#getClusteringExpression)
* [ZOrderClustering](ZOrderClustering.md#getClusteringExpression)

Used when:

* `SpaceFillingCurveClustering` is requested to execute [multi-dimensional clustering](#cluster)

## Implementations

* [HilbertClustering](HilbertClustering.md)
* [ZOrderClustering](ZOrderClustering.md)

## Multi-Dimensional Clustering { #cluster }
Expand Down
20 changes: 10 additions & 10 deletions docs/commands/optimize/ZOrderClustering.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,24 @@
# ZOrderClustering

`ZOrderClustering` is a [SpaceFillingCurveClustering](SpaceFillingCurveClustering.md) for [MultiDimClustering.cluster](MultiDimClustering.md#cluster-utility) utility.
`ZOrderClustering` is a [SpaceFillingCurveClustering](SpaceFillingCurveClustering.md) for [multi-dimensional clustering](MultiDimClustering.md#cluster-utility) with [zorder](OptimizeExecutor.md#zorder) curve.

## <span id="getClusteringExpression"> getClusteringExpression
## Clustering Expression { #getClusteringExpression }

```scala
getClusteringExpression(
cols: Seq[Column],
numRanges: Int): Column
```
??? note "SpaceFillingCurveClustering"

`getClusteringExpression` is part of the [SpaceFillingCurveClustering](SpaceFillingCurveClustering.md#getClusteringExpression) abstraction.
```scala
getClusteringExpression(
cols: Seq[Column],
numRanges: Int): Column
```

---
`getClusteringExpression` is part of the [SpaceFillingCurveClustering](SpaceFillingCurveClustering.md#getClusteringExpression) abstraction.

`getClusteringExpression` creates a [range_partition_id](MultiDimClusteringFunctions.md#range_partition_id) function (with the given `numRanges` for the number of partitions) for every `Column` (in the given `cols`).

In the end, `getClusteringExpression` [interleave_bits](MultiDimClusteringFunctions.md#interleave_bits) with the `range_partition_id` columns and casts the (evaluation) result to `StringType`.

### <span id="getClusteringExpression-demo"> Demo
### Demo { #getClusteringExpression-demo }

For some reason, [getClusteringExpression](#getClusteringExpression) is `protected[skipping]` so let's hop over the fence with the following hack.

Expand Down
7 changes: 4 additions & 3 deletions docs/liquid-clustering/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,19 @@ subtitle: Clustered Tables

# Liquid Clustering

**Liquid Clustering** is an optimization technique in Delta Lake that...FIXME
**Liquid Clustering** is an optimization technique in Delta Lake that uses [OPTIMIZE](../commands/optimize/index.md) with [Hilbert clustering](../commands/optimize/HilbertClustering.md).

!!! info "Not Recommended for Production Use"
1. A clustered table is currently in preview and is disabled by default.
1. A clustered table is not recommended for production use (e.g., unsupported incremental clustering).

Liquid Clustering can be enabled using [spark.databricks.delta.clusteredTable.enableClusteringTablePreview](../configuration-properties/index.md#spark.databricks.delta.clusteredTable.enableClusteringTablePreview) configuration property.
Liquid Clustering can be enabled system-wide using [spark.databricks.delta.clusteredTable.enableClusteringTablePreview](../configuration-properties/index.md#spark.databricks.delta.clusteredTable.enableClusteringTablePreview) configuration property.

```sql
SET spark.databricks.delta.clusteredTable.enableClusteringTablePreview=true
```

Liquid Clustering can be applied to delta tables that were created with `CLUSTER BY` clause.
Liquid Clustering can only be applied to delta tables created with `CLUSTER BY` clause.

```sql
CREATE TABLE IF NOT EXISTS delta_table
Expand Down Expand Up @@ -54,3 +54,4 @@ DESC EXTENDED delta_table

1. Liquid Clustering cannot be used with partitioning (`PARTITIONED BY`)
1. Liquid Clustering cannot be used with bucketing (`CLUSTERED BY INTO BUCKETS`)
1. Liquid Clustering can be used with 2 and [up to 9 columns](../commands/optimize/MultiDimClusteringFunctions.md#hilbert_index) to `CLUSTER BY`.

0 comments on commit 3956e1d

Please sign in to comment.