Skip to content

Commit

Permalink
Identity Columns cntd.
Browse files Browse the repository at this point in the history
  • Loading branch information
jaceklaskowski committed Jan 11, 2025
1 parent 3bace4f commit f6d2bed
Show file tree
Hide file tree
Showing 6 changed files with 138 additions and 32 deletions.
39 changes: 24 additions & 15 deletions docs/ColumnWithDefaultExprUtils.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
IDENTITY column is not supported
```

## <span id="IDENTITY_MIN_WRITER_VERSION"> IDENTITY_MIN_WRITER_VERSION
## IDENTITY_MIN_WRITER_VERSION { #IDENTITY_MIN_WRITER_VERSION }

`ColumnWithDefaultExprUtils` uses `6` as the [minimum version of a writer](Protocol.md#minWriterVersion) for writing to `IDENTITY` columns.

Expand All @@ -16,7 +16,7 @@
* `ColumnWithDefaultExprUtils` is used to [satisfyProtocol](#satisfyProtocol)
* `Protocol` utility is used to [determine the required minimum protocol](Protocol.md#requiredMinimumProtocol)

## <span id="columnHasDefaultExpr"> columnHasDefaultExpr
## columnHasDefaultExpr { #columnHasDefaultExpr }

```scala
columnHasDefaultExpr(
Expand All @@ -30,7 +30,7 @@ columnHasDefaultExpr(

* `DeltaAnalysis` logical resolution rule is requested to `resolveQueryColumnsByName`

## <span id="hasIdentityColumn"> hasIdentityColumn
## hasIdentityColumn { #hasIdentityColumn }

```scala
hasIdentityColumn(
Expand All @@ -43,41 +43,50 @@ hasIdentityColumn(

* `Protocol` utility is used for the [required minimum protocol](Protocol.md#requiredMinimumProtocol)

## <span id="isIdentityColumn"> isIdentityColumn
## isIdentityColumn { #isIdentityColumn }

```scala
isIdentityColumn(
field: StructField): Boolean
```

`isIdentityColumn` uses the `Metadata` (of the given `StructField`) to check the existence of [delta.identity.start](spark-connector/DeltaSourceUtils.md#IDENTITY_INFO_START), [delta.identity.step](spark-connector/DeltaSourceUtils.md#IDENTITY_INFO_STEP) and [delta.identity.allowExplicitInsert](spark-connector/DeltaSourceUtils.md#IDENTITY_INFO_ALLOW_EXPLICIT_INSERT) metadata keys.
`isIdentityColumn` is used to find out whether a `StructField` is an [identity column](identity-columns/index.md) or not.

!!! note "IDENTITY column"
**IDENTITY column** is a column with [delta.identity.start](spark-connector/DeltaSourceUtils.md#IDENTITY_INFO_START), [delta.identity.step](spark-connector/DeltaSourceUtils.md#IDENTITY_INFO_STEP) and [delta.identity.allowExplicitInsert](spark-connector/DeltaSourceUtils.md#IDENTITY_INFO_ALLOW_EXPLICIT_INSERT) metadata.
`isIdentityColumn` uses the `Metadata` (of the given `StructField`) to check the existence of the following metadata keys:

* [delta.identity.start](spark-connector/DeltaSourceUtils.md#IDENTITY_INFO_START)
* [delta.identity.step](spark-connector/DeltaSourceUtils.md#IDENTITY_INFO_STEP)
* [delta.identity.allowExplicitInsert](spark-connector/DeltaSourceUtils.md#IDENTITY_INFO_ALLOW_EXPLICIT_INSERT)

---

`isIdentityColumn` is used when:

* `ColumnWithDefaultExprUtils` is used to [hasIdentityColumn](#hasIdentityColumn) and [removeDefaultExpressions](#removeDefaultExpressions)
* `ColumnWithDefaultExprUtils` is used to [addDefaultExprsOrReturnConstraints](#addDefaultExprsOrReturnConstraints), [columnHasDefaultExpr](#columnHasDefaultExpr), [hasIdentityColumn](#hasIdentityColumn) and [removeDefaultExpressions](#removeDefaultExpressions)
* `IdentityColumn` is requested to [blockExplicitIdentityColumnInsert](identity-columns/IdentityColumn.md#blockExplicitIdentityColumnInsert), [getIdentityColumns](identity-columns/IdentityColumn.md#getIdentityColumns), [syncIdentity](identity-columns/IdentityColumn.md#syncIdentity), [updateSchema](identity-columns/IdentityColumn.md#updateSchema), [updateToValidHighWaterMark](identity-columns/IdentityColumn.md#updateToValidHighWaterMark)
* `DeltaCatalog` is requested to [alterTable](DeltaCatalog.md#alterTable) and [createDeltaTable](DeltaCatalog.md#createDeltaTable)
* `MergeIntoCommandBase` is requested to [checkIdentityColumnHighWaterMarks](commands/merge/MergeIntoCommandBase.md#checkIdentityColumnHighWaterMarks)
* `WriteIntoDelta` is requested to [writeAndReturnCommitData](commands/WriteIntoDelta.md#writeAndReturnCommitData)

## <span id="removeDefaultExpressions"> Removing Default Expressions
## Remove Default Expressions from Table Schema { #removeDefaultExpressions }

```scala
removeDefaultExpressions(
schema: StructType,
keepGeneratedColumns: Boolean = false): StructType
keepGeneratedColumns: Boolean = false,
keepIdentityColumns: Boolean = false): StructType
```

`removeDefaultExpressions`...FIXME

---

`removeDefaultExpressions` is used when:

* `DeltaLog` is requested to [create a BaseRelation](DeltaLog.md#createRelation) and [createDataFrame](DeltaLog.md#createDataFrame)
* `DeltaTableUtils` is requested to [removeInternalWriterMetadata](DeltaTableUtils.md#removeInternalWriterMetadata)
* `OptimisticTransactionImpl` is requested to [updateMetadataInternal](OptimisticTransactionImpl.md#updateMetadataInternal)
* `DeltaTableV2` is requested for the [tableSchema](DeltaTableV2.md#tableSchema)
* `DeltaDataSource` is requested for the [sourceSchema](spark-connector/DeltaDataSource.md#sourceSchema)
* `DeltaSourceBase` is requested for the [schema](spark-connector/DeltaSource.md#schema)

## <span id="tableHasDefaultExpr"> tableHasDefaultExpr
## tableHasDefaultExpr { #tableHasDefaultExpr }

```scala
tableHasDefaultExpr(
Expand Down
61 changes: 54 additions & 7 deletions docs/DeltaColumnBuilder.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,22 +28,22 @@ import io.delta.tables.DeltaColumnBuilder

## Operators

### <span id="build"> build
### Build StructField { #build }

```scala
build(): StructField
```

Creates a `StructField` ([Spark SQL]({{ book.spark_sql }}/types/StructField))
Creates a `StructField` ([Spark SQL]({{ book.spark_sql }}/types/StructField)) (possibly with some field metadata)

### <span id="comment"> comment
### comment { #comment }

```scala
comment(
comment: String): DeltaColumnBuilder
```

### <span id="dataType"> dataType
### dataType { #dataType }

```scala
dataType(
Expand All @@ -52,7 +52,7 @@ dataType(
dataType: String): DeltaColumnBuilder
```

### <span id="generatedAlwaysAs"> generatedAlwaysAs
### generatedAlwaysAs { #generatedAlwaysAs }

```scala
generatedAlwaysAs(
Expand All @@ -61,14 +61,46 @@ generatedAlwaysAs(

Registers the [Generation Expression](#generationExpr) of this field

### <span id="nullable"> nullable
### generatedAlwaysAsIdentity { #generatedAlwaysAsIdentity }

```scala
generatedAlwaysAsIdentity(
start: Long,
step: Long): DeltaColumnBuilder
```

Sets the following:

Property | Value
-|-
[identityStart](#identityStart) | `start`
[identityStep](#identityStep) | `step`
[identityAllowExplicitInsert](#identityAllowExplicitInsert) | `false`

### generatedByDefaultAsIdentity { #generatedByDefaultAsIdentity }

```scala
generatedByDefaultAsIdentity(
start: Long,
step: Long): DeltaColumnBuilder
```

Sets the following:

Property | Value
-|-
[identityStart](#identityStart) | `start`
[identityStep](#identityStep) | `step`
[identityAllowExplicitInsert](#identityAllowExplicitInsert) | `true`

### nullable { #nullable }

```scala
nullable(
nullable: Boolean): DeltaColumnBuilder
```

## <span id="generationExpr"> Generation Expression
## Generation Expression { #generationExpr }

```scala
generationExpr: Option[String] = None
Expand All @@ -77,3 +109,18 @@ generationExpr: Option[String] = None
`DeltaColumnBuilder` uses `generationExpr` internal registry for the [generatedAlwaysAs](#generatedAlwaysAs) expression.

When requested to [build a StructField](#build), `DeltaColumnBuilder` registers `generationExpr` under [delta.generationExpression](spark-connector/DeltaSourceUtils.md#GENERATION_EXPRESSION_METADATA_KEY) key in the metadata (of this field).

## identityAllowExplicitInsert { #identityAllowExplicitInsert }

```scala
identityAllowExplicitInsert: Option[Boolean] = None
```

`identityAllowExplicitInsert` flag is used to indicate a call to the following methods:

Method | Value
-|-
[generatedAlwaysAsIdentity](#generatedAlwaysAsIdentity) | `false`
[generatedByDefaultAsIdentity](#generatedByDefaultAsIdentity) | `true`

`identityAllowExplicitInsert` is used to [build a StructField](#build).
9 changes: 9 additions & 0 deletions docs/commands/merge/MergeIntoCommandBase.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,15 @@ Used when:

* `MergeIntoCommandBase` is requested to [run](#run)

### checkIdentityColumnHighWaterMarks { #checkIdentityColumnHighWaterMarks }

```scala
checkIdentityColumnHighWaterMarks(
deltaTxn: OptimisticTransaction): Unit
```

`checkIdentityColumnHighWaterMarks`...FIXME

## Implementations

* [MergeIntoCommand](MergeIntoCommand.md)
Expand Down
17 changes: 17 additions & 0 deletions docs/identity-columns/IdentityColumn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# IdentityColumn

## getIdentityInfo { #getIdentityInfo }

```scala
getIdentityInfo(
field: StructField): IdentityInfo
```

`getIdentityInfo`...FIXME

---

`getIdentityInfo` is used when:

* `IdentityColumn` is requested to [copySchemaWithMergedHighWaterMarks](#copySchemaWithMergedHighWaterMarks), [createIdentityColumnGenerationExpr](#createIdentityColumnGenerationExpr), [syncIdentity](#syncIdentity), [updateSchema](#updateSchema), [updateToValidHighWaterMark](#updateToValidHighWaterMark)
* `MergeIntoCommandBase` is requested to [checkIdentityColumnHighWaterMarks](../commands/merge/MergeIntoCommandBase.md#checkIdentityColumnHighWaterMarks)
27 changes: 26 additions & 1 deletion docs/identity-columns/index.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,28 @@
# Identity Columns

**Identity Columns** is a new feature in Delta Lake 3.3.0 that allows assigning unique values for each record inserted into a table.
**Identity Columns** is a new feature in Delta Lake 3.3.0 that allows assigning unique values for each record writted out into a table (unless users provide values for them explicitly).

Identity Columns feature is supported by delta tables that meet one of the following requirements:

* The tables must be on Writer Version 6
* The table must be on Writer Version 7, and a feature name `identityColumns` must exist in the table protocol's `writerFeatures`.

Identity Columns cannot be specified with a generated column expression (or a `DeltaAnalysisException` is reported).

Identity Columns can only be of `LongType`.

IDENTITY column step cannot be 0 (or a `DeltaAnalysisException` is reported).

Internally, identity columns are columns (fields) with the following `Metadata`:

Key | Value
-|-
[delta.identity.allowExplicitInsert](../spark-connector/DeltaSourceUtils.md#IDENTITY_INFO_ALLOW_EXPLICIT_INSERT) | [identityAllowExplicitInsert](../DeltaColumnBuilder.md#identityAllowExplicitInsert)
[delta.identity.start](../spark-connector/DeltaSourceUtils.md#IDENTITY_INFO_START) | [identityStart](../DeltaColumnBuilder.md#identityStart)
[delta.identity.step](../spark-connector/DeltaSourceUtils.md#IDENTITY_INFO_STEP) | [identityStep](../DeltaColumnBuilder.md#identityStep)

[IdentityColumn](IdentityColumn.md) and [ColumnWithDefaultExprUtils](../ColumnWithDefaultExprUtils.md#isIdentityColumn) utilities are used to work with identity columns.

## Learn More

* [Identity Columns]({{ delta.github }}/PROTOCOL.md#identity-columns) in Delta Lake's table protocol specification
17 changes: 8 additions & 9 deletions docs/spark-connector/DeltaSourceUtils.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ title: DeltaSourceUtils

# DeltaSourceUtils

## <span id="GENERATION_EXPRESSION_METADATA_KEY"><span id="delta.generationExpression"> delta.generationExpression
## <span id="GENERATION_EXPRESSION_METADATA_KEY"> delta.generationExpression { #delta.generationExpression }

`DeltaSourceUtils` defines `delta.generationExpression` metadata key for the generation expression of a [generated column](../DeltaColumnBuilder.md#generatedAlwaysAs) of a delta table.

Expand All @@ -17,31 +17,30 @@ Used when:
* [GeneratedColumn](../generated-columns/GeneratedColumn.md) utility is used to [isGeneratedColumn](../generated-columns/GeneratedColumn.md#isGeneratedColumn) and [getGenerationExpressionStr](../generated-columns/GeneratedColumn.md#getGenerationExpressionStr)
* `SchemaUtils` utility is used to [reportDifferences](../SchemaUtils.md#reportDifferences)

## <span id="IDENTITY_INFO_ALLOW_EXPLICIT_INSERT"><span id="delta.identity.allowExplicitInsert"> delta.identity.allowExplicitInsert
## <span id="IDENTITY_INFO_ALLOW_EXPLICIT_INSERT"> delta.identity.allowExplicitInsert { #delta.identity.allowExplicitInsert }

`DeltaSourceUtils` defines `delta.identity.allowExplicitInsert` metadata key for...FIXME

Used when:

* `ColumnWithDefaultExprUtils` utility is used to [isIdentityColumn](../ColumnWithDefaultExprUtils.md#isIdentityColumn) and [removeDefaultExpressions](../ColumnWithDefaultExprUtils.md#removeDefaultExpressions)

## <span id="IDENTITY_INFO_START"><span id="delta.identity.start"> delta.identity.start
## <span id="IDENTITY_INFO_START"> delta.identity.start { #delta.identity.start }

`DeltaSourceUtils` defines `delta.identity.start` metadata key for...FIXME
`delta.identity.start` table metadata key is used when:

Used when:

* `ColumnWithDefaultExprUtils` utility is used to [isIdentityColumn](../ColumnWithDefaultExprUtils.md#isIdentityColumn) and [removeDefaultExpressions](../ColumnWithDefaultExprUtils.md#removeDefaultExpressions)
* `DeltaColumnBuilder` is requested to [build a StructField](../DeltaColumnBuilder.md#build) (with [identityAllowExplicitInsert](../DeltaColumnBuilder.md#identityAllowExplicitInsert) defined)
* `ColumnWithDefaultExprUtils` is used to [isIdentityColumn](../ColumnWithDefaultExprUtils.md#isIdentityColumn) and [removeDefaultExpressions](../ColumnWithDefaultExprUtils.md#removeDefaultExpressions)

## <span id="IDENTITY_INFO_STEP"><span id="delta.identity.step"> delta.identity.step
## <span id="IDENTITY_INFO_STEP"> delta.identity.step { #delta.identity.step }

`DeltaSourceUtils` defines `delta.identity.step` metadata key for...FIXME

Used when:

* `ColumnWithDefaultExprUtils` utility is used to [isIdentityColumn](../ColumnWithDefaultExprUtils.md#isIdentityColumn) and [removeDefaultExpressions](../ColumnWithDefaultExprUtils.md#removeDefaultExpressions)

## <span id="isDeltaDataSourceName"> isDeltaDataSourceName
## isDeltaDataSourceName { #isDeltaDataSourceName }

```scala
isDeltaDataSourceName(
Expand Down

0 comments on commit f6d2bed

Please sign in to comment.