Skip to content

Commit

Permalink
DeltaHistoryManager and Maximum Number of Keys
Browse files Browse the repository at this point in the history
  • Loading branch information
jaceklaskowski committed Dec 9, 2023
1 parent d682256 commit f3db923
Show file tree
Hide file tree
Showing 6 changed files with 59 additions and 10 deletions.
32 changes: 26 additions & 6 deletions docs/DeltaHistoryManager.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,23 @@
`DeltaHistoryManager` takes the following to be created:

* <span id="deltaLog"> [DeltaLog](DeltaLog.md)
* <span id="maxKeysPerList"> Maximum number of keys (default: `1000`)
* [Maximum Number of Keys](#maxKeysPerList)

`DeltaHistoryManager` is created when:

* `DeltaLog` is requested for [one](DeltaLog.md#history)
* `DeltaTableOperations` is requested to [execute history command](DeltaTableOperations.md#executeHistory)
* `DeltaLog` is requested for the [DeltaHistoryManager](DeltaLog.md#history)

## <span id="getHistory"> Version and Commit History
### Maximum Number of Keys { #maxKeysPerList }

`DeltaHistoryManager` can be given `maxKeysPerList` when [created](#creating-instance).

Unless given, `maxKeysPerList` is `1000`.

The value of `maxKeysPerList` can be configured using [spark.databricks.delta.history.maxKeysPerList](configuration-properties/index.md#spark.databricks.delta.history.maxKeysPerList) configuration property.

`maxKeysPerList` is used to [look up the active commit at a given time](#getActiveCommitAtTime) (and uses [parallelSearch](#parallelSearch)).

## Version and Commit History { #getHistory }

```scala
getHistory(
Expand All @@ -31,7 +40,7 @@ getHistory(
* `DeltaTableOperations` is requested to [executeHistory](DeltaTableOperations.md#executeHistory) (for [DeltaTable.history](DeltaTable.md#history) operator)
* [DescribeDeltaHistoryCommand](commands/describe-history/DescribeDeltaHistoryCommand.md) is executed (for [DESCRIBE HISTORY](sql/index.md#describe-history) SQL command)

### <span id="getCommitInfo"> getCommitInfo Utility
### getCommitInfo { #getCommitInfo }

```scala
getCommitInfo(
Expand All @@ -42,7 +51,7 @@ getCommitInfo(

`getCommitInfo`...FIXME

## <span id="getActiveCommitAtTime"> getActiveCommitAtTime
## getActiveCommitAtTime { #getActiveCommitAtTime }

```scala
getActiveCommitAtTime(
Expand All @@ -52,8 +61,19 @@ getActiveCommitAtTime(
canReturnEarliestCommit: Boolean = false): Commit
```

`getActiveCommitAtTime` determines the earliest commit to find based on the given `mustBeRecreatable` flag (default: `true`):

* When enabled (default), `getActiveCommitAtTime` [getEarliestRecreatableCommit](#getEarliestRecreatableCommit)
* When disabled, `getActiveCommitAtTime` [getEarliestDeltaFile](#getEarliestDeltaFile)

`getActiveCommitAtTime` requests the [DeltaLog](#deltaLog) to [update](SnapshotManagement.md#update) that gives the latest [Snapshot](Snapshot.md) that is requested for the [latest version](Snapshot.md#version).

`getActiveCommitAtTime` finds the commit. Based on how many commits to fetch (and the [maxKeysPerList](#maxKeysPerList)), `getActiveCommitAtTime` does [parallelSearch](#parallelSearch) or [not](#getCommits).

`getActiveCommitAtTime`...FIXME

---

`getActiveCommitAtTime` is used when:

* `DeltaTableUtils` utility is used to [resolveTimeTravelVersion](DeltaTableUtils.md#resolveTimeTravelVersion)
Expand Down
15 changes: 15 additions & 0 deletions docs/DeltaLog.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,21 @@

* [DeltaLog.forTable](#forTable) utility is used

## DeltaHistoryManager { #history }

```scala
history: DeltaHistoryManager
```

`DeltaLog` creates a [DeltaHistoryManager](DeltaHistoryManager.md) (when requested for one the very first time).

??? note "Lazy Value"
`history` is a Scala **lazy value** to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.

Learn more in the [Scala Language Specification]({{ scala.spec }}/05-classes-and-objects.html#lazy).

`DeltaLog` uses [spark.databricks.delta.history.maxKeysPerList](configuration-properties/index.md#spark.databricks.delta.history.maxKeysPerList) property for the [maxKeysPerList](DeltaHistoryManager.md#maxKeysPerList).

## <span id="_delta_log"> _delta_log Metadata Directory

`DeltaLog` uses **_delta_log** metadata directory for the transaction log of a Delta table.
Expand Down
4 changes: 3 additions & 1 deletion docs/commands/describe-history/index.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
# DESCRIBE HISTORY Command

Delta Lake supports displaying versions (_history_) of delta tables using the following high-level operators:
Delta Lake can display the versions (_history_) of delta tables using the following high-level operators:

* [DESCRIBE HISTORY](../../sql/index.md#describe-history) SQL command
* [DeltaTable.history](../../DeltaTable.md#history)

`DESCRIBE HISTORY` (regardless of the variant: SQL or `DeltaTable` API) is a mere wrapper around [DeltaHistoryManager](../../DeltaHistoryManager.md) to access the history of a delta table.

## Metrics Reporting

Write metrics can be collected at [transactional write](../../TransactionalWrite.md#writeFiles) based on [history.metricsEnabled](../../configuration-properties/index.md#history.metricsEnabled) configuration property.
4 changes: 4 additions & 0 deletions docs/configuration-properties/DeltaSQLConf.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,10 @@

[spark.databricks.delta.delete.deletionVectors.persistent](index.md#delete.deletionVectors.persistent)

## history.maxKeysPerList { #DELTA_HISTORY_PAR_SEARCH_THRESHOLD }

[spark.databricks.delta.history.maxKeysPerList](index.md#history.maxKeysPerList)

## merge.materializeSource { #DELTA_COLLECT_STATS_USING_TABLE_SCHEMA }

[spark.databricks.delta.merge.materializeSource](index.md#merge.materializeSource)
Expand Down
12 changes: 9 additions & 3 deletions docs/configuration-properties/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,14 +102,20 @@ Default: `3`

Default: `.s3-optimization-`

### <span id="history.maxKeysPerList"><span id="DELTA_HISTORY_PAR_SEARCH_THRESHOLD"> history.maxKeysPerList
### <span id="spark.databricks.delta.history.maxKeysPerList"><span id="DELTA_HISTORY_PAR_SEARCH_THRESHOLD"> history.maxKeysPerList { #history.maxKeysPerList }

**spark.databricks.delta.history.maxKeysPerList** (internal) controls how many commits to list when performing a parallel search.
**spark.databricks.delta.history.maxKeysPerList**

The default is the maximum keys returned by S3 per list call. Azure can return 5000, therefore we choose 1000.
(internal) How many commits to list when performing a parallel search

Default: `1000`

The default is the maximum keys returned by S3 per [ListObjectsV2]({{ s3.api }}/API_ListObjectsV2.html) call. Azure can return `5000`, therefore we choose `1000`.

Used when:

* `DeltaLog` is requested for the [DeltaHistoryManager](../DeltaLog.md#history)

### <span id="DELTA_HISTORY_METRICS_ENABLED"> history.metricsEnabled { #history.metricsEnabled }

**spark.databricks.delta.history.metricsEnabled**
Expand Down
2 changes: 2 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,8 @@ extra:
scala: https://github.com/FasterXML/jackson-module-scala
java:
api: https://docs.oracle.com/en/java/javase/17/docs/api/java.base
s3:
api: https://docs.aws.amazon.com/AmazonS3/latest/API
scala:
api: https://www.scala-lang.org/api/2.13.8
docs: https://docs.scala-lang.org/
Expand Down

0 comments on commit f3db923

Please sign in to comment.