Add UCX history schema and table for storing UCX's artifact #2744

JCZuurmond · 2024-09-25T10:13:45Z

Changes

Add UCX history schema and table for storing UCX's artifact. The artifacts are, amongst other things, going to be used for migration progress tracking.

Linked issues

Resolves #2572

Functionality

modified existing command: databricks labs ucx create-ucx-catalog

Tests

manually tested
added unit tests
added integration tests

JCZuurmond

@nfx and @asnare : Please provide input. Especially on the dataclass attributes and the naming

JCZuurmond · 2024-09-25T10:15:32Z

Missing feature on lsql to reuse SchemaDeployer here: databrickslabs/lsql#294

asnare · 2024-09-25T11:18:19Z

@nfx and @asnare : Please provide input. Especially on the dataclass attributes and the naming

This is the current version that I'm using (as an intermediate version) on #2743:

@dataclass(frozen=True, kw_only=True)
class HistoricalRecord:
    workspace_id: int
    """The identifier of the workspace where this record was generated."""

    run_id: str
    """An identifier of the workflow run that generated this record."""

    snapshot_id: str
    """An identifier that is unique to the records produced for a given snapshot."""

    run_start_time: dt.datetime
    """When this record was generated."""

    object_type: str
    """The inventory table for which this record was generated."""

    object_type_version: int
    """Versioning of inventory table, for forward compatibility."""

    object_id: list[str]
    """The type-specific identifier for this inventory record."""

    object_data: str
    """Type-specific JSON-encoded data of the inventory record."""

    failures: list[str]
    """The list of problems associated with the object that this inventory record covers."""

    owner: str
    """The identity of the account that created this inventory record."""

asnare · 2024-09-25T11:23:25Z

I've also added some notes on the feature issue: #2572 (comment) &
#2572 (comment)

JCZuurmond · 2024-09-25T11:48:54Z

Both `_id

@nfx and @asnare : Please provide input. Especially on the dataclass attributes and the naming

This is the current version that I'm using (as an intermediate version) on #2743:

@dataclass(frozen=True, kw_only=True)
class HistoricalRecord:
    workspace_id: int
    """The identifier of the workspace where this record was generated."""

    run_id: str
    """An identifier of the workflow run that generated this record."""

    snapshot_id: str
    """An identifier that is unique to the records produced for a given snapshot."""

    run_start_time: dt.datetime
    """When this record was generated."""

    object_type: str
    """The inventory table for which this record was generated."""

    object_type_version: int
    """Versioning of inventory table, for forward compatibility."""

    object_id: list[str]
    """The type-specific identifier for this inventory record."""

    object_data: str
    """Type-specific JSON-encoded data of the inventory record."""

    failures: list[str]
    """The list of problems associated with the object that this inventory record covers."""

    owner: str
    """The identity of the account that created this inventory record."""

Thanks! I have update the data class: e80556c

Note:

The run_id and snapshot_id are int. Do you know if they should be str instead?
On owner see: [FEATURE] Create a ucx.history table #2572 (comment)

asnare · 2024-09-25T12:07:49Z

The run_id and snapshot_id are int. Do you know if they should be str instead?

run_id: Can definitely be an integer.

snapshot_id: I was populating this with a UUID, but I'm also fine with it being an int.

src/databricks/labs/ucx/progress/install.py

nfx · 2024-09-25T12:26:42Z

src/databricks/labs/ucx/progress/install.py

+    object_type: str
+    """The inventory table for which this record was generated."""
+
+    object_type_version: int


can it be just the ucx version str? if not str, then it should be list[int]

Elsewhere I wrote:

Typically it's easier to version serialisation formats independently of the software version they're used in. (Corner cases with software upgrades and downgrades are also handled better.)

Technically we don't need a version for the first version (ie, what we're doing now) and can introduce the version field when it's needed. However schema changes are painful, so having the field from the start is worth the cost of having it from the start even though it's not needed yet.

After a version change for a type, during deserialisation we only need to have logic for "this is how we decode version 0, this is how we decode version 1, etc." and it's all tightly contained/testable/etc. With the exception of some hash-based versioning schemes, this is typically how all serde handle this problem.

If we use the UCX version then this suddenly becomes based on semantic range-checks. (And downgrades lead to failed deserialisation because the 'current' version is never specified, but isn't necessarily the 'latest', and that range has an open upper bound leading to a version by a future release not being detected as unsupported.)

we can

version_map[tuple[int,int,int], int] = { (0,37,0): 1, (0,39,0): 2, (0,52,0): 3, }

Does this handle future versions, including later versions installed in other workspaces attached to the same metastore?

If we (a reader) are on version 0,54,0 then we'll assume version 3 (which is fine) but break if 0,55,0 introduced version 4 because that's indistinguishable to us: we will treat it was 3 (and fail) because only 0,55,0 or later knows version 4 exists.

Just encoding the integer avoids this because we can detect a future record (and that it cannot be decoded by the current version).

actually, better option would be:

migrate_after_versions[tuple[int,int,int], Callable[[dict[str,str]], dict[str,str]] = { (0,37,0): lambda x: x | {'foo': "bar"}, (0,39,0): remove_x_column, (0,52,0): convert_some_datatype, }

we sort by version tuple and get a diff between from databricks.labs.ucx.__about__ import __version__ and apply those. if we get a version that we don't understand - we warn and skip.

Just to aid my own understanding, and try to work through ambiguities, as far as I can tell there are two ways of interpreting a version tied to the UCX release:

The records contain the first version that understands the serialised form. That is, version 0.37.0, 0.38.0, etc. write 0.37.0 in the record, until version 0.39.0 at which point version 0.39.0, 0.40, etc. all start writing 0.39.0. This continues until 0.52.0, etc.

Under this protocol we can detect future/unknown versions because they're higher than __about__.__version__. We can also read past versions. It does however have a fairly large caveat: for all data classes we need to know the earliest version at which the 'current' version was introduced. This needs to be known in advance of the software release while we're writing and testing it and hard-coded somewhere. (We also need something special during development to ensure that we can read our own records.)

The records contain the actual version that wrote the record.

Under this protocol we can detect past versions, but not handle future/unknown versions because there's no way to know if a future release introduced a new version in some way. For safety we could just always ignore (and warn on) versions later than __about__.__version__. The main caveat here relates to when a downgrade happens. Mixed installs with a shared metastore could end up quite noisy (if not filtering out records from other workspaces).

@nfx: Is this also your understanding?

The records contain the actual version that wrote the record.

this is the intended way. let's store the version as ucx_version: list[int] and source it from __about__.__version__ and use databricks.sdk.mixins.compute.SemVer.parse to parse it.

That's clear, thanks.

src/databricks/labs/ucx/progress/install.py

src/databricks/labs/ucx/contexts/workspace_cli.py

src/databricks/labs/ucx/progress/install.py

nfx

lgtm

github-actions · 2024-10-09T17:50:53Z

✅ 29/29 passed, 1 flaky, 58m9s total

Flaky tests:

🤪 test_running_real_remove_backup_groups_job (8m40.994s)

_{Running from acceptance #6577}

* Added UCX history schema and table for storing UCX's artifact ([#2744](#2744)). In this release, we have introduced a new dataclass `Historical` to store UCX artifacts for migration progress tracking, including attributes such as workspace identifier, job run identifier, object type, object identifier, data, failures, owner, and UCX version. The `ProgressTrackingInstallation` class has been updated to include a new method for deploying a table for historical records using the `Historical` dataclass. Additionally, we have modified the `databricks labs ucx create-ucx-catalog` command, and updated the integration test file `test_install.py` to include a parametrized test function for checking if the `workflow_runs` and `historical` tables are created by the UCX installation. We have also renamed the function `test_progress_tracking_installation_run_creates_workflow_runs_table` to `test_progress_tracking_installation_run_creates_tables` to reflect the addition of the new table. These changes add necessary functionality for tracking UCX migration progress and provide associated tests to ensure correctness, thereby improving UCX's progress tracking functionality and resolving issue [#2572](#2572). * Added `hjson` to known list ([#2899](#2899)). In this release, we are excited to announce the addition of support for the Hjson library, addressing partial resolution for issue [#1931](#1931) related to configuration. This change integrates the following Hjson modules: hjson, hjson.compat, hjson.decoder, hjson.encoder, hjson.encoderH, hjson.ordered_dict, hjson.scanner, and hjson.tool. Hjson is a powerful library that enhances JSON functionality by providing comments and multi-line strings. By incorporating Hjson into our library's known list, users can now leverage its advanced features in a more streamlined and cohesive manner, resulting in a more versatile and efficient development experience. * Bump databrickslabs/sandbox from acceptance/v0.3.0 to 0.3.1 ([#2894](#2894)). In this version bump from acceptance/v0.3.0 to 0.3.1 of the databrickslabs/sandbox library, several enhancements and bug fixes have been implemented. These changes include updates to the README file with instructions on how to use the library with the databricks labs sandbox command, fixes for the `unsupported protocol scheme` error, and the addition of more git-related libraries. Additionally, dependency updates for golang.org/x/crypto from version 0.16.0 to 0.17.0 have been made in the /go-libs and /runtime-packages directories. This version also introduces new commits that allow larger logs from acceptance tests and implement experimental OIDC refresh token rotation. The tests using this library have been updated to utilize the new version to ensure compatibility and functionality. * Fixed `AttributeError: `UsedTable` has no attribute 'table'` by adding more type checks ([#2895](#2895)). In this release, we have made significant improvements to the library's type safety and robustness in handling `UsedTable` objects. We fixed an AttributeError related to the `UsedTable` class not having a `table` attribute by adding more type checks in the `collect_tables` method of the `TablePyCollector` and `CollectTablesVisit` classes. We also introduced `AstroidSyntaxError` exception handling and logging. Additionally, we renamed the `table_infos` variable to `used_tables` and changed its type to 'list[JobProblem]' in the `collect_tables_from_tree` and '_SparkSqlAnalyzer.collect_tables' functions. We added conditional statements to check for the presence of required attributes before yielding a new 'TableInfoNode'. A new unit test file, 'test_context.py', has been added to exercise the `tables_collector` method, which extracts table references from a given code snippet, improving the linter's table reference extraction capabilities. * Fixed `TokenError` in assessment workflow ([#2896](#2896)). In this update, we've implemented a bug fix to improve the robustness of the assessment workflow in our open-source library. Previously, the code only caught parse errors during the execution of the workflow, but parse errors were not the only cause of failures. This commit changes the exception being caught from `ParseError` to the more general `SqlglotError`, which is the common ancestor of both `ParseError` and `TokenError`. By catching the more general `SqlglotError`, the code is now able to handle both parse errors and tokenization errors, providing a more robust solution. The `walk_expressions` method has been updated to catch `SqlglotError` instead of `ParseError`. This change allows the assessment workflow to handle a wider range of issues that may arise during the execution of SQL code, making it more versatile and reliable. The `SqlglotError` class has been imported from the `sqlglot.errors` module. This update enhances the assessment workflow's ability to handle more complex SQL queries, ensuring smoother execution. * Fixed `assessment` workflow failure for jobs running tasks on existing interactive clusters ([#2889](#2889)). In this release, we have implemented changes to address a failure in the `assessment` workflow when jobs are run on existing interactive clusters (issue [#2886](#2886)). The fix includes modifying the `jobs.py` file by adding a try-except block when loading libraries for an existing cluster, utilizing a new exception type `ResourceDoesNotExist` to handle cases where the cluster does not exist. Furthermore, the `_register_cluster_info` function has been enhanced to manage situations where the existing cluster is not found, raising a `DependencyProblem` with the message 'cluster-not-found'. This ensures the workflow can continue running jobs on other clusters or with other configurations. Overall, these enhancements improve the system's robustness by gracefully handling edge cases and preventing workflow failure due to non-existent clusters. * Ignore UCX inventory database in HMS while scanning tables ([#2897](#2897)). In this release, changes have been implemented in the 'tables.py' file of the 'databricks/labs/ucx/hive_metastore' directory to address the issue of mistakenly scanning the UCX inventory database during table scanning. The `_all_databases` method has been updated to exclude the UCX inventory database by checking if the database name matches the schema name and skipping it if so. This change affects the `_crawl` and `_get_table_names` methods, which no longer process the UCX inventory schema when scanning for tables. A TODO comment has been added to the `_get_table_names` method, suggesting potential removal of the UCX inventory schema check in future releases. This change ensures accurate and efficient table scanning, avoiding the `hallucination` of mistaking the UCX inventory schema as a database to be scanned. * Tech debt: fix situations where `next()` isn't being used properly ([#2885](#2885)). In this commit, technical debt related to the proper usage of Python's built-in `next()` function has been addressed in several areas of the codebase. Previously, there was an assumption that `None` would be returned if there is no next value, which is incorrect. This commit updates and fixes the implementation to correctly handle cases where `next()` is used. Specifically, the `get_dbutils_notebook_run_path_arg`, `of_language` class method in the `CellLanguage` class, and certain methods in the `test_table_migrate.py` file have been updated to correctly handle situations where there is no next value. The `has_path()` method has been removed, and the `prepend_path()` method has been updated to insert the given path at the beginning of the list of system paths. Additionally, a test case for checking table in mount mapping with table owner has been included. These changes improve the robustness and reliability of the code by ensuring that it handles edge cases related to the `next()` function and paths correctly. * [chore] apply `make fmt` ([#2883](#2883)). In this release, the `make_random` parameter has been removed from the `save_locations` method in the `conftest.py` file for the integration tests. This method is used to save a list of `ExternalLocation` objects to the `external_locations` table in the inventory database, and it no longer requires the `make_random` parameter. In the updated implementation, the `save_locations` method creates a single `ExternalLocation` object with a specific string and priority based on the workspace environment (Azure or AWS), and then uses the SQL backend to save the list of `ExternalLocation` objects to the database. This change simplifies the `save_locations` method and makes it more reusable throughout the test suite. Dependency updates: * Bump databrickslabs/sandbox from acceptance/v0.3.0 to 0.3.1 ([#2894](#2894)).

## Changes Ran into a couple improvements when manually testing #2744: - We request the catalog location also when the catalog already exists. Solved by checking if a catalog exists before requesting the storage location - Multiple loops over the storage locations are not supported as the iterator is empty after first loop. Solved by emptying the external locations in a list. - More consistent: - Logging - Matching storage locations ### Linked issues Resolves #2879 ### Functionality - [x] modified existing command: `databricks labs ucx create-ucx-catalog/create-catalogs-schemas` ### Tests - [x] manually tested - [x] added unit tests

JCZuurmond added feat/cli CLI commands feat/migration-progress Issues related to the migration progress workflow labels Sep 25, 2024

JCZuurmond self-assigned this Sep 25, 2024

JCZuurmond commented Sep 25, 2024

View reviewed changes

JCZuurmond force-pushed the feat/add-ucx-history-table-2 branch from a1a0296 to acd928d Compare September 25, 2024 12:36

nfx requested changes Sep 25, 2024

View reviewed changes

JCZuurmond mentioned this pull request Sep 26, 2024

[FEATURE] Create a ucx.workflow_runs table #2600

Closed

asnare mentioned this pull request Sep 26, 2024

[FEATURE]: Identify owners for inventory types that we track the history of #2761

Closed

2 tasks

JCZuurmond force-pushed the feat/add-ucx-history-table-2 branch from acd928d to 6a25849 Compare October 9, 2024 09:39

nfx requested changes Oct 9, 2024

View reviewed changes

src/databricks/labs/ucx/progress/install.py Outdated Show resolved Hide resolved

JCZuurmond mentioned this pull request Oct 9, 2024

Improve creating UC catalogs #2898

Merged

3 tasks

JCZuurmond marked this pull request as ready for review October 9, 2024 17:29

JCZuurmond requested a review from a team as a code owner October 9, 2024 17:29

JCZuurmond had a problem deploying to account-admin October 9, 2024 17:29 — with GitHub Actions Error

JCZuurmond requested a review from nfx October 9, 2024 17:29

JCZuurmond added 10 commits October 9, 2024 19:30

Add history record

4b05cf0

Add history install

f9cbc17

Fix type hint

f507b12

Update history record dataclass

3b9b279

Add run_as attribute to HistoricalRecord

285429d

Add default to HistoricalRecord.object_type_version

2ce2b2d

Sort HistoricalRecord attributes

4622570

Add ucx_version to HistoryRecord

642a700

Rename object_owner to owner

61aec4d

Ignore too many attributes on dataclass

96d8117

JCZuurmond added 13 commits October 9, 2024 19:30

Use consistent attribute docs

7c9b670

Check if the historical records are populated

f773878

Set UCX version by default

bf011d9

Rephrase to job_run_id

23125ff

Remove fields part of WorkflowRun

bf7ae1f

Make snapshot id None by default

e5705ac

Do not check for history records yet

4b2cd0b

Test for historical records table to be created

d2556a5

Change object_data to data

0d5f34a

Move failures down

e2610f6

Make snapshot id an string

01d1a71

Rename HistoricalRecord to Historical

70702f0

Remove snapshot_id and object_version

1abf804

JCZuurmond force-pushed the feat/add-ucx-history-table-2 branch from 7269a09 to 1abf804 Compare October 9, 2024 17:30

JCZuurmond temporarily deployed to account-admin October 9, 2024 17:31 — with GitHub Actions Inactive

nfx approved these changes Oct 9, 2024

View reviewed changes

nfx merged commit 62e8da3 into main Oct 9, 2024
7 checks passed

nfx deleted the feat/add-ucx-history-table-2 branch October 9, 2024 18:00

nfx mentioned this pull request Oct 9, 2024

Release v0.41.0 #2900

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add UCX history schema and table for storing UCX's artifact #2744

Add UCX history schema and table for storing UCX's artifact #2744

JCZuurmond commented Sep 25, 2024 •

edited

Loading

JCZuurmond left a comment

JCZuurmond commented Sep 25, 2024

asnare commented Sep 25, 2024

asnare commented Sep 25, 2024

JCZuurmond commented Sep 25, 2024

asnare commented Sep 25, 2024

nfx Sep 25, 2024

asnare Sep 25, 2024

nfx Sep 25, 2024

asnare Sep 25, 2024

nfx Sep 25, 2024

asnare Sep 26, 2024

nfx Oct 9, 2024

asnare Oct 9, 2024

nfx left a comment

github-actions bot commented Oct 9, 2024

Add UCX history schema and table for storing UCX's artifact #2744

Add UCX history schema and table for storing UCX's artifact #2744

Conversation

JCZuurmond commented Sep 25, 2024 • edited Loading

Changes

Linked issues

Functionality

Tests

JCZuurmond left a comment

Choose a reason for hiding this comment

JCZuurmond commented Sep 25, 2024

asnare commented Sep 25, 2024

asnare commented Sep 25, 2024

JCZuurmond commented Sep 25, 2024

asnare commented Sep 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nfx left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 9, 2024

JCZuurmond commented Sep 25, 2024 •

edited

Loading