Minor improvements for `.save_table(mode="overwrite")` #298

asnare · 2024-09-26T15:36:25Z

This PR includes some updates for .save_table() in overwrite mode:

If no rows are supplied, instead of a no-op the table is truncated. The mock backend already assumed this but diverged from the concrete implementations which treated this as a no-op (as when appending). Unit tests now cover this situation for all backends, and the existing integration test for the SQL-statement backend has been updated to cover this.
The SQL-based backends have a slight optimisation: instead of first truncating before inserting the truncate is now performed as part of the insert for the first batch.
Type hints on the abstract method now match the concrete implementations.

Instead of truncating separately before inserting, the first batch being inserted performs the truncate.

…e mode. Previously this was treated as a no-op by the implementations, although the mock backend assumed they would truncate.

… no-rows situation.

github-actions · 2024-09-26T15:46:44Z

✅ 35/35 passed, 3 flaky, 4 skipped, 21m57s total

Flaky tests:

🤪 test_dashboards_creates_dashboard_with_order_overwrite_in_dashboard_yaml (5.925s)
🤪 test_dashboards_creates_dashboard_from_query_with_cte (5.283s)
🤪 test_dashboard_deploys_dashboard_with_table (6.179s)

_{Running from acceptance #429}

nfx · 2024-09-26T18:12:47Z

src/databricks/labs/lsql/backends.py

-        if len(rows) == 0:
+        if not rows:
+            if mode == "overwrite":
+                self.execute(f"TRUNCATE TABLE {full_name}")


Suggested change

self.execute(f"TRUNCATE TABLE {full_name}")

self.execute(f"DELETE FROM {full_name}")

otherwise we remove delta log and can't recover data. isn't INSERT OVERWRITE ... VALUES () doing the same?

Using TRUNCATE does not remove the Delta history:

create table test_truncate as select * from values 1, 2, 3; insert into test_truncate values 4, 5, 6; truncate table test_truncate; select version, timestamp, operation, operationMetrics, engineInfo from (describe history test_truncate);

version timestamp operation operationMetrics engineInfo

2 2024-09-30 09:20:35.000000 TRUNCATE {"numRemovedFiles":"2","numDeletedRows":"6","executionTimeMs":"15","numRemovedBytes":"906"} Databricks-Runtime/15.3.x-photon-scala2.12

1 2024-09-30 09:20:34.000000 WRITE {"numFiles":"1","numOutputRows":"3","numOutputBytes":"453"} Databricks-Runtime/15.3.x-photon-scala2.12

0 2024-09-30 09:20:33.000000 CREATE TABLE AS SELECT {"numFiles":"1","numOutputRows":"3","numOutputBytes":"453"} Databricks-Runtime/15.3.x-photon-scala2.12

The INSERT OVERWRITE operation would be semantically equivalent, but this is part of the block that handles .save_table() when no rows are provided: there was an (existing) optimisation in place to turn this into a no-op by returning early. This was correct for append-mode, but not really what would be expected in overwrite mode. It still returns early.

JCZuurmond

generally lgtm, some nits

tests/unit/test_backends.py

JCZuurmond

Some final nits

JCZuurmond · 2024-10-01T08:00:51Z

src/databricks/labs/lsql/backends.py

+        rows: Sequence[DataclassInstance],
+        klass: Dataclass,
+        mode: Literal["append", "overwrite"] = "append",
+    ):


Suggested change

):

) -> None:

JCZuurmond · 2024-10-01T08:02:09Z

src/databricks/labs/lsql/backends.py

            self.execute(sql)
+            # Only the first batch can truncate; subsequent batches append.


More close the wording in this code block:

Suggested change

# Only the first batch can truncate; subsequent batches append.

# Only the first batch should overwrite; subsequent batches append.

src/databricks/labs/lsql/backends.py

nfx

lgtm

* Added `escape_name` function to escape individual SQL names and `escape_full_name` function to escape dot-separated full names ([#316](#316)). Two new functions, `escape_name` and `escape_full_name`, have been added to the `databricks.labs.lsql.escapes` module for escaping SQL names. The `escape_name` function takes a single name as an input and returns it enclosed in backticks, while `escape_full_name` handles dot-separated full names by escaping each individual component. These functions have been ported from the `databrickslabs/ucx` repository and are designed to provide a consistent way to escape names and full names in SQL statements, improving the robustness of the system by preventing issues caused by unescaped special characters in SQL names. The test suite includes various cases, including single names, full names with different combinations of escaped and unescaped components, and special characters, with a specific focus on the scenario where the column name contains a period. * Bump actions/checkout from 4.2.0 to 4.2.1 ([#304](#304)). In this pull request, the `actions/checkout` dependency is updated from version 4.2.0 to 4.2.1 in the `.github/workflows/release.yml` file. This update includes a new feature where `refs/*` are checked out by commit if provided, falling back to the ref specified by the `@orhantoy` user. This change improves the flexibility of the action, allowing users to specify a commit or branch for checkout. The pull request also introduces a new contributor, `@Jcambass`, who added a workflow file for publishing releases to an immutable action package. The commits for this release include changes to prepare for the 4.2.1 release, add a workflow file for publishing releases, and check out other `refs/*` by commit if provided, falling back to ref. This pull request has been reviewed and approved by Dependabot. * Bump actions/checkout from 4.2.1 to 4.2.2 ([#310](#310)). This is a pull request to update the `actions/checkout` dependency from version 4.2.1 to 4.2.2, which includes improvements to the `url-helper.ts` file that now utilize well-known environment variables and expanded unit test coverage for the `isGhes` function. The `actions/checkout` action is commonly used in GitHub Actions workflows for checking out a repository at a specific commit or branch. The changes in this update are internal to the `actions/checkout` action and should not affect the functionality of the project utilizing this action. The pull request also includes details on the commits and compatibility score for the upgrade, and reviewers can manage and merge the request using Dependabot commands once the changes have been verified. * Bump databrickslabs/sandbox from acceptance/v0.3.0 to 0.3.1 ([#307](#307)). In this release, the `databrickslabs/sandbox` dependency has been updated from version `acceptance/v0.3.0` to `0.3.1`. This update includes previously tagged commits, bug fixes for git-related libraries, and resolution of the `unsupported protocol scheme` error. The README has been updated with more information on using the `databricks labs sandbox` command, and installation instructions have been improved. Additionally, there have been dependency updates for `go-git` libraries and `golang.org/x/crypto` in the `/go-libs` and `/runtime-packages` directories. New commits in this release allow larger logs from acceptance tests and implement experimental OIDC refresh functionality. Ignore conditions have been applied to prevent conflicts with previous versions of the dependency. This update is recommended for users who want to take advantage of the latest bug fixes and improvements. * Bump databrickslabs/sandbox from acceptance/v0.3.1 to 0.4.2 ([#315](#315)). In this release, the `databrickslabs/sandbox` dependency has been updated from version `acceptance/v0.3.1` to `0.4.2`. This update includes bug fixes, dependency updates, and additional go-git libraries. Specifically, the `Run integration tests` job in the GitHub Actions workflow has been updated to use the new version of the `databrickslabs/sandbox/acceptance` Docker image. The updated version also includes install instructions, usage instructions in the README, and a modification to provide more git-related libraries. Additionally, there were several updates to dependencies, including `golang.org/x/crypto` version `0.16.0` to `0.17.0`. Dependabot, a tool that manages dependencies in GitHub projects, is responsible for the update and provides instructions for resolving any conflicts or merging the changes into the project. This update is intended to improve the functionality and reliability of the `databrickslabs/sandbox` dependency. * Deprecate `Row.as_dict()` ([#309](#309)). In this release, we are introducing a deprecation warning for the `as_dict()` method in the `Row` class, which will be removed in favor of the `asDict()` method. This change aims to maintain consistency with Spark's `Row` behavior and prevent subtle bugs when switching between different backends. The deprecation warning will be implemented using Python's warnings mechanism, including the new annotation in Python 3.13 for static code analysis. The existing functionality of fetching values from the database through `StatementExecutionExt` remains unchanged. We recommend that clients update their code to use `.asDict()` instead of `.as_dict()` to avoid any disruptions. A new test case `test_row_as_dict_deprecated()` has been added to verify the deprecation warning for `Row.as_dict()`. * Minor improvements for `.save_table(mode="overwrite")` ([#298](#298)). In this release, the `.save_table()` method has been improved, particularly when using the `overwrite` mode. If no rows are supplied, the table will now be truncated, ensuring consistency with the mock backend behavior. This change has been optimized for SQL-based backends, which now perform truncation as part of the insert for the first batch. Type hints on the abstract method have been updated to match the concrete implementations. Unit tests and integration tests have been updated to cover the new functionality, and new methods have been added to test the truncation behavior in overwrite mode. These improvements enhance the consistency and efficiency of the `.save_table()` method when using `overwrite` mode across different backends. * Updated databrickslabs/sandbox requirement to acceptance/v0.3.0 ([#305](#305)). In this release, we have updated the requirement for the `databrickslabs/sandbox` package to version `acceptance/v0.3.0` in the `downstreams.yml` file. This update is necessary to use the latest version of the package, which includes several bug fixes and dependency updates. The `databrickslabs/sandbox` package is used in the acceptance tests, which are run as part of the CI/CD pipeline. It provides a set of tools and utilities for developing and testing code in a sandbox environment. The changelog for this version includes the addition of install instructions, more git-related libraries, and the modification of the README to include information about how to use it with the `databricks labs sandbox` command. Specifically, the version of the `databrickslabs/sandbox` package used in the `acceptance` job has been updated from `acceptance/v0.1.4` to `acceptance/v0.3.0`, allowing the integration tests to be run using the latest version of the package. The ignore conditions for this PR ensure that Dependabot will resolve any conflicts that may arise and can be manually triggered with the `@dependabot rebase` command. Dependency updates: * Bump actions/checkout from 4.2.0 to 4.2.1 ([#304](#304)). * Updated databrickslabs/sandbox requirement to acceptance/v0.3.0 ([#305](#305)). * Bump databrickslabs/sandbox from acceptance/v0.3.0 to 0.3.1 ([#307](#307)). * Bump actions/checkout from 4.2.1 to 4.2.2 ([#310](#310)). * Bump databrickslabs/sandbox from acceptance/v0.3.1 to 0.4.2 ([#315](#315)).

asnare added 6 commits September 26, 2024 17:12

Update save_table() type hints for consistency with subclasses.

5702d98

Tighten some assertions.

05a2c23

Update unit test to cover both save_table() modes.

3b015e5

Minor optimization for SQL-based save tables in overwrite mode.

ac20a91

Instead of truncating separately before inserting, the first batch being inserted performs the truncate.

Ensure that save_table() with no rows truncates the table in overwrit…

6bbbf44

…e mode. Previously this was treated as a no-op by the implementations, although the mock backend assumed they would truncate.

Update integration test for save_table(mode="overwrite") to cover the…

357d8d0

… no-rows situation.

asnare requested review from nfx and JCZuurmond September 26, 2024 15:36

asnare self-assigned this Sep 26, 2024

asnare temporarily deployed to runtime September 26, 2024 15:36 — with GitHub Actions Inactive

nfx requested changes Sep 26, 2024

View reviewed changes

JCZuurmond reviewed Sep 30, 2024

View reviewed changes

tests/unit/test_backends.py Outdated Show resolved Hide resolved

asnare added 2 commits September 30, 2024 11:49

Whitespace.

d223179

Type hints for tests modified by the PR.

bb9b475

asnare temporarily deployed to runtime September 30, 2024 09:50 — with GitHub Actions Inactive

asnare requested review from nfx and JCZuurmond September 30, 2024 09:51

JCZuurmond approved these changes Oct 1, 2024

View reviewed changes

nfx approved these changes Oct 2, 2024

View reviewed changes

nfx merged commit 128a0fa into main Oct 2, 2024
8 checks passed

nfx deleted the save-table-consistency branch October 2, 2024 17:58

nfx mentioned this pull request Nov 8, 2024

Release v0.13.0 #317

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor improvements for `.save_table(mode="overwrite")` #298

Minor improvements for `.save_table(mode="overwrite")` #298

asnare commented Sep 26, 2024

github-actions bot commented Sep 26, 2024 •

edited

Loading

nfx Sep 26, 2024

asnare Sep 30, 2024

JCZuurmond left a comment

JCZuurmond left a comment

JCZuurmond Oct 1, 2024

JCZuurmond Oct 1, 2024

nfx left a comment

	self.execute(f"TRUNCATE TABLE {full_name}")
	self.execute(f"DELETE FROM {full_name}")

`version`	`timestamp`	`operation`	`operationMetrics`	`engineInfo`
2	2024-09-30 09:20:35.000000	TRUNCATE	`{"numRemovedFiles":"2","numDeletedRows":"6","executionTimeMs":"15","numRemovedBytes":"906"}`	Databricks-Runtime/15.3.x-photon-scala2.12
1	2024-09-30 09:20:34.000000	WRITE	`{"numFiles":"1","numOutputRows":"3","numOutputBytes":"453"}`	Databricks-Runtime/15.3.x-photon-scala2.12
0	2024-09-30 09:20:33.000000	CREATE TABLE AS SELECT	`{"numFiles":"1","numOutputRows":"3","numOutputBytes":"453"}`	Databricks-Runtime/15.3.x-photon-scala2.12

		self.execute(sql)
		# Only the first batch can truncate; subsequent batches append.

	# Only the first batch can truncate; subsequent batches append.
	# Only the first batch should overwrite; subsequent batches append.

Minor improvements for .save_table(mode="overwrite") #298

Minor improvements for .save_table(mode="overwrite") #298

Conversation

asnare commented Sep 26, 2024

github-actions bot commented Sep 26, 2024 • edited Loading

nfx Sep 26, 2024

Choose a reason for hiding this comment

asnare Sep 30, 2024

Choose a reason for hiding this comment

JCZuurmond left a comment

Choose a reason for hiding this comment

JCZuurmond left a comment

Choose a reason for hiding this comment

JCZuurmond Oct 1, 2024

Choose a reason for hiding this comment

JCZuurmond Oct 1, 2024

Choose a reason for hiding this comment

nfx left a comment

Choose a reason for hiding this comment

Minor improvements for `.save_table(mode="overwrite")` #298

Minor improvements for `.save_table(mode="overwrite")` #298

github-actions bot commented Sep 26, 2024 •

edited

Loading