Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor improvements for .save_table(mode="overwrite") #298

Merged
merged 8 commits into from
Oct 2, 2024

Conversation

asnare
Copy link
Contributor

@asnare asnare commented Sep 26, 2024

This PR includes some updates for .save_table() in overwrite mode:

  • If no rows are supplied, instead of a no-op the table is truncated. The mock backend already assumed this but diverged from the concrete implementations which treated this as a no-op (as when appending). Unit tests now cover this situation for all backends, and the existing integration test for the SQL-statement backend has been updated to cover this.
  • The SQL-based backends have a slight optimisation: instead of first truncating before inserting the truncate is now performed as part of the insert for the first batch.
  • Type hints on the abstract method now match the concrete implementations.

Instead of truncating separately before inserting, the first batch being inserted performs the truncate.
…e mode.

Previously this was treated as a no-op by the implementations, although the mock backend assumed they would truncate.
@asnare asnare requested review from nfx and JCZuurmond September 26, 2024 15:36
@asnare asnare self-assigned this Sep 26, 2024
Copy link

github-actions bot commented Sep 26, 2024

✅ 35/35 passed, 3 flaky, 4 skipped, 21m57s total

Flaky tests:

  • 🤪 test_dashboards_creates_dashboard_with_order_overwrite_in_dashboard_yaml (5.925s)
  • 🤪 test_dashboards_creates_dashboard_from_query_with_cte (5.283s)
  • 🤪 test_dashboard_deploys_dashboard_with_table (6.179s)

Running from acceptance #429

if len(rows) == 0:
if not rows:
if mode == "overwrite":
self.execute(f"TRUNCATE TABLE {full_name}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.execute(f"TRUNCATE TABLE {full_name}")
self.execute(f"DELETE FROM {full_name}")

otherwise we remove delta log and can't recover data. isn't INSERT OVERWRITE ... VALUES () doing the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using TRUNCATE does not remove the Delta history:

create table test_truncate as select * from values 1, 2, 3;
insert into test_truncate values 4, 5, 6;

truncate table test_truncate;

select version, timestamp, operation, operationMetrics, engineInfo
from (describe history test_truncate);
version timestamp operation operationMetrics engineInfo
2 2024-09-30 09:20:35.000000 TRUNCATE {"numRemovedFiles":"2","numDeletedRows":"6","executionTimeMs":"15","numRemovedBytes":"906"} Databricks-Runtime/15.3.x-photon-scala2.12
1 2024-09-30 09:20:34.000000 WRITE {"numFiles":"1","numOutputRows":"3","numOutputBytes":"453"} Databricks-Runtime/15.3.x-photon-scala2.12
0 2024-09-30 09:20:33.000000 CREATE TABLE AS SELECT {"numFiles":"1","numOutputRows":"3","numOutputBytes":"453"} Databricks-Runtime/15.3.x-photon-scala2.12

The INSERT OVERWRITE operation would be semantically equivalent, but this is part of the block that handles .save_table() when no rows are provided: there was an (existing) optimisation in place to turn this into a no-op by returning early. This was correct for append-mode, but not really what would be expected in overwrite mode. It still returns early.

Copy link
Member

@JCZuurmond JCZuurmond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally lgtm, some nits

tests/unit/test_backends.py Outdated Show resolved Hide resolved
Copy link
Member

@JCZuurmond JCZuurmond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some final nits

rows: Sequence[DataclassInstance],
klass: Dataclass,
mode: Literal["append", "overwrite"] = "append",
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
):
) -> None:

self.execute(sql)
# Only the first batch can truncate; subsequent batches append.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More close the wording in this code block:

Suggested change
# Only the first batch can truncate; subsequent batches append.
# Only the first batch should overwrite; subsequent batches append.

src/databricks/labs/lsql/backends.py Show resolved Hide resolved
Copy link
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@nfx nfx merged commit 128a0fa into main Oct 2, 2024
8 checks passed
@nfx nfx deleted the save-table-consistency branch October 2, 2024 17:58
nfx added a commit that referenced this pull request Nov 8, 2024
* Added `escape_name` function to escape individual SQL names and `escape_full_name` function to escape dot-separated full names ([#316](#316)). Two new functions, `escape_name` and `escape_full_name`, have been added to the `databricks.labs.lsql.escapes` module for escaping SQL names. The `escape_name` function takes a single name as an input and returns it enclosed in backticks, while `escape_full_name` handles dot-separated full names by escaping each individual component. These functions have been ported from the `databrickslabs/ucx` repository and are designed to provide a consistent way to escape names and full names in SQL statements, improving the robustness of the system by preventing issues caused by unescaped special characters in SQL names. The test suite includes various cases, including single names, full names with different combinations of escaped and unescaped components, and special characters, with a specific focus on the scenario where the column name contains a period.
* Bump actions/checkout from 4.2.0 to 4.2.1 ([#304](#304)). In this pull request, the `actions/checkout` dependency is updated from version 4.2.0 to 4.2.1 in the `.github/workflows/release.yml` file. This update includes a new feature where `refs/*` are checked out by commit if provided, falling back to the ref specified by the `@orhantoy` user. This change improves the flexibility of the action, allowing users to specify a commit or branch for checkout. The pull request also introduces a new contributor, `@Jcambass`, who added a workflow file for publishing releases to an immutable action package. The commits for this release include changes to prepare for the 4.2.1 release, add a workflow file for publishing releases, and check out other `refs/*` by commit if provided, falling back to ref. This pull request has been reviewed and approved by Dependabot.
* Bump actions/checkout from 4.2.1 to 4.2.2 ([#310](#310)). This is a pull request to update the `actions/checkout` dependency from version 4.2.1 to 4.2.2, which includes improvements to the `url-helper.ts` file that now utilize well-known environment variables and expanded unit test coverage for the `isGhes` function. The `actions/checkout` action is commonly used in GitHub Actions workflows for checking out a repository at a specific commit or branch. The changes in this update are internal to the `actions/checkout` action and should not affect the functionality of the project utilizing this action. The pull request also includes details on the commits and compatibility score for the upgrade, and reviewers can manage and merge the request using Dependabot commands once the changes have been verified.
* Bump databrickslabs/sandbox from acceptance/v0.3.0 to 0.3.1 ([#307](#307)). In this release, the `databrickslabs/sandbox` dependency has been updated from version `acceptance/v0.3.0` to `0.3.1`. This update includes previously tagged commits, bug fixes for git-related libraries, and resolution of the `unsupported protocol scheme` error. The README has been updated with more information on using the `databricks labs sandbox` command, and installation instructions have been improved. Additionally, there have been dependency updates for `go-git` libraries and `golang.org/x/crypto` in the `/go-libs` and `/runtime-packages` directories. New commits in this release allow larger logs from acceptance tests and implement experimental OIDC refresh functionality. Ignore conditions have been applied to prevent conflicts with previous versions of the dependency. This update is recommended for users who want to take advantage of the latest bug fixes and improvements.
* Bump databrickslabs/sandbox from acceptance/v0.3.1 to 0.4.2 ([#315](#315)). In this release, the `databrickslabs/sandbox` dependency has been updated from version `acceptance/v0.3.1` to `0.4.2`. This update includes bug fixes, dependency updates, and additional go-git libraries. Specifically, the `Run integration tests` job in the GitHub Actions workflow has been updated to use the new version of the `databrickslabs/sandbox/acceptance` Docker image. The updated version also includes install instructions, usage instructions in the README, and a modification to provide more git-related libraries. Additionally, there were several updates to dependencies, including `golang.org/x/crypto` version `0.16.0` to `0.17.0`. Dependabot, a tool that manages dependencies in GitHub projects, is responsible for the update and provides instructions for resolving any conflicts or merging the changes into the project. This update is intended to improve the functionality and reliability of the `databrickslabs/sandbox` dependency.
* Deprecate `Row.as_dict()` ([#309](#309)). In this release, we are introducing a deprecation warning for the `as_dict()` method in the `Row` class, which will be removed in favor of the `asDict()` method. This change aims to maintain consistency with Spark's `Row` behavior and prevent subtle bugs when switching between different backends. The deprecation warning will be implemented using Python's warnings mechanism, including the new annotation in Python 3.13 for static code analysis. The existing functionality of fetching values from the database through `StatementExecutionExt` remains unchanged. We recommend that clients update their code to use `.asDict()` instead of `.as_dict()` to avoid any disruptions. A new test case `test_row_as_dict_deprecated()` has been added to verify the deprecation warning for `Row.as_dict()`.
* Minor improvements for `.save_table(mode="overwrite")` ([#298](#298)). In this release, the `.save_table()` method has been improved, particularly when using the `overwrite` mode. If no rows are supplied, the table will now be truncated, ensuring consistency with the mock backend behavior. This change has been optimized for SQL-based backends, which now perform truncation as part of the insert for the first batch. Type hints on the abstract method have been updated to match the concrete implementations. Unit tests and integration tests have been updated to cover the new functionality, and new methods have been added to test the truncation behavior in overwrite mode. These improvements enhance the consistency and efficiency of the `.save_table()` method when using `overwrite` mode across different backends.
* Updated databrickslabs/sandbox requirement to acceptance/v0.3.0 ([#305](#305)). In this release, we have updated the requirement for the `databrickslabs/sandbox` package to version `acceptance/v0.3.0` in the `downstreams.yml` file. This update is necessary to use the latest version of the package, which includes several bug fixes and dependency updates. The `databrickslabs/sandbox` package is used in the acceptance tests, which are run as part of the CI/CD pipeline. It provides a set of tools and utilities for developing and testing code in a sandbox environment. The changelog for this version includes the addition of install instructions, more git-related libraries, and the modification of the README to include information about how to use it with the `databricks labs sandbox` command. Specifically, the version of the `databrickslabs/sandbox` package used in the `acceptance` job has been updated from `acceptance/v0.1.4` to `acceptance/v0.3.0`, allowing the integration tests to be run using the latest version of the package. The ignore conditions for this PR ensure that Dependabot will resolve any conflicts that may arise and can be manually triggered with the `@dependabot rebase` command.

Dependency updates:

 * Bump actions/checkout from 4.2.0 to 4.2.1 ([#304](#304)).
 * Updated databrickslabs/sandbox requirement to acceptance/v0.3.0 ([#305](#305)).
 * Bump databrickslabs/sandbox from acceptance/v0.3.0 to 0.3.1 ([#307](#307)).
 * Bump actions/checkout from 4.2.1 to 4.2.2 ([#310](#310)).
 * Bump databrickslabs/sandbox from acceptance/v0.3.1 to 0.4.2 ([#315](#315)).
@nfx nfx mentioned this pull request Nov 8, 2024
nfx added a commit that referenced this pull request Nov 8, 2024
* Added `escape_name` function to escape individual SQL names and
`escape_full_name` function to escape dot-separated full names
([#316](#316)). Two new
functions, `escape_name` and `escape_full_name`, have been added to the
`databricks.labs.lsql.escapes` module for escaping SQL names. The
`escape_name` function takes a single name as an input and returns it
enclosed in backticks, while `escape_full_name` handles dot-separated
full names by escaping each individual component. These functions have
been ported from the `databrickslabs/ucx` repository and are designed to
provide a consistent way to escape names and full names in SQL
statements, improving the robustness of the system by preventing issues
caused by unescaped special characters in SQL names. The test suite
includes various cases, including single names, full names with
different combinations of escaped and unescaped components, and special
characters, with a specific focus on the scenario where the column name
contains a period.
* Bump actions/checkout from 4.2.0 to 4.2.1
([#304](#304)). In this
pull request, the `actions/checkout` dependency is updated from version
4.2.0 to 4.2.1 in the `.github/workflows/release.yml` file. This update
includes a new feature where `refs/*` are checked out by commit if
provided, falling back to the ref specified by the `@orhantoy` user.
This change improves the flexibility of the action, allowing users to
specify a commit or branch for checkout. The pull request also
introduces a new contributor, `@Jcambass`, who added a workflow file for
publishing releases to an immutable action package. The commits for this
release include changes to prepare for the 4.2.1 release, add a workflow
file for publishing releases, and check out other `refs/*` by commit if
provided, falling back to ref. This pull request has been reviewed and
approved by Dependabot.
* Bump actions/checkout from 4.2.1 to 4.2.2
([#310](#310)). This is a
pull request to update the `actions/checkout` dependency from version
4.2.1 to 4.2.2, which includes improvements to the `url-helper.ts` file
that now utilize well-known environment variables and expanded unit test
coverage for the `isGhes` function. The `actions/checkout` action is
commonly used in GitHub Actions workflows for checking out a repository
at a specific commit or branch. The changes in this update are internal
to the `actions/checkout` action and should not affect the functionality
of the project utilizing this action. The pull request also includes
details on the commits and compatibility score for the upgrade, and
reviewers can manage and merge the request using Dependabot commands
once the changes have been verified.
* Bump databrickslabs/sandbox from acceptance/v0.3.0 to 0.3.1
([#307](#307)). In this
release, the `databrickslabs/sandbox` dependency has been updated from
version `acceptance/v0.3.0` to `0.3.1`. This update includes previously
tagged commits, bug fixes for git-related libraries, and resolution of
the `unsupported protocol scheme` error. The README has been updated
with more information on using the `databricks labs sandbox` command,
and installation instructions have been improved. Additionally, there
have been dependency updates for `go-git` libraries and
`golang.org/x/crypto` in the `/go-libs` and `/runtime-packages`
directories. New commits in this release allow larger logs from
acceptance tests and implement experimental OIDC refresh functionality.
Ignore conditions have been applied to prevent conflicts with previous
versions of the dependency. This update is recommended for users who
want to take advantage of the latest bug fixes and improvements.
* Bump databrickslabs/sandbox from acceptance/v0.3.1 to 0.4.2
([#315](#315)). In this
release, the `databrickslabs/sandbox` dependency has been updated from
version `acceptance/v0.3.1` to `0.4.2`. This update includes bug fixes,
dependency updates, and additional go-git libraries. Specifically, the
`Run integration tests` job in the GitHub Actions workflow has been
updated to use the new version of the
`databrickslabs/sandbox/acceptance` Docker image. The updated version
also includes install instructions, usage instructions in the README,
and a modification to provide more git-related libraries. Additionally,
there were several updates to dependencies, including
`golang.org/x/crypto` version `0.16.0` to `0.17.0`. Dependabot, a tool
that manages dependencies in GitHub projects, is responsible for the
update and provides instructions for resolving any conflicts or merging
the changes into the project. This update is intended to improve the
functionality and reliability of the `databrickslabs/sandbox`
dependency.
* Deprecate `Row.as_dict()`
([#309](#309)). In this
release, we are introducing a deprecation warning for the `as_dict()`
method in the `Row` class, which will be removed in favor of the
`asDict()` method. This change aims to maintain consistency with Spark's
`Row` behavior and prevent subtle bugs when switching between different
backends. The deprecation warning will be implemented using Python's
warnings mechanism, including the new annotation in Python 3.13 for
static code analysis. The existing functionality of fetching values from
the database through `StatementExecutionExt` remains unchanged. We
recommend that clients update their code to use `.asDict()` instead of
`.as_dict()` to avoid any disruptions. A new test case
`test_row_as_dict_deprecated()` has been added to verify the deprecation
warning for `Row.as_dict()`.
* Minor improvements for `.save_table(mode="overwrite")`
([#298](#298)). In this
release, the `.save_table()` method has been improved, particularly when
using the `overwrite` mode. If no rows are supplied, the table will now
be truncated, ensuring consistency with the mock backend behavior. This
change has been optimized for SQL-based backends, which now perform
truncation as part of the insert for the first batch. Type hints on the
abstract method have been updated to match the concrete implementations.
Unit tests and integration tests have been updated to cover the new
functionality, and new methods have been added to test the truncation
behavior in overwrite mode. These improvements enhance the consistency
and efficiency of the `.save_table()` method when using `overwrite` mode
across different backends.
* Updated databrickslabs/sandbox requirement to acceptance/v0.3.0
([#305](#305)). In this
release, we have updated the requirement for the
`databrickslabs/sandbox` package to version `acceptance/v0.3.0` in the
`downstreams.yml` file. This update is necessary to use the latest
version of the package, which includes several bug fixes and dependency
updates. The `databrickslabs/sandbox` package is used in the acceptance
tests, which are run as part of the CI/CD pipeline. It provides a set of
tools and utilities for developing and testing code in a sandbox
environment. The changelog for this version includes the addition of
install instructions, more git-related libraries, and the modification
of the README to include information about how to use it with the
`databricks labs sandbox` command. Specifically, the version of the
`databrickslabs/sandbox` package used in the `acceptance` job has been
updated from `acceptance/v0.1.4` to `acceptance/v0.3.0`, allowing the
integration tests to be run using the latest version of the package. The
ignore conditions for this PR ensure that Dependabot will resolve any
conflicts that may arise and can be manually triggered with the
`@dependabot rebase` command.

Dependency updates:

* Bump actions/checkout from 4.2.0 to 4.2.1
([#304](#304)).
* Updated databrickslabs/sandbox requirement to acceptance/v0.3.0
([#305](#305)).
* Bump databrickslabs/sandbox from acceptance/v0.3.0 to 0.3.1
([#307](#307)).
* Bump actions/checkout from 4.2.1 to 4.2.2
([#310](#310)).
* Bump databrickslabs/sandbox from acceptance/v0.3.1 to 0.4.2
([#315](#315)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants