Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make delta format case insensitive #2861

Merged
merged 8 commits into from
Oct 8, 2024
Merged

Conversation

JCZuurmond
Copy link
Member

@JCZuurmond JCZuurmond commented Oct 8, 2024

Changes

Make delta format case insensitive

Linked issues

Resolves #2858
Relevant to #2840

Functionality

  • all table format related code

Tests

  • added unit tests
  • verified on staging environment (screenshot attached)

@JCZuurmond JCZuurmond added the migrate/external go/uc/upgrade SYNC EXTERNAL TABLES step label Oct 8, 2024
@JCZuurmond JCZuurmond self-assigned this Oct 8, 2024
@JCZuurmond JCZuurmond requested a review from a team as a code owner October 8, 2024 09:15
Copy link
Member Author

@JCZuurmond JCZuurmond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments

@@ -334,6 +334,10 @@ class TableInMount:
format: str
is_partitioned: bool

def __post_init__(self) -> None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nfx: I added this fall safe to always have the format as an upper case string.

@@ -3,7 +3,7 @@ WITH table_stats AS (
SELECT
`database`,
object_type,
table_format AS `format`,
UPPER(table_format) AS `format`,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Required for backwards compatibility, i.e. users that already populated the UCX inventory

@@ -195,6 +195,7 @@ def test_tables_returning_error_when_show_tables(caplog):
'table,dbfs_root,what',
[
(Table("a", "b", "c", "MANAGED", "DELTA", location="dbfs:/somelocation/tablename"), True, What.DBFS_ROOT_DELTA),
(Table("a", "b", "c", "MANAGED", "delta", location="dbfs:/somelocation/tablename"), True, What.DBFS_ROOT_DELTA),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HariGS-DB : This test fails without the code changes in the PR. Could be responsible for #2840

Copy link
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@@ -80,12 +80,22 @@ class Table:

UPGRADED_FROM_WS_PARAM: typing.ClassVar[str] = "upgraded_from_workspace_id"

def __post_init__(self) -> None:
if isinstance(self.table_format, str): # Should not happen according to type hint, still safer
self.table_format = self.table_format.upper()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay...

@dataclass
class Foo:
    table_format: str = dataclasses.field()
    def __post_init__(self):
        if isinstance(self.table_format, str):
            self.table_format = self.table_format.upper()

def test_x():
    f = Foo("parquet")
    assert f.table_format == "PARQUET"

    f = dataclasses.replace(f, table_format="csv")
    assert f.table_format == "CSV"

@nfx nfx merged commit 4162830 into main Oct 8, 2024
4 of 6 checks passed
@nfx nfx deleted the fix/delta-format-case-sensitive branch October 8, 2024 09:27
@JCZuurmond JCZuurmond mentioned this pull request Oct 8, 2024
3 tasks
nfx pushed a commit that referenced this pull request Oct 8, 2024
## Changes
Follow up on #2861, make delta format case sensitive

### Linked issues
Resolves #2858
Relevant to #2840

### Functionality

- [x] all table format related code

### Tests

- [x] added unit tests
- [ ] verified on staging environment (screenshot attached)
@JCZuurmond JCZuurmond changed the title Make delta format case sensitive Make delta format case insensitive Oct 8, 2024
nfx added a commit that referenced this pull request Oct 9, 2024
* Added `google-cloud-core` to known list ([#2826](#2826)). In this release, we have incorporated the `google-cloud-core` library into our project's configuration file, specifying several modules from this library. This change is part of the resolution of issue [#1931](#1931), which pertains to working with Google Cloud services. The `google-cloud-core` library offers core functionalities for Google Cloud client libraries, including helper functions, HTTP-related functionalities, testing utilities, client classes, environment variable handling, exceptions, obsolete features, operation tracking, and version management. By adding these new modules to the known list in the configuration file, we can now utilize them in our project as needed, thereby enhancing our ability to work with Google Cloud services.
* Added `gviz-api` to known list ([#2831](#2831)). In this release, we have added the `gviz-api` library to our known library list, specifically specifying the `gviz_api` package within it. This addition enables the proper handling and recognition of components from the `gviz-api` library in the system, thereby addressing a portion of issue [#1931](#1931). While the specifics of the `gviz-api` library's implementation and usage are not described in the commit message, it is expected to provide functionality related to data visualization. This enhancement will enable us to expand our system's capabilities and provide more comprehensive solutions for our users.
* Added export CLI functionality for assessment results ([#2553](#2553)). A new `export` command-line interface (CLI) function has been added to the open-source library to export assessment results. This feature includes the addition of a new `AssessmentExporter` class in the `export.py` module, which is responsible for exporting assessment results to CSV files inside a ZIP archive. Users can specify the destination path and type of report for the exported results. A notebook utility is also included to run the export from the workspace environment, with default location, unit tests, and integration tests for the notebook utility. The `acl_migrator` method has been optimized for better performance. This new functionality provides more flexibility in exporting assessment results and improves the overall assessment functionality of the library.
* Added functional test related to bug [#2850](#2850) ([#2880](#2880)). A new functional test has been added to address a bug fix related to issue [#2850](#2850), which involves reading data from a CSV file located in a volume using Spark's readStream function. The test specifies various options including file format, schema location, header, and compression. The CSV file is loaded from '/Volumes/playground/test/demo_data/' and the schema location is set to '/Volumes/playground/test/schemas/'. Additionally, a unit test has been added and is referenced in the commit. This functional test will help ensure that the bug fix for issue [#2850](#2850) is working as expected.
* Added handling for `PermissionDenied` when retrieving `WorkspaceClient`s from account ([#2877](#2877)). In this release, the `workspace_clients` method of the `Account` class in `workspaces.py` has been updated to handle `PermissionDenied` exceptions when retrieving `WorkspaceClient`s. This change introduces a try-except block around the command retrieving the workspace client, which catches the `PermissionDenied` exception and logs a warning message if access to a workspace is denied. If no exception is raised, the workspace client is added to the list of clients as before. The commit also includes a new unit test to verify this functionality. This update addresses issue [#2874](#2874) and enhances the robustness of the `databricks labs ucx sync-workspace-info` command by ensuring it gracefully handles permission errors during workspace retrieval.
* Added testing with Python 3.13 ([#2878](#2878)). The project has been updated to include testing with Python 3.13, in addition to the previously supported versions of Python 3.10, 3.11, and 3.12. This update is reflected in the `.github/workflows/push.yml` file, which now includes '3.13' in the `pyVersion` matrix for the jobs. This addition expands the range of Python versions that the project can be tested and run on, providing increased flexibility and compatibility for users, as well as ensuring continued support for the latest versions of the Python programming language.
* Added used tables in assessment dashboard ([#2836](#2836)). In this update, we introduce a new widget to the assessment dashboard for displaying used tables, enhancing visibility into how tables are utilized within the Databricks environment. This change includes the addition of the `UsedTable` class in the `databricks.labs.ucx.source_code.base` module, which tracks table usage details in the inventory database. Two new methods, `collect_dfsas_from_query` and `collect_used_tables_from_query`, have been implemented to collect data source access and used tables information from a query, with lineage information added to the table details. Additionally, a test function, `test_dashboard_with_prepopulated_data`, has been introduced to prepopulate data for use in the dashboard, ensuring proper functionality of the new feature.
* Avoid resource conflicts in integration tests by using a random dir name ([#2865](#2865)). In this release, we have implemented changes to address resource conflicts in integration tests by introducing random directory names. The `save_locations` method in `conftest.py` has been updated to generate random directory names using the `tempfile.mkdtemp` function, based on the value of the new `make_random` parameter. Additionally, in the `test_migrate.py` file located in the `tests/integration/hive_metastore` directory, the hard-coded directory name has been replaced with a random one generated by the `make_random` function, which is used when creating external tables and specifying the external delta location. Lastly, the `test_move_tables_table_properties_mismatch_preserves_original` function in `test_table_move.py` has been updated to include a randomly generated directory name in the table's external delta and storage location, ensuring that tests can run concurrently without conflicting with each other. These changes resolve the issue described in [#2797](#2797) and improve the reliability of integration tests.
* Exclude dfsas from used tables ([#2841](#2841)). In this release, we've made significant improvements to the accuracy of table identification and handling in our system. We've excluded certain direct filesystem access patterns from being treated as tables in the current implementation, correcting a previous error. The `collect_tables` method has been updated to exclude table names matching defined direct filesystem access patterns. Additionally, we've added a new method `TableInfoNode` to wrap used tables and the nodes that use them. We've also introduced changes to handle direct filesystem access patterns more accurately, ensuring that the DataFrame API's `spark.table()` function is identified correctly, while the `spark.read.parquet()` function, representing direct filesystem access, is now ignored. These changes are supported by new unit tests to ensure correctness and reliability, enhancing the overall functionality and behavior of the system.
* Fixed known matches false postives for libraries starting with the same name as a library in the known.json ([#2860](#2860)). This commit addresses an issue of false positives in known matches for libraries that have the same name as a library in the known.json file. The `module_compatibility` function in the `known.py` file was updated to look for exact matches or parent module matches, rather than just matches at the beginning of the name. This more nuanced approach ensures that libraries with similar names are not incorrectly flagged as having compatibility issues. Additionally, the `known.json` file is now sorted when constructing module problems, indicating that the order of the entries in this file may have been relevant to the issue being resolved. To ensure the accuracy of the changes, new unit tests were added. The test suite was expanded to include tests for known and unknown compatibility, and a new load test was added for the known.json file. These changes improve the reliability of the known matches feature, which is critical for ensuring the correct identification of compatibility issues.
* Make delta format case sensitive ([#2861](#2861)). In this commit, the delta format is made case sensitive to enhance the robustness and reliability of the code. The `TableInMount` class has been updated with a `__post_init__` method to convert the `format` attribute to uppercase, ensuring case sensitivity. Additionally, the `Table` class in the `tables.py` file has been modified to include a `__post_init__` method that converts the `table_format` attribute to uppercase during object creation, making format comparisons case insensitive. New properties, `is_delta` and `is_hive`, have been added to the `Table` class to check if the table format is delta or hive, respectively. These changes affect the `what` method of the `AclMigrationWhat` enum class, which now checks for `is_delta` and `is_hive` instead of comparing `table_format` with `DELTA` and "HIVE". Relevant issues [#2858](#2858) and [#2840](#2840) have been addressed, and unit tests have been included to verify the behavior. However, the changes have not been verified on the staging environment yet.
* Make delta format case sensitive ([#2862](#2862)). The recent update, derived from the resolution of issue [#2861](#2861), introduces a case-sensitive delta format to our open-source library, enhancing the precision of delta table tracking. This change impacts all table format-related code and is accompanied by additional tests for robustness. A new `location` column has been incorporated into the `table_estimates` view, facilitating the determination of delta table location. Furthermore, a new method has been implemented to extract the `location` column from the `table_estimates` view, further refining the project's functionality and accuracy in managing delta tables.
* Verify UCX catalog is accessible at start of `migration-progress-experimental` workflow ([#2851](#2851)). In this release, we have introduced a new `verify_has_ucx_catalog` method in the `Application` class of the `databricks.labs.ucx.contexts` module, which checks for the presence of a UCX catalog in the workspace and returns an instance of the `VerifyHasCatalog` class. This method is used in the `migration-progress-experimental` workflow to verify UCX catalog accessibility, addressing issues [#2577](#2577) and [#2848](#2848) and progressing work on [#2816](#2816). The `verify_has_ucx_catalog` method is decorated with `@cached_property` and takes `workspace_client` and `ucx_catalog` as arguments. Additionally, we have added a new `VerifyHasCatalog` class that checks if a specified Unity Catalog (UC) catalog exists in the workspace and updated the import statement to include a `NotFound` exception. We have also added a timeout parameter to the `validate_step` function in the `workflows.py` file, modified the `migration-progress-experimental` workflow to include a new step `verify_prerequisites` in the `table_migration` job cluster, and added unit tests to ensure the proper functioning of these changes. These updates improve the application's ability to interact with UCX catalogs and ensure their presence and accessibility during workflow execution, while also enhancing the robustness and reliability of the `migration-progress-experimental` workflow.
@nfx nfx mentioned this pull request Oct 9, 2024
nfx added a commit that referenced this pull request Oct 9, 2024
* Added `google-cloud-core` to known list
([#2826](#2826)). In this
release, we have incorporated the `google-cloud-core` library into our
project's configuration file, specifying several modules from this
library. This change is part of the resolution of issue
[#1931](#1931), which
pertains to working with Google Cloud services. The `google-cloud-core`
library offers core functionalities for Google Cloud client libraries,
including helper functions, HTTP-related functionalities, testing
utilities, client classes, environment variable handling, exceptions,
obsolete features, operation tracking, and version management. By adding
these new modules to the known list in the configuration file, we can
now utilize them in our project as needed, thereby enhancing our ability
to work with Google Cloud services.
* Added `gviz-api` to known list
([#2831](#2831)). In this
release, we have added the `gviz-api` library to our known library list,
specifically specifying the `gviz_api` package within it. This addition
enables the proper handling and recognition of components from the
`gviz-api` library in the system, thereby addressing a portion of issue
[#1931](#1931). While the
specifics of the `gviz-api` library's implementation and usage are not
described in the commit message, it is expected to provide functionality
related to data visualization. This enhancement will enable us to expand
our system's capabilities and provide more comprehensive solutions for
our users.
* Added export CLI functionality for assessment results
([#2553](#2553)). A new
`export` command-line interface (CLI) function has been added to the
open-source library to export assessment results. This feature includes
the addition of a new `AssessmentExporter` class in the `export.py`
module, which is responsible for exporting assessment results to CSV
files inside a ZIP archive. Users can specify the destination path and
type of report for the exported results. A notebook utility is also
included to run the export from the workspace environment, with default
location, unit tests, and integration tests for the notebook utility.
The `acl_migrator` method has been optimized for better performance.
This new functionality provides more flexibility in exporting assessment
results and improves the overall assessment functionality of the
library.
* Added functional test related to bug
[#2850](#2850)
([#2880](#2880)). A new
functional test has been added to address a bug fix related to issue
[#2850](#2850), which
involves reading data from a CSV file located in a volume using Spark's
readStream function. The test specifies various options including file
format, schema location, header, and compression. The CSV file is loaded
from '/Volumes/playground/test/demo_data/' and the schema location is
set to '/Volumes/playground/test/schemas/'. Additionally, a unit test
has been added and is referenced in the commit. This functional test
will help ensure that the bug fix for issue
[#2850](#2850) is working as
expected.
* Added handling for `PermissionDenied` when retrieving
`WorkspaceClient`s from account
([#2877](#2877)). In this
release, the `workspace_clients` method of the `Account` class in
`workspaces.py` has been updated to handle `PermissionDenied` exceptions
when retrieving `WorkspaceClient`s. This change introduces a try-except
block around the command retrieving the workspace client, which catches
the `PermissionDenied` exception and logs a warning message if access to
a workspace is denied. If no exception is raised, the workspace client
is added to the list of clients as before. The commit also includes a
new unit test to verify this functionality. This update addresses issue
[#2874](#2874) and enhances
the robustness of the `databricks labs ucx sync-workspace-info` command
by ensuring it gracefully handles permission errors during workspace
retrieval.
* Added testing with Python 3.13
([#2878](#2878)). The
project has been updated to include testing with Python 3.13, in
addition to the previously supported versions of Python 3.10, 3.11, and
3.12. This update is reflected in the `.github/workflows/push.yml` file,
which now includes '3.13' in the `pyVersion` matrix for the jobs. This
addition expands the range of Python versions that the project can be
tested and run on, providing increased flexibility and compatibility for
users, as well as ensuring continued support for the latest versions of
the Python programming language.
* Added used tables in assessment dashboard
([#2836](#2836)). In this
update, we introduce a new widget to the assessment dashboard for
displaying used tables, enhancing visibility into how tables are
utilized within the Databricks environment. This change includes the
addition of the `UsedTable` class in the
`databricks.labs.ucx.source_code.base` module, which tracks table usage
details in the inventory database. Two new methods,
`collect_dfsas_from_query` and `collect_used_tables_from_query`, have
been implemented to collect data source access and used tables
information from a query, with lineage information added to the table
details. Additionally, a test function,
`test_dashboard_with_prepopulated_data`, has been introduced to
prepopulate data for use in the dashboard, ensuring proper functionality
of the new feature.
* Avoid resource conflicts in integration tests by using a random dir
name ([#2865](#2865)). In
this release, we have implemented changes to address resource conflicts
in integration tests by introducing random directory names. The
`save_locations` method in `conftest.py` has been updated to generate
random directory names using the `tempfile.mkdtemp` function, based on
the value of the new `make_random` parameter. Additionally, in the
`test_migrate.py` file located in the `tests/integration/hive_metastore`
directory, the hard-coded directory name has been replaced with a random
one generated by the `make_random` function, which is used when creating
external tables and specifying the external delta location. Lastly, the
`test_move_tables_table_properties_mismatch_preserves_original` function
in `test_table_move.py` has been updated to include a randomly generated
directory name in the table's external delta and storage location,
ensuring that tests can run concurrently without conflicting with each
other. These changes resolve the issue described in
[#2797](#2797) and improve
the reliability of integration tests.
* Exclude dfsas from used tables
([#2841](#2841)). In this
release, we've made significant improvements to the accuracy of table
identification and handling in our system. We've excluded certain direct
filesystem access patterns from being treated as tables in the current
implementation, correcting a previous error. The `collect_tables` method
has been updated to exclude table names matching defined direct
filesystem access patterns. Additionally, we've added a new method
`TableInfoNode` to wrap used tables and the nodes that use them. We've
also introduced changes to handle direct filesystem access patterns more
accurately, ensuring that the DataFrame API's `spark.table()` function
is identified correctly, while the `spark.read.parquet()` function,
representing direct filesystem access, is now ignored. These changes are
supported by new unit tests to ensure correctness and reliability,
enhancing the overall functionality and behavior of the system.
* Fixed known matches false postives for libraries starting with the
same name as a library in the known.json
([#2860](#2860)). This
commit addresses an issue of false positives in known matches for
libraries that have the same name as a library in the known.json file.
The `module_compatibility` function in the `known.py` file was updated
to look for exact matches or parent module matches, rather than just
matches at the beginning of the name. This more nuanced approach ensures
that libraries with similar names are not incorrectly flagged as having
compatibility issues. Additionally, the `known.json` file is now sorted
when constructing module problems, indicating that the order of the
entries in this file may have been relevant to the issue being resolved.
To ensure the accuracy of the changes, new unit tests were added. The
test suite was expanded to include tests for known and unknown
compatibility, and a new load test was added for the known.json file.
These changes improve the reliability of the known matches feature,
which is critical for ensuring the correct identification of
compatibility issues.
* Make delta format case sensitive
([#2861](#2861)). In this
commit, the delta format is made case sensitive to enhance the
robustness and reliability of the code. The `TableInMount` class has
been updated with a `__post_init__` method to convert the `format`
attribute to uppercase, ensuring case sensitivity. Additionally, the
`Table` class in the `tables.py` file has been modified to include a
`__post_init__` method that converts the `table_format` attribute to
uppercase during object creation, making format comparisons case
insensitive. New properties, `is_delta` and `is_hive`, have been added
to the `Table` class to check if the table format is delta or hive,
respectively. These changes affect the `what` method of the
`AclMigrationWhat` enum class, which now checks for `is_delta` and
`is_hive` instead of comparing `table_format` with `DELTA` and "HIVE".
Relevant issues
[#2858](#2858) and
[#2840](#2840) have been
addressed, and unit tests have been included to verify the behavior.
However, the changes have not been verified on the staging environment
yet.
* Make delta format case sensitive
([#2862](#2862)). The recent
update, derived from the resolution of issue
[#2861](#2861), introduces a
case-sensitive delta format to our open-source library, enhancing the
precision of delta table tracking. This change impacts all table
format-related code and is accompanied by additional tests for
robustness. A new `location` column has been incorporated into the
`table_estimates` view, facilitating the determination of delta table
location. Furthermore, a new method has been implemented to extract the
`location` column from the `table_estimates` view, further refining the
project's functionality and accuracy in managing delta tables.
* Verify UCX catalog is accessible at start of
`migration-progress-experimental` workflow
([#2851](#2851)). In this
release, we have introduced a new `verify_has_ucx_catalog` method in the
`Application` class of the `databricks.labs.ucx.contexts` module, which
checks for the presence of a UCX catalog in the workspace and returns an
instance of the `VerifyHasCatalog` class. This method is used in the
`migration-progress-experimental` workflow to verify UCX catalog
accessibility, addressing issues
[#2577](#2577) and
[#2848](#2848) and
progressing work on
[#2816](#2816). The
`verify_has_ucx_catalog` method is decorated with `@cached_property` and
takes `workspace_client` and `ucx_catalog` as arguments. Additionally,
we have added a new `VerifyHasCatalog` class that checks if a specified
Unity Catalog (UC) catalog exists in the workspace and updated the
import statement to include a `NotFound` exception. We have also added a
timeout parameter to the `validate_step` function in the `workflows.py`
file, modified the `migration-progress-experimental` workflow to include
a new step `verify_prerequisites` in the `table_migration` job cluster,
and added unit tests to ensure the proper functioning of these changes.
These updates improve the application's ability to interact with UCX
catalogs and ensure their presence and accessibility during workflow
execution, while also enhancing the robustness and reliability of the
`migration-progress-experimental` workflow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
migrate/external go/uc/upgrade SYNC EXTERNAL TABLES step
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG]: Case sensitive dashboard queries creating discrepancies on Assessment Report
2 participants