Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Migrate tables errors out when migrating managed table with mount external storage #2840

Closed
1 task done
HariGS-DB opened this issue Oct 4, 2024 · 1 comment · Fixed by #3020
Closed
1 task done
Assignees
Labels
bug Something isn't working migrate/external go/uc/upgrade SYNC EXTERNAL TABLES step migrate/managed go/uc/upgrade Upgrade Managed Tables and Jobs

Comments

@HariGS-DB
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

When running a migrate table job for a table which is managed with a mount path pointing to an external storage location, the code fails with the error Description: [UPGRADE_NOT_SUPPORTED.NOT_EXTERNAL] Table is not eligible for an upgrade from Hive Metastore to Unity Catalog. Reason: Not an external table. SQLSTATE: 0AKUC

Expected Behavior

Since the table is with mount path, it should not go through the SYNC route but instead tables in mount logic to re-create the table in UC pointing to the absolute location of the mount

Steps To Reproduce

No response

Cloud

AWS

Operating System

macOS

Version

latest via Databricks CLI

Relevant log output

No response

@HariGS-DB
Copy link
Contributor Author

related to #2838

nfx pushed a commit that referenced this issue Oct 8, 2024
## Changes
Make delta format case sensitive

### Linked issues
Resolves #2858
Relevant to #2840

### Functionality

- [x] all table format related code

### Tests

- [x] added unit tests
- [ ] verified on staging environment (screenshot attached)
nfx pushed a commit that referenced this issue Oct 8, 2024
## Changes
Follow up on #2861, make delta format case sensitive

### Linked issues
Resolves #2858
Relevant to #2840

### Functionality

- [x] all table format related code

### Tests

- [x] added unit tests
- [ ] verified on staging environment (screenshot attached)
@nfx nfx added migrate/external go/uc/upgrade SYNC EXTERNAL TABLES step migrate/managed go/uc/upgrade Upgrade Managed Tables and Jobs and removed needs-triage labels Oct 9, 2024
nfx added a commit that referenced this issue Oct 9, 2024
* Added `google-cloud-core` to known list ([#2826](#2826)). In this release, we have incorporated the `google-cloud-core` library into our project's configuration file, specifying several modules from this library. This change is part of the resolution of issue [#1931](#1931), which pertains to working with Google Cloud services. The `google-cloud-core` library offers core functionalities for Google Cloud client libraries, including helper functions, HTTP-related functionalities, testing utilities, client classes, environment variable handling, exceptions, obsolete features, operation tracking, and version management. By adding these new modules to the known list in the configuration file, we can now utilize them in our project as needed, thereby enhancing our ability to work with Google Cloud services.
* Added `gviz-api` to known list ([#2831](#2831)). In this release, we have added the `gviz-api` library to our known library list, specifically specifying the `gviz_api` package within it. This addition enables the proper handling and recognition of components from the `gviz-api` library in the system, thereby addressing a portion of issue [#1931](#1931). While the specifics of the `gviz-api` library's implementation and usage are not described in the commit message, it is expected to provide functionality related to data visualization. This enhancement will enable us to expand our system's capabilities and provide more comprehensive solutions for our users.
* Added export CLI functionality for assessment results ([#2553](#2553)). A new `export` command-line interface (CLI) function has been added to the open-source library to export assessment results. This feature includes the addition of a new `AssessmentExporter` class in the `export.py` module, which is responsible for exporting assessment results to CSV files inside a ZIP archive. Users can specify the destination path and type of report for the exported results. A notebook utility is also included to run the export from the workspace environment, with default location, unit tests, and integration tests for the notebook utility. The `acl_migrator` method has been optimized for better performance. This new functionality provides more flexibility in exporting assessment results and improves the overall assessment functionality of the library.
* Added functional test related to bug [#2850](#2850) ([#2880](#2880)). A new functional test has been added to address a bug fix related to issue [#2850](#2850), which involves reading data from a CSV file located in a volume using Spark's readStream function. The test specifies various options including file format, schema location, header, and compression. The CSV file is loaded from '/Volumes/playground/test/demo_data/' and the schema location is set to '/Volumes/playground/test/schemas/'. Additionally, a unit test has been added and is referenced in the commit. This functional test will help ensure that the bug fix for issue [#2850](#2850) is working as expected.
* Added handling for `PermissionDenied` when retrieving `WorkspaceClient`s from account ([#2877](#2877)). In this release, the `workspace_clients` method of the `Account` class in `workspaces.py` has been updated to handle `PermissionDenied` exceptions when retrieving `WorkspaceClient`s. This change introduces a try-except block around the command retrieving the workspace client, which catches the `PermissionDenied` exception and logs a warning message if access to a workspace is denied. If no exception is raised, the workspace client is added to the list of clients as before. The commit also includes a new unit test to verify this functionality. This update addresses issue [#2874](#2874) and enhances the robustness of the `databricks labs ucx sync-workspace-info` command by ensuring it gracefully handles permission errors during workspace retrieval.
* Added testing with Python 3.13 ([#2878](#2878)). The project has been updated to include testing with Python 3.13, in addition to the previously supported versions of Python 3.10, 3.11, and 3.12. This update is reflected in the `.github/workflows/push.yml` file, which now includes '3.13' in the `pyVersion` matrix for the jobs. This addition expands the range of Python versions that the project can be tested and run on, providing increased flexibility and compatibility for users, as well as ensuring continued support for the latest versions of the Python programming language.
* Added used tables in assessment dashboard ([#2836](#2836)). In this update, we introduce a new widget to the assessment dashboard for displaying used tables, enhancing visibility into how tables are utilized within the Databricks environment. This change includes the addition of the `UsedTable` class in the `databricks.labs.ucx.source_code.base` module, which tracks table usage details in the inventory database. Two new methods, `collect_dfsas_from_query` and `collect_used_tables_from_query`, have been implemented to collect data source access and used tables information from a query, with lineage information added to the table details. Additionally, a test function, `test_dashboard_with_prepopulated_data`, has been introduced to prepopulate data for use in the dashboard, ensuring proper functionality of the new feature.
* Avoid resource conflicts in integration tests by using a random dir name ([#2865](#2865)). In this release, we have implemented changes to address resource conflicts in integration tests by introducing random directory names. The `save_locations` method in `conftest.py` has been updated to generate random directory names using the `tempfile.mkdtemp` function, based on the value of the new `make_random` parameter. Additionally, in the `test_migrate.py` file located in the `tests/integration/hive_metastore` directory, the hard-coded directory name has been replaced with a random one generated by the `make_random` function, which is used when creating external tables and specifying the external delta location. Lastly, the `test_move_tables_table_properties_mismatch_preserves_original` function in `test_table_move.py` has been updated to include a randomly generated directory name in the table's external delta and storage location, ensuring that tests can run concurrently without conflicting with each other. These changes resolve the issue described in [#2797](#2797) and improve the reliability of integration tests.
* Exclude dfsas from used tables ([#2841](#2841)). In this release, we've made significant improvements to the accuracy of table identification and handling in our system. We've excluded certain direct filesystem access patterns from being treated as tables in the current implementation, correcting a previous error. The `collect_tables` method has been updated to exclude table names matching defined direct filesystem access patterns. Additionally, we've added a new method `TableInfoNode` to wrap used tables and the nodes that use them. We've also introduced changes to handle direct filesystem access patterns more accurately, ensuring that the DataFrame API's `spark.table()` function is identified correctly, while the `spark.read.parquet()` function, representing direct filesystem access, is now ignored. These changes are supported by new unit tests to ensure correctness and reliability, enhancing the overall functionality and behavior of the system.
* Fixed known matches false postives for libraries starting with the same name as a library in the known.json ([#2860](#2860)). This commit addresses an issue of false positives in known matches for libraries that have the same name as a library in the known.json file. The `module_compatibility` function in the `known.py` file was updated to look for exact matches or parent module matches, rather than just matches at the beginning of the name. This more nuanced approach ensures that libraries with similar names are not incorrectly flagged as having compatibility issues. Additionally, the `known.json` file is now sorted when constructing module problems, indicating that the order of the entries in this file may have been relevant to the issue being resolved. To ensure the accuracy of the changes, new unit tests were added. The test suite was expanded to include tests for known and unknown compatibility, and a new load test was added for the known.json file. These changes improve the reliability of the known matches feature, which is critical for ensuring the correct identification of compatibility issues.
* Make delta format case sensitive ([#2861](#2861)). In this commit, the delta format is made case sensitive to enhance the robustness and reliability of the code. The `TableInMount` class has been updated with a `__post_init__` method to convert the `format` attribute to uppercase, ensuring case sensitivity. Additionally, the `Table` class in the `tables.py` file has been modified to include a `__post_init__` method that converts the `table_format` attribute to uppercase during object creation, making format comparisons case insensitive. New properties, `is_delta` and `is_hive`, have been added to the `Table` class to check if the table format is delta or hive, respectively. These changes affect the `what` method of the `AclMigrationWhat` enum class, which now checks for `is_delta` and `is_hive` instead of comparing `table_format` with `DELTA` and "HIVE". Relevant issues [#2858](#2858) and [#2840](#2840) have been addressed, and unit tests have been included to verify the behavior. However, the changes have not been verified on the staging environment yet.
* Make delta format case sensitive ([#2862](#2862)). The recent update, derived from the resolution of issue [#2861](#2861), introduces a case-sensitive delta format to our open-source library, enhancing the precision of delta table tracking. This change impacts all table format-related code and is accompanied by additional tests for robustness. A new `location` column has been incorporated into the `table_estimates` view, facilitating the determination of delta table location. Furthermore, a new method has been implemented to extract the `location` column from the `table_estimates` view, further refining the project's functionality and accuracy in managing delta tables.
* Verify UCX catalog is accessible at start of `migration-progress-experimental` workflow ([#2851](#2851)). In this release, we have introduced a new `verify_has_ucx_catalog` method in the `Application` class of the `databricks.labs.ucx.contexts` module, which checks for the presence of a UCX catalog in the workspace and returns an instance of the `VerifyHasCatalog` class. This method is used in the `migration-progress-experimental` workflow to verify UCX catalog accessibility, addressing issues [#2577](#2577) and [#2848](#2848) and progressing work on [#2816](#2816). The `verify_has_ucx_catalog` method is decorated with `@cached_property` and takes `workspace_client` and `ucx_catalog` as arguments. Additionally, we have added a new `VerifyHasCatalog` class that checks if a specified Unity Catalog (UC) catalog exists in the workspace and updated the import statement to include a `NotFound` exception. We have also added a timeout parameter to the `validate_step` function in the `workflows.py` file, modified the `migration-progress-experimental` workflow to include a new step `verify_prerequisites` in the `table_migration` job cluster, and added unit tests to ensure the proper functioning of these changes. These updates improve the application's ability to interact with UCX catalogs and ensure their presence and accessibility during workflow execution, while also enhancing the robustness and reliability of the `migration-progress-experimental` workflow.
@nfx nfx mentioned this issue Oct 9, 2024
nfx added a commit that referenced this issue Oct 9, 2024
* Added `google-cloud-core` to known list
([#2826](#2826)). In this
release, we have incorporated the `google-cloud-core` library into our
project's configuration file, specifying several modules from this
library. This change is part of the resolution of issue
[#1931](#1931), which
pertains to working with Google Cloud services. The `google-cloud-core`
library offers core functionalities for Google Cloud client libraries,
including helper functions, HTTP-related functionalities, testing
utilities, client classes, environment variable handling, exceptions,
obsolete features, operation tracking, and version management. By adding
these new modules to the known list in the configuration file, we can
now utilize them in our project as needed, thereby enhancing our ability
to work with Google Cloud services.
* Added `gviz-api` to known list
([#2831](#2831)). In this
release, we have added the `gviz-api` library to our known library list,
specifically specifying the `gviz_api` package within it. This addition
enables the proper handling and recognition of components from the
`gviz-api` library in the system, thereby addressing a portion of issue
[#1931](#1931). While the
specifics of the `gviz-api` library's implementation and usage are not
described in the commit message, it is expected to provide functionality
related to data visualization. This enhancement will enable us to expand
our system's capabilities and provide more comprehensive solutions for
our users.
* Added export CLI functionality for assessment results
([#2553](#2553)). A new
`export` command-line interface (CLI) function has been added to the
open-source library to export assessment results. This feature includes
the addition of a new `AssessmentExporter` class in the `export.py`
module, which is responsible for exporting assessment results to CSV
files inside a ZIP archive. Users can specify the destination path and
type of report for the exported results. A notebook utility is also
included to run the export from the workspace environment, with default
location, unit tests, and integration tests for the notebook utility.
The `acl_migrator` method has been optimized for better performance.
This new functionality provides more flexibility in exporting assessment
results and improves the overall assessment functionality of the
library.
* Added functional test related to bug
[#2850](#2850)
([#2880](#2880)). A new
functional test has been added to address a bug fix related to issue
[#2850](#2850), which
involves reading data from a CSV file located in a volume using Spark's
readStream function. The test specifies various options including file
format, schema location, header, and compression. The CSV file is loaded
from '/Volumes/playground/test/demo_data/' and the schema location is
set to '/Volumes/playground/test/schemas/'. Additionally, a unit test
has been added and is referenced in the commit. This functional test
will help ensure that the bug fix for issue
[#2850](#2850) is working as
expected.
* Added handling for `PermissionDenied` when retrieving
`WorkspaceClient`s from account
([#2877](#2877)). In this
release, the `workspace_clients` method of the `Account` class in
`workspaces.py` has been updated to handle `PermissionDenied` exceptions
when retrieving `WorkspaceClient`s. This change introduces a try-except
block around the command retrieving the workspace client, which catches
the `PermissionDenied` exception and logs a warning message if access to
a workspace is denied. If no exception is raised, the workspace client
is added to the list of clients as before. The commit also includes a
new unit test to verify this functionality. This update addresses issue
[#2874](#2874) and enhances
the robustness of the `databricks labs ucx sync-workspace-info` command
by ensuring it gracefully handles permission errors during workspace
retrieval.
* Added testing with Python 3.13
([#2878](#2878)). The
project has been updated to include testing with Python 3.13, in
addition to the previously supported versions of Python 3.10, 3.11, and
3.12. This update is reflected in the `.github/workflows/push.yml` file,
which now includes '3.13' in the `pyVersion` matrix for the jobs. This
addition expands the range of Python versions that the project can be
tested and run on, providing increased flexibility and compatibility for
users, as well as ensuring continued support for the latest versions of
the Python programming language.
* Added used tables in assessment dashboard
([#2836](#2836)). In this
update, we introduce a new widget to the assessment dashboard for
displaying used tables, enhancing visibility into how tables are
utilized within the Databricks environment. This change includes the
addition of the `UsedTable` class in the
`databricks.labs.ucx.source_code.base` module, which tracks table usage
details in the inventory database. Two new methods,
`collect_dfsas_from_query` and `collect_used_tables_from_query`, have
been implemented to collect data source access and used tables
information from a query, with lineage information added to the table
details. Additionally, a test function,
`test_dashboard_with_prepopulated_data`, has been introduced to
prepopulate data for use in the dashboard, ensuring proper functionality
of the new feature.
* Avoid resource conflicts in integration tests by using a random dir
name ([#2865](#2865)). In
this release, we have implemented changes to address resource conflicts
in integration tests by introducing random directory names. The
`save_locations` method in `conftest.py` has been updated to generate
random directory names using the `tempfile.mkdtemp` function, based on
the value of the new `make_random` parameter. Additionally, in the
`test_migrate.py` file located in the `tests/integration/hive_metastore`
directory, the hard-coded directory name has been replaced with a random
one generated by the `make_random` function, which is used when creating
external tables and specifying the external delta location. Lastly, the
`test_move_tables_table_properties_mismatch_preserves_original` function
in `test_table_move.py` has been updated to include a randomly generated
directory name in the table's external delta and storage location,
ensuring that tests can run concurrently without conflicting with each
other. These changes resolve the issue described in
[#2797](#2797) and improve
the reliability of integration tests.
* Exclude dfsas from used tables
([#2841](#2841)). In this
release, we've made significant improvements to the accuracy of table
identification and handling in our system. We've excluded certain direct
filesystem access patterns from being treated as tables in the current
implementation, correcting a previous error. The `collect_tables` method
has been updated to exclude table names matching defined direct
filesystem access patterns. Additionally, we've added a new method
`TableInfoNode` to wrap used tables and the nodes that use them. We've
also introduced changes to handle direct filesystem access patterns more
accurately, ensuring that the DataFrame API's `spark.table()` function
is identified correctly, while the `spark.read.parquet()` function,
representing direct filesystem access, is now ignored. These changes are
supported by new unit tests to ensure correctness and reliability,
enhancing the overall functionality and behavior of the system.
* Fixed known matches false postives for libraries starting with the
same name as a library in the known.json
([#2860](#2860)). This
commit addresses an issue of false positives in known matches for
libraries that have the same name as a library in the known.json file.
The `module_compatibility` function in the `known.py` file was updated
to look for exact matches or parent module matches, rather than just
matches at the beginning of the name. This more nuanced approach ensures
that libraries with similar names are not incorrectly flagged as having
compatibility issues. Additionally, the `known.json` file is now sorted
when constructing module problems, indicating that the order of the
entries in this file may have been relevant to the issue being resolved.
To ensure the accuracy of the changes, new unit tests were added. The
test suite was expanded to include tests for known and unknown
compatibility, and a new load test was added for the known.json file.
These changes improve the reliability of the known matches feature,
which is critical for ensuring the correct identification of
compatibility issues.
* Make delta format case sensitive
([#2861](#2861)). In this
commit, the delta format is made case sensitive to enhance the
robustness and reliability of the code. The `TableInMount` class has
been updated with a `__post_init__` method to convert the `format`
attribute to uppercase, ensuring case sensitivity. Additionally, the
`Table` class in the `tables.py` file has been modified to include a
`__post_init__` method that converts the `table_format` attribute to
uppercase during object creation, making format comparisons case
insensitive. New properties, `is_delta` and `is_hive`, have been added
to the `Table` class to check if the table format is delta or hive,
respectively. These changes affect the `what` method of the
`AclMigrationWhat` enum class, which now checks for `is_delta` and
`is_hive` instead of comparing `table_format` with `DELTA` and "HIVE".
Relevant issues
[#2858](#2858) and
[#2840](#2840) have been
addressed, and unit tests have been included to verify the behavior.
However, the changes have not been verified on the staging environment
yet.
* Make delta format case sensitive
([#2862](#2862)). The recent
update, derived from the resolution of issue
[#2861](#2861), introduces a
case-sensitive delta format to our open-source library, enhancing the
precision of delta table tracking. This change impacts all table
format-related code and is accompanied by additional tests for
robustness. A new `location` column has been incorporated into the
`table_estimates` view, facilitating the determination of delta table
location. Furthermore, a new method has been implemented to extract the
`location` column from the `table_estimates` view, further refining the
project's functionality and accuracy in managing delta tables.
* Verify UCX catalog is accessible at start of
`migration-progress-experimental` workflow
([#2851](#2851)). In this
release, we have introduced a new `verify_has_ucx_catalog` method in the
`Application` class of the `databricks.labs.ucx.contexts` module, which
checks for the presence of a UCX catalog in the workspace and returns an
instance of the `VerifyHasCatalog` class. This method is used in the
`migration-progress-experimental` workflow to verify UCX catalog
accessibility, addressing issues
[#2577](#2577) and
[#2848](#2848) and
progressing work on
[#2816](#2816). The
`verify_has_ucx_catalog` method is decorated with `@cached_property` and
takes `workspace_client` and `ucx_catalog` as arguments. Additionally,
we have added a new `VerifyHasCatalog` class that checks if a specified
Unity Catalog (UC) catalog exists in the workspace and updated the
import statement to include a `NotFound` exception. We have also added a
timeout parameter to the `validate_step` function in the `workflows.py`
file, modified the `migration-progress-experimental` workflow to include
a new step `verify_prerequisites` in the `table_migration` job cluster,
and added unit tests to ensure the proper functioning of these changes.
These updates improve the application's ability to interact with UCX
catalogs and ensure their presence and accessibility during workflow
execution, while also enhancing the robustness and reliability of the
`migration-progress-experimental` workflow.
@nfx nfx closed this as completed in #3020 Oct 30, 2024
@nfx nfx closed this as completed in bf261ae Oct 30, 2024
@nfx nfx mentioned this issue Oct 30, 2024
nfx added a commit that referenced this issue Oct 30, 2024
* Added `--dry-run` option for ACL migrate ([#3017](#3017)). In this release, we have added a `--dry-run` option to the `migrate-acls` command in the `labs.yml` file, enabling a preview of the migration process without executing it. This feature also introduces the `hms-fed` flag, allowing migration of HMS-FED ACLs while migrating tables. The `ACLMigrator` class in the `application.py` file has been updated to include new parameters, `sql_backend` and `inventory_database`, to perform a dry run migration of Access Control Lists (ACLs). Additionally, a new `retrieve` method has been added to the `ACLMigrator` class to retrieve a list of grants based on the source and destination objects, and a `CrawlerBase` class has been introduced for fetching grants. We have also introduced a new `inferred_grants` table in the deployment schema to store inferred grants during the migration process.
* Added `WorkspacePathOwnership` to determine transitive owners for files and notebooks ([#3047](#3047)). In this release, we introduce a new class `WorkspacePathOwnership` in the `owners.py` module to determine the transitive owners for files and notebooks within a workspace. This class is added as a subclass of `Ownership` and takes `AdministratorLocator` and `WorkspaceClient` as inputs. It has methods to infer the owner from the first `CAN_MANAGE` permission level in the access control list. We also added a new property `workspace_path_ownership` to the existing `HiveMetastoreContext` class, which returns a `WorkspacePathOwnership` object initialized with an `AdministratorLocator` object and a `workspace_client`. This addition enables the determination of owners for files and notebooks within the workspace. The functionality is demonstrated through new tests added to `test_owners.py`. The new tests, `test_notebook_owner` and `test_file_owner`, create a notebook and a workspace file and verify the owner of each using the `owner_of` method. The `AdministratorLocator` is used to locate the administrators group for the workspace and the `PermissionLevel` class is used to specify the permission level for the notebook permissions.
* Added `mosaicml-streaming` to known list ([#3029](#3029)). In this release, we have expanded the range of recognized packages in our system by adding several new libraries to the known list in the JSON file. The additions include `mosaicml-streaming`, `oci`, `pynacl`, `pyopenssl`, `python-snapy`, and `zstd`. Notably, `mosaicml-streaming` has two new entries, `simulation` and `streaming`, while the other packages have a single entry each. This update addresses issue [#1931](#1931) and enhances the system's ability to identify and work with a wider variety of packages.
* Added `msal-extensions` to known list ([#3030](#3030)). In this release, we have added support for two new packages, `msal-extensions` and `portalocker`, to our project. The `msal-extensions` package includes modules for extending the Microsoft Authentication Library (MSAL), including cache lock, libsecret, osx, persistence, token cache, and windows. This addition enhances the library's authentication capabilities and provides greater flexibility when working with MSAL. The `portalocker` package offers functionalities for handling file locking with various backends such as Redis, as well as constants, exceptions, and utilities. This package enables developers to manage file locking more efficiently, preventing conflicts and ensuring data consistency. These new packages extend the range of supported packages and functionalities for handling authentication and file locking in the project, providing more options for software engineers to develop robust and secure applications.
* Added `multimethod` to known list ([#3031](#3031)). In this release, we have added support for the `multimethod` programming concept to the library. This feature has been added to the `known.json` file, which partially resolves issue [#193](#193)
* Added `murmurhash` to known list ([#3032](#3032)). A new hash function, MurmurHash, has been added to the library's supported list, addressing part of issue [#1931](#1931). The MurmurHash function includes two variants, `murmurhash` and "murmurhash.about", with distinct functionalities. The `murmurhash` variant offers core hashing functionality, while "murmurhash.about" contains metadata or documentation related to the MurmurHash function. This integration enables developers to leverage MurmurHash for data processing tasks, enhancing the library's functionality and versatility. Users familiar with the project can now incorporate MurmurHash into their applications and configurations, taking advantage of its unique features and capabilities.
* Added `ninja` to known list ([#3050](#3050)). In this release, we have added Ninja to the known list in the `known.json` file. Ninja is a fast, lightweight build system that enables better integration and handling within the project's larger context. This change partially resolves issue [#1931](#1931), which may have been caused by challenges in integrating or using Ninja. It is important to note that this change does not modify any existing functionality or introduce new methods. The alteration is limited to including Ninja in the known list, improving the management and identification of various components within the project.
* Added `nvidia-ml-py` to known list ([#3051](#3051)). In this release, we have added support for the `nvidia-ml-py` package to our project. This addition consists of two components: `example` and 'pynvml'. `Example` is likely a placeholder or sample usage of the package, while `pynvml` is a module that enables interaction with NVIDIA's system management library (NVML) through Python. This enhancement is a significant step towards resolving issue [#1931](#1931), which may require the use of NVIDIA-related tools or libraries, thereby improving the project's functionality and capabilities.
* Added dashboard for tracking migration progress ([#3016](#3016)). This change introduces a new dashboard for tracking migration progress in a project, called "migration-progress", which displays real-time insights into migration progress and facilitates planning and task division. A new method, `_create_dashboard`, has been added to generate the dashboard from SQL queries in a specified folder and replace database and catalog references to match the configuration settings. The changes include updating the install to replace the UCX catalog in queries, adding a new object serializer, and updating integration tests and manual testing on a staging environment. The new functionality covers the migration of tables, views, UDFs, grants, jobs, workflow problems, clusters, pipelines, and policies. Additionally, a new SQL file has been added to track the percentage of various objects migrated and display the results in the new dashboard.
* Added grant progress encoder ([#3079](#3079)). A new `GrantsProgressEncoder` class has been introduced in the `progress/grants.py` file to encode `Grant` objects into `History` objects for the `migration-progress` workflow. This change includes the addition of unit tests to ensure proper functionality and handles cases where `Grant` objects fail to map to the Unity Catalog by adding a list of failures to the `History` object. The commit also modifies the `migration-progress` workflow to incorporate the new `GrantsProgressEncoder` class, enhancing the grant processing capabilities and improving the testing of this functionality. This change addresses issue [#3058](#3058), which was related to grant progress encoding. The `GrantsProgressEncoder` class can encode grant properties, such as the principal, action, database, schema, table, and UDF, into a format that can be written to a backend, ensuring successful migration of grants in the database.
* Added table progress encoder ([#3083](#3083)). In this release, we've added a table progress encoder to the WorkflowTask context to enhance the tracking of table-related operations in the migration-progress workflow. This new encoder, implemented in the TableProgressEncoder class, is connected to the sql_backend, table_ownership, and migration_status_refresher objects. The GrantsProgressEncoder class has been refactored to GrantProgressEncoder, with additional parameters for improved encoding of grants. We've also introduced the refresh_table_migration_status task to scan and record the migration status of tables and views in the inventory, storing results in the $inventory.migration_status inventory table. Two new unit tests have been added to ensure proper encoding and migration status handling. This change improves progress tracking and reporting in the table migration process, addressing issues [#3061](#3061) and [#3064](#3064).
* Combine static code analysis results with historical job snapshots ([#3074](#3074)). In this release, we have added a new method, `JobsProgressEncoder`, to the `WorkflowTask` class in the `databricks.labs.ucx.contexts` module. This method is used to track the progress of jobs in the context of a workflow task, replacing the existing `jobs_progress` method which only tracked the progress of grants. The `JobsProgressEncoder` method takes in additional arguments, including `inventory_database`, to provide more detailed progress tracking for jobs and is used in the `grants_progress` method to track the progress of jobs in the context of a workflow task. We have also added a new unit test for the `JobsProgressEncoder` class in the `databricks.labs.ucx` project to ensure that the encoding of job information works as expected with different types of failures and job details. Additionally, this revision introduces the ability to include workflow problem records in the historical job snapshots, providing additional context for debugging and analysis. The `JobsProgressEncoder` class is a subclass of the `ProgressEncoder` class and provides additional functionality for tracking the progress of jobs.
* Connected `WorkspacePathOwnership` with `DirectFsAccessOwnership` ([#3049](#3049)). In this revision, the `DirectFsAccessCrawler` class from the `databricks.labs.ucx.source_code.directfs_access` module is imported as `DirectFsAccessCrawler` and `DirectFsAccessOwnership`, and a new `cached_property` called `directfs_access_ownership` is added to the `TableCrawler` class. This property returns an instance of the `DirectFsAccessOwnership` class, which takes in `administrator_locator`, `workspace_path_ownership`, and `workspace_client` as arguments. Additionally, the `DirectFsAccessOwnership` class has been updated to determine DirectFS access ownership for a given table and connect with `WorkspacePathOwnership`, enhancing the tool's functionality by determining access ownership in DirectFS and improving overall system security and permissions management. The `test_directfs_access.py` file has also been updated to test the ownership of query and path records using the new `DirectFsAccessOwnership` object.
* Crawlers: append snapshots to history journal, if available ([#2743](#2743)). This commit introduces a history table to store snapshots after each crawling operation, addressing issues [#2572](#2572) and [#2573](#2573). The changes include the addition of a `HistoryLog` class, which handles appending inventory snapshots to the history table within a specific catalog, workspace, and run_id. The new methods also include a `TableMigrationStatus` class with a new class variable `__id_attributes__` to specify the attributes used to uniquely identify a table. The `destination()` method has been added to the `TableMigrationStatus` class to return the fully qualified name of the destination table. Additionally, unit and integration tests have been added and updated to ensure the functionality works as expected. The `Table`, `Job`, `Cluster`, and `UDF` classes have been updated with a new `history` attribute to store a string representing a problem associated with the respective class. The `__id_attributes__` class variable has also been added to these classes to specify the attributes used to uniquely identify them.
* Determine ownership of tables based on grants and source code ([#3066](#3066)). In this release, changes have been made to the `application.py` file in the `databricks/labs/ucx/contexts` directory to improve the accuracy of determining table ownership in the inventory. A new class `LegacyQueryOwnership` has been added to the `databricks.labs.ucx.framework.owners` module to determine the owner of a table based on the queries that write to it. The `TableOwnership` class has been updated to accept additional arguments for determining ownership based on grants, queries, and workspace paths. The `DirectFsAccessOwnership` class has also been updated to accept a new `legacy_query_ownership` argument. Additionally, a new method `owner_of_path` has been added to the `Ownership` class, and the `LegacyQueryOwnership` class has been added as a subclass of `Ownership`. A new file `ownership.py` has been introduced, which defines the `TableOwnership` and `TableMigrationOwnership` classes for determining ownership of tables and table migration records in the inventory. These changes provide a more accurate and consistent ownership information for tables in the inventory.
* Ensure that pipeline assessment doesn't fail if a pipeline is deleted… ([#3034](#3034)). In this pull request, the pipelines crawler of the DLT assessment feature has been updated to improve its resiliency in the event of a pipeline deletion during crawling. Instead of failing, the crawler now logs a warning and continues to crawl when a pipeline is deleted. A new test method, `test_pipeline_disappears_during_crawl`, has been added to verify that the crawler can handle the deletion of a pipeline after listing the pipelines but before assessing them. The `assessment` and `migration-progress-experimental` workflows have been modified, and new unit tests have been added to ensure the proper functioning of the changes. Additionally, the `test_pipeline_list_with_no_config` test case has been added to check the behavior of the pipelines crawler when there is no configuration present. This pull request aims to enhance the robustness of the assessment feature and ensure its continued operation even in the face of unexpected pipeline deletions.
* Fixed `UnicodeDecodeError` when fetching init scripts ([#3103](#3103)). In this release, we have enhanced the error handling capabilities of the open-source library by fixing a `UnicodeDecodeError` issue that occurred when fetching init scripts in the `_get_init_script_data` method. To address this, we have added `UnicodeDecodeError` and `FileNotFoundError` to the list of exceptions handled in the method. Now, when any of these exceptions occur, the method will return `None` and a warning message will be logged instead of raising an unhandled exception. This change ensures that the function operates smoothly and provides better error handling in the library, without modifying the behavior of the `_check_cluster_init_script` method, which remains unchanged and continues to verify the correct setup of init scripts in the cluster.
* Fixed `UnknownHostException` on the specified KeyVault ([#3102](#3102)). In this release, we have made significant improvements to the Azure Key Vault integration, addressing issues [#3102](#3102) and [#3090](#3090). We have resolved an `UnknownHostException` problem in a specific KeyVault and implemented error handling for invalid Azure Key Vaults, ensuring more robust and reliable system behavior. Additionally, we have expanded `NotFound` exception handling to include the `InvalidState` exception. When the Azure Key Vault is in an invalid state, the corresponding secret will be skipped, and a warning message will be logged. This enhancement provides a more comprehensive solution to handle various exceptions that may arise when dealing with secrets stored in Azure Key Vaults.
* Fixed `Unsupported schema: XXX` error on `assess_workflows` ([#3104](#3104)). The recent change to the open-source library addresses the 'Unsupported schema: XXX' error in the `assess_workflows` function. This was achieved by introducing a new exception class, 'InvalidPath', in the `WorkspaceCache` mixin, and substituting `ValueError` with `InvalidPath` in the 'jobs.py' file. The `InvalidPath` exception is used to provide a more specific error message for unsupported schema paths. The `WorkspaceCache` mixin now includes an `InvalidPath` exception for caching workspace paths. The error handling in the 'jobs.py' file has been modified to raise `InvalidPath` instead of `ValueError` for better error messages. Additionally, the 'test_cached_workspace_path.py' file has updates for testing the `WorkspaceCache` object, including the addition of the `InvalidPath` exception for non-absolute paths, and a new test function for this exception. The `WorkspaceCache` class has an ellipsis in the `__init__` method, indicating additional initialization code not shown in this diff.
* Fixed `assert curr.location is not None` ([#3105](#3105)). In this release, we have addressed a potential issue in the `_external_locations` method which failed to check if the location of the current Hive table is `None` before proceeding. This oversight could result in unnecessary exceptions when accessing the location of a Hive table. To rectify this, we have introduced a check for `None` that will bypass the current iteration of the loop if the location is not set, thereby improving the robustness of the code. The method continues to return a list of `ExternalLocation` objects, each representing a Hive table or partition location with the corresponding number of tables or partitions present. The `ExternalLocation` class remains unchanged in this commit. This improvement will ensure that the method functions smoothly and avoids errors when dealing with Hive tables that do not have a location set.
* Fixed dynamic import issue ([#3053](#3053)). In this release, we've addressed an issue related to dynamic import inference in our open-source library. Previously, the code did not infer import names when using `importlib.import_module(some_name)`. This has been resolved by implementing a new method, `_make_sources_for_import_call_node`, which infers the import name from the provided node argument. Additionally, we've introduced new functions, `get_global(self, name: str)`, `_adjust_node_for_import_member(self, name: str, match_node: type, node: NodeNG)`, and updated the `_matches(self, node: NodeNG, depth: int)` method to handle attributes as global names. A new unit test, `test_graph_imports_dynamic_import()`, has been added to ensure the proper functioning of the dynamic import feature. Moreover, a new function `is_from_module` has been introduced to check if a given name is from a specific module. This commit, co-authored by Eric Vergnaud, significantly enhances the code's ability to infer imports in dynamic import scenarios.
* Fixed issue with migrating `MANAGED` hive_metastore table to UC for `CONVERT_TO_EXTERNAL` scenario ([#3020](#3020)). This change updates the process for converting a managed Hive Metastore (HMS) table to external in the CONVERT_TO_EXTERNAL scenario. The functionality is split into a separate workflow task, executed from a non-Unity Catalog (UC) cluster, and is tested with unit and integration tests. The migrate table function for external sync ensures the table is migrated as external to UC post-conversion. The changes include adding a new workflow and modifying an existing one, and updates the existing workflow to rename the migrate_tables function to convert_managed_hms_to_external. The new function handles the conversion of managed HMS tables to external, and updates the object_type property of the table in the inventory database to `EXTERNAL` after the conversion is completed. The pull request resolves issue [#2840](#2840) and removes the existing functionality of applying grants during the migration process.
* Fixed issue with table location on storage root ([#3094](#3094)). In this release, we have implemented changes to address an issue related to the incorrect identification of the parent folder as an external location when there is a single table with a prefix that matches a parent folder. Additionally, we have improved the storage and retrieval of table locations in the root directory of a storage service by adding support for additional S3 bucket URL formats in the unit tests for the Hive Metastore. This includes handling S3 bucket URLs that do not include a specific file or path, and those with a path that does not include a file. We have also added new test cases for these URL formats and modified existing ones to include them. These changes ensure correct identification of external locations and improve functionality and flexibility of the Hive Metastore's support for external table locations. The new methods added are not explicitly stated, but they likely involve functions for parsing and processing the new S3 bucket URL formats.
* Fixed snapshot loading for DFSA and used-table crawlers ([#3046](#3046)). This commit resolves issues related to snapshot loading for the DFSA and used-table crawlers when using the spark-based lsql backend. The root cause was the use of `.as_dict()` to convert rows to dictionaries, which is unavailable in the spark-based lsql backend. The fix involves replacing this method with `.asDict()`. Additionally, integration and unit tests were updated to include snapshot loading for these crawlers, and a typo in a test name was corrected. The changes are confined to the test_queries.py file and do not affect other parts of the project. No new methods were added, and existing functionality changes were limited to updating the snapshot loading process.
* Ignore failed inference codes when presenting results to Databricks Runtime ([#3087](#3087)). In this release, the `lsp_plugin.py` file has been updated in the `databricks/labs/ucx/source_code` directory to improve the user experience in the notebook editor. The changes include disabling certain advice codes from being propagated, specifically: 'cannot-autofix-table-reference', 'default-format-changed-in-dbr8', 'dependency-not-found', 'not-supported', 'notebook-run-cannot-compute-value', 'sql-parse-error', 'sys-path-cannot-compute-value', and 'unsupported-magic-line'. A new variable `DEBUG_MESSAGE_CODES` has been introduced to store the list of advice codes to be ignored, and the list comprehension that creates `diagnostics` in the `pylsp_lint` function has been updated to exclude these codes. These updates aim to reduce the number of unnecessary error messages and improve the accuracy of the linter for supported codes.
* Improve scan tables in mounts ([#2767](#2767)). In this release, the `scan-tables-in-mounts` functionality in the hive metastore has been significantly improved, providing a more robust and comprehensive solution. Previously, the implementation skipped most directories, only finding 8 tables, but this issue has been addressed, allowing the updated version to parse many more tables. The commit includes bug fixes and the addition of new unit tests. The reviewer is encouraged to refactor the code in future iterations to use the `os` module instead of `dbutils` for listing directories, enabling parallelization and improving scalability. The commit resolves issue [#2540](#2540) and updates the `scan-tables-in-mounts-experimental` workflow. While manual and unit tests have been added and verified, integration tests are still pending implementation. The co-author of this commit is Dan Zafar.
* Removed `WorkflowLinter` as it is part of the `Assessment` workflow ([#3036](#3036)). In this release, the `WorkflowLinter` has been removed as it is now integrated into the `Assessment` workflow, addressing issue [#3035](#3035). This change simplifies the codebase, removing the need for a separate linter while maintaining essential functionality for ensuring Unity Catalog compatibility. The linter's functionality has been merged with other parts of the assessment workflow, with results persisted in the `$inventory_database.workflow_problems` and `$inventory_database.directfs_in_paths` tables. The `assess_workflows` and `assess_dashboards` methods have been updated accordingly, removing `WorkflowLinter` usage. Additionally, the `ExperimentalWorkflowLinter` class has been removed from the `workflows.py` file, along with its associated methods `lint_all_workflows` and `lint_all_queries`. The `test_running_real_workflow_linter_job` function has also been removed due to the integration of the `WorkflowLinter` into the `Assessment` workflow. Manual testing has been conducted to ensure the correctness of these changes and the continued proper functioning of the assessment workflow.
* Updated permissions crawling so that it doesn't fail if a secret scope disappears during crawling ([#3070](#3070)). This commit enhances the open-source library by updating the permissions crawling process for secret scopes, addressing the issue of task failure when a secret scope disappears before ACL retrieval. The `assessment` workflow has been modified to incorporate these updates, and new unit tests have been added, including one that simulates the disappearance of a secret scope during crawling. The `PermissionsCrawler` class and the `Threads.gather` method have been improved to handle such cases, logging a warning instead of failing the task. The return type of the `get_crawler_tasks` method has been updated to Iterable[Callable[[], Permissions | None]]. These changes improve the reliability and robustness of the permissions crawling process for secret scopes, ensuring task completion in the face of unexpected scope disappearances.
* Updated sqlglot requirement from <25.26,>=25.5.0 to >=25.5.0,<25.27 ([#3041](#3041)). In this pull request, we have updated the sqlglot library requirement to incorporate the latest version, which includes various bug fixes, refactors, and exciting new features. The latest version now supports the TO_DOUBLE and TRY_TO_TIMESTAMP functions in Snowflake and the EDIT_DISTANCE (Levinshtein) function in BigQuery. Moreover, we've addressed an issue with the ARRAY JOIN function in Clickhouse and made changes to the hive dialect hierarchy. We encourage users to update to this latest version to benefit from these enhancements and fixes, ensuring optimal performance and functionality of the library.
* Updated sqlglot requirement from <25.27,>=25.5.0 to >=25.5.0,<25.28 ([#3048](#3048)). In this release, we have updated the requirement for the `sqlglot` library to a version greater than or equal to 25.5.0 and less than 25.28. This change was made to allow for the use of the latest features and bug fixes available in 'sqlglot', while avoiding the breaking changes that were introduced in version 25.27. The new version of `sqlglot` offers several improvements, including but not limited to enhanced query optimization, expanded support for various SQL dialects, and better error handling. We recommend that all users upgrade to the latest version of `sqlglot` to take advantage of these new features and improvements.
* Updated sqlglot requirement from <25.28,>=25.5.0 to >=25.5.0,<25.29 ([#3093](#3093)). This release includes an update to the `sqlglot` dependency, changing the version requirement from 25.5.0 up to but excluding 25.28, to a range that includes 25.5.0 up to but excluding 25.29. This change allows for the use of the latest `sqlglot` version and includes all the updates and bug fixes from this library since the previous version. The pull request provides a list of changes made in `sqlglot` since the previous version, as well as a list of relevant commits. Dependabot has been configured to handle any merge conflicts for this pull request and includes commands to trigger various Dependabot actions. This update was made by Dependabot and is indicated by a signed-off-by line.

Dependency updates:

 * Updated sqlglot requirement from <25.26,>=25.5.0 to >=25.5.0,<25.27 ([#3041](#3041)).
 * Updated sqlglot requirement from <25.27,>=25.5.0 to >=25.5.0,<25.28 ([#3048](#3048)).
 * Updated sqlglot requirement from <25.28,>=25.5.0 to >=25.5.0,<25.29 ([#3093](#3093)).
nfx added a commit that referenced this issue Oct 30, 2024
* Added `--dry-run` option for ACL migrate
([#3017](#3017)). In this
release, we have added a `--dry-run` option to the `migrate-acls`
command in the `labs.yml` file, enabling a preview of the migration
process without executing it. This feature also introduces the `hms-fed`
flag, allowing migration of HMS-FED ACLs while migrating tables. The
`ACLMigrator` class in the `application.py` file has been updated to
include new parameters, `sql_backend` and `inventory_database`, to
perform a dry run migration of Access Control Lists (ACLs).
Additionally, a new `retrieve` method has been added to the
`ACLMigrator` class to retrieve a list of grants based on the source and
destination objects, and a `CrawlerBase` class has been introduced for
fetching grants. We have also introduced a new `inferred_grants` table
in the deployment schema to store inferred grants during the migration
process.
* Added `WorkspacePathOwnership` to determine transitive owners for
files and notebooks
([#3047](#3047)). In this
release, we introduce a new class `WorkspacePathOwnership` in the
`owners.py` module to determine the transitive owners for files and
notebooks within a workspace. This class is added as a subclass of
`Ownership` and takes `AdministratorLocator` and `WorkspaceClient` as
inputs. It has methods to infer the owner from the first `CAN_MANAGE`
permission level in the access control list. We also added a new
property `workspace_path_ownership` to the existing
`HiveMetastoreContext` class, which returns a `WorkspacePathOwnership`
object initialized with an `AdministratorLocator` object and a
`workspace_client`. This addition enables the determination of owners
for files and notebooks within the workspace. The functionality is
demonstrated through new tests added to `test_owners.py`. The new tests,
`test_notebook_owner` and `test_file_owner`, create a notebook and a
workspace file and verify the owner of each using the `owner_of` method.
The `AdministratorLocator` is used to locate the administrators group
for the workspace and the `PermissionLevel` class is used to specify the
permission level for the notebook permissions.
* Added `mosaicml-streaming` to known list
([#3029](#3029)). In this
release, we have expanded the range of recognized packages in our system
by adding several new libraries to the known list in the JSON file. The
additions include `mosaicml-streaming`, `oci`, `pynacl`, `pyopenssl`,
`python-snapy`, and `zstd`. Notably, `mosaicml-streaming` has two new
entries, `simulation` and `streaming`, while the other packages have a
single entry each. This update addresses issue
[#1931](#1931) and enhances
the system's ability to identify and work with a wider variety of
packages.
* Added `msal-extensions` to known list
([#3030](#3030)). In this
release, we have added support for two new packages, `msal-extensions`
and `portalocker`, to our project. The `msal-extensions` package
includes modules for extending the Microsoft Authentication Library
(MSAL), including cache lock, libsecret, osx, persistence, token cache,
and windows. This addition enhances the library's authentication
capabilities and provides greater flexibility when working with MSAL.
The `portalocker` package offers functionalities for handling file
locking with various backends such as Redis, as well as constants,
exceptions, and utilities. This package enables developers to manage
file locking more efficiently, preventing conflicts and ensuring data
consistency. These new packages extend the range of supported packages
and functionalities for handling authentication and file locking in the
project, providing more options for software engineers to develop robust
and secure applications.
* Added `multimethod` to known list
([#3031](#3031)). In this
release, we have added support for the `multimethod` programming concept
to the library. This feature has been added to the `known.json` file,
which partially resolves issue
[#193](#193)
* Added `murmurhash` to known list
([#3032](#3032)). A new hash
function, MurmurHash, has been added to the library's supported list,
addressing part of issue
[#1931](#1931). The
MurmurHash function includes two variants, `murmurhash` and
"murmurhash.about", with distinct functionalities. The `murmurhash`
variant offers core hashing functionality, while "murmurhash.about"
contains metadata or documentation related to the MurmurHash function.
This integration enables developers to leverage MurmurHash for data
processing tasks, enhancing the library's functionality and versatility.
Users familiar with the project can now incorporate MurmurHash into
their applications and configurations, taking advantage of its unique
features and capabilities.
* Added `ninja` to known list
([#3050](#3050)). In this
release, we have added Ninja to the known list in the `known.json` file.
Ninja is a fast, lightweight build system that enables better
integration and handling within the project's larger context. This
change partially resolves issue
[#1931](#1931), which may
have been caused by challenges in integrating or using Ninja. It is
important to note that this change does not modify any existing
functionality or introduce new methods. The alteration is limited to
including Ninja in the known list, improving the management and
identification of various components within the project.
* Added `nvidia-ml-py` to known list
([#3051](#3051)). In this
release, we have added support for the `nvidia-ml-py` package to our
project. This addition consists of two components: `example` and
'pynvml'. `Example` is likely a placeholder or sample usage of the
package, while `pynvml` is a module that enables interaction with
NVIDIA's system management library (NVML) through Python. This
enhancement is a significant step towards resolving issue
[#1931](#1931), which may
require the use of NVIDIA-related tools or libraries, thereby improving
the project's functionality and capabilities.
* Added dashboard for tracking migration progress
([#3016](#3016)). This
change introduces a new dashboard for tracking migration progress in a
project, called "migration-progress", which displays real-time insights
into migration progress and facilitates planning and task division. A
new method, `_create_dashboard`, has been added to generate the
dashboard from SQL queries in a specified folder and replace database
and catalog references to match the configuration settings. The changes
include updating the install to replace the UCX catalog in queries,
adding a new object serializer, and updating integration tests and
manual testing on a staging environment. The new functionality covers
the migration of tables, views, UDFs, grants, jobs, workflow problems,
clusters, pipelines, and policies. Additionally, a new SQL file has been
added to track the percentage of various objects migrated and display
the results in the new dashboard.
* Added grant progress encoder
([#3079](#3079)). A new
`GrantsProgressEncoder` class has been introduced in the
`progress/grants.py` file to encode `Grant` objects into `History`
objects for the `migration-progress` workflow. This change includes the
addition of unit tests to ensure proper functionality and handles cases
where `Grant` objects fail to map to the Unity Catalog by adding a list
of failures to the `History` object. The commit also modifies the
`migration-progress` workflow to incorporate the new
`GrantsProgressEncoder` class, enhancing the grant processing
capabilities and improving the testing of this functionality. This
change addresses issue
[#3058](#3058), which was
related to grant progress encoding. The `GrantsProgressEncoder` class
can encode grant properties, such as the principal, action, database,
schema, table, and UDF, into a format that can be written to a backend,
ensuring successful migration of grants in the database.
* Added table progress encoder
([#3083](#3083)). In this
release, we've added a table progress encoder to the WorkflowTask
context to enhance the tracking of table-related operations in the
migration-progress workflow. This new encoder, implemented in the
TableProgressEncoder class, is connected to the sql_backend,
table_ownership, and migration_status_refresher objects. The
GrantsProgressEncoder class has been refactored to GrantProgressEncoder,
with additional parameters for improved encoding of grants. We've also
introduced the refresh_table_migration_status task to scan and record
the migration status of tables and views in the inventory, storing
results in the $inventory.migration_status inventory table. Two new unit
tests have been added to ensure proper encoding and migration status
handling. This change improves progress tracking and reporting in the
table migration process, addressing issues
[#3061](#3061) and
[#3064](#3064).
* Combine static code analysis results with historical job snapshots
([#3074](#3074)). In this
release, we have added a new method, `JobsProgressEncoder`, to the
`WorkflowTask` class in the `databricks.labs.ucx.contexts` module. This
method is used to track the progress of jobs in the context of a
workflow task, replacing the existing `jobs_progress` method which only
tracked the progress of grants. The `JobsProgressEncoder` method takes
in additional arguments, including `inventory_database`, to provide more
detailed progress tracking for jobs and is used in the `grants_progress`
method to track the progress of jobs in the context of a workflow task.
We have also added a new unit test for the `JobsProgressEncoder` class
in the `databricks.labs.ucx` project to ensure that the encoding of job
information works as expected with different types of failures and job
details. Additionally, this revision introduces the ability to include
workflow problem records in the historical job snapshots, providing
additional context for debugging and analysis. The `JobsProgressEncoder`
class is a subclass of the `ProgressEncoder` class and provides
additional functionality for tracking the progress of jobs.
* Connected `WorkspacePathOwnership` with `DirectFsAccessOwnership`
([#3049](#3049)). In this
revision, the `DirectFsAccessCrawler` class from the
`databricks.labs.ucx.source_code.directfs_access` module is imported as
`DirectFsAccessCrawler` and `DirectFsAccessOwnership`, and a new
`cached_property` called `directfs_access_ownership` is added to the
`TableCrawler` class. This property returns an instance of the
`DirectFsAccessOwnership` class, which takes in `administrator_locator`,
`workspace_path_ownership`, and `workspace_client` as arguments.
Additionally, the `DirectFsAccessOwnership` class has been updated to
determine DirectFS access ownership for a given table and connect with
`WorkspacePathOwnership`, enhancing the tool's functionality by
determining access ownership in DirectFS and improving overall system
security and permissions management. The `test_directfs_access.py` file
has also been updated to test the ownership of query and path records
using the new `DirectFsAccessOwnership` object.
* Crawlers: append snapshots to history journal, if available
([#2743](#2743)). This
commit introduces a history table to store snapshots after each crawling
operation, addressing issues
[#2572](#2572) and
[#2573](#2573). The changes
include the addition of a `HistoryLog` class, which handles appending
inventory snapshots to the history table within a specific catalog,
workspace, and run_id. The new methods also include a
`TableMigrationStatus` class with a new class variable
`__id_attributes__` to specify the attributes used to uniquely identify
a table. The `destination()` method has been added to the
`TableMigrationStatus` class to return the fully qualified name of the
destination table. Additionally, unit and integration tests have been
added and updated to ensure the functionality works as expected. The
`Table`, `Job`, `Cluster`, and `UDF` classes have been updated with a
new `history` attribute to store a string representing a problem
associated with the respective class. The `__id_attributes__` class
variable has also been added to these classes to specify the attributes
used to uniquely identify them.
* Determine ownership of tables based on grants and source code
([#3066](#3066)). In this
release, changes have been made to the `application.py` file in the
`databricks/labs/ucx/contexts` directory to improve the accuracy of
determining table ownership in the inventory. A new class
`LegacyQueryOwnership` has been added to the
`databricks.labs.ucx.framework.owners` module to determine the owner of
a table based on the queries that write to it. The `TableOwnership`
class has been updated to accept additional arguments for determining
ownership based on grants, queries, and workspace paths. The
`DirectFsAccessOwnership` class has also been updated to accept a new
`legacy_query_ownership` argument. Additionally, a new method
`owner_of_path` has been added to the `Ownership` class, and the
`LegacyQueryOwnership` class has been added as a subclass of
`Ownership`. A new file `ownership.py` has been introduced, which
defines the `TableOwnership` and `TableMigrationOwnership` classes for
determining ownership of tables and table migration records in the
inventory. These changes provide a more accurate and consistent
ownership information for tables in the inventory.
* Ensure that pipeline assessment doesn't fail if a pipeline is deleted…
([#3034](#3034)). In this
pull request, the pipelines crawler of the DLT assessment feature has
been updated to improve its resiliency in the event of a pipeline
deletion during crawling. Instead of failing, the crawler now logs a
warning and continues to crawl when a pipeline is deleted. A new test
method, `test_pipeline_disappears_during_crawl`, has been added to
verify that the crawler can handle the deletion of a pipeline after
listing the pipelines but before assessing them. The `assessment` and
`migration-progress-experimental` workflows have been modified, and new
unit tests have been added to ensure the proper functioning of the
changes. Additionally, the `test_pipeline_list_with_no_config` test case
has been added to check the behavior of the pipelines crawler when there
is no configuration present. This pull request aims to enhance the
robustness of the assessment feature and ensure its continued operation
even in the face of unexpected pipeline deletions.
* Fixed `UnicodeDecodeError` when fetching init scripts
([#3103](#3103)). In this
release, we have enhanced the error handling capabilities of the
open-source library by fixing a `UnicodeDecodeError` issue that occurred
when fetching init scripts in the `_get_init_script_data` method. To
address this, we have added `UnicodeDecodeError` and `FileNotFoundError`
to the list of exceptions handled in the method. Now, when any of these
exceptions occur, the method will return `None` and a warning message
will be logged instead of raising an unhandled exception. This change
ensures that the function operates smoothly and provides better error
handling in the library, without modifying the behavior of the
`_check_cluster_init_script` method, which remains unchanged and
continues to verify the correct setup of init scripts in the cluster.
* Fixed `UnknownHostException` on the specified KeyVault
([#3102](#3102)). In this
release, we have made significant improvements to the Azure Key Vault
integration, addressing issues
[#3102](#3102) and
[#3090](#3090). We have
resolved an `UnknownHostException` problem in a specific KeyVault and
implemented error handling for invalid Azure Key Vaults, ensuring more
robust and reliable system behavior. Additionally, we have expanded
`NotFound` exception handling to include the `InvalidState` exception.
When the Azure Key Vault is in an invalid state, the corresponding
secret will be skipped, and a warning message will be logged. This
enhancement provides a more comprehensive solution to handle various
exceptions that may arise when dealing with secrets stored in Azure Key
Vaults.
* Fixed `Unsupported schema: XXX` error on `assess_workflows`
([#3104](#3104)). The recent
change to the open-source library addresses the 'Unsupported schema:
XXX' error in the `assess_workflows` function. This was achieved by
introducing a new exception class, 'InvalidPath', in the
`WorkspaceCache` mixin, and substituting `ValueError` with `InvalidPath`
in the 'jobs.py' file. The `InvalidPath` exception is used to provide a
more specific error message for unsupported schema paths. The
`WorkspaceCache` mixin now includes an `InvalidPath` exception for
caching workspace paths. The error handling in the 'jobs.py' file has
been modified to raise `InvalidPath` instead of `ValueError` for better
error messages. Additionally, the 'test_cached_workspace_path.py' file
has updates for testing the `WorkspaceCache` object, including the
addition of the `InvalidPath` exception for non-absolute paths, and a
new test function for this exception. The `WorkspaceCache` class has an
ellipsis in the `__init__` method, indicating additional initialization
code not shown in this diff.
* Fixed `assert curr.location is not None`
([#3105](#3105)). In this
release, we have addressed a potential issue in the
`_external_locations` method which failed to check if the location of
the current Hive table is `None` before proceeding. This oversight could
result in unnecessary exceptions when accessing the location of a Hive
table. To rectify this, we have introduced a check for `None` that will
bypass the current iteration of the loop if the location is not set,
thereby improving the robustness of the code. The method continues to
return a list of `ExternalLocation` objects, each representing a Hive
table or partition location with the corresponding number of tables or
partitions present. The `ExternalLocation` class remains unchanged in
this commit. This improvement will ensure that the method functions
smoothly and avoids errors when dealing with Hive tables that do not
have a location set.
* Fixed dynamic import issue
([#3053](#3053)). In this
release, we've addressed an issue related to dynamic import inference in
our open-source library. Previously, the code did not infer import names
when using `importlib.import_module(some_name)`. This has been resolved
by implementing a new method, `_make_sources_for_import_call_node`,
which infers the import name from the provided node argument.
Additionally, we've introduced new functions, `get_global(self, name:
str)`, `_adjust_node_for_import_member(self, name: str, match_node:
type, node: NodeNG)`, and updated the `_matches(self, node: NodeNG,
depth: int)` method to handle attributes as global names. A new unit
test, `test_graph_imports_dynamic_import()`, has been added to ensure
the proper functioning of the dynamic import feature. Moreover, a new
function `is_from_module` has been introduced to check if a given name
is from a specific module. This commit, co-authored by Eric Vergnaud,
significantly enhances the code's ability to infer imports in dynamic
import scenarios.
* Fixed issue with migrating `MANAGED` hive_metastore table to UC for
`CONVERT_TO_EXTERNAL` scenario
([#3020](#3020)). This
change updates the process for converting a managed Hive Metastore (HMS)
table to external in the CONVERT_TO_EXTERNAL scenario. The functionality
is split into a separate workflow task, executed from a non-Unity
Catalog (UC) cluster, and is tested with unit and integration tests. The
migrate table function for external sync ensures the table is migrated
as external to UC post-conversion. The changes include adding a new
workflow and modifying an existing one, and updates the existing
workflow to rename the migrate_tables function to
convert_managed_hms_to_external. The new function handles the conversion
of managed HMS tables to external, and updates the object_type property
of the table in the inventory database to `EXTERNAL` after the
conversion is completed. The pull request resolves issue
[#2840](#2840) and removes
the existing functionality of applying grants during the migration
process.
* Fixed issue with table location on storage root
([#3094](#3094)). In this
release, we have implemented changes to address an issue related to the
incorrect identification of the parent folder as an external location
when there is a single table with a prefix that matches a parent folder.
Additionally, we have improved the storage and retrieval of table
locations in the root directory of a storage service by adding support
for additional S3 bucket URL formats in the unit tests for the Hive
Metastore. This includes handling S3 bucket URLs that do not include a
specific file or path, and those with a path that does not include a
file. We have also added new test cases for these URL formats and
modified existing ones to include them. These changes ensure correct
identification of external locations and improve functionality and
flexibility of the Hive Metastore's support for external table
locations. The new methods added are not explicitly stated, but they
likely involve functions for parsing and processing the new S3 bucket
URL formats.
* Fixed snapshot loading for DFSA and used-table crawlers
([#3046](#3046)). This
commit resolves issues related to snapshot loading for the DFSA and
used-table crawlers when using the spark-based lsql backend. The root
cause was the use of `.as_dict()` to convert rows to dictionaries, which
is unavailable in the spark-based lsql backend. The fix involves
replacing this method with `.asDict()`. Additionally, integration and
unit tests were updated to include snapshot loading for these crawlers,
and a typo in a test name was corrected. The changes are confined to the
test_queries.py file and do not affect other parts of the project. No
new methods were added, and existing functionality changes were limited
to updating the snapshot loading process.
* Ignore failed inference codes when presenting results to Databricks
Runtime ([#3087](#3087)). In
this release, the `lsp_plugin.py` file has been updated in the
`databricks/labs/ucx/source_code` directory to improve the user
experience in the notebook editor. The changes include disabling certain
advice codes from being propagated, specifically:
'cannot-autofix-table-reference', 'default-format-changed-in-dbr8',
'dependency-not-found', 'not-supported',
'notebook-run-cannot-compute-value', 'sql-parse-error',
'sys-path-cannot-compute-value', and 'unsupported-magic-line'. A new
variable `DEBUG_MESSAGE_CODES` has been introduced to store the list of
advice codes to be ignored, and the list comprehension that creates
`diagnostics` in the `pylsp_lint` function has been updated to exclude
these codes. These updates aim to reduce the number of unnecessary error
messages and improve the accuracy of the linter for supported codes.
* Improve scan tables in mounts
([#2767](#2767)). In this
release, the `scan-tables-in-mounts` functionality in the hive metastore
has been significantly improved, providing a more robust and
comprehensive solution. Previously, the implementation skipped most
directories, only finding 8 tables, but this issue has been addressed,
allowing the updated version to parse many more tables. The commit
includes bug fixes and the addition of new unit tests. The reviewer is
encouraged to refactor the code in future iterations to use the `os`
module instead of `dbutils` for listing directories, enabling
parallelization and improving scalability. The commit resolves issue
[#2540](#2540) and updates
the `scan-tables-in-mounts-experimental` workflow. While manual and unit
tests have been added and verified, integration tests are still pending
implementation. The co-author of this commit is Dan Zafar.
* Removed `WorkflowLinter` as it is part of the `Assessment` workflow
([#3036](#3036)). In this
release, the `WorkflowLinter` has been removed as it is now integrated
into the `Assessment` workflow, addressing issue
[#3035](#3035). This change
simplifies the codebase, removing the need for a separate linter while
maintaining essential functionality for ensuring Unity Catalog
compatibility. The linter's functionality has been merged with other
parts of the assessment workflow, with results persisted in the
`$inventory_database.workflow_problems` and
`$inventory_database.directfs_in_paths` tables. The `assess_workflows`
and `assess_dashboards` methods have been updated accordingly, removing
`WorkflowLinter` usage. Additionally, the `ExperimentalWorkflowLinter`
class has been removed from the `workflows.py` file, along with its
associated methods `lint_all_workflows` and `lint_all_queries`. The
`test_running_real_workflow_linter_job` function has also been removed
due to the integration of the `WorkflowLinter` into the `Assessment`
workflow. Manual testing has been conducted to ensure the correctness of
these changes and the continued proper functioning of the assessment
workflow.
* Updated permissions crawling so that it doesn't fail if a secret scope
disappears during crawling
([#3070](#3070)). This
commit enhances the open-source library by updating the permissions
crawling process for secret scopes, addressing the issue of task failure
when a secret scope disappears before ACL retrieval. The `assessment`
workflow has been modified to incorporate these updates, and new unit
tests have been added, including one that simulates the disappearance of
a secret scope during crawling. The `PermissionsCrawler` class and the
`Threads.gather` method have been improved to handle such cases, logging
a warning instead of failing the task. The return type of the
`get_crawler_tasks` method has been updated to Iterable[Callable[[],
Permissions | None]]. These changes improve the reliability and
robustness of the permissions crawling process for secret scopes,
ensuring task completion in the face of unexpected scope disappearances.
* Updated sqlglot requirement from <25.26,>=25.5.0 to >=25.5.0,<25.27
([#3041](#3041)). In this
pull request, we have updated the sqlglot library requirement to
incorporate the latest version, which includes various bug fixes,
refactors, and exciting new features. The latest version now supports
the TO_DOUBLE and TRY_TO_TIMESTAMP functions in Snowflake and the
EDIT_DISTANCE (Levinshtein) function in BigQuery. Moreover, we've
addressed an issue with the ARRAY JOIN function in Clickhouse and made
changes to the hive dialect hierarchy. We encourage users to update to
this latest version to benefit from these enhancements and fixes,
ensuring optimal performance and functionality of the library.
* Updated sqlglot requirement from <25.27,>=25.5.0 to >=25.5.0,<25.28
([#3048](#3048)). In this
release, we have updated the requirement for the `sqlglot` library to a
version greater than or equal to 25.5.0 and less than 25.28. This change
was made to allow for the use of the latest features and bug fixes
available in 'sqlglot', while avoiding the breaking changes that were
introduced in version 25.27. The new version of `sqlglot` offers several
improvements, including but not limited to enhanced query optimization,
expanded support for various SQL dialects, and better error handling. We
recommend that all users upgrade to the latest version of `sqlglot` to
take advantage of these new features and improvements.
* Updated sqlglot requirement from <25.28,>=25.5.0 to >=25.5.0,<25.29
([#3093](#3093)). This
release includes an update to the `sqlglot` dependency, changing the
version requirement from 25.5.0 up to but excluding 25.28, to a range
that includes 25.5.0 up to but excluding 25.29. This change allows for
the use of the latest `sqlglot` version and includes all the updates and
bug fixes from this library since the previous version. The pull request
provides a list of changes made in `sqlglot` since the previous version,
as well as a list of relevant commits. Dependabot has been configured to
handle any merge conflicts for this pull request and includes commands
to trigger various Dependabot actions. This update was made by
Dependabot and is indicated by a signed-off-by line.

Dependency updates:

* Updated sqlglot requirement from <25.26,>=25.5.0 to >=25.5.0,<25.27
([#3041](#3041)).
* Updated sqlglot requirement from <25.27,>=25.5.0 to >=25.5.0,<25.28
([#3048](#3048)).
* Updated sqlglot requirement from <25.28,>=25.5.0 to >=25.5.0,<25.29
([#3093](#3093)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working migrate/external go/uc/upgrade SYNC EXTERNAL TABLES step migrate/managed go/uc/upgrade Upgrade Managed Tables and Jobs
Projects
2 participants