Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Skip the migration-progress run when the assessment job did not run yet #2816

Closed
1 of 2 tasks
Tracked by #2074
JCZuurmond opened this issue Oct 3, 2024 · 2 comments · Fixed by #2912
Closed
1 of 2 tasks
Tracked by #2074
Assignees
Labels
feat/migration-progress Issues related to the migration progress workflow feat/workflow triggered as a Databricks Job managed by UCX

Comments

@JCZuurmond
Copy link
Member

JCZuurmond commented Oct 3, 2024

Summary

Skip the migration-process run when the assessment job did not run yet.

Extra

  • Warn that the assessment needs to run during creation of the UCX catalog, if it has not yet already.
  • Remove running "assessment" from the integration test_running_real_migration_progress_job to reduce flakiness

Implementation

At the start of the migration-process workflow:

  • Check if the assessment job has run
  • If not, stop the workflow
  • If so, continue the workflow
@github-project-automation github-project-automation bot moved this to Triage in UCX Oct 3, 2024
@JCZuurmond JCZuurmond added needs-triage feat/workflow triggered as a Databricks Job managed by UCX feat/migration-progress Issues related to the migration progress workflow labels Oct 3, 2024
@JCZuurmond JCZuurmond moved this from Triage to Active Backlog in UCX Oct 3, 2024
nfx pushed a commit that referenced this issue Oct 8, 2024
…rimental` workflow (#2851)

## Changes
Verify UCX catalog is accesible at start of
`migration-progress-experimental` workflow

### Linked issues

Resolves #2577
Resolves #2848
Progresses #2816

### Functionality

- [x] modified existing workflow: `migration-progress-experimental`

### Tests

- [x] added unit tests
@nfx nfx removed the needs-triage label Oct 9, 2024
nfx added a commit that referenced this issue Oct 9, 2024
* Added `google-cloud-core` to known list ([#2826](#2826)). In this release, we have incorporated the `google-cloud-core` library into our project's configuration file, specifying several modules from this library. This change is part of the resolution of issue [#1931](#1931), which pertains to working with Google Cloud services. The `google-cloud-core` library offers core functionalities for Google Cloud client libraries, including helper functions, HTTP-related functionalities, testing utilities, client classes, environment variable handling, exceptions, obsolete features, operation tracking, and version management. By adding these new modules to the known list in the configuration file, we can now utilize them in our project as needed, thereby enhancing our ability to work with Google Cloud services.
* Added `gviz-api` to known list ([#2831](#2831)). In this release, we have added the `gviz-api` library to our known library list, specifically specifying the `gviz_api` package within it. This addition enables the proper handling and recognition of components from the `gviz-api` library in the system, thereby addressing a portion of issue [#1931](#1931). While the specifics of the `gviz-api` library's implementation and usage are not described in the commit message, it is expected to provide functionality related to data visualization. This enhancement will enable us to expand our system's capabilities and provide more comprehensive solutions for our users.
* Added export CLI functionality for assessment results ([#2553](#2553)). A new `export` command-line interface (CLI) function has been added to the open-source library to export assessment results. This feature includes the addition of a new `AssessmentExporter` class in the `export.py` module, which is responsible for exporting assessment results to CSV files inside a ZIP archive. Users can specify the destination path and type of report for the exported results. A notebook utility is also included to run the export from the workspace environment, with default location, unit tests, and integration tests for the notebook utility. The `acl_migrator` method has been optimized for better performance. This new functionality provides more flexibility in exporting assessment results and improves the overall assessment functionality of the library.
* Added functional test related to bug [#2850](#2850) ([#2880](#2880)). A new functional test has been added to address a bug fix related to issue [#2850](#2850), which involves reading data from a CSV file located in a volume using Spark's readStream function. The test specifies various options including file format, schema location, header, and compression. The CSV file is loaded from '/Volumes/playground/test/demo_data/' and the schema location is set to '/Volumes/playground/test/schemas/'. Additionally, a unit test has been added and is referenced in the commit. This functional test will help ensure that the bug fix for issue [#2850](#2850) is working as expected.
* Added handling for `PermissionDenied` when retrieving `WorkspaceClient`s from account ([#2877](#2877)). In this release, the `workspace_clients` method of the `Account` class in `workspaces.py` has been updated to handle `PermissionDenied` exceptions when retrieving `WorkspaceClient`s. This change introduces a try-except block around the command retrieving the workspace client, which catches the `PermissionDenied` exception and logs a warning message if access to a workspace is denied. If no exception is raised, the workspace client is added to the list of clients as before. The commit also includes a new unit test to verify this functionality. This update addresses issue [#2874](#2874) and enhances the robustness of the `databricks labs ucx sync-workspace-info` command by ensuring it gracefully handles permission errors during workspace retrieval.
* Added testing with Python 3.13 ([#2878](#2878)). The project has been updated to include testing with Python 3.13, in addition to the previously supported versions of Python 3.10, 3.11, and 3.12. This update is reflected in the `.github/workflows/push.yml` file, which now includes '3.13' in the `pyVersion` matrix for the jobs. This addition expands the range of Python versions that the project can be tested and run on, providing increased flexibility and compatibility for users, as well as ensuring continued support for the latest versions of the Python programming language.
* Added used tables in assessment dashboard ([#2836](#2836)). In this update, we introduce a new widget to the assessment dashboard for displaying used tables, enhancing visibility into how tables are utilized within the Databricks environment. This change includes the addition of the `UsedTable` class in the `databricks.labs.ucx.source_code.base` module, which tracks table usage details in the inventory database. Two new methods, `collect_dfsas_from_query` and `collect_used_tables_from_query`, have been implemented to collect data source access and used tables information from a query, with lineage information added to the table details. Additionally, a test function, `test_dashboard_with_prepopulated_data`, has been introduced to prepopulate data for use in the dashboard, ensuring proper functionality of the new feature.
* Avoid resource conflicts in integration tests by using a random dir name ([#2865](#2865)). In this release, we have implemented changes to address resource conflicts in integration tests by introducing random directory names. The `save_locations` method in `conftest.py` has been updated to generate random directory names using the `tempfile.mkdtemp` function, based on the value of the new `make_random` parameter. Additionally, in the `test_migrate.py` file located in the `tests/integration/hive_metastore` directory, the hard-coded directory name has been replaced with a random one generated by the `make_random` function, which is used when creating external tables and specifying the external delta location. Lastly, the `test_move_tables_table_properties_mismatch_preserves_original` function in `test_table_move.py` has been updated to include a randomly generated directory name in the table's external delta and storage location, ensuring that tests can run concurrently without conflicting with each other. These changes resolve the issue described in [#2797](#2797) and improve the reliability of integration tests.
* Exclude dfsas from used tables ([#2841](#2841)). In this release, we've made significant improvements to the accuracy of table identification and handling in our system. We've excluded certain direct filesystem access patterns from being treated as tables in the current implementation, correcting a previous error. The `collect_tables` method has been updated to exclude table names matching defined direct filesystem access patterns. Additionally, we've added a new method `TableInfoNode` to wrap used tables and the nodes that use them. We've also introduced changes to handle direct filesystem access patterns more accurately, ensuring that the DataFrame API's `spark.table()` function is identified correctly, while the `spark.read.parquet()` function, representing direct filesystem access, is now ignored. These changes are supported by new unit tests to ensure correctness and reliability, enhancing the overall functionality and behavior of the system.
* Fixed known matches false postives for libraries starting with the same name as a library in the known.json ([#2860](#2860)). This commit addresses an issue of false positives in known matches for libraries that have the same name as a library in the known.json file. The `module_compatibility` function in the `known.py` file was updated to look for exact matches or parent module matches, rather than just matches at the beginning of the name. This more nuanced approach ensures that libraries with similar names are not incorrectly flagged as having compatibility issues. Additionally, the `known.json` file is now sorted when constructing module problems, indicating that the order of the entries in this file may have been relevant to the issue being resolved. To ensure the accuracy of the changes, new unit tests were added. The test suite was expanded to include tests for known and unknown compatibility, and a new load test was added for the known.json file. These changes improve the reliability of the known matches feature, which is critical for ensuring the correct identification of compatibility issues.
* Make delta format case sensitive ([#2861](#2861)). In this commit, the delta format is made case sensitive to enhance the robustness and reliability of the code. The `TableInMount` class has been updated with a `__post_init__` method to convert the `format` attribute to uppercase, ensuring case sensitivity. Additionally, the `Table` class in the `tables.py` file has been modified to include a `__post_init__` method that converts the `table_format` attribute to uppercase during object creation, making format comparisons case insensitive. New properties, `is_delta` and `is_hive`, have been added to the `Table` class to check if the table format is delta or hive, respectively. These changes affect the `what` method of the `AclMigrationWhat` enum class, which now checks for `is_delta` and `is_hive` instead of comparing `table_format` with `DELTA` and "HIVE". Relevant issues [#2858](#2858) and [#2840](#2840) have been addressed, and unit tests have been included to verify the behavior. However, the changes have not been verified on the staging environment yet.
* Make delta format case sensitive ([#2862](#2862)). The recent update, derived from the resolution of issue [#2861](#2861), introduces a case-sensitive delta format to our open-source library, enhancing the precision of delta table tracking. This change impacts all table format-related code and is accompanied by additional tests for robustness. A new `location` column has been incorporated into the `table_estimates` view, facilitating the determination of delta table location. Furthermore, a new method has been implemented to extract the `location` column from the `table_estimates` view, further refining the project's functionality and accuracy in managing delta tables.
* Verify UCX catalog is accessible at start of `migration-progress-experimental` workflow ([#2851](#2851)). In this release, we have introduced a new `verify_has_ucx_catalog` method in the `Application` class of the `databricks.labs.ucx.contexts` module, which checks for the presence of a UCX catalog in the workspace and returns an instance of the `VerifyHasCatalog` class. This method is used in the `migration-progress-experimental` workflow to verify UCX catalog accessibility, addressing issues [#2577](#2577) and [#2848](#2848) and progressing work on [#2816](#2816). The `verify_has_ucx_catalog` method is decorated with `@cached_property` and takes `workspace_client` and `ucx_catalog` as arguments. Additionally, we have added a new `VerifyHasCatalog` class that checks if a specified Unity Catalog (UC) catalog exists in the workspace and updated the import statement to include a `NotFound` exception. We have also added a timeout parameter to the `validate_step` function in the `workflows.py` file, modified the `migration-progress-experimental` workflow to include a new step `verify_prerequisites` in the `table_migration` job cluster, and added unit tests to ensure the proper functioning of these changes. These updates improve the application's ability to interact with UCX catalogs and ensure their presence and accessibility during workflow execution, while also enhancing the robustness and reliability of the `migration-progress-experimental` workflow.
@nfx nfx mentioned this issue Oct 9, 2024
nfx added a commit that referenced this issue Oct 9, 2024
* Added `google-cloud-core` to known list
([#2826](#2826)). In this
release, we have incorporated the `google-cloud-core` library into our
project's configuration file, specifying several modules from this
library. This change is part of the resolution of issue
[#1931](#1931), which
pertains to working with Google Cloud services. The `google-cloud-core`
library offers core functionalities for Google Cloud client libraries,
including helper functions, HTTP-related functionalities, testing
utilities, client classes, environment variable handling, exceptions,
obsolete features, operation tracking, and version management. By adding
these new modules to the known list in the configuration file, we can
now utilize them in our project as needed, thereby enhancing our ability
to work with Google Cloud services.
* Added `gviz-api` to known list
([#2831](#2831)). In this
release, we have added the `gviz-api` library to our known library list,
specifically specifying the `gviz_api` package within it. This addition
enables the proper handling and recognition of components from the
`gviz-api` library in the system, thereby addressing a portion of issue
[#1931](#1931). While the
specifics of the `gviz-api` library's implementation and usage are not
described in the commit message, it is expected to provide functionality
related to data visualization. This enhancement will enable us to expand
our system's capabilities and provide more comprehensive solutions for
our users.
* Added export CLI functionality for assessment results
([#2553](#2553)). A new
`export` command-line interface (CLI) function has been added to the
open-source library to export assessment results. This feature includes
the addition of a new `AssessmentExporter` class in the `export.py`
module, which is responsible for exporting assessment results to CSV
files inside a ZIP archive. Users can specify the destination path and
type of report for the exported results. A notebook utility is also
included to run the export from the workspace environment, with default
location, unit tests, and integration tests for the notebook utility.
The `acl_migrator` method has been optimized for better performance.
This new functionality provides more flexibility in exporting assessment
results and improves the overall assessment functionality of the
library.
* Added functional test related to bug
[#2850](#2850)
([#2880](#2880)). A new
functional test has been added to address a bug fix related to issue
[#2850](#2850), which
involves reading data from a CSV file located in a volume using Spark's
readStream function. The test specifies various options including file
format, schema location, header, and compression. The CSV file is loaded
from '/Volumes/playground/test/demo_data/' and the schema location is
set to '/Volumes/playground/test/schemas/'. Additionally, a unit test
has been added and is referenced in the commit. This functional test
will help ensure that the bug fix for issue
[#2850](#2850) is working as
expected.
* Added handling for `PermissionDenied` when retrieving
`WorkspaceClient`s from account
([#2877](#2877)). In this
release, the `workspace_clients` method of the `Account` class in
`workspaces.py` has been updated to handle `PermissionDenied` exceptions
when retrieving `WorkspaceClient`s. This change introduces a try-except
block around the command retrieving the workspace client, which catches
the `PermissionDenied` exception and logs a warning message if access to
a workspace is denied. If no exception is raised, the workspace client
is added to the list of clients as before. The commit also includes a
new unit test to verify this functionality. This update addresses issue
[#2874](#2874) and enhances
the robustness of the `databricks labs ucx sync-workspace-info` command
by ensuring it gracefully handles permission errors during workspace
retrieval.
* Added testing with Python 3.13
([#2878](#2878)). The
project has been updated to include testing with Python 3.13, in
addition to the previously supported versions of Python 3.10, 3.11, and
3.12. This update is reflected in the `.github/workflows/push.yml` file,
which now includes '3.13' in the `pyVersion` matrix for the jobs. This
addition expands the range of Python versions that the project can be
tested and run on, providing increased flexibility and compatibility for
users, as well as ensuring continued support for the latest versions of
the Python programming language.
* Added used tables in assessment dashboard
([#2836](#2836)). In this
update, we introduce a new widget to the assessment dashboard for
displaying used tables, enhancing visibility into how tables are
utilized within the Databricks environment. This change includes the
addition of the `UsedTable` class in the
`databricks.labs.ucx.source_code.base` module, which tracks table usage
details in the inventory database. Two new methods,
`collect_dfsas_from_query` and `collect_used_tables_from_query`, have
been implemented to collect data source access and used tables
information from a query, with lineage information added to the table
details. Additionally, a test function,
`test_dashboard_with_prepopulated_data`, has been introduced to
prepopulate data for use in the dashboard, ensuring proper functionality
of the new feature.
* Avoid resource conflicts in integration tests by using a random dir
name ([#2865](#2865)). In
this release, we have implemented changes to address resource conflicts
in integration tests by introducing random directory names. The
`save_locations` method in `conftest.py` has been updated to generate
random directory names using the `tempfile.mkdtemp` function, based on
the value of the new `make_random` parameter. Additionally, in the
`test_migrate.py` file located in the `tests/integration/hive_metastore`
directory, the hard-coded directory name has been replaced with a random
one generated by the `make_random` function, which is used when creating
external tables and specifying the external delta location. Lastly, the
`test_move_tables_table_properties_mismatch_preserves_original` function
in `test_table_move.py` has been updated to include a randomly generated
directory name in the table's external delta and storage location,
ensuring that tests can run concurrently without conflicting with each
other. These changes resolve the issue described in
[#2797](#2797) and improve
the reliability of integration tests.
* Exclude dfsas from used tables
([#2841](#2841)). In this
release, we've made significant improvements to the accuracy of table
identification and handling in our system. We've excluded certain direct
filesystem access patterns from being treated as tables in the current
implementation, correcting a previous error. The `collect_tables` method
has been updated to exclude table names matching defined direct
filesystem access patterns. Additionally, we've added a new method
`TableInfoNode` to wrap used tables and the nodes that use them. We've
also introduced changes to handle direct filesystem access patterns more
accurately, ensuring that the DataFrame API's `spark.table()` function
is identified correctly, while the `spark.read.parquet()` function,
representing direct filesystem access, is now ignored. These changes are
supported by new unit tests to ensure correctness and reliability,
enhancing the overall functionality and behavior of the system.
* Fixed known matches false postives for libraries starting with the
same name as a library in the known.json
([#2860](#2860)). This
commit addresses an issue of false positives in known matches for
libraries that have the same name as a library in the known.json file.
The `module_compatibility` function in the `known.py` file was updated
to look for exact matches or parent module matches, rather than just
matches at the beginning of the name. This more nuanced approach ensures
that libraries with similar names are not incorrectly flagged as having
compatibility issues. Additionally, the `known.json` file is now sorted
when constructing module problems, indicating that the order of the
entries in this file may have been relevant to the issue being resolved.
To ensure the accuracy of the changes, new unit tests were added. The
test suite was expanded to include tests for known and unknown
compatibility, and a new load test was added for the known.json file.
These changes improve the reliability of the known matches feature,
which is critical for ensuring the correct identification of
compatibility issues.
* Make delta format case sensitive
([#2861](#2861)). In this
commit, the delta format is made case sensitive to enhance the
robustness and reliability of the code. The `TableInMount` class has
been updated with a `__post_init__` method to convert the `format`
attribute to uppercase, ensuring case sensitivity. Additionally, the
`Table` class in the `tables.py` file has been modified to include a
`__post_init__` method that converts the `table_format` attribute to
uppercase during object creation, making format comparisons case
insensitive. New properties, `is_delta` and `is_hive`, have been added
to the `Table` class to check if the table format is delta or hive,
respectively. These changes affect the `what` method of the
`AclMigrationWhat` enum class, which now checks for `is_delta` and
`is_hive` instead of comparing `table_format` with `DELTA` and "HIVE".
Relevant issues
[#2858](#2858) and
[#2840](#2840) have been
addressed, and unit tests have been included to verify the behavior.
However, the changes have not been verified on the staging environment
yet.
* Make delta format case sensitive
([#2862](#2862)). The recent
update, derived from the resolution of issue
[#2861](#2861), introduces a
case-sensitive delta format to our open-source library, enhancing the
precision of delta table tracking. This change impacts all table
format-related code and is accompanied by additional tests for
robustness. A new `location` column has been incorporated into the
`table_estimates` view, facilitating the determination of delta table
location. Furthermore, a new method has been implemented to extract the
`location` column from the `table_estimates` view, further refining the
project's functionality and accuracy in managing delta tables.
* Verify UCX catalog is accessible at start of
`migration-progress-experimental` workflow
([#2851](#2851)). In this
release, we have introduced a new `verify_has_ucx_catalog` method in the
`Application` class of the `databricks.labs.ucx.contexts` module, which
checks for the presence of a UCX catalog in the workspace and returns an
instance of the `VerifyHasCatalog` class. This method is used in the
`migration-progress-experimental` workflow to verify UCX catalog
accessibility, addressing issues
[#2577](#2577) and
[#2848](#2848) and
progressing work on
[#2816](#2816). The
`verify_has_ucx_catalog` method is decorated with `@cached_property` and
takes `workspace_client` and `ucx_catalog` as arguments. Additionally,
we have added a new `VerifyHasCatalog` class that checks if a specified
Unity Catalog (UC) catalog exists in the workspace and updated the
import statement to include a `NotFound` exception. We have also added a
timeout parameter to the `validate_step` function in the `workflows.py`
file, modified the `migration-progress-experimental` workflow to include
a new step `verify_prerequisites` in the `table_migration` job cluster,
and added unit tests to ensure the proper functioning of these changes.
These updates improve the application's ability to interact with UCX
catalogs and ensure their presence and accessibility during workflow
execution, while also enhancing the robustness and reliability of the
`migration-progress-experimental` workflow.
@JCZuurmond JCZuurmond self-assigned this Oct 10, 2024
@JCZuurmond
Copy link
Member Author

@nfx and @asnare : I am changing the following:

Warn that the assessment needs to run during creation of the migration-progress workflow creation, if it has not yet already.

becomes

Warn that the assessment needs to run during creation of the UCX catalog, if it has not yet already.

Motivation:

The workflow creation happens during installing UCX when the assessment has never run yet

@JCZuurmond
Copy link
Member Author

Leaving:

Remove running "assessment" from the integration test_running_real_migration_progress_job to reduce flakiness

For #2573

@nfx nfx closed this as completed in #2912 Oct 10, 2024
@nfx nfx closed this as completed in d2a50cf Oct 10, 2024
@github-project-automation github-project-automation bot moved this from Active Backlog to Archive in UCX Oct 10, 2024
nfx added a commit that referenced this issue Oct 10, 2024
* Added `google-cloud-storage` to known list ([#2827](#2827)). In this release, we have added the `google-cloud-storage` library, along with its various modules and sub-modules, to our project's known list in a JSON file. Additionally, we have included the `google-crc32c` and `google-resumable-media` libraries. These libraries provide functionalities such as content addressable storage, checksum calculation, and resumable media upload and download. This change is a partial resolution to issue [#1931](#1931), which is likely related to the integration or usage of these libraries in the project. Software engineers should take note of these additions and how they may impact the project's functionality.
* Added `google-crc32c` to known list ([#2828](#2828)). With this commit, we have added the `google-crc32c` library to our system's known list, addressing part of issue [#1931](#1931). This addition enhances the overall functionality of the system by providing efficient and high-speed CRC32C computation when utilized. The `google-crc32c` library is known for its performance and reliability, and by incorporating it into our system, we aim to improve the efficiency and robustness of the CRC32C computation process. This enhancement is part of our ongoing efforts to optimize the system and ensure a more efficient experience for our end-users. With this change, users can expect faster and more reliable CRC32C computations in their applications.
* Added `holidays` to known list ([#2906](#2906)). In this release, we have expanded the known list in our open-source library to include a new `holidays` category, aimed at supporting tracking of holidays for different countries, religions, and financial institutions. This category includes several subcategories, such as calendars, countries, deprecation, financial holidays, groups, helpers, holiday base, mixins, observed holiday base, registry, and utils. Each subcategory contains an empty list, allowing for future data storage related to holidays. This change partially resolves issue [#1931](#1931), and represents a significant step towards supporting a more comprehensive range of holiday tracking needs in our library. Software engineers may utilize this new feature to build applications that require tracking and management of various holidays and related data.
* Added `htmlmin` to known list ([#2907](#2907)). In this update, we have added the `htmlmin` library to the `known.json` configuration file's list of known libraries. This addition enables the use and management of `htmlmin` and its components, including `htmlmin.command`, `htmlmin.decorator`, `htmlmin.escape`, `htmlmin.main`, `htmlmin.middleware`, `htmlmin.parser`, `htmlmin.python3html`, and `htmlmin.python3html.parser`. This change partially addresses issue [#1931](#1931), which may have been caused by the integration or usage of `htmlmin`. Software engineers can now utilize `htmlmin` and its features in their projects, thanks to this enhancement.
* Document preparing external locations when creating catalogs ([#2915](#2915)). Databricks Labs' UCX tool has been updated to incorporate the preparation of external locations when creating catalogs during the upgrade to Unity Catalog (UC). This enhancement involves the addition of new documentation outlining how to physically separate data in storage within UC, adhering to Databricks' best practices. The `create-catalogs-schemas` command has been updated to create UC catalogs and schemas based on a mapping file, allowing users to reuse previously created external locations or establish new ones outside of UCX. For data separation, users can leverage external locations when using subpaths, providing flexibility in data management during the upgrade process.
* Fixed `KeyError` from `assess_workflows` task ([#2919](#2919)). In this release, we have made significant improvements to error handling in our open-source library. We have fixed a KeyError in the `assess_workflows` task and modified the `_safe_infer_internal` and `_unsafe_infer_internal` methods to handle both `InferenceError` and `KeyError` during inference. When an error occurs, we now log the error message with the node and yield a `Uninferable` object. Additionally, we have updated the `do_infer_values` method of the `_LocalInferredValue` class to yield an iterator of iterables of `NodeNG` objects. We have added multiple unit tests for inferring values in Python code, including cases for handling externally defined values and their absence. These changes ensure that our library can handle errors more gracefully and provide more informative feedback during inference, making it more robust and easier to use in software engineering projects.
* Fixed `OSError: [Errno 95]` bug in `assess_workflows` task by skipping GIT-sourced workflows from static code analysis ([#2924](#2924)). In this release, we have resolved the `OSError: [Errno 95]` bug in the `assess_workflows` task that occurred while performing static code analysis on GIT-sourced workflows. A new attribute `Source` has been introduced in the `jobs` module of the `databricks.sdk.service` package to identify the source of a notebook task. If the notebook task source is GIT, a new `DependencyProblem` is raised, indicating that notebooks in GIT should be analyzed using the `databricks labs ucx lint-local-code` CLI command. The `_register_notebook` method has been updated to check if the notebook task source is GIT and return an appropriate `DependencyProblem` message. This change enhances the reliability of the `assess_workflows` task by avoiding the aforementioned bug and provides a more informative message when notebooks are sourced from GIT. This change is part of our ongoing effort to improve the project's quality and reliability and benefits software engineers who adopt the project.
* Fixed absolute path normalisation in source code analysis ([#2920](#2920)). In this release, we have addressed an issue with the Workspace API not supporting relative subpaths such as "/a/b/../c", which has been resolved by resolving workspace paths before calling the API. This fix is backward compatible and ensures the correct behavior of the source code analysis. Additionally, we have added integration tests and co-authored this commit with Eric Vergnaud and Serge Smertin. Furthermore, we have added a new test case that supports relative grand-parent paths in the dependency graph construction, utilizing a new `NotebookLoader` class. This loader is responsible for loading the notebook content and metadata given a path, and this new test case exercises the path resolution logic when a notebook depends on another notebook located two levels up in the directory hierarchy. These changes improve the robustness and reliability of the source code analysis in the presence of relative paths.
* Fixed downloading wheel libraries from DBFS on mounted Azure Storage fail with access denied ([#2918](#2918)). In this release, we have introduced enhancements to the library's handling of registering and downloading wheel libraries from DBFS on mounted Azure Storage, addressing an issue that resulted in access denied errors. The changes include improved error handling with the addition of a `try-except` block to handle potential `BadRequest` exceptions and the inclusion of three new methods to register different types of libraries. The `_register_requirements_txt` method reads requirements files and registers each library specified in the file, logging a warning message for any references to other requirements or constraints files. The `_register_whl` method creates a temporary copy of the given wheel file in the local file system and registers it, while the `_register_egg` method checks the runtime version and yields a `DependencyProblem` if the version is greater than (14, 0). These changes simplify the code and enhance error handling while addressing the reported issues related to registering libraries. The changes are implemented in the `jobs.py` file located in the `databricks/labs/ucx/source_code` directory, which also includes the import of the `BadRequest` exception class from `databricks.sdk.errors`.
* Fixed issue with migrating MANAGED hive_metastore table to UC ([#2892](#2892)). In this release, we have implemented changes to address the issue of migrating HMS (Hive Metastore) managed tables to UC (Unity Catalog) as EXTERNAL. Historically, deleting a managed table also removed the underlying data, leading to potential data loss and making the UC table unusable. The new approach provides options to mitigate these issues, including migrating as EXTERNAL or cloning the data to maintain integrity. These changes aim to prevent accidental data deletion, ensure data recoverability, and avoid inconsistencies when new data is added to either HMS or UC. We have introduced new class attributes, methods, and parameters in relevant modules such as `WorkspaceConfig`, `Table`, `migrate_tables`, and `install.py`. These modifications support the new migration strategies and allow for more flexibility in managing how tables are migrated and how data is handled. The upgrade process can be triggered using the `migrate-tables` UCX command or by running the table migration workflows deployed to the workspace. Thorough testing and documentation have been performed to minimize risks of data inconsistencies during migration. It is crucial to understand the implications of these changes and carefully consider the trade-offs before migrating managed tables to UC as EXTERNAL.
* Improve creating UC catalogs ([#2898](#2898)). In this release, the process of creating Unity Catalog (UC) catalogs has been significantly improved with the resolution of several issues discovered during manual testing. The `databricks labs ucx create-ucx-catalog/create-catalogs-schemas` command has been updated to ensure a better user experience and enhance consistency. Changes include requesting the catalog location even if the catalog already exists, eliminating multiple loops over storage locations, and improving logging and matching storage locations. The code now includes new checks to avoid requesting a catalog's storage location if it already exists and updates the behavior of the `_create_catalog_validate` and `_validate_location` methods. Additionally, new unit tests have been added to verify these changes. Under the hood, a new method, `get_catalog`, has been introduced to the `WorkspaceClient` class, and several test functions, such as `test_create_ucx_catalog_skips_when_ucx_catalogs_exists` and `test_create_all_catalogs_schemas_creates_catalogs`, have been implemented to ensure the proper functioning of the updated command. This release addresses issue [#2879](#2879) and enhances the overall process of creating UC catalogs, making it more efficient and reliable.
* Improve logging when skipping grant a in `create-catalogs-schemas` ([#2917](#2917)). In this release, the logging for skipping grants in the `_update_principal_acl` method of the `CatalogSchema` class has been improved. The code now logs a more detailed message when it cannot identify a UC grant for a specific grant object, indicating that the grant is a legacy grant that is not supported in UC, along with the grant's action type and associated object. This change provides more context for debugging and troubleshooting purposes. Additionally, the functionality of using a `DENY` grant instead of a `USAGE` grant for a specific principal and schema in the hive metastore has been introduced. The test case `test_catalog_schema_acl()` in the `test_catalog_schema.py` file has been updated to reflect this new behavior. A new test case `test_create_all_catalogs_schemas_logs_untranslatable_grant(caplog)` has also been added to verify the new logging behavior for skipping legacy grants that are not supported in UC. These changes improve the logging system and enhance the `CatalogSchema` class functionality in the open-source library.
* Verify migration progress prerequisites during UCX catalog creation ([#2912](#2912)). In this update, a new method `verify()` has been added to the `verify_progress_tracking` object in the `workspace_context` object to verify the prerequisites for UCX catalog creation. The prerequisites include the existence of a UC metastore, a UCX catalog, and a successful `assessment` job run. If the assessment job is pending or running, the code will wait up to 1 hour for it to finish before considering the prerequisites unmet. This feature includes modifications to the `create-ucx-catalog` CLI command and adds unit tests. This resolves issue [#2816](#2816) and ensures that the migration progress prerequisites are met before creating the UCX catalog. The `VerifyProgressTracking` class has been added to the `databricks.labs.ucx.progress.install` module and is used in the `Application` class. The changes include a new `timeout` argument to specify the waiting time for pending or running assessment jobs. The commit also includes several new unit tests for the `VerifyProgressTracking` class and modifications to the `test_install.py` file in the `tests/unit/progress` directory. The code has been manually tested and meets the requirements.
@nfx nfx mentioned this issue Oct 10, 2024
nfx added a commit that referenced this issue Oct 10, 2024
* Added `google-cloud-storage` to known list
([#2827](#2827)). In this
release, we have added the `google-cloud-storage` library, along with
its various modules and sub-modules, to our project's known list in a
JSON file. Additionally, we have included the `google-crc32c` and
`google-resumable-media` libraries. These libraries provide
functionalities such as content addressable storage, checksum
calculation, and resumable media upload and download. This change is a
partial resolution to issue
[#1931](#1931), which is
likely related to the integration or usage of these libraries in the
project. Software engineers should take note of these additions and how
they may impact the project's functionality.
* Added `google-crc32c` to known list
([#2828](#2828)). With this
commit, we have added the `google-crc32c` library to our system's known
list, addressing part of issue
[#1931](#1931). This
addition enhances the overall functionality of the system by providing
efficient and high-speed CRC32C computation when utilized. The
`google-crc32c` library is known for its performance and reliability,
and by incorporating it into our system, we aim to improve the
efficiency and robustness of the CRC32C computation process. This
enhancement is part of our ongoing efforts to optimize the system and
ensure a more efficient experience for our end-users. With this change,
users can expect faster and more reliable CRC32C computations in their
applications.
* Added `holidays` to known list
([#2906](#2906)). In this
release, we have expanded the known list in our open-source library to
include a new `holidays` category, aimed at supporting tracking of
holidays for different countries, religions, and financial institutions.
This category includes several subcategories, such as calendars,
countries, deprecation, financial holidays, groups, helpers, holiday
base, mixins, observed holiday base, registry, and utils. Each
subcategory contains an empty list, allowing for future data storage
related to holidays. This change partially resolves issue
[#1931](#1931), and
represents a significant step towards supporting a more comprehensive
range of holiday tracking needs in our library. Software engineers may
utilize this new feature to build applications that require tracking and
management of various holidays and related data.
* Added `htmlmin` to known list
([#2907](#2907)). In this
update, we have added the `htmlmin` library to the `known.json`
configuration file's list of known libraries. This addition enables the
use and management of `htmlmin` and its components, including
`htmlmin.command`, `htmlmin.decorator`, `htmlmin.escape`,
`htmlmin.main`, `htmlmin.middleware`, `htmlmin.parser`,
`htmlmin.python3html`, and `htmlmin.python3html.parser`. This change
partially addresses issue
[#1931](#1931), which may
have been caused by the integration or usage of `htmlmin`. Software
engineers can now utilize `htmlmin` and its features in their projects,
thanks to this enhancement.
* Document preparing external locations when creating catalogs
([#2915](#2915)). Databricks
Labs' UCX tool has been updated to incorporate the preparation of
external locations when creating catalogs during the upgrade to Unity
Catalog (UC). This enhancement involves the addition of new
documentation outlining how to physically separate data in storage
within UC, adhering to Databricks' best practices. The
`create-catalogs-schemas` command has been updated to create UC catalogs
and schemas based on a mapping file, allowing users to reuse previously
created external locations or establish new ones outside of UCX. For
data separation, users can leverage external locations when using
subpaths, providing flexibility in data management during the upgrade
process.
* Fixed `KeyError` from `assess_workflows` task
([#2919](#2919)). In this
release, we have made significant improvements to error handling in our
open-source library. We have fixed a KeyError in the `assess_workflows`
task and modified the `_safe_infer_internal` and
`_unsafe_infer_internal` methods to handle both `InferenceError` and
`KeyError` during inference. When an error occurs, we now log the error
message with the node and yield a `Uninferable` object. Additionally, we
have updated the `do_infer_values` method of the `_LocalInferredValue`
class to yield an iterator of iterables of `NodeNG` objects. We have
added multiple unit tests for inferring values in Python code, including
cases for handling externally defined values and their absence. These
changes ensure that our library can handle errors more gracefully and
provide more informative feedback during inference, making it more
robust and easier to use in software engineering projects.
* Fixed `OSError: [Errno 95]` bug in `assess_workflows` task by skipping
GIT-sourced workflows from static code analysis
([#2924](#2924)). In this
release, we have resolved the `OSError: [Errno 95]` bug in the
`assess_workflows` task that occurred while performing static code
analysis on GIT-sourced workflows. A new attribute `Source` has been
introduced in the `jobs` module of the `databricks.sdk.service` package
to identify the source of a notebook task. If the notebook task source
is GIT, a new `DependencyProblem` is raised, indicating that notebooks
in GIT should be analyzed using the `databricks labs ucx
lint-local-code` CLI command. The `_register_notebook` method has been
updated to check if the notebook task source is GIT and return an
appropriate `DependencyProblem` message. This change enhances the
reliability of the `assess_workflows` task by avoiding the
aforementioned bug and provides a more informative message when
notebooks are sourced from GIT. This change is part of our ongoing
effort to improve the project's quality and reliability and benefits
software engineers who adopt the project.
* Fixed absolute path normalisation in source code analysis
([#2920](#2920)). In this
release, we have addressed an issue with the Workspace API not
supporting relative subpaths such as "/a/b/../c", which has been
resolved by resolving workspace paths before calling the API. This fix
is backward compatible and ensures the correct behavior of the source
code analysis. Additionally, we have added integration tests and
co-authored this commit with Eric Vergnaud and Serge Smertin.
Furthermore, we have added a new test case that supports relative
grand-parent paths in the dependency graph construction, utilizing a new
`NotebookLoader` class. This loader is responsible for loading the
notebook content and metadata given a path, and this new test case
exercises the path resolution logic when a notebook depends on another
notebook located two levels up in the directory hierarchy. These changes
improve the robustness and reliability of the source code analysis in
the presence of relative paths.
* Fixed downloading wheel libraries from DBFS on mounted Azure Storage
fail with access denied
([#2918](#2918)). In this
release, we have introduced enhancements to the library's handling of
registering and downloading wheel libraries from DBFS on mounted Azure
Storage, addressing an issue that resulted in access denied errors. The
changes include improved error handling with the addition of a
`try-except` block to handle potential `BadRequest` exceptions and the
inclusion of three new methods to register different types of libraries.
The `_register_requirements_txt` method reads requirements files and
registers each library specified in the file, logging a warning message
for any references to other requirements or constraints files. The
`_register_whl` method creates a temporary copy of the given wheel file
in the local file system and registers it, while the `_register_egg`
method checks the runtime version and yields a `DependencyProblem` if
the version is greater than (14, 0). These changes simplify the code and
enhance error handling while addressing the reported issues related to
registering libraries. The changes are implemented in the `jobs.py` file
located in the `databricks/labs/ucx/source_code` directory, which also
includes the import of the `BadRequest` exception class from
`databricks.sdk.errors`.
* Fixed issue with migrating MANAGED hive_metastore table to UC
([#2892](#2892)). In this
release, we have implemented changes to address the issue of migrating
HMS (Hive Metastore) managed tables to UC (Unity Catalog) as EXTERNAL.
Historically, deleting a managed table also removed the underlying data,
leading to potential data loss and making the UC table unusable. The new
approach provides options to mitigate these issues, including migrating
as EXTERNAL or cloning the data to maintain integrity. These changes aim
to prevent accidental data deletion, ensure data recoverability, and
avoid inconsistencies when new data is added to either HMS or UC. We
have introduced new class attributes, methods, and parameters in
relevant modules such as `WorkspaceConfig`, `Table`, `migrate_tables`,
and `install.py`. These modifications support the new migration
strategies and allow for more flexibility in managing how tables are
migrated and how data is handled. The upgrade process can be triggered
using the `migrate-tables` UCX command or by running the table migration
workflows deployed to the workspace. Thorough testing and documentation
have been performed to minimize risks of data inconsistencies during
migration. It is crucial to understand the implications of these changes
and carefully consider the trade-offs before migrating managed tables to
UC as EXTERNAL.
* Improve creating UC catalogs
([#2898](#2898)). In this
release, the process of creating Unity Catalog (UC) catalogs has been
significantly improved with the resolution of several issues discovered
during manual testing. The `databricks labs ucx
create-ucx-catalog/create-catalogs-schemas` command has been updated to
ensure a better user experience and enhance consistency. Changes include
requesting the catalog location even if the catalog already exists,
eliminating multiple loops over storage locations, and improving logging
and matching storage locations. The code now includes new checks to
avoid requesting a catalog's storage location if it already exists and
updates the behavior of the `_create_catalog_validate` and
`_validate_location` methods. Additionally, new unit tests have been
added to verify these changes. Under the hood, a new method,
`get_catalog`, has been introduced to the `WorkspaceClient` class, and
several test functions, such as
`test_create_ucx_catalog_skips_when_ucx_catalogs_exists` and
`test_create_all_catalogs_schemas_creates_catalogs`, have been
implemented to ensure the proper functioning of the updated command.
This release addresses issue
[#2879](#2879) and enhances
the overall process of creating UC catalogs, making it more efficient
and reliable.
* Improve logging when skipping grant a in `create-catalogs-schemas`
([#2917](#2917)). In this
release, the logging for skipping grants in the `_update_principal_acl`
method of the `CatalogSchema` class has been improved. The code now logs
a more detailed message when it cannot identify a UC grant for a
specific grant object, indicating that the grant is a legacy grant that
is not supported in UC, along with the grant's action type and
associated object. This change provides more context for debugging and
troubleshooting purposes. Additionally, the functionality of using a
`DENY` grant instead of a `USAGE` grant for a specific principal and
schema in the hive metastore has been introduced. The test case
`test_catalog_schema_acl()` in the `test_catalog_schema.py` file has
been updated to reflect this new behavior. A new test case
`test_create_all_catalogs_schemas_logs_untranslatable_grant(caplog)` has
also been added to verify the new logging behavior for skipping legacy
grants that are not supported in UC. These changes improve the logging
system and enhance the `CatalogSchema` class functionality in the
open-source library.
* Verify migration progress prerequisites during UCX catalog creation
([#2912](#2912)). In this
update, a new method `verify()` has been added to the
`verify_progress_tracking` object in the `workspace_context` object to
verify the prerequisites for UCX catalog creation. The prerequisites
include the existence of a UC metastore, a UCX catalog, and a successful
`assessment` job run. If the assessment job is pending or running, the
code will wait up to 1 hour for it to finish before considering the
prerequisites unmet. This feature includes modifications to the
`create-ucx-catalog` CLI command and adds unit tests. This resolves
issue [#2816](#2816) and
ensures that the migration progress prerequisites are met before
creating the UCX catalog. The `VerifyProgressTracking` class has been
added to the `databricks.labs.ucx.progress.install` module and is used
in the `Application` class. The changes include a new `timeout` argument
to specify the waiting time for pending or running assessment jobs. The
commit also includes several new unit tests for the
`VerifyProgressTracking` class and modifications to the
`test_install.py` file in the `tests/unit/progress` directory. The code
has been manually tested and meets the requirements.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat/migration-progress Issues related to the migration progress workflow feat/workflow triggered as a Databricks Job managed by UCX
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants