Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/databricks): ingest hive metastore by default, more docs #9601

22 changes: 22 additions & 0 deletions metadata-ingestion/docs/sources/databricks/unity-catalog_post.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,25 @@


#### Advanced

##### Multiple Databricks Workspaces

If you have multiple databricks workspaces <b>that point to the same UC metastore</b>, our suggestion is to use separate recipes for ingesting Hive Metastore catalog and Unity Catalog information schema.

To ingest Hive metastore information schema
- Setup one ingestion recipe per workspace
- Use platform instance equivalent to workspace name
- Ingest only hive_metastore catalog in the recipe using config `catalogs: ["hive_metastore"]`

To ingest Unity Catalog information schema
- Disable hive metastore catalog ingestion in the recipe using config `include_hive_metastore: False`
- Ideally, just ingest from one workspace
- To ingest from both workspaces (e.g. if each workspace has different permissions and therefore restricted view of catalogs):
- Use same platform instance for all workspaces using same UC metastore
- Ingest usage from only one workspace (you lose usage from other workspace)
- Use filters to only ingest each catalog once, but shouldn’t be necessary


#### Troubleshooting

##### No data lineage captured or missing lineage
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,11 @@
* Ownership of or `SELECT` privilege on any tables and views you want to ingest
* [Ownership documentation](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/ownership.html)
* [Privileges documentation](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/privileges.html)
+ To ingest legacy hive_metastore catalog (`include_hive_metastore` - disabled by default), your service principal must have all of the following:
* `READ_METADATA` and `USAGE` privilege on `hive_metastore` catalog
* `READ_METADATA` and `USAGE` privilege on schemas you want to ingest
* `READ_METADATA` and `USAGE` privilege on tables and views you want to ingest
* [Hive Metastore Privileges documentation](https://docs.databricks.com/en/sql/language-manual/sql-ref-privileges-hms.html)
+ To ingest your workspace's notebooks and respective lineage, your service principal must have `CAN_READ` privileges on the folders containing the notebooks you want to ingest: [guide](https://docs.databricks.com/en/security/auth-authz/access-control/workspace-acl.html#folder-permissions).
+ To `include_usage_statistics` (enabled by default), your service principal must have `CAN_MANAGE` permissions on any SQL Warehouses you want to ingest: [guide](https://docs.databricks.com/security/auth-authz/access-control/sql-endpoint-acl.html).
+ To ingest `profiling` information with `method: ge`, you need `SELECT` privileges on all profiled tables.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ class UnityCatalogSourceConfig(
description="SQL Warehouse id, for running queries. If not set, will use the default warehouse.",
)
include_hive_metastore: bool = pydantic.Field(
default=False,
default=True,
description="Whether to ingest legacy `hive_metastore` catalog. This requires executing queries on SQL warehouse.",
)
workspace_name: Optional[str] = pydantic.Field(
Expand All @@ -135,7 +135,7 @@ class UnityCatalogSourceConfig(
)

include_metastore: bool = pydantic.Field(
default=True,
default=False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's call out that this is a breaking change?

It might also be good to show examples of what the hierarchy would look like when this is true/false

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also we should say we recommend keeping this set to false, and that folks should use platform_instance instead

Copy link
Collaborator Author

@mayurinehate mayurinehate Jan 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add the section in doc about hierarchy + update description of include_metastore config field.

As far as I understand, we'll get rid of this config, eventually. So the section would not be relevant - post that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. added in updating-datahub.md as a breaking change
  2. added the examples in updating-datahub doc, I believe, you meant this only and not to add this in source docs ?
  3. we already have pydantic validator for warning if this is not set to false, so we are covered there. changed the error message to represent current state.

description=(
"Whether to ingest the workspace's metastore as a container and include it in all urns."
" Changing this will affect the urns of all entities in the workspace."
Expand Down
Loading