-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingestion/tableau): introduce project_path_pattern #10855
feat(ingestion/tableau): introduce project_path_pattern #10855
Conversation
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThe Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant TableauConfig
participant Tableau Project
User->>TableauConfig: Configure project_path_pattern
User->>TableauConfig: Call method to filter projects
TableauConfig->>Tableau Project: Retrieve project path
TableauConfig->>TableauConfig: Check if project_path matches pattern
alt Project Path Allowed
TableauConfig->>User: Project is allowed
else Project Path Denied
TableauConfig->>User: Project is denied
end
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- metadata-ingestion/src/datahub/ingestion/source/tableau.py (3 hunks)
Additional comments not posted (2)
metadata-ingestion/src/datahub/ingestion/source/tableau.py (2)
296-304
: New fieldproject_path_pattern
added.The new field
project_path_pattern
allows filtering Tableau projects based on their full path using allow and deny patterns. This provides more granular control over project ingestion, addressing the limitations of the existingproject_pattern
.
694-700
: Modified method_is_allowed_project
to includeproject_path_pattern
.The method
_is_allowed_project
now incorporates checks against the newproject_path_pattern
in addition to the existingproject_pattern
. This ensures that both the project name and the full project path are considered when determining if a project should be ingested.
|
||
for project in list_of_skip_projects: | ||
if ( | ||
project.parent_id in self.tableau_project_registry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something but from my understanding, this can never happen.
If we only add projects to list_of_skip_projects
that are not allowed, there will never be a project in this list that is not denied as it is checked below.
When checking if a project is allowed, the following is executed downstream
datahub/metadata-ingestion/src/datahub/configuration/common.py
Lines 261 to 262 in ebe7b2d
if self._denied(string): | |
return False |
What do you think? Am I missing something here?
EDIT: I checked this again because the failing tests indicate that this indeed does something. I think I know how it could work. The method _is_denied_project
only checks the deny
part of the pattern in the config and _is_allowed_project
checks both the allowed
part and the deny
part of the pattern under the hood. So if for a project the path
or the name
is not allowed but not explicitly denied, this part here would still add it if the parent project was added (and extract_project_hierarchy
was enabled). But I don't really understand the use case for this. Could someone shed some light on this? Why do we need extract_project_hierarchy
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might have been implemented based on user requests to include child projects by default unless explicitly denied. Could you please revert these changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the thing, I think we always include child projects with or without the property extract_project_hierarchy
. So I don't really understand why we need it. From what I understand, the only use case, where this property actually has an impact is the one I mentioned above:
So if for a project the path or the name is not allowed but not explicitly denied, this part here would still add it if the parent project was added (and extract_project_hierarchy was enabled).
I can revert the changes, but IMO we should have a further look at this part as it does things that are hard to understand, at least for me. Could be that as the code evolved over time, we don't even need it anymore.
Again, maybe I'm missing something and for some reason this makes total sense, but it could be worth checking again. @sid-acryl could you do that together with @hsheth2?
@@ -681,30 +691,16 @@ def form_path(project_id: str) -> List[str]: | |||
|
|||
def _is_allowed_project(self, project: TableauProject) -> bool: | |||
# Either project name or project path should exist in allow | |||
is_allowed: bool = self.config.project_pattern.allowed( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We encountered some issues with the current implementation. The problem is that both project paths and project names can be controlled with the same pattern.
For example, if you deny a nested project by its path e.g. "Project1/Nested1/Test (Catalog)" but allow all projects with "(Catalog)" in the name (by a regex). The denied project "Project1/Nested1/Test (Catalog)" would still be ingested as the project name is allowed. Or if we for example want to deny all projects with "(Sensitive)" in the path this can't be achieved because the names might be allowed.
Ideally, I think it would make sense to completely separate project_path_pattern
and project_name_pattern
to avoid these cases. But this would break backwards compatibility.
What do you think about this?
…ure/add-exclusive-project-path-pattern
|
||
for project in list_of_skip_projects: | ||
if ( | ||
project.parent_id in self.tableau_project_registry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might have been implemented based on user requests to include child projects by default unless explicitly denied. Could you please revert these changes?
…ure/add-exclusive-project-path-pattern
@@ -320,6 +320,16 @@ class TableauConfig( | |||
"By default, all projects will be ingested.", | |||
) | |||
|
|||
project_path_pattern: AllowDenyPattern = Field( | |||
default=AllowDenyPattern.allow_all(), | |||
description="Filter for specific Tableau projects by checking their full path. For example, use 'My Project/Nested Project' to ingest a nested project with name 'Nested Project'. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please update the description as per below text:
"Filters Tableau projects by their full path. For instance, 'My Project/Nested Project' targets a specific nested project named 'Nested Project'."
" Unlike project_pattern, this field only checks the project path, not both the path and project name."
" This is useful when you need to exclude all nested projects under a particular project."
" You can allow or deny projects by specifying their path or a regular expression pattern."
" Deny patterns always override allow patterns."
" By default, all projects are ingested."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, changed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just minor change in description, otherwise looks ok.
): | ||
logger.debug(f"Project {project.name} is added in project registry") | ||
self.tableau_project_registry[project.id] = project | ||
# We rely on automatic browse paths (v2) when creating containers. That's why we need to sort the projects here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted the changes to include the extract_project_hierarchy part again, as discussed. However, I slightly refactored it so we do the sorting below at the very end for all the projects to ingest. Let me know if you see any problem with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fields project_pattern
and project_path_pattern
might cause confusion for users.
Please implement the following changes:
- Mark project_pattern as deprecated and recommend using project_path_pattern going forward. refer
_deprecate_projects = pydantic_field_deprecated("projects")
in tableau config - Add validation to ensure project_pattern and project_path_pattern are mutually exclusive.
I guess they both serve different purposes and that's why I wouldn't just deprecate |
I preferred project-path based ingestion because of following reason:
so better to have a one way of doing the things to avoid confusion later. |
You're right, since we can also use the new |
logger.warning( | ||
"project_pattern is not set but projects is set. projects is deprecated, please use " | ||
"project_pattern instead." | ||
"projects is deprecated, please use " "project_path_pattern instead." | ||
) | ||
logger.info("Initializing project_pattern from projects") | ||
values["project_pattern"] = AllowDenyPattern( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sid-acryl I now deprecated project_pattern
but I wanted to point out that for now this validation rule still initializes project_pattern
when projects
is set. Ideally, we would remove the old projects
completely now but I tried it and it would require quite some refactoring in the tests.
…ect#10855) Co-authored-by: Yanik Häni <Yanik.Haeni1@swisscom.com>
Description
We have the need to exclude certain projects and all their nested projects from the ingestion. Unfortunately, this is not possible with the current
project_pattern
as this currently allows a project to be ingested if the project name OR the project path is allowed by the pattern, see:datahub/metadata-ingestion/src/datahub/ingestion/source/tableau.py
Lines 690 to 692 in 74a543d
For example, if you have a project structure as follows: Project1/Nested1/Test1, Project1/Nested1/Test2, Project1/Nested2/Test1 and you would like to deny "Nested1" and all its nested projects, we couldn't find a way to do this. Using a regex like
^.*Nested1.*$
in the deny section ofproject_pattern
would still ingest the nested projects because the project name is allowed. And a project is allowed if either the name or the project path is allowed.This PR introduces a new config property
project_path_pattern
that exclusively checks the project path of a project when deciding if a project is allowed to be ingested or not. This enables the mentioned use case to be configured.The current implementation of
project_pattern
also seems to have some flaws that I would like to discuss with you (see comments).Let me know what you think.
Many thanks.
Checklist
Summary by CodeRabbit
project_path_pattern
, supporting both allow and deny patterns via Regex.