Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/abs): Adding azure blob storage ingestion source #10813

Merged
merged 15 commits into from
Jul 17, 2024
Merged

feat(ingest/abs): Adding azure blob storage ingestion source #10813

merged 15 commits into from
Jul 17, 2024

Conversation

joelmataKPN
Copy link
Contributor

@joelmataKPN joelmataKPN commented Jul 1, 2024

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

Summary by CodeRabbit

  • New Features

    • Introduced a connector for ingesting Azure Blob Storage datasets into DataHub.
    • Added support for defining datasets using Path Specs.
    • Implemented utility functions for handling Azure Blob Storage operations.
  • Documentation

    • Updated documentation for ingesting AWS S3 datasets to include references to "Path Specs."
    • Added detailed guides for configuring and using the Azure Blob Storage connector.
  • Chores

    • Updated dependencies to include Azure-specific libraries.

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Jul 1, 2024
Copy link
Contributor

coderabbitai bot commented Jul 1, 2024

Walkthrough

The latest updates introduce a new connector for ingesting Azure Blob Storage (ABS) datasets into DataHub. This enhancement includes comprehensive configurations, profiling options, and utilities for managing ABS URIs. Additionally, the documentation has been updated to reflect the new features, along with minor adjustments for AWS S3.

Changes

File/Path Change Summary
metadata-ingestion/docs/sources/abs/README.md Introduces a new connector for ingesting Azure Blob Storage datasets into DataHub and maps files/folders to datasets, supporting various file types.
metadata-ingestion/docs/sources/abs/abs.md Introduces the concept of Path Specs for defining datasets, including structure, inclusion/exclusion of paths, and examples.
metadata-ingestion/docs/sources/s3/README.md Minor modification in the documentation for ingesting AWS S3 datasets into DataHub, referencing "Path Specs."
metadata-ingestion/src/datahub/ingestion/source/abs/config.py Introduces a DataLakeSourceConfig class for defining configurations for a data lake source, including platform information, Azure configuration, ingestion settings, profiling options, Spark configurations, schema inference, and path specification validation logic.
metadata-ingestion/src/datahub/ingestion/source/abs/datalake_profiler_config.py Introduces a DataLakeProfilerConfig model for data lake profiling, including settings for enabling profiling, specifying operation configs, defining various profiling options, and logic to normalize field-level settings based on table-level profiling.
metadata-ingestion/src/datahub/ingestion/source/abs/report.py Introduces a DataLakeSourceReport class to track files scanned and filtered, along with methods to report file scanning and dropped files.
metadata-ingestion/src/datahub/ingestion/source/abs/source.py Introduces the ABSSource class for ingesting data from Azure Blob Storage (ABS), providing capabilities for browsing containers, extracting table schema information, creating metadata change proposals, handling telemetry, and supporting schema inference for various file formats.
metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py Introduces utility functions for handling Azure Blob Storage URIs, extracting metadata properties, managing tags, and listing folders within containers. It includes functions for URI validation, prefix extraction, URN creation, property retrieval, tag addition, and folder listing.
metadata-ingestion/src/datahub/ingestion/source/common/subtypes.py Adds new subtype ABS_CONTAINER to the DatasetContainerSubTypes enum.
metadata-ingestion/src/datahub/ingestion/source/data_lake_common/data_lake_utils.py Introduces support for Azure Blob Storage (ABS) by adding functions to handle ABS URIs, retrieve container names, and determine container subtypes. These changes enable identifying ABS platforms, retrieving ABS-specific information, and creating container hierarchies for ABS datasets.
metadata-ingestion/setup.py Adds dependencies for abs_base set, modifies delta_lake set to include *abs_base, and includes "abs" as a source in the source mapping.

Poem

In the cloud where data flows,
Azure Blob now clearly shows,
With DataHub we sync and weave,
Files and folders, a web we leave.

AWS paths still align,
Together they form a grand design,
Here’s to code that gleams and streams,
In the realm of data dreams!


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 23

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 8edc94d and af3e161.

Files selected for processing (14)
  • metadata-ingestion/docs/sources/abs/README.md (1 hunks)
  • metadata-ingestion/docs/sources/abs/abs.md (1 hunks)
  • metadata-ingestion/docs/sources/abs/abs_recipe.yml (1 hunks)
  • metadata-ingestion/setup.py (3 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/abs/config.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/abs/datalake_profiler_config.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/abs/profiling.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/abs/report.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/abs/source.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/azure/azure_common.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/common/subtypes.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/data_lake_common/data_lake_utils.py (4 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py (4 hunks)
Files not summarized due to errors (7)
  • metadata-ingestion/src/datahub/ingestion/source/common/subtypes.py: Error: Server error. Please try again later.
  • metadata-ingestion/src/datahub/ingestion/source/abs/report.py: Error: Server error. Please try again later.
  • metadata-ingestion/src/datahub/ingestion/source/abs/source.py: Error: Server error. Please try again later.
  • metadata-ingestion/setup.py: Error: Server error. Please try again later.
  • metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py: Error: Server error. Please try again later.
  • metadata-ingestion/src/datahub/ingestion/source/abs/datalake_profiler_config.py: Error: Server error. Please try again later.
  • metadata-ingestion/src/datahub/ingestion/source/azure/azure_common.py: Error: Server error. Please try again later.
Additional context used
LanguageTool
metadata-ingestion/docs/sources/abs/README.md

[style] ~34-~34: ‘on the basis of’ might be wordy. Consider a shorter alternative.
Context: ...etails)) JSON file schemas are inferred on the basis of the entire file (given the difficulty i...

(EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)

metadata-ingestion/docs/sources/abs/abs.md

[uncategorized] ~4-~4: Possible missing comma found.
Context: ...) is a list of Path Spec (path_spec) objects where each individual path_spec` repre...

(AI_HYDRA_LEO_MISSING_COMMA)


[uncategorized] ~4-~4: Possible missing article found.
Context: ...e path (path_spec.include) represents formatted path to the dataset. This path must end...

(AI_HYDRA_LEO_MISSING_A)


[uncategorized] ~4-~4: Possible missing comma found.
Context: ...o represent leaf level. If *.[ext] is provided then files with only specified extensio...

(AI_HYDRA_LEO_MISSING_COMMA)


[uncategorized] ~4-~4: Possible missing article found.
Context: ...[ext] is provided then files with only specified extension type will be scanned. ".[ext...

(AI_HYDRA_LEO_MISSING_THE)


[uncategorized] ~6-~6: Possible missing article found.
Context: ...ent a folder level and avoid specifying exact folder name. To map folder as a dataset...

(AI_HYDRA_LEO_MISSING_AN)


[uncategorized] ~6-~6: Possible missing article found.
Context: ...id specifying exact folder name. To map folder as a dataset, use {table} placeholder...

(AI_HYDRA_LEO_MISSING_A)


[uncategorized] ~6-~6: Possible missing article found.
Context: ... use {table} placeholder to represent folder level for which dataset is to be create...

(AI_HYDRA_LEO_MISSING_THE)


[uncategorized] ~6-~6: Possible missing article found.
Context: ...der to represent folder level for which dataset is to be created. For a partitioned dat...

(AI_HYDRA_LEO_MISSING_THE)


[grammar] ~17-~17: You’ve repeated a verb. Did you mean to only write one of them?
Context: ...in named variables. ### Path Specs - Examples #### Example 1 - Individual file as Dataset Contain...

(REPEATED_VERBS)

Markdownlint
metadata-ingestion/docs/sources/abs/README.md

40-40: null
Files should end with a single newline character

(MD047, single-trailing-newline)


33-33: null
Link fragments should be valid

(MD051, link-fragments)

metadata-ingestion/docs/sources/abs/abs.md

55-55: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)


202-202: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)


16-16: Expected: 1; Actual: 2
Multiple consecutive blank lines

(MD012, no-multiple-blanks)


107-107: Expected: 1; Actual: 2
Multiple consecutive blank lines

(MD012, no-multiple-blanks)


169-169: Expected: 1; Actual: 2
Multiple consecutive blank lines

(MD012, no-multiple-blanks)


170-170: Expected: 1; Actual: 3
Multiple consecutive blank lines

(MD012, no-multiple-blanks)


17-17: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


18-18: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


30-30: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


34-34: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


40-40: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


49-49: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


59-59: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


74-74: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


84-84: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


100-100: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


111-111: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


131-131: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


13-13: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


141-141: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


164-164: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


165-165: null
Bare URL used

(MD034, no-bare-urls)


167-167: null
Bare URL used

(MD034, no-bare-urls)


12-12: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


147-147: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


163-163: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


22-22: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


30-30: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


40-40: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


49-49: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


59-59: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


74-74: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


84-84: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


100-100: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


111-111: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


131-131: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


204-204: null
Files should end with a single newline character

(MD047, single-trailing-newline)


4-4: null
Link fragments should be valid

(MD051, link-fragments)

Ruff
metadata-ingestion/src/datahub/ingestion/source/abs/config.py

148-150: Use dict.get() without default value

Remove default value

(SIM910)

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

2-2: Redefinition of unused logging from line 1

Remove definition: logging

(F811)


155-155: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


280-280: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

metadata-ingestion/src/datahub/ingestion/source/abs/profiling.py

126-130: Use a single if statement instead of nested if statements

(SIM102)

metadata-ingestion/src/datahub/ingestion/source/abs/source.py

162-163: Use a single if statement instead of nested if statements

(SIM102)


693-693: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

Additional comments not posted (42)
metadata-ingestion/docs/sources/abs/abs_recipe.yml (1)

1-11: Ensure sensitive information is secured.

The configuration contains sensitive information such as account_name and sas_token. Ensure these are secured and not hard-coded in production environments.
[SECURITY]
Consider using environment variables or a secrets management service.

azure_config:
  account_name: "${{ secrets.AZURE_ACCOUNT_NAME }}"
  sas_token: "${{ secrets.AZURE_SAS_TOKEN }}"
metadata-ingestion/src/datahub/ingestion/source/abs/report.py (1)

1-19: LGTM!

The data class DataLakeSourceReport and its methods report_file_scanned and report_file_dropped are well-defined and follow best practices.

metadata-ingestion/src/datahub/ingestion/source/common/subtypes.py (1)

38-38: LGTM!

The new subtype ABS_CONTAINER has been correctly added to the DatasetContainerSubTypes enum.

metadata-ingestion/docs/sources/abs/README.md (3)

1-35: Consider rephrasing for conciseness.

The phrase "on the basis of" in line 34 might be wordy. Consider using "based on" instead.

- JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.
+ JSON file schemas are inferred based on the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.
Tools
LanguageTool

[style] ~34-~34: ‘on the basis of’ might be wordy. Consider a shorter alternative.
Context: ...etails)) JSON file schemas are inferred on the basis of the entire file (given the difficulty i...

(EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)

Markdownlint

33-33: null
Link fragments should be valid

(MD051, link-fragments)


33-33: Ensure link fragments are valid.

The link fragment in line 33 should be valid.

Please ensure the max_rows link fragment is correct and points to the appropriate section.

Tools
Markdownlint

33-33: null
Link fragments should be valid

(MD051, link-fragments)


40-40: Add a newline at the end of the file.

Files should end with a single newline character.

- Profiling is not available in the current release.
+ Profiling is not available in the current release.
+
Tools
Markdownlint

40-40: null
Files should end with a single newline character

(MD047, single-trailing-newline)

metadata-ingestion/src/datahub/ingestion/source/abs/datalake_profiler_config.py (2)

26-28: Consider renaming _allow_deny_patterns.

The private attribute _allow_deny_patterns could be more descriptive, indicating it is used for profiling patterns.

-_allow_deny_patterns: AllowDenyPattern = pydantic.PrivateAttr(
+_profiling_allow_deny_patterns: AllowDenyPattern = pydantic.PrivateAttr(

75-92: Improve the root validator method name.

The root validator method ensure_field_level_settings_are_normalized could be renamed to better reflect its purpose.

-    def ensure_field_level_settings_are_normalized(
+    def normalize_field_level_settings(
metadata-ingestion/src/datahub/ingestion/source/data_lake_common/data_lake_utils.py (1)

38-38: Use consistent naming for platform constants.

The platform constants should use consistent naming conventions.

-PLATFORM_ABS = "abs"
+PLATFORM_AZURE_BLOB_STORAGE = "abs"
metadata-ingestion/src/datahub/ingestion/source/abs/config.py (1)

99-101: Use dict.get() without default value.

Remove the default value in the dict.get() method.

-    _rename_path_spec_to_plural = pydantic_renamed_field(
-        "path_spec", "path_specs", lambda path_spec: [path_spec]
-    )
+    _rename_path_spec_to_plural = pydantic_renamed_field("path_spec", "path_specs", lambda path_spec: [path_spec])

Likely invalid or redundant comment.

metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py (1)

218-220: LGTM!

The code changes are approved.

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (7)

31-32: LGTM!

The code changes are approved.


35-39: LGTM!

The code changes are approved.


42-50: LGTM!

The code changes are approved.


53-65: LGTM!

The code changes are approved.


68-73: LGTM!

The code changes are approved.


76-81: LGTM!

The code changes are approved.


162-195: LGTM!

The code changes are approved.

metadata-ingestion/src/datahub/ingestion/source/abs/profiling.py (1)

55-57: LGTM!

The code changes are approved.

metadata-ingestion/src/datahub/ingestion/source/abs/source.py (19)

204-204: Class ABSSource Implementation Looks Good

The class is well-structured and adheres to the project conventions.


210-227: Constructor Initialization and Telemetry Data

The constructor correctly initializes the attributes and sends telemetry data.


229-233: Class Method create Looks Good

The method correctly parses the configuration dictionary and creates an instance of ABSSource.


235-299: Method get_fields Looks Good

The method correctly handles both ABS and local file sources and infers the schema accurately. Error handling is also properly implemented.


301-324: Method add_partition_columns_to_schema Looks Good

The method correctly derives partition keys and adds them to the schema fields.


325-334: Method _create_table_operation_aspect Looks Good

The method correctly creates the operation aspect with the appropriate timestamps and operation type.


336-437: Method ingest_table Looks Good

The method correctly handles metadata extraction, schema inference, and aspect creation.


439-444: Method get_prefix Looks Good

The method correctly extracts the prefix from a relative path using regex.


446-449: Method extract_table_name Looks Good

The method correctly extracts the table name from the path specification and named variables.


451-472: Method extract_table_data Looks Good

The method correctly extracts table data including display name, path, timestamp, and size.


474-490: Method resolve_templated_folders Looks Good

The method correctly resolves templated folders recursively.


491-520: Method get_dir_to_process Looks Good

The method correctly retrieves the directory to process, handling nested directories and sorting them.


522-615: Method abs_browser Looks Good

The method correctly handles ABS browsing, including error handling and sampling logic.


616-623: Method create_abs_path Looks Good

The method correctly constructs the ABS path using the account name and container name.


625-648: Method local_browser Looks Good

The method correctly handles local file browsing, including file and directory scanning.


649-695: Method get_workunits_internal Looks Good

The method correctly handles work unit generation, including ABS and local file sources.

Tools
Ruff

693-693: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)


696-702: Method get_workunit_processors Looks Good

The method correctly retrieves the work unit processors, including the stale entity removal handler.


704-705: Method is_abs_platform Looks Good

The method correctly checks if the platform is ABS.


707-708: Method get_report Looks Good

The method correctly retrieves the report.

metadata-ingestion/setup.py (4)

261-266: Addition of ABS Base Dependencies Looks Good

The ABS base dependencies are correctly added to the abs_base dictionary.


275-275: Addition of ABS Base to Delta Lake Dependencies Looks Good

The ABS base dependencies are correctly included in the delta_lake dictionary.


416-416: Addition of ABS Plugin Dependencies Looks Good

The ABS plugin dependencies are correctly added to the plugins dictionary.


693-693: Addition of ABS Source Entry Point Looks Good

The ABS source entry point is correctly added to the entry_points dictionary.

Comment on lines +19 to +22
base_path: str = Field(
default="/",
description="Base folder in hierarchical namespaces to start from.",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a validator for base_path.

The base_path field could benefit from a validator to ensure it starts with a /.

@pydantic.validator("base_path")
def validate_base_path(cls, value):
    if not value.startswith("/"):
        raise ValueError("base_path must start with a '/'")
    return value

Comment on lines +50 to +53
def get_abfss_url(self, folder_path: str = "") -> str:
if not folder_path.startswith("/"):
folder_path = f"/{folder_path}"
return f"abfss://{self.container_name}@{self.account_name}.dfs.core.windows.net{folder_path}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplify get_abfss_url method.

The get_abfss_url method can be simplified by using str.lstrip.

-        if not folder_path.startswith("/"):
-            folder_path = f"/{folder_path}"
+        folder_path = folder_path.lstrip("/")

Committable suggestion was skipped due to low confidence.

Comment on lines +111 to +112
elif is_abs_uri(path):
return get_container_name(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handle invalid ABS URIs gracefully.

Add error handling for invalid ABS URIs in get_bucket_name.

-        elif is_abs_uri(path):
-            return get_container_name(path)
+        elif is_abs_uri(path):
+            try:
+                return get_container_name(path)
+            except ValueError:
+                raise ValueError(f"Invalid ABS URI: {path}")
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
elif is_abs_uri(path):
return get_container_name(path)
elif is_abs_uri(path):
try:
return get_container_name(path)
except ValueError:
raise ValueError(f"Invalid ABS URI: {path}")

Comment on lines +95 to +96
elif is_abs_uri(path):
protocol = get_abs_prefix(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handle invalid ABS URIs gracefully.

Add error handling for invalid ABS URIs in get_protocol.

-        elif is_abs_uri(path):
-            protocol = get_abs_prefix(path)
+        elif is_abs_uri(path):
+            try:
+                protocol = get_abs_prefix(path)
+            except ValueError:
+                raise ValueError(f"Invalid ABS URI: {path}")
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
elif is_abs_uri(path):
protocol = get_abs_prefix(path)
elif is_abs_uri(path):
try:
protocol = get_abs_prefix(path)
except ValueError:
raise ValueError(f"Invalid ABS URI: {path}")

Comment on lines 115 to 122
def get_sub_types(self) -> str:
if self.platform == "s3":
return DatasetContainerSubTypes.S3_BUCKET
elif self.platform == "gcs":
return DatasetContainerSubTypes.GCS_BUCKET
elif self.platform == "abs":
return DatasetContainerSubTypes.ABS_CONTAINER
raise ValueError(f"Unable to sub type for platform: {self.platform}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use constants for platform strings.

Replace hardcoded platform strings with constants.

-        if self.platform == "s3":
+        if self.platform == PLATFORM_S3:
-        elif self.platform == "gcs":
+        elif self.platform == PLATFORM_GCS:
-        elif self.platform == "abs":
+        elif self.platform == PLATFORM_ABS:
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_sub_types(self) -> str:
if self.platform == "s3":
return DatasetContainerSubTypes.S3_BUCKET
elif self.platform == "gcs":
return DatasetContainerSubTypes.GCS_BUCKET
elif self.platform == "abs":
return DatasetContainerSubTypes.ABS_CONTAINER
raise ValueError(f"Unable to sub type for platform: {self.platform}")
def get_sub_types(self) -> str:
if self.platform == PLATFORM_S3:
return DatasetContainerSubTypes.S3_BUCKET
elif self.platform == PLATFORM_GCS:
return DatasetContainerSubTypes.GCS_BUCKET
elif self.platform == PLATFORM_ABS:
return DatasetContainerSubTypes.ABS_CONTAINER
raise ValueError(f"Unable to sub type for platform: {self.platform}")


Path Specs (`path_specs`) is a list of Path Spec (`path_spec`) objects where each individual `path_spec` represents one or more datasets. Include path (`path_spec.include`) represents formatted path to the dataset. This path must end with `*.*` or `*.[ext]` to represent leaf level. If `*.[ext]` is provided then files with only specified extension type will be scanned. "`.[ext]`" can be any of [supported file types](#supported-file-types). Refer [example 1](#example-1---individual-file-as-dataset) below for more details.

All folder levels need to be specified in include path. You can use `/*/` to represent a folder level and avoid specifying exact folder name. To map folder as a dataset, use `{table}` placeholder to represent folder level for which dataset is to be created. For a partitioned dataset, you can use placeholder `{partition_key[i]}` to represent name of `i`th partition and `{partition[i]}` to represent value of `i`th partition. During ingestion, `i` will be used to match partition_key to partition. Refer [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix grammatical issues.

Add the missing articles for better readability.

- All folder levels need to be specified in include path. You can use `/*/` to represent a folder level and avoid specifying exact folder name. To map folder as a dataset, use `{table}` placeholder to represent folder level for which dataset is to be created. For a partitioned dataset, you can use placeholder `{partition_key[i]}` to represent name of `i`th partition and `{partition[i]}` to represent value of `i`th partition. During ingestion, `i` will be used to match partition_key to partition. Refer [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details.
+ All folder levels need to be specified in the include path. You can use `/*/` to represent a folder level and avoid specifying the exact folder name. To map a folder as a dataset, use the `{table}` placeholder to represent the folder level for which the dataset is to be created. For a partitioned dataset, you can use the placeholder `{partition_key[i]}` to represent the name of the `i`th partition and `{partition[i]}` to represent the value of the `i`th partition. During ingestion, `i` will be used to match the partition_key to the partition. Refer to [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
All folder levels need to be specified in include path. You can use `/*/` to represent a folder level and avoid specifying exact folder name. To map folder as a dataset, use `{table}` placeholder to represent folder level for which dataset is to be created. For a partitioned dataset, you can use placeholder `{partition_key[i]}` to represent name of `i`th partition and `{partition[i]}` to represent value of `i`th partition. During ingestion, `i` will be used to match partition_key to partition. Refer [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details.
All folder levels need to be specified in the include path. You can use `/*/` to represent a folder level and avoid specifying the exact folder name. To map a folder as a dataset, use the `{table}` placeholder to represent the folder level for which the dataset is to be created. For a partitioned dataset, you can use the placeholder `{partition_key[i]}` to represent the name of the `i`th partition and `{partition[i]}` to represent the value of the `i`th partition. During ingestion, `i` will be used to match the partition_key to the partition. Refer to [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details.
Tools
LanguageTool

[uncategorized] ~6-~6: Possible missing article found.
Context: ...ent a folder level and avoid specifying exact folder name. To map folder as a dataset...

(AI_HYDRA_LEO_MISSING_AN)


[uncategorized] ~6-~6: Possible missing article found.
Context: ...id specifying exact folder name. To map folder as a dataset, use {table} placeholder...

(AI_HYDRA_LEO_MISSING_A)


[uncategorized] ~6-~6: Possible missing article found.
Context: ... use {table} placeholder to represent folder level for which dataset is to be create...

(AI_HYDRA_LEO_MISSING_THE)


[uncategorized] ~6-~6: Possible missing article found.
Context: ...der to represent folder level for which dataset is to be created. For a partitioned dat...

(AI_HYDRA_LEO_MISSING_THE)

Comment on lines +173 to +174
is_abs = is_abs_uri(values.get("include") or "")
if not is_s3 and not is_gcs and not is_abs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tip

Codebase Verification

The turn_off_sampling_for_non_s3 method appears to be unused.

The method turn_off_sampling_for_non_s3 is defined in metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py but is not invoked anywhere in the codebase. This suggests that it might be dead code.

  • metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py: def turn_off_sampling_for_non_s3(cls, v, values)

Consider removing this method if it is no longer needed.

Analysis chain

Ensure sampling is correctly handled for Azure Blob Storage.

The sampling logic has been updated to handle Azure Blob Storage URIs. Ensure that this change is consistent with the rest of the codebase.

Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify the usage of the `turn_off_sampling_for_non_s3` validator in the codebase.

# Test: Search for the usage of the `turn_off_sampling_for_non_s3` validator. Expect: Consistent handling of Azure Blob Storage URIs.
rg --type python $'turn_off_sampling_for_non_s3'

Length of output: 78



Script:

#!/bin/bash
# Description: Verify the usage of the `turn_off_sampling_for_non_s3` validator in the codebase.

# Test: Search for the usage of the `turn_off_sampling_for_non_s3` validator. Expect: Consistent handling of Azure Blob Storage URIs.
rg -t py 'turn_off_sampling_for_non_s3'

Length of output: 171



Script:

#!/bin/bash
# Description: Verify the usage of the `turn_off_sampling_for_non_s3` validator within the path_spec.py file.

# Test: Search for invocations of the `turn_off_sampling_for_non_s3` method within the path_spec.py file.
rg 'turn_off_sampling_for_non_s3' metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py

Length of output: 165



Script:

#!/bin/bash
# Description: Verify the usage of the `turn_off_sampling_for_non_s3` validator across the entire codebase.

# Test: Search for invocations of the `turn_off_sampling_for_non_s3` method across the entire codebase.
rg 'turn_off_sampling_for_non_s3'

Length of output: 165

continue

folder_name = f"{prefix}{folder_name}"
if folder_name in this_dict.keys():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unnecessary .keys() method.

Use key in dict instead of key in dict.keys() for better performance.

- if folder_name in this_dict.keys():
+ if folder_name in this_dict:
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if folder_name in this_dict.keys():
if folder_name in this_dict:
Tools
Ruff

280-280: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

Comment on lines 95 to 96
if azure_config is None:
raise ValueError("No container_client available.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve the error message.

The error message for azure_config being None could be more descriptive.

- raise ValueError("No container_client available.")
+ raise ValueError("Azure configuration is not provided. Cannot retrieve container client.")
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if azure_config is None:
raise ValueError("No container_client available.")
if azure_config is None:
raise ValueError("Azure configuration is not provided. Cannot retrieve container client.")

def add_property(
key: str, value: str, custom_properties: Dict[str, str], resource_name: str
) -> Dict[str, str]:
if key in custom_properties.keys():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unnecessary .keys() method.

Use key in dict instead of key in dict.keys() for better performance.

- if key in custom_properties.keys():
+ if key in custom_properties:
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if key in custom_properties.keys():
if key in custom_properties:
Tools
Ruff

155-155: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 8edc94d and af3e161.

Files selected for processing (14)
  • metadata-ingestion/docs/sources/abs/README.md (1 hunks)
  • metadata-ingestion/docs/sources/abs/abs.md (1 hunks)
  • metadata-ingestion/docs/sources/abs/abs_recipe.yml (1 hunks)
  • metadata-ingestion/setup.py (3 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/abs/config.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/abs/datalake_profiler_config.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/abs/profiling.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/abs/report.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/abs/source.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/azure/azure_common.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/common/subtypes.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/data_lake_common/data_lake_utils.py (4 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py (4 hunks)
Files not summarized due to errors (4)
  • metadata-ingestion/src/datahub/ingestion/source/abs/profiling.py: Error: Server error. Please try again later.
  • metadata-ingestion/setup.py: Error: Server error. Please try again later.
  • metadata-ingestion/src/datahub/ingestion/source/abs/source.py: Error: Server error. Please try again later.
  • metadata-ingestion/docs/sources/abs/README.md: Error: Server error. Please try again later.
Files skipped from review due to trivial changes (2)
  • metadata-ingestion/docs/sources/abs/abs_recipe.yml
  • metadata-ingestion/src/datahub/ingestion/source/common/subtypes.py
Additional context used
LanguageTool
metadata-ingestion/docs/sources/abs/README.md

[uncategorized] ~3-~3: You might be missing the article “the” here.
Context: ...group of files that form a dataset, use path_specs configuration in ingestion r...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)


[uncategorized] ~3-~3: You might be missing the article “the” here.
Context: ...aset, use path_specs configuration in ingestion recipe. Refer section [Path Specs](http...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)


[style] ~34-~34: ‘on the basis of’ might be wordy. Consider a shorter alternative.
Context: ...etails)) JSON file schemas are inferred on the basis of the entire file (given the difficulty i...

(EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)

metadata-ingestion/docs/sources/abs/abs.md

[uncategorized] ~4-~4: Possible missing comma found.
Context: ...) is a list of Path Spec (path_spec) objects where each individual path_spec` repre...

(AI_HYDRA_LEO_MISSING_COMMA)


[uncategorized] ~4-~4: Possible missing article found.
Context: ...e path (path_spec.include) represents formatted path to the dataset. This path must end...

(AI_HYDRA_LEO_MISSING_A)


[uncategorized] ~4-~4: Possible missing comma found.
Context: ...o represent leaf level. If *.[ext] is provided then files with only specified extensio...

(AI_HYDRA_LEO_MISSING_COMMA)


[uncategorized] ~4-~4: Possible missing article found.
Context: ...[ext] is provided then files with only specified extension type will be scanned. ".[ext...

(AI_HYDRA_LEO_MISSING_THE)


[uncategorized] ~6-~6: Possible missing article found.
Context: ...id specifying exact folder name. To map folder as a dataset, use {table} placeholder...

(AI_HYDRA_LEO_MISSING_A)


[uncategorized] ~6-~6: Possible missing article found.
Context: ... use {table} placeholder to represent folder level for which dataset is to be create...

(AI_HYDRA_LEO_MISSING_THE)


[uncategorized] ~6-~6: Possible missing article found.
Context: ...der to represent folder level for which dataset is to be created. For a partitioned dat...

(AI_HYDRA_LEO_MISSING_THE)


[grammar] ~17-~17: You’ve repeated a verb. Did you mean to only write one of them?
Context: ...in named variables. ### Path Specs - Examples #### Example 1 - Individual file as Dataset Contain...

(REPEATED_VERBS)


[uncategorized] ~54-~54: Possible missing article found.
Context: ...table}/*.avro ``` {table} represents folder for which dataset will be created. ##...

(AI_HYDRA_LEO_MISSING_A)


[uncategorized] ~54-~54: Possible missing article found.
Context: ... {table} represents folder for which dataset will be created. #### Example 3 - Fol...

(AI_HYDRA_LEO_MISSING_A)

Markdownlint
metadata-ingestion/docs/sources/abs/README.md

40-40: null
Files should end with a single newline character

(MD047, single-trailing-newline)


33-33: null
Link fragments should be valid

(MD051, link-fragments)

metadata-ingestion/docs/sources/abs/abs.md

55-55: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)


202-202: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)


16-16: Expected: 1; Actual: 2
Multiple consecutive blank lines

(MD012, no-multiple-blanks)


107-107: Expected: 1; Actual: 2
Multiple consecutive blank lines

(MD012, no-multiple-blanks)


169-169: Expected: 1; Actual: 2
Multiple consecutive blank lines

(MD012, no-multiple-blanks)


170-170: Expected: 1; Actual: 3
Multiple consecutive blank lines

(MD012, no-multiple-blanks)


17-17: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


18-18: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


30-30: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


34-34: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


40-40: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


49-49: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


59-59: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


74-74: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


84-84: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


100-100: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


111-111: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


131-131: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


13-13: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


141-141: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


164-164: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


165-165: null
Bare URL used

(MD034, no-bare-urls)


167-167: null
Bare URL used

(MD034, no-bare-urls)


12-12: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


147-147: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


163-163: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


22-22: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


30-30: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


40-40: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


49-49: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


59-59: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


74-74: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


84-84: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


100-100: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


111-111: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


131-131: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


204-204: null
Files should end with a single newline character

(MD047, single-trailing-newline)


4-4: null
Link fragments should be valid

(MD051, link-fragments)

Ruff
metadata-ingestion/src/datahub/ingestion/source/abs/config.py

148-150: Use dict.get() without default value

Remove default value

(SIM910)

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

2-2: Redefinition of unused logging from line 1

Remove definition: logging

(F811)


155-155: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


280-280: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

metadata-ingestion/src/datahub/ingestion/source/abs/profiling.py

126-130: Use a single if statement instead of nested if statements

(SIM102)

metadata-ingestion/src/datahub/ingestion/source/abs/source.py

162-163: Use a single if statement instead of nested if statements

(SIM102)


693-693: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

Additional comments not posted (28)
metadata-ingestion/src/datahub/ingestion/source/abs/report.py (1)

10-19: LGTM!

The class DataLakeSourceReport correctly extends StaleEntityRemovalSourceReport and adds functionality to track scanned and filtered files.

metadata-ingestion/docs/sources/abs/README.md (1)

34-34: Simplify the phrase "on the basis of".

The phrase "on the basis of" can be simplified to "based on".

- JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.
+ JSON file schemas are inferred based on the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.
Tools
LanguageTool

[style] ~34-~34: ‘on the basis of’ might be wordy. Consider a shorter alternative.
Context: ...etails)) JSON file schemas are inferred on the basis of the entire file (given the difficulty i...

(EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)

metadata-ingestion/src/datahub/ingestion/source/abs/datalake_profiler_config.py (1)

75-92: LGTM!

The root validator ensure_field_level_settings_are_normalized correctly ensures that field-level settings are normalized based on the profile_table_level_only flag.

metadata-ingestion/src/datahub/ingestion/source/azure/azure_common.py (1)

84-98: LGTM!

The root validator _check_credential_values correctly ensures that the necessary credential values are provided.

metadata-ingestion/src/datahub/ingestion/source/data_lake_common/data_lake_utils.py (7)

19-24: Imports look good!

The import statements for Azure Blob Storage utilities are correctly added.


38-38: Constant declaration looks good!

The PLATFORM_ABS constant is correctly introduced.


95-96: ABS URI handling looks good!

The get_protocol method is correctly updated to handle ABS URIs.


111-113: ABS URI handling looks good!

The get_bucket_name method is correctly updated to handle ABS URIs.


115-122: ABS container subtype handling looks good!

The get_sub_types method is correctly updated to return ABS container subtype.


124-129: ABS URI handling looks good!

The get_base_full_path method is correctly updated to handle ABS URIs.


137-148: ABS platform handling looks good!

The create_container_hierarchy method is correctly updated to handle ABS platform.

metadata-ingestion/src/datahub/ingestion/source/abs/config.py (4)

1-17: Imports look good!

The import statements for Azure Blob Storage configurations and utilities are correctly added.


28-107: Class definition looks good!

The DataLakeSourceConfig class is correctly introduced with necessary fields and methods.


108-145: ABS platform handling looks good!

The check_path_specs_and_infer_platform method is correctly updated to handle ABS platform.


156-163: Profiling pattern handling looks good!

The ensure_profiling_pattern_is_passed_to_profiling method is correctly implemented to handle profiling patterns.

metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py (3)

14-14: Imports look good!

The import statement for Azure Blob Storage utility is correctly added.


173-174: ABS URI handling looks good!

The sample_files validator is correctly updated to handle ABS URIs.


218-221: ABS URI check looks good!

The is_abs cached property is correctly implemented to check for ABS URIs.

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (6)

31-32: LGTM!

The function correctly checks if a URI matches the Azure Blob Storage pattern.


35-39: LGTM!

The function correctly extracts the prefix from an Azure Blob Storage URI.


42-50: LGTM!

The function correctly removes the Azure Blob Storage prefix from a URI.


53-65: LGTM!

The function correctly creates a URN for an Azure Blob Storage URI.


68-73: LGTM!

The function correctly extracts the container name from an Azure Blob Storage URI.


76-81: LGTM!

The function correctly extracts the key prefix from an Azure Blob Storage URI.

metadata-ingestion/src/datahub/ingestion/source/abs/profiling.py (1)

60-73: LGTM!

The class correctly defines the specification for profiling a single column.

metadata-ingestion/setup.py (3)

261-266: LGTM! Adding Azure Blob Storage dependencies.

The new dependencies for Azure Blob Storage are appropriate for the intended functionality.


275-275: LGTM! Including Azure Blob Storage dependencies in delta_lake plugin.

The addition of abs_base to the delta_lake plugin is logical and aligns with the objective of adding Azure Blob ingestion functionality.


416-416: LGTM! Adding abs plugin for Azure Blob Storage ingestion.

The new abs plugin with abs_base dependencies aligns with the PR objective of adding Azure Blob ingestion functionality.

Comment on lines +148 to +150
inferred_platform = values.get(
"platform", None
) # we may have inferred it above
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplify the platform inference logic.

Use dict.get() without the default value as suggested by static analysis tools.

-    inferred_platform = values.get("platform", None)
+    inferred_platform = values.get("platform")
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
inferred_platform = values.get(
"platform", None
) # we may have inferred it above
inferred_platform = values.get("platform")
Tools
Ruff

148-150: Use dict.get() without default value

Remove default value

(SIM910)

Comment on lines 4 to 8
Path Specs (`path_specs`) is a list of Path Spec (`path_spec`) objects where each individual `path_spec` represents one or more datasets. Include path (`path_spec.include`) represents formatted path to the dataset. This path must end with `*.*` or `*.[ext]` to represent leaf level. If `*.[ext]` is provided then files with only specified extension type will be scanned. "`.[ext]`" can be any of [supported file types](#supported-file-types). Refer [example 1](#example-1---individual-file-as-dataset) below for more details.

All folder levels need to be specified in include path. You can use `/*/` to represent a folder level and avoid specifying exact folder name. To map folder as a dataset, use `{table}` placeholder to represent folder level for which dataset is to be created. For a partitioned dataset, you can use placeholder `{partition_key[i]}` to represent name of `i`th partition and `{partition[i]}` to represent value of `i`th partition. During ingestion, `i` will be used to match partition_key to partition. Refer [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details.

Exclude paths (`path_spec.exclude`) can be used to ignore paths that are not relevant to current `path_spec`. This path cannot have named variables ( `{}` ). Exclude path can have `**` to represent multiple folder levels. Refer [example 4](#example-4---folder-of-files-as-dataset-with-partitions-and-exclude-filter) below for more details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix grammatical issues and formatting inconsistencies.

There are grammatical issues and formatting inconsistencies that need to be addressed for better readability.

- Path Specs (`path_specs`) is a list of Path Spec (`path_spec`) objects where each individual `path_spec` represents one or more datasets. Include path (`path_spec.include`) represents formatted path to the dataset. This path must end with `*.*` or `*.[ext]` to represent leaf level. If `*.[ext]` is provided then files with only specified extension type will be scanned. "`.[ext]`" can be any of [supported file types](#supported-file-types). Refer [example 1](#example-1---individual-file-as-dataset) below for more details.
+ Path Specs (`path_specs`) is a list of Path Spec (`path_spec`) objects, where each individual `path_spec` represents one or more datasets. The include path (`path_spec.include`) represents a formatted path to the dataset. This path must end with `*.*` or `*.[ext]` to represent the leaf level. If `*.[ext]` is provided, then only files with the specified extension type will be scanned. "`.[ext]`" can be any of the [supported file types](#supported-file-types). Refer to [example 1](#example-1---individual-file-as-dataset) below for more details.

- All folder levels need to be specified in include path. You can use `/*/` to represent a folder level and avoid specifying exact folder name. To map folder as a dataset, use `{table}` placeholder to represent folder level for which dataset is to be created. For a partitioned dataset, you can use placeholder `{partition_key[i]}` to represent name of `i`th partition and `{partition[i]}` to represent value of `i`th partition. During ingestion, `i` will be used to match partition_key to partition. Refer [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details.
+ All folder levels need to be specified in the include path. You can use `/*/` to represent a folder level and avoid specifying the exact folder name. To map a folder as a dataset, use the `{table}` placeholder to represent the folder level for which the dataset is to be created. For a partitioned dataset, you can use the placeholder `{partition_key[i]}` to represent the name of the `i`th partition and `{partition[i]}` to represent the value of the `i`th partition. During ingestion, `i` will be used to match the partition_key to the partition. Refer to [examples 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details.

- Exclude paths (`path_spec.exclude`) can be used to ignore paths that are not relevant to current `path_spec`. This path cannot have named variables ( `{}` ). Exclude path can have `**` to represent multiple folder levels. Refer [example 4](#example-4---folder-of-files-as-dataset-with-partitions-and-exclude-filter) below for more details.
+ Exclude paths (`path_spec.exclude`) can be used to ignore paths that are not relevant to the current `path_spec`. This path cannot have named variables (`{}`). The exclude path can have `**` to represent multiple folder levels. Refer to [example 4](#example-4---folder-of-files-as-dataset-with-partitions-and-exclude-filter) below for more details.

- Refer [example 5](#example-5---advanced---either-individual-file-or-folder-of-files-as-dataset) if your container has more complex dataset representation.
+ Refer to [example 5](#example-5---advanced---either-individual-file-or-folder-of-files-as-dataset) if your container has a more complex dataset representation.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Path Specs (`path_specs`) is a list of Path Spec (`path_spec`) objects where each individual `path_spec` represents one or more datasets. Include path (`path_spec.include`) represents formatted path to the dataset. This path must end with `*.*` or `*.[ext]` to represent leaf level. If `*.[ext]` is provided then files with only specified extension type will be scanned. "`.[ext]`" can be any of [supported file types](#supported-file-types). Refer [example 1](#example-1---individual-file-as-dataset) below for more details.
All folder levels need to be specified in include path. You can use `/*/` to represent a folder level and avoid specifying exact folder name. To map folder as a dataset, use `{table}` placeholder to represent folder level for which dataset is to be created. For a partitioned dataset, you can use placeholder `{partition_key[i]}` to represent name of `i`th partition and `{partition[i]}` to represent value of `i`th partition. During ingestion, `i` will be used to match partition_key to partition. Refer [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details.
Exclude paths (`path_spec.exclude`) can be used to ignore paths that are not relevant to current `path_spec`. This path cannot have named variables ( `{}` ). Exclude path can have `**` to represent multiple folder levels. Refer [example 4](#example-4---folder-of-files-as-dataset-with-partitions-and-exclude-filter) below for more details.
Path Specs (`path_specs`) is a list of Path Spec (`path_spec`) objects, where each individual `path_spec` represents one or more datasets. The include path (`path_spec.include`) represents a formatted path to the dataset. This path must end with `*.*` or `*.[ext]` to represent the leaf level. If `*.[ext]` is provided, then only files with the specified extension type will be scanned. "`.[ext]`" can be any of the [supported file types](#supported-file-types). Refer to [example 1](#example-1---individual-file-as-dataset) below for more details.
All folder levels need to be specified in the include path. You can use `/*/` to represent a folder level and avoid specifying the exact folder name. To map a folder as a dataset, use the `{table}` placeholder to represent the folder level for which the dataset is to be created. For a partitioned dataset, you can use the placeholder `{partition_key[i]}` to represent the name of the `i`th partition and `{partition[i]}` to represent the value of the `i`th partition. During ingestion, `i` will be used to match the partition_key to the partition. Refer to [examples 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details.
Exclude paths (`path_spec.exclude`) can be used to ignore paths that are not relevant to the current `path_spec`. This path cannot have named variables (`{}`). The exclude path can have `**` to represent multiple folder levels. Refer to [example 4](#example-4---folder-of-files-as-dataset-with-partitions-and-exclude-filter) below for more details.
Refer to [example 5](#example-5---advanced---either-individual-file-or-folder-of-files-as-dataset) if your container has a more complex dataset representation.
Tools
LanguageTool

[uncategorized] ~4-~4: Possible missing comma found.
Context: ...) is a list of Path Spec (path_spec) objects where each individual path_spec` repre...

(AI_HYDRA_LEO_MISSING_COMMA)


[uncategorized] ~4-~4: Possible missing article found.
Context: ...e path (path_spec.include) represents formatted path to the dataset. This path must end...

(AI_HYDRA_LEO_MISSING_A)


[uncategorized] ~4-~4: Possible missing comma found.
Context: ...o represent leaf level. If *.[ext] is provided then files with only specified extensio...

(AI_HYDRA_LEO_MISSING_COMMA)


[uncategorized] ~4-~4: Possible missing article found.
Context: ...[ext] is provided then files with only specified extension type will be scanned. ".[ext...

(AI_HYDRA_LEO_MISSING_THE)


[uncategorized] ~6-~6: Possible missing article found.
Context: ...id specifying exact folder name. To map folder as a dataset, use {table} placeholder...

(AI_HYDRA_LEO_MISSING_A)


[uncategorized] ~6-~6: Possible missing article found.
Context: ... use {table} placeholder to represent folder level for which dataset is to be create...

(AI_HYDRA_LEO_MISSING_THE)


[uncategorized] ~6-~6: Possible missing article found.
Context: ...der to represent folder level for which dataset is to be created. For a partitioned dat...

(AI_HYDRA_LEO_MISSING_THE)

Markdownlint

4-4: null
Link fragments should be valid

(MD051, link-fragments)

Comment on lines +147 to +161
**Valid path_specs.include**

```python
https://storageaccountname.blob.core.windows.net/my-container/foo/tests/bar.avro # single file table
https://storageaccountname.blob.core.windows.net/my-container/foo/tests/*.* # mulitple file level tables
https://storageaccountname.blob.core.windows.net/my-container/foo/tests/{table}/*.avro #table without partition
https://storageaccountname.blob.core.windows.net/my-container/foo/tests/{table}/*/*.avro #table where partitions are not specified
https://storageaccountname.blob.core.windows.net/my-container/foo/tests/{table}/*.* # table where no partitions as well as data type specified
https://storageaccountname.blob.core.windows.net/my-container/{dept}/tests/{table}/*.avro # specifying keywords to be used in display name
https://storageaccountname.blob.core.windows.net/my-container/{dept}/tests/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.avro # specify partition key and value format
https://storageaccountname.blob.core.windows.net/my-container/{dept}/tests/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.avro # specify partition value only format
https://storageaccountname.blob.core.windows.net/my-container/{dept}/tests/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # for all extensions
https://storageaccountname.blob.core.windows.net/my-container/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # table is present at 2 levels down in container
https://storageaccountname.blob.core.windows.net/my-container/*/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # table is present at 3 levels down in container
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix grammatical issues and formatting inconsistencies.

There are grammatical issues and formatting inconsistencies that need to be addressed for better readability.

- https://storageaccountname.blob.core.windows.net/my-container/foo/tests/bar.avro # single file table
- https://storageaccountname.blob.core.windows.net/my-container/foo/tests/*.* # mulitple file level tables
+ https://storageaccountname.blob.core.windows.net/my-container/foo/tests/bar.avro # single file table
+ https://storageaccountname.blob.core.windows.net/my-container/foo/tests/*.* # multiple file level tables

Committable suggestion was skipped due to low confidence.

Tools
Markdownlint

147-147: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

Comment on lines +17 to +55
### Path Specs - Examples
#### Example 1 - Individual file as Dataset

Container structure:

```
test-container
├── employees.csv
├── departments.json
└── food_items.csv
```

Path specs config to ingest `employees.csv` and `food_items.csv` as datasets:
```
path_specs:
- include: https://storageaccountname.blob.core.windows.net/test-container/*.csv

```
This will automatically ignore `departments.json` file. To include it, use `*.*` instead of `*.csv`.

#### Example 2 - Folder of files as Dataset (without Partitions)

Container structure:
```
test-container
└── offers
├── 1.avro
└── 2.avro

```

Path specs config to ingest folder `offers` as dataset:
```
path_specs:
- include: https://storageaccountname.blob.core.windows.net/test-container/{table}/*.avro
```

`{table}` represents folder for which dataset will be created.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix grammatical issues and formatting inconsistencies.

There are grammatical issues and formatting inconsistencies that need to be addressed for better readability.

- `{table}` represents folder for which dataset will be created.
+ `{table}` represents the folder for which the dataset will be created.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
### Path Specs - Examples
#### Example 1 - Individual file as Dataset
Container structure:
```
test-container
├── employees.csv
├── departments.json
└── food_items.csv
```
Path specs config to ingest `employees.csv` and `food_items.csv` as datasets:
```
path_specs:
- include: https://storageaccountname.blob.core.windows.net/test-container/*.csv
```
This will automatically ignore `departments.json` file. To include it, use `*.*` instead of `*.csv`.
#### Example 2 - Folder of files as Dataset (without Partitions)
Container structure:
```
test-container
└── offers
├── 1.avro
└── 2.avro
```
Path specs config to ingest folder `offers` as dataset:
```
path_specs:
- include: https://storageaccountname.blob.core.windows.net/test-container/{table}/*.avro
```
`{table}` represents folder for which dataset will be created.
`{table}` represents the folder for which the dataset will be created.
Tools
LanguageTool

[grammar] ~17-~17: You’ve repeated a verb. Did you mean to only write one of them?
Context: ...in named variables. ### Path Specs - Examples #### Example 1 - Individual file as Dataset Contain...

(REPEATED_VERBS)


[uncategorized] ~54-~54: Possible missing article found.
Context: ...table}/*.avro ``` {table} represents folder for which dataset will be created. ##...

(AI_HYDRA_LEO_MISSING_A)


[uncategorized] ~54-~54: Possible missing article found.
Context: ... {table} represents folder for which dataset will be created. #### Example 3 - Fol...

(AI_HYDRA_LEO_MISSING_A)

Markdownlint

55-55: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)


17-17: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


18-18: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


30-30: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


34-34: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


40-40: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


49-49: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


22-22: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


30-30: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


40-40: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


49-49: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

@@ -0,0 +1,289 @@
import logging
import logging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove redundant import.

The logging module is imported twice.

-import logging
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import logging
Tools
Ruff

2-2: Redefinition of unused logging from line 1

Remove definition: logging

(F811)

@@ -0,0 +1,40 @@
This connector ingests Azure Blob Storage (abbreviated to abs) datasets into DataHub. It allows mapping an individual
file or a folder of files to a dataset in DataHub.
To specify the group of files that form a dataset, use `path_specs` configuration in ingestion recipe. Refer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add missing article "the".

The sentence is missing the article "the" before "section."

- Refer section [Path Specs](https://datahubproject.io/docs/generated/ingestion/sources/s3/#path-specs) for more details.
+ Refer to the section [Path Specs](https://datahubproject.io/docs/generated/ingestion/sources/s3/#path-specs) for more details.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
To specify the group of files that form a dataset, use `path_specs` configuration in ingestion recipe. Refer
To specify the group of files that form a dataset, use `path_specs` configuration in ingestion recipe. Refer to the section [Path Specs](https://datahubproject.io/docs/generated/ingestion/sources/s3/#path-specs) for more details.
Tools
LanguageTool

[uncategorized] ~3-~3: You might be missing the article “the” here.
Context: ...group of files that form a dataset, use path_specs configuration in ingestion r...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)


[uncategorized] ~3-~3: You might be missing the article “the” here.
Context: ...aset, use path_specs configuration in ingestion recipe. Refer section [Path Specs](http...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)

Comment on lines +108 to +146
#### Example 5 - Advanced - Either Individual file OR Folder of files as Dataset

Container structure:
```
test-container
├── customers
│ ├── part1.json
│ ├── part2.json
│ ├── part3.json
│ └── part4.json
├── employees.csv
├── food_items.csv
├── tmp_10101000.csv
└── orders
└── year=2022
└── month=2
├── 1.parquet
├── 2.parquet
└── 3.parquet

```

Path specs config:
```
path_specs:
- include: https://storageaccountname.blob.core.windows.net/test-container/*.csv
exclude:
- **/tmp_10101000.csv
- include: https://storageaccountname.blob.core.windows.net/test-container/{table}/*.json
- include: https://storageaccountname.blob.core.windows.net/test-container/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet
```

Above config has 3 path_specs and will ingest following datasets
- `employees.csv` - Single File as Dataset
- `food_items.csv` - Single File as Dataset
- `customers` - Folder as Dataset
- `orders` - Folder as Dataset
and will ignore file `tmp_10101000.csv`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix grammatical issues and formatting inconsistencies.

There are grammatical issues and formatting inconsistencies that need to be addressed for better readability.

- Above config has 3 path_specs and will ingest following datasets
+ The above config has 3 path_specs and will ingest the following datasets:
Tools
Markdownlint

111-111: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


131-131: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


141-141: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


111-111: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


131-131: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


### Profiling

Profiling is not available in the current release.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure the file ends with a newline.

Markdown files should end with a single newline character.

+ 
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Profiling is not available in the current release.
Profiling is not available in the current release.
Tools
Markdownlint

40-40: null
Files should end with a single newline character

(MD047, single-trailing-newline)

Comment on lines +1 to +2
from typing import Any, Dict, Optional

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unused import.

The Optional import from typing is not used in the file.

- from typing import Any, Dict, Optional
+ from typing import Any, Dict
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from typing import Any, Dict, Optional
from typing import Any, Dict

folder_path = f"/{folder_path}"
return f"abfss://{self.container_name}@{self.account_name}.dfs.core.windows.net{folder_path}"

# TODO DEX-1010
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove TODO comment or create an issue.

The TODO comment should either be removed or an issue should be created to track this task.

Do you want me to open a GitHub issue to track this task?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between af3e161 and c7f59d3.

Files selected for processing (2)
  • metadata-ingestion/src/datahub/ingestion/source/abs/source.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)
Additional context used
Ruff
metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

2-2: Redefinition of unused logging from line 1

Remove definition: logging

(F811)


146-146: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


272-272: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

metadata-ingestion/src/datahub/ingestion/source/abs/source.py

162-163: Use a single if statement instead of nested if statements

(SIM102)


685-685: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

Additional comments not posted (26)
metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (9)

22-23: LGTM!

The function correctly checks if a URI matches the Azure Blob Storage pattern.


26-30: LGTM!

The function correctly extracts the prefix from an Azure Blob Storage URI.


33-41: LGTM!

The function correctly removes the Azure Blob Storage prefix from a URI.


44-56: LGTM!

The function correctly creates a URN for an Azure Blob Storage dataset.


59-64: LGTM!

The function correctly extracts the container name from an Azure Blob Storage URI.


67-72: LGTM!

The function correctly extracts the key prefix from an Azure Blob Storage URI.


153-187: LGTM!

The function correctly processes properties and adds them to the custom properties dictionary.


189-237: LGTM!

The function correctly retrieves tags from an Azure Blob Storage object or container.


86-87: Improve the error message.

The error message for azure_config being None could be more descriptive.

- raise ValueError("No container_client available.")
+ raise ValueError("Azure configuration is not provided. Cannot retrieve container client.")

Likely invalid or redundant comment.

metadata-ingestion/src/datahub/ingestion/source/abs/source.py (17)

127-147: LGTM!

The function correctly maps Spark column types to DataHub types.


235-299: LGTM!

The function correctly retrieves fields from a table based on its file extension.


301-324: LGTM!

The function correctly adds partition columns to the schema.


325-334: LGTM!

The function correctly creates an operation aspect for a table.


336-437: LGTM!

The function correctly ingests a table and generates metadata work units.


439-444: LGTM!

The function correctly extracts the prefix from a relative path.


446-449: LGTM!

The function correctly extracts the table name from a path specification and named variables.


451-472: LGTM!

The function correctly extracts table data from a path specification.


474-490: LGTM!

The function correctly resolves templated folders in a container.


491-520: LGTM!

The function correctly retrieves the directory to process based on a path specification.


522-606: LGTM!

The function correctly browses Azure Blob Storage and retrieves metadata for files.


608-615: LGTM!

The function correctly creates an Azure Blob Storage path from a key.


617-639: LGTM!

The function correctly browses the local file system and retrieves metadata for files.


641-687: LGTM!

The function correctly retrieves metadata work units.

Tools
Ruff

685-685: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)


688-694: LGTM!

The function correctly retrieves work unit processors.


696-697: LGTM!

The function correctly checks if the platform is Azure Blob Storage.


699-700: LGTM!

The function correctly retrieves the report.

Comment on lines 240 to 277
def list_folders(
container_name: str, prefix: str, azure_config: Optional[AzureConnectionConfig]
) -> Iterable[str]:
if azure_config is None:
raise ValueError("azure_config not set. Cannot browse Azure Blob Storage")

abs_blob_service_client = azure_config.get_blob_service_client()
container_client = abs_blob_service_client.get_container_client(container_name)

current_level = prefix.count("/")
blob_list = container_client.list_blobs(name_starts_with=prefix)

this_dict = {}
for blob in blob_list:
blob_name = blob.name[: blob.name.rfind("/") + 1]
folder_structure_arr = blob_name.split("/")

folder_name = ""
if len(folder_structure_arr) > current_level:
folder_name = f"{folder_name}/{folder_structure_arr[current_level]}"
else:
continue

folder_name = folder_name[1 : len(folder_name)]

if folder_name.endswith("/"):
folder_name = folder_name[:-1]

if folder_name == "":
continue

folder_name = f"{prefix}{folder_name}"
if folder_name in this_dict.keys():
continue
else:
this_dict[folder_name] = folder_name

yield f"{folder_name}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unnecessary .keys() method.

Use key in dict instead of key in dict.keys() for better performance.

- if folder_name in this_dict.keys():
+ if folder_name in this_dict:
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def list_folders(
container_name: str, prefix: str, azure_config: Optional[AzureConnectionConfig]
) -> Iterable[str]:
if azure_config is None:
raise ValueError("azure_config not set. Cannot browse Azure Blob Storage")
abs_blob_service_client = azure_config.get_blob_service_client()
container_client = abs_blob_service_client.get_container_client(container_name)
current_level = prefix.count("/")
blob_list = container_client.list_blobs(name_starts_with=prefix)
this_dict = {}
for blob in blob_list:
blob_name = blob.name[: blob.name.rfind("/") + 1]
folder_structure_arr = blob_name.split("/")
folder_name = ""
if len(folder_structure_arr) > current_level:
folder_name = f"{folder_name}/{folder_structure_arr[current_level]}"
else:
continue
folder_name = folder_name[1 : len(folder_name)]
if folder_name.endswith("/"):
folder_name = folder_name[:-1]
if folder_name == "":
continue
folder_name = f"{prefix}{folder_name}"
if folder_name in this_dict.keys():
continue
else:
this_dict[folder_name] = folder_name
yield f"{folder_name}"
if folder_name in this_dict:
Tools
Ruff

272-272: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

Comment on lines 75 to 87
def get_abs_properties(
container_name: str,
blob_name: Optional[str],
full_path: str,
number_of_files: int,
size_in_bytes: int,
sample_files: bool,
azure_config: Optional[AzureConnectionConfig],
use_abs_container_properties: Optional[bool] = False,
use_abs_blob_properties: Optional[bool] = False,
) -> Dict[str, str]:
if azure_config is None:
raise ValueError("No container_client available.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add error handling for get_blob_properties and get_container_properties.

The function can be improved by adding error handling for the get_blob_properties and get_container_properties calls to handle potential errors gracefully.

-        blob_properties = blob_client.get_blob_properties()
+        try:
+            blob_properties = blob_client.get_blob_properties()
+        except Exception as e:
+            logger.error(f"Failed to get blob properties: {e}")
+            blob_properties = None
-        container_properties = container_client.get_container_properties()
+        try:
+            container_properties = container_client.get_container_properties()
+        except Exception as e:
+            logger.error(f"Failed to get container properties: {e}")
+            container_properties = None
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_abs_properties(
container_name: str,
blob_name: Optional[str],
full_path: str,
number_of_files: int,
size_in_bytes: int,
sample_files: bool,
azure_config: Optional[AzureConnectionConfig],
use_abs_container_properties: Optional[bool] = False,
use_abs_blob_properties: Optional[bool] = False,
) -> Dict[str, str]:
if azure_config is None:
raise ValueError("No container_client available.")
def get_abs_properties(
container_name: str,
blob_name: Optional[str],
full_path: str,
number_of_files: int,
size_in_bytes: int,
sample_files: bool,
azure_config: Optional[AzureConnectionConfig],
use_abs_container_properties: Optional[bool] = False,
use_abs_blob_properties: Optional[bool] = False,
) -> Dict[str, str]:
if azure_config is None:
raise ValueError("No container_client available.")
try:
blob_properties = blob_client.get_blob_properties()
except Exception as e:
logger.error(f"Failed to get blob properties: {e}")
blob_properties = None
try:
container_properties = container_client.get_container_properties()
except Exception as e:
logger.error(f"Failed to get container properties: {e}")
container_properties = None

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Outside diff range and nitpick comments (2)
metadata-ingestion/docs/sources/s3/README.md (2)

Line range hint 41-44: Fix unordered list indentation.

Ensure proper indentation for unordered lists.

- - Row and column counts for each dataset
- - For each column, if profiling is enabled:
-     - null counts and proportions
-     - distinct counts and proportions
-     - minimum, maximum, mean, median, standard deviation, some quantile values
-     - histograms or frequencies of unique values
+ - Row and column counts for each dataset
+ - For each column, if profiling is enabled:
+   - null counts and proportions
+   - distinct counts and proportions
+   - minimum, maximum, mean, median, standard deviation, some quantile values
+   - histograms or frequencies of unique values
Tools
Markdownlint

1-1: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)


Line range hint 21-21: Fix invalid link fragments.

The link fragments should be valid.

- max_rows` recipe parameter (see [below](#config-details))
+ max_rows` recipe parameter (see [below](#profiling))

- see [compatibility](#compatibility) for more details
+ see [compatibility](#compatibility) for more details

Also applies to: 46-46

Tools
Markdownlint

1-1: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between c7f59d3 and f6025a1.

Files selected for processing (5)
  • metadata-ingestion/docs/sources/abs/README.md (1 hunks)
  • metadata-ingestion/docs/sources/abs/abs.md (1 hunks)
  • metadata-ingestion/docs/sources/s3/README.md (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/data_lake_common/data_lake_utils.py (4 hunks)
Files not reviewed due to errors (1)
  • metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (no review received)
Files skipped from review as they are similar to previous changes (1)
  • metadata-ingestion/src/datahub/ingestion/source/data_lake_common/data_lake_utils.py
Additional context used
LanguageTool
metadata-ingestion/docs/sources/abs/README.md

[uncategorized] ~3-~3: You might be missing the article “the” here.
Context: ...group of files that form a dataset, use path_specs configuration in ingestion r...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)


[uncategorized] ~3-~3: You might be missing the article “the” here.
Context: ...aset, use path_specs configuration in ingestion recipe. Refer section [Path Specs](http...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)


[style] ~34-~34: ‘on the basis of’ might be wordy. Consider a shorter alternative.
Context: ...etails)) JSON file schemas are inferred on the basis of the entire file (given the difficulty i...

(EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)

metadata-ingestion/docs/sources/s3/README.md

[style] ~22-~22: ‘on the basis of’ might be wordy. Consider a shorter alternative.
Context: ...etails)) JSON file schemas are inferred on the basis of the entire file (given the difficulty i...

(EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)


[style] ~46-~46: This phrase is redundant. Consider using “outside”.
Context: ...to interface with AWS (schema inference outside of profiling requires s3:// access). Enabl...

(OUTSIDE_OF)

metadata-ingestion/docs/sources/abs/abs.md

[grammar] ~4-~4: The verb ‘include’ does not usually follow articles like ‘The’. Check that ‘include’ is spelled correctly; using ‘include’ as a noun may be non-standard.
Context: ..._spec represents one or more datasets. The include path (path_spec.include`) represents a...

(A_INFINITIVE)


[grammar] ~17-~17: You’ve repeated a verb. Did you mean to only write one of them?
Context: ...in named variables. ### Path Specs - Examples #### Example 1 - Individual file as Dataset Contain...

(REPEATED_VERBS)


[uncategorized] ~202-~202: This verb does not appear to agree with the subject. Consider using a different form.
Context: ...ames that don't start with a letter and contains other character than letters, number, a...

(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)


[uncategorized] ~202-~202: The grammatical number of this noun doesn’t look right. Consider replacing it.
Context: ... contains other character than letters, number, and underscore. [https://github.com/ap...

(AI_EN_LECTOR_REPLACEMENT_NOUN_NUMBER)

Markdownlint
metadata-ingestion/docs/sources/abs/README.md

33-33: null
Link fragments should be valid

(MD051, link-fragments)

metadata-ingestion/docs/sources/s3/README.md

41-41: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


42-42: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


43-43: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


44-44: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


1-1: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)


9-9: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


39-39: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


21-21: null
Link fragments should be valid

(MD051, link-fragments)


46-46: null
Link fragments should be valid

(MD051, link-fragments)

metadata-ingestion/docs/sources/abs/abs.md

55-55: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)


202-202: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)


16-16: Expected: 1; Actual: 2
Multiple consecutive blank lines

(MD012, no-multiple-blanks)


107-107: Expected: 1; Actual: 2
Multiple consecutive blank lines

(MD012, no-multiple-blanks)


169-169: Expected: 1; Actual: 2
Multiple consecutive blank lines

(MD012, no-multiple-blanks)


170-170: Expected: 1; Actual: 3
Multiple consecutive blank lines

(MD012, no-multiple-blanks)


17-17: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


18-18: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


30-30: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


34-34: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


40-40: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


49-49: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


59-59: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


74-74: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


84-84: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


100-100: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


111-111: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


131-131: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


13-13: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


141-141: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


164-164: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


165-165: null
Bare URL used

(MD034, no-bare-urls)


167-167: null
Bare URL used

(MD034, no-bare-urls)


12-12: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


147-147: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


163-163: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


22-22: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


30-30: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


40-40: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


49-49: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


59-59: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


74-74: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


84-84: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


100-100: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


111-111: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


131-131: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


204-204: null
Files should end with a single newline character

(MD047, single-trailing-newline)


4-4: null
Link fragments should be valid

(MD051, link-fragments)

Ruff
metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

145-145: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


271-271: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

Additional comments not posted (10)
metadata-ingestion/docs/sources/abs/README.md (2)

3-3: Add missing article "the".

The sentence is missing the article "the" before "section."

- Refer section [Path Specs](https://datahubproject.io/docs/generated/ingestion/sources/s3/#path-specs) for more details.
+ Refer to the section [Path Specs](https://datahubproject.io/docs/generated/ingestion/sources/s3/#path-specs) for more details.
Tools
LanguageTool

[uncategorized] ~3-~3: You might be missing the article “the” here.
Context: ...group of files that form a dataset, use path_specs configuration in ingestion r...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)


[uncategorized] ~3-~3: You might be missing the article “the” here.
Context: ...aset, use path_specs configuration in ingestion recipe. Refer section [Path Specs](http...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)


34-34: Consider a shorter alternative to "on the basis of".

The phrase "on the basis of" might be wordy. Consider using "based on" instead.

- JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.
+ JSON file schemas are inferred based on the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.
Tools
LanguageTool

[style] ~34-~34: ‘on the basis of’ might be wordy. Consider a shorter alternative.
Context: ...etails)) JSON file schemas are inferred on the basis of the entire file (given the difficulty i...

(EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)

metadata-ingestion/docs/sources/s3/README.md (5)

Line range hint 22-22: Consider a shorter alternative to "on the basis of".

The phrase "on the basis of" might be wordy. Consider using "based on" instead.

- JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.
+ JSON file schemas are inferred based on the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.
Tools
Markdownlint

1-1: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)


Line range hint 46-46: Consider using "outside" instead of "outside of".

The phrase "outside of" is redundant. Consider using "outside" instead.

- schema inference outside of profiling requires s3:// access
+ schema inference outside profiling requires s3:// access
Tools
Markdownlint

1-1: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)


1-1: Remove trailing spaces.

Trailing spaces should be removed for better formatting.

- This connector ingests AWS S3 datasets into DataHub. It allows mapping an individual file or a folder of files to a dataset in DataHub. 
+ This connector ingests AWS S3 datasets into DataHub. It allows mapping an individual file or a folder of files to a dataset in DataHub.
Tools
Markdownlint

1-1: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)


Line range hint 9-9: Add blank lines around headings.

Headings should be surrounded by blank lines for better readability.

- ### Supported file types
+ 
+ ### Supported file types
Tools
Markdownlint

1-1: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)


Line range hint 39-39: Add blank lines around lists.

Lists should be surrounded by blank lines for better readability.

- - Row and column counts for each dataset
+ 
+ - Row and column counts for each dataset
Tools
Markdownlint

1-1: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)

metadata-ingestion/docs/sources/abs/abs.md (3)

17-18: Add blank lines around headings.

Headings should be surrounded by blank lines for better readability.

- ### Path Specs -  Examples
- #### Example 1 - Individual file as Dataset
+ ### Path Specs - Examples

+ #### Example 1 - Individual file as Dataset
Tools
LanguageTool

[grammar] ~17-~17: You’ve repeated a verb. Did you mean to only write one of them?
Context: ...in named variables. ### Path Specs - Examples #### Example 1 - Individual file as Dataset Contain...

(REPEATED_VERBS)

Markdownlint

17-17: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


18-18: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


147-147: Use headings instead of emphasis for better structure.

The emphasis should be replaced with a heading for better readability and structure.

- **Valid path_specs.include**
+ ### Valid path_specs.include

- **Valid path_specs.exclude**
+ ### Valid path_specs.exclude

Also applies to: 163-163

Tools
Markdownlint

147-147: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


164-168: Add blank lines around lists.

Lists should be surrounded by blank lines for better readability.

- - \**/tests/**
+ 
+ - \**/tests/**
Tools
Markdownlint

164-164: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


165-165: null
Bare URL used

(MD034, no-bare-urls)


167-167: null
Bare URL used

(MD034, no-bare-urls)

Schemas for Parquet and Avro files are extracted as provided.

Schemas for schemaless formats (CSV, TSV, JSONL, JSON) are inferred. For CSV, TSV and JSONL files, we consider the first
100 rows by default, which can be controlled via the `max_rows` recipe parameter (see [below](#config-details))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix invalid link fragment.

The link fragment #config-details should be valid.

- max_rows` recipe parameter (see [below](#config-details))
+ max_rows` recipe parameter (see [below](#profiling))
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
100 rows by default, which can be controlled via the `max_rows` recipe parameter (see [below](#config-details))
100 rows by default, which can be controlled via the `max_rows` recipe parameter (see [below](#profiling))
Tools
Markdownlint

33-33: null
Link fragments should be valid

(MD051, link-fragments)


Refer to [example 5](#example-5---advanced---either-individual-file-or-folder-of-files-as-dataset) if your container has a more complex dataset representation.

**Additional points to note**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use headings instead of emphasis for better structure.

The emphasis should be replaced with a heading for better readability and structure.

- **Additional points to note**
+ ### Additional Points to Note
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
**Additional points to note**
### Additional Points to Note
Tools
Markdownlint

12-12: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between f6025a1 and 3f0d854.

Files selected for processing (1)
  • metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)
Additional context used
Ruff
metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

147-147: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


277-277: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

Additional comments not posted (8)
metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (8)

21-22: LGTM!

The function correctly checks if a given URI matches the Azure Blob Storage pattern.


25-29: LGTM!

The function correctly retrieves the prefix of an Azure Blob Storage URI.


32-40: LGTM!

The function correctly removes the Azure Blob Storage prefix from a URI.


43-55: LGTM!

The function correctly generates a URN for an Azure Blob Storage dataset.


58-63: LGTM!

The function correctly retrieves the container name from an Azure Blob Storage URI.


66-71: LGTM!

The function correctly retrieves the key prefix from an Azure Blob Storage URI.


154-188: LGTM!

The function correctly creates properties recursively.


285-286: LGTM!

The function correctly retrieves the relative path of a container.

Comment on lines 243 to 282
def list_folders(
container_name: str, prefix: str, azure_config: Optional[AzureConnectionConfig]
) -> Iterable[str]:
if azure_config is None:
raise ValueError(
"Azure configuration is not provided. Cannot retrieve container client."
)

abs_blob_service_client = azure_config.get_blob_service_client()
container_client = abs_blob_service_client.get_container_client(container_name)

current_level = prefix.count("/")
blob_list = container_client.list_blobs(name_starts_with=prefix)

this_dict = {}
for blob in blob_list:
blob_name = blob.name[: blob.name.rfind("/") + 1]
folder_structure_arr = blob_name.split("/")

folder_name = ""
if len(folder_structure_arr) > current_level:
folder_name = f"{folder_name}/{folder_structure_arr[current_level]}"
else:
continue

folder_name = folder_name[1 : len(folder_name)]

if folder_name.endswith("/"):
folder_name = folder_name[:-1]

if folder_name == "":
continue

folder_name = f"{prefix}{folder_name}"
if folder_name in this_dict.keys():
continue
else:
this_dict[folder_name] = folder_name

yield f"{folder_name}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unnecessary .keys() method.

Use key in dict instead of key in dict.keys() for better performance.

- if folder_name in this_dict.keys():
+ if folder_name in this_dict:
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def list_folders(
container_name: str, prefix: str, azure_config: Optional[AzureConnectionConfig]
) -> Iterable[str]:
if azure_config is None:
raise ValueError(
"Azure configuration is not provided. Cannot retrieve container client."
)
abs_blob_service_client = azure_config.get_blob_service_client()
container_client = abs_blob_service_client.get_container_client(container_name)
current_level = prefix.count("/")
blob_list = container_client.list_blobs(name_starts_with=prefix)
this_dict = {}
for blob in blob_list:
blob_name = blob.name[: blob.name.rfind("/") + 1]
folder_structure_arr = blob_name.split("/")
folder_name = ""
if len(folder_structure_arr) > current_level:
folder_name = f"{folder_name}/{folder_structure_arr[current_level]}"
else:
continue
folder_name = folder_name[1 : len(folder_name)]
if folder_name.endswith("/"):
folder_name = folder_name[:-1]
if folder_name == "":
continue
folder_name = f"{prefix}{folder_name}"
if folder_name in this_dict.keys():
continue
else:
this_dict[folder_name] = folder_name
yield f"{folder_name}"
if folder_name in this_dict:
continue
else:
this_dict[folder_name] = folder_name
Tools
Ruff

277-277: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

Comment on lines +190 to +240
def get_abs_tags(
container_name: str,
key_name: Optional[str],
dataset_urn: str,
azure_config: Optional[AzureConnectionConfig],
ctx: PipelineContext,
use_abs_blob_tags: Optional[bool] = False,
) -> Optional[GlobalTagsClass]:
# Todo add the service_client, when building out this get_abs_tags
if azure_config is None:
raise ValueError(
"Azure configuration is not provided. Cannot retrieve container client."
)

tags_to_add: List[str] = []
blob_service_client = azure_config.get_blob_service_client()
container_client = blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(blob=key_name)

if use_abs_blob_tags and key_name is not None:
tag_set = blob_client.get_blob_tags()
if tag_set:
tags_to_add.extend(
make_tag_urn(f"""{key}:{value}""") for key, value in tag_set.items()
)
else:
# Unlike container tags, if an object does not have tags, it will just return an empty array
# as opposed to an exception.
logger.info(f"No tags found for container={container_name} key={key_name}")

if len(tags_to_add) == 0:
return None

if ctx.graph is not None:
logger.debug("Connected to DatahubApi, grabbing current tags to maintain.")
current_tags: Optional[GlobalTagsClass] = ctx.graph.get_aspect(
entity_urn=dataset_urn,
aspect_type=GlobalTagsClass,
)
if current_tags:
tags_to_add.extend([current_tag.tag for current_tag in current_tags.tags])
else:
logger.warning("Could not connect to DatahubApi. No current tags to maintain")

# Sort existing tags
tags_to_add = sorted(list(set(tags_to_add)))
# Remove duplicate tags
new_tags = GlobalTagsClass(
tags=[TagAssociationClass(tag_to_add) for tag_to_add in tags_to_add]
)
return new_tags
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add error handling for get_blob_tags.

The function can be improved by adding error handling for the get_blob_tags call to handle potential errors gracefully.

-        tag_set = blob_client.get_blob_tags()
+        try:
+            tag_set = blob_client.get_blob_tags()
+        except Exception as e:
+            logger.error(f"Failed to get blob tags: {e}")
+            tag_set = None
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_abs_tags(
container_name: str,
key_name: Optional[str],
dataset_urn: str,
azure_config: Optional[AzureConnectionConfig],
ctx: PipelineContext,
use_abs_blob_tags: Optional[bool] = False,
) -> Optional[GlobalTagsClass]:
# Todo add the service_client, when building out this get_abs_tags
if azure_config is None:
raise ValueError(
"Azure configuration is not provided. Cannot retrieve container client."
)
tags_to_add: List[str] = []
blob_service_client = azure_config.get_blob_service_client()
container_client = blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(blob=key_name)
if use_abs_blob_tags and key_name is not None:
tag_set = blob_client.get_blob_tags()
if tag_set:
tags_to_add.extend(
make_tag_urn(f"""{key}:{value}""") for key, value in tag_set.items()
)
else:
# Unlike container tags, if an object does not have tags, it will just return an empty array
# as opposed to an exception.
logger.info(f"No tags found for container={container_name} key={key_name}")
if len(tags_to_add) == 0:
return None
if ctx.graph is not None:
logger.debug("Connected to DatahubApi, grabbing current tags to maintain.")
current_tags: Optional[GlobalTagsClass] = ctx.graph.get_aspect(
entity_urn=dataset_urn,
aspect_type=GlobalTagsClass,
)
if current_tags:
tags_to_add.extend([current_tag.tag for current_tag in current_tags.tags])
else:
logger.warning("Could not connect to DatahubApi. No current tags to maintain")
# Sort existing tags
tags_to_add = sorted(list(set(tags_to_add)))
# Remove duplicate tags
new_tags = GlobalTagsClass(
tags=[TagAssociationClass(tag_to_add) for tag_to_add in tags_to_add]
)
return new_tags
tag_set = None
try:
tag_set = blob_client.get_blob_tags()
except Exception as e:
logger.error(f"Failed to get blob tags: {e}")

Comment on lines 144 to 151
def add_property(
key: str, value: str, custom_properties: Dict[str, str], resource_name: str
) -> Dict[str, str]:
if key in custom_properties.keys():
key = f"{key}_{resource_name}"
if value is not None:
custom_properties[key] = str(value)
return custom_properties
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unnecessary .keys() method.

Use key in dict instead of key in dict.keys() for better performance.

- if key in custom_properties.keys():
+ if key in custom_properties:
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def add_property(
key: str, value: str, custom_properties: Dict[str, str], resource_name: str
) -> Dict[str, str]:
if key in custom_properties.keys():
key = f"{key}_{resource_name}"
if value is not None:
custom_properties[key] = str(value)
return custom_properties
def add_property(
key: str, value: str, custom_properties: Dict[str, str], resource_name: str
) -> Dict[str, str]:
if key in custom_properties:
key = f"{key}_{resource_name}"
if value is not None:
custom_properties[key] = str(value)
return custom_properties
Tools
Ruff

147-147: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

Comment on lines +74 to +141
def get_abs_properties(
container_name: str,
blob_name: Optional[str],
full_path: str,
number_of_files: int,
size_in_bytes: int,
sample_files: bool,
azure_config: Optional[AzureConnectionConfig],
use_abs_container_properties: Optional[bool] = False,
use_abs_blob_properties: Optional[bool] = False,
) -> Dict[str, str]:
if azure_config is None:
raise ValueError(
"Azure configuration is not provided. Cannot retrieve container client."
)

blob_service_client = azure_config.get_blob_service_client()
container_client = blob_service_client.get_container_client(
container=container_name
)

custom_properties = {"schema_inferred_from": full_path}
if not sample_files:
custom_properties.update(
{
"number_of_files": str(number_of_files),
"size_in_bytes": str(size_in_bytes),
}
)

if use_abs_blob_properties and blob_name is not None:
blob_client = container_client.get_blob_client(blob=blob_name)
blob_properties = blob_client.get_blob_properties()
if blob_properties:
create_properties(
data=blob_properties,
prefix="blob",
custom_properties=custom_properties,
resource_name=blob_name,
json_properties=[
"metadata",
"content_settings",
"lease",
"copy",
"immutability_policy",
],
)
else:
logger.warning(
f"No blob properties found for container={container_name}, blob={blob_name}."
)

if use_abs_container_properties:
container_properties = container_client.get_container_properties()
if container_properties:
create_properties(
data=container_properties,
prefix="container",
custom_properties=custom_properties,
resource_name=container_name,
json_properties=["metadata"],
)
else:
logger.warning(
f"No container properties found for container={container_name}."
)

return custom_properties
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add error handling for get_blob_properties and get_container_properties.

The function can be improved by adding error handling for the get_blob_properties and get_container_properties calls to handle potential errors gracefully.

-        blob_properties = blob_client.get_blob_properties()
+        try:
+            blob_properties = blob_client.get_blob_properties()
+        except Exception as e:
+            logger.error(f"Failed to get blob properties: {e}")
+            blob_properties = None
-        container_properties = container_client.get_container_properties()
+        try:
+            container_properties = container_client.get_container_properties()
+        except Exception as e:
+            logger.error(f"Failed to get container properties: {e}")
+            container_properties = None
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_abs_properties(
container_name: str,
blob_name: Optional[str],
full_path: str,
number_of_files: int,
size_in_bytes: int,
sample_files: bool,
azure_config: Optional[AzureConnectionConfig],
use_abs_container_properties: Optional[bool] = False,
use_abs_blob_properties: Optional[bool] = False,
) -> Dict[str, str]:
if azure_config is None:
raise ValueError(
"Azure configuration is not provided. Cannot retrieve container client."
)
blob_service_client = azure_config.get_blob_service_client()
container_client = blob_service_client.get_container_client(
container=container_name
)
custom_properties = {"schema_inferred_from": full_path}
if not sample_files:
custom_properties.update(
{
"number_of_files": str(number_of_files),
"size_in_bytes": str(size_in_bytes),
}
)
if use_abs_blob_properties and blob_name is not None:
blob_client = container_client.get_blob_client(blob=blob_name)
blob_properties = blob_client.get_blob_properties()
if blob_properties:
create_properties(
data=blob_properties,
prefix="blob",
custom_properties=custom_properties,
resource_name=blob_name,
json_properties=[
"metadata",
"content_settings",
"lease",
"copy",
"immutability_policy",
],
)
else:
logger.warning(
f"No blob properties found for container={container_name}, blob={blob_name}."
)
if use_abs_container_properties:
container_properties = container_client.get_container_properties()
if container_properties:
create_properties(
data=container_properties,
prefix="container",
custom_properties=custom_properties,
resource_name=container_name,
json_properties=["metadata"],
)
else:
logger.warning(
f"No container properties found for container={container_name}."
)
return custom_properties
def get_abs_properties(
container_name: str,
blob_name: Optional[str],
full_path: str,
number_of_files: int,
size_in_bytes: int,
sample_files: bool,
azure_config: Optional[AzureConnectionConfig],
use_abs_container_properties: Optional[bool] = False,
use_abs_blob_properties: Optional[bool] = False,
) -> Dict[str, str]:
if azure_config is None:
raise ValueError(
"Azure configuration is not provided. Cannot retrieve container client."
)
blob_service_client = azure_config.get_blob_service_client()
container_client = blob_service_client.get_container_client(
container=container_name
)
custom_properties = {"schema_inferred_from": full_path}
if not sample_files:
custom_properties.update(
{
"number_of_files": str(number_of_files),
"size_in_bytes": str(size_in_bytes),
}
)
if use_abs_blob_properties and blob_name is not None:
blob_client = container_client.get_blob_client(blob=blob_name)
try:
blob_properties = blob_client.get_blob_properties()
except Exception as e:
logger.error(f"Failed to get blob properties: {e}")
blob_properties = None
if blob_properties:
create_properties(
data=blob_properties,
prefix="blob",
custom_properties=custom_properties,
resource_name=blob_name,
json_properties=[
"metadata",
"content_settings",
"lease",
"copy",
"immutability_policy",
],
)
else:
logger.warning(
f"No blob properties found for container={container_name}, blob={blob_name}."
)
if use_abs_container_properties:
try:
container_properties = container_client.get_container_properties()
except Exception as e:
logger.error(f"Failed to get container properties: {e}")
container_properties = None
if container_properties:
create_properties(
data=container_properties,
prefix="container",
custom_properties=custom_properties,
resource_name=container_name,
json_properties=["metadata"],
)
else:
logger.warning(
f"No container properties found for container={container_name}."
)
return custom_properties

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 3f0d854 and d315ecd.

Files selected for processing (1)
  • metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
  • metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between d315ecd and b16d626.

Files selected for processing (1)
  • metadata-ingestion/src/datahub/ingestion/source/abs/source.py (1 hunks)
Additional context used
Ruff
metadata-ingestion/src/datahub/ingestion/source/abs/source.py

163-164: Use a single if statement instead of nested if statements

(SIM102)


686-686: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

Additional comments not posted (35)
metadata-ingestion/src/datahub/ingestion/source/abs/source.py (35)

128-148: LGTM!

The function correctly maps Spark types to DataHub types.


159-179: Simplify nested if statements.

Use a single if statement instead of nested if statements.

-        if "=" in folder1 and "=" in folder2:
-            if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
+        if "=" in folder1 and "=" in folder2 and folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
Tools
Ruff

163-164: Use a single if statement instead of nested if statements

(SIM102)


182-193: LGTM!

The TableData dataclass correctly represents table metadata.


195-700: LGTM!

The ABSSource class is well-structured and follows the expected pattern for DataHub sources.

Tools
Ruff

686-686: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)


642-688: LGTM!

The function correctly generates work units for the ingestion process.

Tools
Ruff

686-686: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)


689-695: LGTM!

The function correctly returns a list of work unit processors.


697-698: LGTM!

The function correctly checks if the platform is Azure Blob Storage.


700-701: LGTM!

The function correctly returns the report object.


236-300: LGTM!

The function correctly handles different file formats and compression types.


302-325: LGTM!

The function correctly adds partition columns to the schema.


326-335: LGTM!

The function correctly creates an operation aspect for the table.


337-438: LGTM!

The function is well-structured and handles various aspects of table ingestion.


523-607: LGTM!

The function is well-structured and handles various scenarios for browsing Azure Blob Storage.


609-616: LGTM!

The function correctly creates an ABS path from a given key.


618-641: LGTM!

The function is well-structured and handles various scenarios for browsing local files.


475-491: LGTM!

The function correctly resolves folders with templates in the prefix.


492-521: LGTM!

The function correctly retrieves the directory to process based on the path specification.


447-450: LGTM!

The function correctly extracts the table name from the path specification.


452-473: LGTM!

The function correctly extracts table data from the path specification and file metadata.


440-445: LGTM!

The function correctly retrieves the prefix from the relative path.


337-438: LGTM!

The function is well-structured and handles various aspects of table ingestion.


523-607: LGTM!

The function is well-structured and handles various scenarios for browsing Azure Blob Storage.


609-616: LGTM!

The function correctly creates an ABS path from a given key.


618-641: LGTM!

The function is well-structured and handles various scenarios for browsing local files.


475-491: LGTM!

The function correctly resolves folders with templates in the prefix.


492-521: LGTM!

The function correctly retrieves the directory to process based on the path specification.


447-450: LGTM!

The function correctly extracts the table name from the path specification.


452-473: LGTM!

The function correctly extracts table data from the path specification and file metadata.


440-445: LGTM!

The function correctly retrieves the prefix from the relative path.


642-688: LGTM!

The function correctly generates work units for the ingestion process.

Tools
Ruff

686-686: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)


689-695: LGTM!

The function correctly returns a list of work unit processors.


697-698: LGTM!

The function correctly checks if the platform is Azure Blob Storage.


700-701: LGTM!

The function correctly returns the report object.


236-300: LGTM!

The function correctly handles different file formats and compression types.


302-325: LGTM!

The function correctly adds partition columns to the schema.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between b16d626 and d45710c.

Files selected for processing (1)
  • metadata-ingestion/src/datahub/ingestion/source/abs/source.py (1 hunks)
Additional context used
Ruff
metadata-ingestion/src/datahub/ingestion/source/abs/source.py

162-163: Use a single if statement instead of nested if statements

(SIM102)


685-685: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

Additional comments not posted (24)
metadata-ingestion/src/datahub/ingestion/source/abs/source.py (24)

1-95: Imports look good.

The imports are necessary for the new Azure Blob Storage functionality.


127-147: Function looks good.

The get_column_type function correctly maps Spark types to DataHub types and includes error handling.


158-178: Simplify nested if statements.

Use a single if statement instead of nested if statements.

-        if "=" in folder1 and "=" in folder2:
-            if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
+        if "=" in folder1 and "=" in folder2 and folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
Tools
Ruff

162-163: Use a single if statement instead of nested if statements

(SIM102)


181-191: Dataclass looks good.

The TableData dataclass correctly represents table data and includes necessary fields.


210-227: Constructor looks good.

The __init__ method correctly initializes the ABSSource class and includes telemetry reporting.


229-233: Factory method looks good.

The create method correctly creates an instance of the ABSSource class.


235-299: Method looks good.

The get_fields method correctly extracts fields from a table and includes error handling.


301-324: Method looks good.

The add_partition_columns_to_schema method correctly adds partition columns to the schema and includes necessary logging.


325-334: Method looks good.

The _create_table_operation_aspect method correctly creates a table operation aspect and follows the expected structure.


336-437: Method looks good.

The ingest_table method correctly ingests a table and yields metadata work units. It includes error handling and supports various aspects.


439-444: Method looks good.

The get_prefix method correctly gets the prefix from a relative path and includes necessary regex handling.


446-449: Method looks good.

The extract_table_name method correctly extracts the table name from a path spec and includes necessary validation.


451-472: Method looks good.

The extract_table_data method correctly extracts table data from a path spec and includes necessary logging.


474-490: Method looks good.

The resolve_templated_folders method correctly resolves templated folders and includes necessary recursion.


491-520: Method looks good.

The get_dir_to_process method correctly gets the directory to process and includes necessary sorting and recursion.


522-606: Method looks good.

The abs_browser method correctly browses Azure Blob Storage and includes error handling. It supports templated folders.


608-615: Method looks good.

The create_abs_path method correctly creates an Azure Blob Storage path and includes necessary validation.


617-640: Method looks good.

The local_browser method correctly browses local files and includes necessary sorting and recursion.


641-687: Method looks good.

The get_workunits_internal method correctly generates metadata work units and includes error handling. It supports various path specs.

Tools
Ruff

685-685: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)


688-694: Method looks good.

The get_workunit_processors method correctly returns a list of work unit processors and includes the stale entity removal handler.


696-697: Method looks good.

The is_abs_platform method correctly checks if the platform is Azure Blob Storage and includes necessary validation.


699-700: Method looks good.

The get_report method correctly returns the report and includes necessary validation.


162-163: Simplify nested if statements.

Use a single if statement instead of nested if statements.

-        if "=" in folder1 and "=" in folder2:
-            if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
+        if "=" in folder1 and "=" in folder2 and folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
Tools
Ruff

162-163: Use a single if statement instead of nested if statements

(SIM102)


685-685: Rename unused loop variable.

Rename the unused loop variable guid to _guid.

-                for guid, table_data in table_dict.items():
+                for _guid, table_data in table_dict.items():
Tools
Ruff

685-685: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between d45710c and b4bf0a4.

Files selected for processing (2)
  • metadata-ingestion/src/datahub/ingestion/source/abs/source.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
  • metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py
Additional context used
Ruff
metadata-ingestion/src/datahub/ingestion/source/abs/source.py

162-163: Use a single if statement instead of nested if statements

(SIM102)


685-685: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

Additional comments not posted (6)
metadata-ingestion/src/datahub/ingestion/source/abs/source.py (6)

235-299: LGTM!

The function is well-structured and handles different file formats appropriately.


301-324: LGTM!

The function is well-structured and performs the intended operations correctly.


325-334: LGTM!

The function is well-structured and performs the intended operations correctly.


336-437: LGTM!

The function is well-structured and performs the intended operations correctly.


522-607: LGTM!

The function is well-structured and performs the intended operations correctly.


617-640: LGTM!

The function is well-structured and performs the intended operations correctly.

Comment on lines +162 to +163
if "=" in folder1 and "=" in folder2:
if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplify nested if statements.

Use a single if statement instead of nested if statements.

-        if "=" in folder1 and "=" in folder2:
-            if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
+        if "=" in folder1 and "=" in folder2 and folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if "=" in folder1 and "=" in folder2:
if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
if "=" in folder1 and "=" in folder2 and folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
Tools
Ruff

162-163: Use a single if statement instead of nested if statements

(SIM102)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between b4bf0a4 and b2174cc.

Files selected for processing (1)
  • metadata-ingestion/setup.py (3 hunks)
Additional comments not posted (3)
metadata-ingestion/setup.py (3)

261-266: Addition of abs_base set looks good!

The dependencies listed are necessary for Azure Blob Storage functionality. Ensure that the specified versions are compatible with the rest of the project.


275-275: Modification to delta_lake set looks good!

Including abs_base ensures that Azure Blob Storage dependencies are available for Delta Lake functionality.


417-417: Addition of abs source plugin looks good!

The plugin is correctly referenced in the plugins dictionary and entry_points. Ensure that the plugin is properly integrated and tested.

Also applies to: 696-696

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between b2174cc and 8f312da.

Files selected for processing (2)
  • metadata-ingestion/src/datahub/ingestion/source/abs/source.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
  • metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py
Additional context used
Ruff
metadata-ingestion/src/datahub/ingestion/source/abs/source.py

162-163: Use a single if statement instead of nested if statements

(SIM102)


685-685: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

Additional comments not posted (4)
metadata-ingestion/src/datahub/ingestion/source/abs/source.py (4)

158-178: Simplify nested if statements.

The nested if statements can be simplified to improve readability. This change was previously suggested and is still valid.

-        if "=" in folder1 and "=" in folder2:
-            if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
+        if "=" in folder1 and "=" in folder2 and folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
Tools
Ruff

162-163: Use a single if statement instead of nested if statements

(SIM102)


685-685: Rename unused loop variable.

The loop variable guid is not used within the loop body. It should be renamed to _guid to indicate that it is unused.

-                for guid, table_data in table_dict.items():
+                for _guid, table_data in table_dict.items():
Tools
Ruff

685-685: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)


235-299: Review file format handling in schema extraction.

This method handles various file formats for schema extraction. Ensure that all expected file formats are supported and that the error handling is robust.


522-607: Review browsing logic and error handling in Azure Blob Storage.

This method handles browsing in Azure Blob Storage. Ensure that the path handling and error management are correctly implemented, especially the handling of specific exceptions like NoSuchBucket.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 8f312da and 6963078.

Files selected for processing (1)
  • metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
  • metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 6963078 and 9b2d0cb.

Files selected for processing (1)
  • metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
  • metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 9b2d0cb and 7e61de6.

Files selected for processing (1)
  • metadata-ingestion/src/datahub/ingestion/source/abs/source.py (1 hunks)
Additional context used
Ruff
metadata-ingestion/src/datahub/ingestion/source/abs/source.py

162-163: Use a single if statement instead of nested if statements

(SIM102)


685-685: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

Additional comments not posted (2)
metadata-ingestion/src/datahub/ingestion/source/abs/source.py (2)

127-148: Function get_column_type is well-implemented.

The function correctly maps Spark types to DataHub types and handles unmapped types gracefully.


158-178: Skip: Simplification of nested if statements already addressed.

The previous comment about simplifying nested if statements has been addressed.

Tools
Ruff

162-163: Use a single if statement instead of nested if statements

(SIM102)

Comment on lines 204 to 699
prefix=f"{prefix}", results_per_page=PAGE_SIZE
):
abs_path = self.create_abs_path(obj.name)
logger.debug(f"Path: {abs_path}")
# the following line if using the file_system_client
# yield abs_path, obj.last_modified, obj.content_length,
yield abs_path, obj.name, obj.last_modified, obj.size

def create_abs_path(self, key: str) -> str:
if self.source_config.azure_config:
account_name = self.source_config.azure_config.account_name
container_name = self.source_config.azure_config.container_name
return (
f"https://{account_name}.blob.core.windows.net/{container_name}/{key}"
)
return ""

def local_browser(
self, path_spec: PathSpec
) -> Iterable[Tuple[str, str, datetime, int]]:
prefix = self.get_prefix(path_spec.include)
if os.path.isfile(prefix):
logger.debug(f"Scanning single local file: {prefix}")
file_name = prefix
yield prefix, file_name, datetime.utcfromtimestamp(
os.path.getmtime(prefix)
), os.path.getsize(prefix)
else:
logger.debug(f"Scanning files under local folder: {prefix}")
for root, dirs, files in os.walk(prefix):
dirs.sort(key=functools.cmp_to_key(partitioned_folder_comparator))

for file in sorted(files):
# We need to make sure the path is in posix style which is not true on windows
full_path = PurePath(
os.path.normpath(os.path.join(root, file))
).as_posix()
yield full_path, file, datetime.utcfromtimestamp(
os.path.getmtime(full_path)
), os.path.getsize(full_path)

def get_workunits_internal(self) -> Iterable[MetadataWorkUnit]:
self.container_WU_creator = ContainerWUCreator(
self.source_config.platform,
self.source_config.platform_instance,
self.source_config.env,
)
with PerfTimer():
assert self.source_config.path_specs
for path_spec in self.source_config.path_specs:
file_browser = (
self.abs_browser(
path_spec, self.source_config.number_of_files_to_sample
)
if self.is_abs_platform()
else self.local_browser(path_spec)
)
table_dict: Dict[str, TableData] = {}
for file, name, timestamp, size in file_browser:
if not path_spec.allowed(file):
continue
table_data = self.extract_table_data(
path_spec, file, name, timestamp, size
)
if table_data.table_path not in table_dict:
table_dict[table_data.table_path] = table_data
else:
table_dict[table_data.table_path].number_of_files = (
table_dict[table_data.table_path].number_of_files + 1
)
table_dict[table_data.table_path].size_in_bytes = (
table_dict[table_data.table_path].size_in_bytes
+ table_data.size_in_bytes
)
if (
table_dict[table_data.table_path].timestamp
< table_data.timestamp
) and (table_data.size_in_bytes > 0):
table_dict[
table_data.table_path
].full_path = table_data.full_path
table_dict[
table_data.table_path
].timestamp = table_data.timestamp

for guid, table_data in table_dict.items():
yield from self.ingest_table(table_data, path_spec)

def get_workunit_processors(self) -> List[Optional[MetadataWorkUnitProcessor]]:
return [
*super().get_workunit_processors(),
StaleEntityRemovalHandler.create(
self, self.source_config, self.ctx
).workunit_processor,
]

def is_abs_platform(self):
return self.source_config.platform == "abs"

def get_report(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider refactoring the get_fields method for better readability.

The get_fields method is functional but complex. Breaking it into smaller, more focused methods could improve readability and maintainability.

Tools
Ruff

685-685: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

table_data.table_path
].timestamp = table_data.timestamp

for guid, table_data in table_dict.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename unused loop variable for clarity.

The loop variable guid is unused and should be renamed to _guid to indicate it's intentionally unused.

-                for guid, table_data in table_table_dict.items():
+                for _guid, table_data in table_dict.items():

Committable suggestion was skipped due to low confidence.

Tools
Ruff

685-685: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Outside diff range, codebase verification and nitpick comments (2)
metadata-ingestion/src/datahub/ingestion/source/abs/source.py (2)

123-124: Optimization of Compression Registration

The hack to support the .gzip extension with smart_open is clear, but consider adding a comment explaining why this specific registration is necessary if it's not commonly understood.


336-438: Method: ingest_table

This method orchestrates the ingestion process and handles various metadata aspects. Ensure that all operations are atomic and transactional where necessary, especially when dealing with external systems like databases or cloud services.

Consider adding more detailed logging to trace the ingestion process step-by-step.

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 7e61de6 and 7a2c3fa.

Files selected for processing (2)
  • metadata-ingestion/src/datahub/ingestion/source/abs/source.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
  • metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py
Additional context used
Ruff
metadata-ingestion/src/datahub/ingestion/source/abs/source.py

162-163: Use a single if statement instead of nested if statements

(SIM102)


685-685: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

Additional comments not posted (3)
metadata-ingestion/src/datahub/ingestion/source/abs/source.py (3)

1-33: Review of Imports and Global Constants

The imports and global constants are well-organized and relevant to the functionality being implemented. However, ensure that all these libraries and modules are necessary and used within the file to avoid unnecessary dependencies.


181-192: Data Class: TableData

The TableData class is well-defined with appropriate type hints. Ensure that all fields are necessary and used effectively throughout the codebase to prevent data bloat.


194-203: Class Decorators for ABSSource

The use of decorators to define capabilities and configuration requirements is clear and enhances modularity. Ensure that all capabilities like DATA_PROFILING and TAGS are implemented correctly within the class methods.

Comment on lines +127 to +148
def get_column_type(
report: SourceReport, dataset_name: str, column_type: str
) -> SchemaFieldDataType:
"""
Maps known Spark types to datahub types
"""
TypeClass: Any = None

for field_type, type_class in _field_type_mapping.items():
if isinstance(column_type, field_type):
TypeClass = type_class
break

# if still not found, report the warning
if TypeClass is None:
report.report_warning(
dataset_name, f"unable to map type {column_type} to metadata schema"
)
TypeClass = NullTypeClass

return SchemaFieldDataType(type=TypeClass())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function: get_column_type

This function maps Spark types to DataHub types. It's crucial to ensure that all possible Spark types are handled, and any unmapped types are reported correctly. The use of a warning when a type cannot be mapped is a good practice.

Consider adding unit tests to cover all branches of this function, especially the default case where a type cannot be mapped.

Would you like me to help create unit tests for this function?

Comment on lines +210 to +228
def __init__(self, config: DataLakeSourceConfig, ctx: PipelineContext):
super().__init__(config, ctx)
self.source_config = config
self.report = DataLakeSourceReport()
self.profiling_times_taken = []
config_report = {
config_option: config.dict().get(config_option)
for config_option in config_options_to_report
}
config_report = {
**config_report,
"profiling_enabled": config.is_profiling_enabled(),
}

telemetry.telemetry_instance.ping(
"data_lake_config",
config_report,
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constructor of ABSSource

The constructor initializes several configurations and reports telemetry. It's important to ensure that telemetry data is handled securely and does not contain sensitive information.

Consider reviewing what data is being sent in telemetry to ensure compliance with privacy laws.

Comment on lines +522 to +607
def abs_browser(
self, path_spec: PathSpec, sample_size: int
) -> Iterable[Tuple[str, str, datetime, int]]:
if self.source_config.azure_config is None:
raise ValueError("azure_config not set. Cannot browse Azure Blob Storage")
abs_blob_service_client = (
self.source_config.azure_config.get_blob_service_client()
)
container_client = abs_blob_service_client.get_container_client(
self.source_config.azure_config.container_name
)

container_name = self.source_config.azure_config.container_name
logger.debug(f"Scanning container: {container_name}")

prefix = self.get_prefix(get_container_relative_path(path_spec.include))
logger.debug(f"Scanning objects with prefix:{prefix}")

matches = re.finditer(r"{\s*\w+\s*}", path_spec.include, re.MULTILINE)
matches_list = list(matches)
if matches_list and path_spec.sample_files:
max_start: int = -1
include: str = path_spec.include
max_match: str = ""
for match in matches_list:
pos = include.find(match.group())
if pos > max_start:
if max_match:
include = include.replace(max_match, "*")
max_start = match.start()
max_match = match.group()

table_index = include.find(max_match)

for folder in self.resolve_templated_folders(
container_name,
get_container_relative_path(include[:table_index]),
):
try:
for f in list_folders(
container_name, f"{folder}", self.source_config.azure_config
):
logger.info(f"Processing folder: {f}")
protocol = ContainerWUCreator.get_protocol(path_spec.include)
dir_to_process = self.get_dir_to_process(
container_name=container_name,
folder=f + "/",
path_spec=path_spec,
protocol=protocol,
)
logger.info(f"Getting files from folder: {dir_to_process}")
dir_to_process = dir_to_process.rstrip("\\")
for obj in container_client.list_blobs(
name_starts_with=f"{dir_to_process}",
results_per_page=PAGE_SIZE,
):
abs_path = self.create_abs_path(obj.name)
logger.debug(f"Sampling file: {abs_path}")
yield abs_path, obj.name, obj.last_modified, obj.size,
except Exception as e:
# This odd check if being done because boto does not have a proper exception to catch
# The exception that appears in stacktrace cannot actually be caught without a lot more work
# https://github.com/boto/boto3/issues/1195
if "NoSuchBucket" in repr(e):
logger.debug(
f"Got NoSuchBucket exception for {container_name}", e
)
self.get_report().report_warning(
"Missing bucket", f"No bucket found {container_name}"
)
else:
raise e
else:
logger.debug(
"No template in the pathspec can't do sampling, fallbacking to do full scan"
)
path_spec.sample_files = False
for obj in container_client.list_blobs(
prefix=f"{prefix}", results_per_page=PAGE_SIZE
):
abs_path = self.create_abs_path(obj.name)
logger.debug(f"Path: {abs_path}")
# the following line if using the file_system_client
# yield abs_path, obj.last_modified, obj.content_length,
yield abs_path, obj.name, obj.last_modified, obj.size

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method: abs_browser

This method handles the browsing of ABS paths and is crucial for the ingestion process. Ensure that error handling is robust, especially for network-related errors or API limitations.

The handling of exceptions related to NoSuchBucket could be improved by using more specific exception types if available.

Consider using more specific exception types for better error handling.

Comment on lines +158 to +178
def partitioned_folder_comparator(folder1: str, folder2: str) -> int:
# Try to convert to number and compare if the folder name is a number
try:
# Stripping = from the folder names as it most probably partition name part like year=2021
if "=" in folder1 and "=" in folder2:
if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
folder1 = folder1.rsplit("=", 1)[-1]
folder2 = folder2.rsplit("=", 1)[-1]

num_folder1 = int(folder1)
num_folder2 = int(folder2)
if num_folder1 == num_folder2:
return 0
else:
return 1 if num_folder1 > num_folder2 else -1
except Exception:
# If folder name is not a number then do string comparison
if folder1 == folder2:
return 0
else:
return 1 if folder1 > folder2 else -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function: partitioned_folder_comparator Review

The nested if statements have been flagged by both a previous reviewer and static analysis tools. The suggestion to simplify these into a single if statement is valid and should be implemented to improve readability.

-        if "=" in folder1 and "=" in folder2:
-            if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
+        if "=" in folder1 and "=" in folder2 and folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def partitioned_folder_comparator(folder1: str, folder2: str) -> int:
# Try to convert to number and compare if the folder name is a number
try:
# Stripping = from the folder names as it most probably partition name part like year=2021
if "=" in folder1 and "=" in folder2:
if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
folder1 = folder1.rsplit("=", 1)[-1]
folder2 = folder2.rsplit("=", 1)[-1]
num_folder1 = int(folder1)
num_folder2 = int(folder2)
if num_folder1 == num_folder2:
return 0
else:
return 1 if num_folder1 > num_folder2 else -1
except Exception:
# If folder name is not a number then do string comparison
if folder1 == folder2:
return 0
else:
return 1 if folder1 > folder2 else -1
def partitioned_folder_comparator(folder1: str, folder2: str) -> int:
# Try to convert to number and compare if the folder name is a number
try:
# Stripping = from the folder names as it most probably partition name part like year=2021
if "=" in folder1 and "=" in folder2 and folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
folder1 = folder1.rsplit("=", 1)[-1]
folder2 = folder2.rsplit("=", 1)[-1]
num_folder1 = int(folder1)
num_folder2 = int(folder2)
if num_folder1 == num_folder2:
return 0
else:
return 1 if num_folder1 > num_folder2 else -1
except Exception:
# If folder name is not a number then do string comparison
if folder1 == folder2:
return 0
else:
return 1 if folder1 > folder2 else -1
Tools
Ruff

162-163: Use a single if statement instead of nested if statements

(SIM102)

Comment on lines +235 to +298
def get_fields(self, table_data: TableData, path_spec: PathSpec) -> List:
if self.is_abs_platform():
if self.source_config.azure_config is None:
raise ValueError("Azure config is required for ABS file sources")

abs_client = self.source_config.azure_config.get_blob_service_client()
file = smart_open(
f"azure://{self.source_config.azure_config.container_name}/{table_data.rel_path}",
"rb",
transport_params={"client": abs_client},
)
else:
# We still use smart_open here to take advantage of the compression
# capabilities of smart_open.
file = smart_open(table_data.full_path, "rb")

fields = []

extension = pathlib.Path(table_data.full_path).suffix
from datahub.ingestion.source.data_lake_common.path_spec import (
SUPPORTED_COMPRESSIONS,
)

if path_spec.enable_compression and (extension[1:] in SUPPORTED_COMPRESSIONS):
# Removing the compression extension and using the one before that like .json.gz -> .json
extension = pathlib.Path(table_data.full_path).with_suffix("").suffix
if extension == "" and path_spec.default_extension:
extension = f".{path_spec.default_extension}"

try:
if extension == ".parquet":
fields = parquet.ParquetInferrer().infer_schema(file)
elif extension == ".csv":
fields = csv_tsv.CsvInferrer(
max_rows=self.source_config.max_rows
).infer_schema(file)
elif extension == ".tsv":
fields = csv_tsv.TsvInferrer(
max_rows=self.source_config.max_rows
).infer_schema(file)
elif extension == ".json":
fields = json.JsonInferrer().infer_schema(file)
elif extension == ".avro":
fields = avro.AvroInferrer().infer_schema(file)
else:
self.report.report_warning(
table_data.full_path,
f"file {table_data.full_path} has unsupported extension",
)
file.close()
except Exception as e:
self.report.report_warning(
table_data.full_path,
f"could not infer schema for file {table_data.full_path}: {e}",
)
file.close()
logger.debug(f"Extracted fields in schema: {fields}")
fields = sorted(fields, key=lambda f: f.fieldPath)

if self.source_config.add_partition_columns_to_schema:
self.add_partition_columns_to_schema(
fields=fields, path_spec=path_spec, full_path=table_data.full_path
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method: get_fields

This method is complex and handles various file types. Refactoring to break this method into smaller, more focused methods could improve readability and maintainability. Additionally, ensure that error handling is robust, especially around file operations.

Consider the following refactor suggestion:

-        try:
-            if extension == ".parquet":
-                fields = parquet.ParquetInferrer().infer_schema(file)
-            elif extension == ".csv":
-                fields = csv_tsv.CsvInferrer(
-                    max_rows=self.source_config.max_rows
-                ).infer_schema(file)
-            elif extension == ".tsv":
-                fields = csv_tsv.TsvInferrer(
-                    max_rows=self.source_config.max_rows
-                ).infer_schema(file)
-            elif extension == ".json":
-                fields = json.JsonInferrer().infer_schema(file)
-            elif extension == ".avro":
-                fields = avro.AvroInferrer().infer_schema(file)
-            else:
-                self.report.report_warning(
-                    table_data.full_path,
-                    f"file {table_data.full_path} has unsupported extension",
-                )
-            file.close()
-        except Exception as e:
-            self.report.report_warning(
-                table_data.full_path,
-                f"could not infer schema for file {table_data.full_path}: {e}",
-            )
-            file.close()
+        fields = self.handle_file_inference(file, extension, table_data)
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_fields(self, table_data: TableData, path_spec: PathSpec) -> List:
if self.is_abs_platform():
if self.source_config.azure_config is None:
raise ValueError("Azure config is required for ABS file sources")
abs_client = self.source_config.azure_config.get_blob_service_client()
file = smart_open(
f"azure://{self.source_config.azure_config.container_name}/{table_data.rel_path}",
"rb",
transport_params={"client": abs_client},
)
else:
# We still use smart_open here to take advantage of the compression
# capabilities of smart_open.
file = smart_open(table_data.full_path, "rb")
fields = []
extension = pathlib.Path(table_data.full_path).suffix
from datahub.ingestion.source.data_lake_common.path_spec import (
SUPPORTED_COMPRESSIONS,
)
if path_spec.enable_compression and (extension[1:] in SUPPORTED_COMPRESSIONS):
# Removing the compression extension and using the one before that like .json.gz -> .json
extension = pathlib.Path(table_data.full_path).with_suffix("").suffix
if extension == "" and path_spec.default_extension:
extension = f".{path_spec.default_extension}"
try:
if extension == ".parquet":
fields = parquet.ParquetInferrer().infer_schema(file)
elif extension == ".csv":
fields = csv_tsv.CsvInferrer(
max_rows=self.source_config.max_rows
).infer_schema(file)
elif extension == ".tsv":
fields = csv_tsv.TsvInferrer(
max_rows=self.source_config.max_rows
).infer_schema(file)
elif extension == ".json":
fields = json.JsonInferrer().infer_schema(file)
elif extension == ".avro":
fields = avro.AvroInferrer().infer_schema(file)
else:
self.report.report_warning(
table_data.full_path,
f"file {table_data.full_path} has unsupported extension",
)
file.close()
except Exception as e:
self.report.report_warning(
table_data.full_path,
f"could not infer schema for file {table_data.full_path}: {e}",
)
file.close()
logger.debug(f"Extracted fields in schema: {fields}")
fields = sorted(fields, key=lambda f: f.fieldPath)
if self.source_config.add_partition_columns_to_schema:
self.add_partition_columns_to_schema(
fields=fields, path_spec=path_spec, full_path=table_data.full_path
)
def get_fields(self, table_data: TableData, path_spec: PathSpec) -> List:
if self.is_abs_platform():
if self.source_config.azure_config is None:
raise ValueError("Azure config is required for ABS file sources")
abs_client = self.source_config.azure_config.get_blob_service_client()
file = smart_open(
f"azure://{self.source_config.azure_config.container_name}/{table_data.rel_path}",
"rb",
transport_params={"client": abs_client},
)
else:
# We still use smart_open here to take advantage of the compression
# capabilities of smart_open.
file = smart_open(table_data.full_path, "rb")
fields = []
extension = pathlib.Path(table_data.full_path).suffix
from datahub.ingestion.source.data_lake_common.path_spec import (
SUPPORTED_COMPRESSIONS,
)
if path_spec.enable_compression and (extension[1:] in SUPPORTED_COMPRESSIONS):
# Removing the compression extension and using the one before that like .json.gz -> .json
extension = pathlib.Path(table_data.full_path).with_suffix("").suffix
if extension == "" and path_spec.default_extension:
extension = f".{path_spec.default_extension}"
fields = self.handle_file_inference(file, extension, table_data)
logger.debug(f"Extracted fields in schema: {fields}")
fields = sorted(fields, key=lambda f: f.fieldPath)
if self.source_config.add_partition_columns_to_schema:
self.add_partition_columns_to_schema(
fields=fields, path_spec=path_spec, full_path=table_data.full_path
)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 7a2c3fa and 240d116.

Files selected for processing (1)
  • metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
  • metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

Copy link
Contributor

@treff7es treff7es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks for the contribution

@treff7es treff7es changed the title Adding azure blob ingestion feat(ingest/abs): Adding azure blob storage ingestion source Jul 17, 2024
@treff7es treff7es merged commit 13b6feb into datahub-project:master Jul 17, 2024
56 of 58 checks passed
aviv-julienjehannet pushed a commit to aviv-julienjehannet/datahub that referenced this pull request Jul 25, 2024
arosanda added a commit to infobip/datahub that referenced this pull request Sep 23, 2024
* feat(forms) Handle deleting forms references when hard deleting forms (datahub-project#10820)

* refactor(ui): Misc improvements to the setup ingestion flow (ingest uplift 1/2)  (datahub-project#10764)

Co-authored-by: John Joyce <john@Johns-MBP.lan>
Co-authored-by: John Joyce <john@ip-192-168-1-200.us-west-2.compute.internal>

* fix(ingestion/airflow-plugin): pipeline tasks discoverable in search (datahub-project#10819)

* feat(ingest/transformer): tags to terms transformer (datahub-project#10758)

Co-authored-by: Aseem Bansal <asmbansal2@gmail.com>

* fix(ingestion/unity-catalog): fixed issue with profiling with GE turned on (datahub-project#10752)

Co-authored-by: Aseem Bansal <asmbansal2@gmail.com>

* feat(forms) Add java SDK for form entity PATCH + CRUD examples (datahub-project#10822)

* feat(SDK) Add java SDK for structuredProperty entity PATCH + CRUD examples (datahub-project#10823)

* feat(SDK) Add StructuredPropertyPatchBuilder in python sdk and provide sample CRUD files (datahub-project#10824)

* feat(forms) Add CRUD endpoints to GraphQL for Form entities (datahub-project#10825)

* add flag for includeSoftDeleted in scroll entities API (datahub-project#10831)

* feat(deprecation) Return actor entity with deprecation aspect (datahub-project#10832)

* feat(structuredProperties) Add CRUD graphql APIs for structured property entities (datahub-project#10826)

* add scroll parameters to openapi v3 spec (datahub-project#10833)

* fix(ingest): correct profile_day_of_week implementation (datahub-project#10818)

* feat(ingest/glue): allow ingestion of empty databases from Glue (datahub-project#10666)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* feat(cli): add more details to get cli (datahub-project#10815)

* fix(ingestion/glue): ensure date formatting works on all platforms for aws glue (datahub-project#10836)

* fix(ingestion): fix datajob patcher (datahub-project#10827)

* fix(smoke-test): add suffix in temp file creation (datahub-project#10841)

* feat(ingest/glue): add helper method to permit user or group ownership (datahub-project#10784)

* feat(): Show data platform instances in policy modal if they are set on the policy (datahub-project#10645)

Co-authored-by: Hendrik Richert <hendrik.richert@swisscom.com>

* docs(patch): add patch documentation for how implementation works (datahub-project#10010)

Co-authored-by: John Joyce <john@acryl.io>

* fix(jar): add missing custom-plugin-jar task (datahub-project#10847)

* fix(): also check exceptions/stack trace when filtering log messages (datahub-project#10391)

Co-authored-by: John Joyce <john@acryl.io>

* docs(): Update posts.md (datahub-project#9893)

Co-authored-by: Hyejin Yoon <0327jane@gmail.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* chore(ingest): update acryl-datahub-classify version (datahub-project#10844)

* refactor(ingest): Refactor structured logging to support infos, warnings, and failures structured reporting to UI (datahub-project#10828)

Co-authored-by: John Joyce <john@Johns-MBP.lan>
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* fix(restli): log aspect-not-found as a warning rather than as an error (datahub-project#10834)

* fix(ingest/nifi): remove duplicate upstream jobs (datahub-project#10849)

* fix(smoke-test): test access to create/revoke personal access tokens (datahub-project#10848)

* fix(smoke-test): missing test for move domain (datahub-project#10837)

* ci: update usernames to not considered for community (datahub-project#10851)

* env: change defaults for data contract visibility (datahub-project#10854)

* fix(ingest/tableau): quote special characters in external URL (datahub-project#10842)

* fix(smoke-test): fix flakiness of auto complete test

* ci(ingest): pin dask dependency for feast (datahub-project#10865)

* fix(ingestion/lookml): liquid template resolution and view-to-view cll (datahub-project#10542)

* feat(ingest/audit): add client id and version in system metadata props (datahub-project#10829)

* chore(ingest): Mypy 1.10.1 pin (datahub-project#10867)

* docs: use acryl-datahub-actions as expected python package to install (datahub-project#10852)

* docs: add new js snippet (datahub-project#10846)

* refactor(ingestion): remove company domain for security reason (datahub-project#10839)

* fix(ingestion/spark): Platform instance and column level lineage fix (datahub-project#10843)

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* feat(ingestion/tableau): optionally ingest multiple sites and create site containers (datahub-project#10498)

Co-authored-by: Yanik Häni <Yanik.Haeni1@swisscom.com>

* fix(ingestion/looker): Add sqlglot dependency and remove unused sqlparser (datahub-project#10874)

* fix(manage-tokens): fix manage access token policy (datahub-project#10853)

* Batch get entity endpoints (datahub-project#10880)

* feat(system): support conditional write semantics (datahub-project#10868)

* fix(build): upgrade vercel builds to Node 20.x (datahub-project#10890)

* feat(ingest/lookml): shallow clone repos (datahub-project#10888)

* fix(ingest/looker): add missing dependency (datahub-project#10876)

* fix(ingest): only populate audit stamps where accurate (datahub-project#10604)

* fix(ingest/dbt): always encode tag urns (datahub-project#10799)

* fix(ingest/redshift): handle multiline alter table commands (datahub-project#10727)

* fix(ingestion/looker): column name missing in explore (datahub-project#10892)

* fix(lineage) Fix lineage source/dest filtering with explored per hop limit (datahub-project#10879)

* feat(conditional-writes): misc updates and fixes (datahub-project#10901)

* feat(ci): update outdated action (datahub-project#10899)

* feat(rest-emitter): adding async flag to rest emitter (datahub-project#10902)

Co-authored-by: Gabe Lyons <gabe.lyons@acryl.io>

* feat(ingest): add snowflake-queries source (datahub-project#10835)

* fix(ingest): improve `auto_materialize_referenced_tags_terms` error handling (datahub-project#10906)

* docs: add new company to adoption list (datahub-project#10909)

* refactor(redshift): Improve redshift error handling with new structured reporting system (datahub-project#10870)

Co-authored-by: John Joyce <john@Johns-MBP.lan>
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* feat(ui) Finalize support for all entity types on forms (datahub-project#10915)

* Index ExecutionRequestResults status field (datahub-project#10811)

* feat(ingest): grafana connector (datahub-project#10891)

Co-authored-by: Shirshanka Das <shirshanka@apache.org>
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* fix(gms) Add Form entity type to EntityTypeMapper (datahub-project#10916)

* feat(dataset): add support for external url in Dataset (datahub-project#10877)

* docs(saas-overview) added missing features to observe section (datahub-project#10913)

Co-authored-by: John Joyce <john@acryl.io>

* fix(ingest/spark): Fixing Micrometer warning (datahub-project#10882)

* fix(structured properties): allow application of structured properties without schema file (datahub-project#10918)

* fix(data-contracts-web) handle other schedule types (datahub-project#10919)

* fix(ingestion/tableau): human-readable message for PERMISSIONS_MODE_SWITCHED error (datahub-project#10866)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* Add feature flag for view defintions (datahub-project#10914)

Co-authored-by: Ethan Cartwright <ethan.cartwright@acryl.io>

* feat(ingest/BigQuery): refactor+parallelize dataset metadata extraction (datahub-project#10884)

* fix(airflow): add error handling around render_template() (datahub-project#10907)

* feat(ingestion/sqlglot): add optional `default_dialect` parameter to sqlglot lineage (datahub-project#10830)

* feat(mcp-mutator): new mcp mutator plugin (datahub-project#10904)

* fix(ingest/bigquery): changes helper function to decode unicode scape sequences (datahub-project#10845)

* feat(ingest/postgres): fetch table sizes for profile (datahub-project#10864)

* feat(ingest/abs): Adding azure blob storage ingestion source (datahub-project#10813)

* fix(ingest/redshift): reduce severity of SQL parsing issues (datahub-project#10924)

* fix(build): fix lint fix web react (datahub-project#10896)

* fix(ingest/bigquery): handle quota exceeded for project.list requests (datahub-project#10912)

* feat(ingest): report extractor failures more loudly (datahub-project#10908)

* feat(ingest/snowflake): integrate snowflake-queries into main source (datahub-project#10905)

* fix(ingest): fix docs build (datahub-project#10926)

* fix(ingest/snowflake): fix test connection (datahub-project#10927)

* fix(ingest/lookml): add view load failures to cache (datahub-project#10923)

* docs(slack) overhauled setup instructions and screenshots (datahub-project#10922)

Co-authored-by: John Joyce <john@acryl.io>

* fix(airflow): Add comma parsing of owners to DataJobs (datahub-project#10903)

* fix(entityservice): fix merging sideeffects (datahub-project#10937)

* feat(ingest): Support System Ingestion Sources, Show and hide system ingestion sources with Command-S (datahub-project#10938)

Co-authored-by: John Joyce <john@Johns-MBP.lan>

* chore() Set a default lineage filtering end time on backend when a start time is present (datahub-project#10925)

Co-authored-by: John Joyce <john@ip-192-168-1-200.us-west-2.compute.internal>
Co-authored-by: John Joyce <john@Johns-MBP.lan>

* Added relationships APIs to V3. Added these generic APIs to V3 swagger doc. (datahub-project#10939)

* docs: add learning center to docs (datahub-project#10921)

* doc: Update hubspot form id (datahub-project#10943)

* chore(airflow): add python 3.11 w/ Airflow 2.9 to CI (datahub-project#10941)

* fix(ingest/Glue): column upstream lineage between S3 and Glue (datahub-project#10895)

* fix(ingest/abs): split abs utils into multiple files (datahub-project#10945)

* doc(ingest/looker): fix doc for sql parsing documentation (datahub-project#10883)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* fix(ingest/bigquery): Adding missing BigQuery types (datahub-project#10950)

* fix(ingest/setup): feast and abs source setup (datahub-project#10951)

* fix(connections) Harden adding /gms to connections in backend (datahub-project#10942)

* feat(siblings) Add flag to prevent combining siblings in the UI (datahub-project#10952)

* fix(docs): make graphql doc gen more automated (datahub-project#10953)

* feat(ingest/athena): Add option for Athena partitioned profiling (datahub-project#10723)

* fix(spark-lineage): default timeout for future responses (datahub-project#10947)

* feat(datajob/flow): add environment filter using info aspects (datahub-project#10814)

* fix(ui/ingest): correct privilege used to show tab (datahub-project#10483)

Co-authored-by: Kunal-kankriya <127090035+Kunal-kankriya@users.noreply.github.com>

* feat(ingest/looker): include dashboard urns in browse v2 (datahub-project#10955)

* add a structured type to batchGet in OpenAPI V3 spec (datahub-project#10956)

* fix(ui): scroll on the domain sidebar to show all domains (datahub-project#10966)

* fix(ingest/sagemaker): resolve incorrect variable assignment for SageMaker API call (datahub-project#10965)

* fix(airflow/build): Pinning mypy (datahub-project#10972)

* Fixed a bug where the OpenAPI V3 spec was incorrect. The bug was introduced in datahub-project#10939. (datahub-project#10974)

* fix(ingest/test): Fix for mssql integration tests (datahub-project#10978)

* fix(entity-service) exist check correctly extracts status (datahub-project#10973)

* fix(structuredProps) casing bug in StructuredPropertiesValidator (datahub-project#10982)

* bugfix: use anyOf instead of allOf when creating references in openapi v3 spec (datahub-project#10986)

* fix(ui): Remove ant less imports (datahub-project#10988)

* feat(ingest/graph): Add get_results_by_filter to DataHubGraph (datahub-project#10987)

* feat(ingest/cli): init does not actually support environment variables (datahub-project#10989)

* fix(ingest/graph): Update get_results_by_filter graphql query (datahub-project#10991)

* feat(ingest/spark): Promote beta plugin (datahub-project#10881)

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* feat(ingest): support domains in meta -> "datahub" section (datahub-project#10967)

* feat(ingest): add `check server-config` command (datahub-project#10990)

* feat(cli): Make consistent use of DataHubGraphClientConfig (datahub-project#10466)

Deprecates get_url_and_token() in favor of a more complete option: load_graph_config() that returns a full DatahubClientConfig.
This change was then propagated across previous usages of get_url_and_token so that connections to DataHub server from the client respect the full breadth of configuration specified by DatahubClientConfig.

I.e: You can now specify disable_ssl_verification: true in your ~/.datahubenv file so that all cli functions to the server work when ssl certification is disabled.

Fixes datahub-project#9705

* fix(ingest/s3): Fixing container creation when there is no folder in path (datahub-project#10993)

* fix(ingest/looker): support platform instance for dashboards & charts (datahub-project#10771)

* feat(ingest/bigquery): improve handling of information schema in sql parser (datahub-project#10985)

* feat(ingest): improve `ingest deploy` command (datahub-project#10944)

* fix(backend): allow excluding soft-deleted entities in relationship-queries; exclude soft-deleted members of groups (datahub-project#10920)

- allow excluding soft-deleted entities in relationship-queries
- exclude soft-deleted members of groups

* fix(ingest/looker): downgrade missing chart type log level (datahub-project#10996)

* doc(acryl-cloud): release docs for 0.3.4.x (datahub-project#10984)

Co-authored-by: John Joyce <john@acryl.io>
Co-authored-by: RyanHolstien <RyanHolstien@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: Pedro Silva <pedro@acryl.io>

* fix(protobuf/build): Fix protobuf check jar script (datahub-project#11006)

* fix(ui/ingest): Support invalid cron jobs (datahub-project#10998)

* fix(ingest): fix graph config loading (datahub-project#11002)

Co-authored-by: Pedro Silva <pedro@acryl.io>

* feat(docs): Document __DATAHUB_TO_FILE_ directive (datahub-project#10968)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* fix(graphql/upsertIngestionSource): Validate cron schedule; parse error in CLI (datahub-project#11011)

* feat(ece): support custom ownership type urns in ECE generation (datahub-project#10999)

* feat(assertion-v2): changed Validation tab to Quality and created new Governance tab (datahub-project#10935)

* fix(ingestion/glue): Add support for missing config options for profiling in Glue (datahub-project#10858)

* feat(propagation): Add models for schema field docs, tags, terms (datahub-project#2959) (datahub-project#11016)

Co-authored-by: Chris Collins <chriscollins3456@gmail.com>

* docs: standardize terminology to DataHub Cloud (datahub-project#11003)

* fix(ingestion/transformer): replace the externalUrl container (datahub-project#11013)

* docs(slack) troubleshoot docs (datahub-project#11014)

* feat(propagation): Add graphql API (datahub-project#11030)

Co-authored-by: Chris Collins <chriscollins3456@gmail.com>

* feat(propagation):  Add models for Action feature settings (datahub-project#11029)

* docs(custom properties): Remove duplicate from sidebar (datahub-project#11033)

* feat(models): Introducing Dataset Partitions Aspect (datahub-project#10997)

Co-authored-by: John Joyce <john@Johns-MBP.lan>
Co-authored-by: John Joyce <john@ip-192-168-1-200.us-west-2.compute.internal>

* feat(propagation): Add Documentation Propagation Settings (datahub-project#11038)

* fix(models): chart schema fields mapping, add dataHubAction entity, t… (datahub-project#11040)

* fix(ci): smoke test lint failures (datahub-project#11044)

* docs: fix learning center color scheme & typo (datahub-project#11043)

* feat: add cloud main page (datahub-project#11017)

Co-authored-by: Jay <159848059+jayacryl@users.noreply.github.com>

* feat(restore-indices): add additional step to also clear system metadata service (datahub-project#10662)

Co-authored-by: John Joyce <john@acryl.io>

* docs: fix typo (datahub-project#11046)

* fix(lint): apply spotless (datahub-project#11050)

* docs(airflow): example query to get datajobs for a dataflow (datahub-project#11034)

* feat(cli): Add run-id option to put sub-command (datahub-project#11023)

Adds an option to assign run-id to a given put command execution. 
This is useful when transformers do not exist for a given ingestion payload, we can follow up with custom metadata and assign it to an ingestion pipeline.

* fix(ingest): improve sql error reporting calls (datahub-project#11025)

* fix(airflow): fix CI setup (datahub-project#11031)

* feat(ingest/dbt): add experimental `prefer_sql_parser_lineage` flag (datahub-project#11039)

* fix(ingestion/lookml): enable stack-trace in lookml logs (datahub-project#10971)

* (chore): Linting fix (datahub-project#11015)

* chore(ci): update deprecated github actions (datahub-project#10977)

* Fix ALB configuration example (datahub-project#10981)

* chore(ingestion-base): bump base image packages (datahub-project#11053)

* feat(cli): Trim report of dataHubExecutionRequestResult to max GMS size (datahub-project#11051)

* fix(ingestion/lookml): emit dummy sql condition for lookml custom condition tag (datahub-project#11008)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* fix(ingestion/powerbi): fix issue with broken report lineage (datahub-project#10910)

* feat(ingest/tableau): add retry on timeout (datahub-project#10995)

* change generate kafka connect properties from env (datahub-project#10545)

Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com>

* fix(ingest): fix oracle cronjob ingestion (datahub-project#11001)

Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com>

* chore(ci): revert update deprecated github actions (datahub-project#10977) (datahub-project#11062)

* feat(ingest/dbt-cloud): update metadata_endpoint inference (datahub-project#11041)

* build: Reduce size of datahub-frontend-react image by 50-ish% (datahub-project#10878)

Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com>

* fix(ci): Fix lint issue in datahub_ingestion_run_summary_provider.py (datahub-project#11063)

* docs(ingest): update developing-a-transformer.md (datahub-project#11019)

* feat(search-test): update search tests from datahub-project#10408 (datahub-project#11056)

* feat(cli): add aspects parameter to DataHubGraph.get_entity_semityped (datahub-project#11009)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* docs(airflow): update min version for plugin v2 (datahub-project#11065)

* doc(ingestion/tableau): doc update for derived permission (datahub-project#11054)

Co-authored-by: Pedro Silva <pedro.cls93@gmail.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* fix(py): remove dep on types-pkg_resources (datahub-project#11076)

* feat(ingest/mode): add option to exclude restricted (datahub-project#11081)

* fix(ingest): set lastObserved in sdk when unset (datahub-project#11071)

* doc(ingest): Update capabilities (datahub-project#11072)

* chore(vulnerability): Log Injection (datahub-project#11090)

* chore(vulnerability): Information exposure through a stack trace (datahub-project#11091)

* chore(vulnerability): Comparison of narrow type with wide type in loop condition (datahub-project#11089)

* chore(vulnerability): Insertion of sensitive information into log files (datahub-project#11088)

* chore(vulnerability): Risky Cryptographic Algorithm (datahub-project#11059)

* chore(vulnerability): Overly permissive regex range (datahub-project#11061)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* fix: update customer data (datahub-project#11075)

* fix(models): fixing the datasetPartition models (datahub-project#11085)

Co-authored-by: John Joyce <john@ip-192-168-1-200.us-west-2.compute.internal>

* fix(ui): Adding view, forms GraphQL query, remove showing a fallback error message on unhandled GraphQL error (datahub-project#11084)

Co-authored-by: John Joyce <john@ip-192-168-1-200.us-west-2.compute.internal>

* feat(docs-site): hiding learn more from cloud page (datahub-project#11097)

* fix(docs): Add correct usage of orFilters in search API docs (datahub-project#11082)

Co-authored-by: Jay <159848059+jayacryl@users.noreply.github.com>

* fix(ingest/mode): Regexp in mode name matcher didn't allow underscore (datahub-project#11098)

* docs: Refactor customer stories section (datahub-project#10869)

Co-authored-by: Jeff Merrick <jeff@wireform.io>

* fix(release): fix full/slim suffix on tag (datahub-project#11087)

* feat(config): support alternate hashing algorithm for doc id (datahub-project#10423)

Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com>
Co-authored-by: John Joyce <john@acryl.io>

* fix(emitter): fix typo in get method of java kafka emitter (datahub-project#11007)

* fix(ingest): use correct native data type in all SQLAlchemy sources by compiling data type using dialect (datahub-project#10898)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* chore: Update contributors list in PR labeler (datahub-project#11105)

* feat(ingest): tweak stale entity removal messaging (datahub-project#11064)

* fix(ingestion): enforce lastObserved timestamps in SystemMetadata (datahub-project#11104)

* fix(ingest/powerbi): fix broken lineage between chart and dataset (datahub-project#11080)

* feat(ingest/lookml): CLL support for sql set in sql_table_name attribute of lookml view (datahub-project#11069)

* docs: update graphql docs on forms & structured properties (datahub-project#11100)

* test(search): search openAPI v3 test (datahub-project#11049)

* fix(ingest/tableau): prevent empty site content urls (datahub-project#11057)

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* feat(entity-client): implement client batch interface (datahub-project#11106)

* fix(snowflake): avoid reporting warnings/info for sys tables (datahub-project#11114)

* fix(ingest): downgrade column type mapping warning to info (datahub-project#11115)

* feat(api): add AuditStamp to the V3 API entity/aspect response (datahub-project#11118)

* fix(ingest/redshift): replace r'\n' with '\n' to avoid token error redshift serverless… (datahub-project#11111)

* fix(entiy-client): handle null entityUrn case for restli (datahub-project#11122)

* fix(sql-parser): prevent bad urns from alter table lineage (datahub-project#11092)

* fix(ingest/bigquery): use small batch size if use_tables_list_query_v2 is set (datahub-project#11121)

* fix(graphql): add missing entities to EntityTypeMapper and EntityTypeUrnMapper (datahub-project#10366)

* feat(ui): Changes to allow editable dataset name (datahub-project#10608)

Co-authored-by: Jay Kadambi <jayasimhan_venkatadri@optum.com>

* fix: remove saxo (datahub-project#11127)

* feat(mcl-processor): Update mcl processor hooks (datahub-project#11134)

* fix(openapi): fix openapi v2 endpoints & v3 documentation update

* Revert "fix(openapi): fix openapi v2 endpoints & v3 documentation update"

This reverts commit 573c1cb.

* docs(policies): updates to policies documentation (datahub-project#11073)

* fix(openapi): fix openapi v2 and v3 docs update (datahub-project#11139)

* feat(auth): grant type and acr values custom oidc parameters support (datahub-project#11116)

* fix(mutator): mutator hook fixes (datahub-project#11140)

* feat(search): support sorting on multiple fields (datahub-project#10775)

* feat(ingest): various logging improvements (datahub-project#11126)

* fix(ingestion/lookml): fix for sql parsing error (datahub-project#11079)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* feat(docs-site) cloud page spacing and content polishes (datahub-project#11141)

* feat(ui) Enable editing structured props on fields (datahub-project#11042)

* feat(tests): add md5 and last computed to testResult model (datahub-project#11117)

* test(openapi): openapi regression smoke tests (datahub-project#11143)

* fix(airflow): fix tox tests + update docs (datahub-project#11125)

* docs: add chime to adoption stories (datahub-project#11142)

* fix(ingest/databricks): Updating code to work with Databricks sdk 0.30 (datahub-project#11158)

* fix(kafka-setup): add missing script to image (datahub-project#11190)

* fix(config): fix hash algo config (datahub-project#11191)

* test(smoke-test): updates to smoke-tests (datahub-project#11152)

* fix(elasticsearch): refactor idHashAlgo setting (datahub-project#11193)

* chore(kafka): kafka version bump (datahub-project#11211)

* readd UsageStatsWorkUnit

* fix merge problems

* change logo

---------

Co-authored-by: Chris Collins <chriscollins3456@gmail.com>
Co-authored-by: John Joyce <john@acryl.io>
Co-authored-by: John Joyce <john@Johns-MBP.lan>
Co-authored-by: John Joyce <john@ip-192-168-1-200.us-west-2.compute.internal>
Co-authored-by: dushayntAW <158567391+dushayntAW@users.noreply.github.com>
Co-authored-by: sagar-salvi-apptware <159135491+sagar-salvi-apptware@users.noreply.github.com>
Co-authored-by: Aseem Bansal <asmbansal2@gmail.com>
Co-authored-by: Kevin Chun <kevin1chun@gmail.com>
Co-authored-by: jordanjeremy <72943478+jordanjeremy@users.noreply.github.com>
Co-authored-by: skrydal <piotr.skrydalewicz@gmail.com>
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com>
Co-authored-by: sid-acryl <155424659+sid-acryl@users.noreply.github.com>
Co-authored-by: Julien Jehannet <80408664+aviv-julienjehannet@users.noreply.github.com>
Co-authored-by: Hendrik Richert <github@richert.li>
Co-authored-by: Hendrik Richert <hendrik.richert@swisscom.com>
Co-authored-by: RyanHolstien <RyanHolstien@users.noreply.github.com>
Co-authored-by: Felix Lüdin <13187726+Masterchen09@users.noreply.github.com>
Co-authored-by: Pirry <158024088+chardaway@users.noreply.github.com>
Co-authored-by: Hyejin Yoon <0327jane@gmail.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: cburroughs <chris.burroughs@gmail.com>
Co-authored-by: ksrinath <ksrinath@users.noreply.github.com>
Co-authored-by: Mayuri Nehate <33225191+mayurinehate@users.noreply.github.com>
Co-authored-by: Kunal-kankriya <127090035+Kunal-kankriya@users.noreply.github.com>
Co-authored-by: Shirshanka Das <shirshanka@apache.org>
Co-authored-by: ipolding-cais <155455744+ipolding-cais@users.noreply.github.com>
Co-authored-by: Tamas Nemeth <treff7es@gmail.com>
Co-authored-by: Shubham Jagtap <132359390+shubhamjagtap639@users.noreply.github.com>
Co-authored-by: haeniya <yanik.haeni@gmail.com>
Co-authored-by: Yanik Häni <Yanik.Haeni1@swisscom.com>
Co-authored-by: Gabe Lyons <itsgabelyons@gmail.com>
Co-authored-by: Gabe Lyons <gabe.lyons@acryl.io>
Co-authored-by: 808OVADOZE <52988741+shtephlee@users.noreply.github.com>
Co-authored-by: noggi <anton.kuraev@acryl.io>
Co-authored-by: Nicholas Pena <npena@foursquare.com>
Co-authored-by: Jay <159848059+jayacryl@users.noreply.github.com>
Co-authored-by: ethan-cartwright <ethan.cartwright.m@gmail.com>
Co-authored-by: Ethan Cartwright <ethan.cartwright@acryl.io>
Co-authored-by: Nadav Gross <33874964+nadavgross@users.noreply.github.com>
Co-authored-by: Patrick Franco Braz <patrickfbraz@poli.ufrj.br>
Co-authored-by: pie1nthesky <39328908+pie1nthesky@users.noreply.github.com>
Co-authored-by: Joel Pinto Mata (KPN-DSH-DEX team) <130968841+joelmataKPN@users.noreply.github.com>
Co-authored-by: Ellie O'Neil <110510035+eboneil@users.noreply.github.com>
Co-authored-by: Ajoy Majumdar <ajoymajumdar@hotmail.com>
Co-authored-by: deepgarg-visa <149145061+deepgarg-visa@users.noreply.github.com>
Co-authored-by: Tristan Heisler <tristankheisler@gmail.com>
Co-authored-by: Andrew Sikowitz <andrew.sikowitz@acryl.io>
Co-authored-by: Davi Arnaut <davi.arnaut@acryl.io>
Co-authored-by: Pedro Silva <pedro@acryl.io>
Co-authored-by: amit-apptware <132869468+amit-apptware@users.noreply.github.com>
Co-authored-by: Sam Black <sam.black@acryl.io>
Co-authored-by: Raj Tekal <varadaraj_tekal@optum.com>
Co-authored-by: Steffen Grohsschmiedt <gitbhub@steffeng.eu>
Co-authored-by: jaegwon.seo <162448493+wornjs@users.noreply.github.com>
Co-authored-by: Renan F. Lima <51028757+lima-renan@users.noreply.github.com>
Co-authored-by: Matt Exchange <xkollar@users.noreply.github.com>
Co-authored-by: Jonny Dixon <45681293+acrylJonny@users.noreply.github.com>
Co-authored-by: Pedro Silva <pedro.cls93@gmail.com>
Co-authored-by: Pinaki Bhattacharjee <pinakipb2@gmail.com>
Co-authored-by: Jeff Merrick <jeff@wireform.io>
Co-authored-by: skrydal <piotr.skrydalewicz@acryl.io>
Co-authored-by: AndreasHegerNuritas <163423418+AndreasHegerNuritas@users.noreply.github.com>
Co-authored-by: jayasimhankv <145704974+jayasimhankv@users.noreply.github.com>
Co-authored-by: Jay Kadambi <jayasimhan_venkatadri@optum.com>
Co-authored-by: David Leifker <david.leifker@acryl.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants