Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add transformations #1555

Merged
merged 4 commits into from
Jan 29, 2025
Merged

feat: add transformations #1555

merged 4 commits into from
Jan 29, 2025

Conversation

gventuri
Copy link
Collaborator

@gventuri gventuri commented Jan 28, 2025

Important

Adds TransformationManager for data transformations in PandaAI, updates documentation, and includes comprehensive unit tests.

  • Behavior:
    • Introduces TransformationManager class in transformation_manager.py to handle data transformations on DataFrames.
    • Supports transformations like anonymize, convert_timezone, to_lowercase, to_uppercase, strip, round_numbers, scale, format_date, to_numeric, to_datetime, fill_na, replace, extract, truncate, pad, clip, bin, normalize, standardize, map_values, encode_categorical, validate_email, validate_date_range, normalize_phone, remove_duplicates, validate_foreign_key, ensure_positive, standardize_categories.
    • Updates DatasetLoader in loader.py to use TransformationManager for applying transformations.
  • Documentation:
    • Adds transformations.mdx to document available transformations and usage examples.
    • Updates mint.json to include transformations documentation in the navigation.
  • Constants:
    • Updates VALID_TRANSFORMATION_TYPES in constants.py to include new transformation types.
  • Tests:
    • Adds unit tests in test_transformation_manager.py to cover all transformation methods.
    • Updates test_loader.py to test integration of transformations in data loading process.

This description was created by Ellipsis for fe22453. It will automatically update as commits are pushed.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to 15e4360 in 2 minutes and 10 seconds

More details
  • Looked at 1868 lines of code in 7 files
  • Skipped 0 files when reviewing.
  • Skipped posting 5 drafted comments based on config settings.
1. pandasai/data_loader/transformation_manager.py:301
  • Draft comment:
    Ensure the column is of string type before applying str.replace to handle non-string columns correctly.
  • Reason this comment was not posted:
    Marked as duplicate.
2. pandasai/data_loader/transformation_manager.py:77
  • Draft comment:
    Ensure the column is converted to datetime before applying timezone conversion to handle non-datetime columns correctly.
  • Reason this comment was not posted:
    Comment did not seem useful.
3. pandasai/data_loader/transformation_manager.py:205
  • Draft comment:
    Ensure the column is converted to datetime before formatting to handle non-datetime columns correctly.
  • Reason this comment was not posted:
    Marked as duplicate.
4. pandasai/data_loader/transformation_manager.py:381
  • Draft comment:
    Ensure the column is of string type before applying str.rjust or str.ljust to handle non-string columns correctly.
  • Reason this comment was not posted:
    Comment did not seem useful.
5. pandasai/data_loader/transformation_manager.py:8
  • Draft comment:
    The TransformationManager class is a significant addition to the library. Ensure that the documentation is updated to include details about this class and its usage.
  • Reason this comment was not posted:
    Comment looked like it was already resolved.

Workflow ID: wflow_T3jluj1GEA86vlVh


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on 743008b in 36 seconds

More details
  • Looked at 1107 lines of code in 2 files
  • Skipped 0 files when reviewing.
  • Skipped posting 2 drafted comments based on config settings.
1. tests/unit_tests/data_loader/test_transformation_manager.py:53
  • Draft comment:
    Ensure the input timestamps are timezone-aware before testing convert_timezone. If they are naive, use tz_localize to set a timezone first.
  • Reason this comment was not posted:
    Comment did not seem useful.
2. pandasai/data_loader/transformation_manager.py:287
  • Draft comment:
    Ensure that inputs to str.replace are sanitized if they come from untrusted sources to prevent potential security issues.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The replace method in transformation_manager.py uses str.replace, which is vulnerable to SQL injection if the input is not sanitized. However, since this is a DataFrame operation and not directly interacting with a database, the risk is minimal. Still, it's good practice to ensure inputs are sanitized if they are coming from an untrusted source.

Workflow ID: wflow_6gggmxP1BVmnffeG


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

Copy link

codecov bot commented Jan 28, 2025

Codecov Report

Attention: Patch coverage is 95.42857% with 8 lines in your changes missing coverage. Please review.

Project coverage is 81.50%. Comparing base (792e683) to head (fe22453).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
pandasai/data_loader/transformation_manager.py 95.80% 7 Missing ⚠️
pandasai/data_loader/loader.py 80.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1555      +/-   ##
==========================================
+ Coverage   80.42%   81.50%   +1.07%     
==========================================
  Files          62       63       +1     
  Lines        2105     2271     +166     
==========================================
+ Hits         1693     1851     +158     
- Misses        412      420       +8     
Flag Coverage Δ
unittests 81.50% <95.42%> (+1.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -202,18 +203,11 @@ def execute_query(self, query: str, params: Optional[list] = None) -> pd.DataFra
) from e

def _apply_transformations(self, df: pd.DataFrame) -> pd.DataFrame:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the view case should be handled.
Under the hood columns in schema "table.column" get transformed in "table_column" the same should apply for transformation. And check should be added as well in SemanticLayerSchema

@gventuri gventuri merged commit 093ee9c into main Jan 29, 2025
13 checks passed
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Incremental review on fe22453 in 46 seconds

More details
  • Looked at 39 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 1 drafted comments based on config settings.
1. pandasai/constants.py:57
  • Draft comment:
    Consider adding a comment or documentation to explain the purpose of each transformation type for better maintainability and understanding.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The addition of new transformation types is straightforward, but it lacks a description or context in the PR description. However, the change itself is clear and consistent with the existing format.

Workflow ID: wflow_AETPCnswz6rQIjJF


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@@ -54,4 +54,33 @@

VALID_COLUMN_TYPES = ["string", "integer", "float", "datetime", "boolean"]

VALID_TRANSFORMATION_TYPES = ["anonymize", "convert_timezone"]
VALID_TRANSFORMATION_TYPES = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ensure that there are corresponding unit tests for the new transformation types added. Additionally, if these transformations are part of a library, update the documentation accordingly.

scaliseraoul pushed a commit to scaliseraoul/pandas-ai that referenced this pull request Jan 30, 2025
…das-ai into fix/SIN-SIN-348

* 'fix/SIN-SIN-348' of https://github.com/scaliseraoul/pandas-ai:
  feat: add transformations (sinaptik-ai#1555)
  fix(dependencies): update poetry file and pyproject files (sinaptik-ai#1557)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants