Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix column cleaning and general cleanup #39

Merged
merged 13 commits into from
Sep 7, 2022
Merged

fix column cleaning and general cleanup #39

merged 13 commits into from
Sep 7, 2022

Conversation

shouples
Copy link
Collaborator

@shouples shouples commented Sep 6, 2022

  • enforce datatypes after sampling
  • remove unnecessary column cleaning when comparing dataframes
  • remove lingering callout code
  • more adjustments to debug logging
  • add tests for formatting dataframes with missing values (None, np.nan, pd.NA)

known issue:

  • conversion to original dtypes after resampling/filtering may not be successful on some types (pd.Period/pd.Interval) since pandas can't parse a stringified list, so we might need to keep track of some intermediate normalized set of dtypes instead of the "original" dtypes

@shouples shouples changed the title fix JSON serialization errors and column cleaning fix column cleaning and general cleanup Sep 7, 2022
@shouples shouples marked this pull request as ready for review September 7, 2022 17:41
Copy link
Contributor

@willingc willingc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updated tests. I had a couple of questions but those shouldn't block merge.

@@ -120,15 +96,28 @@ def get_df_variable_name(
"""
Returns the variable name of the DataFrame object.
"""
logger.debug("looking for matching variables for dataframe")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious. What is the use case for this utility?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was previously for the blueprintjs callout to the user after a resample request that returned instructions to pass the filtered subset to a new variable, like:

new_df = my_awesome_df.query( < pandas query string >, engine="python")

But it's going to be passed in the metadata to the frontend to let DEX handle displaying the query syntax dynamically when the filters change, instead of just on (backend) resample requests. I'd like to have this handled in the dataframe caching branch for easier access.

s = geometry.handle_geometry_series(s)

s = datatypes.handle_dict_series(s)
s = datatypes.handle_sequence_series(s)
s = datatypes.handle_unk_type_series(s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the type unknown for these?

Copy link
Collaborator Author

@shouples shouples Sep 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly as a catch-all for if we run into objects that aren't JSON serializable (custom classes/objects) that we haven't handled in an explicit way earlier, to avoid breaking the pandas build_json_schema() call. Definitely open to suggestions here!

@shouples shouples merged commit 5539563 into main Sep 7, 2022
@shouples shouples deleted the djs/cleanup branch September 7, 2022 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants