fix column cleaning and general cleanup #39

shouples · 2022-09-06T23:36:07Z

enforce datatypes after sampling
remove unnecessary column cleaning when comparing dataframes
remove lingering callout code
more adjustments to debug logging
add tests for formatting dataframes with missing values (None, np.nan, pd.NA)

known issue:

conversion to original dtypes after resampling/filtering may not be successful on some types (pd.Period/pd.Interval) since pandas can't parse a stringified list, so we might need to keep track of some intermediate normalized set of dtypes instead of the "original" dtypes

willingc

Thanks for the updated tests. I had a couple of questions but those shouldn't block merge.

willingc · 2022-09-07T17:50:04Z

dx/utils/tracking.py

@@ -120,15 +96,28 @@ def get_df_variable_name(
    """
    Returns the variable name of the DataFrame object.
    """
+    logger.debug("looking for matching variables for dataframe")


Curious. What is the use case for this utility?

This was previously for the blueprintjs callout to the user after a resample request that returned instructions to pass the filtered subset to a new variable, like:

new_df = my_awesome_df.query( < pandas query string >, engine="python")

But it's going to be passed in the metadata to the frontend to let DEX handle displaying the query syntax dynamically when the filters change, instead of just on (backend) resample requests. I'd like to have this handled in the dataframe caching branch for easier access.

willingc · 2022-09-07T17:50:59Z

dx/utils/formatting.py

    s = geometry.handle_geometry_series(s)

    s = datatypes.handle_dict_series(s)
    s = datatypes.handle_sequence_series(s)
+    s = datatypes.handle_unk_type_series(s)


Why is the type unknown for these?

Mostly as a catch-all for if we run into objects that aren't JSON serializable (custom classes/objects) that we haven't handled in an explicit way earlier, to avoid breaking the pandas build_json_schema() call. Definitely open to suggestions here!

shouples added 9 commits September 6, 2022 16:50

remove callout placeholders

47e9cca

remove np.nan fill; remove duplicate cleaning functions

9262dba

pass remaining cleaning function to tests

ea2a284

remove column debug logging

7ddc06b

remove callout prep during sampling

429a2ec

ensure dtypes are consistent before/after sampling

6035cf5

handle fillna in payload generation

a572e5f

add tests to make sure null values don't cause problems internally

c831115

update test docstring

02d1d4e

shouples changed the title ~~fix JSON serialization errors and column cleaning~~ fix column cleaning and general cleanup Sep 7, 2022

shouples added 4 commits September 7, 2022 12:21

debug logging for dtype conversions

8f1f4fb

remove is_equal and use df.equals(other_df)

b8ddda1

skip cleaning for float/int/bool columns/datetime columns

a87b6ed

debug logging for column conversions

75b79df

shouples marked this pull request as ready for review September 7, 2022 17:41

willingc approved these changes Sep 7, 2022

View reviewed changes

shouples merged commit 5539563 into main Sep 7, 2022

shouples deleted the djs/cleanup branch September 7, 2022 17:53

shouples mentioned this pull request Sep 8, 2022

replace callout HTML with custom media type #20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix column cleaning and general cleanup #39

fix column cleaning and general cleanup #39

shouples commented Sep 6, 2022 •

edited

Loading

willingc left a comment

willingc Sep 7, 2022

shouples Sep 7, 2022

willingc Sep 7, 2022

shouples Sep 7, 2022 •

edited

Loading

fix column cleaning and general cleanup #39

fix column cleaning and general cleanup #39

Conversation

shouples commented Sep 6, 2022 • edited Loading

willingc left a comment

Choose a reason for hiding this comment

willingc Sep 7, 2022

Choose a reason for hiding this comment

shouples Sep 7, 2022

Choose a reason for hiding this comment

willingc Sep 7, 2022

Choose a reason for hiding this comment

shouples Sep 7, 2022 • edited Loading

Choose a reason for hiding this comment

shouples commented Sep 6, 2022 •

edited

Loading

shouples Sep 7, 2022 •

edited

Loading