-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[UPDATED] Changing Data Cleaning & Labeling Dropped Observations #19
Conversation
…nomaly detection into data cleaning notebook
…xample for final presentation
…zed in data cleaning notebook. Code has been cleaned up, somewhat commented, and run. New cleaned and validated (through isolation forests) csv files are in the all data folder
…ed data cleaning labeling notebook to label alldata.csv rows as valid or invalid and why
…d how many observations were thrown out for being invalid). Also, made filling missing state names more efficient
…ts, reducing false positives in outlier detection. Loaded newly cleaned data and data labeling. Began commenting code in data cleaning notebook
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
… data cleaning. Also, modified some reference data cleaning code to make it more efficient.
…Also, fixed little bug in data cleaning labeling
…nnotated and cleand up -2_values.ipynb. Fixed some typos in comments in data cleaning and data cleaning labeling notebooks. Reran data cleaning labeling to fix a small bug
…ngs that have been fixed
… of warning that I fixed
…ason-watch into labeling-data-cleaning Accidentally committed on locally instead of remotely. Local commit just fixed a small typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great! its clean, descriptive, and organized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two little comments!
@@ -109639,7 +109542,7 @@ | |||
"name": "python", | |||
"nbconvert_exporter": "python", | |||
"pygments_lexer": "ipython3", | |||
"version": "3.12.2" | |||
"version": "3.11.6" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting that this versioned down...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was working on both locally and through BU SCC. I'm guessing there's some inconsistency in the version because of the way we set up our Conda environment. Shouldn't cause any problems in these notebooks though.
" for phenophase in phenophases:\n", | ||
" if species_dict[phenophase] == 0: # Phenophase seen in species\n", | ||
" false_positive_idx = species_df.index[species_df[phenophase] == -2] # Indices of reports that incorrectly assign -2 values (false positive) to phenophases SEEN in the species\n", | ||
" df.loc[false_positive_idx, phenophase] = np.full(len(false_positive_idx),100.612) # Turn all false positives into 100.612 (these observations will later be dropped but they cannot be Null/NaN/None values yet, so they will not be identified by .isna() check later)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you know that 100.612 will never be seen in any other entry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these phenophase attributes are categorical floats (integers [-2.0,2.0] & NaNs). The rest of the attributes are also integers (similarly typed as floats). Thus, there are no decimal values other than .0 in the dataset, and 100.612 should not show up in the dataset. I will implement a proof of this in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proof has been implemented and explained
…e Colette's comment about verifying 100.612 will work as a placeholder value in the data cleaning labeling notebook
Edits
!!!BOLD MEANS NEW!!!
Data Cleaning Notebook
Labeling Dropped Observations
Miscellaneous
Key for
"data_cleaning_labeled_alldata.csv"
: