Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UPDATED] Changing Data Cleaning & Labeling Dropped Observations #19

Merged
merged 21 commits into from
Jun 26, 2024

Conversation

zacharymeurer
Copy link
Contributor

@zacharymeurer zacharymeurer commented Jun 24, 2024

Edits

!!!BOLD MEANS NEW!!!

Data Cleaning Notebook

  • Fixed the method for filling in missing state names based on coordinates.
  • Implemented Jacob's feedback on the last PR: Record how many observations are being dropped by the anomaly detection process.
  • Improved outlier detection to detect less false positives by experimenting and finalizing the contamination index parameter in the isolation forests.
  • Thoroughly commented/annotated all code in notebook
  • Cleaned up and optimized reference data cleaning code
  • Created a script version of data cleaning notebook

Labeling Dropped Observations

  • Created a new notebook expanding on the data cleaning notebook to record when citizen observations are dropped by our data cleaning process and why.
  • Data is stored in a CSV file labeling each observation in the raw citizen data with a "validation_label" indicating whether the observation was dropped and for what reason.
  • Thoroughly commented/annotated all code in notebook
  • Fixed small bug arising from false positive incorrect -2 values being set to None. Then reran code to load updated CSV
  • Put notebook and CSV for labeling dropped observations into a folder along with a README_key.md which gives a brief description and key for the validaiton labels

Miscellaneous

  • Thoroughly commented/annotated and cleaned up -2_values notebook
  • Changed india folder title to india_map to make it more intuitive
  • Deleted data cleaning job notebook because it is deprecated and not useful to our project

Key for "data_cleaning_labeled_alldata.csv":

Label Meaning
0 Kept
1 Dropped because a phenophase was incorrectly reported as being -2
2 Dropped because a phenophase had missing data (Null Values)
3 Dropped because observation was flagged as anomalous

zacharymeurer and others added 10 commits June 20, 2024 15:03
…nomaly detection into data cleaning notebook
…zed in data cleaning notebook. Code has been cleaned up, somewhat commented, and run. New cleaned and validated (through isolation forests) csv files are in the all data folder
…ed data cleaning labeling notebook to label alldata.csv rows as valid or invalid and why
…d how many observations were thrown out for being invalid). Also, made filling missing state names more efficient
…ts, reducing false positives in outlier detection. Loaded newly cleaned data and data labeling. Began commenting code in data cleaning notebook
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Zachary Meurer and others added 2 commits June 24, 2024 17:05
… data cleaning. Also, modified some reference data cleaning code to make it more efficient.
…Also, fixed little bug in data cleaning labeling
Zachary Meurer and others added 5 commits June 25, 2024 11:10
…nnotated and cleand up -2_values.ipynb. Fixed some typos in comments in data cleaning and data cleaning labeling notebooks. Reran data cleaning labeling to fix a small bug
…ason-watch into labeling-data-cleaning

Accidentally committed on locally instead of remotely. Local commit just fixed a small typo
@zacharymeurer zacharymeurer changed the title Changing Data Cleaning & Labeling Dropped Observations [UPDATED] Changing Data Cleaning & Labeling Dropped Observations Jun 25, 2024
Copy link
Contributor

@sh1v-ansh sh1v-ansh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great! its clean, descriptive, and organized

Copy link

@colettebas colettebas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two little comments!

@@ -109639,7 +109542,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.2"
"version": "3.11.6"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that this versioned down...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was working on both locally and through BU SCC. I'm guessing there's some inconsistency in the version because of the way we set up our Conda environment. Shouldn't cause any problems in these notebooks though.

" for phenophase in phenophases:\n",
" if species_dict[phenophase] == 0: # Phenophase seen in species\n",
" false_positive_idx = species_df.index[species_df[phenophase] == -2] # Indices of reports that incorrectly assign -2 values (false positive) to phenophases SEEN in the species\n",
" df.loc[false_positive_idx, phenophase] = np.full(len(false_positive_idx),100.612) # Turn all false positives into 100.612 (these observations will later be dropped but they cannot be Null/NaN/None values yet, so they will not be identified by .isna() check later)\n",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you know that 100.612 will never be seen in any other entry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these phenophase attributes are categorical floats (integers [-2.0,2.0] & NaNs). The rest of the attributes are also integers (similarly typed as floats). Thus, there are no decimal values other than .0 in the dataset, and 100.612 should not show up in the dataset. I will implement a proof of this in the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proof has been implemented and explained

…e Colette's comment about verifying 100.612 will work as a placeholder value in the data cleaning labeling notebook
@zacharymeurer zacharymeurer merged commit e5ae59d into dev Jun 26, 2024
@funkyvoong funkyvoong deleted the labeling-data-cleaning branch September 17, 2024 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants