[UPDATED] Changing Data Cleaning & Labeling Dropped Observations #19

zacharymeurer · 2024-06-24T18:09:20Z

Edits

!!!BOLD MEANS NEW!!!

Data Cleaning Notebook

Fixed the method for filling in missing state names based on coordinates.
Implemented Jacob's feedback on the last PR: Record how many observations are being dropped by the anomaly detection process.
Improved outlier detection to detect less false positives by experimenting and finalizing the contamination index parameter in the isolation forests.
Thoroughly commented/annotated all code in notebook
Cleaned up and optimized reference data cleaning code
Created a script version of data cleaning notebook

Labeling Dropped Observations

Created a new notebook expanding on the data cleaning notebook to record when citizen observations are dropped by our data cleaning process and why.
Data is stored in a CSV file labeling each observation in the raw citizen data with a "validation_label" indicating whether the observation was dropped and for what reason.
Thoroughly commented/annotated all code in notebook
Fixed small bug arising from false positive incorrect -2 values being set to None. Then reran code to load updated CSV
Put notebook and CSV for labeling dropped observations into a folder along with a README_key.md which gives a brief description and key for the validaiton labels

Miscellaneous

Thoroughly commented/annotated and cleaned up -2_values notebook
Changed india folder title to india_map to make it more intuitive
Deleted data cleaning job notebook because it is deprecated and not useful to our project

Key for "data_cleaning_labeled_alldata.csv":

Label	Meaning
0	Kept
1	Dropped because a phenophase was incorrectly reported as being -2
2	Dropped because a phenophase had missing data (Null Values)
3	Dropped because observation was flagged as anomalous

…nomaly detection into data cleaning notebook

…xample for final presentation

…ata csv files

…zed in data cleaning notebook. Code has been cleaned up, somewhat commented, and run. New cleaned and validated (through isolation forests) csv files are in the all data folder

…ed data cleaning labeling notebook to label alldata.csv rows as valid or invalid and why

…ned citizen df

…d how many observations were thrown out for being invalid). Also, made filling missing state names more efficient

…ts, reducing false positives in outlier detection. Loaded newly cleaned data and data labeling. Began commenting code in data cleaning notebook

review-notebook-app · 2024-06-24T18:09:50Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

data cleaning labeling.ipynb

data_cleaning_job.ipynb

… data cleaning. Also, modified some reference data cleaning code to make it more efficient.

…Also, fixed little bug in data cleaning labeling

…nnotated and cleand up -2_values.ipynb. Fixed some typos in comments in data cleaning and data cleaning labeling notebooks. Reran data cleaning labeling to fix a small bug

…ngs that have been fixed

… of warning that I fixed

…ason-watch into labeling-data-cleaning Accidentally committed on locally instead of remotely. Local commit just fixed a small typo

…hat it is

sh1v-ansh

great! its clean, descriptive, and organized

colettebas

Two little comments!

colettebas · 2024-06-25T17:55:17Z

-2_values.ipynb

@@ -109639,7 +109542,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.12.2"
+   "version": "3.11.6"


Interesting that this versioned down...

I was working on both locally and through BU SCC. I'm guessing there's some inconsistency in the version because of the way we set up our Conda environment. Shouldn't cause any problems in these notebooks though.

colettebas · 2024-06-25T18:02:16Z

citizen_validation_labels/data cleaning labeling.ipynb

+    "    for phenophase in phenophases:\n",
+    "        if species_dict[phenophase] == 0: # Phenophase seen in species\n",
+    "            false_positive_idx = species_df.index[species_df[phenophase] == -2] # Indices of reports that incorrectly assign -2 values (false positive) to phenophases SEEN in the species\n",
+    "            df.loc[false_positive_idx, phenophase] = np.full(len(false_positive_idx),100.612) # Turn all false positives into 100.612 (these observations will later be dropped but they cannot be Null/NaN/None values yet, so they will not be identified by .isna() check later)\n",


How do you know that 100.612 will never be seen in any other entry?

All of these phenophase attributes are categorical floats (integers [-2.0,2.0] & NaNs). The rest of the attributes are also integers (similarly typed as floats). Thus, there are no decimal values other than .0 in the dataset, and 100.612 should not show up in the dataset. I will implement a proof of this in the code.

Proof has been implemented and explained

…e Colette's comment about verifying 100.612 will work as a placeholder value in the data cleaning labeling notebook

zacharymeurer and others added 10 commits June 20, 2024 15:03

Started cleaning and organizing data cleaning notebook. Implemented a…

0ea8b81

…nomaly detection into data cleaning notebook

tested anomaly detection on subset of data. Saved anomaly detection e…

997eabf

…xample for final presentation

Ran outlier detection on all citizen data and updated state citizen d…

cf3473c

…ata csv files

Made some code simpler and less error prone

d8e888e

Save state. Merging pvt_cleaning and data cleaning notebooks together

da7b922

Data cleaning for both reference and citize and citizen data centrali…

9d19365

…zed in data cleaning notebook. Code has been cleaned up, somewhat commented, and run. New cleaned and validated (through isolation forests) csv files are in the all data folder

Fixed filling missing state names to be significantly faster, and add…

8a26134

…ed data cleaning labeling notebook to label alldata.csv rows as valid or invalid and why

cleaned up code a little bit, and separated valid labels df from clea…

07054f0

…ned citizen df

Updated data cleaning notebook according to Jacob's feedback (reporte…

bee965a

…d how many observations were thrown out for being invalid). Also, made filling missing state names more efficient

Experimented with and set contamination parameter for isolation fores…

13ee2af

…ts, reducing false positives in outlier detection. Loaded newly cleaned data and data labeling. Began commenting code in data cleaning notebook

zacharymeurer requested review from colettebas, sh1v-ansh and AnaSof0 June 24, 2024 18:09

zacharymeurer self-assigned this Jun 24, 2024

colettebas reviewed Jun 24, 2024

View reviewed changes

data cleaning labeling.ipynb Outdated Show resolved Hide resolved

data_cleaning_job.ipynb Outdated Show resolved Hide resolved

data_cleaning_job.ipynb Outdated Show resolved Hide resolved

Zachary Meurer and others added 2 commits June 24, 2024 17:05

Commented code for all of citizen data cleaning and part of reference…

cfe7d53

… data cleaning. Also, modified some reference data cleaning code to make it more efficient.

Commented/annotated all of data cleaning and data cleaning labeling. …

e607f5c

…Also, fixed little bug in data cleaning labeling

AnaSof0 approved these changes Jun 25, 2024

View reviewed changes

Zachary Meurer and others added 5 commits June 25, 2024 11:10

Deleted data_cleaning_job.ipynb because it is deprecated. Commented/a…

1ca640c

…nnotated and cleand up -2_values.ipynb. Fixed some typos in comments in data cleaning and data cleaning labeling notebooks. Reran data cleaning labeling to fix a small bug

Cleared output of cells in data cleaning labeling to get rid of warni…

de5f66a

…ngs that have been fixed

Cleared output of cells in data cleaning labeling notebook to get rid…

b013298

… of warning that I fixed

Merge branch 'labeling-data-cleaning' of github.com:BU-Spark/pitne-se…

559b9fc

…ason-watch into labeling-data-cleaning Accidentally committed on locally instead of remotely. Local commit just fixed a small typo

Made folder and readme for citizen vdata validation labels

1ed6a88

zacharymeurer changed the title ~~Changing Data Cleaning & Labeling Dropped Observations~~ [UPDATED] Changing Data Cleaning & Labeling Dropped Observations Jun 25, 2024

zacharymeurer added 3 commits June 25, 2024 11:32

changed india folder title to india_map to make it more clear as to w…

0c5407a

…hat it is

Wrote brief description of the purpose of citizen_validation_labels

ae63585

Created a python script version of the data cleaning notebook

5212eb2

sh1v-ansh approved these changes Jun 25, 2024

View reviewed changes

colettebas reviewed Jun 25, 2024

View reviewed changes

Rename data cleaning labeling to have _ instead of spaces, and resolv…

0986075

…e Colette's comment about verifying 100.612 will work as a placeholder value in the data cleaning labeling notebook

zacharymeurer merged commit e5ae59d into dev Jun 26, 2024

funkyvoong deleted the labeling-data-cleaning branch September 17, 2024 18:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[UPDATED] Changing Data Cleaning & Labeling Dropped Observations #19

[UPDATED] Changing Data Cleaning & Labeling Dropped Observations #19

zacharymeurer commented Jun 24, 2024 •

edited

Loading

review-notebook-app bot commented Jun 24, 2024

sh1v-ansh left a comment

colettebas left a comment

colettebas Jun 25, 2024

zacharymeurer Jun 26, 2024

colettebas Jun 25, 2024

zacharymeurer Jun 26, 2024

zacharymeurer Jun 26, 2024

[UPDATED] Changing Data Cleaning & Labeling Dropped Observations #19

[UPDATED] Changing Data Cleaning & Labeling Dropped Observations #19

Conversation

zacharymeurer commented Jun 24, 2024 • edited Loading

Edits

!!!BOLD MEANS NEW!!!

Data Cleaning Notebook

Labeling Dropped Observations

Miscellaneous

review-notebook-app bot commented Jun 24, 2024

sh1v-ansh left a comment

Choose a reason for hiding this comment

colettebas left a comment

Choose a reason for hiding this comment

colettebas Jun 25, 2024

Choose a reason for hiding this comment

zacharymeurer Jun 26, 2024

Choose a reason for hiding this comment

colettebas Jun 25, 2024

Choose a reason for hiding this comment

zacharymeurer Jun 26, 2024

Choose a reason for hiding this comment

zacharymeurer Jun 26, 2024

Choose a reason for hiding this comment

zacharymeurer commented Jun 24, 2024 •

edited

Loading