Merge pull request #28 from BU-Spark/dev

Final PR from Dev to Main
BU-Spark · Jan 17, 2025 · caa06b1 · caa06b1
2 parents 9d1b928 + 7ca783d
commit caa06b1
Show file tree

Hide file tree

Showing 33 changed files with 1,617,154 additions and 79 deletions.
diff --git a/README.md b/README.md
@@ -1,79 +1,181 @@
-# PIT-NE SeasonWatch Project Overview
-
-_Further details on project background, process, and results in SeasonWatch_project_report_
-
-SeasonWatch, a citizen science organization based in India, provided our PIT-NE team with a citizen database containing daily tree phenology data collected from citizen scientists in India and a reference database containing weekly tree phenology data collected from credible sources (e.g. textbooks).
-
-## Applications
-
-Our team processed and analyzed these databases to provide valuable information to support the SeasonWatch in their climate research efforts:
-
-### Data Processing
-
-- Cleaned and reformatted citizen and reference databases (Made database formatting consistent, handled data with incorrectly reported features, etc.)
-- Developed a data validation system for citizen database (Used isolation forests for anomaly detection)
-
-### Data Analysis
-
-- Created visualizations of the citizen and reference data over time (Bar and line charts highlighting discrepancies between the citizen and reference observations over time)
-- Developed a process for selecting representative citizen observations over a year to use as up-to-date baselines for any species.
-- Designed a scoring function to identify flowering and fruiting stage transitions throughout a given year.
-
-## Repository Structure
-
-### code (Contains Python notebooks used in the final product)
-  - -2_values (Flags citizen observations with incorrect reports regarding the presence or absence of a phenophase in the reported species)
-  - data_cleaning (Cleans citizen and reference databases, and validates citizen database)
-  - mean_transition_times_generation (Creates visualizations and a dataset of probability distributions of phenophase transition times based on a score function)
-  - selecting_reference_data (Creates visualizations and a dataset of representative citizen observations selected as baselines)
-  - validation_labels (Flags citizen observations dropped during the data cleaning process and gives reasons for dropping them)
-  - visualizations (Creates visualizations of the citizen and reference data)
-  - year_to_year_transition_times_data_generation (Creates the year_to_year_transition_time dataset)
-### data (Contains CSV files of original data and data produced by the Python notebooks in code)
-  - citizen_states_cleaned (Cleaned and reformatted citizen database sorted by states)
-  - india_map (Geographic data used for finding the Inidan state given a set of coordinates)
-  - original_citizen_data (Citizen database given by SeasonWatch)
-  - original_reference_data (Reference database given by SeasonWatch)
-  - reference_states_cleaned (Cleaned and reformatted reference database sorted by states)
-  - alldata_labeling_-2_all_species (Citizen database given by SeasonWatch with incorrect reports regarding the presence or absence of a phenophase in the reported species flagged)
-  - average_transition_times (Dataset of probability distributions of phenophase transition times based on a score function)
-  - cleaned_alldata (Cleaned and reformatted citizen database as one dataset)
-  - selected_reference_data (Dataset of representative citizen observations selected as baselines)
-  - species codes (Dataset mapping tree species ids to names)
-  - validation_labels_alldata (Citizen database given by SeasonWatch with citizen observations dropped during the data cleaning process flagged and reasons for dropping them given)
-  - year_to_year_transition_time (Dataset of max and mean transition time and probability of phenophases)
-### dev_code (Contains Python notebooks used in the development process)
-  - jobfiles (Files of jobs submitted to shared cloud computing service)
-  - scc-config (Config for submitting jobs to shared cloud computing service)
-  - kmeans_pca_testing (Experimenting with and visualizing data validation methods)
-  - mean transition times from repeat observations (Experimenting with only using regular citizen observations to find phenophase transition times)
-  - mean_transition_times_dev (Experimenting with different methods for finding phenophase transition times)
-  - plotting (Preliminary, experimental visualizations)
-  - ref_cit_na_comparison (Comparing how much citizen data has associated reference data)
-### plots (Contains PNG files depicting plots produced by the Python notebooks in code)
-
-> _Citizen observations are usually depicted as percentages. This measure indicates the percentage of citizen reports observing a phenophase in the given week._
->
-> _Plots report information weekly (48 weeks per year) over a year._
-
-  - combination_percentage_charts (Compares citizen data and reference data over time; bar charts indicate number of citizen observations that week)
-  - overlaid_percentage_plots (Compares related phenophases within citizen data over time; bar charts indicate number of citizen observations that week)
-  - repeat_combination_percentage_charts (Compares regular observations and all observations within citizen data over time; bar charts indicate number of citizen observations that week)
-  - repeat_observations (Compares differences between regular observations and reference data over time, and between all observations and reference data over time)
-  - selected_ref_vs_cit (Compares citizen data and selected baselines over time)
-  - transition_bar_plots (Depicts number of observations reporting a phenophase appearing over time)
-  - two_values_weighted (Compares percentage presence of a phenophase and the magnitude of the presence of a phenophase within the citizen data over time)
-
-## Usage Guide
-
-### Step 1: Data Cleaning
-
-Data should be cleaned, reformatted, and validated before it is applied to anything. Thus, the data cleaning notebook or script should be run before any visualization or analysis.
-
-> _Edit file paths within the code to any new citizen data or reference data CSV files._
-
-### Step 2: Plotting & Analysis
-
-Any other notebook within the code folder can be run next to update the data and plots. Notebooks have functions for plotting and producing datasets. Modify parameters (states, species, year, etc.) to the functions as needed (i.e. If selected reference data on tamarind in Kerala in 2018 is wanted, set the function parameters to match that).
-
-> _Edit plot and CSV file paths within the code as needed._
+# SeasonWatch Project
+
+## Overview
+The SeasonWatch Project aims to analyze the phenological changes in tree species in India, with a focus on Kerala, using citizen-science data collected from 2015 to 2023. The project examines the relationship between climate change and tree phenology by identifying trends, seasonal shifts, and geographic variations. Our final deliverables include visualizations, statistical analyses, and code that can be reused for similar ecological studies.
+
+## Objectives
+1. **Analyze phenological changes:** Investigate how trees respond to climate change and seasonal transitions.
+2. **Identify key patterns:** Focus on the timing of phenological stages (leaves, flowers, fruits) for the top 30 observed tree species.
+3. **Visualize shifts:** Create interactive and static visualizations to convey changes in onset weeks, geographic clustering, and seasonal variability.
+4. **Provide reusable tools:** Develop and share scripts, cleaned datasets, and workflows for future analysis.
+
+---
+
+## Deliverables
+### 1. Data Cleaning and Preparation
+- **Input Data:**
+  - Citizen-submitted observations from SeasonWatch (2015–2023).
+  - ~177 species with detailed phenological stage observations.
+- **Cleaned Data:**
+  - Filtered dataset with accurate geocoding, standardized state names, and adjusted missing values.
+  - Historical (pre-2020) and comparative (post-2020) datasets for reference.
+
+### 2. Visualizations
+- **Interactive Visualizations:** Created with Flourish Studio, including:
+  - Seasonal shifts in onset weeks.
+  - Geographic clustering of phenological stages.
+- **Static Visuals:** Heatmaps, time-series plots, and summary tables saved to:
+  `/data/VISUALIZATIONS-fall 2024/Kerala Visuals`
+
+### 3. Statistical Analysis
+- Summary statistics, regression analysis, and survival modeling to answer base questions:
+  - How are trees changing due to climate change?
+  - What is the onset timing for flowering and fruiting in tropical species?
+  - What is the probability of transitioning between seasonal states?
+
+### 4. Final Report
+- Comprehensive document with:
+  - Visualizations.
+  - Interpretations of patterns and trends.
+  - Recommendations for conservation efforts.
+  - Delivered in PDF format.
+
+---
+
+## Getting Started
+### Prerequisites
+1. **Python Environment:**
+   - Python 3.9 or higher.
+   - Required libraries:
+     - `pandas`
+     - `numpy`
+     - `matplotlib`
+     - `geopandas (v0.9.0)`
+     - `shapely (v2.0.1)`
+     - `googlemaps`
+     - `seaborn`
+2. **Geospatial Tools:**
+   - Shapefiles available in `india_map` folder for geographic visualizations.
+   - Google Maps API key for geocoding (optional).
+3. **Data Files:**
+   - Cleaned datasets stored in the `/data` directory.
+
+### Installation
+1. Clone the repository:
+   ```bash
+   git clone https://github.com/your-org/seasonwatch-project.git
+   cd seasonwatch-project
+   ```
+2. Set up a virtual environment:
+   ```bash
+   python -m venv env
+   source env/bin/activate  # On Windows: env\Scripts\activate
+   pip install -r requirements.txt
+   ```
+3. Download the cleaned datasets and place them in the `/data` directory.
+
+### Running the Code
+1. **Data Cleaning:**
+   ```bash
+   python scripts/data_cleaning.py
+   ```
+   Cleans raw observation data, standardizes formats, and generates cleaned datasets.
+
+2. **Geocoding:**
+   If using Google Maps API, update your API key in `config.py`:
+   ```python
+   GOOGLE_API_KEY = 'your-api-key'
+   ```
+   Run geocoding script:
+   ```bash
+   python scripts/geocoding.py
+   ```
+
+3. **Visualization Generation:**
+   Generate visuals:
+   ```bash
+   python scripts/visualizations.py
+   ```
+   Outputs saved in `/data/VISUALIZATIONS-fall 2024`.
+
+4. **Statistical Analysis:**
+   Perform analyses and generate summary tables:
+   ```bash
+   python scripts/analysis.py
+   ```
+
+---
+
+## Directory Structure
+```
+seasonwatch-project/
+├── data/
+│   ├── raw_data.csv
+│   ├── cleaned_data.csv
+│   ├── VISUALIZATIONS-fall 2024/
+│   │   ├── Kerala Visuals/
+│   │   └── ...
+├── scripts/
+│   ├── data_cleaning.py
+│   ├── geocoding.py
+│   ├── visualizations.py
+│   └── analysis.py
+├── india_map/
+│   ├── shapefiles/
+│   └── ...
+├── README.md
+├── requirements.txt
+└── config.py
+```
+
+---
+
+## Blockers Faced and Solutions
+### Blockers
+1. **Google API Costs:**
+   - Budget constraints limited the number of geocoding requests.
+   - **Solution:** Implemented a caching mechanism to minimize API calls and explored free alternatives like OpenStreetMap.
+
+2. **Anomalies in Data:**
+   - Missing or incorrect data for several observations.
+   - **Solution:** Used statistical imputation techniques and cross-referenced with the SeasonWatch tree phenology handbook to standardize missing values.
+
+3. **Processing Speed:**
+   - Geocoding and data cleaning scripts were slow due to large datasets.
+   - **Solution:** Optimized scripts using `try-except` blocks for error handling and parallel processing where possible.
+
+4. **Prior Data Loss:**
+   - Previous teams dropped too many rows during data cleaning.
+   - **Solution:** Reviewed raw data meticulously and identified only 20,000 rows with missing values, preserving as much data as possible.
+
+---
+
+## Next Steps for Future Teams
+1. **Enhance Visualizations:**
+   - Improve interactivity in Flourish Studio and integrate additional features, such as user filtering by state or species.
+   - Explore advanced visualization libraries like Plotly or D3.js for greater customization.
+
+2. **Expand Analysis:**
+   - Incorporate additional years of data beyond 2023 to track long-term trends.
+   - Perform deeper survival and Markov modeling to understand tree state transitions.
+
+3. **Climate Correlation:**
+   - Integrate external climate datasets (e.g., rainfall, temperature) to analyze correlations with phenological changes.
+
+4. **Optimize Geocoding:**
+   - Automate retries for failed geocoding requests and further explore OpenStreetMap for cost-free options.
+
+5. **Machine Learning Models:**
+   - Apply machine learning techniques to predict phenological stages based on climate and temporal data.
+
+6. **Documentation:**
+   - Update README and inline comments regularly for new tools or methods added to the project.
+
+
+## Contributors
+- **Team Members:** Cecily Wang-Munoz, Aditya Chopra, An Ngo (Sue), Brenda Kim
+
+---
+
+## Acknowledgments
+This project uses data from SeasonWatch and insights from the SeasonWatch tree phenology handbook. Special thanks to BU SPARK for supporting the project.
diff --git a/SeasonWatch_Final_Report_fa24.pdf b/SeasonWatch_Final_Report_fa24.pdf