Predicting Electricity Usage

Project Dashboard

Project Presentation

By: Ahmed Ayaz, Krish Patel & Dylan Patel Date: 12/2/2024

Important Note

This project is contained entirely in a single Jupyter notebook: final_project/Electricity_Usage_Prediction.ipynb. All code, analysis, visualizations, and detailed documentation can be found in this notebook. Please refer to this file for the complete project implementation.

1. Introduction

1.1 Project Overview

This project analyzes the impact of weather conditions—including temperature, snow, and precipitation—on electricity usage across all 50 U.S. states. Using historical weather data and electricity usage information, we investigate correlations between environmental factors and electricity demand to build predictive models for future electricity usage.

1.2 Key Questions

How do weather factors such as temperature, snow, and precipitation influence electricity usage across different states?
What is the relationship between weather conditions and electricity pricing trends over time?
How accurately can future electricity usage be predicted based on historical weather data?

1.3 Brief Process

Data Gathering: Collection of weather data and electricity usage data for all 50 states
Data Preparation: Cleaning, preprocessing, and integration of weather and electricity data
Data Analysis: Exploration of trends and correlations between weather conditions and electricity metrics
Modeling and Prediction: Development of predictive models for electricity usage
Evaluation and Insights: Evaluation of model performance and insights generation

1.4 Project Objective

The primary objective is to build a robust model that accurately predicts electricity usage based on weather patterns. This analysis helps identify key weather factors that drive electricity demand, supporting better resource planning strategies.

1.5 Project Importance

Understanding the link between weather conditions and electricity demand is crucial for utility companies, policymakers, and energy providers. Accurate forecasting can help optimize energy distribution and guide sustainable energy policies.

2. Data Preparation

2.1 Data Sources

Electricity Dataset:
- Source: Kaggle's Electricity Prices Dataset
- Content: Detailed information on electricity prices across all sectors and U.S. states
- Time Period: 2001-2024
Weather Dataset:
- Source: Meteostat Python Library
- Content: Daily average temperature, precipitation, and snowfall data
- Coverage: All major U.S. cities
City Information Dataset:
- Source: SimpleMaps U.S. Cities Data
- Content: Geographic information (latitude, longitude, population)
- Purpose: Location data for weather data retrieval

2.2 Data Cleaning

Electricity Usage Data Cleaning:
- Missing value handling
- Column standardization
- Relevant column filtering
Weather Data Cleaning:
- Temperature conversion (Celsius to Fahrenheit)
- Outlier removal
- Date format standardization
Data Integration Preparation:
- Structure compatibility verification
- Temporal scale alignment
- Granularity matching

2.3 Feature Engineering

Time-Based Features:
- Season categorization
- Quarterly representation
Weather Features:
- Temperature range calculation
- Precipitation intensity categories
- Binary weather indicators:
  - is_high_temp: Temperature > 80°F
  - is_low_temp: Temperature < 32°F
  - has_precipitation: Precipitation > 0
  - is_high_precipitation: Precipitation > 1 inch
  - has_snowfall: Snowfall > 0
  - is_heavy_snowfall: Snowfall > 5 inches
Electricity Usage Metrics:
- Per capita usage calculation
- Price-to-sales ratio

3. Data Analysis

3.1 Exploratory Data Analysis Findings

Customer Distribution:
- Average: ~3 million customers
- High standard deviation indicating substantial variation
- Range influenced by state population sizes
Price Analysis:
- Range: 3.78 to 42.76 cents per kwh
- Average: ~10 cents per kwh
- Significant state-specific variations
Temperature Distribution:
- Mean: 54°F
- Range: -3.7°F to 90.25°F
- Balanced seasonal distribution

3.2 Seasonal and Weather Impact Analysis

Usage Patterns:
- Highest usage in Summer
- Lower usage in Spring
- Winter usage varies by region
Temperature Impact:
- Strong inverse relationship with usage
- Peak usage during extreme temperatures
- Regional variation in temperature sensitivity

4. Modeling and Prediction

4.1 Model Selection: XGBoost

Selected for its advantages:

Handling non-linear relationships
Robustness to overfitting
Speed and efficiency
Feature importance insights
Hyperparameter tuning flexibility

4.2 Model Configuration

XGBRegressor(
    n_estimators=200,
    learning_rate=0.03,
    max_depth=5,
    min_child_weight=3,
    subsample=0.9,
    colsample_bytree=0.9,
    reg_alpha=0.1,
    reg_lambda=1,
    random_state=42,
    eval_metric='rmse'
)

4.3 Modeling Approaches

4.3.1 Initial Approach

Performance varied by sales volume:

Small amounts (500-1000): 9.8% error
Medium amounts (3,000-9,000): 3.6% error
Large amounts (>10,000): 23.6% error

4.3.2 Log Transform Approach

Implemented to handle scale differences:

Reduced scale difference between small and large values
Improved model's handling of large values
Significant RMSE improvement:
- Training: 0.98266 → 0.07672
- Validation: 0.98030 → 0.09716

4.3.3 Final Stratified Approach

Population-based strategy results:

Small Population States:
- Average Error: 13.65%
- Median Error: 10.24%
- Error Range: 0.06% - 54.96%
Medium Population States:
- Average Error: 9.05%
- Median Error: 7.51%
- Error Range: 0.06% - 49.22%
Large Population States:
- Average Error: 10.58%
- Median Error: 7.58%
- Error Range: 0.05% - 51.77%

5. Conclusion and Insights

5.1 Key Findings

Population-Based Strategy:
- Effective handling of varying state sizes
- More targeted predictions for different categories
- Better scale difference handling
State-Specific Insights:
- Small states: Most consistent predictions
- Medium states: Balanced performance
- Large states: Required complex handling
Seasonal Patterns:
- Clear summer usage peaks
- Regional winter variations
- Strong temperature impact

5.2 Limitations

Data Structure:
- Fixed population numbers
- Limited weather data granularity
- Simplified seasonal indicators
Geographic Factors:
- State-level aggregation limitations
- Climate zone boundary exclusion
- Missing regional interconnections
External Variables:
- Economic factors not included
- Policy changes not considered
- Industrial development not tracked

5.3 Future Work

Technical Enhancements:
- City-level predictions
- Climate zone stratification
- Region-specific models
Data Expansion:
- Industrial usage patterns
- Demographic trends
- Policy change indicators
Analysis Extensions:
- State clustering analysis
- Cross-state dependencies
- Extreme weather impacts

5.4 Setup and Usage

Clone the repository
Install required packages:

pip install pandas numpy xgboost scikit-learn matplotlib seaborn

Open and run final_project/Electricity_Usage_Prediction.ipynb

6. References

6.1 Data Sources

Electricity Dataset: Kaggle
Weather Data: Meteostat Python Library
City Information: SimpleMaps U.S. Cities Data

6.2 Tools and Libraries

Data Analysis & Machine Learning:
- NumPy: Numerical computing and array operations
- Pandas: Data manipulation and analysis
- Scikit-learn: Machine learning algorithms and evaluation metrics
- XGBoost: Gradient boosting implementation
- Seaborn: Statistical data visualization
- Matplotlib: Data visualization and plotting
- Meteostat: Weather data retrieval and processing
Dashboard & Visualization:
- Panel: Interactive web application framework
- Flask: Web application framework
- Renderer: Cloud hosting service

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.devcontainer		.devcontainer
dashboard		dashboard
datasets		datasets
final_notebook&presentation		final_notebook&presentation
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md

Ahmedayaz1210/electricity-usage-prediction

Folders and files

Latest commit

History

Repository files navigation