By: Ahmed Ayaz, Krish Patel & Dylan Patel Date: 12/2/2024
This project is contained entirely in a single Jupyter notebook: final_project/Electricity_Usage_Prediction.ipynb
. All code, analysis, visualizations, and detailed documentation can be found in this notebook. Please refer to this file for the complete project implementation.
- Introduction
- Data Preparation
- Data Analysis
- Modeling and Prediction
- Conclusion and Insights
- References
This project analyzes the impact of weather conditions—including temperature, snow, and precipitation—on electricity usage across all 50 U.S. states. Using historical weather data and electricity usage information, we investigate correlations between environmental factors and electricity demand to build predictive models for future electricity usage.
- How do weather factors such as temperature, snow, and precipitation influence electricity usage across different states?
- What is the relationship between weather conditions and electricity pricing trends over time?
- How accurately can future electricity usage be predicted based on historical weather data?
- Data Gathering: Collection of weather data and electricity usage data for all 50 states
- Data Preparation: Cleaning, preprocessing, and integration of weather and electricity data
- Data Analysis: Exploration of trends and correlations between weather conditions and electricity metrics
- Modeling and Prediction: Development of predictive models for electricity usage
- Evaluation and Insights: Evaluation of model performance and insights generation
The primary objective is to build a robust model that accurately predicts electricity usage based on weather patterns. This analysis helps identify key weather factors that drive electricity demand, supporting better resource planning strategies.
Understanding the link between weather conditions and electricity demand is crucial for utility companies, policymakers, and energy providers. Accurate forecasting can help optimize energy distribution and guide sustainable energy policies.
-
Electricity Dataset:
- Source: Kaggle's Electricity Prices Dataset
- Content: Detailed information on electricity prices across all sectors and U.S. states
- Time Period: 2001-2024
-
Weather Dataset:
- Source: Meteostat Python Library
- Content: Daily average temperature, precipitation, and snowfall data
- Coverage: All major U.S. cities
-
City Information Dataset:
- Source: SimpleMaps U.S. Cities Data
- Content: Geographic information (latitude, longitude, population)
- Purpose: Location data for weather data retrieval
-
Electricity Usage Data Cleaning:
- Missing value handling
- Column standardization
- Relevant column filtering
-
Weather Data Cleaning:
- Temperature conversion (Celsius to Fahrenheit)
- Outlier removal
- Date format standardization
-
Data Integration Preparation:
- Structure compatibility verification
- Temporal scale alignment
- Granularity matching
-
Time-Based Features:
- Season categorization
- Quarterly representation
-
Weather Features:
- Temperature range calculation
- Precipitation intensity categories
- Binary weather indicators:
is_high_temp
: Temperature > 80°Fis_low_temp
: Temperature < 32°Fhas_precipitation
: Precipitation > 0is_high_precipitation
: Precipitation > 1 inchhas_snowfall
: Snowfall > 0is_heavy_snowfall
: Snowfall > 5 inches
-
Electricity Usage Metrics:
- Per capita usage calculation
- Price-to-sales ratio
-
Customer Distribution:
- Average: ~3 million customers
- High standard deviation indicating substantial variation
- Range influenced by state population sizes
-
Price Analysis:
- Range: 3.78 to 42.76 cents per kwh
- Average: ~10 cents per kwh
- Significant state-specific variations
-
Temperature Distribution:
- Mean: 54°F
- Range: -3.7°F to 90.25°F
- Balanced seasonal distribution
-
Usage Patterns:
- Highest usage in Summer
- Lower usage in Spring
- Winter usage varies by region
-
Temperature Impact:
- Strong inverse relationship with usage
- Peak usage during extreme temperatures
- Regional variation in temperature sensitivity
Selected for its advantages:
- Handling non-linear relationships
- Robustness to overfitting
- Speed and efficiency
- Feature importance insights
- Hyperparameter tuning flexibility
XGBRegressor(
n_estimators=200,
learning_rate=0.03,
max_depth=5,
min_child_weight=3,
subsample=0.9,
colsample_bytree=0.9,
reg_alpha=0.1,
reg_lambda=1,
random_state=42,
eval_metric='rmse'
)
Performance varied by sales volume:
- Small amounts (500-1000): 9.8% error
- Medium amounts (3,000-9,000): 3.6% error
- Large amounts (>10,000): 23.6% error
Implemented to handle scale differences:
- Reduced scale difference between small and large values
- Improved model's handling of large values
- Significant RMSE improvement:
- Training: 0.98266 → 0.07672
- Validation: 0.98030 → 0.09716
Population-based strategy results:
-
Small Population States:
- Average Error: 13.65%
- Median Error: 10.24%
- Error Range: 0.06% - 54.96%
-
Medium Population States:
- Average Error: 9.05%
- Median Error: 7.51%
- Error Range: 0.06% - 49.22%
-
Large Population States:
- Average Error: 10.58%
- Median Error: 7.58%
- Error Range: 0.05% - 51.77%
-
Population-Based Strategy:
- Effective handling of varying state sizes
- More targeted predictions for different categories
- Better scale difference handling
-
State-Specific Insights:
- Small states: Most consistent predictions
- Medium states: Balanced performance
- Large states: Required complex handling
-
Seasonal Patterns:
- Clear summer usage peaks
- Regional winter variations
- Strong temperature impact
-
Data Structure:
- Fixed population numbers
- Limited weather data granularity
- Simplified seasonal indicators
-
Geographic Factors:
- State-level aggregation limitations
- Climate zone boundary exclusion
- Missing regional interconnections
-
External Variables:
- Economic factors not included
- Policy changes not considered
- Industrial development not tracked
-
Technical Enhancements:
- City-level predictions
- Climate zone stratification
- Region-specific models
-
Data Expansion:
- Industrial usage patterns
- Demographic trends
- Policy change indicators
-
Analysis Extensions:
- State clustering analysis
- Cross-state dependencies
- Extreme weather impacts
- Clone the repository
- Install required packages:
pip install pandas numpy xgboost scikit-learn matplotlib seaborn
- Open and run
final_project/Electricity_Usage_Prediction.ipynb
- Electricity Dataset: Kaggle
- Weather Data: Meteostat Python Library
- City Information: SimpleMaps U.S. Cities Data
-
Data Analysis & Machine Learning:
- NumPy: Numerical computing and array operations
- Pandas: Data manipulation and analysis
- Scikit-learn: Machine learning algorithms and evaluation metrics
- XGBoost: Gradient boosting implementation
- Seaborn: Statistical data visualization
- Matplotlib: Data visualization and plotting
- Meteostat: Weather data retrieval and processing
-
Dashboard & Visualization:
- Panel: Interactive web application framework
- Flask: Web application framework
- Renderer: Cloud hosting service