- The project focuses on analyzing the effect of infrastructure modifications from tall building units/skyscrapers on climate change.
- The project includes a study of the energy usage by the building. Counterfactual models are required to measure the energy consumption done depending on four energy types. The consumption is based on historic usage rates and observed weather.
- It can be then predicted how far will it affect and result in climate change in later years. A goal is to develop a predictive model for climage change based on energy consumption from skyscrapers.
-The project has a data of over one thousand building's hourly meter reading for three years.
The data can be found here
We have undergone these steps as a part of CRISP-DM:
- Research Phase
- Data Understanding Phase
- Data Preparation Phase
- Data Exploration
- Modeling Phase
- Evaluation Phase
Buildings and construction account for more than one-third of global energy consumption. They are also responsible for 40% of the world's total CO2 emissions. These numbers continue to rise, as developing countries begin to access previously untapped energy sources. Additionally, as countries start enjoying economic stability, the average size of buildings also increases consequently increasing the overall energy demand.
To counter this disturbing trend, nations and companies are investing in energy and cost-efficient technologies to better manage the energy needs of the building. However, the impact of these measures has not been accurately estimated. Existing models do not consider data such as the type of meter or assume that all buildings are of the same type. Therefore, we propose a solution that includes these features and models the energy usage of building before and after implementing improvements.
The dataset has data from four files train.csv, building_meta.csv, weather_[train/test].csv, and sample_submission.csv
Each file has details regarding the specific topic.
- building_id - Foreign key for the building metadata.
- meter - The meter id code. Read as {0: electricity, 1: chilledwater, 2: steam, 3: hotwater}. Not every building has all meter types.
- timestamp - When the measurement was taken
- meter_reading - The target variable. Energy consumption in kWh (or equivalent). Note that this is real data with measurement error, which we expect will impose a baseline level of - modeling error.
- site_id - Foreign key for the weather files.
- building_id - Foreign key for training.csv
- primary_use - Indicator of the primary category of activities for the building based on EnergyStar property type definitions
- square_feet - Gross floor area of the building
- year_built - Year building was opened
- floor_count - Number of floors of the building
Weather data from a meteorological station as close as possible to the site.
- site_id - Primary Key
- air_temperature - Degrees Celsius
- cloud_coverage - Portion of the sky covered in clouds, in oktas
- dew_temperature - Degrees Celsius
- precip_depth_1_hr - Millimeters
- sea_level_pressure - Millibar/hectopascals
- wind_direction - Compass direction (0-360)
- wind_speed - Meters per second
The submission files use row numbers for ID codes in order to save space on the file uploads. test.csv has no feature data; it exists so you can get your predictions into the correct order.
- row_id - Row id for your submission file
- building_id - Building id code
- meter - The meter id code
- timestamp - Timestamps for the test data period
- Delving into data
- Examining important interrelationships between attributes
- Identifying interesting subsets of the observations
- Develop an initial idea of possible associations amongst the predictors, as well as between the predictors and the target variable.
There are no features related to people or location. Though the site_id is given we can't calculate the location from the given information. From our perspective we do not see any discrimination or bias in our data.
- Due to compute restrictions we had to cut the size of dataset from ~ 20 million to ~ 10 million. We did this by dropping the random selection of 10 million rows so as to avoid the bias.
- We scaled the data to normalize the relations between dependent and independent variables.
- We encoded categorical variables using one-hot encoding and label-encoding.
- Due to the large amount of missing values in the "year_built", "floor_count", "builiding_age" columns in the building metadata table, we dropped the columns.
- Dropped columns like "building_id", "site_id" which are not necessary.
- Instead of having the whole time stamp we just selected the hour value and stored.
Since, our target variable is continuous we had to implement a regression algorithm. In our research we found that decision tree is the best algorithm for this problem. Additionaly, since our data set is large we needed an algorithm that dealt with large amounts of data easily. So, we turned to "LightGBM".
- Light GBM is a gradient boosting framework that uses tree based learning algorithm.
- Light GBM is prefixed as ‘Light’ because of its high speed. It can handle the large size of data and takes lower memory to run. Also, the algorithm focuses on accuracy of results and GPU learning.
Since our problem is a regression problem we decided to use Root Mean Square Error for model Evaluation. Evaluation is done based on RMSE score. We acheived an RMSE score of 74.
Our model can be used to improve the energy efficiency of the buildings which can be used by Government bodies, contractors and Environmental watch dog organizations. Our model needs more improvements, but it is a good starting ponit.
- Create four seperate models to predict each meters readings.
- Experiment with fastai and Neural Networks.
- Tune hyper parameters to imrove RMSE score.
- This project is based on python 3.7.
- This file requires Jupyter Notebooks or Google Colab to run.
- Packages to be installed: Numpy, Pandas, Sklearn, Matplotlib, Seaborn, LightGBM, tqdm.
Team Members:
- Ravina Gaikawad
- Kevin Thomas
- Sahithi Priya Gutta
- Uma Sai Madhuri Jetty
- https://www.kaggle.com/c/ashrae-energy-prediction/data
- https://medium.com/@pushkarmandot/https-medium-com-pushkarmandot-what-is-lightgbm-how-to-implement-it-how-to-fine-tune-the-parameters-60347819b7fc