Skip to content

Urban-Informatics-Lab/svce-data-processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SVCE Data Processor

Purpose

The purpose of this algorithm is to modify the SVCE input data such that for each row in the data (each row vector), there are at least N - 1 other rows that are identical. This means that if you group the data by all of its columns (except the index column), each resulting group must be at least of size N. This should be achieved with the smallest possible amount of perturbations / changes to the original data.

Use

The following files are required:

in/svce/ITEM16_2018_tax_assessor_addtl_cols_masked_v2.csv
in/svce/use_codes.csv
in/census/ca_xwalk.csv
in/census/cbg_data_ca.csv
in/tiger/tl_2019_06_tract.shp

To apply the algorithm, first open setup.py to define which columns are being used, and how they are binned.

Then run files in the following order (once the necessary input data is in place):

  1. 01_prepare.py to add external census data based on block tracts, remove rows with missing data, drop unused columns.
  2. 02_compile.py to bins the data, turn categorical columns into dummy variables, and apply the algorithm. The runtime depends on the number of columns and bins, but the code is optimized.
  3. 03_verify.py to verify that the data generated by 02_compile.py meets the N threshold.

Algorithm

A detailed description of the algorithm is available in the Applied Energy conference article.

Setup

See setup.py. For each column to be included in the final data, set:

  • 'log' to 1 (take logarithm of data) or 0 (do not transform).
  • 'type' to 0 (continous data), 1 (binary data), or 2 (nominal data).
  • 'w' to the weight of the corresponding column. A higher weight means that less noise is introduced for that particular column.
  • filter, which should contain a lambda function that returns those rows that should be filtered out(.e.g, value too small, no value, etc). The input of that lambda function is the pandas Series containing the corresponding column.

If a column is not listed in setup, it will be dropped from the data.

Example:

setup = {
    'A': {'log': 1, 'type': 0, 'w': 3.0, 'filter': lambda v: (v < 1) | pd.isnull(v)},
    'B': {'log': 0, 'type': 2, 'w': 100.0, 'filter': lambda v: pd.isnull(v)},
}

Output metrics

Running 02_compile.py will generate a set of output metrics that looks as follows:

                                          Unweighted  Weighted
Fraction of affected rows                      99.90     99.90
Fraction of affected values                    40.02     40.02
Average number of changed values per row        6.40      6.40
Average change relative to scale                5.14      0.47
  COLUMN_1                                      5.39      0.75
  COLUMN_2                                      9.69      2.26
  COLUMN_3                                      6.09      0.28
  COLUMN_4                                      6.30      0.29
  ...

Fraction of affected rows represents the share of rows where at least one value has changed. Fraction of affected values represents the share of values that has changed. The 'average number of changed values per row' represents the average number of values that have changed in each row. And finally, the 'average change relative to scale' represents the average percentage change in values relative to the difference between the largest and smallest existing values of the corresponding column. For eample, if column 1 ranges from 5 to 25, and the metric is 5, it means that on average, column 1 values were changed by +/- 1 ((25 - 5) * 0.05).

The 'weighted' column shows the same metrics but multiplying each metric by the corresponding weight. This only affects the 'average change relative to scale' set of metrics.

Current columns

The following columns are currently included in the preprocessing. Note that additional columns, including whether a building has an EV/PHEV or not and whether a building has solar PV and/or electricity storage, are added after the fact (not shown here, but available in final data):

BUILDING_FLOORSPACE_SQFT        Building floor space in sq ft
PARCEL_UNITS                    Number of units in parcel
YEAR_BUILT                      Year the building was built
EFFECTIVE_YEAR                  Effective year (major renovation)
NUMBER_FLOORS                   Number of floors in building
BUILDING_COVERAGE_RATIO         The ratio between the building footprint area and the area of its land plot.
                                A BCR of 1.0 means the building occupies the entire land (no land available around it)
PARCELFN_COMMERCIAL             Whether a building is commericial or not
PARCELFN_RESIDENTIAL_MULTIPLE   Whether a building is a multi-unit residential building or not
PARCELFN_RESIDENTIAL_SINGLE     Whether a building is a single-unit residential building or not
DENSITY_RESIDENTIAL             The residential population density in the census block group of the building
DENSITY_COMMERCIAL              The commercial (job) density in the census block group of the building
VALUE_LAND                      The (tax assessed) land value of the building parcel
VALUE_BLDG                      The (tax assessed) value of the building 
HAS_AC                          Whether the building has AC or not
HAS_HEAT                        Whether the building has a heating system or not

About

Processing SVCE data to comply with modified "15/15 rule"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages