The purpose of this algorithm is to modify the SVCE input data such that for each row in the data (each row vector), there are at least N - 1 other rows that are identical. This means that if you group the data by all of its columns (except the index column), each resulting group must be at least of size N. This should be achieved with the smallest possible amount of perturbations / changes to the original data.
The following files are required:
in/svce/ITEM16_2018_tax_assessor_addtl_cols_masked_v2.csv
in/svce/use_codes.csv
in/census/ca_xwalk.csv
in/census/cbg_data_ca.csv
in/tiger/tl_2019_06_tract.shp
To apply the algorithm, first open setup.py
to define which columns are being used, and how they are binned.
Then run files in the following order (once the necessary input data is in place):
01_prepare.py
to add external census data based on block tracts, remove rows with missing data, drop unused columns.02_compile.py
to bins the data, turn categorical columns into dummy variables, and apply the algorithm. The runtime depends on the number of columns and bins, but the code is optimized.03_verify.py
to verify that the data generated by02_compile.py
meets theN
threshold.
A detailed description of the algorithm is available in the Applied Energy conference article.
See setup.py
. For each column to be included in the final data, set:
- 'log' to 1 (take logarithm of data) or 0 (do not transform).
- 'type' to 0 (continous data), 1 (binary data), or 2 (nominal data).
- 'w' to the weight of the corresponding column. A higher weight means that less noise is introduced for that particular column.
filter
, which should contain a lambda function that returns those rows that should be filtered out(.e.g, value too small, no value, etc). The input of that lambda function is the pandas Series containing the corresponding column.
If a column is not listed in setup, it will be dropped from the data.
Example:
setup = {
'A': {'log': 1, 'type': 0, 'w': 3.0, 'filter': lambda v: (v < 1) | pd.isnull(v)},
'B': {'log': 0, 'type': 2, 'w': 100.0, 'filter': lambda v: pd.isnull(v)},
}
Running 02_compile.py
will generate a set of output metrics that looks as follows:
Unweighted Weighted
Fraction of affected rows 99.90 99.90
Fraction of affected values 40.02 40.02
Average number of changed values per row 6.40 6.40
Average change relative to scale 5.14 0.47
COLUMN_1 5.39 0.75
COLUMN_2 9.69 2.26
COLUMN_3 6.09 0.28
COLUMN_4 6.30 0.29
...
Fraction of affected rows represents the share of rows where at least one value has changed. Fraction of affected values represents the share of values that has changed. The 'average number of changed values per row' represents the average number of values that have changed in each row. And finally, the 'average change relative to scale' represents the average percentage change in values relative to the difference between the largest and smallest existing values of the corresponding column. For eample, if column 1 ranges from 5 to 25, and the metric is 5, it means that on average, column 1 values were changed by +/- 1 ((25 - 5) * 0.05).
The 'weighted' column shows the same metrics but multiplying each metric by the corresponding weight. This only affects the 'average change relative to scale' set of metrics.
The following columns are currently included in the preprocessing. Note that additional columns, including whether a building has an EV/PHEV or not and whether a building has solar PV and/or electricity storage, are added after the fact (not shown here, but available in final data):
BUILDING_FLOORSPACE_SQFT Building floor space in sq ft
PARCEL_UNITS Number of units in parcel
YEAR_BUILT Year the building was built
EFFECTIVE_YEAR Effective year (major renovation)
NUMBER_FLOORS Number of floors in building
BUILDING_COVERAGE_RATIO The ratio between the building footprint area and the area of its land plot.
A BCR of 1.0 means the building occupies the entire land (no land available around it)
PARCELFN_COMMERCIAL Whether a building is commericial or not
PARCELFN_RESIDENTIAL_MULTIPLE Whether a building is a multi-unit residential building or not
PARCELFN_RESIDENTIAL_SINGLE Whether a building is a single-unit residential building or not
DENSITY_RESIDENTIAL The residential population density in the census block group of the building
DENSITY_COMMERCIAL The commercial (job) density in the census block group of the building
VALUE_LAND The (tax assessed) land value of the building parcel
VALUE_BLDG The (tax assessed) value of the building
HAS_AC Whether the building has AC or not
HAS_HEAT Whether the building has a heating system or not