Master Thesis on Differential Privacy of Global COVID-19 Trends and Impact Survey Microdata and Opendata, especially focusing on the evaluation of different synthetic datasets.
The COVID-19 Trends and Impact Survey Data project aims to generate synthetic datasets using various synthesizing algorithms, such as linear regression, multinomial logistic regression and random forest, based on the COVID-19 Trends and Impact Survey Data. The goal is to evaluate the data utility and practicability in the context of Machine Learning using Tree based methods.
The COVID-19 Trends and Impact Survey Data used in this project was collected through an online survey that aimed to understand the trends and impact of the COVID-19 pandemic on individuals and society. The survey data includes information on demographics, mental health, work and financial impact, and COVID-19 knowledge and behavior.
The synthetic datasets are generated using the following algorithms:
- Linear Regression (
method="norm"
) - Linear Regression which maintains the marginal distribution (
method="normrank"
) - Decision Tree (
method="cart"
) - Multinomial Logistic Regression (
method="polyreg"
) - Random Forest (
method="rf"
) - Random Forest based Bagging algorithm (
method="bag"
)
These algorithms are used to synthesize the survey data and create new, synthetic datasets that can be used for machine learning and analysis.
The utility and practicability of the synthetic datasets are evaluated using Tree based methods. These methods include decision trees, random forests, and gradient boosting. The evaluation aims to assess the quality of the synthetic datasets and their potential usefulness in machine learning and analysis.
To be continued...