Datasheet for dataset "IncomeSCM-1.0.CATE"

Accompanying "IncomeSCM: From tabular data set to time-series simulator and causal estimation benchmark"

Questions from the Datasheets for Datasets paper, v7.

Jump to section:

Motivation
Composition
Collection process
Preprocessing/cleaning/labeling
Uses
Distribution
Maintenance

Motivation

For what purpose was the dataset created?

We wanted to create a benchmark for causal effect estimators that captures the nuance of real-world data yet is highly configurable to different causal estimation tasks and conditions.

Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?

This dataset was created by Fredrik D. Johansson at Chalmers University of Technology.

Who funded the creation of the dataset?

The creation of the dataset was supported in part by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut & Alice Wallenberg Foundation.

Any other comments?

Composition

What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?

The instances comprise a simulated population of individuals representative of the 1994 US census.

How many instances are there in total (of each type, if appropriate)?

In the 1.0 release, there are 50,000 samples in total.

Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?

The data set is generated by a simulator and the number of instances can be arbitrarily large or small.

What data does each instance consist of?

Each instance consists of (tabular) observations related to a subject's education, income and demographics.

Is there a label or target associated with each instance?

Each instance is associated with 3 labels: the observational (factual) outcome of an intervention, and two "potential" outcomes. The label represents the income of an individual 5 years after an intervention on their studies.

Is any information missing from individual instances?

No.

Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)?

Individual instances are independent, there are no relationships.

Are there recommended data splits (e.g., training, development/validation, testing)?

Yes. Samples are generated with one seed for training and another seed for testing.

Are there any errors, sources of noise, or redundancies in the dataset?

The authors are not aware of errors or redundancies.

Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)?

The data set is self-contained by the simulator links to the well-known Adult data set.

Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals’ non-public communications)?

No.

Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?

No.

Does the dataset relate to people?

Yes. It relates to simulated individuals. No new data was collected.

Does the dataset identify any subpopulations (e.g., by age, gender)?

Yes.

Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset?

No.

Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals racial or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)?

No.

Any other comments?

Collection process

How was the data associated with each instance acquired?

The data used to fit the IncomeSCM simulator was acquired from the UCI Machine Learning repository.

What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)?

The released data set was simulated in python.

If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)?

N/A. (Probabilistic sampling was used to sample from the simulator)

Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)?

N/A

Over what timeframe was the data collected?

The base dataset Adult was collected from the 1994 US Census.

Were any ethical review processes conducted (e.g., by an institutional review board)?

N/A

Does the dataset relate to people?

Yes. It relates to simulated individuals. No new data was collected.

Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)?

N/A

Were the individuals in question notified about the data collection?

N/A

Did the individuals in question consent to the collection and use of their data?

N/A

If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses?

N/A

Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted?

N/A

Any other comments?

Preprocessing/cleaning/labeling

Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?

The Adult data set was preprocessed before used to fit the IncomeSCM simulator.

Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)?

The "raw" data, if applicable, is saved in the Adult data set.

Is the software used to preprocess/clean/label the instances available?

Yes.

Any other comments?

Uses

Has the dataset been used for any tasks already?

Yes, it has been used for benchmarking causal effect estimators.

Is there a repository that links to any or all papers or systems that use the dataset?

Yes, there is a repository that links to our preprint and will link to papers that use the dataset in the future.

What (other) tasks could the dataset be used for?

The IncomeSCM simulator can be used for a variety of causal estimation tasks. The IncomeSCM-1.0.CATE data set release is intended for estimating the causal effect of studies on personal income.

Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?

The data set is representative of subjects from the 1994 US Census. It is not intended to generalize to other populations but serve as a benchmark task for causal estimation.

Are there tasks for which the dataset should not be used?

None that the authors are aware of.

Any other comments?

Distribution

Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?

The dataset will be freely available for public download.

How will the dataset be distributed (e.g., tarball on website, API, GitHub)?

Our dataset is hosted on the HealthyAI github organization and on healthyai.se. @TODO: Enter URLs to the data set!

When will the dataset be distributed?

An early version of the dataset has been distributed as of June, 2023. The IncomeSCM-1.0.CATE release has been available since June 7, 2024.

Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)?

The dataset will be distributed under the CC-BY-4.0 license.

Have any third parties imposed IP-based or other restrictions on the data associated with the instances?

No.

Do any export controls or other regulatory restrictions apply to the dataset or to individual instances?

No.

Any other comments?

Maintenance

Who is supporting/hosting/maintaining the dataset?

Fredrik D. Johansson and the Healthy AI lab at Chalmers University of Technology.

How can the owner/curator/manager of the dataset be contacted (e.g., email address)?

The authors can be contacted at fredrik.johansson@chalmers.se

Is there an erratum?

N/A

Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)?

Yes, in the event that errors are found, the dataset will be uploaded as a new version.

If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were individuals in question told that their data would be retained for a fixed period of time and then deleted)?

N/A

Will older versions of the dataset continue to be supported/hosted/maintained?

Yes, they will be hosted as previous versions.

If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so?

In the future, we may release new versions of the dataset with additional features or estimation tasks.

Any other comments?

Have fun!

Files

DATASHEET.md

Latest commit

History

DATASHEET.md

File metadata and controls

Datasheet for dataset "IncomeSCM-1.0.CATE"

Accompanying "IncomeSCM: From tabular data set to time-series simulator and causal estimation benchmark"

Motivation

For what purpose was the dataset created?

Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?

Who funded the creation of the dataset?

Any other comments?

Composition

What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?

How many instances are there in total (of each type, if appropriate)?

Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?

What data does each instance consist of?

Is there a label or target associated with each instance?

Is any information missing from individual instances?

Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)?

Are there recommended data splits (e.g., training, development/validation, testing)?

Are there any errors, sources of noise, or redundancies in the dataset?

Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)?

Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals’ non-public communications)?

Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?

Does the dataset relate to people?

Does the dataset identify any subpopulations (e.g., by age, gender)?

Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset?

Any other comments?

Collection process

How was the data associated with each instance acquired?

What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)?

If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)?

Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)?

Over what timeframe was the data collected?

Were any ethical review processes conducted (e.g., by an institutional review board)?

Does the dataset relate to people?

Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)?

Were the individuals in question notified about the data collection?

Did the individuals in question consent to the collection and use of their data?

If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses?

Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted?

Any other comments?

Preprocessing/cleaning/labeling

Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?

Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)?

Is the software used to preprocess/clean/label the instances available?

Any other comments?

Uses

Has the dataset been used for any tasks already?

Is there a repository that links to any or all papers or systems that use the dataset?

What (other) tasks could the dataset be used for?

Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?

Are there tasks for which the dataset should not be used?

Any other comments?

Distribution

Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?

How will the dataset be distributed (e.g., tarball on website, API, GitHub)?

When will the dataset be distributed?

Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)?

Have any third parties imposed IP-based or other restrictions on the data associated with the instances?

Do any export controls or other regulatory restrictions apply to the dataset or to individual instances?

Any other comments?

Maintenance

Who is supporting/hosting/maintaining the dataset?

How can the owner/curator/manager of the dataset be contacted (e.g., email address)?

Is there an erratum?

Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)?

If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were individuals in question told that their data would be retained for a fixed period of time and then deleted)?

Will older versions of the dataset continue to be supported/hosted/maintained?

If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so?

Any other comments?