Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError when sampling PII columns #1445

Closed
npatki opened this issue May 30, 2023 · 0 comments · Fixed by #1452
Closed

ValueError when sampling PII columns #1445

npatki opened this issue May 30, 2023 · 0 comments · Fixed by #1452
Assignees
Labels
bug Something isn't working feature:sampling Related to generating synthetic data after a model is built

Comments

@npatki
Copy link
Contributor

npatki commented May 30, 2023

Environment Details

  • SDV version: 1.1.0

Description

For PII columns only, it should be ok if the input pandas.dtype is not the same as the output pandas.dtype. For example, the input data may be all 0'd out (since it's sensitive). But I expect the synthetic data data should have strings, based on the sdtype that I have selected in the metadata (and ultimately the Faker that is used).

Steps to reproduce

import pandas as pd
from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer

data = pd.DataFrame(data={
    'id': [0, 1, 2, 3, 4],
    'address': [0, 0, 0, 0, 0],
    'numerical': [0.234234, 0.123213, 0.123123, 0.123123, 0.1345435]
})

metadata = SingleTableMetadata.load_from_dict({
    'primary_key': 'id',
    'columns': {
        'id': { 'sdtype': 'id' },
        'address': { 'sdtype': 'address' },
        'numerical': { 'sdtype': 'numerical' }
    }
})

synth = GaussianCopulaSynthesizer(metadata)
synth.fit(data)
synth.sample(10)

Output:

ValueError: Error: Sampling terminated. Partial results are stored in a temporary file: .sample.csv.temp. This file will be overridden the next time you sample. Please rename the file if you wish to save these results.
invalid literal for int() with base 10: 'PSC 4734, Box 5790\nAPO AA 75326'

Additional Context

For PII and ID columns only, we can do a try-catch whenever we try to cast the data back to the original dtype. If we cannot do the casting, then just return the data without casting. Log an INFO message when this happens.

logger.INFO The real data in 'column_name' was stored as 'int' but the synthetic data could not be cast back
to this type. If this is a problem, please check your input data and metadata settings.
@npatki npatki added bug Something isn't working feature:sampling Related to generating synthetic data after a model is built labels May 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature:sampling Related to generating synthetic data after a model is built
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants