Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BaseIndependentSampler Mixin #1395

Closed
amontanez24 opened this issue Apr 27, 2023 · 0 comments · Fixed by #1423
Closed

Add BaseIndependentSampler Mixin #1395

amontanez24 opened this issue Apr 27, 2023 · 0 comments · Fixed by #1423
Assignees
Labels
feature:sampling Related to generating synthetic data after a model is built internal The issue doesn't change the API or functionality
Milestone

Comments

@amontanez24
Copy link
Contributor

amontanez24 commented Apr 27, 2023

Problem Description

There is not currently a unified abstraction for multi table synthesizers. This has led to diverging implementations for HMASynthesizer and HSASynthesizer. Additionally it leads to the following issues:

  • It is not easy to add new, multi table models: Over the lifetime of DataCebo, we expect to be creating many different multi table synthesizers.
  • It is not easy to fix existing bugs: If there is a general bug that affects modeling and sampling for all multi-table synthesizers, we should be able to make the fix in only 1 place.
  • It is not easy to add common features between all synthesizers: For example, it may be nice to include some verbosity (progress bars, logger info, etc.) during the fitting and sampling for all synthesizers.

As a solution, we propose creating mixins that unify multi table sampling based on two strategies:

  1. Hierarchical
  2. Independent

This issue focuses on the Independent strategy.

Expected behavior

Add a new mixin called IndependentSampler to the sdv.sampling module.

Methods

These methods should all have an implementation in the base mixin itself.

  • _sample(self, scale=1.0)
    • Args
      • scale (float): A float representing how much to scale the data by.
    • Returns: A dictionary containing as keys the names of the tables and as values the sampled data tables as pandas.DataFrame.
    • pseudo-code
    sampled_data = {}
    for table in tables:
        num_rows = get_num_rows(table, scale)
        synthesizer = self._synthesizers[table]
        self._sample_table(synthesizer, root, num_rows, sampled_data)
    self._connect_tables(sampled_data)
    self._finalize(sampled_data)
  • _sample_table(self, table_name, num_rows, sampled_data=None)
    • Args
      • synthesizer (SingleTableSynthesizer): Synthesizer to sample from
      • table_name (string): Name of the table to sample
      • num_rows (int): Number of rows to sample.
      • sampled_data (dict): Dictionary of data sampled so far.
    • pseudo-code
    data = synthesizer._sample(num_rows)
    data = reverse_transform(data) ### finalize table here
    sampled_data[table_name] = data
  • _connect_tables(sampled_data)
    • Args
      • sampled_data (dict): Dictionary of data sampled so far.
    • pseudo-code
    queue = get_root_tables()
    while queue:
        parent = queue.pop()
        for child in self.metadata.get_children(parent):
            self._add_foreign_key_column(sampled_data[child], parent, child_name, parent_name)
            if has_all_foreign_keys(child):
                queue.append(child)
  • _finalize(sampled_data)
    • Args
      • sampled_data (dict): Dictionary of data sampled so far.
    • pseudo-code
    for table, data in sampled_data.items():
        remove_added_columns(data)

Abstract methods

These methods do not need to be implemented here but any class that uses this mixin must implement them.

  • _add_foreign_key_column(child_table_rows, parent_table_rows, child_name, parent_name): This method should add a column for the foreign key that connects the child to the parent. It can use whatever logic it needs to figure out which value to use for that parent row.

Additional context

  • This mixin changes the current sampling strategy that HSA uses in the following ways:
  1. Current HSA is depth first (connects children as it samples them). This strategy would sample all tables and then connect.
@amontanez24 amontanez24 added internal The issue doesn't change the API or functionality feature:sampling Related to generating synthetic data after a model is built labels Apr 27, 2023
@amontanez24 amontanez24 added this to the 1.2.0 milestone Jun 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature:sampling Related to generating synthetic data after a model is built internal The issue doesn't change the API or functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants