Add BaseIndependentSampler Mixin #1395

amontanez24 · 2023-04-27T01:57:56Z

Problem Description

There is not currently a unified abstraction for multi table synthesizers. This has led to diverging implementations for HMASynthesizer and HSASynthesizer. Additionally it leads to the following issues:

It is not easy to add new, multi table models: Over the lifetime of DataCebo, we expect to be creating many different multi table synthesizers.
It is not easy to fix existing bugs: If there is a general bug that affects modeling and sampling for all multi-table synthesizers, we should be able to make the fix in only 1 place.
It is not easy to add common features between all synthesizers: For example, it may be nice to include some verbosity (progress bars, logger info, etc.) during the fitting and sampling for all synthesizers.

As a solution, we propose creating mixins that unify multi table sampling based on two strategies:

Hierarchical
Independent

This issue focuses on the Independent strategy.

Expected behavior

Add a new mixin called IndependentSampler to the sdv.sampling module.

Methods

These methods should all have an implementation in the base mixin itself.

_sample(self, scale=1.0)

Args
- scale (float): A float representing how much to scale the data by.
Returns: A dictionary containing as keys the names of the tables and as values the sampled data tables as pandas.DataFrame.
pseudo-code

sampled_data = {}
for table in tables:
    num_rows = get_num_rows(table, scale)
    synthesizer = self._synthesizers[table]
    self._sample_table(synthesizer, root, num_rows, sampled_data)
self._connect_tables(sampled_data)
self._finalize(sampled_data)

_sample_table(self, table_name, num_rows, sampled_data=None)
- Args
  - synthesizer (SingleTableSynthesizer): Synthesizer to sample from
  - table_name (string): Name of the table to sample
  - num_rows (int): Number of rows to sample.
  - sampled_data (dict): Dictionary of data sampled so far.
- pseudo-code
```
data = synthesizer._sample(num_rows)
data = reverse_transform(data) ### finalize table here
sampled_data[table_name] = data
```

_connect_tables(sampled_data)

Args
- sampled_data (dict): Dictionary of data sampled so far.
pseudo-code

queue = get_root_tables()
while queue:
    parent = queue.pop()
    for child in self.metadata.get_children(parent):
        self._add_foreign_key_column(sampled_data[child], parent, child_name, parent_name)
        if has_all_foreign_keys(child):
            queue.append(child)

_finalize(sampled_data)
- Args
  - sampled_data (dict): Dictionary of data sampled so far.
- pseudo-code
```
for table, data in sampled_data.items():
    remove_added_columns(data)
```

Abstract methods

These methods do not need to be implemented here but any class that uses this mixin must implement them.

_add_foreign_key_column(child_table_rows, parent_table_rows, child_name, parent_name): This method should add a column for the foreign key that connects the child to the parent. It can use whatever logic it needs to figure out which value to use for that parent row.

Additional context

This mixin changes the current sampling strategy that HSA uses in the following ways:

Current HSA is depth first (connects children as it samples them). This strategy would sample all tables and then connect.

The text was updated successfully, but these errors were encountered:

amontanez24 added internal The issue doesn't change the API or functionality feature:sampling Related to generating synthetic data after a model is built labels Apr 27, 2023

frances-h mentioned this issue May 12, 2023

Add BaseIndependentSampler mixin #1423

Merged

frances-h closed this as completed in #1423 Jun 1, 2023

amontanez24 assigned frances-h Jun 6, 2023

amontanez24 added this to the 1.2.0 milestone Jun 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BaseIndependentSampler Mixin #1395

Add BaseIndependentSampler Mixin #1395

amontanez24 commented Apr 27, 2023 •

edited

Loading

Add BaseIndependentSampler Mixin #1395

Add BaseIndependentSampler Mixin #1395

Comments

amontanez24 commented Apr 27, 2023 • edited Loading

Problem Description

Expected behavior

Methods

Abstract methods

Additional context

amontanez24 commented Apr 27, 2023 •

edited

Loading