Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot use HMASynthesizer.fit_processed_data more than once (KeyError) #1240

Closed
npatki opened this issue Feb 7, 2023 · 0 comments
Closed
Assignees
Labels
bug Something isn't working
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Feb 7, 2023

Environment Details

  • SDV version: 1.0.0 (in progress)
  • Python version: Any
  • Operating System: Any

Error Description

It should be possible to fit a synthesizer multiple times. Each time, the synthesizer will completely reset and refit on the data that is provided. Currently, the code crashes if I try to run the method fit_processed_data multiple times.

Interestingly, it works ok if I call fit multiple times.

Steps to reproduce

import pandas as pd

from sdv.metadata import MultiTableMetadata
from sdv.multi_table import HMASynthesizer

parent_table = pd.DataFrame(data={
    'id': [1, 2, 3, 4, 5],
    'column': [1.2, 2.1, 2.2, 2.1, 1.4]
})

child_table = pd.DataFrame(data={
    'id': [1, 2, 3, 4, 5],
    'parent_id': [1, 1, 3, 2, 1],
    'low_column': [1, 3, 3, 1, 2],
    'high_column': [2, 4, 5, 2, 4]
})

data = {
    'parent_table': parent_table,
    'child_table': child_table
}

metadata = MultiTableMetadata()
metadata.detect_table_from_dataframe(table_name='parent_table', data=parent_table)
metadata.detect_table_from_dataframe(table_name='child_table', data=child_table)

metadata.set_primary_key(table_name='parent_table', column_name='id')
metadata.set_primary_key(table_name='child_table', column_name='id')

metadata.add_relationship(
    parent_table_name='parent_table',
    child_table_name='child_table',
    parent_primary_key='id',
    child_foreign_key='parent_id'
)

synthesizer = HMASynthesizer(metadata)
synthesizer.auto_assign_transformers({'parent_table':parent_table,'child_table':child_table})
processed_data = synthesizer.preprocess(data)
synthesizer.fit_processed_data(processed_data)
synthesizer.fit_processed_data(processed_data)

Output

KeyError: '__child_table__parent_id__num_rows'

Full Stack Trace

--------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
[/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py](https://localhost:8080/#) in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

8 frames
/usr/local/lib/python3.8/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

/usr/local/lib/python3.8/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: '__child_table__parent_id__num_rows'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
[<ipython-input-1-c92fa585102b>](https://localhost:8080/#) in <module>
     39 processed_data = synthesizer.preprocess(data)
     40 synthesizer.fit_processed_data(processed_data)
---> 41 synthesizer.fit_processed_data(processed_data)

[/usr/local/lib/python3.8/dist-packages/sdv/multi_table/base.py](https://localhost:8080/#) in fit_processed_data(self, processed_data)
    309                 Dictionary mapping each table name to a preprocessed ``pandas.DataFrame``.
    310         """
--> 311         self._fit(processed_data)
    312         self._fitted = True
    313         self._fitted_date = datetime.datetime.today().strftime('%Y-%m-%d')

[/usr/local/lib/python3.8/dist-packages/sdv/multi_table/hma.py](https://localhost:8080/#) in _fit(self, processed_data)
    207         for table_name in processed_data:
    208             if not parent_map.get(table_name):
--> 209                 self._model_table(table_name, processed_data)
    210 
    211         LOGGER.info('Modeling Complete')

[/usr/local/lib/python3.8/dist-packages/sdv/multi_table/hma.py](https://localhost:8080/#) in _model_table(self, table_name, tables)
    180         self._table_sizes[table_name] = len(table)
    181 
--> 182         table = self._extend_table(table, tables, table_name)
    183         keys = self._pop_foreign_keys(table, table_name)
    184         self._clear_nans(table)

[/usr/local/lib/python3.8/dist-packages/sdv/multi_table/hma.py](https://localhost:8080/#) in _extend_table(self, table, tables, table_name)
    126                 table = table.merge(extension, how='left', right_index=True, left_index=True)
    127                 num_rows_key = f'__{child_name}__{foreign_key}__num_rows'
--> 128                 table[num_rows_key] = table[num_rows_key].fillna(0)
    129                 self._max_child_rows[num_rows_key] = table[num_rows_key].max()
    130 

[/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py](https://localhost:8080/#) in __getitem__(self, key)
   3456             if self.columns.nlevels > 1:
   3457                 return self._getitem_multilevel(key)
-> 3458             indexer = self.columns.get_loc(key)
   3459             if is_integer(indexer):
   3460                 indexer = [indexer]

[/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py](https://localhost:8080/#) in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: '__child_table__parent_id__num_rows'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants