Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kedro-datasets: Datasets accept non-primitive parameters in the __init__ #950

Open
ElenaKhaustova opened this issue Nov 27, 2024 · 6 comments
Labels
datasets enhancement New feature or request

Comments

@ElenaKhaustova
Copy link
Contributor

ElenaKhaustova commented Nov 27, 2024

Description

Catalog to config solution is based on extracting dataset __init__ parameters and their values passed when creating a dataset.

The solution does not cover the case when non-primitive parameters are accepted in the dataset constructor:

Possible Implementation

(Preferable solution) Force datasets to only have static, primitive parameters in the __init__: kedro-org/kedro#4329 (comment)

Possible Alternatives

Extend parent AbstractDataset.to_config() at the dataset level to serialize those objects.

@astrojuanlu
Copy link
Member

In [16]: catalog = KedroDataCatalog()

In [17]: catalog.add_feed_dict({"ds": GBQQueryDataset("SELECT 1", "test_table", credentials=Credentials("abc"))})

In [18]: catalog.list()
Out[18]: ['ds']

In [19]: catalog.to_config()
Out[19]: ({}, {}, {}, None)

Should I have seen an error here @ElenaKhaustova ?

@astrojuanlu
Copy link
Member

It does happen with a dataset only instantiated with primitive properties though 🤔

In [20]: catalog = KedroDataCatalog()

In [21]: catalog.add_feed_dict({"ds": GBQQueryDataset("SELECT 1", "test_table")})

In [22]: catalog.list()
Out[22]: ['ds']

In [23]: catalog.to_config()
Out[23]: ({}, {}, {}, None)

@ElenaKhaustova
Copy link
Contributor Author

.add_feed_dict({"ds": GBQQueryDataset("SELECT 1", "test_table", credentials=Credentials("abc"))})

I'm not sure what credentials you passed here cause Credentials is an abstract.

Image

But even if GBQQueryDataset()object is instantiated properly you won't see it in the serialization result because you used add_feed_dict method to add it to the catalog. This method adds everything as a MemoryDataset which is skipped during the serialization.

It works as expected in case you add dataset to the catalog via __setitem__ and dataset only instantiated with primitive properties:

Image

But in case non-primitives are used, no error occurs as we do not dump the result dictionary and objects remains as they are - not serialized:

Image

@ElenaKhaustova
Copy link
Contributor Author

So the error should not occur until dumping from the result dictionary. We do not validate it now but we have a note in catalog.to_config() docstrings:

This method is only applicable to catalogs that contain datasets initialized with static, primitive
parameters. For example, it will work fine if one passes credentials as dictionary to
`GBQQueryDataset` but not as `google.auth.credentials.Credentials` object. See
https://github.com/kedro-org/kedro-plugins/issues/950 for the details.

@astrojuanlu
Copy link
Member

I'm not sure what credentials you passed here cause Credentials is an abstract.

Took from google.oauth2.credentials import Credentials

you won't see it in the serialization result because you used add_feed_dict method

Hm I missed that KedroDataCatalog.add_feed_dict behaves differently than DataCatalog.add_feed_dict 🤔

So the error should not occur until dumping from the result dictionary

Indeed, looks like the problem would, in any case, going from dictionary to file.

Not sure we should do anything here.

@ElenaKhaustova
Copy link
Contributor Author

Not sure we should do anything here.

I agree, but since the docstring refers to this ticket, we can keep it open until people start using the new catalog in case it becomes a common problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants