Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move default dataset creation responsibilities from runners to the KedroDataCatalog #4475

Open
ElenaKhaustova opened this issue Feb 10, 2025 · 0 comments
Labels
Component: IO Issue/PR addresses data loading/saving/versioning and validation, the DataCatalog and DataSets Component: Runners Issue: Feature Request New feature or improvement to existing feature

Comments

@ElenaKhaustova
Copy link
Contributor

Description

The suggestion is to move default dataset creation responsibilities from runners to the KedroDataCatalog. This is part of the runners refactoring to decouple runners and catalog.

Context

Currently, we set the default dataset patterns for each runner separately:

default_dataset_pattern = {"{default}": {"type": "MemoryDataset"}}
,
default_dataset_pattern = {"{default}": {"type": "MemoryDataset"}}

Then in the AbstractRunner.run() we add these patterns to the catalog before the run and remove them after execution, So at the execution time all intermediate datasets not set in the catalog explicitly are treated as MemoryDatasets.

catalog = catalog.shallow_copy(

We can't just add this pattern as the catalog default, as there will be no mechanism to differentiate what dataset is actually in the catalog. Another difficulty is that the catalog is not aware of how it is used by the external objects—it doesn't differ the runtime from other usage. Adding the last will still keep runners-catalog coupling but move it to the catalog side.

Some exploration is needed first to decide how the above problems can be solved at once.

@ElenaKhaustova ElenaKhaustova added the Issue: Feature Request New feature or improvement to existing feature label Feb 10, 2025
@ElenaKhaustova ElenaKhaustova added this to the Kedro 1.0.0 milestone Feb 10, 2025
@ElenaKhaustova ElenaKhaustova added Component: IO Issue/PR addresses data loading/saving/versioning and validation, the DataCatalog and DataSets Component: Runners labels Feb 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: IO Issue/PR addresses data loading/saving/versioning and validation, the DataCatalog and DataSets Component: Runners Issue: Feature Request New feature or improvement to existing feature
Projects
Status: No status
Development

No branches or pull requests

2 participants