-
Notifications
You must be signed in to change notification settings - Fork 926
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Reorganise and improve the data catalog documentation (#2888)
* First drop of newly organised data catalog docs Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * linter Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Added to-do notes Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Afternoon's work in rewriting/reorganising content Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * More changes Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Further changes Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Another chunk of changes Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Final changes Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Revise ordering of pages Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Add new CLI commands to dataset factory docs (#2935) * Add changes from #2930 Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Lint Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Apply suggestions from code review Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Make code snippets collapsable Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> --------- Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Bunch of changes from feedback Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * A few more tweaks Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update h1,h2,h3 font sizes Signed-off-by: Tynan DeBold <thdebold@gmail.com> * Add code snippet for using DataCatalog with Kedro config Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> * Few more tweaks Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/data/data_catalog.md * Upgrade kedro-datasets for docs Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> * Improve prose Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> --------- Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> Signed-off-by: Tynan DeBold <thdebold@gmail.com> Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> Co-authored-by: Ahdra Merali <90615669+AhdraMeraliQB@users.noreply.github.com> Co-authored-by: Tynan DeBold <thdebold@gmail.com> Co-authored-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> Co-authored-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
- Loading branch information
1 parent
16dd1df
commit c45e629
Showing
24 changed files
with
1,187 additions
and
1,028 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,225 @@ | ||
# Advanced: Access the Data Catalog in code | ||
|
||
You can define a Data Catalog in two ways. Most use cases can be through a YAML configuration file as [illustrated previously](./data_catalog.md), but it is possible to access the Data Catalog programmatically through [`kedro.io.DataCatalog`](/kedro.io.DataCatalog) using an API that allows you to configure data sources in code and use the IO module within notebooks. | ||
|
||
## How to configure the Data Catalog | ||
|
||
To use the `DataCatalog` API, construct a `DataCatalog` object programmatically in a file like `catalog.py`. | ||
|
||
In the following, we are using several pre-built data loaders documented in the [API reference documentation](/kedro_datasets). | ||
|
||
```python | ||
from kedro.io import DataCatalog | ||
from kedro_datasets.pandas import ( | ||
CSVDataSet, | ||
SQLTableDataSet, | ||
SQLQueryDataSet, | ||
ParquetDataSet, | ||
) | ||
|
||
io = DataCatalog( | ||
{ | ||
"bikes": CSVDataSet(filepath="../data/01_raw/bikes.csv"), | ||
"cars": CSVDataSet(filepath="../data/01_raw/cars.csv", load_args=dict(sep=",")), | ||
"cars_table": SQLTableDataSet( | ||
table_name="cars", credentials=dict(con="sqlite:///kedro.db") | ||
), | ||
"scooters_query": SQLQueryDataSet( | ||
sql="select * from cars where gear=4", | ||
credentials=dict(con="sqlite:///kedro.db"), | ||
), | ||
"ranked": ParquetDataSet(filepath="ranked.parquet"), | ||
} | ||
) | ||
``` | ||
|
||
When using `SQLTableDataSet` or `SQLQueryDataSet` you must provide a `con` key containing [SQLAlchemy compatible](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) database connection string. In the example above we pass it as part of `credentials` argument. Alternative to `credentials` is to put `con` into `load_args` and `save_args` (`SQLTableDataSet` only). | ||
|
||
## How to view the available data sources | ||
|
||
To review the `DataCatalog`: | ||
|
||
```python | ||
io.list() | ||
``` | ||
|
||
## How to load datasets programmatically | ||
|
||
To access each dataset by its name: | ||
|
||
```python | ||
cars = io.load("cars") # data is now loaded as a DataFrame in 'cars' | ||
gear = cars["gear"].values | ||
``` | ||
|
||
The following steps happened behind the scenes when `load` was called: | ||
|
||
- The value `cars` was located in the Data Catalog | ||
- The corresponding `AbstractDataSet` object was retrieved | ||
- The `load` method of this dataset was called | ||
- This `load` method delegated the loading to the underlying pandas `read_csv` function | ||
|
||
## How to save data programmatically | ||
|
||
```{warning} | ||
This pattern is not recommended unless you are using platform notebook environments (Sagemaker, Databricks etc) or writing unit/integration tests for your Kedro pipeline. Use the YAML approach in preference. | ||
``` | ||
|
||
### How to save data to memory | ||
|
||
To save data using an API similar to that used to load data: | ||
|
||
```python | ||
from kedro.io import MemoryDataSet | ||
|
||
memory = MemoryDataSet(data=None) | ||
io.add("cars_cache", memory) | ||
io.save("cars_cache", "Memory can store anything.") | ||
io.load("cars_cache") | ||
``` | ||
|
||
### How to save data to a SQL database for querying | ||
|
||
To put the data in a SQLite database: | ||
|
||
```python | ||
import os | ||
|
||
# This cleans up the database in case it exists at this point | ||
try: | ||
os.remove("kedro.db") | ||
except FileNotFoundError: | ||
pass | ||
|
||
io.save("cars_table", cars) | ||
|
||
# rank scooters by their mpg | ||
ranked = io.load("scooters_query")[["brand", "mpg"]] | ||
``` | ||
|
||
### How to save data in Parquet | ||
|
||
To save the processed data in Parquet format: | ||
|
||
```python | ||
io.save("ranked", ranked) | ||
``` | ||
|
||
```{warning} | ||
Saving `None` to a dataset is not allowed! | ||
``` | ||
|
||
## How to access a dataset with credentials | ||
Before instantiating the `DataCatalog`, Kedro will first attempt to read [the credentials from the project configuration](../configuration/credentials.md). The resulting dictionary is then passed into `DataCatalog.from_config()` as the `credentials` argument. | ||
|
||
Let's assume that the project contains the file `conf/local/credentials.yml` with the following contents: | ||
|
||
```yaml | ||
dev_s3: | ||
client_kwargs: | ||
aws_access_key_id: key | ||
aws_secret_access_key: secret | ||
|
||
scooters_credentials: | ||
con: sqlite:///kedro.db | ||
|
||
my_gcp_credentials: | ||
id_token: key | ||
``` | ||
Your code will look as follows: | ||
```python | ||
CSVDataSet( | ||
filepath="s3://test_bucket/data/02_intermediate/company/motorbikes.csv", | ||
load_args=dict(sep=",", skiprows=5, skipfooter=1, na_values=["#NA", "NA"]), | ||
credentials=dict(key="token", secret="key"), | ||
) | ||
``` | ||
|
||
## How to version a dataset using the Code API | ||
|
||
In an earlier section of the documentation we described how [Kedro enables dataset and ML model versioning](./data_catalog.md/#dataset-versioning). | ||
|
||
If you require programmatic control over load and save versions of a specific dataset, you can instantiate `Version` and pass it as a parameter to the dataset initialisation: | ||
|
||
```python | ||
from kedro.io import DataCatalog, Version | ||
from kedro_datasets.pandas import CSVDataSet | ||
import pandas as pd | ||
|
||
data1 = pd.DataFrame({"col1": [1, 2], "col2": [4, 5], "col3": [5, 6]}) | ||
data2 = pd.DataFrame({"col1": [7], "col2": [8], "col3": [9]}) | ||
version = Version( | ||
load=None, # load the latest available version | ||
save=None, # generate save version automatically on each save operation | ||
) | ||
|
||
test_data_set = CSVDataSet( | ||
filepath="data/01_raw/test.csv", save_args={"index": False}, version=version | ||
) | ||
io = DataCatalog({"test_data_set": test_data_set}) | ||
|
||
# save the dataset to data/01_raw/test.csv/<version>/test.csv | ||
io.save("test_data_set", data1) | ||
# save the dataset into a new file data/01_raw/test.csv/<version>/test.csv | ||
io.save("test_data_set", data2) | ||
|
||
# load the latest version from data/test.csv/*/test.csv | ||
reloaded = io.load("test_data_set") | ||
assert data2.equals(reloaded) | ||
``` | ||
|
||
In the example above, we do not fix any versions. The behaviour of load and save operations becomes slightly different when we set a version: | ||
|
||
|
||
```python | ||
version = Version( | ||
load="my_exact_version", # load exact version | ||
save="my_exact_version", # save to exact version | ||
) | ||
|
||
test_data_set = CSVDataSet( | ||
filepath="data/01_raw/test.csv", save_args={"index": False}, version=version | ||
) | ||
io = DataCatalog({"test_data_set": test_data_set}) | ||
|
||
# save the dataset to data/01_raw/test.csv/my_exact_version/test.csv | ||
io.save("test_data_set", data1) | ||
# load from data/01_raw/test.csv/my_exact_version/test.csv | ||
reloaded = io.load("test_data_set") | ||
assert data1.equals(reloaded) | ||
|
||
# raises DataSetError since the path | ||
# data/01_raw/test.csv/my_exact_version/test.csv already exists | ||
io.save("test_data_set", data2) | ||
``` | ||
|
||
We do not recommend passing exact load and/or save versions, since it might lead to inconsistencies between operations. For example, if versions for load and save operations do not match, a save operation would result in a `UserWarning`. | ||
|
||
Imagine a simple pipeline with two nodes, where B takes the output from A. If you specify the load-version of the data for B to be `my_data_2023_08_16.csv`, the data that A produces (`my_data_20230818.csv`) is not used. | ||
|
||
```text | ||
Node_A -> my_data_20230818.csv | ||
my_data_2023_08_16.csv -> Node B | ||
``` | ||
|
||
In code: | ||
|
||
```python | ||
version = Version( | ||
load="my_data_2023_08_16.csv", # load exact version | ||
save="my_data_20230818.csv", # save to exact version | ||
) | ||
|
||
test_data_set = CSVDataSet( | ||
filepath="data/01_raw/test.csv", save_args={"index": False}, version=version | ||
) | ||
io = DataCatalog({"test_data_set": test_data_set}) | ||
|
||
io.save("test_data_set", data1) # emits a UserWarning due to version inconsistency | ||
|
||
# raises DataSetError since the data/01_raw/test.csv/exact_load_version/test.csv | ||
# file does not exist | ||
reloaded = io.load("test_data_set") | ||
``` |
Oops, something went wrong.