Reorganise and improve the data catalog documentation (#2888)

* First drop of newly organised data catalog docs Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * linter Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Added to-do notes Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Afternoon's work in rewriting/reorganising content Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * More changes Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Further changes Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Another chunk of changes Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Final changes Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Revise ordering of pages Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Add new CLI commands to dataset factory docs (#2935) * Add changes from #2930 Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Lint Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Apply suggestions from code review Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Make code snippets collapsable Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> --------- Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Bunch of changes from feedback Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * A few more tweaks Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update h1,h2,h3 font sizes Signed-off-by: Tynan DeBold <thdebold@gmail.com> * Add code snippet for using DataCatalog with Kedro config Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> * Few more tweaks Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/data/data_catalog.md * Upgrade kedro-datasets for docs Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> * Improve prose Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> --------- Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> Signed-off-by: Tynan DeBold <thdebold@gmail.com> Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> Co-authored-by: Ahdra Merali <90615669+AhdraMeraliQB@users.noreply.github.com> Co-authored-by: Tynan DeBold <thdebold@gmail.com> Co-authored-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> Co-authored-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
kedro-org · Aug 18, 2023 · c45e629 · c45e629
1 parent 16dd1df
commit c45e629
Show file tree

Hide file tree

Showing 24 changed files with 1,187 additions and 1,028 deletions.
diff --git a/RELEASE.md b/RELEASE.md
@@ -18,6 +18,8 @@
 * Updated `kedro pipeline create` and `kedro catalog create` to use new `/conf` file structure.
 
 ## Documentation changes
+* Revised the `data` section to restructure beginner and advanced pages about the Data Catalog and datasets.
+* Moved contributor documentation to the [GitHub wiki](https://github.com/kedro-org/kedro/wiki/Contribute-to-Kedro).
 * Update example of using generator functions in nodes.
 * Added migration guide from the `ConfigLoader` to the `OmegaConfigLoader`. The `ConfigLoader` is deprecated and will be removed in the `0.19.0` release.
 

diff --git a/docs/source/_static/css/qb1-sphinx-rtd.css b/docs/source/_static/css/qb1-sphinx-rtd.css
@@ -321,16 +321,16 @@ h1, h2, .rst-content .toctree-wrapper p.caption, h3, h4, h5, h6, legend {
 }
 
 .wy-body-for-nav h1 {
-    font-size: 2.6rem;
+    font-size: 2.6rem !important;
     letter-spacing: -0.3px;
 }
 
 .wy-body-for-nav h2 {
-    font-size: 2.3rem;
+    font-size: 2rem;
 }
 
 .wy-body-for-nav h3 {
-    font-size: 2.1rem;
+    font-size: 2rem;
 }
 
 .wy-body-for-nav h4 {

diff --git a/docs/source/configuration/credentials.md b/docs/source/configuration/credentials.md
@@ -3,7 +3,7 @@
 For security reasons, we strongly recommend that you *do not* commit any credentials or other secrets to version control.
 Kedro is set up so that, by default, if a file inside the `conf` folder (and its subfolders) contains `credentials` in its name, it will be ignored by git.
 
-Credentials configuration can be used on its own directly in code or [fed into the `DataCatalog`](../data/data_catalog.md#feeding-in-credentials).
+Credentials configuration can be used on its own directly in code or [fed into the `DataCatalog`](../data/data_catalog.md#dataset-access-credentials).
 If you would rather store your credentials in environment variables instead of a file, you can use the `OmegaConfigLoader` [to load credentials from environment variables](advanced_configuration.md#how-to-load-credentials-through-environment-variables) as described in the advanced configuration chapter.
 
 ## How to load credentials in code

diff --git a/docs/source/data/advanced_data_catalog_usage.md b/docs/source/data/advanced_data_catalog_usage.md
@@ -0,0 +1,225 @@
+# Advanced: Access the Data Catalog in code
+
+You can define a Data Catalog in two ways. Most use cases can be through a YAML configuration file as [illustrated previously](./data_catalog.md), but it is possible to access the Data Catalog programmatically through [`kedro.io.DataCatalog`](/kedro.io.DataCatalog) using an API that allows you to configure data sources in code and use the IO module within notebooks.
+
+## How to configure the Data Catalog
+
+To use the `DataCatalog` API, construct a `DataCatalog` object programmatically in a file like `catalog.py`.
+
+In the following, we are using several pre-built data loaders documented in the [API reference documentation](/kedro_datasets).
+
+```python
+from kedro.io import DataCatalog
+from kedro_datasets.pandas import (
+    CSVDataSet,
+    SQLTableDataSet,
+    SQLQueryDataSet,
+    ParquetDataSet,
+)
+
+io = DataCatalog(
+    {
+        "bikes": CSVDataSet(filepath="../data/01_raw/bikes.csv"),
+        "cars": CSVDataSet(filepath="../data/01_raw/cars.csv", load_args=dict(sep=",")),
+        "cars_table": SQLTableDataSet(
+            table_name="cars", credentials=dict(con="sqlite:///kedro.db")
+        ),
+        "scooters_query": SQLQueryDataSet(
+            sql="select * from cars where gear=4",
+            credentials=dict(con="sqlite:///kedro.db"),
+        ),
+        "ranked": ParquetDataSet(filepath="ranked.parquet"),
+    }
+)
+```
+
+When using `SQLTableDataSet` or `SQLQueryDataSet` you must provide a `con` key containing [SQLAlchemy compatible](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) database connection string. In the example above we pass it as part of `credentials` argument. Alternative to `credentials` is to put `con` into `load_args` and `save_args` (`SQLTableDataSet` only).
+
+## How to view the available data sources
+
+To review the `DataCatalog`:
+
+```python
+io.list()
+```
+
+## How to load datasets programmatically
+
+To access each dataset by its name:
+
+```python
+cars = io.load("cars")  # data is now loaded as a DataFrame in 'cars'
+gear = cars["gear"].values
+```
+
+The following steps happened behind the scenes when `load` was called:
+
+- The value `cars` was located in the Data Catalog
+- The corresponding `AbstractDataSet` object was retrieved
+- The `load` method of this dataset was called
+- This `load` method delegated the loading to the underlying pandas `read_csv` function
+
+## How to save data programmatically
+
+```{warning}
+This pattern is not recommended unless you are using platform notebook environments (Sagemaker, Databricks etc) or writing unit/integration tests for your Kedro pipeline. Use the YAML approach in preference.
+```
+
+### How to save data to memory
+
+To save data using an API similar to that used to load data:
+
+```python
+from kedro.io import MemoryDataSet
+
+memory = MemoryDataSet(data=None)
+io.add("cars_cache", memory)
+io.save("cars_cache", "Memory can store anything.")
+io.load("cars_cache")
+```
+
+### How to save data to a SQL database for querying
+
+To put the data in a SQLite database:
+
+```python
+import os
+
+# This cleans up the database in case it exists at this point
+try:
+    os.remove("kedro.db")
+except FileNotFoundError:
+    pass
+
+io.save("cars_table", cars)
+
+# rank scooters by their mpg
+ranked = io.load("scooters_query")[["brand", "mpg"]]
+```
+
+### How to save data in Parquet
+
+To save the processed data in Parquet format:
+
+```python
+io.save("ranked", ranked)
+```
+
+```{warning}
+Saving `None` to a dataset is not allowed!
+```
+
+## How to access a dataset with credentials
+Before instantiating the `DataCatalog`, Kedro will first attempt to read [the credentials from the project configuration](../configuration/credentials.md). The resulting dictionary is then passed into `DataCatalog.from_config()` as the `credentials` argument.
+
+Let's assume that the project contains the file `conf/local/credentials.yml` with the following contents:
+
+```yaml
+dev_s3:
+  client_kwargs:
+    aws_access_key_id: key
+    aws_secret_access_key: secret
+
+scooters_credentials:
+  con: sqlite:///kedro.db
+
+my_gcp_credentials:
+  id_token: key
+```
+
+Your code will look as follows:
+
+```python
+CSVDataSet(
+    filepath="s3://test_bucket/data/02_intermediate/company/motorbikes.csv",
+    load_args=dict(sep=",", skiprows=5, skipfooter=1, na_values=["#NA", "NA"]),
+    credentials=dict(key="token", secret="key"),
+)
+```
+
+## How to version a dataset using the Code API
+
+In an earlier section of the documentation we described how [Kedro enables dataset and ML model versioning](./data_catalog.md/#dataset-versioning).
+
+If you require programmatic control over load and save versions of a specific dataset, you can instantiate `Version` and pass it as a parameter to the dataset initialisation:
+
+```python
+from kedro.io import DataCatalog, Version
+from kedro_datasets.pandas import CSVDataSet
+import pandas as pd
+
+data1 = pd.DataFrame({"col1": [1, 2], "col2": [4, 5], "col3": [5, 6]})
+data2 = pd.DataFrame({"col1": [7], "col2": [8], "col3": [9]})
+version = Version(
+    load=None,  # load the latest available version
+    save=None,  # generate save version automatically on each save operation
+)
+
+test_data_set = CSVDataSet(
+    filepath="data/01_raw/test.csv", save_args={"index": False}, version=version
+)
+io = DataCatalog({"test_data_set": test_data_set})
+
+# save the dataset to data/01_raw/test.csv/<version>/test.csv
+io.save("test_data_set", data1)
+# save the dataset into a new file data/01_raw/test.csv/<version>/test.csv
+io.save("test_data_set", data2)
+
+# load the latest version from data/test.csv/*/test.csv
+reloaded = io.load("test_data_set")
+assert data2.equals(reloaded)
+```
+
+In the example above, we do not fix any versions. The behaviour of load and save operations becomes slightly different when we set a version:
+
+
+```python
+version = Version(
+    load="my_exact_version",  # load exact version
+    save="my_exact_version",  # save to exact version
+)
+
+test_data_set = CSVDataSet(
+    filepath="data/01_raw/test.csv", save_args={"index": False}, version=version
+)
+io = DataCatalog({"test_data_set": test_data_set})
+
+# save the dataset to data/01_raw/test.csv/my_exact_version/test.csv
+io.save("test_data_set", data1)
+# load from data/01_raw/test.csv/my_exact_version/test.csv
+reloaded = io.load("test_data_set")
+assert data1.equals(reloaded)
+
+# raises DataSetError since the path
+# data/01_raw/test.csv/my_exact_version/test.csv already exists
+io.save("test_data_set", data2)
+```
+
+We do not recommend passing exact load and/or save versions, since it might lead to inconsistencies between operations. For example, if versions for load and save operations do not match, a save operation would result in a `UserWarning`.
+
+Imagine a simple pipeline with two nodes, where B takes the output from A. If you specify the load-version of the data for B to be `my_data_2023_08_16.csv`, the data that A produces (`my_data_20230818.csv`) is not used.
+
+```text
+Node_A -> my_data_20230818.csv
+my_data_2023_08_16.csv -> Node B
+```
+
+In code:
+
+```python
+version = Version(
+    load="my_data_2023_08_16.csv",  # load exact version
+    save="my_data_20230818.csv",  # save to exact version
+)
+
+test_data_set = CSVDataSet(
+    filepath="data/01_raw/test.csv", save_args={"index": False}, version=version
+)
+io = DataCatalog({"test_data_set": test_data_set})
+
+io.save("test_data_set", data1)  # emits a UserWarning due to version inconsistency
+
+# raises DataSetError since the data/01_raw/test.csv/exact_load_version/test.csv
+# file does not exist
+reloaded = io.load("test_data_set")
+```