Skip to content

Commit

Permalink
Replace "DataSet" with "Dataset" in Markdown files (#2735)
Browse files Browse the repository at this point in the history
* LambdaDataSet->LambdaDataset in .md files

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* MemoryDataSet->MemoryDataset in .md files

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* PartitionedDataSet->PartitionedDataset in .md files

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* IncrementalDataSet->IncrementalDataset in .md files

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* CachedDataSet->CachedDataset in .md files

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* DataSetError->DatasetError in .md files

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* DataSetNotFoundError->DatasetNotFoundError in .md files

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Replace "DataSet" with "Dataset" in Markdown files

* Update RELEASE.md

* Fix remaining instance of "*DataSet*"->"*Dataset*"

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* `find . -name '*.md' -print0 | xargs -0 sed -i "" "s/\([^A-Za-z]\)DataSet/\1Dataset/g"`

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Change non-class instances of Dataset to dataset

* Replace any remaining instances of DataSet in docs

* Fix a broken link to docs for `PartitionedDataset`

---------

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>
Co-authored-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
  • Loading branch information
3 people authored Aug 18, 2023
1 parent 04ffbe4 commit 2633c2c
Show file tree
Hide file tree
Showing 23 changed files with 157 additions and 156 deletions.
6 changes: 3 additions & 3 deletions docs/source/configuration/advanced_configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -176,7 +176,7 @@ From version 0.17.0, `TemplatedConfigLoader` also supports the [Jinja2](https://
```
{% for speed in ['fast', 'slow'] %}
{{ speed }}-trains:
type: MemoryDataSet
type: MemoryDataset
{{ speed }}-cars:
type: pandas.CSVDataSet
Expand All @@ -197,13 +197,13 @@ The output Python dictionary will look as follows:

```python
{
"fast-trains": {"type": "MemoryDataSet"},
"fast-trains": {"type": "MemoryDataset"},
"fast-cars": {
"type": "pandas.CSVDataSet",
"filepath": "s3://my_s3_bucket/fast-cars.csv",
"save_args": {"index": True},
},
"slow-trains": {"type": "MemoryDataSet"},
"slow-trains": {"type": "MemoryDataset"},
"slow-cars": {
"type": "pandas.CSVDataSet",
"filepath": "s3://my_s3_bucket/slow-cars.csv",
Expand Down
2 changes: 1 addition & 1 deletion docs/source/configuration/parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ node(
)
```

In both cases, under the hood parameters are added to the Data Catalog through the method `add_feed_dict()` in [`DataCatalog`](/kedro.io.DataCatalog), where they live as `MemoryDataSet`s. This method is also what the `KedroContext` class uses when instantiating the catalog.
In both cases, under the hood parameters are added to the Data Catalog through the method `add_feed_dict()` in [`DataCatalog`](/kedro.io.DataCatalog), where they live as `MemoryDataset`s. This method is also what the `KedroContext` class uses when instantiating the catalog.

```{note}
You can use `add_feed_dict()` to inject any other entries into your `DataCatalog` as per your use case.
Expand Down
10 changes: 5 additions & 5 deletions docs/source/data/advanced_data_catalog_usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ gear = cars["gear"].values
The following steps happened behind the scenes when `load` was called:

- The value `cars` was located in the Data Catalog
- The corresponding `AbstractDataSet` object was retrieved
- The corresponding `AbstractDataset` object was retrieved
- The `load` method of this dataset was called
- This `load` method delegated the loading to the underlying pandas `read_csv` function

Expand All @@ -70,9 +70,9 @@ This pattern is not recommended unless you are using platform notebook environme
To save data using an API similar to that used to load data:

```python
from kedro.io import MemoryDataSet
from kedro.io import MemoryDataset

memory = MemoryDataSet(data=None)
memory = MemoryDataset(data=None)
io.add("cars_cache", memory)
io.save("cars_cache", "Memory can store anything.")
io.load("cars_cache")
Expand Down Expand Up @@ -190,7 +190,7 @@ io.save("test_data_set", data1)
reloaded = io.load("test_data_set")
assert data1.equals(reloaded)

# raises DataSetError since the path
# raises DatasetError since the path
# data/01_raw/test.csv/my_exact_version/test.csv already exists
io.save("test_data_set", data2)
```
Expand Down Expand Up @@ -219,7 +219,7 @@ io = DataCatalog({"test_data_set": test_data_set})

io.save("test_data_set", data1) # emits a UserWarning due to version inconsistency

# raises DataSetError since the data/01_raw/test.csv/exact_load_version/test.csv
# raises DatasetError since the data/01_raw/test.csv/exact_load_version/test.csv
# file does not exist
reloaded = io.load("test_data_set")
```
5 changes: 3 additions & 2 deletions docs/source/data/data_catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,7 @@ In the example above, the `catalog.yml` file contains references to credentials

### Dataset versioning


Kedro enables dataset and ML model versioning through the `versioned` definition. For example:

```yaml
Expand All @@ -144,9 +145,9 @@ kedro run --load-version=cars:YYYY-MM-DDThh.mm.ss.sssZ
```
where `--load-version` is dataset name and version timestamp separated by `:`.

A dataset offers versioning support if it extends the [`AbstractVersionedDataSet`](/kedro.io.AbstractVersionedDataset) class to accept a version keyword argument as part of the constructor and adapt the `_save` and `_load` method to use the versioned data path obtained from `_get_save_path` and `_get_load_path` respectively.
A dataset offers versioning support if it extends the [`AbstractVersionedDataset`](/kedro.io.AbstractVersionedDataset) class to accept a version keyword argument as part of the constructor and adapt the `_save` and `_load` method to use the versioned data path obtained from `_get_save_path` and `_get_load_path` respectively.

To verify whether a dataset can undergo versioning, you should examine the dataset class code to inspect its inheritance [(you can find contributed datasets within the `kedro-datasets` repository)](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets). Check if the dataset class inherits from the `AbstractVersionedDataSet`. For instance, if you encounter a class like `CSVDataSet(AbstractVersionedDataSet[pd.DataFrame, pd.DataFrame])`, this indicates that the dataset is set up to support versioning.
To verify whether a dataset can undergo versioning, you should examine the dataset class code to inspect its inheritance [(you can find contributed datasets within the `kedro-datasets` repository)](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets). Check if the dataset class inherits from the `AbstractVersionedDataset`. For instance, if you encounter a class like `CSVDataSet(AbstractVersionedDataset[pd.DataFrame, pd.DataFrame])`, this indicates that the dataset is set up to support versioning.

```{note}
Note that HTTP(S) is a supported file system in the dataset implementations, but if you it, you can't also use versioning.
Expand Down
6 changes: 3 additions & 3 deletions docs/source/data/data_catalog_yaml_examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -397,12 +397,12 @@ for loading, so the first node outputs a `pyspark.sql.DataFrame`, while the seco

You can use the [`kedro catalog create` command to create a Data Catalog YAML configuration](../development/commands_reference.md#create-a-data-catalog-yaml-configuration-file).

This creates a `<conf_root>/<env>/catalog/<pipeline_name>.yml` configuration file with `MemoryDataSet` datasets for each dataset in a registered pipeline if it is missing from the `DataCatalog`.
This creates a `<conf_root>/<env>/catalog/<pipeline_name>.yml` configuration file with `MemoryDataset` datasets for each dataset in a registered pipeline if it is missing from the `DataCatalog`.

```yaml
# <conf_root>/<env>/catalog/<pipeline_name>.yml
rockets:
type: MemoryDataSet
type: MemoryDataset
scooters:
type: MemoryDataSet
type: MemoryDataset
```
16 changes: 8 additions & 8 deletions docs/source/data/how_to_create_a_custom_dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

[Kedro supports many datasets](/kedro_datasets) out of the box, but you may find that you need to create a custom dataset. For example, you may need to handle a proprietary data format or filesystem in your pipeline, or perhaps you have found a particular use case for a dataset that Kedro does not support. This tutorial explains how to create a custom dataset to read and save image data.

## AbstractDataSet
## AbstractDataset

For contributors, if you would like to submit a new dataset, you must extend the [`AbstractDataSet` interface](/kedro.io.AbstractDataset) or [`AbstractVersionedDataSet` interface](/kedro.io.AbstractVersionedDataset) if you plan to support versioning. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataSet` implementation.
For contributors, if you would like to submit a new dataset, you must extend the [`AbstractDataset` interface](/kedro.io.AbstractDataset) or [`AbstractVersionedDataset` interface](/kedro.io.AbstractVersionedDataset) if you plan to support versioning. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataset` implementation.


## Scenario
Expand Down Expand Up @@ -267,19 +267,19 @@ class ImageDataSet(AbstractDataset[np.ndarray, np.ndarray]):
```
</details>

## Integration with `PartitionedDataSet`
## Integration with `PartitionedDataset`

Currently, the `ImageDataSet` only works with a single image, but this example needs to load all Pokemon images from the raw data directory for further processing.

Kedro's [`PartitionedDataSet`](./partitioned_and_incremental_datasets.md) is a convenient way to load multiple separate data files of the same underlying dataset type into a directory.
Kedro's [`PartitionedDataset`](/kedro.io.PartitionedDataset) is a convenient way to load multiple separate data files of the same underlying dataset type into a directory.

To use `PartitionedDataSet` with `ImageDataSet` to load all Pokemon PNG images, add this to the data catalog YAML so that `PartitionedDataSet` loads all PNG files from the data directory using `ImageDataSet`:
To use `PartitionedDataset` with `ImageDataSet` to load all Pokemon PNG images, add this to the data catalog YAML so that `PartitionedDataset` loads all PNG files from the data directory using `ImageDataSet`:

```yaml
# in conf/base/catalog.yml

pokemon:
type: PartitionedDataSet
type: PartitionedDataset
dataset: kedro_pokemon.extras.datasets.image_dataset.ImageDataSet
path: data/01_raw/pokemon-images-and-types/images/images
filename_suffix: ".png"
Expand All @@ -305,11 +305,11 @@ $ ls -la data/01_raw/pokemon-images-and-types/images/images/*.png | wc -l
### How to implement versioning in your dataset

```{note}
Versioning doesn't work with `PartitionedDataSet`. You can't use both of them at the same time.
Versioning doesn't work with `PartitionedDataset`. You can't use both of them at the same time.
```

To add versioning support to the new dataset we need to extend the
[AbstractVersionedDataSet](/kedro.io.AbstractVersionedDataset) to:
[AbstractVersionedDataset](/kedro.io.AbstractVersionedDataset) to:

* Accept a `version` keyword argument as part of the constructor
* Adapt the `_save` and `_load` method to use the versioned data path obtained from `_get_save_path` and `_get_load_path` respectively
Expand Down
2 changes: 1 addition & 1 deletion docs/source/data/kedro_dataset_factories.md
Original file line number Diff line number Diff line change
Expand Up @@ -215,7 +215,7 @@ The matches are ranked according to the following criteria:

## How to override the default dataset creation with dataset factories

You can use dataset factories to define a catch-all pattern which will overwrite the default [`MemoryDataSet`](/kedro.io.MemoryDataset) creation.
You can use dataset factories to define a catch-all pattern which will overwrite the default [`MemoryDataset`](/kedro.io.MemoryDataset) creation.

```yaml
"{default_dataset}":
Expand Down
Loading

0 comments on commit 2633c2c

Please sign in to comment.