Skip to content

Releases: kedro-org/kedro

0.16.1

21 May 12:46
d291a21
Compare
Choose a tag to compare

Bug fixes and other changes

  • Fixed deprecation warnings from kedro.cli and kedro.context when running kedro jupyter notebook.
  • Fixed a bug where catalog and context were not available in Jupyter Lab and Notebook.
  • Fixed a bug where kedro build-reqs would fail if you didn't have your project dependencies installed.

0.16.0

20 May 11:04
c19ca9e
Compare
Choose a tag to compare

Major features and improvements

CLI

  • Added new CLI commands (only available for the projects created using Kedro 0.16.0 or later):
    • kedro catalog list to list datasets in your catalog
    • kedro pipeline list to list pipelines
    • kedro pipeline describe to describe a specific pipeline
    • kedro pipeline create to create a modular pipeline
  • Improved the CLI speed by up to 50%.
  • Improved error handling when making a typo on the CLI. We now suggest some of the possible commands you meant to type, in git-style.

Framework

  • All modules in kedro.cli and kedro.context have been moved into kedro.framework.cli and kedro.framework.context respectively. kedro.cli and kedro.context will be removed in future releases.
  • Added Hooks, which is a new mechanism for extending Kedro.
  • Fixed load_context changing user's current working directory.
  • Allowed the source directory to be configurable in .kedro.yml.
  • Added the ability to specify nested parameter values inside your node inputs, e.g. node(func, "params:a.b", None)

DataSets

  • Added the following new datasets.
Type Description Location
pillow.ImageDataSet Work with image files using Pillow kedro.extras.datasets.pillow
geopandas.GeoJSONDataSet Work with geospatial data using GeoPandas kedro.extras.datasets.geopandas.GeoJSONDataSet
api.APIDataSet Work with data from HTTP(S) API requests kedro.extras.datasets.api.APIDataSet
  • Added joblib backend support to pickle.PickleDataSet.
  • Added versioning support to MatplotlibWriter dataset.
  • Added the ability to install dependencies for a given dataset with more granularity, e.g. pip install "kedro[pandas.ParquetDataSet]".
  • Added the ability to specify extra arguments, e.g. encoding or compression, for fsspec.spec.AbstractFileSystem.open() calls when loading/saving a dataset. See Example 3 under docs.

Other

  • Added namespace property on Node, related to the modular pipeline where the node belongs.
  • Added an option to enable asynchronous loading inputs and saving outputs in both SequentialRunner(is_async=True) and ParallelRunner(is_async=True) class.
  • Added MemoryProfiler transformer.
  • Removed the requirement to have all dependencies for a dataset module to use only a subset of the datasets within.
  • Added support for pandas>=1.0.
  • Enabled Python 3.8 compatibility. Please note that a Spark workflow may be unreliable for this Python version as pyspark is not fully-compatible with 3.8 yet.
  • Renamed "features" layer to "feature" layer to be consistent with (most) other layers and the relevant FAQ.

Bug fixes and other changes

  • Fixed a bug where a new version created mid-run by an external system caused inconsistencies in the load versions used in the current run.
  • Documentation improvements
    • Added instruction in the documentation on how to create a custom runner).
    • Updated contribution process in CONTRIBUTING.md - added Developer Workflow.
    • Documented installation of development version of Kedro in the FAQ section.
    • Added missing _exists method to MyOwnDataSet example in 04_user_guide/08_advanced_io.
  • Fixed a bug where PartitionedDataSet and IncrementalDataSet were not working with s3a or s3n protocol.
  • Added ability to read partitioned parquet file from a directory in pandas.ParquetDataSet.
  • Replaced functools.lru_cache with cachetools.cachedmethod in PartitionedDataSet and IncrementalDataSet for per-instance cache invalidation.
  • Implemented custom glob function for SparkDataSet when running on Databricks.
  • Fixed a bug in SparkDataSet not allowing for loading data from DBFS in a Windows machine using Databricks-connect.
  • Improved the error message for DataSetNotFoundError to suggest possible dataset names user meant to type.
  • Added the option for contributors to run Kedro tests locally without Spark installation with make test-no-spark.
  • Added option to lint the project without applying the formatting changes (kedro lint --check-only).

Breaking changes to the API

Datasets

  • Deleted obsolete datasets from kedro.io.
  • Deleted kedro.contrib and extras folders.
  • Deleted obsolete CSVBlobDataSet and JSONBlobDataSet dataset types.
  • Made invalidate_cache method on datasets private.
  • get_last_load_version and get_last_save_version methods are no longer available on AbstractDataSet.
  • get_last_load_version and get_last_save_version have been renamed to resolve_load_version and resolve_save_version on AbstractVersionedDataSet, the results of which are cached.
  • The release() method on datasets extending AbstractVersionedDataSet clears the cached load and save version. All custom datasets must call super()._release() inside _release().
  • TextDataSet no longer has load_args and save_args. These can instead be specified under open_args_load or open_args_save in fs_args.
  • PartitionedDataSet and IncrementalDataSet method invalidate_cache was made private: _invalidate_caches.

Other

  • Removed KEDRO_ENV_VAR from kedro.context to speed up the CLI run time.
  • Pipeline.name has been removed in favour of Pipeline.tag().
  • Dropped Pipeline.transform() in favour of kedro.pipeline.modular_pipeline.pipeline() helper function.
  • Made constant PARAMETER_KEYWORDS private, and moved it from kedro.pipeline.pipeline to kedro.pipeline.modular_pipeline.
  • Layers are no longer part of the dataset object, as they've moved to the DataCatalog.
  • Python 3.5 is no longer supported by the current and all future versions of Kedro.

Migration guide from Kedro 0.15.* to Upcoming Release

Migration for datasets

Since all the datasets (from kedro.io and kedro.contrib.io) were moved to kedro/extras/datasets you must update the type of all datasets in <project>/conf/base/catalog.yml file.
Here how it should be changed: type: <SomeDataSet> -> type: <subfolder of kedro/extras/datasets>.<SomeDataSet> (e.g. type: CSVDataSet -> type: pandas.CSVDataSet).

In addition, all the specific datasets like CSVLocalDataSet, CSVS3DataSet etc. were deprecated. Instead, you must use generalized datasets like CSVDataSet.
E.g. type: CSVS3DataSet -> type: pandas.CSVDataSet.

Note: No changes required if you are using your custom dataset.

Migration for Pipeline.transform()

Pipeline.transform() has been dropped in favour of the pipeline() constructor. The following changes apply:

  • Remember to import from kedro.pipeline import pipeline
  • The prefix argument has been renamed to namespace
  • And datasets has been broken down into more granular arguments:
    • inputs: Independent inputs to the pipeline
    • outputs: Any output created in the pipeline, whether an intermediary dataset or a leaf output
    • parameters: params:... or parameters

As an example, code that used to look like this with the Pipeline.transform() constructor:

result = my_pipeline.transform(
    datasets={"input": "new_input", "output": "new_output", "params:x": "params:y"},
    prefix="pre"
)

When used with the new pipeline() constructor, becomes:

from kedro.pipeline import pipeline

result = pipeline(
    my_pipeline,
    inputs={"input": "new_input"},
    outputs={"output": "new_output"},
    parameters={"params:x": "params:y"},
    namespace="pre"
)
Migration for decorators, color logger, transformers etc.

Since some modules were moved to other locations you need to update import paths appropriately.
You can find the list of moved files in the 0.15.6 release notes under the section titled Files with a new location.

Migration for KEDRO_ENV_VAR, the environment variable

Note: If you haven't made significant changes to your kedro_cli.py, it may be easier to simply copy the updated kedro_cli.py .ipython/profile_default/startup/00-kedro-init.py and from GitHub or a newly generated project into your old project.

  • We've removed KEDRO_ENV_VAR from kedro.context. To get your existing project template working, you'll need to remove all instances of KEDRO_ENV_VAR from your project template:
    • From the imports in kedro_cli.py and .ipython/profile_default/startup/00-kedro-init.py: from kedro.context import KEDRO_ENV_VAR, load_context -> from kedro.framework.context import load_context
    • Remove the envvar=KEDRO_ENV_VAR line from the click options in run, jupyter_notebook and jupyter_lab in kedro_cli.py
    • Replace KEDRO_ENV_VAR with "KEDRO_ENV" in _build_jupyter_env
    • Replace context = load_context(path, env=os.getenv(KEDRO_ENV_VAR)) with context = load_context(path) in .ipython/profile_default/startup/00-kedro-init.py
Migration for kedro build-reqs

We have upgraded pip-tools which is used by kedro build-reqs to 5.x. This pip-tools version requires pip>=20.0. To upgrade pip, please refer to their documentation.

Thanks for supporting contributions

@foolsgold, [Mani ...

Read more

0.15.9

06 Apr 14:50
b8bd47c
Compare
Choose a tag to compare

Bug fixes and other changes

  • Pinned fsspec>=0.5.1, <0.7.0 and s3fs>=0.3.0, <0.4.1 to fix incompatibility issues with their latest release.

0.15.8

05 Mar 10:12
f79ee14
Compare
Choose a tag to compare

Major features and improvements

  • Added the additional libraries to our requirements.txt so pandas.CSVDataSet class works out of box with pip install kedro.
  • Added pandas to our extra_requires in setup.py.
  • Improved the error message when dependencies of a DataSet class are missing.

0.15.7

26 Feb 17:13
19eaf88
Compare
Choose a tag to compare

Major features and improvements

  • Added in documentation on how to contribute a custom AbstractDataSet implementation.

Bug fixes and other changes

  • Fixed the link to the Kedro banner image in the documentation.

0.15.6

26 Feb 11:55
eb6bdd8
Compare
Choose a tag to compare

Major features and improvements

TL;DR We're launching kedro.extras, the new home for our revamped series of datasets, decorators and dataset transformers. The datasets in kedro.extras.datasets use fsspec to access a variety of data stores including local file systems, network file systems, cloud object stores (including S3 and GCP), and Hadoop, read more about this here. The change will allow #178 to happen in the next major release of Kedro.

An example of this new system can be seen below, loading the CSV SparkDataSet from S3:

weather:
  type: spark.SparkDataSet  # Observe the specified type, this  affects all datasets
  filepath: s3a://your_bucket/data/01_raw/weather*  # filepath uses fsspec to indicate the file storage system
  credentials: dev_s3
  file_format: csv

You can also load data incrementally whenever it is dumped into a directory with the extension to PartionedDataSet, a feature that allows you to load a directory of files. The IncrementalDataSet stores the information about the last processed partition in a checkpoint, read more about this feature here.

New features

  • Added layer attribute for datasets in kedro.extras.datasets to specify the name of a layer according to data engineering convention, this feature will be passed to kedro-viz in future releases.
  • Enabled loading a particular version of a dataset in Jupyter Notebooks and iPython, using catalog.load("dataset_name", version="<2019-12-13T15.08.09.255Z>").
  • Added property run_id on ProjectContext, used for versioning using the Journal. To customise your journal run_id you can override the private method _get_run_id().
  • Added the ability to install all optional kedro dependencies via pip install "kedro[all]".
  • Modified the DataCatalog's load order for datasets, loading order is the following:
    • kedro.io
    • kedro.extras.datasets
    • Import path, specified in type
  • Added an optional copy_mode flag to CachedDataSet and MemoryDataSet to specify (deepcopy, copy or assign) the copy mode to use when loading and saving.

New Datasets

Type Description Location
ParquetDataSet Handles parquet datasets using Dask kedro.extras.datasets.dask
PickleDataSet Work with Pickle files using fsspec to communicate with the underlying filesystem kedro.extras.datasets.pickle
CSVDataSet Work with CSV files using fsspec to communicate with the underlying filesystem kedro.extras.datasets.pandas
TextDataSet Work with text files using fsspec to communicate with the underlying filesystem kedro.extras.datasets.pandas
ExcelDataSet Work with Excel files using fsspec to communicate with the underlying filesystem kedro.extras.datasets.pandas
HDFDataSet Work with HDF using fsspec to communicate with the underlying filesystem kedro.extras.datasets.pandas
YAMLDataSet Work with YAML files using fsspec to communicate with the underlying filesystem kedro.extras.datasets.yaml
MatplotlibWriter Save with Matplotlib images using fsspec to communicate with the underlying filesystem kedro.extras.datasets.matplotlib
NetworkXDataSet Work with NetworkX files using fsspec to communicate with the underlying filesystem kedro.extras.datasets.networkx
BioSequenceDataSet Work with bio-sequence objects using fsspec to communicate with the underlying filesystem kedro.extras.datasets.biosequence
GBQTableDataSet Work with Google BigQuery kedro.extras.datasets.pandas
FeatherDataSet Work with feather files using fsspec to communicate with the underlying filesystem kedro.extras.datasets.pandas
IncrementalDataSet Inherit from PartitionedDataSet and remembers the last processed partition kedro.io

Files with a new location

Type New Location
JSONDataSet kedro.extras.datasets.pandas
CSVBlobDataSet kedro.extras.datasets.pandas
JSONBlobDataSet kedro.extras.datasets.pandas
SQLTableDataSet kedro.extras.datasets.pandas
SQLQueryDataSet kedro.extras.datasets.pandas
SparkDataSet kedro.extras.datasets.spark
SparkHiveDataSet kedro.extras.datasets.spark
SparkJDBCDataSet kedro.extras.datasets.spark
kedro/contrib/decorators/retry.py kedro/extras/decorators/retry_node.py
kedro/contrib/decorators/memory_profiler.py kedro/extras/decorators/memory_profiler.py
kedro/contrib/io/transformers/transformers.py kedro/extras/transformers/time_profiler.py
kedro/contrib/colors/logging/color_logger.py kedro/extras/logging/color_logger.py
extras/ipython_loader.py tools/ipython/ipython_loader.py
kedro/contrib/io/cached/cached_dataset.py kedro/io/cached_dataset.py
kedro/contrib/io/catalog_with_default/data_catalog_with_default.py kedro/io/data_catalog_with_default.py
kedro/contrib/config/templated_config.py kedro/config/templated_config.py

Upcoming deprecations

Category Type
Datasets BioSequenceLocalDataSet
CSVGCSDataSet
CSVHTTPDataSet
CSVLocalDataSet
CSVS3DataSet
ExcelLocalDataSet
FeatherLocalDataSet
JSONGCSDataSet
`JSONLo...
Read more

0.15.5

12 Dec 13:29
98a6c8f
Compare
Choose a tag to compare

Major features and improvements

  • New CLI commands and command flags:
    • Load multiple kedro run CLI flags from a configuration file with the --config flag (e.g. kedro run --config run_config.yml)
    • Run parametrised pipeline runs with the --params flag (e.g. kedro run --params param1:value1,param2:value2).
    • Lint your project code using the kedro lint command, your project is linted with black (Python 3.6+), flake8 and isort.
  • Load specific environments with Jupyter notebooks using KEDRO_ENV which will globally set run, jupyter notebook and jupyter lab commands using environment variables.
  • Added the following datasets:
    • CSVGCSDataSet dataset in contrib for working with CSV files in Google Cloud Storage.
    • ParquetGCSDataSet dataset in contrib for working with Parquet files in Google Cloud Storage.
    • JSONGCSDataSet dataset in contrib for working with JSON files in Google Cloud Storage.
    • MatplotlibS3Writer dataset in contrib for saving Matplotlib images to S3.
    • PartitionedDataSet for working with datasets split across multiple files.
    • JSONDataSet dataset for working with JSON files that uses fsspec to communicate with the underlying filesystem. It doesn't support http(s) protocol for now.
  • Added s3fs_args to all S3 datasets.
  • Pipelines can be deducted with pipeline1 - pipeline2.

Bug fixes and other changes

  • ParallelRunner now works with SparkDataSet.
  • Allowed the use of nulls in parameters.yml.
  • Fixed an issue where %reload_kedro wasn't reloading all user modules.
  • Fixed pandas_to_spark and spark_to_pandas decorators to work with functions with kwargs.
  • Fixed a bug where kedro jupyter notebook and kedro jupyter lab would run a different Jupyter installation to the one in the local environment.
  • Implemented Databricks-compatible dataset versioning for SparkDataSet.
  • Fixed a bug where kedro package would fail in certain situations where kedro build-reqs was used to generate requirements.txt.
  • Made bucket_name argument optional for the following datasets: CSVS3DataSet, HDFS3DataSet, PickleS3DataSet, contrib.io.parquet.ParquetS3DataSet, contrib.io.gcs.JSONGCSDataSet - bucket name can now be included into the filepath along with the filesystem protocol (e.g. s3://bucket-name/path/to/key.csv).
  • Documentation improvements and fixes.

Breaking changes to the API

  • Renamed entry point for running pip-installed projects to run_package() instead of main() in src/<package>/run.py.
  • bucket_name key has been removed from the string representation of the following datasets: CSVS3DataSet, HDFS3DataSet, PickleS3DataSet, contrib.io.parquet.ParquetS3DataSet, contrib.io.gcs.JSONGCSDataSet.
  • Moved the mem_profiler decorator to contrib and separated the contrib decorators so that dependencies are modular. You may need to update your import paths, for example the pyspark decorators should be imported as from kedro.contrib.decorators.pyspark import <pyspark_decorator> instead of from kedro.contrib.decorators import <pyspark_decorator>.

Thanks for supporting contributions

Sheldon Tsen, @roumail, Karlson Lee, Waylon Walker, Deepyaman Datta, Giovanni, Zain Patel

0.15.4

30 Oct 17:23
b97440a
Compare
Choose a tag to compare

Major features and improvements

  • kedro jupyter now gives the default kernel a sensible name.
  • Pipeline.name has been deprecated in favour of Pipeline.tags.
  • Reuse pipelines within a Kedro project using Pipeline.transform, it simplifies dataset and node renaming.
  • Added Jupyter Notebook line magic (%run_viz) to run kedro viz in a Notebook cell (requires kedro-viz version 3.0.0 or later).
  • Added the following datasets:
    • NetworkXLocalDataSet in kedro.contrib.io.networkx to load and save local graphs (JSON format) via NetworkX. (by @josephhaaga)
    • SparkHiveDataSet in kedro.contrib.io.pyspark.SparkHiveDataSet allowing usage of Spark and insert/upsert on non-transactional Hive tables.
  • kedro.contrib.config.TemplatedConfigLoader now supports name/dict key templating and default values.

Bug fixes and other changes

  • get_last_load_version() method for versioned datasets now returns exact last load version if the dataset has been loaded at least once and None otherwise.
  • Fixed a bug in _exists method for versioned SparkDataSet.
  • Enabled the customisation of the ExcelWriter in ExcelLocalDataSet by specifying options under writer key in save_args.
  • Fixed a bug in IPython startup script, attempting to load context from the incorrect location.
  • Removed capping the length of a dataset's string representation.
  • Fixed kedro install command failing on Windows if src/requirements.txt contains a different version of Kedro.
  • Enabled passing a single tag into a node or a pipeline without having to wrap it in a list (i.e. tags="my_tag").

Breaking changes to the API

  • Removed _check_paths_consistency() method from AbstractVersionedDataSet. Version consistency check is now done in AbstractVersionedDataSet.save(). Custom versioned datasets should modify save() method implementation accordingly.

Thanks for supporting contributions

Joseph Haaga, Deepyaman Datta, Joost Duisters, Zain Patel, Tom Vigrass

0.15.3

17 Oct 14:48
f3f977d
Compare
Choose a tag to compare

Bug Fixes and other changes

  • Narrowed the requirements for PyTables so that we maintain support for Python 3.5.

0.15.2

08 Oct 16:19
10bb2f7
Compare
Choose a tag to compare

Major features and improvements

  • Added --load-version, a kedro run argument that allows you run the pipeline with a particular load version of a dataset.
  • Support for modular pipelines in src/, break the pipeline into isolated parts with reusability in mind.
  • Support for multiple pipelines, an ability to have multiple entry point pipelines and choose one with kedro run --pipeline NAME.
  • Added a MatplotlibWriter dataset in contrib for saving Matplotlib images.
  • An ability to template/parameterize configuration files with kedro.contrib.config.TemplatedConfigLoader.
  • Parameters are exposed as a context property for ease of access in iPython / Jupyter Notebooks with context.params.
  • Added max_workers parameter for ParallelRunner.

Bug fixes and other changes

  • Users will override the _get_pipeline abstract method in ProjectContext(KedroContext) in run.py rather than the pipeline abstract property. The pipeline property is not abstract anymore.
  • Improved an error message when versioned local dataset is saved and unversioned path already exists.
  • Added catalog global variable to 00-kedro-init.py, allowing you to load datasets with catalog.load().
  • Enabled tuples to be returned from a node.
  • Disallowed the ConfigLoader loading the same file more than once, and deduplicated the conf_paths passed in.
  • Added a --open flag to kedro build-docs that opens the documentation on build.
  • Updated the Pipeline representation to include name of the pipeline, also making it readable as a context property.
  • kedro.contrib.io.pyspark.SparkDataSet and kedro.contrib.io.azure.CSVBlobDataSet now support versioning.

Breaking changes to the API

  • KedroContext.run() no longer accepts catalog and pipeline arguments.
  • node.inputs now returns the node's inputs in the order required to bind them properly to the node's function.

Thanks for supporting contributions

Deepyaman Datta, Luciano Issoe, Joost Duisters, Zain Patel, William Ashford, Karlson Lee