Skip to content

Commit

Permalink
FIX #12 - Refactor MlflowModelDataSet to distinguish between saver an…
Browse files Browse the repository at this point in the history
…d logger
  • Loading branch information
Galileo-Galilei committed Nov 3, 2020
1 parent d89e2fa commit 1dda1e4
Show file tree
Hide file tree
Showing 17 changed files with 977 additions and 287 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
- `kedro-mlflow` hooks can now be declared in `.kedro.yml` or `pyproject.toml` by adding `kedro_mlflow.framework.hooks.mlflow_pipeline_hook` and `kedro_mlflow.framework.hooks.mlflow_node_hook` into the hooks entry. _Only for kedro>=0.16.5_ [#96](https://github.com/Galileo-Galilei/kedro-mlflow/issues/96)
- `pipeline_ml_factory` now accepts that `inference` pipeline `inputs` may be in `training` pipeline `inputs` [#71](https://github.com/Galileo-Galilei/kedro-mlflow/issues/71)
- `pipeline_ml_factory` now infer automatically the schema of the input dataset to validate data automatically at inference time. The output schema can be declared manually in `model_signature` argument [#70](https://github.com/Galileo-Galilei/kedro-mlflow/issues/70)
- Add two Datasets for model logging and saving: `MlflowModelLoggerDataSet` and `MlflowModelSaverDataSet` ([#12](https://github.com/Galileo-Galilei/kedro-mlflow/issues/12))

### Fixed

Expand Down
16 changes: 12 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,38 +23,46 @@ The following people actively maintain, enhance and discuss design to make this
- [Adrian Piotr Kruszewski](https://github.com/akruszewski)
- [Takieddine Kadiri](https://github.com/takikadiri)


# Release and roadmap

The [release history](https://github.com/Galileo-Galilei/kedro-mlflow/blob/develop/CHANGELOG.md) centralizes packages improvements across time. The main features coming in next releases are [listed on github milestones](https://github.com/Galileo-Galilei/kedro-mlflow/milestones). Feel free to upvote/downvote and discuss prioritization in associated issues.

# What is kedro-mlflow?
``kedro-mlflow`` is a [kedro-plugin](https://kedro.readthedocs.io/en/stable/04_user_guide/10_developing_plugins.html) for lightweight and portable integration of [mlflow](https://mlflow.org/docs/latest/index.html) capabilities inside [kedro](https://kedro.readthedocs.io/en/stable/index.html) projects. It enforces [``Kedro`` principles]() to make mlflow usage as production ready as possible. Its core functionalities are :

``kedro-mlflow`` is a [kedro-plugin](https://kedro.readthedocs.io/en/stable/04_user_guide/10_developing_plugins.html) for lightweight and portable integration of [mlflow](https://mlflow.org/docs/latest/index.html) capabilities inside [kedro](https://kedro.readthedocs.io/en/stable/index.html) projects. It enforces [``Kedro`` principles](https://kedro.readthedocs.io/en/stable/12_faq/01_faq.html?highlight=principles#what-is-the-philosophy-behind-kedro) to make mlflow usage as production ready as possible. Its core functionalities are :

- **versioning**: you can effortlessly register your parameters or your datasets with minimal configuration in a kedro run. Later, you will be able to browse your runs in the mlflow UI, and retrieve the runs you want. This is directly linked to [Mlflow Tracking](https://www.mlflow.org/docs/latest/tracking.html).
- **model packaging**: ``kedro-mlflow`` offers a convenient API to register a pipeline as a ``model`` in the mlflow sense. Consequently, you can *API-fy* or serve your kedro pipeline with one line of code, or share a model with without worrying of the preprocessing to be made for further use. This is directly linked to [Mlflow Models](https://www.mlflow.org/docs/latest/models.html).


# How do I install kedro-mlflow?
**Important: kedro-mlflow is only compatible with ``kedro>0.16.0``. If you have a project created with an older version of ``Kedro``, see this [migration guide](https://github.com/quantumblacklabs/kedro/blob/master/RELEASE.md#migration-guide-from-kedro-015-to-016).**

**Important: kedro-mlflow is only compatible with ``kedro>=0.16.0``. If you have a project created with an older version of ``Kedro``, see this [migration guide](https://github.com/quantumblacklabs/kedro/blob/master/RELEASE.md#migration-guide-from-kedro-015-to-016).**

``kedro-mlflow`` is available on PyPI, so you can install it with ``pip``:

```console
pip install kedro-mlflow
```

If you want to use the ``develop`` version of the package which is the most up to date, you can install the package from github:

```console
pip install --upgrade git+https://github.com/Galileo-Galilei/kedro-mlflow.git@develop
```

I strongly recommend to use ``conda`` (a package manager) to create an environment and to read [``kedro`` installation guide](https://kedro.readthedocs.io/en/stable/02_getting_started/01_prerequisites.html).


# Getting started

# Getting started:
The documentation contains:

- [A "hello world" example](https://kedro-mlflow.readthedocs.io/en/latest/source/02_hello_world_example/index.html) which demonstrates how you to **setup your project**, **version parameters** and **datasets**, and browse your runs in the UI.
- A more [detailed tutorial](https://kedro-mlflow.readthedocs.io/en/latest/source/03_tutorial/index.html) to show more advanced features (mlflow configuration through the plugin, package and serve a kedro ``Pipeline``...)

Some frequently asked questions on more advanced features:

- You want to log additional metrics to the run? -> [Try ``MlflowMetricsDataSet``](https://kedro-mlflow.readthedocs.io/en/latest/source/03_tutorial/07_version_metrics.html) !
- You want to log nice dataviz of your pipeline that you register with ``MatplotlibWriter``? -> [Try ``MlflowArtifactDataSet`` to log any local files (.png, .pkl, .csv...) *automagically*](https://kedro-mlflow.readthedocs.io/en/latest/source/02_hello_world_example/02_first_steps.html#artifacts)!
- You want to create easily an API to share your awesome model to anyone? -> [See if ``pipeline_ml_factory`` can fit your needs](https://github.com/Galileo-Galilei/kedro-mlflow/issues/16)
Expand Down
14 changes: 7 additions & 7 deletions docs/source/01_introduction/02_motivation.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,13 +33,13 @@ Above implementations have the advantage of being very straightforward and *mlfl
``kedro-mlflow`` enforces these best practices while implementing a clear interface for each mlflow action in Kedro template. Below chart maps the mlflow action to perform with the Python API provided by kedro-mlflow and the location in Kedro template where the action should be performed.

|Mlflow action |Template file |Python API |
|:----------------------------|:-----------------------|:-------------------------------------------------|
|Set up configuration |``mlflow.yml`` |``MlflowPipelineHook`` |
|Logging parameters |``run.py`` |``MlflowNodeHook`` |
|Logging artifacts |``catalog.yml`` |``MlflowArtifactDataSet`` |
|Logging models |NA |NA |
|Logging metrics |``catalog.yml`` |``MlflowMetricsDataSet`` |
|Logging Pipeline as model |``pipeline.py`` |``KedroPipelineModel`` and ``pipeline_ml_factory``|
|:----------------------------|:-----------------------|:------------------------------------------------------|
|Set up configuration |``mlflow.yml`` |``MlflowPipelineHook`` |
|Logging parameters |``mlflow.yml`` |``MlflowNodeHook`` |
|Logging artifacts |``catalog.yml`` |``MlflowArtifactDataSet`` |
|Logging models |``catalog.yml`` |`MlflowModelLoggerDataSet` and `MlflowModelSaverDataSet` |
|Logging metrics |``catalog.yml`` |``MlflowMetricsDataSet`` |
|Logging Pipeline as model |``hooks.py`` |``KedroPipelineModel`` and ``pipeline_ml_factory``|

In the current version (``kedro_mlflow=0.3.0``), `kedro-mlflow` does not provide interface to set tags or log models outside a Kedro ``Pipeline``. These decisions are subject to debate and design decisions (for instance, metrics are often updated in a loop during each epoch / training iteration and it does not always make sense to register the metric between computation steps, e.g. as a an I/O operation after a node run).

Expand Down
42 changes: 27 additions & 15 deletions docs/source/03_tutorial/06_version_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,40 +2,52 @@

## What is model tracking?

MLflow allows to serialize and deserialize models to a common format, track those models in MLflow Tracking and manage them using MLflow Model Registry. Many popular Machine / Deep Learning frameworks have built-in support through what MLflow calls flavors. Even if there's no flavor for your framework of choice, it's easy to create your own flavor and integrate it with MLflow.
MLflow allows to serialize and deserialize models to a common format, track those models in MLflow Tracking and manage them using MLflow Model Registry. Many popular Machine / Deep Learning frameworks have built-in support through what MLflow calls [flavors](https://www.mlflow.org/docs/latest/models.html#built-in-model-flavors). Even if there's no flavor for your framework of choice, it's easy to [create your own flavor](https://www.mlflow.org/docs/latest/models.html#custom-python-models) and integrate it with MLflow.

## How to track models using MLflow in Kedro project?

kedro-mlflow introduces a new dataset type that can be used in Data Catalog called ``MlflowModelDataSet``. Suppose you would like to add a scikit-learn model to your Data Catalog. For that you need to an entry like this:
`kedro-mlflow` introduces two new `DataSet` types that can be used in `DataCatalog` called `MlflowModelLoggerDataSet` and `MlflowModelSaverDataSet`. The two have very similar API, except that:

- the ``MlflowModelLoggerDataSet`` is used to load from and save to from the mlflow artifact store. It uses optional `run_id` argument to load and save from a given `run_id` which must exists in the mlflow server you are logging to.
- the ``MlflowModelSaverDataSet`` is used to load from and save to a given path. It uses the standard `filepath` argument in the constructor of Kedro DataSets. Note that it **does not log in mlflow**.

*Important: The ``MlflowModelSaverDataSet`` is a dataset for advanced users who want fine grained control and eventually tweak mlflow models management. You very likely want to __use the ``MlflowModelLoggerDataSet``__ instead.*

Suppose you would like to register a `scikit-learn` model of your `DataCatalog` in mlflow, you can use the following yaml API:

```yaml
my_sklearn_model:
type: kedro_mlflow.io.MlflowModelDataSet
type: kedro_mlflow.io.models.MlflowModelLoggerDataSet
flavor: mlflow.sklearn
path: data/06_models/my_sklearn_model
```
You are now able to use ``my_sklearn_model`` in your nodes.
More informations on available parameters are available in the [dedicated section](docs\source\05_python_objects\01_DataSets.md#mlflowmodelloggerdataset).
You are now able to use ``my_sklearn_model`` in your nodes. Since this model is registered in mlflow, you can also leverage the [mlflow model serving abilities](https://www.mlflow.org/docs/latest/cli.html#mlflow-models-serve) or [predicting on batch abilities](https://www.mlflow.org/docs/latest/cli.html#mlflow-models-predict), as well as the [mlflow models registry](https://www.mlflow.org/docs/latest/model-registry.html) to manage the lifecycle of this model.
## Frequently asked questions?
## How is it working under the hood?
### How is it working under the hood?
**For ``MlflowModelLoggerDataSet``**
During save, a model object from node output is save locally under specified ``path`` using ``save_model`` function of the specified ``flavor``. It is then logged to MLflow using ``log_model``.
During save, a model object from node output is logged to mlflow using ``log_model`` function of the specified ``flavor``. It is logged in the `run_id` run if specified and if there is no active run, else in the currently active mlflow run. If the `run_id` is specified and there is an active run, the saving operation will fail. Consequently it will **never be possible to save in a specific mlflow run_id** if you launch a pipeline with the `kedro run` command because the `MlflowPipelineHook` creates a new run before each pipeline run.

When model is loaded, the latest version stored locally is read using ``load_model`` function of the specified ``flavor``. You can also load a model from a specific [Kedro run](#can-i-use-kedro-versioning-with-mlflowmodeldataset) or [MLflow run](#can-i-load-a-model-from-a-specific-mlflow-run-id).
During load, the model is retrieved from the ``run_id`` if specified, else it is retrieved from the mlflow active run. If there is no mlflow active run, the loading fails. This will never happen if you are using the `kedro run` command, because the `MlflowPipelineHook` creates a new run before each pipeline run.

**For ``MlflowModelSaverDataSet``**

During save, a model object from node output is saved locally under specified ``filepath`` using ``save_model`` function of the specified ``flavor``.

When model is loaded, the latest version stored locally is read using ``load_model`` function of the specified ``flavor``. You can also load a model from a specific kedro run by specifying the `version` argument to the constructor.

### How can I track a custom MLflow model flavor?

To track a custom MLflow model flavor you need to set the `flavor` parameter to import path of your custom flavor:
To track a custom MLflow model flavor you need to set the `flavor` parameter to import path of your custom flavor and to specify a [pyfunc workflow](https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#pyfunc-create-custom-workflows) which can be set either to `python_model` or `loader_module`. The former is the more high level and user friendly and is [recommend by mlflow](https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#which-workflow-is-right-for-my-use-case) while the latter offer more control. We haven't tested the integration in `kedro-mlflow` of this second workflow extensively, and it should be use with caution.

```yaml
my_custom_model:
type: kedro_mlflow.io.MlflowModelDataSet
type: kedro_mlflow.io.models.MlflowModelLoggerDataSet
flavor: my_package.custom_mlflow_flavor
path: data/06_models/my_sklearn_model
pyfunc_workflow: python_model # or loader_module
```

### Can I use Kedro versioning with `MlflowModelDataSet`?

### Can I load a model from a specific MLflow Run ID?
92 changes: 88 additions & 4 deletions docs/source/05_python_objects/01_DataSets.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,20 @@
# New ``DataSet``:
# New ``DataSet``

## ``MlflowArtifactDataSet``

``MlflowArtifactDataSet`` is a wrapper for any ``AbstractDataSet`` which logs the dataset automatically in mlflow as an artifact when its ``save`` method is called. It can be used both with the YAML API:
```

```yaml
my_dataset_to_version:
type: kedro_mlflow.io.MlflowArtifactDataSet
data_set:
type: pandas.CSVDataSet # or any valid kedro DataSet
filepath: /path/to/a/local/destination/file.csv
```
or with additional parameters:
```
```python
my_dataset_to_version:
type: kedro_mlflow.io.MlflowArtifactDataSet
data_set:
Expand All @@ -23,11 +28,90 @@ my_dataset_to_version:
run_id: 13245678910111213 # a valid mlflow run to log in. If None, default to active run
artifact_path: reporting # relative path where the artifact must be stored. if None, saved in root folder.
```
or with the python API:
```
```python
from kedro_mlflow.io import MlflowArtifactDataSet
from kedro.extras.datasets.pandas import CSVDataSet
csv_dataset = MlflowArtifactDataSet(data_set={"type": CSVDataSet,
"filepath": r"/path/to/a/local/destination/file.csv"})
csv_dataset.save(data=pd.DataFrame({"a":[1,2], "b": [3,4]}))
```
## Models `DataSets`

### ``MlflowModelLoggerDataSet``

The ``MlflowModelLoggerDataSet`` accepts the following arguments:

- flavor (str): Built-in or custom MLflow model flavor module. Must be Python-importable.
- run_id (Optional[str], optional): MLflow run ID to use to load the model from or save the model to. It plays the same role as "filepath" for standard mlflow datasets. Defaults to None.
- artifact_path (str, optional): the run relative path tothe model.
- pyfunc_workflow (str, optional): Either `python_model` or `loader_module`.See [mlflow workflows](https://www.mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#workflows).
- load_args (Dict[str, Any], optional): Arguments to `load_model` function from specified `flavor`. Defaults to None.
- save_args (Dict[str, Any], optional): Arguments to `log_model` function from specified `flavor`. Defaults to None.

You can either only specify the flavor:

```python
from kedro_mlflow.io.models import MlflowModelLoggerDataSet
from sklearn.linear_model import LinearRegression
mlflow_model_logger=MlflowModelLoggerDataSet(flavor="mlflow.sklearn")
mlflow_model_logger.save(LinearRegression())
```

Let assume that this first model has been saved once, and you xant to retrieve it (for prediction for instance):

```python
mlflow_model_logger=MlflowModelLoggerDataSet(flavor="mlflow.sklearn", run_id=<the-model-run-id>)
my_linear_regression=mlflow_model_logger.load()
my_linear_regression.predict(<data>) # will obviously fail if you have not fitted your model object first :)
```

You can also specify some [logging parameters](https://www.mlflow.org/docs/latest/python_api/mlflow.sklearn.html#mlflow.sklearn.log_model):

```python
mlflow_model_logger=MlflowModelLoggerDataSet(
flavor="mlflow.sklearn",
run_id=<the-model-run-id>,
save_args={
"conda_env": {"python": "3.7.0"},
"input_example": data.iloc[0:5,:]
}
)
mlflow_model_logger.save(LinearRegression().fit(data))
```

### ``MlflowModelSaverDataSet``

The ``MlflowModelLoggerDataSet`` accepts the following arguments:

- flavor (str): Built-in or custom MLflow model flavor module. Must be Python-importable.
- filepath (str): Path to store the dataset locally.
- pyfunc_workflow (str, optional): Either `python_model` or `loader_module`. See [mlflow workflows](https://www.mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#workflows).
- load_args (Dict[str, Any], optional): Arguments to `load_model` function from specified `flavor`. Defaults to None.
- save_args (Dict[str, Any], optional): Arguments to `save_model` function from specified `flavor`. Defaults to None.
- version (Version, optional): Kedro version to use. Defaults to None.

The use ifs very similar to MlflowModelLoggerDataSet, but that you specify a filepath instead of a `run_id`:

```python
from kedro_mlflow.io.models import MlflowModelLoggerDataSet
from sklearn.linear_model import LinearRegression
mlflow_model_logger=MlflowModelSaverDataSet(flavor="mlflow.sklearn", filepath="path/to/where/you/want/model")
mlflow_model_logger.save(LinearRegression().fit(data))
```

The same arguments are available, plus an additional [`version` common to usual `AbstractVersionedDataSet`](https://kedro.readthedocs.io/en/stable/kedro.io.AbstractVersionedDataSet.html)

```python
mlflow_model_logger=MlflowModelSaverDataSet(
flavor="mlflow.sklearn",
filepath="path/to/where/you/want/model",
version="<valid-kedro-version>")
my_model= mlflow_model_logger.load()
```
1 change: 0 additions & 1 deletion kedro_mlflow/io/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,2 @@
from .mlflow_dataset import MlflowArtifactDataSet
from .mlflow_metrics_dataset import MlflowMetricsDataSet
from .mlflow_model_dataset import MlflowModelDataSet
Loading

0 comments on commit 1dda1e4

Please sign in to comment.