FIX #12 - Refactor MlflowModelDataSet to distinguish between saver an…

…d logger
Galileo-Galilei · Nov 3, 2020 · 1dda1e4 · 1dda1e4
1 parent d89e2fa
commit 1dda1e4
Show file tree

Hide file tree

Showing 17 changed files with 977 additions and 287 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,6 +8,7 @@
 - `kedro-mlflow` hooks can now be declared in `.kedro.yml` or `pyproject.toml` by adding `kedro_mlflow.framework.hooks.mlflow_pipeline_hook` and `kedro_mlflow.framework.hooks.mlflow_node_hook` into the hooks entry. _Only for kedro>=0.16.5_ [#96](https://github.com/Galileo-Galilei/kedro-mlflow/issues/96)
 - `pipeline_ml_factory` now accepts that `inference` pipeline `inputs` may be in `training` pipeline `inputs` [#71](https://github.com/Galileo-Galilei/kedro-mlflow/issues/71)
 - `pipeline_ml_factory` now infer automatically the schema of the input dataset to validate data automatically at inference time. The output schema can be declared manually in `model_signature` argument [#70](https://github.com/Galileo-Galilei/kedro-mlflow/issues/70)
+- Add two Datasets for model logging and saving: `MlflowModelLoggerDataSet` and `MlflowModelSaverDataSet` ([#12](https://github.com/Galileo-Galilei/kedro-mlflow/issues/12))
 
 ### Fixed
 

diff --git a/README.md b/README.md
@@ -23,38 +23,46 @@ The following people actively maintain, enhance and discuss design to make this
 - [Adrian Piotr Kruszewski](https://github.com/akruszewski)
 - [Takieddine Kadiri](https://github.com/takikadiri)
 
-
 # Release and roadmap
+
 The [release history](https://github.com/Galileo-Galilei/kedro-mlflow/blob/develop/CHANGELOG.md) centralizes packages improvements across time. The main features coming in next releases are [listed on github milestones](https://github.com/Galileo-Galilei/kedro-mlflow/milestones). Feel free to upvote/downvote and discuss prioritization in associated issues.
 
 # What is kedro-mlflow?
-``kedro-mlflow`` is a [kedro-plugin](https://kedro.readthedocs.io/en/stable/04_user_guide/10_developing_plugins.html) for lightweight and portable integration of [mlflow](https://mlflow.org/docs/latest/index.html) capabilities inside [kedro](https://kedro.readthedocs.io/en/stable/index.html) projects. It enforces [``Kedro`` principles]() to make mlflow usage as production ready as possible. Its core functionalities are :
+
+``kedro-mlflow`` is a [kedro-plugin](https://kedro.readthedocs.io/en/stable/04_user_guide/10_developing_plugins.html) for lightweight and portable integration of [mlflow](https://mlflow.org/docs/latest/index.html) capabilities inside [kedro](https://kedro.readthedocs.io/en/stable/index.html) projects. It enforces [``Kedro`` principles](https://kedro.readthedocs.io/en/stable/12_faq/01_faq.html?highlight=principles#what-is-the-philosophy-behind-kedro) to make mlflow usage as production ready as possible. Its core functionalities are :
+
 - **versioning**: you can effortlessly register your parameters or your datasets with minimal configuration in a kedro run. Later, you will be able to browse your runs in the mlflow UI, and retrieve the runs you want. This is directly linked to [Mlflow Tracking](https://www.mlflow.org/docs/latest/tracking.html).
 - **model packaging**: ``kedro-mlflow`` offers a convenient API to register a pipeline as a ``model`` in the mlflow sense. Consequently, you can *API-fy* or serve your kedro pipeline with one line of code, or share a model with without worrying of the preprocessing to be made for further use. This is directly linked to [Mlflow Models](https://www.mlflow.org/docs/latest/models.html).
 
 
 # How do I install kedro-mlflow?
-**Important: kedro-mlflow is only compatible with ``kedro>0.16.0``. If you have a project created with an older version of ``Kedro``, see this [migration guide](https://github.com/quantumblacklabs/kedro/blob/master/RELEASE.md#migration-guide-from-kedro-015-to-016).**
+
+**Important: kedro-mlflow is only compatible with ``kedro>=0.16.0``. If you have a project created with an older version of ``Kedro``, see this [migration guide](https://github.com/quantumblacklabs/kedro/blob/master/RELEASE.md#migration-guide-from-kedro-015-to-016).**
 
 ``kedro-mlflow`` is available on PyPI, so you can install it with ``pip``:
+
 ```console
 pip install kedro-mlflow
 ```
+
 If you want to use the ``develop`` version of the package which is the most up to date, you can install the package from github:
+
 ```console
 pip install --upgrade git+https://github.com/Galileo-Galilei/kedro-mlflow.git@develop
 ```
 
 I strongly recommend to use ``conda`` (a package manager) to create an environment and to read [``kedro`` installation guide](https://kedro.readthedocs.io/en/stable/02_getting_started/01_prerequisites.html).
 
 
+# Getting started
 
-# Getting started:
 The documentation contains:
+
 - [A "hello world" example](https://kedro-mlflow.readthedocs.io/en/latest/source/02_hello_world_example/index.html) which demonstrates how you to **setup your project**, **version parameters** and **datasets**, and browse your runs in the UI.
 - A more [detailed tutorial](https://kedro-mlflow.readthedocs.io/en/latest/source/03_tutorial/index.html) to show more advanced features (mlflow configuration through the plugin, package and serve a kedro ``Pipeline``...)
 
 Some frequently asked questions on more advanced features:
+
 - You want to log additional metrics to the run? -> [Try ``MlflowMetricsDataSet``](https://kedro-mlflow.readthedocs.io/en/latest/source/03_tutorial/07_version_metrics.html) !
 - You want to log nice dataviz of your pipeline that you register with ``MatplotlibWriter``? -> [Try ``MlflowArtifactDataSet`` to log any local files (.png, .pkl, .csv...) *automagically*](https://kedro-mlflow.readthedocs.io/en/latest/source/02_hello_world_example/02_first_steps.html#artifacts)!
 - You want to create easily an API to share your awesome model to anyone? -> [See if ``pipeline_ml_factory`` can fit your needs](https://github.com/Galileo-Galilei/kedro-mlflow/issues/16)

diff --git a/docs/source/01_introduction/02_motivation.md b/docs/source/01_introduction/02_motivation.md
@@ -33,13 +33,13 @@ Above implementations have the advantage of being very straightforward and *mlfl
 ``kedro-mlflow`` enforces these best practices while implementing a clear interface for each mlflow action in Kedro template. Below chart maps the mlflow action to perform with the Python API provided by kedro-mlflow and the location in Kedro template where the action should be performed.
 
 |Mlflow action                |Template file           |Python API                                        |
-|:----------------------------|:-----------------------|:-------------------------------------------------|
-|Set up configuration         |``mlflow.yml``          |``MlflowPipelineHook``                            |
-|Logging parameters           |``run.py``              |``MlflowNodeHook``                                |
-|Logging artifacts            |``catalog.yml``         |``MlflowArtifactDataSet``                         |
-|Logging models               |NA                      |NA                                                |
-|Logging metrics              |``catalog.yml``         |``MlflowMetricsDataSet``                          |
-|Logging Pipeline as model    |``pipeline.py``         |``KedroPipelineModel`` and ``pipeline_ml_factory``|
+|:----------------------------|:-----------------------|:------------------------------------------------------|
+|Set up configuration         |``mlflow.yml``     |``MlflowPipelineHook``                            |
+|Logging parameters           |``mlflow.yml``     |``MlflowNodeHook``                                |
+|Logging artifacts            |``catalog.yml``    |``MlflowArtifactDataSet``                         |
+|Logging models               |``catalog.yml``    |`MlflowModelLoggerDataSet` and `MlflowModelSaverDataSet`                                             |
+|Logging metrics              |``catalog.yml``    |``MlflowMetricsDataSet``                          |
+|Logging Pipeline as model    |``hooks.py``    |``KedroPipelineModel`` and ``pipeline_ml_factory``|
 
 In the current version (``kedro_mlflow=0.3.0``), `kedro-mlflow` does not provide interface to set tags or log models outside a Kedro ``Pipeline``. These decisions are subject to debate and design decisions (for instance, metrics are often updated in a loop during each epoch / training iteration and it does not always make sense to register the metric between computation steps, e.g. as a an I/O operation after a node run).
 

diff --git a/docs/source/03_tutorial/06_version_models.md b/docs/source/03_tutorial/06_version_models.md
@@ -2,40 +2,52 @@
 
 ## What is model tracking?
 
-MLflow allows to serialize and deserialize models to a common format, track those models in MLflow Tracking and manage them using MLflow Model Registry. Many popular Machine / Deep Learning frameworks have built-in support through what MLflow calls flavors. Even if there's no flavor for your framework of choice, it's easy to create your own flavor and integrate it with MLflow.
+MLflow allows to serialize and deserialize models to a common format, track those models in MLflow Tracking and manage them using MLflow Model Registry. Many popular Machine / Deep Learning frameworks have built-in support through what MLflow calls [flavors](https://www.mlflow.org/docs/latest/models.html#built-in-model-flavors). Even if there's no flavor for your framework of choice, it's easy to [create your own flavor](https://www.mlflow.org/docs/latest/models.html#custom-python-models) and integrate it with MLflow.
 
 ## How to track models using MLflow in Kedro project?
 
-kedro-mlflow introduces a new dataset type that can be used in Data Catalog called ``MlflowModelDataSet``. Suppose you would like to add a scikit-learn model to your Data Catalog. For that you need to an entry like this:
+`kedro-mlflow` introduces two new `DataSet` types that can be used in `DataCatalog` called `MlflowModelLoggerDataSet` and `MlflowModelSaverDataSet`. The two have very similar API, except that:
+
+- the ``MlflowModelLoggerDataSet`` is used to load from and save to from the mlflow artifact store. It uses optional `run_id` argument to load and save from a given `run_id` which must exists in the mlflow server you are logging to.
+- the ``MlflowModelSaverDataSet`` is used to load from and save to a given path. It uses the standard `filepath` argument in the constructor of Kedro DataSets. Note that it **does not log in mlflow**.
+
+*Important: The ``MlflowModelSaverDataSet`` is a dataset for advanced users who want fine grained control and eventually tweak mlflow models management. You very likely want to __use the ``MlflowModelLoggerDataSet``__ instead.*
+
+Suppose you would like to register a `scikit-learn` model of your `DataCatalog` in mlflow, you can use the following yaml API:
 
 ```yaml
 my_sklearn_model:
-    type: kedro_mlflow.io.MlflowModelDataSet
+    type: kedro_mlflow.io.models.MlflowModelLoggerDataSet
     flavor: mlflow.sklearn
-    path: data/06_models/my_sklearn_model
 ```
 
-You are now able to use ``my_sklearn_model`` in your nodes.
+More informations on available parameters are available in the [dedicated section](docs\source\05_python_objects\01_DataSets.md#mlflowmodelloggerdataset).
+
+You are now able to use ``my_sklearn_model`` in your nodes. Since this model is registered in mlflow, you can also leverage the [mlflow model serving abilities](https://www.mlflow.org/docs/latest/cli.html#mlflow-models-serve) or [predicting on batch abilities](https://www.mlflow.org/docs/latest/cli.html#mlflow-models-predict), as well as the [mlflow models registry](https://www.mlflow.org/docs/latest/model-registry.html) to manage the lifecycle of this model.
 
 ## Frequently asked questions?
 
-## How is it working under the hood?
+### How is it working under the hood?
+
+**For ``MlflowModelLoggerDataSet``**
 
-During save, a model object from node output is save locally under specified ``path`` using ``save_model`` function of the specified ``flavor``. It is then logged to MLflow using ``log_model``.
+During save, a model object from node output is logged to mlflow using ``log_model`` function of the specified ``flavor``. It is logged in the `run_id` run if specified and if there is no active run, else in the currently active mlflow run. If the `run_id` is specified and there is an active run, the saving operation will fail. Consequently it will **never be possible to save in a specific mlflow run_id** if you launch a pipeline with the `kedro run` command because the `MlflowPipelineHook` creates a new run before each pipeline run.
 
-When model is loaded, the latest version stored locally is read using ``load_model`` function of the specified ``flavor``. You can also load a model from a specific [Kedro run](#can-i-use-kedro-versioning-with-mlflowmodeldataset) or [MLflow run](#can-i-load-a-model-from-a-specific-mlflow-run-id).
+During load, the model is retrieved from the ``run_id`` if specified, else it is retrieved from the mlflow active run. If there is no mlflow active run, the loading fails. This will never happen if you are using the `kedro run` command, because the `MlflowPipelineHook` creates a new run before each pipeline run.
+
+**For ``MlflowModelSaverDataSet``**
+
+During save, a model object from node output is saved locally under specified ``filepath`` using ``save_model`` function of the specified ``flavor``.
+
+When model is loaded, the latest version stored locally is read using ``load_model`` function of the specified ``flavor``. You can also load a model from a specific kedro run by specifying the `version` argument to the constructor.
 
 ### How can I track a custom MLflow model flavor?
 
-To track a custom MLflow model flavor you need to set the `flavor` parameter to import path of your custom flavor:
+To track a custom MLflow model flavor you need to set the `flavor` parameter to import path of your custom flavor and to specify a [pyfunc workflow](https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#pyfunc-create-custom-workflows) which can be set either to `python_model` or `loader_module`. The former is the more high level and user friendly and is [recommend by mlflow](https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#which-workflow-is-right-for-my-use-case) while the latter offer more control. We haven't tested the integration in `kedro-mlflow` of this second workflow extensively, and it should be use with caution.
 
 ```yaml
 my_custom_model:
-    type: kedro_mlflow.io.MlflowModelDataSet
+    type: kedro_mlflow.io.models.MlflowModelLoggerDataSet
     flavor: my_package.custom_mlflow_flavor
-    path: data/06_models/my_sklearn_model
+    pyfunc_workflow: python_model # or loader_module
 ```
-
-### Can I use Kedro versioning with `MlflowModelDataSet`?
-
-### Can I load a model from a specific MLflow Run ID?
diff --git a/docs/source/05_python_objects/01_DataSets.md b/docs/source/05_python_objects/01_DataSets.md
@@ -1,15 +1,20 @@
-# New ``DataSet``:
+# New ``DataSet``
+
 ## ``MlflowArtifactDataSet``
+
 ``MlflowArtifactDataSet`` is a wrapper for any ``AbstractDataSet`` which logs the dataset automatically in mlflow as an artifact when its ``save`` method is called. It can be used both with the YAML API:
-```
+
+```yaml
 my_dataset_to_version:
     type: kedro_mlflow.io.MlflowArtifactDataSet
     data_set:
         type: pandas.CSVDataSet  # or any valid kedro DataSet
         filepath: /path/to/a/local/destination/file.csv
 ```
+
 or with additional parameters:
-```
+
+```python
 my_dataset_to_version:
     type: kedro_mlflow.io.MlflowArtifactDataSet
     data_set:
@@ -23,11 +28,90 @@ my_dataset_to_version:
     run_id: 13245678910111213  # a valid mlflow run to log in. If None, default to active run
     artifact_path: reporting  # relative path where the artifact must be stored. if None, saved in root folder.
 ```
+
 or with the python API:
-```
+
+```python
 from kedro_mlflow.io import MlflowArtifactDataSet
 from kedro.extras.datasets.pandas import CSVDataSet
 csv_dataset = MlflowArtifactDataSet(data_set={"type": CSVDataSet,
                                       "filepath": r"/path/to/a/local/destination/file.csv"})
 csv_dataset.save(data=pd.DataFrame({"a":[1,2], "b": [3,4]}))
 ```
+
+## Models `DataSets`
+
+### ``MlflowModelLoggerDataSet``
+
+The ``MlflowModelLoggerDataSet`` accepts the following arguments:
+
+- flavor (str): Built-in or custom MLflow model flavor module. Must be Python-importable.
+- run_id (Optional[str], optional): MLflow run ID to use to load the model from or save the model to. It plays the same role as "filepath" for standard mlflow datasets. Defaults to None.
+- artifact_path (str, optional): the run relative path tothe model.
+- pyfunc_workflow (str, optional): Either `python_model` or `loader_module`.See [mlflow workflows](https://www.mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#workflows).
+- load_args (Dict[str, Any], optional): Arguments to `load_model` function from specified `flavor`. Defaults to None.
+- save_args (Dict[str, Any], optional): Arguments to `log_model` function from specified `flavor`. Defaults to None.
+
+You can either only specify the flavor:
+
+```python
+from kedro_mlflow.io.models import MlflowModelLoggerDataSet
+from sklearn.linear_model import LinearRegression
+
+mlflow_model_logger=MlflowModelLoggerDataSet(flavor="mlflow.sklearn")
+mlflow_model_logger.save(LinearRegression())
+```
+
+Let assume that this first model has been saved once, and you xant to retrieve it (for prediction for instance):
+
+```python
+mlflow_model_logger=MlflowModelLoggerDataSet(flavor="mlflow.sklearn", run_id=<the-model-run-id>)
+my_linear_regression=mlflow_model_logger.load()
+my_linear_regression.predict(<data>) # will obviously fail if you have not fitted your model object first :)
+```
+
+You can also specify some [logging parameters](https://www.mlflow.org/docs/latest/python_api/mlflow.sklearn.html#mlflow.sklearn.log_model):
+
+```python
+mlflow_model_logger=MlflowModelLoggerDataSet(
+    flavor="mlflow.sklearn",
+     run_id=<the-model-run-id>,
+     save_args={
+         "conda_env": {"python": "3.7.0"},
+          "input_example": data.iloc[0:5,:]
+          }
+    )
+mlflow_model_logger.save(LinearRegression().fit(data))
+```
+
+### ``MlflowModelSaverDataSet``
+
+The ``MlflowModelLoggerDataSet`` accepts the following arguments:
+
+- flavor (str): Built-in or custom MLflow model flavor module. Must be Python-importable.
+- filepath (str): Path to store the dataset locally.
+- pyfunc_workflow (str, optional): Either `python_model` or `loader_module`. See [mlflow workflows](https://www.mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#workflows).
+- load_args (Dict[str, Any], optional): Arguments to `load_model` function from specified `flavor`. Defaults to None.
+- save_args (Dict[str, Any], optional): Arguments to `save_model` function from specified `flavor`. Defaults to None.
+- version (Version, optional): Kedro version to use. Defaults to None.
+
+The use ifs very similar to MlflowModelLoggerDataSet, but that you specify a filepath instead of a `run_id`:
+
+```python
+from kedro_mlflow.io.models import MlflowModelLoggerDataSet
+from sklearn.linear_model import LinearRegression
+
+mlflow_model_logger=MlflowModelSaverDataSet(flavor="mlflow.sklearn", filepath="path/to/where/you/want/model")
+mlflow_model_logger.save(LinearRegression().fit(data))
+```
+
+The same arguments are available, plus an additional [`version` common to usual `AbstractVersionedDataSet`](https://kedro.readthedocs.io/en/stable/kedro.io.AbstractVersionedDataSet.html)
+
+```python
+
+mlflow_model_logger=MlflowModelSaverDataSet(
+    flavor="mlflow.sklearn",
+    filepath="path/to/where/you/want/model",
+    version="<valid-kedro-version>")
+my_model= mlflow_model_logger.load()
+```
diff --git a/kedro_mlflow/io/__init__.py b/kedro_mlflow/io/__init__.py
@@ -1,3 +1,2 @@
 from .mlflow_dataset import MlflowArtifactDataSet
 from .mlflow_metrics_dataset import MlflowMetricsDataSet
-from .mlflow_model_dataset import MlflowModelDataSet