Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: #12 Create a MlflowModelDataSet #56

Merged
merged 2 commits into from
Nov 3, 2020

Conversation

turn1a
Copy link
Contributor

@turn1a turn1a commented Aug 27, 2020

Closes #12.

Opening PR to continue discussions about this implementation here.

In this implementation, I'm both saving locally and logging to MLflow.

TODO:

  • Agree on _save strategy: save+log vs log only
  • Saving to specified run_id - I wasn't able to find a way to log a model to a particular run.
  • Documentation
  • Changelog update
  • Implement pyfunc support

@turn1a turn1a self-assigned this Aug 27, 2020
@turn1a turn1a marked this pull request as draft September 3, 2020 15:31
@turn1a turn1a mentioned this pull request Sep 8, 2020
@Galileo-Galilei
Copy link
Owner

For the record, I really think that MlflowModelDataSet should not perform logging in mlflow (and associated run_id management you seem in trouble with) and delegate this to MlflowDataSet. It will make this DataSet minimal and still offer all desirable functionalities in an eay to use user interface. (e.g calling (MlflowDataSet(data_set={type:MlflowModelDataSet, filepath:/path/to/local/folder}) seems fine).

@turn1a
Copy link
Contributor Author

turn1a commented Sep 10, 2020

Apart from just logging the model as an artifact in a run, log_model also creates a model version and registers it when registered_model_name is specified in log_model. If we go for save_model and log it as an artifact using MlflowArtifactDataSet, we will lose this functionality and quite possibly others that might be added in the future.

registered_model_name – (Experimental) If given, create a model version under registered_model_name, also creating a registered model if one with the given name does not exist.

@Galileo-Galilei
Copy link
Owner

Yes, you're totally right. I still think that saving/loading locally and remotely are two different operations that should be performed in different datasets. It would be both easier to manage (for us) and to use (for end users).

@turn1a
Copy link
Contributor Author

turn1a commented Sep 23, 2020

Let's discuss this when on the call.

@Galileo-Galilei Galileo-Galilei mentioned this pull request Oct 1, 2020
3 tasks
@Galileo-Galilei
Copy link
Owner

Hello @kaemo, when do you plan to finish on this PR? It looks like it is 95% finished, and it is a frequently required feature. What prevent you from merging it right now? The sooner we deploy it, the more we can get feedback to improve it. Feel free to ask if you need help.

@Galileo-Galilei Galileo-Galilei force-pushed the 12-create-a-mlflowmodeldataset branch from fb68c79 to c2b1d96 Compare November 1, 2020 21:47
@codecov-io
Copy link

codecov-io commented Nov 1, 2020

Codecov Report

Merging #56 into develop will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##           develop       #56   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           20        24    +4     
  Lines          634       713   +79     
=========================================
+ Hits           634       713   +79     
Impacted Files Coverage Δ
kedro_mlflow/io/models/__init__.py 100.00% <100.00%> (ø)
..._mlflow/io/models/mlflow_abstract_model_dataset.py 100.00% <100.00%> (ø)
...ro_mlflow/io/models/mlflow_model_logger_dataset.py 100.00% <100.00%> (ø)
...dro_mlflow/io/models/mlflow_model_saver_dataset.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9adb5ef...a166982. Read the comment docs.

@Galileo-Galilei
Copy link
Owner

Hello @kaemo, a few updates:

  • I have rebased to be up to date with the develop branch and I fixed the conflicts. It will likely mess up your local version so update it accordingly.
  • I have splitted your dataset into two datasets
    • one for logging/ loading from a run_id called MlflowModelLoggerDataSet
    • one for saving locally / reading from file , called MlfflowModelSaverDataSet
  • I have put them in a models subfolder and open an issue to refactor the io folder : Refactor kedro_mlflow.io folder #109
  • Update tests and documentation accordingly
  • Fix a few typos:
    • do not use default dict argumetns which are mutable
    • change requiremetns from sklearn to scikit-learn
    • Fix a typo in a if condition (mflow.pyfunc instead of mlflow.pyfunc)

It would be great if we could discuss it this week. Feel free to make any comment if you are not comfortable with these changes

@Galileo-Galilei Galileo-Galilei force-pushed the 12-create-a-mlflowmodeldataset branch from c2b1d96 to a166982 Compare November 3, 2020 19:33
@Galileo-Galilei Galileo-Galilei marked this pull request as ready for review November 3, 2020 19:34
@Galileo-Galilei
Copy link
Owner

Hello @kaemo, I merge it for the next release. Feel free to reopen if you want to discuss / refactor. Thank you very much for the amazing work!

@Galileo-Galilei Galileo-Galilei merged commit 1dda1e4 into develop Nov 3, 2020
@Galileo-Galilei Galileo-Galilei deleted the 12-create-a-mlflowmodeldataset branch November 13, 2020 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a MlflowModelDataSet
4 participants