-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a MlflowMetricDataSet #9
Comments
@Galileo-Galilei did you do some work on it? I would be happy to take it over and provide a solution proposition as PR. (sorry for interrupting your holidays...) |
Hello @akruszewski, thank you so much for taking this one. I have some general design decisions about General principlesI wish all Consistency with
|
@Galileo-Galilei I agree with basically everything, but there's one use case that needs to be solved which your proposed solution does not address. When there are two pipelines (say training and prediction) and the first one is producing a dataset/model/metric and the second one is executed as a separate run and depends on one of those artifacts (in a broad sense) there is no way to run such pipelines one after the other because you would need to specify I see two possible solutions:
Let me know how do you think this can be solved. Manually specifying a |
I totally agree, and that's part of what I have in mind when I wrote
but I did not write this post fast enough ;) Thoughts about your workflowFirst of all, I think that the workflow you describe (2 separated pipeline, one for training, one for prediction) only concerns artifacts (and models, which are special artifacts), but not parameters nor metrics. I don't have any common use case when you may want to retrieve parameters/metrics in another run : params which need to be reused for prediction are always stored in an object which will be stored as an artifact, and metrics are "terminal" objects : another pipeline will likely use other data and calculate other metrics. The point you are describing is one of the major disagreement between the data scientists and the data engineers at work (they do not use this plugin but a custom made version, it does not matter here). The point is that data scientists want to perform the operation you describe (load latest version on disk without providing run_id, reuse a model from a coworker copy/pasted locally), while data engineers want this operation (providing the run_id) to be manual and the artifacts downloaded from mlflow as the single source of truth, because when they deploy in production they want an extra check after training the model. Data engineers insist that manually providing the run_id is the responsibility of the "ops guy". They really stand against just using "the last version" to avoid operational risk. The consensus we reached is to force providing the run id when running the pipeline as a global paramer (we use the I don't feel this is the right solution for us though, because the plugin will not be self contained and it will imply messing up with the project template which is a movin part. It will largely hurt the portability and ease to use of the plugin. Suggestion for the DataSets of the plugin
This implementation would also enable to have the following entry in the catalog: my_model:
type: kedro_mlflow.io.MlflowArtifactDataSet
data_set:
type: kedro_mlflow.io.MlflowLocalModelDataSet # or any valid kedro DataSet
filepath: /path/to/a/LOCAL/destination/folder # must be a local folder so it would made point 2.ii irrelevant, since it will be completely redundant with above entry. Would it match all your needs if we do it this way ? |
Hi @Galileo-Galilei, thanks for your review! I think that I implemented most of the things, but I have also few topics to cover in discussion. If I omitted something, please point it here. I pushed today branch with the second implementation of this issue. In this one MlflowMetricsDataSet:
What I didn't implement and why?Treating MlflowMetricsDataSet as in-memory dataset.As you mention in the comment to my original PR, we should limit side-effects to bare minimum. as logging to MLflow is our mine task (and side effects in terms of function purity), we probably should avoid putting it in the The second argument would be that MlflowMetricsDataSet is not an in-memory dataset, but rather a dataset that is persistent. Of course, if you are thinking that it is still better to have the Why I left the
|
@Galileo-Galilei I forgot to mention that in my opinion there is no point in doing the second dataset |
PR: #49 @kaemo @Galileo-Galilei I have also idea, let me know what do you think about it. If you would find a time, I would be happy to have a live session (chat/video chat/another live channel of communication), where we could discuss this topic. |
Hello, I agree on almost everything. Some comments:
The more I think about it, the more I agree with you. My first idea was to enable the possibility to load from one run and log in another because some data scientists do this manually for some artifacts/models (as @kaemo suggested above, they share model locally during experimentation phase even if it sounds a bad practice for further productionizing). However :
Conclusion: let's pass run_id to contructor
Agreed, let's keep mlflow behaviour even if I don't like it and think like you that it should rather fail. It should not have any impact while running a pipeline in comand line (because hooks properly manage run opening and closing), but it will change behaviour in interactive mode.
It should also handle "float" only, shouldn't it?
I wish we could, but I don't think we can pass the timestamp key as an argument in log_metric unfortunately according to mlflow documentation
Agreed. My idea here was to avoid an extra http connexion when loading from a remote database but it is really not a big issue and avoiding side effects is more important to me.
I totally agree that it would be much better to retrieve the name of the DataCatalog. I think we can achieve it the following way:
I think that having automatic consistency with the DataCatalog is a fair compensation for the additional complexity/side effect introduced by such an implementation.
Agreed, it will introduce too much code redundancy for very little additional gain. P.S: The call is a very good idea. I've sent you linkedin invitation to exchange privately our coordinates. |
I forgot to write it but using the most recent runs for loading is completely out of the possible solutions. Indeed I've learnt that some teams use a common mlflow for all data scientists (unlike my team where all data scientists have their own they can handle as they want+ a shared one for sharing models where training is triggered by CI/CD). This leads to conflicting writing issues (several runs can be launched by different data scientists at the same time. I feel that it is a very bad decision (and they complain that their mlflow is a total mess), but it is still what they use right now and we cannot exclude the possibility that even for my team the shared mlflow can have conflicts if several runs are launched concurrently (especially when models are long to train, e.g. deep learning models) |
Context
As of today,
kedro-mlflow
offers a clear way to log parameters (through aHook
) and artifacts (through theMlflowArtifactDataSet
class in thecatalog.yml
.However, there is no weel-defined way to log metrics automatically in mlflow within the plugin. The user still have to log the metrics directly by calling
log_metric
within its self-defined function. this is not very convenient nor parametrizable, and makes the code lesss portable and messier.Feature description
Provide a unique and weel defined way to og metrics through the plugin.
Possible Implementation
The easiest implemation would be to create a
MlflowMetricDataSet
very similar toMlflowArtifactDataSet
to enable logging the metric directly inthecatalog.yml
.The main problem of this approach is that some metrics evolve over time, and we would like to log the metric on each update. This is not possible with this approach because the updates are made inside the node (when it is running), and not at the end.
The text was updated successfully, but these errors were encountered: