Support summation for `PipelineML` #554

mck-star-yar · 2024-05-14T13:26:18Z

Description

This feature would simplify pipeline ensembling a lot if PipelineML would implement __sum__ and other dunder methods.

Here's a simplified example. We have ds pipeline defined as data_prep + training + report_training, where actual model training happens in the training. All four are exposed in the model registry. So users can trigger training pipelines by calling either ds or training. Given I can't append the result of pipeline_ml_factory to define ds, I should wrap both ds and training with ml wrapper:

Current code:

data_prep = ...
training = ...
inference = ...
training_ml = pipeline_ml_factory(
	training_pipeline=training,
	inference=inference,
	input_name="feature_table",
	log_model_kwargs={
		"artifact_path": "some/path/to/artifact",
		"registered_model_name": "some_model_name",
	}
)
report_training = ...

ds = data_prep + training + report_training
ds_ml = pipeline_ml_factory(
	training_pipeline=ds,
	inference=inference,
	input_name="feature_table",
	log_model_kwargs={
		"artifact_path": "some/path/to/artifact",
		"registered_model_name": "some_model_name",
	}
)
pipeline_registry = {
  'ds': ds_ml,
  'training': training_ml,
  'data_prep': data_prep,
  'report_training': report_training,
}

Desired code:

data_prep = ...
training = ...
inference = ...
training = pipeline_ml_factory(
	training_pipeline=training,
	inference=inference,
	input_name="feature_table",
	log_model_kwargs={
		"artifact_path": "some/path/to/artifact",
		"registered_model_name": "some_model_name",
	}
)
report_training = ...
ds = data_prep + training + report_training

pipeline_registry = {
  'ds': ds,
  'training': training,
  'data_prep': data_prep,
  'report_training': report_training,
}

Of course the alternative here is to extract this piece of code and do wrapping for each of the functions that contain training step; but that'd require manual tracking each time the pipeline is updated opposed to having only training step wrapped into this new class.

I can see that this is the behavior by design. Is there any specific reason why this limitation is imposed?

Possible Implementation

Implement dunder methods of PipelineML

The text was updated successfully, but these errors were encountered:

Galileo-Galilei · 2024-05-15T22:11:32Z

I can see the value of this, but this is a very difficult problem, because :

the sum / filter of a Pipeline and a PipelineML is not a PipelineML in general
even the sum / filter of a PipelineML and a PipelineML is not a PipelineML in general

mostly because summing / filtering pipelines breaks the assumption of a single input, and all inputs of inference are outputs of training. You need to know exactly what you are doing to ensure your pipeline is indeed a PipelineML.

I could eventually let summation happen, and have an error message in case of invalid output, but I am afraid this will raise even more questions...

mck-star-yar · 2024-05-17T15:30:43Z

Yeah, that makes perfect sense. I think we should leave it as it is. And you could mention this discussion in the code itself to verbalize this design decision.

Galileo-Galilei · 2024-05-19T13:27:56Z

I've palyed a little bit with it and it's possible to add one PipelineML with a Pipeline, but it dos not make sense to add two PipelineML altogether (we can add the training, but what about the inference.

As discussed, I'll just document the decision

…ring (#554)

Galileo-Galilei · 2024-08-26T19:57:18Z

Closed by #584

mck-star-yar changed the title ~~Make PipelineML summable~~ Support summation for PipelineML May 14, 2024

Galileo-Galilei added the need-design-decision Several ways of implementation are possible and one must be chosen label May 15, 2024

Galileo-Galilei added documentation Improvements or additions to documentation and removed need-design-decision Several ways of implementation are possible and one must be chosen labels May 19, 2024

Galileo-Galilei mentioned this issue Aug 26, 2024

🔊 Use MlflowHook internal logger and log the active run (#567) #584

Merged

6 tasks

Galileo-Galilei added a commit that referenced this issue Aug 26, 2024

📝 Various documentation improvement especially about PipelineML filte…

0e4fe31

…ring (#554)

Galileo-Galilei added a commit that referenced this issue Aug 26, 2024

📝 Various documentation improvement especially about PipelineML filte…

52aecc7

…ring (#554)

Galileo-Galilei closed this as completed Aug 26, 2024

Galileo-Galilei moved this to ✅ Done in kedro-mlflow roadmap Oct 29, 2024

Galileo-Galilei added this to kedro-mlflow roadmap Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support summation for `PipelineML` #554

Support summation for `PipelineML` #554

mck-star-yar commented May 14, 2024

Galileo-Galilei commented May 15, 2024

mck-star-yar commented May 17, 2024

Galileo-Galilei commented May 19, 2024

Galileo-Galilei commented Aug 26, 2024

Support summation for PipelineML #554

Support summation for PipelineML #554

Comments

mck-star-yar commented May 14, 2024

Description

Possible Implementation

Galileo-Galilei commented May 15, 2024

mck-star-yar commented May 17, 2024

Galileo-Galilei commented May 19, 2024

Galileo-Galilei commented Aug 26, 2024

Support summation for `PipelineML` #554

Support summation for `PipelineML` #554