Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support summation for PipelineML #554

Closed
mck-star-yar opened this issue May 14, 2024 · 4 comments
Closed

Support summation for PipelineML #554

mck-star-yar opened this issue May 14, 2024 · 4 comments
Labels
documentation Improvements or additions to documentation

Comments

@mck-star-yar
Copy link

Description

This feature would simplify pipeline ensembling a lot if PipelineML would implement __sum__ and other dunder methods.

Here's a simplified example. We have ds pipeline defined as data_prep + training + report_training, where actual model training happens in the training. All four are exposed in the model registry. So users can trigger training pipelines by calling either ds or training. Given I can't append the result of pipeline_ml_factory to define ds, I should wrap both ds and training with ml wrapper:

Current code:

data_prep = ...
training = ...
inference = ...
training_ml = pipeline_ml_factory(
	training_pipeline=training,
	inference=inference,
	input_name="feature_table",
	log_model_kwargs={
		"artifact_path": "some/path/to/artifact",
		"registered_model_name": "some_model_name",
	}
)
report_training = ...

ds = data_prep + training + report_training
ds_ml = pipeline_ml_factory(
	training_pipeline=ds,
	inference=inference,
	input_name="feature_table",
	log_model_kwargs={
		"artifact_path": "some/path/to/artifact",
		"registered_model_name": "some_model_name",
	}
)
pipeline_registry = {
  'ds': ds_ml,
  'training': training_ml,
  'data_prep': data_prep,
  'report_training': report_training,
}

Desired code:

data_prep = ...
training = ...
inference = ...
training = pipeline_ml_factory(
	training_pipeline=training,
	inference=inference,
	input_name="feature_table",
	log_model_kwargs={
		"artifact_path": "some/path/to/artifact",
		"registered_model_name": "some_model_name",
	}
)
report_training = ...
ds = data_prep + training + report_training

pipeline_registry = {
  'ds': ds,
  'training': training,
  'data_prep': data_prep,
  'report_training': report_training,
}

Of course the alternative here is to extract this piece of code and do wrapping for each of the functions that contain training step; but that'd require manual tracking each time the pipeline is updated opposed to having only training step wrapped into this new class.

I can see that this is the behavior by design. Is there any specific reason why this limitation is imposed?

Possible Implementation

Implement dunder methods of PipelineML

@mck-star-yar mck-star-yar changed the title Make PipelineML summable Support summation for PipelineML May 14, 2024
@Galileo-Galilei Galileo-Galilei added the need-design-decision Several ways of implementation are possible and one must be chosen label May 15, 2024
@Galileo-Galilei
Copy link
Owner

I can see the value of this, but this is a very difficult problem, because :

  • the sum / filter of a Pipeline and a PipelineML is not a PipelineML in general
  • even the sum / filter of a PipelineML and a PipelineML is not a PipelineML in general

mostly because summing / filtering pipelines breaks the assumption of a single input, and all inputs of inference are outputs of training. You need to know exactly what you are doing to ensure your pipeline is indeed a PipelineML.

I could eventually let summation happen, and have an error message in case of invalid output, but I am afraid this will raise even more questions...

@mck-star-yar
Copy link
Author

Yeah, that makes perfect sense. I think we should leave it as it is. And you could mention this discussion in the code itself to verbalize this design decision.

@Galileo-Galilei
Copy link
Owner

I've palyed a little bit with it and it's possible to add one PipelineML with a Pipeline, but it dos not make sense to add two PipelineML altogether (we can add the training, but what about the inference.

As discussed, I'll just document the decision

@Galileo-Galilei Galileo-Galilei added documentation Improvements or additions to documentation and removed need-design-decision Several ways of implementation are possible and one must be chosen labels May 19, 2024
@Galileo-Galilei
Copy link
Owner

Closed by #584

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
Status: ✅ Done
Development

No branches or pull requests

2 participants