Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessing the project context inside hooks implementations #66

Closed
takikadiri opened this issue Sep 20, 2020 · 3 comments · Fixed by #91
Closed

Accessing the project context inside hooks implementations #66

takikadiri opened this issue Sep 20, 2020 · 3 comments · Fixed by #91
Assignees
Labels
bug Something isn't working need-design-decision Several ways of implementation are possible and one must be chosen
Milestone

Comments

@takikadiri
Copy link
Collaborator

Hi,

I'm a colleague of @Galileo-Galilei. We discussed the issue of accessing the project template and properties from a plugin. We raised the point in the kedro project

What will it solve :

  • No more Template assumptions. We will ask the kedro context where the template component resides (project_paths, conf_paths, src path, ...). That lead to less kedro updates regression, because we will no longer manage a template at our side.

  • No more configs loading logic assumptions. We will use the kedro context configLoader and get_crendentials (that can be overrided by user).

That solve : #64 #54 #30 #31 (User will use kedro credentials mechanisms, i'll put details in another issue) and a part of #62

How can we implement it (For now, i see three possibilities):

  • Calling a load_context inside before_pipeline_run hook and setting at the same time a global project_path variable, hoping that the before_pipeline_run hook is always the first hook that will be called in the futur. --> So hacky

  • Declaring the mlflow hooks as a project context property. Which means that the hooks can easily access to the current context --> No more auto registration

  • Waiting for the release of the kedro session. It will apparently allows access to the current context --> We don't know when will it land

Keep up the good work guys, I'm happy to join you :)

@Galileo-Galilei Galileo-Galilei added bug Something isn't working need-design-decision Several ways of implementation are possible and one must be chosen labels Sep 23, 2020
@Galileo-Galilei Galileo-Galilei added this to the Release 0.4.0 milestone Sep 29, 2020
@Galileo-Galilei
Copy link
Owner

I suggest that we go with the first solution (call load_context in hook before_pipeline_run) which is completely invisible to the user and will solve this painful problem. We will plan to migrate to KedroSession once official. @takikadiri Can you take it?

@takikadiri
Copy link
Collaborator Author

I digged into the implementation details of the first solution, and I can tell you it's going to be very hacky.

The MlflowNodeHook also needs to access the context, but it doesn't have the data (project_path, env, extra_params) to create it. That's mean the MlflowNodeHook and some other hooks in the plugin expects the before_pipeline_run hook to create the context and storing it in a global variable or somewhere else, so they (other hooks) can access to it.

One of the many side effets of using global variable as storage, is that the order of the hooks list given by the user in his kedro project will matter. The before_pipeline_run has to be always the first on the list.

I suggest to go for the second solution, while having in mind the migration to the third one, as soon as the kedro 0.17.0 is released. The second solution look like this

Do you see another way to implement the first solution ?

I'll create a pull request for this.

@Galileo-Galilei
Copy link
Owner

Galileo-Galilei commented Oct 1, 2020

I understand that the second solution is much cleaner than the first (mostly because the first create another context object for each node) but i don't why you need it to create a global variable for the first. I think that creating an attribute on the fly is enough for our use cases (see here):

def before_pipeline_run(
        self, run_params: Dict[str, Any], pipeline: Pipeline, catalog: DataCatalog
    ) -> None:
        """Hook to be invoked before a pipeline runs.

        Args:
            run_params: The params used to run the pipeline.
                Should be identical to the data logged by Journal with the following schema::

                   {
                     "run_id": str
                     "project_path": str,
                     "env": str,
                     "kedro_version": str,
                     "tags": Optional[List[str]],
                     "from_nodes": Optional[List[str]],
                     "to_nodes": Optional[List[str]],
                     "node_names": Optional[List[str]],
                     "from_inputs": Optional[List[str]],
                     "load_versions": Optional[List[str]],
                     "pipeline_name": str,
                     "extra_params": Optional[Dict[str, Any]]
                   }

            pipeline: The ``Pipeline`` that will be run.
            catalog: The ``DataCatalog`` to be used during the run.
        """
        self.context=load_context(run_params["project_path"])
        self.env=run_params["env"]   # is it necessary?
        self.extra_params=run_params["extra_params"]  # is it necessary?

The informations will be accessible later in the hook. For our use cases, it is even simpler:

  1. Remove __init__ in MlflowNodeHook:
    class MlflowNodeHook:
    def __init__(
    self, flatten_dict_params: bool = False, recursive: bool = True, sep: str = "."
    ):
    config = get_mlflow_config()
    self.flatten = config.node_hook_opts["flatten_dict_params"]
    self.recursive = config.node_hook_opts["recursive"]
    self.sep = config.node_hook_opts["sep"]
  2. Load conf with the config loader at the beginning here:

def before_node_run(
self,
node: Node,
catalog: DataCatalog,
inputs: Dict[str, Any],
is_async: bool,
run_id: str,
) -> None:
"""Hook to be invoked before a node runs.
This hook logs all the paramters of the nodes in mlflow.
Args:
node: The ``Node`` to run.
catalog: A ``DataCatalog`` containing the node's inputs and outputs.
inputs: The dictionary of inputs dataset.
is_async: Whether the node was run in ``async`` mode.
run_id: The id of the run.
"""
# only parameters will be logged. Artifacts must be declared manually in the catalog
params_inputs = {}

something like:

config=kedroMlflowConfig.from_dict(self.context.config_loader("mlflow.yml"))
...
  1. Do the same change here in MlflowPipelineHook

mlflow_conf = get_mlflow_config(
project_path=run_params["project_path"], env=run_params["env"]
)

And it should be ok. The reason for why I prefer solution 1 over solution 2 is that the first one is a non breaking change, while the second modifies the hooks arguments and consequently is a breaking change. Since none of them is the target solution, I think we can deal with it for a few months. Does it sounds good for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working need-design-decision Several ways of implementation are possible and one must be chosen
Projects
Status: ✅ Done
2 participants