Accessing the project context inside hooks implementations #66

takikadiri · 2020-09-20T12:50:24Z

Hi,

I'm a colleague of @Galileo-Galilei. We discussed the issue of accessing the project template and properties from a plugin. We raised the point in the kedro project

What will it solve :

No more Template assumptions. We will ask the kedro context where the template component resides (project_paths, conf_paths, src path, ...). That lead to less kedro updates regression, because we will no longer manage a template at our side.
No more configs loading logic assumptions. We will use the kedro context configLoader and get_crendentials (that can be overrided by user).

That solve : #64 #54 #30 #31 (User will use kedro credentials mechanisms, i'll put details in another issue) and a part of #62

How can we implement it (For now, i see three possibilities):

Calling a load_context inside before_pipeline_run hook and setting at the same time a global project_path variable, hoping that the before_pipeline_run hook is always the first hook that will be called in the futur. --> So hacky
Declaring the mlflow hooks as a project context property. Which means that the hooks can easily access to the current context --> No more auto registration
Waiting for the release of the kedro session. It will apparently allows access to the current context --> We don't know when will it land

Keep up the good work guys, I'm happy to join you :)

Galileo-Galilei · 2020-09-29T20:00:52Z

I suggest that we go with the first solution (call load_context in hook before_pipeline_run) which is completely invisible to the user and will solve this painful problem. We will plan to migrate to KedroSession once official. @takikadiri Can you take it?

takikadiri · 2020-09-30T22:10:00Z

I digged into the implementation details of the first solution, and I can tell you it's going to be very hacky.

The MlflowNodeHook also needs to access the context, but it doesn't have the data (project_path, env, extra_params) to create it. That's mean the MlflowNodeHook and some other hooks in the plugin expects the before_pipeline_run hook to create the context and storing it in a global variable or somewhere else, so they (other hooks) can access to it.

One of the many side effets of using global variable as storage, is that the order of the hooks list given by the user in his kedro project will matter. The before_pipeline_run has to be always the first on the list.

I suggest to go for the second solution, while having in mind the migration to the third one, as soon as the kedro 0.17.0 is released. The second solution look like this

Do you see another way to implement the first solution ?

I'll create a pull request for this.

Galileo-Galilei · 2020-10-01T19:53:31Z

I understand that the second solution is much cleaner than the first (mostly because the first create another context object for each node) but i don't why you need it to create a global variable for the first. I think that creating an attribute on the fly is enough for our use cases (see here):

def before_pipeline_run(
        self, run_params: Dict[str, Any], pipeline: Pipeline, catalog: DataCatalog
    ) -> None:
        """Hook to be invoked before a pipeline runs.

        Args:
            run_params: The params used to run the pipeline.
                Should be identical to the data logged by Journal with the following schema::

                   {
                     "run_id": str
                     "project_path": str,
                     "env": str,
                     "kedro_version": str,
                     "tags": Optional[List[str]],
                     "from_nodes": Optional[List[str]],
                     "to_nodes": Optional[List[str]],
                     "node_names": Optional[List[str]],
                     "from_inputs": Optional[List[str]],
                     "load_versions": Optional[List[str]],
                     "pipeline_name": str,
                     "extra_params": Optional[Dict[str, Any]]
                   }

            pipeline: The ``Pipeline`` that will be run.
            catalog: The ``DataCatalog`` to be used during the run.
        """
        self.context=load_context(run_params["project_path"])
        self.env=run_params["env"]   # is it necessary?
        self.extra_params=run_params["extra_params"]  # is it necessary?

The informations will be accessible later in the hook. For our use cases, it is even simpler:

Remove __init__ in MlflowNodeHook:

kedro-mlflow/kedro_mlflow/framework/hooks/node_hook.py

Lines 11 to 18 in e0afaf4

    
           class MlflowNodeHook: 
        
               def __init__( 
        
                   self, flatten_dict_params: bool = False, recursive: bool = True, sep: str = "." 
        
               ): 
        
                   config = get_mlflow_config() 
        
                   self.flatten = config.node_hook_opts["flatten_dict_params"] 
        
                   self.recursive = config.node_hook_opts["recursive"] 
        
                   self.sep = config.node_hook_opts["sep"]

Load conf with the config loader at the beginning here:

kedro-mlflow/kedro_mlflow/framework/hooks/node_hook.py

Lines 21 to 40 in e0afaf4

    
               def before_node_run( 
        
                   self, 
        
                   node: Node, 
        
                   catalog: DataCatalog, 
        
                   inputs: Dict[str, Any], 
        
                   is_async: bool, 
        
                   run_id: str, 
        
               ) -> None: 
        
                   """Hook to be invoked before a node runs. 
        
                   This hook logs all the paramters of the nodes in mlflow. 
        
                   Args: 
        
                       node: The ``Node`` to run. 
        
                       catalog: A ``DataCatalog`` containing the node's inputs and outputs. 
        
                       inputs: The dictionary of inputs dataset. 
        
                       is_async: Whether the node was run in ``async`` mode. 
        
                       run_id: The id of the run. 
        
                   """ 
        
                   # only parameters will be logged. Artifacts must be declared manually in the catalog 
        
                   params_inputs = {}

something like:

config=kedroMlflowConfig.from_dict(self.context.config_loader("mlflow.yml"))
...

Do the same change here in MlflowPipelineHook

kedro-mlflow/kedro_mlflow/framework/hooks/pipeline_hook.py

Lines 65 to 67 in e0afaf4

    
           mlflow_conf = get_mlflow_config( 
        
               project_path=run_params["project_path"], env=run_params["env"] 
        
           )

And it should be ok. The reason for why I prefer solution 1 over solution 2 is that the first one is a non breaking change, while the second modifies the hooks arguments and consequently is a breaking change. Since none of them is the target solution, I think we can deal with it for a few months. Does it sounds good for you?

…Galilei#30 Galileo-Galilei#31 Galileo-Galilei#72 Galileo-Galilei#29 Galileo-Galilei#62 - context access - kedro 16.5 - hook auto registration

…#30 #64 #72)

Galileo-Galilei added bug Something isn't working need-design-decision Several ways of implementation are possible and one must be chosen labels Sep 23, 2020

Galileo-Galilei assigned takikadiri Sep 29, 2020

Galileo-Galilei added this to the Release 0.4.0 milestone Sep 29, 2020

This was referenced Sep 29, 2020

mlflow.yml is not parsed properly when using TemplatedConfigLoader #72

Closed

Migrate to 0.16.5 #62

Closed

takikadiri mentioned this issue Oct 11, 2020

FIX #66 #64 #54 #30 #31 #72 #29 #62 - context access - kedro 16.5 - hook auto registration #86

Closed

6 tasks

takikadiri added a commit that referenced this issue Oct 15, 2020

FIX #66 - accessing project context

685647b

takikadiri mentioned this issue Oct 15, 2020

FIX #66 - accessing project context #91

Merged

6 tasks

takikadiri added a commit that referenced this issue Oct 17, 2020

FIX #66 - Accessing project context inside hooks (and consequently fix …

26b6572

…#30 #64 #72)

Galileo-Galilei closed this as completed in #91 Oct 17, 2020

Galileo-Galilei pushed a commit that referenced this issue Oct 17, 2020

FIX #66 - Accessing project context inside hooks (and consequently fix …

e8f9f09

…#30 #64 #72)

Galileo-Galilei moved this to ✅ Done in kedro-mlflow roadmap Oct 29, 2024

Galileo-Galilei added this to kedro-mlflow roadmap Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accessing the project context inside hooks implementations #66

Accessing the project context inside hooks implementations #66

takikadiri commented Sep 20, 2020

Galileo-Galilei commented Sep 29, 2020

takikadiri commented Sep 30, 2020

Galileo-Galilei commented Oct 1, 2020 •

edited

Loading

Accessing the project context inside hooks implementations #66

Accessing the project context inside hooks implementations #66

Comments

takikadiri commented Sep 20, 2020

Galileo-Galilei commented Sep 29, 2020

takikadiri commented Sep 30, 2020

Galileo-Galilei commented Oct 1, 2020 • edited Loading

Galileo-Galilei commented Oct 1, 2020 •

edited

Loading