Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add configuration to disable hooks tracing #4504

Open
marcelopio opened this issue Feb 19, 2025 · 4 comments
Open

Add configuration to disable hooks tracing #4504

marcelopio opened this issue Feb 19, 2025 · 4 comments
Labels
Component: Framework Issue/PR that addresses core framework functionality Component: Logging Issue: Feature Request New feature or improvement to existing feature

Comments

@marcelopio
Copy link

I am migrating from kedro 0.18.8 to 0.19.10 and suddenly all my pipelines are slower. In one instance a pipeline that was taking 20min is now taking 1h.

After a lot of investigation I narrowed the problem to these two lines:

manager.trace.root.setwriter(logger.debug)
manager.enable_tracing()

These enable tracing to hooks which on pluggy will go to this algorithm:
https://github.com/pytest-dev/pluggy/blob/4eb41bb532fe1edc4efe756367d145f626b82a95/src/pluggy/_tracing.py#L37-L38

I have some datasets that are very big, and when the 'after_node_run' hook is called, this try to log the whole dataset even when I don't have DEBUG enabled.

There should be a config to enable hook tracing only when I need, and disabling this should be the default for production environments.

@merelcht merelcht added the Community Issue/PR opened by the open-source community label Feb 19, 2025
@github-project-automation github-project-automation bot moved this to Wizard inbox in Kedro Wizard 🪄 Feb 19, 2025
@ElenaKhaustova ElenaKhaustova moved this from Wizard inbox to In Progress in Kedro Wizard 🪄 Feb 20, 2025
@ElenaKhaustova
Copy link
Contributor

Hello @marcelopio, thanks for reporting the issue.

We have some questions that will help us to proceed with the issue:

  1. When you say this try to log the whole dataset even when I don't have DEBUG enabled, do you mean the log is not printed for you, but message formatting with pluggy takes time anyway? Or is the log printed as well?
  2. What is the approximate size of the datasets you're experiencing such a problem with?

@marcelopio
Copy link
Author

Sure!

1- It tries to format the dataset, on my case, calling pandas tostring. It doesn't print any logs because debug is not enabled. So it is calling tostring and thus taking a long time unnecessarily.

2- I think the instance I was debugging had an input of 8mb pickled of a pandas dataset, and it was breaking it to a dict of series and doing some formatting.

I probably can share the cProfile logs for both the 0.18.8 and 0.19.10, but I need to check if there isn't any confidential information so it may take a while

@astrojuanlu astrojuanlu marked this as a duplicate of #4505 Feb 20, 2025
@ElenaKhaustova
Copy link
Contributor

Thank you, @marcelopio

After debugging, I can also confirm the following behaviour for my tests:

It tries to format the dataset, on my case, calling pandas tostring. It doesn't print any logs because debug is not enabled. So it is calling tostring and thus taking a long time unnecessarily.

The current workaround to mitigate this suggested by @astrojuanlu is:

# settings.py

from pluggy._callers import _multicall
from kedro.framework.cli.hooks.manager import get_cli_hook_manager

_cli_hook_manager = get_cli_hook_manager()
_cli_hook_manager._inner_hookexec = _multicall

@marcelopio please let us know if it worked for you

@ElenaKhaustova
Copy link
Contributor

Since we already have issues with tracing (#2630), a possible solution is to add an option to enable/disable tracing or make it only when the debug level is set.

Setting writer to None will prevent message formatting with pluggy and thus calling __repr__ for datastes.

manager.trace.root.setwriter(logger.debug)

https://github.com/pytest-dev/pluggy/blob/4eb41bb532fe1edc4efe756367d145f626b82a95/src/pluggy/_tracing.py#L43

@ElenaKhaustova ElenaKhaustova added Issue: Feature Request New feature or improvement to existing feature Component: Framework Issue/PR that addresses core framework functionality Component: Logging and removed Community Issue/PR opened by the open-source community labels Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Framework Issue/PR that addresses core framework functionality Component: Logging Issue: Feature Request New feature or improvement to existing feature
Projects
Status: No status
Development

No branches or pull requests

3 participants