Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

init: zephyr: Fix memory leak during secondary core init #9006

Merged

Conversation

tmleman
Copy link
Contributor

@tmleman tmleman commented Apr 5, 2024

This patch refines the initialization process for secondary cores in a multicore environment when using Zephyr as the RTOS. The patch introduces a check_restore function specifically for Zephyr, which checks if basic core structures (IDC, notifier, schedulers) have been previously allocated and are still present in memory, indicating that the system is not undergoing a cold boot.

By adding this check, the system avoids unnecessary re-allocation of these structures during the power-up sequence of secondary cores, effectively preventing the memory leak observed during repeated power cycle tests.

fix #9005

src/init/init.c Outdated
* has not been powered off.
*/
if (!idc || !notifier || !schedulers)
return 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even though this function returns int (can we change it to return bool?) still return idc && notifier && schedulers; should still work. Also, seeing this check raises a question for me - what if some of these pointers are NULL? I think the answer is that this isn't possible, right? If any of these allocations fail initially, the whole initialisation fails, so we never reach this point. That means, that checking just one of these pointers should be enough: if one of them is set, all the others must be set too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we change it to return bool?

Done

what if some of these pointers are NULL? I think the answer is that this isn't possible, right?

True, initialization should fail at the start (first initialization). It's definitely the case with the scheduler.

that checking just one of these pointers should be enough: if one of them is set, all the others must be set too?

I think one would be enough, but more is better, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one would be enough, but more is better, right?

@tmleman why and how? Maybe it is better, or maybe not. If you claim, that it's better, then I'd like an explanation of why and how and a proper handling of it. So, if you detect that one of them is NULL and others aren't, this can mean that the initialisation code got broken. So I would expect an appropriate handling of such a case. Currently a case of some NULL and some non-NULL seems to indicate some inconsistent state and we leave it at it. A better solution could be adding an explicit "initialisation complete" flag and checking it, instead of using some implicit indicators

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can mean that the initialisation code got broken.

I think you have given the best answer to the question why more is better.

Perhaps it would be helpful to add a piece of code that would handle the case where only part of the components has been initialized. As for the idea with the variable (flag) that would indicate that the initialization has been completed, I don't see the value in it at the moment. What would be the difference compared to checking the state of just one component, for example, one that is initialized last?

At this moment, fixing regressions introduced by the previous refactor is a higher priority for me than starting another one.

@tmleman tmleman force-pushed the topic/upstream/issue/9005/core_reinit branch from 42cf341 to 7da2919 Compare April 8, 2024 08:52
@tmleman tmleman requested a review from lyakh April 8, 2024 12:57
@plbossart
Copy link
Member

this seems to introduce a pretty bad regression on MeteorLake, see e.g. https://sof-ci.01.org/sofpr/PR9006/build3770/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=check-playback-10sec

@tmleman tmleman force-pushed the topic/upstream/issue/9005/core_reinit branch from 7da2919 to 07cd45e Compare April 8, 2024 18:41
src/init/init.c Show resolved Hide resolved
src/init/init.c Outdated Show resolved Hide resolved
@tmleman tmleman force-pushed the topic/upstream/issue/9005/core_reinit branch from 07cd45e to c6df4a6 Compare April 9, 2024 07:38
return 0;

return 1;
return !!idc && !!task && !!notifier && !!schedulers;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lyakh I could use the same function if I skipped checking the 'task' pointer.

tmleman added 2 commits April 9, 2024 10:48
This patch refactors the `check_restore` function to return a `bool`
instead of an `int`. This change enhances code readability and clarifies
the intent of the function, which is to return a true or false value
based on the presence of core structures in memory.

No functional changes are introduced by this patch; it is purely a code
quality improvement.

Signed-off-by: Tomasz Leman <tomasz.m.leman@intel.com>
This patch refines the initialization process for secondary cores in a
multicore environment when using Zephyr as the RTOS. The patch
introduces a `check_restore` function specifically for Zephyr, which
checks if basic core structures (IDC, notifier, schedulers) have been
previously allocated and are still present in memory, indicating that
the system is not undergoing a cold boot.

By adding this check, the system avoids unnecessary re-allocation of
these structures during the power-up sequence of secondary cores,
effectively preventing the memory leak observed during repeated power
cycle tests.

fix thesofproject#9005

Signed-off-by: Tomasz Leman <tomasz.m.leman@intel.com>
@tmleman tmleman requested a review from lyakh April 9, 2024 09:21
@lgirdwood lgirdwood merged commit a43981e into thesofproject:main Apr 9, 2024
44 of 45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Memory Leak on Secondary Core Power Cycle
7 participants