Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intel_adsp/ace: power: pad the hpsram_mask passed to power_down #76046

Conversation

tmleman
Copy link
Collaborator

@tmleman tmleman commented Jul 18, 2024

This PR contains only a cherry-pick of the fix that was proposed and reviewed earlier in this #75285 PR. The reason I am pushing it again is that the change that was merged does not completely fix the problem.

Recently, the same issue that we identified on the MTL could be observed in another PR on SOF in builds for LNL (the problem stopped reproducing after a rebase and so far has not reappeared on LNL). The same issue still reproduces on PTL and without some kind of fix, we are unable to add tests to CI that check if this type of power flow works. The change proposed by @kv2019i is so far the only solution that reliably fixes this problem on all platforms.

I apologize for taking so long with the explanations, I had to focus on other tasks.

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 19, 2024

Link: thesofproject/sof#9268

9268 was supposedly fixed and has been closed for a while. If the previous attempts were not enough then you must link to them and reference the corresponding commits in this commit message with Fixes commit ... "tag". Also, please re-open 9268. Clear bug status information should not be reserved to cache alignment experts...

BTW is this supposed to help with thesofproject/sof#9308 ?

@lyakh
Copy link
Collaborator

lyakh commented Jul 19, 2024

thesofproject/sof#9309 didn't work

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 19, 2024

Ok, thanks to @lyakh now I see that this is just a re-submission of https://github.com/kv2019i/zephyr/commits/202407-fix-powerdown-dcache-try2/ which was a previous (and not merged) attempt to fix thesofproject/sof#9268.

Still, it's very confusing to see a fix for a closed bug... Was this meant as a draft?

@tmleman
Copy link
Collaborator Author

tmleman commented Jul 19, 2024

@marc-hb @lyakh I apologize, gentlemen, but this change was not intended to improve anything for the MTL (it's about the PTL, but I will describe that later). Thank you, however, for your interest and your contribution.

Still, it's very confusing to see a fix for a closed bug... Was this meant as a draft?

Yes, it probably should have been a draft, but I decided to make it an open PR right away because I know that it would not escape your attention anyway. I did not include explanations because I was busy debugging problems on the MTL. Best regards.

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 19, 2024

but I decided to make it an open PR right away because I know that it would not escape your attention anyway.

Drafts achieve (at least) 3 things:

  1. They do NOT notify/spam anyone by default. You can still add any reviewer manually.
  2. They cannot be merged (while running all tests anyway)
  3. For people who notice drafts anyway (by chance, bad filter or whatever), drafts send a very clear signal that they are not ready/experimental/etc. and that no one should spend too much time thinking about them (like I just did).

Drafts are great, please use and abuse them!

If you want 2. and 3. but you still want to notify people then very easy: click "ready to review" and then back to "convert to draft". This notifies everyone and then does NOT unsubscribe anyone.

@marc-hb marc-hb marked this pull request as draft July 19, 2024 16:46
@tmleman
Copy link
Collaborator Author

tmleman commented Jul 23, 2024

@marc-hb @lyakh I have updated the description, so the PR is ready for review. I would maybe wait with the merge until we fix issue thesofproject/sof#9308. When the fix is merged, I will do a rebase and update the PR on the SOF side.

soc/intel/intel_adsp/ace/power.c Outdated Show resolved Hide resolved
/* This assumes a single HPSRAM segment */
static uint32_t hpsram_mask;
const int dcache_words = XCHAL_DCACHE_LINESIZE / 4;
uint32_t hpsram_mask[dcache_words] __aligned(XCHAL_DCACHE_LINESIZE);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this moves it back to the stack, which by itself already moves it away from any potentially neighbouring cache lines... TBH I'm now confused - this more or less reverts #75108 or, in other words, switches to an alternative fix for the same issue by reviving #75285. Would be good to get some more information

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the variable to the bss causes an exception on PTL. We still have the same exception (LoadStoreErrorCause(3)), just on a different platform. The problem still occurred on LNL and stopped after a rebase that did not introduce changes affecting that area.

andyross
andyross previously approved these changes Jul 23, 2024
Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still feels closer to a root cause than the first, so +1 from me. It still worries me that we don't have a real answer.

Did anyone ever get a chance to whitebox the failure? Like, try to lock 1, 2, 3... cache lines at the same index and see when/how it fails? Then iterate over all 256 (or whatever) indices and see if there's one particular one that fails early (because, maybe, the ROM left something locked?)

@tmleman tmleman force-pushed the topic/upstream/pr/ace/fix/power_down branch from b767597 to be73b43 Compare July 29, 2024 09:49
@tmleman
Copy link
Collaborator Author

tmleman commented Jul 29, 2024

Rebase

@tmleman tmleman marked this pull request as ready for review July 29, 2024 09:50
@tmleman tmleman force-pushed the topic/upstream/pr/ace/fix/power_down branch 2 times, most recently from 519ebc1 to 3d2051b Compare August 5, 2024 10:16
The power_down() function will lock dcache for the hpsram_mask
array. On some platforms, the dcache lock will fail if the array
is on cache line that can be used for window register context
saves.

Work around this by aligning and padding the hpsram_mask to cacheline
size.

Link: thesofproject/sof#9268
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
@tmleman
Copy link
Collaborator Author

tmleman commented Aug 5, 2024

I've spent some time on this problem recently and unfortunately, I haven't gotten to the bottom of it yet. For now, I know that alignment and padding are the only solutions that have worked on all platforms so far. What I plan to do now is to merge this patch, add D3 tests for PTL to CI, and then rework power_down to remove hpsram_mask and turn off this memory in a way similar to lpsram.

@tmleman tmleman requested review from lyakh and andyross August 5, 2024 15:15
@carlescufi carlescufi merged commit 2fcdbba into zephyrproject-rtos:main Aug 8, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform: Intel ADSP Intel Audio platforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants