Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed "assert(buffer->is_shared)" in comp_buffer_connect() #9343

Open
marc-hb opened this issue Jul 31, 2024 · 20 comments
Open

Failed "assert(buffer->is_shared)" in comp_buffer_connect() #9343

marc-hb opened this issue Jul 31, 2024 · 20 comments
Assignees
Labels
bug Something isn't working as expected

Comments

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 31, 2024

This is failing when fuzzing, see below.

This code came with commit 4a03699, @marcinszkudlinski can you please comment?

https://github.com/thesofproject/sof/actions/runs/10144692547/job/28048711573?pr=9338

==5054==ERROR: AddressSanitizer: SEGV on unknown address 0x00000000 (pc 0x081aecfa bp 0xdbffadb8 sp 0xdbffacd0 T7)
==5054==The signal is caused by a WRITE memory access.
==5054==Hint: address points to the zero page.
    #0 0x81aecfa in k_sys_fatal_error_handler /home/runner/work/sof/sof/workspace/sof/zephyr/wrapper.c:352:19
    #1 0x815e3c9 in assert_post_action /home/runner/work/sof/sof/workspace/zephyr/lib/os/assert.c:43:2
    #2 0x8188b30 in comp_buffer_connect /home/runner/work/sof/sof/workspace/sof/src/ipc/ipc-helper.c:188:3
    #3 0x81819b6 in ipc_buffer_to_comp_connect /home/runner/work/sof/sof/workspace/sof/src/ipc/ipc3/helper.c:594:9
    #4 0x8181404 in ipc_comp_connect /home/runner/work/sof/sof/workspace/sof/src/ipc/ipc3/helper.c:[622](https://github.com/thesofproject/sof/actions/runs/10144692547/job/28048711573?pr=9338#step:7:623):10
    #5 0x8176f3e in ipc_glb_tplg_comp_connect /home/runner/work/sof/sof/workspace/sof/src/ipc/ipc3/handler.c:1375:9
    #6 0x8175bd4 in ipc_cmd /home/runner/work/sof/sof/workspace/sof/src/ipc/ipc3/handler.c:1654:9
    #7 0x818e639 in ipc_platform_do_cmd /home/runner/work/sof/sof/workspace/sof/src/platform/posix/ipc.c:160:2
    #8 0x818750c in ipc_do_cmd /home/runner/work/sof/sof/workspace/sof/src/ipc/ipc-common.c:326:9
    #9 0x81af56b in task_run /home/runner/work/sof/sof/workspace/sof/zephyr/include/rtos/task.h:94:9
    #10 0x81af0de in edf_work_handler /home/runner/work/sof/sof/workspace/sof/zephyr/edf_schedule.c:32:16

Originally posted by @tmleman in #9338 (comment)

EDIT: somehow this was run again and is passing now: https://github.com/thesofproject/sof/actions/runs/10144692547/job/28093942074?pr=9338

Fuzzing is typically non-deterministic.

I have another PR with the same update and there was no failure there. Locally, I initially couldn't hit it either. That's why I did a rerun of fuzzing for this PR.
After some longer runs, I now have the same failure and I must admit that I don't quite understand the point of this assert. The assumption seems to be that when creating components we know they will be on separate cores and that they will be connected. That's why we set the buffer as shared, but I don't know how to verify this at an earlier stage so that here we can be sure that the buffer is shared.
I will also add that the same case reproduces for me on the current main.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Jul 31, 2024

If this happens due to some invalid IPC sequence then comp_buffer_connect() should return an error code, not assert()

asserts can be disabled in production and they are disabled in production right now, example: #9308 (comment)

Asserts are only a debugging aid for "impossible"/buggy conditions.

Invalid IPC sequences are not considered "impossible": IPC fuzzing would be entirely pointless if they were.

@marc-hb marc-hb added the bug Something isn't working as expected label Jul 31, 2024
@marcinszkudlinski
Copy link
Contributor

Question: does IPC3 allow binding components located on separate cores? As I see in the code - no.

If its true (cross core bind is not allowed) - that means the assert is at right place and protection against the illegal situation should be in ipc_comp_to_buffer_connect

if its false (cross core bind is allowed) - assert should be replaced with an error exit code

in both situations - a fix is needed

@lgirdwood
Copy link
Member

Question: does IPC3 allow binding components located on separate cores? As I see in the code - no.

If its true (cross core bind is not allowed) - that means the assert is at right place and protection against the illegal situation should be in ipc_comp_to_buffer_connect

if its false (cross core bind is allowed) - assert should be replaced with an error exit code

in both situations - a fix is needed

Its not allowed and the assert is ok as there are no IPC3 multicore users today. If someone does add multicore for IPC3 in the future they can address.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Aug 5, 2024

that means the assert is at right place and protection against the illegal situation should be in ipc_comp_to_buffer_connect()

Its not allowed and the assert is ok as there are no IPC3 multicore users today.

@marcinszkudlinski could you please add an error code in ipc_comp_to_buffer_connect()? Then probably cherry-pick to stable-v2.2.

@marcinszkudlinski
Copy link
Contributor

well, it is a bit more complicated. As I see the check is already in place:

static int ipc_buffer_to_comp_connect(struct ipc_comp_dev *buffer,
				      struct ipc_comp_dev *comp)
{
	tr_dbg(&ipc_tr, "ipc: comp sink %d, source %d -> connect", comp->id,
	       buffer->id);

#if CONFIG_INCOHERENT
	if (comp->core != buffer->cb->core) {
		tr_err(&ipc_tr, "ipc: shared buffers are not supported for IPC3 incoherent architectures");
		return -ENOTSUP;
	}
#endif
	return comp_buffer_connect(comp->cd, comp->core, buffer->cb,
				   PPL_CONN_DIR_BUFFER_TO_COMP);
}

its checked for incoherent arch only, but the assert is there in every case

@marc-hb @lgirdwood Looks like fuzzing is performed with CONFIG_INCOHERENT = 0 What coherent architectures we support? Is fuzzing also performed with CONFIG_INCOHERENT = 1, especially for IPC4?

The quick fix would be to put the assert in question under conditional compiling, but there may be more places where checks or even "is_shared" flags are not necessary, checking in progress

@marc-hb
Copy link
Collaborator Author

marc-hb commented Aug 6, 2024

The mystery deepens: I tried to reproduce but I couldn't.

But SOF never even reaches ipc_buffer_to_comp_connect()! It returns an -EINVAL from ipc_comp_connect() first because icd_source is NULL:

int ipc_comp_connect(struct ipc *ipc, ipc_pipe_comp_connect *_connect)
{
	struct sof_ipc_pipe_comp_connect *connect = ipc_from_pipe_connect(_connect);
	struct ipc_comp_dev *icd_source;
	struct ipc_comp_dev *icd_sink;

	/* check whether the components already exist */
	icd_source = ipc_get_comp_dev(ipc, COMP_TYPE_ANY, connect->source_id);
	if (!icd_source) {
		tr_err(&ipc_tr, "ipc_comp_connect(): source component does not exist, source_id = %u sink_id = %u",
		       connect->source_id, connect->sink_id);
		return -EINVAL;
	}

What did I miss?

I have a newer Clang version (18), could that make a difference? EDIT: tried on Ubuntu 22.04 just like Github and no difference.

@andyross
Copy link
Contributor

andyross commented Aug 6, 2024

Its not allowed and the assert is ok as there are no IPC3 multicore users today. If someone does add multicore for IPC3 in the future they can address.

Right, but the less proximate problem here is that the cross-core bind is accessible via the IPC3 protocol even if it's "not allowed" as a matter of policy and device configuration, and the case isn't being handled as a protocol error, only as a an assertion failure downward in the stack when it hits an unexpected situation.

As far as determinism: I've had mixed success with the reproducer files generated by the fuzzer too. It seems like too many things can affect the flow. Presumably what's happened is that the previously-fuzzed IPC commands left SOF in a weird state, but that may not be reachable in a reliable way from a developer system. Generally it's OK just to look at a failure and understand how we "could" have reached it and then fix via analysis instead.

If we're really sure that this "shouldn't be able to happen" then I guess we have work to do.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Aug 6, 2024

Presumably what's happened is that the previously-fuzzed IPC commands left SOF in a weird state, but that may not be reachable in a reliable way from a developer system.

Thanks! This means the crash-... file is technically incomplete. This is disappointing - and confusing/time-consuming.

@marc-hb @lgirdwood Looks like fuzzing is performed with CONFIG_INCOHERENT = 0 What coherent architectures we support? Is fuzzing also performed with CONFIG_INCOHERENT = 1, especially for IPC4?

Thanks, I took a quick look at this and I'm afraid we have some serious Kconfig problems here (what's new?). CONFIG_INCOHERENT is guarded by if (CONFIG_)XTENSA in src/platform/Kconfig. But:

  • (CONFIG_)XTENSA is not defined when fuzzing: neither fuzzing IPC3 nor fuzzing IPC4
  • More surprisingly, (CONFIG_)XTENSA is not defined either when compiling IPC3 on the main branch?! I tried ./scripts/docker-run.sh ./scripts/xtensa-build-all.sh rn.

So, in all of these cases CONFIG_INCOHERENT is not even reachable.

It would help if Kconfig+cpp could make a difference between "undefined" and zero but I'm afraid that ship has sailed a long time ago; sorry for the digression.

Long story short I think the very first step should be "top-down": first making sense of and clarifying the CONFIG_INCOHERENT situation. Only then we can start looking at the C code.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Aug 6, 2024

Thanks! This means the crash-... file is technically incomplete. This is disappointing - and confusing/time-consuming.

To be clear: some of these files have worked for me in the past. When they do, they save an ENORMOUS amount of testing time. So don't just cast them aside. Just disappointing to learn they don't always work.

@marcinszkudlinski
Copy link
Contributor

marcinszkudlinski commented Aug 8, 2024

I think the best solution would be:

int comp_buffer_connect(struct comp_dev *comp, uint32_t comp_core,
			struct comp_buffer *buffer, uint32_t dir)
{
	/* check if it's a connection between cores */
	if (buffer->core != comp_core) {
#if CONFIG_INCOHERENT
		/* buffer must be shared */
		assert(buffer->is_shared);
#else
		buffer->is_shared = true;
#endif
		if (!comp->is_shared)
			comp_make_shared(comp);
	}

	return pipeline_connect(comp, buffer, dir);
}

for incoherent archs the buffer must be allocated as shared/not shared, for coherent it may be set as shared at any moment

@lgirdwood ?

@marcinszkudlinski
Copy link
Contributor

#9356 please comment

@marc-hb
Copy link
Collaborator Author

marc-hb commented Aug 9, 2024

Even if 9356 is perfect, it does not change anything to the apparently messy Kconfig situation... @andyross could you help there? Should CONFIG_INCOHERENT really be gated by CONFIG_XTENSA? Can we align build-fuzz/.config with build/.config more? Some differences are unavoidable but they should be as small as possible: fuzzing something different from the product is both dangerous and a bit of a waste.

@andyross
Copy link
Contributor

andyross commented Aug 9, 2024

Should CONFIG_INCOHERENT really be gated by CONFIG_XTENSA?

Likely yes? The latter is a Zephyr kconfig indicating the architecture, which always has an incoherent L1 cache. The former is a SOF tunable I'm less familiar with. From the perspective of software[1], the incoherence is irrelevant when there's only one CPU. So my guess is that SOF allows this to be =n on single CPU builds? The equivalent (but logically converse) Zephyr abstraction is something called CONFIG_KERNEL_COHERENCE, which forces mutable .data/.bss access to be uncached and does some trickery to allow stacks to be cached, while leaving everything else (.rodata and instruction literals in .text) cached.

But regardless, the only architecture in this whole problem space with a truly incoherent cache is Xtensa, and that's very unlikely to change.

[1] And software only! Things like host-shared memory and DMA buffers still care about coherence even with a single CPU, but that's generally handled at the driver layer and not global firmware behavior, AFAIK.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Aug 10, 2024

Should CONFIG_INCOHERENT really be gated by CONFIG_XTENSA?

Likely yes? The latter is a Zephyr kconfig indicating the architecture, which always has an incoherent L1 cache

So does this mean the "INCOHERENT" code cannot be fuzzed? (This is why I asked in the first place)

@marcinszkudlinski
Copy link
Contributor

@marc-hb even if you enable the flag for a cohernt arch, it won't change much (well, except some asserts like the above) - but the main problem, the infamous cache incoherency, won't be tested with fuzzing.

we can think of some backdoors - fuzzing is not any of productions' compilations anyway, but better would be to run fuzzing using an emulator.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Aug 12, 2024

but the main problem, the infamous cache incoherency, won't be tested with fuzzing.

AFAIK fuzzing is single threaded so I was never expecting fuzzing to catch memory coherency issues, that's not the point I was trying to make.

My main point is: we should always minimize #ifdef differences between fuzzed code and production code as much as possible so GitHub issues like this one don't even exist and we save a lot of time. Also, we have a greater chance to catch actual input sanitization issues. It's not about fuzzing finding very complex bugs but about minimizing #ifdef differences to have fewer false positives and fewer false negatives.

(well, except some asserts like the above)

Maybe it's not very "ambitious" but that's more like my main point.

but better would be to run fuzzing using an emulator.

That does not sound mutually exclusive. You can never have too much test coverage. You can only have limited validation time and resources and bad validation priorities.

@marcinszkudlinski
Copy link
Contributor

AFAIK fuzzing is single threaded so I was never expecting fuzzing to catch memory coherency issues, that's not the point I was trying to make.

You're right

My main point is: we should always minimize #ifdef differences between fuzzed code and production code as much as possible

Well, I believe a backdoor than is not a bad idea,

@kv2019i
Copy link
Collaborator

kv2019i commented Aug 20, 2024

@marc-hb @marcinszkudlinski I believe this can be closed now with #9356 merged ?

@marcinszkudlinski
Copy link
Contributor

marcinszkudlinski commented Aug 20, 2024

@kv2019i the assert won't go off anymore.
@marc-hb maybe open another issue for fuzzing compilation params?

@marc-hb
Copy link
Collaborator Author

marc-hb commented Aug 22, 2024

@marc-hb maybe open another issue for fuzzing compilation params?

I just filed:

Fuzzing is just the messenger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected
Projects
None yet
Development

No branches or pull requests

5 participants