dp bugfix: wait till dp thread stops in thread cancel #9136

marcinszkudlinski · 2024-05-17T08:23:02Z

There's a race posiibility in DP when stopping a pipeline:

dp starts processing
incoming IPC - pause the pipeline. IPC has higher priority
than DP, so DP is preempted
pipeline is stopping, module "reset" is called. Some of resources
may be freed here
when IPC finishes, DP thread continues processing

Sollution: wait for DP to finish processing and terminate DP thread
before calling "reset" method in module

To do this:

call "thread cancel" before calling "reset"
modify "thread cancel" to mark the thread to terminate and
execute k_thread_join()
terminated thread cannot be restarted, so thread creation must be
moved from "init" to "schedule". There's no need to reallocate memory
zephyr guarantees that resources may be re-used when a thread
is terminated.

kv2019i

Looks good, one clarifying question inline (plus few minor comments).

kv2019i · 2024-05-17T12:25:39Z

src/audio/module_adapter/module/generic.c

@@ -336,6 +336,9 @@ int module_reset(struct processing_module *mod)
 	if (md->state < MODULE_IDLE)
 		return 0;
 #endif
+	/* cancel task if DP task*/
+	if (mod->dev->ipc_config.proc_domain == COMP_PROCESSING_DOMAIN_DP && mod->dev->task)


Is this check safe? I mean task is cancelled, where is task set to null? It's initially NULL, so this check will cover DP tasks never started, but how about a DP task that has already ended?

If the task is cancel - its OK, calling cancel again is safe and won't have any effect
If task if freed:

static inline void comp_free(struct comp_dev *dev) { assert(dev->drv->ops.free); /* free task if shared component or DP task*/ if ((dev->is_shared || dev->ipc_config.proc_domain == COMP_PROCESSING_DOMAIN_DP) && dev->task) { schedule_task_free(dev->task); rfree(dev->task); dev->task = NULL; } dev->drv->ops.free(dev); }

free and reset are both called from IPC thread, so it looks safe. Anyway, I can change it to safer:

volatile task * _task = dev->task; dev->task = NULL; schedule_task_free(_task); rfree(_task);

EDIT:
is it safer? Can't the compiler change order of lines in optimalization? Is it needed?
Free is called when a module is really being destroyed, calling "reset" during or after free operation would be fatal anyway

SOF is relying on IPC serialisation. So if both these functions are only called from within IPCs, they cannot race.

@marcinszkudlinski @lyakh I was actually thinking of the case where "dev->task" is a dangling pointer. I couldn't immediately find where dev->task is zeroed, so not sure if this check can ensure task is a valid task object.

@lyakh you're right with IPC serialization, no need for double-over-protection. @kv2019i - as you see pointer is zeroized when task is freed, No changes required here

kv2019i · 2024-05-17T12:26:25Z

src/audio/module_adapter/module/generic.c

@@ -336,6 +336,9 @@ int module_reset(struct processing_module *mod)
 	if (md->state < MODULE_IDLE)


minor: some typoes in commit, "s/posiibility/possibility", "s/sollution/solution/",

marc-hb · 2024-05-20T16:35:58Z

As a funny coincidence, this old IPC3 issue was just exposed now:

[BUG] [stable-v2.2] [NOCODEC] ERROR pipeline_comp_reset(): failed to recover (multiple-pipeline-all) #9135

lyakh

I see that errors in thread handling are just moved over from another location, but let's use this opportunity to fix them too

lyakh · 2024-05-21T07:20:44Z

src/audio/module_adapter/module/generic.c

@@ -336,6 +336,9 @@ int module_reset(struct processing_module *mod)
 	if (md->state < MODULE_IDLE)
 		return 0;
 #endif
+	/* cancel task if DP task*/
+	if (mod->dev->ipc_config.proc_domain == COMP_PROCESSING_DOMAIN_DP && mod->dev->task)


SOF is relying on IPC serialisation. So if both these functions are only called from within IPCs, they cannot race.

lyakh · 2024-05-21T07:24:14Z

src/schedule/zephyr_dp_schedule.c

@@ -278,8 +287,13 @@ static int scheduler_dp_task_cancel(void *data, struct task *task)
 	if (list_is_empty(&dp_sch->tasks))
 		schedule_task_cancel(&dp_sch->ll_tick_src);

+	/* if the task is waiting on a semaphore - let it run and self-terminate */
+	k_sem_give(&pdata->sem);


if the task is a lower priority thread, however, this won't switch to running that other task immediately

it won't at this point, it will when k_thread_join() is called

src/schedule/zephyr_dp_schedule.c

lyakh · 2024-05-21T08:02:53Z

src/schedule/zephyr_dp_schedule.c

+err:
+	/* cleanup - unlock and free all allocated resources */
+	scheduler_dp_unlock(lock_key);
+	if (thread_id)


this check will always return true

yes, but only because the first possible jump to err is after thread is initialized. I would leave this "overprotection" as is - cost is just a few bytes of code.

@marcinszkudlinski but before it's initialised it's undefined, so this check would be meaningless too and the compiler would warn you about that.

kv2019i

Fine to me now, thanks.

marcinszkudlinski · 2024-05-22T12:34:03Z

changing to [DNM], waiting for logs from stress test with DP AEC

lgirdwood · 2024-05-22T15:20:51Z

changing to [DNM], waiting for logs from stress test with DP AEC

Will tag for v2.10 so we can get this fix.

lyakh · 2024-05-23T07:20:41Z

src/audio/module_adapter/module/generic.c

@@ -336,6 +336,9 @@ int module_reset(struct processing_module *mod)
 	if (md->state < MODULE_IDLE)
 		return 0;
 #endif
+	/* cancel task if DP task*/
+	if (mod->dev->ipc_config.proc_domain == COMP_PROCESSING_DOMAIN_DP && mod->dev->task)
+		schedule_task_cancel(mod->dev->task);


do I understand it correctly, that previously DP tasks were terminated from comp_free() by calling schedule_task_free()? Now scheduler_dp_task_cancel() will (potentially) be called twice - from here and from schedule_task_free(). Should be safe for now, at least as long as list_del() also initialises the list item.

Yes., it is called twice, nobody guarantees that cancel had been called before free.

as long as list_del() also initialises the list item

it does:

static inline void list_item_del(struct list_item *item) { item->next->prev = item->prev; item->prev->next = item->next; list_init(item); }

@marcinszkudlinski yes, sorry, that's exactly what I meant: as long as list_init() is there, then calling it twice is ok, but if we once decide to remove it since technically it isn't needed there, then this (and probably a couple of other places) will break

comments addressed, thanks

marcinszkudlinski · 2024-05-31T13:08:34Z

please go ahead with merging

kv2019i · 2024-05-31T13:10:08Z

One more review needed to be able to merge.

kv2019i · 2024-06-04T08:30:27Z

@wszypelt @marcinszkudlinski Can you take a look at the quickbuild build fail for this PR. This is fixing an issue that showing as quickbuild failure for other PRs frequently.

lgirdwood · 2024-06-10T10:18:26Z

changing to [DNM], waiting for logs from stress test with DP AEC

@marcinszkudlinski any update, can we merged for v2.10 or still WIP ?

marcinszkudlinski · 2024-06-12T07:49:16Z

checking CI, please wait

lgirdwood · 2024-06-12T13:35:56Z

@marcinszkudlinski jenkins looks OK, looks like a long queue on internal CI ?

wszypelt · 2024-06-12T13:45:51Z

@lgirdwood results from internal CI should be available within 40 minutes

marcinszkudlinski · 2024-06-13T08:29:23Z

another problem in CI, fix in progress...

There's a race posiibility in DP when stopping a pipeline: - dp starts processing - incoming IPC - pause the pipeline. IPC has higher priority than DP, so DP is preempted - pipeline is stopping, module "reset" is called. Some of resources may be freed here - when IPC finishes, DP thread continues processing Sollution: wait for DP to finish processing and terminate DP thread before calling "reset" method in module To do this: 1) call "thread cancel" before calling "reset"reset 2) modify "thread cancel" to mark the thread to terminate and execute k_thread_join() 3) terminated thread cannot be restarted, so thread creation must be moved from "init" to "schedule". There's no need to reallocate memory zephyr guarantees that resources may be re-used when a thread is terminated. Signed-off-by: Marcin Szkudlinski <marcin.szkudlinski@intel.com>

marcinszkudlinski · 2024-06-18T14:02:03Z

fix - crash if a task had been deleted before was scheduled.
I.e. when pipeline was created and deleted, but not started
In this case zephyr thread wasn't created and null pointer dereference occured

lgirdwood · 2024-06-20T08:10:00Z

fix - crash if a task had been deleted before was scheduled. I.e. when pipeline was created and deleted, but not started In this case zephyr thread wasn't created and null pointer dereference occured

@marcinszkudlinski ok, do you mean now you are happy with stress testing and we are good to go ?

marcinszkudlinski · 2024-06-20T09:24:03Z

@lgirdwood I believe so, pls go ahead with merging

marcinszkudlinski marked this pull request as ready for review May 17, 2024 08:32

marcinszkudlinski requested review from pblaszko, dbaluta, LaurentiuM1234, lgirdwood, plbossart, mmaka1, lbetlej and kv2019i as code owners May 17, 2024 08:32

marcinszkudlinski mentioned this pull request May 17, 2024

[BUG] pipeline with DP-scheduled src_lite upon 0x13000002 IPC #9124

Closed

marcinszkudlinski added bug Something isn't working as expected dp_scheduler labels May 17, 2024

kv2019i reviewed May 17, 2024

View reviewed changes

lyakh previously requested changes May 21, 2024

View reviewed changes

marcinszkudlinski force-pushed the dp_stop_by_wait branch from c7a07ab to 5cfcb4e Compare May 22, 2024 11:57

kv2019i approved these changes May 22, 2024

View reviewed changes

marcinszkudlinski changed the title ~~dp bugfix: wait till dp thread stops in thread cancel~~ [DNM] dp bugfix: wait till dp thread stops in thread cancel May 22, 2024

lgirdwood added this to the v2.10 milestone May 22, 2024

lyakh reviewed May 23, 2024

View reviewed changes

marcinszkudlinski changed the title ~~[DNM] dp bugfix: wait till dp thread stops in thread cancel~~ dp bugfix: wait till dp thread stops in thread cancel May 31, 2024

lyakh approved these changes Jun 4, 2024

View reviewed changes

marcinszkudlinski force-pushed the dp_stop_by_wait branch from 5cfcb4e to 750b119 Compare June 12, 2024 10:54

lgirdwood modified the milestones: v2.10, v2.11 Jun 13, 2024

marcinszkudlinski force-pushed the dp_stop_by_wait branch from 750b119 to 92b2e53 Compare June 18, 2024 14:00

kv2019i merged commit 34957e7 into thesofproject:main Jun 20, 2024
44 of 46 checks passed

marcinszkudlinski deleted the dp_stop_by_wait branch June 21, 2024 07:19

ssavati mentioned this pull request Jul 8, 2024

[BUG] [stable-v2.2] [NOCODEC] ERROR pipeline_comp_reset(): failed to recover (multiple-pipeline-all) #9135

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dp bugfix: wait till dp thread stops in thread cancel #9136

dp bugfix: wait till dp thread stops in thread cancel #9136

marcinszkudlinski commented May 17, 2024 •

edited

Loading

kv2019i left a comment

kv2019i May 17, 2024

marcinszkudlinski May 17, 2024 •

edited

Loading

lyakh May 21, 2024

kv2019i May 21, 2024

marcinszkudlinski May 22, 2024

kv2019i May 17, 2024

marc-hb commented May 20, 2024

lyakh left a comment

lyakh May 21, 2024

lyakh May 21, 2024

marcinszkudlinski May 22, 2024

lyakh May 21, 2024

marcinszkudlinski May 22, 2024

lyakh Jun 3, 2024

kv2019i left a comment

marcinszkudlinski commented May 22, 2024

lgirdwood commented May 22, 2024

lyakh May 23, 2024

marcinszkudlinski May 24, 2024

lyakh May 27, 2024

marcinszkudlinski commented May 31, 2024

kv2019i commented May 31, 2024

kv2019i commented Jun 4, 2024

lgirdwood commented Jun 10, 2024

marcinszkudlinski commented Jun 12, 2024

lgirdwood commented Jun 12, 2024

wszypelt commented Jun 12, 2024

marcinszkudlinski commented Jun 13, 2024

marcinszkudlinski commented Jun 18, 2024

lgirdwood commented Jun 20, 2024

marcinszkudlinski commented Jun 20, 2024

		@@ -336,6 +336,9 @@ int module_reset(struct processing_module *mod)
		if (md->state < MODULE_IDLE)

dp bugfix: wait till dp thread stops in thread cancel #9136

dp bugfix: wait till dp thread stops in thread cancel #9136

Conversation

marcinszkudlinski commented May 17, 2024 • edited Loading

kv2019i left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcinszkudlinski May 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marc-hb commented May 20, 2024

lyakh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kv2019i left a comment

Choose a reason for hiding this comment

marcinszkudlinski commented May 22, 2024

lgirdwood commented May 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcinszkudlinski commented May 31, 2024

kv2019i commented May 31, 2024

kv2019i commented Jun 4, 2024

lgirdwood commented Jun 10, 2024

marcinszkudlinski commented Jun 12, 2024

lgirdwood commented Jun 12, 2024

wszypelt commented Jun 12, 2024

marcinszkudlinski commented Jun 13, 2024

marcinszkudlinski commented Jun 18, 2024

lgirdwood commented Jun 20, 2024

marcinszkudlinski commented Jun 20, 2024

marcinszkudlinski commented May 17, 2024 •

edited

Loading

marcinszkudlinski May 17, 2024 •

edited

Loading