Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dp bugfix: wait till dp thread stops in thread cancel #9136

Merged
merged 1 commit into from
Jun 20, 2024

Conversation

marcinszkudlinski
Copy link
Contributor

@marcinszkudlinski marcinszkudlinski commented May 17, 2024

There's a race posiibility in DP when stopping a pipeline:

  • dp starts processing
  • incoming IPC - pause the pipeline. IPC has higher priority
    than DP, so DP is preempted
  • pipeline is stopping, module "reset" is called. Some of resources
    may be freed here
  • when IPC finishes, DP thread continues processing

Sollution: wait for DP to finish processing and terminate DP thread
before calling "reset" method in module

To do this:

  1. call "thread cancel" before calling "reset"
  2. modify "thread cancel" to mark the thread to terminate and
    execute k_thread_join()
  3. terminated thread cannot be restarted, so thread creation must be
    moved from "init" to "schedule". There's no need to reallocate memory
    zephyr guarantees that resources may be re-used when a thread
    is terminated.

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, one clarifying question inline (plus few minor comments).

@@ -336,6 +336,9 @@ int module_reset(struct processing_module *mod)
if (md->state < MODULE_IDLE)
return 0;
#endif
/* cancel task if DP task*/
if (mod->dev->ipc_config.proc_domain == COMP_PROCESSING_DOMAIN_DP && mod->dev->task)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check safe? I mean task is cancelled, where is task set to null? It's initially NULL, so this check will cover DP tasks never started, but how about a DP task that has already ended?

Copy link
Contributor Author

@marcinszkudlinski marcinszkudlinski May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the task is cancel - its OK, calling cancel again is safe and won't have any effect
If task if freed:

static inline void comp_free(struct comp_dev *dev)
{
	assert(dev->drv->ops.free);

	/* free task if shared component or DP task*/
	if ((dev->is_shared || dev->ipc_config.proc_domain == COMP_PROCESSING_DOMAIN_DP) &&
	    dev->task) {
		schedule_task_free(dev->task);
		rfree(dev->task);
		dev->task = NULL;
	}

	dev->drv->ops.free(dev);
}

free and reset are both called from IPC thread, so it looks safe. Anyway, I can change it to safer:

		volatile task * _task = dev->task;
		dev->task = NULL;
		schedule_task_free(_task);
		rfree(_task);

EDIT:
is it safer? Can't the compiler change order of lines in optimalization? Is it needed?
Free is called when a module is really being destroyed, calling "reset" during or after free operation would be fatal anyway

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SOF is relying on IPC serialisation. So if both these functions are only called from within IPCs, they cannot race.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcinszkudlinski @lyakh I was actually thinking of the case where "dev->task" is a dangling pointer. I couldn't immediately find where dev->task is zeroed, so not sure if this check can ensure task is a valid task object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lyakh you're right with IPC serialization, no need for double-over-protection. @kv2019i - as you see pointer is zeroized when task is freed, No changes required here

@@ -336,6 +336,9 @@ int module_reset(struct processing_module *mod)
if (md->state < MODULE_IDLE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: some typoes in commit, "s/posiibility/possibility", "s/sollution/solution/",

@marc-hb
Copy link
Collaborator

marc-hb commented May 20, 2024

As a funny coincidence, this old IPC3 issue was just exposed now:

lyakh
lyakh previously requested changes May 21, 2024
Copy link
Collaborator

@lyakh lyakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that errors in thread handling are just moved over from another location, but let's use this opportunity to fix them too

@@ -336,6 +336,9 @@ int module_reset(struct processing_module *mod)
if (md->state < MODULE_IDLE)
return 0;
#endif
/* cancel task if DP task*/
if (mod->dev->ipc_config.proc_domain == COMP_PROCESSING_DOMAIN_DP && mod->dev->task)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SOF is relying on IPC serialisation. So if both these functions are only called from within IPCs, they cannot race.

@@ -278,8 +287,13 @@ static int scheduler_dp_task_cancel(void *data, struct task *task)
if (list_is_empty(&dp_sch->tasks))
schedule_task_cancel(&dp_sch->ll_tick_src);

/* if the task is waiting on a semaphore - let it run and self-terminate */
k_sem_give(&pdata->sem);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the task is a lower priority thread, however, this won't switch to running that other task immediately

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it won't at this point, it will when k_thread_join() is called

src/schedule/zephyr_dp_schedule.c Show resolved Hide resolved
src/schedule/zephyr_dp_schedule.c Outdated Show resolved Hide resolved
src/schedule/zephyr_dp_schedule.c Outdated Show resolved Hide resolved
src/schedule/zephyr_dp_schedule.c Outdated Show resolved Hide resolved
src/schedule/zephyr_dp_schedule.c Outdated Show resolved Hide resolved
err:
/* cleanup - unlock and free all allocated resources */
scheduler_dp_unlock(lock_key);
if (thread_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check will always return true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but only because the first possible jump to err is after thread is initialized. I would leave this "overprotection" as is - cost is just a few bytes of code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcinszkudlinski but before it's initialised it's undefined, so this check would be meaningless too and the compiler would warn you about that.

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine to me now, thanks.

@marcinszkudlinski
Copy link
Contributor Author

changing to [DNM], waiting for logs from stress test with DP AEC

@marcinszkudlinski marcinszkudlinski changed the title dp bugfix: wait till dp thread stops in thread cancel [DNM] dp bugfix: wait till dp thread stops in thread cancel May 22, 2024
@lgirdwood
Copy link
Member

changing to [DNM], waiting for logs from stress test with DP AEC

Will tag for v2.10 so we can get this fix.

@lgirdwood lgirdwood added this to the v2.10 milestone May 22, 2024
@@ -336,6 +336,9 @@ int module_reset(struct processing_module *mod)
if (md->state < MODULE_IDLE)
return 0;
#endif
/* cancel task if DP task*/
if (mod->dev->ipc_config.proc_domain == COMP_PROCESSING_DOMAIN_DP && mod->dev->task)
schedule_task_cancel(mod->dev->task);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do I understand it correctly, that previously DP tasks were terminated from comp_free() by calling schedule_task_free()? Now scheduler_dp_task_cancel() will (potentially) be called twice - from here and from schedule_task_free(). Should be safe for now, at least as long as list_del() also initialises the list item.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes., it is called twice, nobody guarantees that cancel had been called before free.

as long as list_del() also initialises the list item

it does:

static inline void list_item_del(struct list_item *item)
{
	item->next->prev = item->prev;
	item->prev->next = item->next;
	list_init(item);
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcinszkudlinski yes, sorry, that's exactly what I meant: as long as list_init() is there, then calling it twice is ok, but if we once decide to remove it since technically it isn't needed there, then this (and probably a couple of other places) will break

@lyakh lyakh dismissed their stale review May 23, 2024 07:44

comments addressed, thanks

@marcinszkudlinski marcinszkudlinski changed the title [DNM] dp bugfix: wait till dp thread stops in thread cancel dp bugfix: wait till dp thread stops in thread cancel May 31, 2024
@marcinszkudlinski
Copy link
Contributor Author

please go ahead with merging

@kv2019i
Copy link
Collaborator

kv2019i commented May 31, 2024

One more review needed to be able to merge.

@kv2019i
Copy link
Collaborator

kv2019i commented Jun 4, 2024

@wszypelt @marcinszkudlinski Can you take a look at the quickbuild build fail for this PR. This is fixing an issue that showing as quickbuild failure for other PRs frequently.

@lgirdwood
Copy link
Member

changing to [DNM], waiting for logs from stress test with DP AEC

@marcinszkudlinski any update, can we merged for v2.10 or still WIP ?

@marcinszkudlinski
Copy link
Contributor Author

checking CI, please wait

@lgirdwood
Copy link
Member

@marcinszkudlinski jenkins looks OK, looks like a long queue on internal CI ?

@wszypelt
Copy link

@lgirdwood results from internal CI should be available within 40 minutes

@marcinszkudlinski
Copy link
Contributor Author

another problem in CI, fix in progress...

@lgirdwood lgirdwood modified the milestones: v2.10, v2.11 Jun 13, 2024
There's a race posiibility in DP when stopping a pipeline:
 - dp starts processing
 - incoming IPC - pause the pipeline. IPC has higher priority
than DP, so DP is preempted
 - pipeline is stopping, module "reset" is called. Some of resources
may be freed here
 - when IPC finishes, DP thread continues processing

Sollution: wait for DP to finish processing and terminate DP thread
before calling "reset" method in module

To do this:
1) call "thread cancel" before calling "reset"reset
2) modify "thread cancel" to mark the thread to terminate and
execute k_thread_join()
3) terminated thread cannot be restarted, so thread creation must be
moved from "init" to "schedule". There's no need to reallocate memory
zephyr guarantees that resources may be re-used when a thread
is terminated.

Signed-off-by: Marcin Szkudlinski <marcin.szkudlinski@intel.com>
@marcinszkudlinski
Copy link
Contributor Author

fix - crash if a task had been deleted before was scheduled.
I.e. when pipeline was created and deleted, but not started
In this case zephyr thread wasn't created and null pointer dereference occured

@lgirdwood
Copy link
Member

fix - crash if a task had been deleted before was scheduled. I.e. when pipeline was created and deleted, but not started In this case zephyr thread wasn't created and null pointer dereference occured

@marcinszkudlinski ok, do you mean now you are happy with stress testing and we are good to go ?

@marcinszkudlinski
Copy link
Contributor Author

@lgirdwood I believe so, pls go ahead with merging

@kv2019i kv2019i merged commit 34957e7 into thesofproject:main Jun 20, 2024
44 of 46 checks passed
@marcinszkudlinski marcinszkudlinski deleted the dp_stop_by_wait branch June 21, 2024 07:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected dp_scheduler
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants