Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Darwin] Fix Watchable Event Manager Select crash on shutdown #8985

Merged
merged 1 commit into from
Aug 13, 2021

Conversation

sagar-apple
Copy link
Contributor

@sagar-apple sagar-apple commented Aug 13, 2021

Problem

#8800 split out some functionality into their own implementations. In the new code that was added, timers are not properly cancelled during shutdown.
This leads to a crash when shutdown is called while a timer is queued. The timer will run after everything has shutdown and will crash on an assert.

Thread 9 Queue : com.zigbee.chip.framework.controller.workqueue (serial)
#0	0x00000001b985f9bc in __pthread_kill ()
#1	0x00000001f3484398 in pthread_kill ()
#2	0x000000018d85acf0 in abort ()
#3	0x0000000101b21290 in chip::System::Timer::List::Remove(chip::System::Timer*) at connectedhomeip/src/system/SystemTimer.cpp:168
#4	0x0000000101b23520 in chip::System::Timer::MutexedList::Remove(chip::System::Timer*) at connectedhomeip/src/system/SystemTimer.h:104
#5	0x0000000101b2257c in chip::System::WatchableEventManager::HandleTimerComplete(chip::System::Timer*) at connectedhomeip/src/system/WatchableEventManagerSelect.cpp:370
#6	0x0000000101b22548 in invocation function for block in chip::System::WatchableEventManager::StartTimer(unsigned int, void (*)(chip::System::Layer*, void*), void*) at connectedhomeip/src/system/WatchableEventManagerSelect.cpp:139

Change overview

Cancel all pending timers while shutting down The WatchableEventManager.

Testing

How was this tested? (at least one bullet point required)
We need more testing for all this code. I don't see any unit tests here. I'm unsure how to go about adding tests here.

  • If manually tested, what platforms controller and device platforms were manually tested, and how?
    I tested this manually with the iOS controller which restarts the Darwin CHIP stack before pairing. The controller no longer crashes when calling shutdown before subsequent pairings.

@boring-cyborg boring-cyborg bot added the system label Aug 13, 2021
@sagar-apple sagar-apple changed the title Fix Watchable Event Manager Select crash on shutdown [Darwin] Fix Watchable Event Manager Select crash on shutdown Aug 13, 2021
Copy link
Contributor

@bzbarsky-apple bzbarsky-apple left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about WatchableEventManager::CancelTimer (both versions)? Should that not be doing the dispatch source cancellation?

In the old code we did this cancellation in Timer::Cancel. Should it just happen in Timer::Clear now? Should there be documentation for Clear, for that matter?

@kpschoedel

@github-actions
Copy link

Size increase report for "esp32-example-build" from af3fefb

File Section File VM
chip-bridge-app.elf .flash.text 48 48
chip-shell.elf .flash.text -8 -8
Full report output
BLOAT REPORT

Files found only in the build output:
    report.csv

Comparing ./master_artifact/chip-bridge-app.elf and ./pull_artifact/chip-bridge-app.elf:

sections,vmsize,filesize
.flash.text,48,48
[Unmapped],0,-48

Comparing ./master_artifact/chip-lock-app.elf and ./pull_artifact/chip-lock-app.elf:

sections,vmsize,filesize

Comparing ./master_artifact/chip-temperature-measurement-app.elf and ./pull_artifact/chip-temperature-measurement-app.elf:

sections,vmsize,filesize

Comparing ./master_artifact/chip-ipv6only-app.elf and ./pull_artifact/chip-ipv6only-app.elf:

sections,vmsize,filesize

Comparing ./master_artifact/chip-shell.elf and ./pull_artifact/chip-shell.elf:

sections,vmsize,filesize
[Unmapped],0,8
.flash.text,-8,-8

Comparing ./master_artifact/chip-all-clusters-app.elf and ./pull_artifact/chip-all-clusters-app.elf:

sections,vmsize,filesize

Comparing ./master_artifact/chip-pigweed-app.elf and ./pull_artifact/chip-pigweed-app.elf:

sections,vmsize,filesize

Comparing ./master_artifact/chip-persistent-storage.elf and ./pull_artifact/chip-persistent-storage.elf:

sections,vmsize,filesize


@sagar-apple
Copy link
Contributor Author

What about WatchableEventManager::CancelTimer (both versions)? Should that not be doing the dispatch source cancellation?

In the old code we did this cancellation in Timer::Cancel. Should it just happen in Timer::Clear now? Should there be documentation for Clear, for that matter?

@kpschoedel

Yeah maybe Timer::Clear ought to remove its timer source from dispatch. I'll let @kpschoedel lead that discussion.

As for CancelTimer, I think we're okay to do it in both places because of the single queue nature of the whole implementation. If a timer is canceled then we won't see it in the list while shutting down, and if not, then shutdown will remove it before it gets to execute.

@bzbarsky-apple
Copy link
Contributor

I guess my question is: if we don't do it in WatchableEventManager::CancelTimer, does that mean that canceled timers would still run?

@sagar-apple
Copy link
Contributor Author

I guess my question is: if we don't do it in WatchableEventManager::CancelTimer, does that mean that canceled timers would still run?

Yes, they would :) .

@bzbarsky-apple
Copy link
Contributor

Yes, they would :)

Sounds like we should be able to test that certainly in a unit test... Set a timer, cancel it, sleep a bit, spin the event loop, see if the timer fires....

@kpschoedel
Copy link
Contributor

What about WatchableEventManager::CancelTimer (both versions)? Should that not be doing the dispatch source cancellation?

It is (unless I copied that wrong).

In the old code we did this cancellation in Timer::Cancel. Should it just happen in Timer::Clear now? Should there be documentation for Clear, for that matter?

The cancellation belongs in WatchableEventManager; I just overlooked adding the dispatch case in Shutdown.

System::Timer is now an optional Object-pool-based utility for implementing the System::Layer timer API. I know this isn't really clear right now; better documentation is coming, along with making the WatchableEventManagerLibevent example implementation not use Timer as a demonstration of that, but I have a few yaks (e.g #8865) on the way.

In line with issue #7715, the end state is that the System::Layer interfaces become virtual and the various WatchableEventManager versions become its implementations. In this world I assume that Darwin would have a purely-dispatch implementation class (inheriting the socket parts from the Select version for now).

@bzbarsky-apple
Copy link
Contributor

It is (unless I copied that wrong).

Ah, I see. WatchableEventManagerSelect.cpp does it. The Libevent and LwIP versions do not. Should they? Or do we know for a fact that they are never used with CHIP_SYSTEM_CONFIG_USE_DISPATCH? @kpschoedel

@kpschoedel
Copy link
Contributor

Ah, I see. WatchableEventManagerSelect.cpp does it. The Libevent and LwIP versions do not. Should they? Or do we know for a fact that they are never used with CHIP_SYSTEM_CONFIG_USE_DISPATCH? @kpschoedel

The previous code had CHIP_SYSTEM_CONFIG_USE_DISPATCH only within CHIP_SYSTEM_CONFIG_USE_SOCKETS, and src/system/system.gni sets it only for sockets+Darwin.

@woody-apple woody-apple added the hotfix urgent fix needed, can bypass review label Aug 13, 2021
@woody-apple woody-apple merged commit c8533d8 into project-chip:master Aug 13, 2021
@sagar-apple sagar-apple deleted the fix_crash branch August 13, 2021 22:18
nikita-s-wrk pushed a commit to nikita-s-wrk/connectedhomeip that referenced this pull request Sep 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotfix urgent fix needed, can bypass review review - pending system
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants