AP_ESC_Telem: fix RPM timeout race #28987

robertlong13 · 2025-01-02T04:20:59Z

The RPM telemetry data structure can get updated by another thread at any time. This can cause (now - last_update_us) to be negative, which looks like the data is nearly UINT32_MAX microseconds stale, causing perfectly fine ESCs to be marked as timed-out. This causes lua scripts to get nil values when they call esc_telem:get_rpm even when the RPM data is fine. I'd imagine this can cause minor problems elsewhere too, but not sure.

There are a few ways we could fix this. Using a signed time difference is the simplest and has the least overhead. We don't need to worry about this signed time difference wrapping around when the rpm is 36 minutes stale, because the data_valid flag gets set false as soon as the data is 1s stale, and it can't get set to true until new data comes in.

The easiest way to demonstrate the problem (and fix) is with the following lua script running in SITL at 100x speed.

local ever_had_rpm = false
local function update()
    local rpm = esc_telem:get_rpm(2)
    -- Try to get RPM 20 times to increase our odds of hitting the race condition
    for _ = 1, 20 do
        rpm = rpm and esc_telem:get_rpm(2)
    end
    -- Make sure we don't complain until we've gotten a non-nil RPM at least once
    if rpm then
        if not ever_had_rpm then
            print("Got an RPM")
        end
        ever_had_rpm = true
    end
    -- Complain about a nil RPM
    if not rpm and ever_had_rpm then
        print("RPM is nil")
    end
    return update, 0
end

print("loaded rpm_nil_test.lua")

return update, 0

Note AP_ESC_Telem_Backend::TelemetryData::stale is just as vulnerable to this, but TelemetryData doesn't have the same wraparound protection (though it's not as bad since it's using millis), so I can't use this same trick to fix it.

The RPM telemetry data structure can get updated by another thread at any time. This can cause (now - last_update_us) to be negative, which looks like the data is nearly UINT32_MAX microseconds stale. To fix this, we need to use signed integers for the time difference. The data_valid flag prevents us from accidentally counting extremely stale data as valid when the signed time difference wraps around, so this is safe.

andyp1per · 2025-01-02T11:09:27Z

I don't think this is easily fixable using signed values. The real fix is to introduce locking, but the code is in a hot path so we should avoid that. A simple fix might be to set data_valid to false temporarily in update_rpm, or alternatively have some kind of "in_update" flag which causes the check to fail temporarily. A better fix would be to take a copy of the data before getting the current time.

andyp1per

Don't think this works, need a thread-safe based solution I think

andyp1per · 2025-01-02T10:48:20Z

libraries/AP_ESC_Telem/AP_ESC_Telem.cpp

@@ -811,16 +811,18 @@ void AP_ESC_Telem::update()
    const uint32_t now_us = AP_HAL::micros();
    for (uint8_t i = 0; i < ESC_TELEM_MAX_ESCS; i++) {
        // Invalidate RPM data if not received for too long
-        if ((now_us - _rpm_data[i].last_update_us) > ESC_RPM_DATA_TIMEOUT_US) {
+        // (last_update_us might be fresher than now_us, so we use a signed difference)
+        if (int32_t(now_us - _rpm_data[i].last_update_us) > ESC_RPM_DATA_TIMEOUT_US) {


Unfortunately I don't think this works, because the same comparison happens at the wrap point, and it needs to register a timeout.

Are you talking about when the time difference wraps around? After the difference wraps around (after 36 minutes being stale), this comparison stops setting data_valid to false, but that's okay because it has been pointlessly re-registering the timeout for 36 minutes and now it just stops setting it to false; it doesn't start setting it to true. And nothing sets it true until fresh data arrives.

Or are you talking about the time when now_us itself wraps around? Because that also works fine. Integer differences pretty much work fine with wraparound as long as both are the same type.

robertlong13 · 2025-01-03T02:37:17Z

The real fix is to introduce locking, but the code is in a hot path so we should avoid that.

Agreed; I saw your very nice comment paragraph in the code explaining why locking is a bad idea. Like you said, because this is such a hot path, I wanted to do something with almost zero overhead. For me, it is easier to prove that this logic works than to prove that I don't impact any tight timing requirements anywhere. The fact that it's tricky to tell that this works by inspection is a real flaw of my approach though. I'm open to alternatives, but I have hacked together a little demo to show that this logic works at least.

I made an int22_t, which goes through a full cycle every 4.2 seconds (so the time difference wraps after it hits 2.1s), and I made the RPM telemetry switch between updating and not-updating every 5 seconds. Here's a plot of the return value of get_rpm:

We can see the value displaying valid for 6 seconds (it's valid for 5 seconds, then times out after 1s of inactivity), then invalid for 4 seconds, just like it should. The data_valid flag follows the same pattern:

Here's the timeout conditional, int22_t(now_us - _rpm_data[i].last_update_us) > ESC_RPM_DATA_TIMEOUT_US:

The validity drops to 0 right when the timeout conditional first transitions to 1, as we expect, but then a second later, the timeout conditional drops to 0 when the time difference wraps around. The flipping of the conditional doesn't matter though, because the validity flag remains false.

Here's the actual time difference, as cast to int22_t:

All of these patterns happen exactly the same each iteration, despite the fact that now_us and last_updated are all wrapping around at different times. Their wraparounds don't impact the math for the time difference. Here's now_us to show that:

On the very left-hand side, we see now_us wrap around right after the conditional first goes to 1; the cycle after that we see it wrap just before. Nowhere does it actually impact the pattern displayed by the return value of get_rpm.

(I also have a picture putting last_update in here, but it just makes everything even messier and doesn't provide anything)

andyp1per · 2025-01-03T09:45:19Z

It's just too subtle for my liking and against the pattern we have pretty much everywhere else in the code. If you want to pursue this I think you will need to add helpers and unit tests to AP_HAL::timeout_expired() and friends, but I really don't think this is a good idea. We already have a subtle change in this code to deal with a similar problem from @magicrub and the changes are getting more involved ...

robertlong13 requested review from andyp1per and WickedShell January 2, 2025 04:21

andyp1per requested changes Jan 2, 2025

View reviewed changes

robertlong13 mentioned this pull request Jan 3, 2025

AP_ESC_Telem: fix RPM timeout race #28999

Open

robertlong13 closed this Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AP_ESC_Telem: fix RPM timeout race #28987

AP_ESC_Telem: fix RPM timeout race #28987

robertlong13 commented Jan 2, 2025

andyp1per commented Jan 2, 2025

andyp1per left a comment

andyp1per Jan 2, 2025

robertlong13 Jan 2, 2025 •

edited

Loading

robertlong13 commented Jan 3, 2025

andyp1per commented Jan 3, 2025

AP_ESC_Telem: fix RPM timeout race #28987

AP_ESC_Telem: fix RPM timeout race #28987

Conversation

robertlong13 commented Jan 2, 2025

andyp1per commented Jan 2, 2025

andyp1per left a comment

Choose a reason for hiding this comment

andyp1per Jan 2, 2025

Choose a reason for hiding this comment

robertlong13 Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

robertlong13 commented Jan 3, 2025

andyp1per commented Jan 3, 2025

robertlong13 Jan 2, 2025 •

edited

Loading