Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AP_ESC_Telem: fix RPM timeout race #28987

Closed

Conversation

robertlong13
Copy link
Collaborator

The RPM telemetry data structure can get updated by another thread at any time. This can cause (now - last_update_us) to be negative, which looks like the data is nearly UINT32_MAX microseconds stale, causing perfectly fine ESCs to be marked as timed-out. This causes lua scripts to get nil values when they call esc_telem:get_rpm even when the RPM data is fine. I'd imagine this can cause minor problems elsewhere too, but not sure.

There are a few ways we could fix this. Using a signed time difference is the simplest and has the least overhead. We don't need to worry about this signed time difference wrapping around when the rpm is 36 minutes stale, because the data_valid flag gets set false as soon as the data is 1s stale, and it can't get set to true until new data comes in.

The easiest way to demonstrate the problem (and fix) is with the following lua script running in SITL at 100x speed.

local ever_had_rpm = false
local function update()
    local rpm = esc_telem:get_rpm(2)
    -- Try to get RPM 20 times to increase our odds of hitting the race condition
    for _ = 1, 20 do
        rpm = rpm and esc_telem:get_rpm(2)
    end
    -- Make sure we don't complain until we've gotten a non-nil RPM at least once
    if rpm then
        if not ever_had_rpm then
            print("Got an RPM")
        end
        ever_had_rpm = true
    end
    -- Complain about a nil RPM
    if not rpm and ever_had_rpm then
        print("RPM is nil")
    end
    return update, 0
end

print("loaded rpm_nil_test.lua")

return update, 0

Note AP_ESC_Telem_Backend::TelemetryData::stale is just as vulnerable to this, but TelemetryData doesn't have the same wraparound protection (though it's not as bad since it's using millis), so I can't use this same trick to fix it.

The RPM telemetry data structure can get updated by another thread at
any time. This can cause (now - last_update_us) to be negative, which
looks like the data is nearly UINT32_MAX microseconds stale. To fix
this, we need to use signed integers for the time difference. The
data_valid flag prevents us from accidentally counting extremely stale
data as valid when the signed time difference wraps around, so this
is safe.
@andyp1per
Copy link
Collaborator

I don't think this is easily fixable using signed values. The real fix is to introduce locking, but the code is in a hot path so we should avoid that. A simple fix might be to set data_valid to false temporarily in update_rpm, or alternatively have some kind of "in_update" flag which causes the check to fail temporarily. A better fix would be to take a copy of the data before getting the current time.

Copy link
Collaborator

@andyp1per andyp1per left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think this works, need a thread-safe based solution I think

@@ -811,16 +811,18 @@ void AP_ESC_Telem::update()
const uint32_t now_us = AP_HAL::micros();
for (uint8_t i = 0; i < ESC_TELEM_MAX_ESCS; i++) {
// Invalidate RPM data if not received for too long
if ((now_us - _rpm_data[i].last_update_us) > ESC_RPM_DATA_TIMEOUT_US) {
// (last_update_us might be fresher than now_us, so we use a signed difference)
if (int32_t(now_us - _rpm_data[i].last_update_us) > ESC_RPM_DATA_TIMEOUT_US) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I don't think this works, because the same comparison happens at the wrap point, and it needs to register a timeout.

Copy link
Collaborator Author

@robertlong13 robertlong13 Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you talking about when the time difference wraps around? After the difference wraps around (after 36 minutes being stale), this comparison stops setting data_valid to false, but that's okay because it has been pointlessly re-registering the timeout for 36 minutes and now it just stops setting it to false; it doesn't start setting it to true. And nothing sets it true until fresh data arrives.

Or are you talking about the time when now_us itself wraps around? Because that also works fine. Integer differences pretty much work fine with wraparound as long as both are the same type.

@robertlong13
Copy link
Collaborator Author

The real fix is to introduce locking, but the code is in a hot path so we should avoid that.

Agreed; I saw your very nice comment paragraph in the code explaining why locking is a bad idea. Like you said, because this is such a hot path, I wanted to do something with almost zero overhead. For me, it is easier to prove that this logic works than to prove that I don't impact any tight timing requirements anywhere. The fact that it's tricky to tell that this works by inspection is a real flaw of my approach though. I'm open to alternatives, but I have hacked together a little demo to show that this logic works at least.

I made an int22_t, which goes through a full cycle every 4.2 seconds (so the time difference wraps after it hits 2.1s), and I made the RPM telemetry switch between updating and not-updating every 5 seconds. Here's a plot of the return value of get_rpm:
Figure 1: return flag of get_rpm

We can see the value displaying valid for 6 seconds (it's valid for 5 seconds, then times out after 1s of inactivity), then invalid for 4 seconds, just like it should. The data_valid flag follows the same pattern:
Figure 2: data_valid flag

Here's the timeout conditional, int22_t(now_us - _rpm_data[i].last_update_us) > ESC_RPM_DATA_TIMEOUT_US:
Figure 3: result of the timeout conditional

The validity drops to 0 right when the timeout conditional first transitions to 1, as we expect, but then a second later, the timeout conditional drops to 0 when the time difference wraps around. The flipping of the conditional doesn't matter though, because the validity flag remains false.

Here's the actual time difference, as cast to int22_t:
Figure 4: time difference cast to int22_t

All of these patterns happen exactly the same each iteration, despite the fact that now_us and last_updated are all wrapping around at different times. Their wraparounds don't impact the math for the time difference. Here's now_us to show that:
Figure 5: now_us wrapping around at different times in the cycle

On the very left-hand side, we see now_us wrap around right after the conditional first goes to 1; the cycle after that we see it wrap just before. Nowhere does it actually impact the pattern displayed by the return value of get_rpm.

(I also have a picture putting last_update in here, but it just makes everything even messier and doesn't provide anything)

@andyp1per
Copy link
Collaborator

It's just too subtle for my liking and against the pattern we have pretty much everywhere else in the code. If you want to pursue this I think you will need to add helpers and unit tests to AP_HAL::timeout_expired() and friends, but I really don't think this is a good idea. We already have a subtle change in this code to deal with a similar problem from @magicrub and the changes are getting more involved ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants