Skip to content
This repository has been archived by the owner on Jul 8, 2022. It is now read-only.

Fix race conditions between polling threads and user threads pushing events #641

Conversation

mliszcz
Copy link
Collaborator

@mliszcz mliszcz commented Dec 16, 2019

Changes proposed in comment in another PR: #635 (comment)

@mliszcz mliszcz requested review from t-b, Ingvord and bourtemb December 16, 2019 13:52
@mliszcz mliszcz force-pushed the fix-511-crash-race-condition-event-push-attempt-2 branch from b1151f3 to a8f07bf Compare December 17, 2019 07:56
@mliszcz
Copy link
Collaborator Author

mliszcz commented Jan 17, 2020

Hi @gscalamera and @lorenzopivetta. This is PR which should correct #511. It contains necessary changes extracted from #635 and refactored. Could you please try to apply this patch and see if it solves the crash in your environment?

Copy link
Collaborator

@t-b t-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm attaching an abi-compliance report I've generated using the script attached to #613. So strictly speaking this is not ABI/API compliant. Not a big suprise as you remove symbols and propagate LastAttrValue from POD to class.
compat-report.tar.gz

That report also misses the removal of EventSupplier::detect_mutex which g++ 9.2 found.

Now I'm also aware that this PR is a guinea pig for the check so I think it is unfair to just reject the current solution.

The really strict solution would be to leave LastAttrValue as it is, implement its store member function as free function and keep EventSupplier::detect_mutex as well. But one could also argue that nobody outside tango itself should use LastAttrValue or inherit from EventSupplier and therefore the API/ABI breakage is not a problem. Thoughts?

Codewise: Looks good and the commit messages also explain what is done. One minor enhancement could be done though.

Replacing

                               if (except
                                   || quality == Tango::ATTR_INVALID
                                   || ((! except) && prev_change_event.err)
                                   || (quality != Tango::ATTR_INVALID && old_quality == Tango::ATTR_INVALID))

with

                               if (except
                                   || quality == Tango::ATTR_INVALID
                                   || prev_change_event.err
                                   || old_quality == Tango::ATTR_INVALID)

as short-circuting is used anyway.

@mliszcz
Copy link
Collaborator Author

mliszcz commented Jan 21, 2020

Thanks a lot Thomas for a detailed review and analysis!
I propose to leave this PR as it is, as it targets the development branch and if backport to 9.3-compatible version is needed, I'll restore the detect_mutex and remove function from LastAttrValue (it's just refactoring). What do you think?

@t-b
Copy link
Collaborator

t-b commented Jan 21, 2020

@mliszcz Sounds good.

@lorenzopivetta
Copy link

Hi @gscalamera and @lorenzopivetta. This is PR which should correct #511. It contains necessary changes extracted from #635 and refactored. Could you please try to apply this patch and see if it solves the crash in your environment?

Hi @mliszcz . Applied to 9-lts branch and compiled on PPC (requires tricking some configs). Need to wait for @gscalamera for testing.

@bourtemb
Copy link
Member

bourtemb commented Jan 22, 2020

We encountered the same issue #511 in one of our device servers here at the ESRF which was crashing several times a day.
We are currently testing this patch (applied to tango-9-lts branch) since yesterday afternoon. No crash so far.
Very encouraging.

Fixes data race between:
* polling thread (EventSupplier::detect_and_push_xxx_event)
* and user thread pushing events (Attribute::fire_xxx_event)
Remove detect_mutex and protect detect_change method with
event_mutex to synchronize access to the old attribute value.

Fixes data race between:
* polling thread (EventSupplier::detect_and_push_xxx_event)
* and user thread pushing events (EventSupplier::detect_change)
Fixes data race between:
* omni worker thread (DServer::event_subscription)
* and user thread pushing events (Attribute::fire_xxx_event)
@mliszcz mliszcz force-pushed the fix-511-crash-race-condition-event-push-attempt-2 branch from a8f07bf to ee262d4 Compare January 23, 2020 14:56
@t-b t-b added Candidate For Backport Requires a backport to the release branches Ready For Merge and removed Has Conflicts labels Jan 23, 2020
Redundant checks are removed to simplify the condition for enforcing
sending of change event. New condition is equivalent to the original
one. This is non-functional change.
@t-b t-b self-requested a review January 23, 2020 20:13
Copy link
Collaborator

@t-b t-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mliszcz Nice!

@mliszcz
Copy link
Collaborator Author

mliszcz commented Jan 24, 2020

Hi @lorenzopivetta and @gscalamera - any news from the test run?

@bourtemb or @Ingvord - I think that at least one of you needs to review this the PR before it can be merged. There is also a backport #665, quite similar to this PR so you can have a look as well.

Copy link
Member

@bourtemb bourtemb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @mliszcz! Thanks!
Still no crash on our side since Tuesday with this patch.

@gscalam
Copy link

gscalam commented Jan 24, 2020

Hi @lorenzopivetta and @gscalamera - any news from the test run?

@bourtemb or @Ingvord - I think that at least one of you needs to review this the PR before it can be merged. There is also a backport #665, quite similar to this PR so you can have a look as well.

Running since almost 24 hours without any issue. I would say it's OK.

@mliszcz
Copy link
Collaborator Author

mliszcz commented Jan 24, 2020

Two approvals, no complains so I'll merge the PR. Thanks to everyone involved in the review and testing!

@mliszcz mliszcz merged commit ec89823 into tango-controls:tango-9-lts Jan 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Candidate For Backport Requires a backport to the release branches Ready For Merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants