Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[19702] Fix Data Race when updating liveliness changed in WLP #3926

Merged
merged 2 commits into from
Oct 26, 2023

Conversation

Mario-DL
Copy link
Member

@Mario-DL Mario-DL commented Oct 11, 2023

Description

This PR fixes a data race in WLP::update_liveliness_changed_status() and repairs the Short liveliness Windows CI tests (see the manual job launch results here).

The PR is not included with a test since the same test suite of the ShortLivelinees (in windows) serves to demonstrate the issue

@Mergifyio backport 2.11.x 2.10.x 2.6.x

Contributor Checklist

  • Commit messages follow the project guidelines.
  • The code follows the style guidelines of this project.
    -N/A Tests that thoroughly check the new feature have been added/Regression tests checking the bug and its fix have been added; the added tests pass locally
  • N/A Any new/modified methods have been properly documented using Doxygen.
  • Changes are ABI compatible.
  • Changes are API compatible.
  • N/A New feature has been added to the versions.md file (if applicable).
  • N/A New feature has been documented/Current behavior is correctly described in the documentation.
  • Applicable backports have been included in the description.

Reviewer Checklist

  • The PR has a milestone assigned.
  • Check contributor checklist is correct.
  • Check CI results: changes do not issue any warning.
  • Check CI results: failing tests are unrelated with the changes.

@Mario-DL Mario-DL added this to the v2.12.1 milestone Oct 11, 2023
@Mario-DL Mario-DL added needs-review PR that is ready to be reviewed ci-pending PR which CI is running labels Oct 11, 2023
Copy link
Member

@MiguelCompany MiguelCompany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see what TSan says about this. I remember this part of the code was subject to deadlocks in the past.

Besides that, see my suggestion below.

src/cpp/rtps/builtin/liveliness/WLP.cpp Outdated Show resolved Hide resolved
@Mario-DL Mario-DL removed the needs-review PR that is ready to be reviewed label Oct 16, 2023
@Mario-DL Mario-DL force-pushed the fix/ci-github/short-liveliness-tests branch from 1a60361 to b558087 Compare October 16, 2023 12:22
@Mario-DL
Copy link
Member Author

Mario-DL commented Oct 16, 2023

  • Temp fix moving the fastcdr branch to master was not enough for getting the tsan job to work as there are conflicts when merging from the forked repo.
  • Compiled in local with the following colcon cmake options.
  • Checked that correct gcc (12.3.0) was used by cmake in the compilation logs.
  • Exported the following variable in the terminal where the tests are run: export TSAN_OPTIONS=second_deadlock_stack=1 history_size=7 memory_limit_mb=5000
  • The following tests
    BlackboxTests_DDS_PIM.LivelinessQos.ShortLiveliness_ManualByTopic_Reliable.Transport BlackboxTests_DDS_PIM.LivelinessQos.ShortLiveliness_ManualByTopic_Reliable.Intraprocess BlackboxTests_DDS_PIM.LivelinessQos.ShortLiveliness_ManualByTopic_Reliable.Datasharing BlackboxTests_DDS_PIM.LivelinessQos.ShortLiveliness_ManualByTopic_BestEffort.Transport BlackboxTests_DDS_PIM.LivelinessQos.ShortLiveliness_ManualByTopic_BestEffort.Intraprocess BlackboxTests_DDS_PIM.LivelinessQos.ShortLiveliness_ManualByTopic_BestEffort.Datasharing BlackboxTests_DDS_PIM.LivelinessQos.ShortLiveliness_ManualByTopic_Automatic_Reliable.Transport BlackboxTests_DDS_PIM.LivelinessQos.ShortLiveliness_ManualByTopic_Automatic_Reliable.Intraprocess BlackboxTests_DDS_PIM.LivelinessQos.ShortLiveliness_ManualByTopic_Automatic_Reliable.Datasharing BlackboxTests_DDS_PIM.LivelinessQos.ShortLiveliness_ManualByTopic_Automatic_BestEffort.Transport BlackboxTests_DDS_PIM.LivelinessQos.ShortLiveliness_ManualByTopic_Automatic_BestEffort.Intraprocess BlackboxTests_DDS_PIM.LivelinessQos.ShortLiveliness_ManualByTopic_Automatic_BestEffort.Datasharing BlackboxTests_DDS_PIM.LivelinessQos.ShortLiveliness_ManualByTopic_ManualByParticipant_Reliable.Transport BlackboxTests_DDS_PIM.LivelinessQos.ShortLiveliness_ManualByTopic_ManualByParticipant_Reliable.Intraprocess BlackboxTests_DDS_PIM.LivelinessQos.ShortLiveliness_ManualByTopic_ManualByParticipant_Reliable.Datasharing BlackboxTests_DDS_PIM.LivelinessQos.ShortLiveliness_ManualByTopic_ManualByParticipant_BestEffort.Transport BlackboxTests_DDS_PIM.LivelinessQos.ShortLiveliness_ManualByTopic_ManualByParticipant_BestEffort.Intraprocess BlackboxTests_DDS_PIM.LivelinessQos.ShortLiveliness_ManualByTopic_ManualByParticipant_BestEffort.Datasharing
    passed without warnings nor errors.

@MiguelCompany
Copy link
Member

@richiprosima Please test this

@Mario-DL
Copy link
Member Author

Mario-DL commented Oct 18, 2023

Update & findings:

  • Running the thread sanitizer in Linux for BlackboxTests_FastRTPS.Discovery.TwentyParticipantsMulticast BlackboxTests_FastRTPS.Discovery.TwentyParticipantsMulticastLocalhostOnly BlackboxTests_FastRTPS.Discovery.TwentyParticipantsSeveralEndpointsMulticast BlackboxTests_FastRTPS.Discovery.TwentyParticipantsUnicast.Transport BlackboxTests_FastRTPS.Discovery.TwentyParticipantsUnicast.Intraprocess BlackboxTests_FastRTPS.Discovery.TwentyParticipantsSeveralEndpointsUnicast.Transport BlackboxTests_FastRTPS.Discovery.TwentyParticipantsSeveralEndpointsUnicast.Intraprocess BlackboxTests_DDS_PIM.Discovery.TwentyParticipantsMulticast BlackboxTests_DDS_PIM.Discovery.TwentyParticipantsMulticastLocalhostOnly BlackboxTests_DDS_PIM.Discovery.TwentyParticipantsSeveralEndpointsMulticast BlackboxTests_DDS_PIM.Discovery.TwentyParticipantsUnicast.Transport BlackboxTests_DDS_PIM.Discovery.TwentyParticipantsUnicast.Intraprocess BlackboxTests_DDS_PIM.Discovery.TwentyParticipantsUnicast.Datasharing BlackboxTests_DDS_PIM.Discovery.TwentyParticipantsSeveralEndpointsUnicast.Transport BlackboxTests_DDS_PIM.Discovery.TwentyParticipantsSeveralEndpointsUnicast.Intraprocess BlackboxTests_DDS_PIM.Discovery.TwentyParticipantsSeveralEndpointsUnicast.Datasharing
    Passes with no errors nor warnings. The BlackboxTests_DDS_PIM.Discovery.TwentyParticipantsSeveralEndpointsUnicast.Datasharing was rerun with retest-until-fail 10 but already passed.

  • In Windows VM (4 cores), the BlackboxTests_DDS_PIM.Discovery.TwentyParticipantsSeveralEndpointsUnicast.Datasharing consistently fails (1 out of 10) in both master and fix/ci-github/short-liveliness-tests branches. Interestingly, it was observed that in both branches, sometimes fails by Timeout and sometimes due to the unhandled throw Failed init_port fastrtps_portXXXX: Interprocess mutex timeout when locking. Possible deadlock: owner died without unlocking? -> Function eprosima::fastdds::rtps::SharedMemGlobal::open_port_internal. This former exception sometimes appear during tests retries but in some occasions is not enough to make the test fail (in the code, another attempt of removing and creating the segment is performed before re-trowing the exception)

Hint: Increasing the timed_lock() waiting time to BOOST_INTERPROCESS_TIMEOUT_... * 5 seems to decrease the probability of the issue, but does not eliminate it

Signed-off-by: Mario Dominguez <mariodominguez@eprosima.com>
Signed-off-by: Mario Dominguez <mariodominguez@eprosima.com>
@Mario-DL Mario-DL force-pushed the fix/ci-github/short-liveliness-tests branch from b558087 to 6c02a2b Compare October 24, 2023 07:16
@EduPonz
Copy link

EduPonz commented Oct 24, 2023

@richiprosima please test this

@EduPonz
Copy link

EduPonz commented Oct 25, 2023

@richiprosima please test mac

2 similar comments
@Mario-DL
Copy link
Member Author

@richiprosima please test mac

@Mario-DL
Copy link
Member Author

@richiprosima please test mac

@Mario-DL
Copy link
Member Author

Regarding the linux failed tests. In local, compiled with TSAN, those same tests failures occur 1 every 30 attempts approx. TSAN does not output any warnings, the tests fail in the assert (std::find(triggered.begin(), triggered.end(), any_cond)) != (triggered.end()).
I could also reproduce the same test failures in master

@EduPonz EduPonz removed the ci-pending PR which CI is running label Oct 26, 2023
@EduPonz EduPonz merged commit 8126a51 into master Oct 26, 2023
7 of 8 checks passed
@EduPonz EduPonz deleted the fix/ci-github/short-liveliness-tests branch October 26, 2023 08:01
@eProsima eProsima deleted a comment from mergify bot Oct 26, 2023
@Mario-DL
Copy link
Member Author

@Mergifyio backport 2.11.x 2.10.x 2.6.x

@mergify
Copy link
Contributor

mergify bot commented Oct 26, 2023

backport 2.11.x 2.10.x 2.6.x

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Oct 26, 2023
* Refs #19702: Fix Data Race in WLP

Signed-off-by: Mario Dominguez <mariodominguez@eprosima.com>

* Refs #19702: reviewer suggestion

Signed-off-by: Mario Dominguez <mariodominguez@eprosima.com>

---------

Signed-off-by: Mario Dominguez <mariodominguez@eprosima.com>
(cherry picked from commit 8126a51)
mergify bot pushed a commit that referenced this pull request Oct 26, 2023
* Refs #19702: Fix Data Race in WLP

Signed-off-by: Mario Dominguez <mariodominguez@eprosima.com>

* Refs #19702: reviewer suggestion

Signed-off-by: Mario Dominguez <mariodominguez@eprosima.com>

---------

Signed-off-by: Mario Dominguez <mariodominguez@eprosima.com>
(cherry picked from commit 8126a51)
mergify bot pushed a commit that referenced this pull request Oct 26, 2023
* Refs #19702: Fix Data Race in WLP

Signed-off-by: Mario Dominguez <mariodominguez@eprosima.com>

* Refs #19702: reviewer suggestion

Signed-off-by: Mario Dominguez <mariodominguez@eprosima.com>

---------

Signed-off-by: Mario Dominguez <mariodominguez@eprosima.com>
(cherry picked from commit 8126a51)
EduPonz pushed a commit that referenced this pull request Nov 13, 2023
* Refs #19702: Fix Data Race in WLP

Signed-off-by: Mario Dominguez <mariodominguez@eprosima.com>

* Refs #19702: reviewer suggestion

Signed-off-by: Mario Dominguez <mariodominguez@eprosima.com>

---------

Signed-off-by: Mario Dominguez <mariodominguez@eprosima.com>
(cherry picked from commit 8126a51)

Co-authored-by: Mario Domínguez López <116071334+Mario-DL@users.noreply.github.com>
EduPonz pushed a commit that referenced this pull request Nov 13, 2023
* Refs #19702: Fix Data Race in WLP

Signed-off-by: Mario Dominguez <mariodominguez@eprosima.com>

* Refs #19702: reviewer suggestion

Signed-off-by: Mario Dominguez <mariodominguez@eprosima.com>

---------

Signed-off-by: Mario Dominguez <mariodominguez@eprosima.com>
(cherry picked from commit 8126a51)

Co-authored-by: Mario Domínguez López <116071334+Mario-DL@users.noreply.github.com>
EduPonz pushed a commit that referenced this pull request Nov 13, 2023
* Refs #19702: Fix Data Race in WLP

Signed-off-by: Mario Dominguez <mariodominguez@eprosima.com>

* Refs #19702: reviewer suggestion

Signed-off-by: Mario Dominguez <mariodominguez@eprosima.com>

---------

Signed-off-by: Mario Dominguez <mariodominguez@eprosima.com>
(cherry picked from commit 8126a51)

Co-authored-by: Mario Domínguez López <116071334+Mario-DL@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants