Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local Simulation Consistent SegFault #936

Closed
rsawtell opened this issue May 26, 2021 · 21 comments
Closed

Local Simulation Consistent SegFault #936

rsawtell opened this issue May 26, 2021 · 21 comments

Comments

@rsawtell
Copy link

I updated packages, updated the subt repo, rebooted, and completely rebuilt my catkin workspace this morning. I now consistently get this error when the robot enters the tunnel upon the scoring start condition:

Stack trace (most recent call last) in thread 9011:
#10 Object "", at 0xffffffffffffffff, in
#9 Source "/build/glibc-S9d2JN/glibc-2.27/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S", line 95, in [0x7fbdde07b71e]
#8 Source "/build/glibc-S9d2JN/glibc-2.27/nptl/pthread_create.c", line 463, in start_thread [0x7fbdddd426da]
#7 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", at 0x7fbddaa696de, in std::error_code::default_error_condition() const
#6 Object "/usr/lib/x86_64-linux-gnu/libignition-gazebo4.so.4", at 0x7fbdd9686209, in GetRecordPluginElem(sdf::v10::Root&)
#5 Object "/usr/lib/x86_64-linux-gnu/libignition-gazebo4.so.4", at 0x7fbdd9690dea, in ignition::gazebo::v4::SimulationRunner::Run(unsigned long)
#4 Object "/usr/lib/x86_64-linux-gnu/libignition-gazebo4.so.4", at 0x7fbdd96906e6, in ignition::gazebo::v4::SimulationRunner::Step(ignition::gazebo::v4::UpdateInfo const&)
#3 Object "/usr/lib/x86_64-linux-gnu/libignition-gazebo4.so.4", at 0x7fbdd968ac51, in ignition::gazebo::v4::SimulationRunner::UpdateSystems()
#2 Source "/home/rwsawtel/Desktop/Projects/SubT/Ignition/catkin_workspace/src/subt/subt_ign/src/GameLogicPlugin.cc", line 1163, in subt::GameLogicPlugin::PreUpdate(ignition::gazebo::v4::UpdateInfo const&, ignition::gazebo::v4::EntityComponentManager&) [0x7fbda83c86ff]
1160: for (auto &ke : this->dataPtr->keInfo)
1161: {
1162: ignition::gazebo::Link link(ke.second.link);
>1163: if (std::nullopt != link.WorldKineticEnergy(_ecm))
1164: {
1165: double currKineticEnergy = *link.WorldKineticEnergy(_ecm);
#1 Object "/usr/lib/x86_64-linux-gnu/libignition-gazebo4.so.4", at 0x7fbdd9638f6d, in ignition::gazebo::v4::Link::WorldKineticEnergy(ignition::gazebo::v4::EntityComponentManager const&) const
#0 Object "/usr/lib/x86_64-linux-gnu/libignition-gazebo4.so.4", at 0x7fbdd95ec953, in ignition::gazebo::v4::EntityComponentManager::ComponentImplementation(unsigned long, unsigned long) const
Segmentation fault (Signal sent by the kernel [(nil)])
Segmentation fault (core dumped)

@angelacmaio
Copy link
Contributor

What command did you use to launch?

@rsawtell
Copy link
Author

ign launch -v 4 competition.ign circuit:=cave worldName:=simple_cave_03 durationSec:=3600 robotName1:=X1 robotConfig1:=CORO_ALLIE_SENSOR_CONFIG_2

@angelacmaio
Copy link
Contributor

I am unable to reproduce this issue. Can you make sure you have updated your local install of the simulator?

@rsawtell
Copy link
Author

I've done all 3 steps now and it is still not working. Curiously though the "sudo apt-get update && sudo apt-get upgrade ignition-dome" command updated packages that a generic system wide package update did not (though apparently not enough or the right ones to make it work). Our IT staff finds this highly unusual and is wondering if something is wrong on the package maintainer side of things that is preventing a normal update?

@peci1
Copy link
Collaborator

peci1 commented May 28, 2021

we got this segfault too

@peci1
Copy link
Collaborator

peci1 commented May 28, 2021

in a docker image based on the official image

@peci1
Copy link
Collaborator

peci1 commented May 30, 2021

Cloudsim can reproduce this issue! All our weekend simulations ended shortly after scoring has started!

This is the end of server_console.log:

(2021-05-28T21:27:30.713344371) [Dbg] [VisibilityRfModel.cc:134] Range: 2.03894, Exp: 1.5, Num hops: 0, TX: 20, RX: -24.6411
(2021-05-28T21:27:31.376612868) [Msg] Scoring has Started
(2021-05-28T21:27:31.378380220) [Dbg] [VisibilityRfModel.cc:134] Range: 2, Exp: 1.5, Num hops: 0, TX: 20, RX: -24.5154
(2021-05-28T21:27:31.378545523) [Dbg] [VisibilityRfModel.cc:134] Range: 8.24621, Exp: 1.5, Num hops: 0, TX: 20, RX: -33.7438
(2021-05-28T21:27:31.378652263) [Dbg] [VisibilityRfModel.cc:134] Range: 4.47455, Exp: 1.5, Num hops: 0, TX: 20, RX: -29.7612
(2021-05-28T21:27:31.378768594) [Dbg] [VisibilityRfModel.cc:134] Range: 5.02781, Exp: 1.5, Num hops: 0, TX: 20, RX: -30.5207
(2021-05-28T21:27:31.378882020) [Dbg] [VisibilityRfModel.cc:134] Range: 8.01849, Exp: 1.5, Num hops: 0, TX: 20, RX: -33.5614

@zbynekwinkler
Copy link

Too bad that #919 is not deployed yet - or is it? If so, where would be the actual stdout?

@nkoenig
Copy link
Contributor

nkoenig commented Jun 1, 2021

New docker images have been released, see https://github.com/osrf/subt/wiki/release_notes#2021-06-01. Can you give these a try?

The docker images and catkin workspace methods work for me.

@peci1
Copy link
Collaborator

peci1 commented Jun 1, 2021

Exactly the same segfault. It might be triggered by a specific robot type or world. We have UAV-only cloudsim sims running in cave 05 normally.

I got segfault with these parameters:

headless:=true robotName1:=X1 robotConfig1:=CTU_CRAS_NORLAB_MARV_SENSOR_CONFIG_4 robotName2:=X2 robotConfig2:=CTU_CRAS_NORLAB_MARMOTTE_SENSOR_CONFIG_2 robotName3:=TEAMBASE robotConfig3:=TEAMBASE robotName4:=X3 robotConfig4:=CSIRO_DATA61_DTR_SENSOR_CONFIG_2 robotName5:=uav3 robotConfig5:=CTU_CRAS_NORLAB_X500_SENSOR_CONFIG_1 robotName6:=uav5 robotConfig6:=CTU_CRAS_NORLAB_X500_SENSOR_CONFIG_1 robotName7:=uav7 robotConfig7:=CTU_CRAS_NORLAB_X500_SENSOR_CONFIG_1 ros:=true durationSec:=3600 circuit:=finals worldName:=finals_practice_02

@peci1
Copy link
Collaborator

peci1 commented Jun 1, 2021

Ahh, got it with a debug build:

er&) [0x7fb4a8949f3d]
       1266:         // Apply KE factor.
       1267:         deltaKE *= robotPlatformTypes.at(
      >1268:           this->dataPtr->robotFullTypes[ke.second.robotName].first);
       1269:         ke.second.prevKineticEnergy = currKineticEnergy;
       1270:
       1271:         // Crash if past the threshold.
#7    Source "/usr/include/c++/8/bits/stl_map.h", line 548, in std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, double> > >::at(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const [0x7fb4a89748da]
        545:       {
        546:    const_iterator __i = lower_bound(__k);
        547:    if (__i == end() || key_comp()(__k, (*__i).first))
      > 548:      __throw_out_of_range(__N("map::at"));
        549:    return (*__i).second;
        550:       }
#6    Object "/usr/lib/x86_64-l

Poor Marmotte is missing in

const std::map<std::string, double> robotPlatformTypes = {
.

So it's probably time to revise the list of robots.

@peci1
Copy link
Collaborator

peci1 commented Jun 1, 2021

Easiest way to trigger this locally - spawn a Marmotte and call /subt/start.

@malcolmst
Copy link

malcolmst commented Jun 2, 2021

Just hit this too in a local run using the docker containers, looks like the same stack trace:

Stack trace (most recent call last) in thread 546:
#10   Object "", at 0xffffffffffffffff, in 
#9    Object "/lib/x86_64-linux-gnu/libc.so.6", at 0x7f978c6bf71e, in clone
#8    Object "/lib/x86_64-linux-gnu/libpthread.so.0", at 0x7f978c3866da, in start_thread
#7    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", at 0x7f978938b6de, in std::error_code::default_error_condition() const
#6    Object "/usr/lib/x86_64-linux-gnu/libignition-gazebo4.so.4", at 0x7f9787fa8209, in GetRecordPluginElem(sdf::v10::Root&)
#5    Object "/usr/lib/x86_64-linux-gnu/libignition-gazebo4.so.4", at 0x7f9787fb2dea, in ignition::gazebo::v4::SimulationRunner::Run(unsigned long)
#4    Object "/usr/lib/x86_64-linux-gnu/libignition-gazebo4.so.4", at 0x7f9787fb26e6, in ignition::gazebo::v4::SimulationRunner::Step(ignition::gazebo::v4::UpdateInfo const&)
#3    Object "/usr/lib/x86_64-linux-gnu/libignition-gazebo4.so.4", at 0x7f9787facc51, in ignition::gazebo::v4::SimulationRunner::UpdateSystems()
#2    Object "/home/developer/subt_ws/install/lib/libGameLogicPlugin.so", at 0x7f975459e460, in subt::GameLogicPlugin::PreUpdate(ignition::gazebo::v4::UpdateInfo const&, ignition::gazebo::v4::EntityComponentManager&)
#1    Object "/usr/lib/x86_64-linux-gnu/libignition-gazebo4.so.4", at 0x7f9787f5af6d, in ignition::gazebo::v4::Link::WorldKineticEnergy(ignition::gazebo::v4::EntityComponentManager const&) const
#0    Object "/usr/lib/x86_64-linux-gnu/libignition-gazebo4.so.4", at 0x7f9787f0e949, in ignition::gazebo::v4::EntityComponentManager::ComponentImplementation(unsigned long, unsigned long) const
Floating point exception (Integer divide by zero [0x7f9787f0e949])
Floating point exception (core dumped)

My launch command was:

ign launch -v 4 competition.ign worldName:=finals_qual circuit:=finals robotName1:=X2N1 robotConfig1:=MARBLE_HD2_SENSOR_CONFIG_2 headless:=true

Edit: I have not updated to the brand new docker containers, will try those

Edit 2: It doesn't seem to be 100% consistent for me, maybe it depends on timing of the /subt/start? Nvm, it does look pretty consistent here...

@peci1
Copy link
Collaborator

peci1 commented Jun 2, 2021

This for loop looks very suspicious:

for (const std::pair<std::string, double> &typeKE :
robotPlatformTypes)
{
std::string platformNameUpper = filePath->Data();
std::transform(platformNameUpper.begin(),
platformNameUpper.end(),
platformNameUpper.begin(), ::toupper);
if (platformNameUpper.find(typeKE.first) != std::string::npos)
{
this->dataPtr->robotTypes.insert(typeKE.first);
// The full type is in the directory name, which is third
// from the end (.../TYPE/VERSION/model.sdf).
std::vector<std::string> pathParts =
ignition::common::split(platformNameUpper, "/");
this->dataPtr->robotFullTypes[mName->Data()] =
{typeKE.first, pathParts[pathParts.size()-3]};
}
}
// Subscribe to detach topics. We are doing a blanket

platformNameUpper is an absolute file path capitalized, and then if (platformNameUpper.find(typeKE.first)) is called, which searches the short platform names like HD2 and MARMOTTE in it. But what if a different part of the path contains one of these strings as a substring? I think the order of operations should be switched, so first the path is split to parts, and then the substring matching is done only to the particular part.

@peci1
Copy link
Collaborator

peci1 commented Jun 2, 2021

Ahh, it seems we're mixing two issues here. My issue is probably caused by Marmotte not being on the robot list, as I segfault at map::at() call. The other cases here segfault on if (std::nullopt != link.WorldKineticEnergy(_ecm))... I'll try tomorrow without Marmotte...

@malcolmst
Copy link

You might not believe this (not sure if I do either), but it appears at least the error I'm hitting is caused by the g++ version.

Recent base images updated gcc from 7.5.0 to 8.4.0. This causes many problems for CUDA 10, so as a quick and dirty workaround I had installed gcc 7.5.0 again and selected that version to be used for g++ and gcc by default in my most recent build. Apparently, however, the older compiler breaks the subt_ws code. I reverted my change to replace the default version and used -DCMAKE_CUDA_HOST_COMPILER=g++-7 instead to fix the CUDA issue, and everything appears to be working again. So far so good anyway.

@peci1
Copy link
Collaborator

peci1 commented Jun 2, 2021

@malcolmst I assume the changes you describe are in solution containers... how could they affect the simulation container? or do you run the simulator from the same code as your solution container?

@rsawtell
Copy link
Author

rsawtell commented Jun 2, 2021

Some combination of pulling the latest subt changes and switching to gcc 8.4.0 has fixed this for me.

@malcolmst
Copy link

@peci1 yeah for doing a quick local test sometimes I just run it in the single solution container

@nkoenig
Copy link
Contributor

nkoenig commented Jun 10, 2021

Marmotte has been added to the robot list in #952. That should help resolve your issue @peci1.

@nkoenig
Copy link
Contributor

nkoenig commented Jun 21, 2021

Closing since I believe this issue is now resolved.

@nkoenig nkoenig closed this as completed Jun 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants