Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clock drift when running ArduPilot SITL in a Docker container on a Mac #12060

Closed
ntamas opened this issue Aug 16, 2019 · 12 comments
Closed

Clock drift when running ArduPilot SITL in a Docker container on a Mac #12060

ntamas opened this issue Aug 16, 2019 · 12 comments

Comments

@ntamas
Copy link
Contributor

ntamas commented Aug 16, 2019

Bug report

Issue details

The internal clock of the ArduPilot SITL drifts from wall clock time when it is run in a Docker container on a Mac. This might not be an issue with ArduPilot itself (although it could be), but it would be nice to have it documented somewhere, at least in the issue tracker in case someone else bumps into this.

The issue can be seen by re-instantiating the commented-out debug section in Aircraft::sync_frame_time() and then looking for strange spikes in achieved_rate_hz. In my case, it looks like this (note that I have extended the debug statement to print the value of frame_counter as well as the difference between the current wall clock time and the last wall clock time seen in the previous iteration):

frame_counter=40 diff=29308 achieved_rate_hz=1189.828 rate=1364.82 rate_hz=1200.000 sft=667.1
frame_counter=40 diff=31289 achieved_rate_hz=1190.713 rate=1278.40 rate_hz=1200.000 sft=666.4
frame_counter=40 diff=34516 achieved_rate_hz=1190.395 rate=1158.88 rate_hz=1200.000 sft=665.8
frame_counter=40 diff=32160 achieved_rate_hz=1190.929 rate=1243.78 rate_hz=1200.000 sft=665.1
frame_counter=40 diff=34074 achieved_rate_hz=1190.759 rate=1173.92 rate_hz=1200.000 sft=664.4
frame_counter=40 diff=33163 achieved_rate_hz=1190.913 rate=1206.16 rate_hz=1200.000 sft=663.8
frame_counter=40 diff=32218 achieved_rate_hz=1191.419 rate=1241.54 rate_hz=1200.000 sft=663.1
frame_counter=40 diff=32905 achieved_rate_hz=1191.661 rate=1215.62 rate_hz=1200.000 sft=662.4
frame_counter=41 diff=3 achieved_rate_hz=137846.422 rate=13666667.00 rate_hz=1200.000 sft=663.1
frame_counter=40 diff=30856 achieved_rate_hz=136480.922 rate=1296.34 rate_hz=1200.000 sft=663.8
frame_counter=40 diff=31817 achieved_rate_hz=135128.688 rate=1257.19 rate_hz=1200.000 sft=664.4
frame_counter=40 diff=33234 achieved_rate_hz=133789.438 rate=1203.59 rate_hz=1200.000 sft=665.1
frame_counter=40 diff=32124 achieved_rate_hz=132464.000 rate=1245.17 rate_hz=1200.000 sft=665.8
frame_counter=40 diff=34529 achieved_rate_hz=131150.938 rate=1158.45 rate_hz=1200.000 sft=666.4

You can see that occasionally, the simulator "thinks" that it has executed 41 frames in 3 microseconds, and this estimates the update rate at 13666667 Hz. This erroneous estimate is large enough that it "skews" the smoothed achieved_rate_hz estimate to 137846 Hz, which means that ArduCopter will intentionally slow down its simulation rate to compensate for this and return to the desired rate of 1200 Hz.

Initially, I suspected that this is some kind of a thread-safety issue; one of the threads might be increasing the frame_counter from 40 to 41 while another one tries to reset it to zero, and the first thread "wins". However, sometimes I see much larger differences, like this:

frame_counter=40 diff=32395 achieved_rate_hz=1195.571 rate=1234.76 rate_hz=1200.000 sft=667.8
frame_counter=40 diff=32779 achieved_rate_hz=1195.818 rate=1220.29 rate_hz=1200.000 sft=667.1
frame_counter=40 diff=34398 achieved_rate_hz=1195.489 rate=1162.86 rate_hz=1200.000 sft=666.4
frame_counter=120 diff=26 achieved_rate_hz=47337.379 rate=4615384.50 rate_hz=1200.000 sft=667.1
frame_counter=40 diff=89555 achieved_rate_hz=46868.469 rate=446.65 rate_hz=1200.000 sft=667.8
frame_counter=40 diff=29699 achieved_rate_hz=46413.254 rate=1346.85 rate_hz=1200.000 sft=668.4
frame_counter=40 diff=29870 achieved_rate_hz=45962.512 rate=1339.14 rate_hz=1200.000 sft=669.1
frame_counter=40 diff=29565 achieved_rate_hz=45516.418 rate=1352.95 rate_hz=1200.000 sft=669.8

Some more information: this does not happen if I run the SITL environment natively on my Mac (without Docker), neither does it happen on a Linux machine with Docker, so it's definitely caused by the combination of Docker + macOS.

What we could do on ArduPilot's side is to throw away the rate estimates that are significantly higher than the target rate we are trying to achieve.

Version

Latest master revision at the time of writing (53f8aa2).

Platform
[ ] All
[ ] AntennaTracker
[X] Copter
[ ] Plane
[ ] Rover
[ ] Submarine

I was compiling ArduCopter, but may affect other platforms.

Airframe type

Simulated quadcopter, but it's probably not relevant.

Hardware type

N/A

@ntamas ntamas changed the title Clock drift when running ArudPilot SITL in a Docker container on a Mac Clock drift when running ArduPilot SITL in a Docker container on a Mac Aug 16, 2019
@khancyr
Copy link
Contributor

khancyr commented Aug 16, 2019

Hello, thanks for the report.
It is interesting. Could you share what docker file you use ?

@ntamas
Copy link
Contributor Author

ntamas commented Aug 16, 2019

I have uploaded it to my Dropbox. It's mostly standard stuff with a few extensions that allow me to supply a custom parameter set by mounting a file inside the container.

I have also included the patch in the ZIP file that turns on the debug output that I have shown in my bug report.

In the meanwhile, it seems like ignoring rate estimates that are significantly higher than the desired update rate (say, 50% higher) gets rid of the problem. I strongly suspect that this is somehow related to the fact that Docker runs inside a VM on a Mac; maybe the VM is syncing its clock to the host machine and the wall clock measurements that the SITL takes get confused when the sync is happening?

Edit: forgot to mention - on my machine it usually does not take more than 30 seconds for the simulation rate to get significantly different from the desired 1200 Hz so it's not a rare occasion.

@m-sasha
Copy link
Contributor

m-sasha commented Feb 11, 2020

This happens when running the SITL in vagrant on Mac OS X as described here: https://ardupilot.org/dev/docs/setting-up-sitl-using-vagrant.html
So it's not specific to Docker.

@m-sasha
Copy link
Contributor

m-sasha commented Feb 11, 2020

My issue is as follows:
Running SITL in vagrant on MacOS X.
I added debug prints into send_heartbeat and handleMessage (when the message is MAVLINK_MSG_ID_HEARTBEAT) which print the HAL time and the system time (clock_gettime(CLOCK_REALTIME)). I connect my custom GCS and everything is fine at first - it prints one send and one receive every second, and the clocks increase synchronously. Then, after a random period of time, it prints a lot of sends rapidly where the HAL time increases by 1 second between each, but the system time is only about 30ms between them. After a few (real time) seconds of that, it hangs for a while not printing anything. During this time, it doesn't communicate with the GCS at all. After about 90 seconds of being stuck, it wakes up and prints all the receives it got while it was stuck (with no increase between either HAL time nor system time). After it finishes them, it resumes normally - one send and one receive each second.

The log is attached.

The issue described above coincides with a spike in achieved_rate_hz, as described by the OP.

@rajat2004
Copy link
Contributor

Fro docker, maybe check the version being used - docker/for-mac#2076 (comment)
Another post for similar timing issue with VirtualBox & Vagrant - https://medium.com/carwow-product-engineering/time-sync-problems-with-vagrant-and-virtualbox-383ab77b6231

Hope this helps!

@peterbarker
Copy link
Contributor

peterbarker commented Feb 11, 2020 via email

@m-sasha
Copy link
Contributor

m-sasha commented Feb 17, 2020

and indeed for me too setting use_time_sync = false fixes the problem.

@m-sasha
Copy link
Contributor

m-sasha commented Feb 17, 2020

Don't expect SITL's simulated time to keep up with wallclock time. Consider --speedup. Consider what happens when you enter the debugger.

Not sure I understand what you mean. I don't expect the simulated time to match wall-clock precisely. But what happens in my case is that the whole simulation gets 'stuck' for ~90 seconds, then speeds up like crazy, and then returns to normal.

@m-sasha
Copy link
Contributor

m-sasha commented Feb 18, 2020

The problems appears to be that sync_frame_time uses get_wall_time_us which uses gettimeofday, which is not monotonic. When the clock jumps back, it skips many frames before now > last_wall_time_us becomes true, but the difference between now and last_wall_time_us is tiny at this point, making it think that it ran many frames in a short period of time, causing it to sleep for a long time to compensate.
The solution is to use clock_gettime(CLOCK_MONOTONIC, ...) instead. I'm not familiar enough with the code to know whether get_wall_time_us can be simply changed to use clock_gettime, or sync_frame_time needs to use another function which is guaranteed to be monotonic. According to the header file get_wall_time_us is supposed to return wall clock time in microseconds since 1970, but it doesn't appear that any code is actually relying on this.

@khancyr
Copy link
Contributor

khancyr commented Feb 18, 2020

Changing gettimeofday to clock_gettime should be transparent. Do you want to open a PR doing it ?

@m-sasha
Copy link
Contributor

m-sasha commented Feb 18, 2020

Changing gettimeofday to clock_gettime should be transparent. Do you want to open a PR doing it ?

Yes, it appears that on Windows get_wall_time_us doesn't do what the header says anyway. It uses timeGetTime, which returns "the time elapsed since Windows was started", according to MS docs.

@khancyr
Copy link
Contributor

khancyr commented Aug 10, 2020

it should have been fixed by #14487. Please reopen it the bug is still here

@khancyr khancyr closed this as completed Aug 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants