Modern multicore processors are entirely weirder than almost anyone thinks possible. They are somewhat weirder than chip makers were willing to admit until fairly recently. They are sufficiently weird enough that almost all multi-threaded programs, and many lock-free algorithms, had bugs because the hardware just does not work the way anyone would reasonably expect. And, in addition, optimizing compilers did not actually know how to not break your code. cite:boehm2005threads
I’m in the (slow) process of writing some test cases for multi-threaded code. I eventually want to have some confidence that some code is executed once, and only once, in an efficient manner. It’s the ‘efficient’ part that worries me, because for efficient, it also has a tendency to be clever, and I’m learning that clever MT code is often subtly broken. cite:bacon2000double So if smarter people than me make mistakes about MT code, I need tests to compensate. And ones that will cover occasional allowed but unexpected behavior. Which means the test framework should be able to detect them.
Also, fortunately, the RPi is a computer that exhibits some odd behavior, as it is an ARM system. X86 has a much stronger model. However, even the x86 model is allowed to perform in odd ways.
Starting in 2007, Intel has started publishing short snippets of assembly and documenting what are the allowed and disallowed results running them in parallel. cite:IWPAug2007 These snippets have come to be called litmus tests, and are used to confirm the behavior of hardware, and confirm models of the hardware behavior. A particularly important model for C++ programmers is the x86 Total Store Order model cite:owens2009better which provides a way of mapping the C++11 memory model to X86 hardware. X86 hardware provides a strongly consistent memory model. Power and ARM provide fewer guarantees, and mapping the C++ memory model to these architectures is more challenging. cite:maranget2012tutorial
The tests outlined in the Intel paper are short pieces of assembly to be exercised on different processors, with guarantees about behavior that will not happen. The first one essentially promises that message passing will work, and is now known as the MP test.
Processor 0 | Processor 1 |
---|---|
mov [ _x], 1 // M1 | mov r1, [ _y] // M3 |
mov [ _y], 1 // M2 | mov r2, [ _x] // M4 |
Initially x = y = 0
r1 = 1 and r2 = 0 is not allowed
That says that we can’t read the writes of x and y out of order, which ensures that if we’re waiting to see the write to the flag y, we can be guaranteed to see the payload in x. If we change it slightly to wait for the write to _y to be visible, we can pass a message from one thread to anothher in _x. This is also known as Dekards Algorithm.
ARM and Power do not provide that guarantee without additional synchronization instructions.
In C++ that looks something like the following, using the test framework I’m writing.
MP::MP() : x_(0), y_(0) {}
void MP::t1() {
x_.store(1, std::memory_order_relaxed);
y_.store(1, std::memory_order_relaxed);
}
void MP::t2(Result& read) {
while (!y_.load(std::memory_order_relaxed)){}
std::get<0>(read) = x_.load(std::memory_order_relaxed);
}
Here, x_ and y_ are atomic<int>s, and we’re using the lowest possible atomic guarantee in C++, relaxed. Relaxed guarantees that the operation happens atomically. but there are no synchronization properties with anything else. This usally corresponds to the basic int type. Unless you’re using a really insane processor that might let an int be partially written and observable. Like you might get the top half of the int, or the middle byte. The commercial processors that allowed this have pretty much died out.
The test spins on seeing the load of y to be complete. It loads the value of x_ into a result tuple. The tuple is used as the key to a map which accumulates how many times each result has been seen. Running the above on my x86 laptop:
[ RUN ] ExperimentTest.MPTest1 (1) : 2000000
on my Raspberry Pi 3:
[ RUN ] ExperimentTest.MPTest1 (0) : 483 (1) : 1999517
Using objdump to check the generated assembly
00000088 <litmus::MP::MP()>: 88: mov r2, #0 8c: str r2, [r0] 90: str r2, [r0, #64] ; 0x40 94: bx lr 00000098 <litmus::MP::t1()>: 98: mov r3, #1 9c: str r3, [r0] a0: str r3, [r0, #64] ; 0x40 a4: bx lr 000000a8 <litmus::MP::t2(std::tuple<int>&)>: a8: ldr r3, [r0, #64] ; 0x40 ac: cmp r3, #0 b0: beq a8 <litmus::MP::t2(std::tuple<int>&)> b4: ldr r3, [r0] b8: str r3, [r1] bc: bx lr
So, out of the 2,000,000 times that I ran the experiment, there were 483 times that reading x_ resulted in 0, even though y_ was 1. ARM has a weaker memory model than x86. This has some advantages in processor implementation. It has distinct disadvantages in how our brains work. X86 tries to preserve the model that there is shared memory that everyone sees and works with. That’s not strictly true, even for X86, but ARM and Power don’t even come close. On the other hand, it’s also why it’s easier to add more cores to Power and ARM chips and systems. I routinely work with Power systems with 512 physical cores.
Store buffering is the odd case that is allowed in the Intel memory model. When assigning locations in two threads, and then reading them on opposite threads, both threads are allowed to read the older state. The stores get buffered. From the Intel White Paper:
Processor 0 | Processor 1 |
---|---|
mov [ _x], 1 // M1 | mov [ _y], 1 // M3 |
mov r1, [ _y] // M2 | mov r2, [ _x] // M4 |
Initially x = y = 0
r1 = 0 and r2 ==0 is allowed
Note, in particular, there is no interleaving of M1 - 4 that could result in r1 and r2 being 0. Not without interupting an instruction in the middle. But the instructions themselves are atomic, and indivisible. If they were actually operating on shared memory, this would not be possible. However, it does happen.
SB::SB() : x_(0), y_(0) {}
void SB::t1(Result& read) {
y_.store(1, std::memory_order_relaxed);
std::get<0>(read) = x_.load(std::memory_order_relaxed);
}
void SB::t2(Result& read) {
x_.store(1, std::memory_order_relaxed);
std::get<1>(read) = y_.load(std::memory_order_relaxed);
}
That generates the x86 code
00000000000000f0 <litmus::SB::t1(std::__1::tuple<int, int>&)>: f0: mov DWORD PTR [rdi+0x40],0x1 f7: mov eax,DWORD PTR [rdi] f9: mov DWORD PTR [rsi],eax fb: ret 0000000000000100 <litmus::SB::t2(std::__1::tuple<int, int>&)>: 100: mov DWORD PTR [rdi],0x1 106: mov eax,DWORD PTR [rdi+0x40] 109: mov DWORD PTR [rsi+0x4],eax 10c: ret
And on my x86 machine:
[ RUN ] ExperimentTest.SBTest1 (0, 0) : 559 (0, 1) : 999858 (1, 0) : 999576 (1, 1) : 7
So 559 times neither core saw the other core’s store.
Load Buffering is the dual of store buffering. Loads into registers might be delayed, or buffered, and actually performed after following instructions. It’s not allowed in the Intel architecture.
From the Intel White Paper
Processor 0 | Processor 1 |
---|---|
mov r1, [ _x] // M1 | mov r2, [ _y] // M3 |
mov [ _y], 1 // M2 | mov [ _x], 1 // M4 |
Initially x = y = 0
r1 = 1 and r2 = 1 is not allowed
LB::LB() : x_(0), y_(0) {}
void LB::t1(Result& read) {
std::get<0>(read) = x_.load(std::memory_order_relaxed);
y_.store(1, std::memory_order_relaxed);
}
void LB::t2(Result& read) {
std::get<1>(read) = y_.load(std::memory_order_relaxed);
x_.store(1, std::memory_order_relaxed);
}
This is the x86 asm code
00000000000000c0 <litmus::LB::t1(std::__1::tuple<int, int>&)>: c0: mov eax,DWORD PTR [rdi] c2: mov DWORD PTR [rsi],eax c4: mov DWORD PTR [rdi+0x40],0x1 cb: ret cc: nop DWORD PTR [rax+0x0] 00000000000000d0 <litmus::LB::t2(std::__1::tuple<int, int>&)>: d0: mov eax,DWORD PTR [rdi+0x40] d3: mov DWORD PTR [rsi+0x4],eax d6: mov DWORD PTR [rdi],0x1 dc: ret dd: nop DWORD PTR [rax]
And the ARM code, at -O1
000000d0 <litmus::LB::t1(std::tuple<int, int>&)>: d0: ldr r3, [r0] d4: str r3, [r1, #4] d8: mov r3, #1 dc: str r3, [r0, #64] ; 0x40 e0: bx lr 000000e4 <litmus::LB::t2(std::tuple<int, int>&)>: e4: ldr r3, [r0, #64] ; 0x40 e8: str r3, [r1] ec: mov r3, #1 f0: str r3, [r0] f4: bx lr
ARM generally allows it, but per cite:maranget2012tutorial it’s very sensitive, and dependencies will make it not appear. In my tests, I did not observe an instance of a buffering, but it may be due to the first store the compiler introduces, in order to actually get the data into the tuple. That it’s documented as possible is still exceedingly strange.
IRIW is a generalization of store buffering, where two reader threads each read different apparent orderings of writes from two distinct writer threads.
T1 | T2 | T3 | T4 |
---|---|---|---|
X = 1 | Y = 1 | R1 = X | R3 = y |
R2 = Y | R4 = X | ||
Initially X=Y=0 Allowed in ARM, not in x86 r1=1, r2=0, r3=1, r4=0 cite:maranget2012tutorial,owens2009better
This is not observed in x86 processors, but is in some ARM and POWER, more often in POWER. X86 hardware has a consistent view of memory where other hardware can see memory writes in different orders on different threads. On my rPi, I didn’t observe any incidents of X and Y being read out of order, over 40 million runs.
IRIW::IRIW() : x_(0), y_(0) {}
void IRIW::t1() {
x_.store(1, std::memory_order_relaxed);
}
void IRIW::t2() {
y_.store(1, std::memory_order_relaxed);
}
void IRIW::t3(Result& read) {
std::get<0>(read) = x_.load(std::memory_order_relaxed);
std::get<1>(read) = y_.load(std::memory_order_relaxed);
}
void IRIW::t4(Result& read) {
std::get<2>(read) = y_.load(std::memory_order_relaxed);
std::get<3>(read) = x_.load(std::memory_order_relaxed);
}
The allowed behavior of modern processors is very different than our mental model of a Von Neumann architecture computer. Each core can have a different view of memory, and without additional controls, writes and reads can break the illusion of a single unified memory. The C++ memory model gives the controls and guarantees about what happens when different threads read and write memory, and here I’ve deliberately used the weakest version available, relaxed, in order to allow the processors the wideest latitude in behavior. Relaxed is, for processors that have it, often just an unconstrained int, which means that you will get odd behavior if you are running shared state multithreaded code that uses plain native types. It is a particular problem with code that was originally written and tested on a x86 architecture because the native model is fairly strong. This frequently causes problems when porting to a mobile platform, where ARM is a very popular hardware choice.
Exported from an org-mode doc. All of the source is available on github at SpinGate
bibliographystyle:unsrt bibliography:ref.bib