-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy pathfirestorm.html
177 lines (134 loc) · 10.2 KB
/
firestorm.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
<html><head><meta charset="UTF-8"><title>Firestorm Overview</title></head><body><pre><strong>Apple Microarchitecture Research</strong> by <a href="https://twitter.com/dougallj">Dougall Johnson</a>
M1/A14 P-core (Firestorm): <a href="firestorm.html">Overview</a> | <a href="firestorm-int.html">Base Instructions</a> | <a href="firestorm-simd.html">SIMD and FP Instructions</a>
M1/A14 E-core (Icestorm): <a href="icestorm.html">Overview</a> | <a href="icestorm-int.html">Base Instructions</a> | <a href="icestorm-simd.html">SIMD and FP Instructions</a>
</pre><p>This is an attempt at microarchitecture documentation for the CPU in the Apple M1, inspired by and building on the amazing work of <a href="https://uops.info/">Andreas Abel</a>,
<a href="https://www.anandtech.com/show/16226/apple-silicon-m1-a14-deep-dive/2">Andrei Frumusanu</a>, <a href="https://github.com/Veedrac/microarchitecturometer">@Veedrac</a>,
<a href="https://github.com/travisdowns/robsize">Travis Downs</a>, <a href="http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/">Henry Wong</a>,
<a href="https://agner.org/optimize/">Agner Fog</a> and <a href="https://twitter.com/handleym99/status/1437537535018684417">Maynard Handley</a>.
This documentation is my best effort, but it is based on black-box reverse engineering, and there are definitely mistakes.
No warranty of any kind (and not just as a legal technicality). To make it easier to verify the information and/or identify such errors,
entries in the instruction tables link to the experiments and results (~35k tables of counter values).</p>
<p>Firestorm is the high-performance microarchitecture used by the four P-cores in the M1.</p>
<h2>Firestorm Pipeline Overview</h2>
<img src="firestorm.png" style="width: 100%; min-width: 900px; max-width: 1310px">
<p>
As instructions are executed, they are mapped to operations inside the processor. Typically, an
instruction is decoded to one or more uops, each uop is <em>mapped</em> and <em>renamed</em> and
placed into a dispatch queue. These uops are then <em>dispatched</em> from the queue to a scheduler.
When the uop's input operands are ready, it <em>issues</em> from the scheduler, and is executed.
After execution, it is marked as completed in the reorder buffer. Completed entries in the reorder
buffer are then released in order.
</p>
<p>
This work focuses on measuring each instruction's <strong>uops</strong> and <strong>issues</strong>.
<strong>Uops</strong> count towards the "pipeline width" limit of 8 uops per cycle, and are measured
using the <code>RETIRE_UOP</code> counter. <strong>Issues</strong> count towards the "execution unit"
limits (one issue per unit per cycle), and can be measured using the documented <code>SCHEDULE_UOP</code>
counter. Three undocumented counters measure issues to the integer, load/store, and simd units separately,
so these values are provided in the instruction tables.
</p>
<p>
All instructions have at least one uop, and most instructions have the same number of issues as uops,
but some have fewer (e.g. <code>nop</code>, <code>ldp</code>, and fused instructions). On the other
hand, some instructions have more issues than uops, e.g ALU operations with an optional shift or
extend (which allows Firestorm to sustain 11 issues per cycle in contrived cases).
</p>
<h2>Firestorm Units</h2>
<p>I use the name "units", but these may also be refered to as "ports" or "pipes".</p>
<pre>
Integer units:
1: alu + ubfm/sbfm + flags + branch + adr + msr/mrs nzcv + mrs
2: alu + ubfm/sbfm + flags + branch + adr + msr/mrs nzcv + indirect branch + ptrauth
3: alu + ubfm/sbfm + flags + mov-from-simd/fp?
4: alu + ubfm/sbfm + mov-from-simd/fp?
5: alu + ubfm/sbfm + mul + div
6: alu + ubfm/sbfm + mul + madd + crc + bfm/extr
Load/store units (up to 128-bit loads and stores, including address generation with shifts up to LSL #3):
7: store + amx
8: load/store + amx
9: load
10: load
SIMD units:
11: fp/simd
12: fp/simd
13: fp/simd + fcsel + to-gpr
14: fp/simd + fcsel + to-gpr + fcmp/e + fdiv + frecpe + frsqrte + fjcvtzs + ursqrte + urecpe + sha
</pre>
<h2>Instruction Fusion</h2>
<p>Certain instructions are able to issue as one uop if they appear consecutively in the instruction stream.</p>
<ul>
<li><code>adds</code>/<code>subs</code>/<code>ands</code>/<code>cmp</code>/<code>tst</code> + <code>b.cc</code> (complete fusion when fused instructions read no more than 4 registers per 6 instructions)</li>
<li><code>add</code>/<code>sub</code>/<code>and</code>/<code>orr</code>/<code>eor</code>/<code>bic</code>/etc. + <code>cbnz</code>/<code>cbz</code> (usually fused, if destination matches cbz operand. also works with instruction variants that set flags)</li>
<li><code>aese</code> + <code>aesmc</code> (usually fused if operands match pattern "A, B ; A, A")</li>
<li><code>aesd</code> + <code>aesimc</code> (usually fused if operands match pattern "A, B ; A, A")</li>
<li><code>aese</code>/<code>aesd</code> + <code>eor</code> (usually fused if operands match pattern "A, B ; A, A, C" or "A, B ; A, C, A")</li>
<li><code>pmull</code> + <code>eor</code> (usually fused if operands match pattern "A, B, C ; A, A, D" or "A, B, C ; A, D, A")</li>
<li><code>amx</code> + <code>amx</code> (excluding loads and stores - probably fuses to something like a <code>stp</code>)</li>
</ul>
<p>Branch fusion does not work with implicit shift or extend, nor instructions that read flags (like <code>adc</code>)</p>
<p>Other tested patterns are <strong>not</strong> fused, including <code>adrp</code> + <code>add</code>, <code>mov</code> + <code>movk</code>, <code>mul</code> + <code>umulh</code>, and <code>udiv</code> + <code>msub</code>.</p>
<h2>Elimination</h2>
<p>Certain instructions do not need to issue:</p>
<ul>
<li><code>mov x0, #0</code> (handled by renaming, as is 32-bit version)</li>
<li><code>mov x0, x1</code> (usually handled by renaming, 64-bit version only)</li>
<li><code>mov x0, #123</code> (handled by renaming at a max of 2 per 8 instructions, both 32-bit and 64-bit. Includes all tested immediate "mov" aliases e.g. bitwise/movz/movn)</li>
<li><code>movi v0.16b, #0</code> (handled by renaming, including other types)</li>
<li><code>mov v0.16b, v1.16b</code> (usually handled by renaming. Includes other full-width types e.g. <code>v0.8h</code> but not <code>v0.8b</code>)</li>
<li><code>nop</code> (never issues)</li>
<li><code>b</code> (unconditional branch never issues)</li>
</ul>
<p>Other tested instructions are <strong>not</strong> eliminated, including <code>adr</code>/<code>adrp</code>, <code>mov w0, w1</code> and <code>mov x0, xzr</code>.</p>
<h2>Complex Latencies</h2>
<p>Several instructions have latencies that aren't adequately described in the instruction tables:</p>
<ul>
<li>MADD's output can be passed to its third operand (the addend) with 1c latency, but if it's chained with other instructions it has 3c latency.</li>
<li>Loads may be passed to the base address of other loads with 3c latency (which is nice for linked lists), but chaining with ALU operations gives a latency of 4c. (Although the second destination register in an LDP always has 4c latency.)</li>
<li>Loads may have an extra 1-cycle of latency when the index register is the output of a shift-like instruction. Addition and bitwise operations do not have this penalty. Other latency chains have not yet been tested.</li>
<li>Integer to SIMD/FP roundtrip latency can be as low as 7c (e.g. when chaining flags operations).</li>
</ul>
<h2>Other limits</h2>
<p>Firestorm has a pipeline width of eight instructions per cycle.</p>
<ul>
<li>Uops per cycle: 8</li>
<li>Taken branches per cycle: 1</li>
<li>Coalesced retire queue size: ~330</li>
<li>In-flight renames: ~623</li>
<li>Integer physical register file size: ~380-394?</li>
<li>FP/SIMD physical register file size: ~432?</li>
<li>Flags physical register file size: ~128</li>
<li>In-flight branches: ~144</li>
<li>In-flight loads: ~130</li>
<li>In-flight stores: ~60</li>
</ul>
<p>These numbers mostly come from my <a href="https://gist.github.com/dougallj/5bafb113492047c865c0c8cfbc930155">M1 buffer size measuring tool</a>.
See also my blog post <a href="https://dougallj.wordpress.com/2021/04/08/apple-m1-load-and-store-queue-measurements/">Apple M1: Load and Store Queue Measurements</a>.</p>
<h2>Completion, and the Reorder Buffer (ROB)</h2>
<p>
Firestorm uses an unconventional reorder buffer, which I described as a ~330 entry "coalesced retire queue" and a ~623 "rename retire queue" (equivalent to a "Physical Register Reclaim Table").
</p>
<p>
Firestorm coalesces uops into <strong>retire groups</strong>, which all retire together. A retire group may contain up to seven uops.
Uops which can fail before retiring (such as memory accesses) must appear at the start of a group, and uops that can fail after retiring
(such as conditional branches) must appear at the end of a group. Groups of seven uops are only observed for eliminated instructions,
such as <code>nop</code> and <code>mov</code> with issuing uops limited to roughly four per group. The coalesced retire queue consists of
~330 such groups. This allows an out-of-order window of just over 1000 (contrived) instructions that issue, or over 2200 <code>nop</code>
instructions.
</p>
<p>
Any time an architectural register is written, that write must be retired (regardless of whether the instruction is eliminated).
This is tracked in a separate structure, called the "rename retire queue", which allows for up to ~623 renames. (For some examples:
<code>cbz</code> or <code>str</code> do not require entries, <code>add</code> and <code>mov</code> need one entry,
and <code>adds</code> or <code>ldp</code> need two entries.)
</p>
<p>
I use <em>retirement</em> to refer to an instruction's entry in the reorder buffer being released.
However, as I've <a href="https://dougallj.wordpress.com/2021/04/08/apple-m1-load-and-store-queue-measurements/">described in a blog post</a>,
loads and stores may commit out-of-order if they are non-speculative. This has some interesting implications. In particular, once a
store has "completed", the architectural state must have advanced past that point.
</p>
<p>
Retirement-rate is <a href="https://twitter.com/dougallj/status/1380416269275389952">measured</a> at up to eight
coalesced retire queue entries per cycle, or up to sixteen rename retire queue entries per cycle.
</p>
</body></html>