-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NativeAOT: Inline tls access on linux/x64 #97413
NativeAOT: Inline tls access on linux/x64 #97413
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsBefore: G_M41298_IG02: ;; offset=0x0001
E800000000 call CORINFO_HELP_READYTORUN_THREADSTATIC_BASE
488D0D00000000 lea rcx, gword ptr [(reloc 0x40000000004210a8)] ; '"a"'
48894858 mov gword ptr [rax+0x58], rcx
;; size=16 bbWeight=1 PerfScore 2.50 After: G_M41298_IG02: ;; offset=0x0008
66 data16
488D3D00000000 lea rdi, [(reloc 0x4000000000421230)]
66 data16
66 data16
48E800000000 call <unknown method>
488B18 mov rbx, gword ptr [rax]
4885DB test rbx, rbx
7412 je SHORT G_M41298_IG05
;; size=24 bbWeight=1 PerfScore 5.50
G_M41298_IG03: ;; offset=0x0020
488D0500000000 lea rax, gword ptr [(reloc 0x4000000000421278)] ; '"a"'
48894358 mov gword ptr [rbx+0x58], rax
;; size=11 bbWeight=1 PerfScore 1.50
G_M41298_IG04: ;; offset=0x002B
4883C408 add rsp, 8
5B pop rbx
5D pop rbp
C3 ret
;; size=7 bbWeight=1 PerfScore 2.25
G_M41298_IG05: ;; offset=0x0032
488BF8 mov rdi, rax
488D0500000000 lea rax, [(reloc 0x4000000000421238)]
FFD0 call rax
488BD8 mov rbx, rax
EBDD jmp SHORT G_M41298_IG03
;; size=17 bbWeight=0 PerfScore 0.00
; Total bytes of code 67, prolog size 8, PerfScore 12.75, instruction count 23, allocated bytes for code 67 (MethodHash=14305ead) for method Program:Test1() (FullOpts) The following sequence is patched by linker as seen below: 66 data16
488D3D00000000 lea rdi, [(reloc 0x4000000000421230)]
66 data16
66 data16
48E800000000 call <unknown method>
![]()
|
Diff results for #97413Throughput diffsThroughput diffs for linux/x64 ran on windows/x64Overall (+0.04% to +0.07%)
MinOpts (+0.06% to +0.27%)
FullOpts (+0.02% to +0.06%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.02% to +0.05%)
MinOpts (+0.04% to +0.24%)
FullOpts (+0.02% to +0.05%)
Details here Throughput diffs for linux/x64 ran on linux/x64Overall (-0.00% to +0.01%)
MinOpts (-0.03% to +0.10%)
FullOpts (-0.01% to +0.01%)
Details here Throughput diffs for windows/x86 ran on windows/x86Overall (+0.02% to +0.04%)
MinOpts (+0.04% to +0.41%)
FullOpts (+0.01% to +0.02%)
Details here |
Diff results for #97413Throughput diffsThroughput diffs for linux/x64 ran on windows/x64Overall (+0.04% to +0.07%)
MinOpts (+0.06% to +0.27%)
FullOpts (+0.02% to +0.06%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.02% to +0.05%)
MinOpts (+0.04% to +0.24%)
FullOpts (+0.02% to +0.05%)
Details here Throughput diffs for windows/x86 ran on windows/x86Overall (+0.02% to +0.04%)
MinOpts (+0.04% to +0.41%)
FullOpts (+0.01% to +0.02%)
Details here |
Diff results for #97413Throughput diffsThroughput diffs for linux/x64 ran on windows/x64Overall (+0.04% to +0.07%)
MinOpts (+0.06% to +0.27%)
FullOpts (+0.02% to +0.06%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.02% to +0.05%)
MinOpts (+0.04% to +0.24%)
FullOpts (+0.02% to +0.05%)
Details here Throughput diffs for windows/x86 ran on windows/x86Overall (+0.02% to +0.04%)
MinOpts (+0.04% to +0.41%)
FullOpts (+0.01% to +0.02%)
Details here Throughput diffs for linux/x64 ran on linux/x64Overall (-0.00% to +0.01%)
MinOpts (-0.03% to +0.09%)
FullOpts (-0.01% to +0.01%)
Details here |
4 similar comments
Diff results for #97413Throughput diffsThroughput diffs for linux/x64 ran on windows/x64Overall (+0.04% to +0.07%)
MinOpts (+0.06% to +0.27%)
FullOpts (+0.02% to +0.06%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.02% to +0.05%)
MinOpts (+0.04% to +0.24%)
FullOpts (+0.02% to +0.05%)
Details here Throughput diffs for windows/x86 ran on windows/x86Overall (+0.02% to +0.04%)
MinOpts (+0.04% to +0.41%)
FullOpts (+0.01% to +0.02%)
Details here Throughput diffs for linux/x64 ran on linux/x64Overall (-0.00% to +0.01%)
MinOpts (-0.03% to +0.09%)
FullOpts (-0.01% to +0.01%)
Details here |
Diff results for #97413Throughput diffsThroughput diffs for linux/x64 ran on windows/x64Overall (+0.04% to +0.07%)
MinOpts (+0.06% to +0.27%)
FullOpts (+0.02% to +0.06%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.02% to +0.05%)
MinOpts (+0.04% to +0.24%)
FullOpts (+0.02% to +0.05%)
Details here Throughput diffs for windows/x86 ran on windows/x86Overall (+0.02% to +0.04%)
MinOpts (+0.04% to +0.41%)
FullOpts (+0.01% to +0.02%)
Details here Throughput diffs for linux/x64 ran on linux/x64Overall (-0.00% to +0.01%)
MinOpts (-0.03% to +0.09%)
FullOpts (-0.01% to +0.01%)
Details here |
Diff results for #97413Throughput diffsThroughput diffs for linux/x64 ran on windows/x64Overall (+0.04% to +0.07%)
MinOpts (+0.06% to +0.27%)
FullOpts (+0.02% to +0.06%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.02% to +0.05%)
MinOpts (+0.04% to +0.24%)
FullOpts (+0.02% to +0.05%)
Details here Throughput diffs for windows/x86 ran on windows/x86Overall (+0.02% to +0.04%)
MinOpts (+0.04% to +0.41%)
FullOpts (+0.01% to +0.02%)
Details here Throughput diffs for linux/x64 ran on linux/x64Overall (-0.00% to +0.01%)
MinOpts (-0.03% to +0.09%)
FullOpts (-0.01% to +0.01%)
Details here |
Diff results for #97413Throughput diffsThroughput diffs for linux/x64 ran on windows/x64Overall (+0.04% to +0.07%)
MinOpts (+0.06% to +0.27%)
FullOpts (+0.02% to +0.06%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.02% to +0.05%)
MinOpts (+0.04% to +0.24%)
FullOpts (+0.02% to +0.05%)
Details here Throughput diffs for windows/x86 ran on windows/x86Overall (+0.02% to +0.04%)
MinOpts (+0.04% to +0.41%)
FullOpts (+0.01% to +0.02%)
Details here Throughput diffs for linux/x64 ran on linux/x64Overall (-0.00% to +0.01%)
MinOpts (-0.03% to +0.09%)
FullOpts (-0.01% to +0.01%)
Details here |
Diff results for #97413Throughput diffsThroughput diffs for linux/x64 ran on windows/x64Overall (+0.04% to +0.07%)
MinOpts (+0.06% to +0.27%)
FullOpts (+0.02% to +0.06%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.02% to +0.05%)
MinOpts (+0.04% to +0.24%)
FullOpts (+0.02% to +0.05%)
Details here Throughput diffs for windows/x86 ran on windows/x86Overall (+0.02% to +0.04%)
MinOpts (+0.04% to +0.41%)
FullOpts (+0.01% to +0.02%)
Details here Throughput diffs for linux/x64 ran on linux/x64Overall (-0.00% to +0.01%)
MinOpts (-0.03% to +0.09%)
FullOpts (-0.01% to +0.01%)
Details here |
Diff results for #97413Throughput diffsThroughput diffs for linux/x64 ran on windows/x64Overall (+0.04% to +0.07%)
MinOpts (+0.06% to +0.27%)
FullOpts (+0.02% to +0.06%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.02% to +0.05%)
MinOpts (+0.04% to +0.24%)
FullOpts (+0.02% to +0.05%)
Details here Throughput diffs for windows/x86 ran on windows/x86Overall (+0.02% to +0.04%)
MinOpts (+0.04% to +0.41%)
FullOpts (+0.01% to +0.02%)
Details here Throughput diffs for linux/x64 ran on linux/x64Overall (-0.00% to +0.01%)
MinOpts (-0.03% to +0.09%)
FullOpts (-0.01% to +0.01%)
Details here |
Diff results for #97413Throughput diffsThroughput diffs for linux/x64 ran on windows/x64Overall (+0.04% to +0.07%)
MinOpts (+0.06% to +0.27%)
FullOpts (+0.02% to +0.06%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.02% to +0.05%)
MinOpts (+0.04% to +0.24%)
FullOpts (+0.02% to +0.05%)
Details here Throughput diffs for windows/x86 ran on windows/x86Overall (+0.02% to +0.04%)
MinOpts (+0.04% to +0.41%)
FullOpts (+0.01% to +0.02%)
Details here Throughput diffs for linux/x64 ran on linux/x64Overall (-0.00% to +0.01%)
MinOpts (-0.03% to +0.09%)
FullOpts (-0.01% to +0.01%)
Details here |
@jkotas @VSadov @MichalStrehovsky @BruceForstall PTAL |
/azp run runtime-nativeaot-outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
@MihaZupan - is this way to trigger the jit-diff? |
Oh I see, the check I had was case-sensitive for the bot name. Fixed now |
Codegen looks nice. No more calls when acquiring a lock! System.Collections.Concurrent.Tests`System.Threading.Lock__EnterAndGetCurrentThreadId:
0x5555564cae70 <+0>: pushq %rbp
0x5555564cae71 <+1>: pushq %r15
0x5555564cae73 <+3>: pushq %rbx
0x5555564cae74 <+4>: subq $0x10, %rsp
0x5555564cae78 <+8>: leaq 0x20(%rsp), %rbp
0x5555564cae7d <+13>: movq %rdi, %rbx
// start fetching managed TLS
-> 0x5555564cae80 <+16>: movq %fs:0x0, %rax
0x5555564cae89 <+25>: leaq -0x108(%rax), %rax
0x5555564cae90 <+32>: movq %rax, %rdi
0x5555564cae93 <+35>: movq (%rdi), %r15 ; here r15 points to tlsRoot object
0x5555564cae96 <+38>: testq %r15, %r15 ; check if the root is initialized (or call helper to initialize)
0x5555564cae99 <+41>: je 0x5555564caee0 ; <+112> at Lock.cs:66
0x5555564cae9b <+43>: movl 0x88(%r15), %edx
// here edx has the managed thread ID - in 7 instructions!
// the rest of the Lock code follows . . .
// check if the states of the lock word and the thread ID permit fast path (typically it is the case)
0x5555564caea2 <+50>: testl %edx, %edx
0x5555564caea4 <+52>: je 0x5555564caec9 ; <+89> at Lock.cs:66
0x5555564caea6 <+54>: movl 0x14(%rbx), %eax
0x5555564caea9 <+57>: movl %eax, -0x14(%rbp)
0x5555564caeac <+60>: testb $0x3, %al
0x5555564caeae <+62>: jne 0x5555564caec9 ; <+89> at Lock.cs:66
0x5555564caeb0 <+64>: leal 0x1(%rax), %edi
0x5555564caeb3 <+67>: leaq 0x14(%rbx), %rsi
// acquire the lock
0x5555564caeb7 <+71>: lock
0x5555564caeb8 <+72>: cmpxchgl %edi, (%rsi)
0x5555564caebb <+75>: movl -0x14(%rbp), %edi
0x5555564caebe <+78>: cmpl %edi, %eax
0x5555564caec0 <+80>: jne 0x5555564caec9 ; <+89> at Lock.cs:66
0x5555564caec2 <+82>: movl %edx, 0x10(%rbx)
0x5555564caec5 <+85>: movl %edx, %eax
0x5555564caec7 <+87>: jmp 0x5555564caed6 ; <+102> at Lock.cs:69
0x5555564caec9 <+89>: movq %rbx, %rdi
0x5555564caecc <+92>: movl $0xffffffff, %esi ; imm = 0xFFFFFFFF
0x5555564caed1 <+97>: callq 0x5555564cb0a0 ; System.Threading.Lock__TryEnterSlow_0 at Lock.cs:297
0x5555564caed6 <+102>: nop
0x5555564caed7 <+103>: addq $0x10, %rsp
0x5555564caedb <+107>: popq %rbx
0x5555564caedc <+108>: popq %r15
0x5555564caede <+110>: popq %rbp
0x5555564caedf <+111>: retq
0x5555564caee0 <+112>: leaq 0x79959(%rip), %rax ; Internal.Runtime.ThreadStatics__GetInlinedThreadStaticBaseSlow at ThreadStatics.cs:40
0x5555564caee7 <+119>: callq *%rax
0x5555564caee9 <+121>: movq %rax, %r15
0x5555564caeec <+124>: jmp 0x5555564cae9b ; <+43> at Lock.cs:66 |
I've run a few libraries tests in a loop in various configurations (Checked/Release, ServerGC,...) and did not see any crashes or anomalies. Can't comment on the code change details, but the end result looks as expected! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!!
Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas Issue DetailsBefore: G_M41298_IG02: ;; offset=0x0001
E800000000 call CORINFO_HELP_READYTORUN_THREADSTATIC_BASE
488D0D00000000 lea rcx, gword ptr [(reloc 0x40000000004210a8)] ; '"a"'
48894858 mov gword ptr [rax+0x58], rcx
;; size=16 bbWeight=1 PerfScore 2.50 After: G_M41298_IG02: ;; offset=0x0008
66 data16
488D3D00000000 lea rdi, [(reloc 0x4000000000421230)]
66 data16
66 data16
48E800000000 call <unknown method>
488B18 mov rbx, gword ptr [rax]
4885DB test rbx, rbx
7412 je SHORT G_M41298_IG05
;; size=24 bbWeight=1 PerfScore 5.50
G_M41298_IG03: ;; offset=0x0020
488D0500000000 lea rax, gword ptr [(reloc 0x4000000000421278)] ; '"a"'
48894358 mov gword ptr [rbx+0x58], rax
;; size=11 bbWeight=1 PerfScore 1.50
G_M41298_IG04: ;; offset=0x002B
4883C408 add rsp, 8
5B pop rbx
5D pop rbp
C3 ret
;; size=7 bbWeight=1 PerfScore 2.25
G_M41298_IG05: ;; offset=0x0032
488BF8 mov rdi, rax
488D0500000000 lea rax, [(reloc 0x4000000000421238)]
FFD0 call rax
488BD8 mov rbx, rax
EBDD jmp SHORT G_M41298_IG03
;; size=17 bbWeight=0 PerfScore 0.00
; Total bytes of code 67, prolog size 8, PerfScore 12.75, instruction count 23, allocated bytes for code 67 (MethodHash=14305ead) for method Program:Test1() (FullOpts) The following sequence is later patched by the linker as seen below: 66 data16
488D3D00000000 lea rdi, [(reloc 0x4000000000421230)]
66 data16
66 data16
48E800000000 call <unknown method>
The end result in this code: ![]()
|
gtNewIconHandleNode((size_t)threadStaticInfo.tlsGetAddrFtnPtr.handle, GTF_ICON_FTN_ADDR); | ||
tls_get_addr_val->SetContained(); | ||
|
||
// GenTreeCall* tlsRefCall = gtNewCallNode(CT_ tls_get_addr_val, TYP_I_IMPL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unneeded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will do this in a follow-up PR
// GenTreeCall* tlsRefCall = gtNewCallNode(CT_ tls_get_addr_val, TYP_I_IMPL); | ||
GenTreeCall* tlsRefCall = gtNewIndCallNode(tls_get_addr_val, TYP_I_IMPL); | ||
tlsRefCall->gtFlags |= GTF_TLS_GET_ADDR; | ||
// // |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extraneous comments?
Before:
After:
The following sequence is later patched by the linker as seen below:
The end result in this code: