-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fold casts of constants in the importer #47133
Fold casts of constants in the importer #47133
Conversation
I believe the test failures is due to an existing issue and it could be that your changes have exposed it.
|
That's possible. I have just managed to obtain the dump of the problematic method (not without some difficulties) and will report back the findings. |
I have a c# standalone repro for this. Let me know if you are interested or if it helps you in ease debugging. |
Yes, that would be great! |
using System;
public class TestClass8
{
public struct S1
{
public struct S1_D1_F1
{
public sbyte sbyte_1;
public bool boolean_2;
public uint uint32_3;
public string string_4;
}
public S1.S1_D1_F1 s1_s1_d1_f1_2;
public S1.S1_D1_F1 s1_s1_d1_f1_3;
public uint uint32_4;
}
public struct S2
{
public sbyte sbyte_1;
public S1 s1_2;
}
public struct S3
{
public S2 s2_1;
public long int64_2;
public struct S3_D1_F3
{
public ulong uint64_1;
}
}
public struct S4
{
public decimal decimal_1;
public S1.S1_D1_F1 s1_s1_d1_f1_2;
public S3 s3_3;
}
public struct S5
{
public S3.S3_D1_F3 s3_s3_d1_f3_1;
}
public long LeafMethod7()
{
unchecked
{
int int32_6 = -1032611019 ;
long int64_7 = 6728172856866929694L ;
S3 s3_18 = new S3() ;
S4 s4_19 = new S4() ;
int loopInvariant = 1;
return 1+2;
}
}
public ulong LeafMethod13()
{
unchecked
{
int int32_6 = 845503196 ;
ulong uint64_13 = 12351733974017606946UL ;
S3.S3_D1_F3 s3_s3_d1_f3_17 = new S3.S3_D1_F3() ;
S5 s5_20 = new S5() ;
int loopInvariant = 8;
return uint64_13 &= 16116317412362549271UL^ uint64_13 <<= 1923055751% 4733734356262465223UL/ 17819534679104801725UL+ 93+ 89| s3_s3_d1_f3_17.uint64_1 <<= int32_6 ^ uint64_13 |= s3_s3_d1_f3_17.uint64_1 += s5_20.s3_s3_d1_f3_1.uint64_1 * s3_s3_d1_f3_17.uint64_1 / 11139180945003832152UL& s3_s3_d1_f3_17.uint64_1 %= uint64_13 /= 8589548964049453567UL+ 90+ 1| s5_20.s3_s3_d1_f3_1.uint64_1 += uint64_13 %= 17562726876373889522UL+ 17938177136826372368UL+ 83+ 68+ uint64_13 = uint64_13 <<= int32_6 &= -1209709259| uint64_13 | uint64_13 &= uint64_13 * 17623384555723108609UL% uint64_13 >>= -421932286+ 19% s3_s3_d1_f3_17.uint64_1 + 36+ s5_20.s3_s3_d1_f3_1.uint64_1 /= s5_20.s3_s3_d1_f3_1.uint64_1 <<= int32_6 &= int32_6 + s3_s3_d1_f3_17.uint64_1 <<= -970431513+ int32_6 ^= 1429190696- 12207894589152215756UL* 4967280248782330651UL+ 62;
}
}
public static void Main(string[] args)
{
TestClass8 objTestClass8 = new TestClass8();
objTestClass8.Method0();
}
public void Method0()
{
unchecked
{
ulong uint64_13 = 12709877184972354889UL ;
S3.S3_D1_F3 s3_s3_d1_f3_17 = new S3.S3_D1_F3() ;
S3 s3_18 = new S3() ;
S4 s4_19 = new S4() ;
S5 s5_20 = new S5() ;
int loopInvariant = 2;
if (s4_19.s3_3.int64_2 == LeafMethod7() & s3_18.int64_2 % LeafMethod7()+ 91)
{
uint64_13 += LeafMethod13() / uint64_13 + 10- uint64_13 | 610171717676389549UL& 4697039689400138510UL/ s3_s3_d1_f3_17.uint64_1 %= s5_20.s3_s3_d1_f3_1.uint64_1 + 97+ 16- LeafMethod13()& LeafMethod13();
}
else
{
{
int __loopvar1 = loopInvariant - 10;
}
{
int __loopvar1 = loopInvariant, __loopSecondaryVar1_0 = loopInvariant - 9;
}
}
return;
}
}
}
/*
Got output diff:
--------- Baseline ---------
Environment:
COMPlus_JITMinOpts=1
COMPlus_TieredCompilation=0
--------- Test ---------
Environment:
COMPlus_JitStress=2
COMPlus_JitStressRegs=0x80
COMPlus_TieredCompilation=0
Assert failure(PID 16648 [0x00004108], Thread: 24468 [0x5f94]): Assertion failed 'OperIsSimple()' in 'TestClass8:LeafMethod13():long:this' during 'Morph - Global' (IL size 356)
File: D:\git\dotnet-runtime\src\coreclr\jit\gtstructs.h Line: 52
Image: D:\git\dotnet-runtime\artifacts\tests\coreclr\windows.x64.Checked\tests\Core_Root\CoreRun.exe
*/ |
After some elimination of dead ends and transcribing of IL, I was able to get to this very minimal repro: public static ulong Problem()
{
long a = 42;
return (ulong)a % 42UL;
} Will investigate what causes the assert tomorrow. The dump for the original problematic method doesn't reveal much: Folding long operator with constant nodes into a constant:
[000079] ------------ * UMOD long
[000076] -----+------ +--* CNS_INT long 3
[000078] ------------ \--* CNS_INT long 3
Bashed to long constant:
[000079] ------------ * CNS_INT long 0
Assert failure(PID 13452 [0x0000348c], Thread: 15176 [0x3b48]): Assertion failed 'OperIsSimple()' in 'DynamicClass:CallSite.Target(System.Runtime.CompilerServices.Closure,System.Runtime.CompilerServices.CallSite,System.Object,System.Object):System.Object' during 'Morph - Global' (IL size 142)
File: C:\Users\Accretion\source\dotnet\runtime\src\coreclr\jit\gtstructs.h Line: 52
Image: C:\Users\Accretion\source\dotnet\runtime\artifacts\bin\testhost\net6.0-windows-Debug-x64\dotnet.exe |
After (not without some difficulty) setting up the native debugger, I was able to trace the assert's origins to these lines: runtime/src/coreclr/jit/morph.cpp Lines 12198 to 12216 in c4421ac
Specifically, it is the last line that asserts: Locally, it fixed both the simplified reproduction and the original test failures. |
ce8b1e3
to
b76096b
Compare
b76096b
to
e329cae
Compare
Seeing as the CI is green, posting the final diffs: PMI
PMI with "--cctors"
|
Did you get chance to see why there are regressions? |
Right, the regressions. I haven't studied a lot in detail yet, but so far it seems like more inlining causing more code to be generated. I will update this post as I analyze more cases.
HashCompare:checkG_M62082_IG01:
push r15
push r14
push r13
push r12
push rdi
push rsi
push rbp
push rbx
sub rsp, 40
mov r14d, ecx
mov rbp, rdx
mov rdi, r8
mov rbx, r9
mov r15, qword ptr [rsp+90H]
mov r12, qword ptr [rsp+98H]
mov r13, qword ptr [rsp+A0H]
mov rsi, qword ptr [rsp+A8H]
;; bbWeight=1 PerfScore 13.25
G_M62082_IG02:
cmp rsi, r15
jl SHORT G_M62082_IG05
;; bbWeight=8 PerfScore 10.00
G_M62082_IG03:
mov eax, 1
;; bbWeight=0.50 PerfScore 0.12
G_M62082_IG04:
add rsp, 40
pop rbx
pop rbp
pop rsi
pop rdi
pop r12
pop r13
pop r14
pop r15
ret
;; bbWeight=0.50 PerfScore 2.62
G_M62082_IG05:
lea rcx, [r12+rsi]
cmp dword ptr [rdi], edi
mov edx, ecx
movsxd rax, edx
cmp rcx, rax
jne SHORT G_M62082_IG10
;; bbWeight=4 PerfScore 17.00
G_M62082_IG06:
mov rcx, rdi
call System.Array:GetValue(int):System.Object:this
mov gword ptr [rsp+20H], rax
lea rcx, [rsi+r13]
cmp dword ptr [rbx], ebx
mov edx, ecx
movsxd r8, edx
cmp rcx, r8
jne SHORT G_M62082_IG10
;; bbWeight=4 PerfScore 26.00
G_M62082_IG07:
mov rcx, rbx
call System.Array:GetValue(int):System.Object:this
mov r9, rax
movzx rcx, r14b
mov rdx, rbp
mov r8, gword ptr [rsp+20H]
call HashCompare:GenericEqualityObj(bool,System.Collections.IEqualityComparer,System.Object,System.Object):bool
test eax, eax
je SHORT G_M62082_IG08
inc rsi
jmp SHORT G_M62082_IG02
;; bbWeight=4 PerfScore 30.00
G_M62082_IG08:
xor eax, eax
;; bbWeight=0.50 PerfScore 0.12
G_M62082_IG09:
add rsp, 40
pop rbx
pop rbp
pop rsi
pop rdi
pop r12
pop r13
pop r14
pop r15
ret
;; bbWeight=0.50 PerfScore 2.62
G_M62082_IG10:
mov ecx, 21
mov edx, 50
call System.ThrowHelper:ThrowArgumentOutOfRangeException(int,int)
int3
; Total bytes of code 204, prolog size 16, PerfScore 122.15, instruction count 71, allocated bytes for code 204 (MethodHash=38330d7d) for method HashCompare:check@1369-2(bool,System.Collections.IEqualityComparer,System.Array,System.Array,long,long,long,long):bool
; ============================================================ This is also the case for
Lines 15 to 28 in 5761dd4
All the following regressions are also caused by inlining of this method:
Lines 19 to 32 in 5761dd4
However, it does eliminate the exception throw and presumably avoids prefetching all the cold code. SyndicationLink constructorG_M54833_IG01:
push rdi
push rsi
push rbp
push rbx
sub rsp, 40
mov rsi, rcx
mov rdi, r8
mov rbx, r9
mov rbp, qword ptr [rsp+78H]
;; bbWeight=1 PerfScore 6.00
G_M54833_IG02:
test rbp, rbp
jl SHORT G_M54833_IG05
;; bbWeight=1 PerfScore 1.25
G_M54833_IG03:
xor rcx, rcx
mov gword ptr [rsi+8], rcx
lea rcx, bword ptr [rsi+40]
call CORINFO_HELP_ASSIGN_REF
lea rcx, bword ptr [rsi+32]
mov rdx, rbx
call CORINFO_HELP_ASSIGN_REF
lea rcx, bword ptr [rsi+24]
mov rdx, rdi
call CORINFO_HELP_ASSIGN_REF
lea rcx, bword ptr [rsi+16]
mov rdx, gword ptr [rsp+70H]
call CORINFO_HELP_ASSIGN_REF
mov qword ptr [rsi+48], rbp
;; bbWeight=1 PerfScore 9.75
G_M54833_IG04:
add rsp, 40
pop rbx
pop rbp
pop rsi
pop rdi
ret
;; bbWeight=1 PerfScore 3.25
G_M54833_IG05:
mov rcx, 0xD1FFAB1E
call CORINFO_HELP_NEWSFAST
mov rsi, rax
mov ecx, 0xAC7
mov rdx, 0xD1FFAB1E
call CORINFO_HELP_STRCNS
mov rdx, rax
mov rcx, rsi
call System.ArgumentOutOfRangeException:.ctor(System.String):this
mov rcx, rsi
call CORINFO_HELP_THROW
int3
;; bbWeight=0 PerfScore 0.00
; Total bytes of code 151, prolog size 8, PerfScore 35.35, instruction count 43, allocated bytes for code 151 (MethodHash=395929ce) for method System.ServiceModel.Syndication.SyndicationLink:.ctor(System.Uri,System.String,System.String,System.String,long):this
; ============================================================ All the following regressions are also caused by inlining of this constructor:
|
So, the regressions due to the new returns being created are caused by the fact that runtime/src/coreclr/jit/flowgraph.cpp Line 9123 in 035821a
The limit on the number of returns created is constant right now: runtime/src/coreclr/jit/flowgraph.cpp Lines 8801 to 8804 in 035821a
This means that returns for constants are created "eagerly". A sample that demonstrates this behavior: sharplab. While this logic could be improved, my changes just exposed this behavior, and working on an item this large is out of my reach at this point anyway. |
Thank you @SingleAccretion for the analysis. |
In this case, we are able to successfully propagate the constants to the inlined callees very early, which lowers the number of local variables and thus CSE chooses to promote the array's length. In doing so, it replaces two CSE section in the dumpAggressive CSE Promotion cutoff is 200.000000
Moderate CSE Promotion cutoff is 100.000000
enregCount is 12
Framesize estimate is 0x0000
We have a small frame
Sorted CSE candidates:
CSE #02, {$d1 , $4 } useCnt=0: [def=150.000000, use=0.000000, cost= 4 ]
:: N004 ( 4, 4) CSE #02 (def)[000101] ---XG------- * IND int <l:$d3, c:$fb>
CSE #01, {$100, $140} useCnt=1: [def=75.000000, use=50.000000, cost= 3 ]
:: N002 ( 3, 3) CSE #01 (def)[000060] ---X-------- * ARR_LENGTH int $c1
Skipped CSE #02 because use count is 0
Considering CSE #01 {$100, $140} [def=75.000000, use=50.000000, cost= 3 ]
CSE Expression :
N002 ( 3, 3) CSE #01 (def)[000060] ---X-------- * ARR_LENGTH int $c1
N001 ( 1, 1) [000059] ------------ \--* LCL_VAR ref V01 arg1 u:1 $81
Aggressive CSE Promotion (200.000000 >= 200.000000)
cseRefCnt=200.000000, aggressiveRefCnt=200.000000, moderateRefCnt=100.000000
defCnt=75.000000, useCnt=50.000000, cost=3, size=3
def_cost=1, use_cost=1, extra_no_cost=4, extra_yes_cost=0
CSE cost savings check (154.000000 >= 125.000000) passes
Promoting CSE:
lvaGrabTemp returning 15 (V15 rat0) (a long lifetime temp) called for CSE - aggressive.
New refCnts for V15: refCnt = 2, refCntWtd = 0.50
New refCnts for V15: refCnt = 3, refCntWtd = 1
New refCnts for V15: refCnt = 4, refCntWtd = 1.50
New refCnts for V15: refCnt = 5, refCntWtd = 2
CSE #01 def at [000060] replaced in BB02 with def of V15
optValnumCSE morphed tree:
N009 ( 14, 12) [000063] -A-X-------- * JTRUE void
N008 ( 12, 10) [000062] JA-X---N---- \--* EQ int $c3
N006 ( 10, 8) [000277] -A-X-------- +--* COMMA int $c1
N004 ( 7, 6) [000275] -A-X----R--- | +--* ASG int $VN.Void
N003 ( 3, 2) [000274] D------N---- | | +--* LCL_VAR int V15 cse0 $c1
N002 ( 3, 3) [000060] ---X-------- | | \--* ARR_LENGTH int $c1
N001 ( 1, 1) [000059] ------------ | | \--* LCL_VAR ref V01 arg1 u:1 $81
N005 ( 3, 2) [000276] ------------ | \--* LCL_VAR int V15 cse0 $c1
N007 ( 1, 1) [000061] ------------ \--* CNS_INT int 0 $40
CSE #01 def at [000215] replaced in BB09 with def of V15
optValnumCSE morphed tree:
N015 ( 24, 24) [000010] -A-XG---R--- * ASG int <l:$c9, c:$c8>
N014 ( 3, 2) [000009] D------N---- +--* LCL_VAR int V03 loc0 d:1 <l:$c7, c:$c6>
N013 ( 20, 21) [000222] -A-XG------- \--* COMMA ushort <l:$c5, c:$340>
N008 ( 15, 16) [000216] -A-X-------- +--* ARR_BOUNDS_CHECK_Rng void $145
N001 ( 1, 1) [000007] ------------ | +--* CNS_INT int 0 $40
N007 ( 10, 8) [000281] -A-X-------- | \--* COMMA int $c1
N005 ( 7, 6) [000279] -A-X----R--- | +--* ASG int $VN.Void
N004 ( 3, 2) [000278] D------N---- | | +--* LCL_VAR int V15 cse0 $c1
N003 ( 3, 3) [000215] ---X-------- | | \--* ARR_LENGTH int $c1
N002 ( 1, 1) [000006] ------------ | | \--* LCL_VAR ref V01 arg1 u:1 $81
N006 ( 3, 2) [000280] ------------ | \--* LCL_VAR int V15 cse0 $c1
N012 ( 5, 5) [000008] a---G------- \--* IND ushort <l:$c4, c:$300>
N011 ( 2, 2) [000221] -------N---- \--* ADD byref $280
N009 ( 1, 1) [000213] ------------ +--* LCL_VAR ref V01 arg1 u:1 $81
N010 ( 1, 1) [000220] ------------ \--* CNS_INT long 12 Fseq[#FirstElem] $1c0
Working on the replacement of the CSE #01 use at [000012] in BB09
optValnumCSE morphed tree:
N004 ( 7, 6) [000015] ------------ * JTRUE void
N003 ( 5, 4) [000014] N------N-U-- \--* NE int $cb
N001 ( 3, 2) [000282] ------------ +--* LCL_VAR int V15 cse0 $100
N002 ( 1, 1) [000013] ------------ \--* CNS_INT int 1 $41 I have reduced this to the following method: public static int Get(string a)
{
if (string.IsNullOrEmpty(a))
{
return 7;
}
else
{
if (a.Length is 1 || a.Length is 2)
{
return 9;
}
else
{
return 11;
}
}
} Which has the following blocks before CSE: Basic blocks-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds weight lp [IL range] [jump] [EH region] [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000] 1 1 [000..008)-> BB05 ( cond ) i label target
BB02 [0007] 1 BB01 0.25 [000..001)-> BB05 ( cond ) i idxlen
BB03 [0008] 1 BB02 0.50 [000..001)-> BB07 ( cond ) i
BB04 [0011] 1 BB03 0.50 [???..???)-> BB06 (always) internal
BB05 [0009] 2 BB01,BB02 0.50 [000..001)-> BB07 ( cond ) i label target
BB06 [0001] 2 BB04,BB05 0.50 [008..00A) (return) i target
BB07 [0002] 2 BB03,BB05 0.50 [00A..013)-> BB09 ( cond ) i label target idxlen
BB08 [0003] 1 BB07 0.50 [013..01C)-> BB10 ( cond ) i idxlen
BB09 [0004] 2 BB07,BB08 0.50 [01C..01F) (return) i label target
BB10 [0005] 1 BB08 0.50 [01F..022) (return) i label target
-----------------------------------------------------------------------------------------------------------------------------------------
------------ BB01 [000..008) -> BB05 (cond), preds={} succs={BB02,BB05}
***** BB01
STMT00007 (IL 0x000... ???)
N004 ( 5, 5) [000024] ------------ * JTRUE void
N003 ( 3, 3) [000023] J------N---- \--* EQ int $c0
N001 ( 1, 1) [000000] ------------ +--* LCL_VAR ref V00 arg0 u:1 $80
N002 ( 1, 1) [000022] ------------ \--* CNS_INT ref null $VN.Null
------------ BB02 [000..001) -> BB05 (cond), preds={BB01} succs={BB03,BB05}
***** BB02
STMT00009 (IL 0x000... ???)
N005 ( 7, 7) [000034] ---X-------- * JTRUE void
N004 ( 5, 5) [000033] J--X---N---- \--* EQ int $c3
N002 ( 3, 3) [000031] ---X-------- +--* ARR_LENGTH int $c1
N001 ( 1, 1) [000030] ------------ | \--* LCL_VAR ref V00 arg0 u:1 $80
N003 ( 1, 1) [000032] ------------ \--* CNS_INT int 0 $40
------------ BB03 [000..001) -> BB07 (cond), preds={BB02} succs={BB04,BB07}
***** BB03
STMT00010 (IL 0x000... ???)
N003 ( 5, 4) [000038] -A------R--- * ASG bool $40
N002 ( 3, 2) [000037] D------N---- +--* LCL_VAR int V02 tmp1 d:2 $40
N001 ( 1, 1) [000035] ------------ \--* CNS_INT int 0 $40
***** BB03
STMT00011 (IL ???... ???)
N004 ( 8, 7) [000041] ------------ * JTRUE void
N003 ( 6, 5) [000042] J------N---- \--* EQ int $42
N001 ( 4, 3) [000043] ------------ +--* LCL_VAR bool V02 tmp1 u:2 (last use) $40
N002 ( 1, 1) [000044] ------------ \--* CNS_INT int 0 $40
------------ BB04 [???..???) -> BB06 (always), preds={BB03} succs={BB06}
------------ BB05 [000..001) -> BB07 (cond), preds={BB01,BB02} succs={BB06,BB07}
***** BB05
STMT00008 (IL 0x000... ???)
N003 ( 5, 4) [000028] -A------R--- * ASG bool $42
N002 ( 3, 2) [000027] D------N---- +--* LCL_VAR int V02 tmp1 d:1 $42
N001 ( 1, 1) [000025] ------------ \--* CNS_INT int 1 $42
***** BB05
STMT00001 (IL ???... ???)
N004 ( 8, 7) [000005] ------------ * JTRUE void
N003 ( 6, 5) [000004] J------N---- \--* EQ int $40
N001 ( 4, 3) [000039] ------------ +--* LCL_VAR bool V02 tmp1 u:1 (last use) $42
N002 ( 1, 1) [000003] ------------ \--* CNS_INT int 0 $40
------------ BB06 [008..00A) (return), preds={BB04,BB05} succs={}
***** BB06
STMT00006 (IL 0x008...0x009)
N002 ( 2, 2) [000021] ------------ * RETURN int $184
N001 ( 1, 1) [000020] ------------ \--* CNS_INT int 7 $46
------------ BB07 [00A..013) -> BB09 (cond), preds={BB03,BB05} succs={BB08,BB09}
***** BB07
STMT00002 (IL 0x00A...0x011)
N005 ( 7, 7) [000010] ---X-------- * JTRUE void
N004 ( 5, 5) [000009] J--X---N---- \--* EQ int $c5
N002 ( 3, 3) [000007] ---X-------- +--* ARR_LENGTH int $c1
N001 ( 1, 1) [000006] ------------ | \--* LCL_VAR ref V00 arg0 u:1 $80
N003 ( 1, 1) [000008] ------------ \--* CNS_INT int 1 $42
------------ BB08 [013..01C) -> BB10 (cond), preds={BB07} succs={BB09,BB10}
***** BB08
STMT00004 (IL 0x013...0x01A)
N005 ( 7, 7) [000017] ---X-------- * JTRUE void
N004 ( 5, 5) [000016] N--X---N-U-- \--* NE int $c7
N002 ( 3, 3) [000014] ---X-------- +--* ARR_LENGTH int $c1
N001 ( 1, 1) [000013] ------------ | \--* LCL_VAR ref V00 arg0 u:1 (last use) $80
N003 ( 1, 1) [000015] ------------ \--* CNS_INT int 2 $43
------------ BB09 [01C..01F) (return), preds={BB07,BB08} succs={}
***** BB09
STMT00003 (IL 0x01C...0x01E)
N002 ( 2, 2) [000012] ------------ * RETURN int $183
N001 ( 1, 1) [000011] ------------ \--* CNS_INT int 9 $45
------------ BB10 [01F..022) (return), preds={BB08} succs={}
***** BB10
STMT00005 (IL 0x01F...0x021)
N002 ( 2, 2) [000019] ------------ * RETURN int $182
N001 ( 1, 1) [000018] ------------ \--* CNS_INT int 11 $44 What is important here is that there are blocks left (namely, Notably, for this simplistic reproduction CSE is still valuable as it has a Overall, it seems to me that things in this case work as they are supposed to, just the unfortunate input causes unfortunate output. |
4fdcd5d
to
061a2fb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
Thank you for the fix!
See #47123 for more context. I have rerun the crossgen diffs and got some more changes, but nothing major (the regressions are still due to more inlining).
Crossgen diffs
Example of a regression: https://www.diffchecker.com/jtOQlJX4.
Example of an improvement: https://www.diffchecker.com/7G19dkSM.
I haven't rerun the PMI diffs yet as they take a really long time on my machine (1.5 hours for two full runs) and kind of make everything else hard to use (80+ processes), and I want to ensure I have the right logic in the right place first (which is unlikely to be true as of this PR's submission). It is expected that they will be more sizeable.
Edit: looks like some things are being folded that shouldn't be. Investigating...
Edit: looks like there was a bug in the morphing of long modulus.
Edit: investigating unexpected regressions...
Edit: regressions have been analyzed, see below.