Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gcc-xtensa appears to hardcode data alignment to 4 #2

Closed
pfalcon opened this issue Feb 4, 2015 · 7 comments
Closed

gcc-xtensa appears to hardcode data alignment to 4 #2

pfalcon opened this issue Feb 4, 2015 · 7 comments

Comments

@pfalcon
Copy link

pfalcon commented Feb 4, 2015

I'm investigating why project built for xtensa produces unexpectedly high data section size. Project is built with -fdata-sections (which is common for embedded projects; the issue likely manifests itself even without it still). In map file I see:

 .rodata.rule_yield_arg
                0x000000003ffe9cfc        0x6 build/py/parse.o
 *fill*         0x000000003ffe9d02        0x2
 .rodata.rule_yield_expr
                0x000000003ffe9d04        0x6 build/py/parse.o
 *fill*         0x000000003ffe9d0a        0x2

I.e. each 6-byte structure gets aligned on 4-byte boundary. Looking at structures, they consist only of short's, i.e. have natural alignment of 2.

The simplest testcase to reproduce the issue is:

struct foo {
    short a, b, c;
};

struct foo s1 = {1};
struct foo s2 = {2};

When build for both arm and x64, this produces following assembly:

    .global s1
    .data
    .align  2
    .type   s1, %object
    .size   s1, 6
s1:
    .short  1
    .space  4
    .global s2
    .align  2
    .type   s2, %object
    .size   s2, 6
s2:
    .short  2
    .space  4

with xtensa-lx106-elf-gcc the result is:

    .global s1
    .data
    .align  4
    .type   s1, @object
    .size   s1, 6
s1:
    .short  1
    .zero   4
    .global s2
    .align  4
    .type   s2, @object
    .size   s2, 6
s2:
    .short  2
    .zero   4

Note the difference in ".align" directives.

The expect behavior is that structure alignment should be its natural alignment (which is defined as maximum alignment of any structure field). Is current xtensa-lx106-elf-gcc behavior grounded in any Xtensa ABI or something? Even if it is, the behavior like above is detrimental for embedded usage, where ABI issues are not relevant, but losses from overzealous alignment are noticeable (for example, in the original case, there're hundreds of such structures; if structures/variables are just single short's, there's 50% loss of space in 4-byte alignment).

@jcmvbkbc
Copy link
Owner

jcmvbkbc commented Feb 6, 2015

That matches exactly what MIPS does. I've had a look at what others do here, found no consensus. Most common is doing natural alignment when optimizing for size. I can implement that, will that work for you?

@pfalcon
Copy link
Author

pfalcon commented Feb 6, 2015

Yes, sure, if you think it makes sense, that would be good enough and definitely would help that project, as it's built with -Os. Thanks!

@jcmvbkbc
Copy link
Owner

jcmvbkbc commented Feb 6, 2015

Pushed proposed fix to the call0-4.8.2-natural-align branch for preview.
Will test and integrate soon.
Thanks for your report.

@jcmvbkbc
Copy link
Owner

Fixed and submitted upstream.

@pfalcon
Copy link
Author

pfalcon commented Feb 22, 2015

Thanks. Have been travelling and didn't have chance to look into it since, but hope to get to it coming weeks.

@pfalcon
Copy link
Author

pfalcon commented Jun 12, 2015

Turns out, I never tested this properly nor upgraded esp-open-sdk to the version with this patch. I did testing now, using MicroPython esp8266 build as a subject.

Here're section address/size diffs between build with old and new toolchain:

 .irom0.text     0x0000000040210000    0x3fe2c
 .text           0x0000000040100000     0x7236
 .data           0x000000003ffe8000      0x574
-.rodata         0x000000003ffe8580     0x511c
-.bss            0x000000003ffed6a0     0xd3a0
+.rodata         0x000000003ffe8580     0x4c80
+.bss            0x000000003ffed200     0xd3a0

So, only .rodata is affected, and more than a kilobyte was saved. For micropython, that can be well saving 10% of available RAM, i.e. really good results. Thanks! (esp-open-sdk is also upgraded)

@dpgeorge: FYI

@dpgeorge
Copy link

@pfalcon thanks for ping, it's great that such improvements can be made upstream for all to benefit from.

jcmvbkbc pushed a commit that referenced this issue May 30, 2017
	* tree.c (ovl_copy): Adjust assert, copy OVL_LOOKUP.
	(ovl_used): New.
	(lookup_keep): Call it.

	PR c++/80891 (#2)
	* g++.dg/lookup/pr80891-2.C: New.


git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@248570 138bc75d-0d04-0410-961f-82ee72b054a4
jcmvbkbc pushed a commit that referenced this issue Jun 18, 2018
This patch implements GCC support for mitigating vulnerability
CVE-2017-5715 known as Spectre #2 on IBM Z.

In order to disable prediction of indirect branches the implementation
makes use of an IBM Z specific feature - the execute instruction.
Performing an indirect branch via execute prevents the branch from
being subject to dynamic branch prediction.

The implementation tries to stay close to the x86 solution regarding
user interface.

x86 style options supported (without thunk-inline):

-mindirect-branch=(keep|thunk|thunk-extern)
-mfunction-return=(keep|thunk|thunk-extern)

IBM Z specific options:

-mindirect-branch-jump=(keep|thunk|thunk-extern|thunk-inline)
-mindirect-branch-call=(keep|thunk|thunk-extern)
-mfunction-return-reg=(keep|thunk|thunk-extern)
-mfunction-return-mem=(keep|thunk|thunk-extern)

These options allow us to enable/disable the branch conversion at a
finer granularity.

-mindirect-branch sets the value of -mindirect-branch-jump and
 -mindirect-branch-call.

-mfunction-return sets the value of -mfunction-return-reg and
 -mfunction-return-mem.

All these options are supported on GCC command line as well as
function attributes.

'thunk' triggers the generation of out of line thunks (expolines) and
replaces the formerly indirect branch with a direct branch to the
thunk.  Depending on the -march= setting two different types of thunks
are generated.  With -march=z10 or higher exrl (execute relative long)
is being used while targeting older machines makes use of larl/ex
instead.  From a security perspective the exrl variant is preferable.

'thunk-extern' does the branch replacement like 'thunk' but does not
emit the thunks.

'thunk-inline' is only available for indirect jumps.  It should be used
in environments where correct CFI is important - known as user space.

Additionally the patch introduces the -mindirect-branch-table option
which generates tables pointing to the locations which have been
modified.  This is supposed to allow reverting the changes without
re-compilation in situations where it isn't required. The sections are
split up into one section per option.

gcc/ChangeLog:

2018-02-08  Andreas Krebbel  <krebbel@linux.vnet.ibm.com>

	* config/s390/s390-opts.h (enum indirect_branch): Define.
	* config/s390/s390-protos.h (s390_return_addr_from_memory)
	(s390_indirect_branch_via_thunk)
	(s390_indirect_branch_via_inline_thunk): Add function prototypes.
	(enum s390_indirect_branch_type): Define.
	* config/s390/s390.c (struct s390_frame_layout, struct
	machine_function): Remove.
	(indirect_branch_prez10thunk_mask, indirect_branch_z10thunk_mask)
	(indirect_branch_table_label_no, indirect_branch_table_name):
	Define variables.
	(INDIRECT_BRANCH_NUM_OPTIONS): Define macro.
	(enum s390_indirect_branch_option): Define.
	(s390_return_addr_from_memory): New function.
	(s390_handle_string_attribute): New function.
	(s390_attribute_table): Add new attribute handler.
	(s390_execute_label): Handle UNSPEC_EXECUTE_JUMP patterns.
	(s390_indirect_branch_via_thunk): New function.
	(s390_indirect_branch_via_inline_thunk): New function.
	(s390_function_ok_for_sibcall): When jumping via thunk disallow
	sibling call optimization for non z10 compiles.
	(s390_emit_call): Force indirect branch target to be a single
	register.  Add r1 clobber for non-z10 compiles.
	(s390_emit_epilogue): Emit return jump via return_use expander.
	(s390_reorg): Handle JUMP_INSNs as execute targets.
	(s390_option_override_internal): Perform validity checks for the
	new command line options.
	(s390_indirect_branch_attrvalue): New function.
	(s390_indirect_branch_settings): New function.
	(s390_set_current_function): Invoke s390_indirect_branch_settings.
	(s390_output_indirect_thunk_function):  New function.
	(s390_code_end): Implement target hook.
	(s390_case_values_threshold): Implement target hook.
	(TARGET_ASM_CODE_END, TARGET_CASE_VALUES_THRESHOLD): Define target
	macros.
	* config/s390/s390.h (struct s390_frame_layout)
	(struct	machine_function): Move here from s390.c.
	(TARGET_INDIRECT_BRANCH_NOBP_RET)
	(TARGET_INDIRECT_BRANCH_NOBP_JUMP)
	(TARGET_INDIRECT_BRANCH_NOBP_JUMP_THUNK)
	(TARGET_INDIRECT_BRANCH_NOBP_JUMP_INLINE_THUNK)
	(TARGET_INDIRECT_BRANCH_NOBP_CALL)
	(TARGET_DEFAULT_INDIRECT_BRANCH_TABLE)
	(TARGET_INDIRECT_BRANCH_THUNK_NAME_EXRL)
	(TARGET_INDIRECT_BRANCH_THUNK_NAME_EX)
	(TARGET_INDIRECT_BRANCH_TABLE): Define macros.
	* config/s390/s390.md (UNSPEC_EXECUTE_JUMP)
	(INDIRECT_BRANCH_THUNK_REGNUM): Define constants.
	(mnemonic attribute): Add values which aren't recognized
	automatically.
	("*cjump_long", "*icjump_long", "*basr", "*basr_r"): Disable
	pattern for branch conversion.  Fix mnemonic attribute.
	("*c<code>", "*sibcall_br", "*sibcall_value_br", "*return"): Emit
	indirect branch via thunk if requested.
	("indirect_jump", "<code>"): Expand patterns for branch conversion.
	("*indirect_jump"): Disable for branch conversion using out of
	line thunks.
	("indirect_jump_via_thunk<mode>_z10")
	("indirect_jump_via_thunk<mode>")
	("indirect_jump_via_inlinethunk<mode>_z10")
	("indirect_jump_via_inlinethunk<mode>", "*casesi_jump")
	("casesi_jump_via_thunk<mode>_z10", "casesi_jump_via_thunk<mode>")
	("casesi_jump_via_inlinethunk<mode>_z10")
	("casesi_jump_via_inlinethunk<mode>", "*basr_via_thunk<mode>_z10")
	("*basr_via_thunk<mode>", "*basr_r_via_thunk_z10")
	("*basr_r_via_thunk", "return<mode>_prez10"): New pattern.
	("*indirect2_jump"): Disable for branch conversion.
	("casesi_jump"): Turn into expander and expand patterns for branch
	conversion.
	("return_use"): New expander.
	("*return"): Emit return via thunk and rename it to ...
	("*return<mode>"): ... this one.
	* config/s390/s390.opt: Add new options and and enum for the
	option values.

gcc/testsuite/ChangeLog:

2018-02-08  Andreas Krebbel  <krebbel@linux.vnet.ibm.com>

	* gcc.target/s390/nobp-function-pointer-attr.c: New test.
	* gcc.target/s390/nobp-function-pointer-nothunk.c: New test.
	* gcc.target/s390/nobp-function-pointer-z10.c: New test.
	* gcc.target/s390/nobp-function-pointer-z900.c: New test.
	* gcc.target/s390/nobp-indirect-jump-attr.c: New test.
	* gcc.target/s390/nobp-indirect-jump-inline-attr.c: New test.
	* gcc.target/s390/nobp-indirect-jump-inline-z10.c: New test.
	* gcc.target/s390/nobp-indirect-jump-inline-z900.c: New test.
	* gcc.target/s390/nobp-indirect-jump-nothunk.c: New test.
	* gcc.target/s390/nobp-indirect-jump-z10.c: New test.
	* gcc.target/s390/nobp-indirect-jump-z900.c: New test.
	* gcc.target/s390/nobp-return-attr-all.c: New test.
	* gcc.target/s390/nobp-return-attr-neg.c: New test.
	* gcc.target/s390/nobp-return-mem-attr.c: New test.
	* gcc.target/s390/nobp-return-mem-nothunk.c: New test.
	* gcc.target/s390/nobp-return-mem-z10.c: New test.
	* gcc.target/s390/nobp-return-mem-z900.c: New test.
	* gcc.target/s390/nobp-return-reg-attr.c: New test.
	* gcc.target/s390/nobp-return-reg-mixed.c: New test.
	* gcc.target/s390/nobp-return-reg-nothunk.c: New test.
	* gcc.target/s390/nobp-return-reg-z10.c: New test.
	* gcc.target/s390/nobp-return-reg-z900.c: New test.
	* gcc.target/s390/nobp-table-jump-inline-z10.c: New test.
	* gcc.target/s390/nobp-table-jump-inline-z900.c: New test.
	* gcc.target/s390/nobp-table-jump-z10.c: New test.
	* gcc.target/s390/nobp-table-jump-z900.c: New test.



git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@257489 138bc75d-0d04-0410-961f-82ee72b054a4
jcmvbkbc pushed a commit that referenced this issue Jun 18, 2018
When -fcf-protection -mcet is used, I got

FAIL: g++.dg/eh/sighandle.C

(gdb) bt
 #0  _Unwind_RaiseException (exc=exc@entry=0x416ed0)
    at /export/gnu/import/git/sources/gcc/libgcc/unwind.inc:140
 #1  0x00007ffff7d9936b in __cxxabiv1::__cxa_throw (obj=<optimized out>,
    tinfo=0x403dd0 <typeinfo for int@@CXXABI_1.3>, dest=0x0)
    at /export/gnu/import/git/sources/gcc/libstdc++-v3/libsupc++/eh_throw.cc:90
 #2  0x0000000000401255 in sighandler (signo=11, si=0x7fffffffd6f8,
    uc=0x7fffffffd5c0)
    at /export/gnu/import/git/sources/gcc/gcc/testsuite/g++.dg/eh/sighandle.C:9
 #3  <signal handler called> <<<< Signal frame which isn't on shadow stack
 #4  dosegv ()
    at /export/gnu/import/git/sources/gcc/gcc/testsuite/g++.dg/eh/sighandle.C:14
 #5  0x00000000004012e3 in main ()
    at /export/gnu/import/git/sources/gcc/gcc/testsuite/g++.dg/eh/sighandle.C:30
(gdb) p frames
$6 = 5
(gdb)

frame count should be 4, not 5.  This patch skips signal frames when
unwinding shadow stack.

gcc/testsuite/

	PR libgcc/85334
	* g++.dg/torture/pr85334.C: New test.

libgcc/

	PR libgcc/85334
	* unwind-generic.h (_Unwind_Frames_Increment): New.
	* config/i386/shadow-stack-unwind.h (_Unwind_Frames_Increment):
	Likewise.
	* unwind.inc (_Unwind_RaiseException_Phase2): Increment frame
	count with _Unwind_Frames_Increment.
	(_Unwind_ForcedUnwind_Phase2): Likewise.


git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@259502 138bc75d-0d04-0410-961f-82ee72b054a4
jcmvbkbc pushed a commit that referenced this issue Jun 19, 2018
Make ix86_frame available to i386 code generation.  This is needed to
backport the patch set of -mindirect-branch= to mitigate variant #2 of
the speculative execution vulnerabilities on x86 processors identified
by CVE-2017-5715, aka Spectre.

	Backport from mainline
	* config/i386/i386.c (ix86_frame): Moved to ...
	* config/i386/i386.h (ix86_frame): Here.
	(machine_function): Add frame.
	* config/i386/i386.c (ix86_compute_frame_layout): Repace the
	frame argument with &cfun->machine->frame.
	(ix86_can_use_return_insn_p): Don't pass &frame to
	ix86_compute_frame_layout.  Copy frame from cfun->machine->frame.
	(ix86_can_eliminate): Likewise.
	(ix86_expand_prologue): Likewise.
	(ix86_expand_epilogue): Likewise.
	(ix86_expand_split_stack_prologue): Likewise.


git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gcc-7-branch@256691 138bc75d-0d04-0410-961f-82ee72b054a4
jcmvbkbc pushed a commit that referenced this issue Jun 19, 2018
This patch fixes an LRA cycling problem on the attached testcase.
The original insn was:

(insn 74 72 76 8 (set (reg:V2DI 287 [ _166 ])
        (subreg:V2DI (reg/v/f:DI 112 [ d ]) 0)) 1060 {*aarch64_simd_movv2di}
     (nil))

which IRA converted to:

(insn 74 72 580 8 (set (reg:V2DI 287 [ _166 ])
        (subreg:V2DI (reg/v/f:DI 517 [orig:112 d ] [112]) 0)) 1060 {*aarch64_simd_movv2di}
     (nil))

after creating loop allocnos.  It happens that the ALLOCNO_WMODEs for
both 112 and 517 were not set to V2DI due to another bug that I'll post
a separate patch for, but we nevertheless got a valid allocation of
register 1.

LRA's first try at constraining the instruction gave:

         Choosing alt 5 in insn 74:  (0) ?w  (1) r {*aarch64_simd_movv2di}

at which point all was good.  But LRA later decided it needed
to spill r517:

    Spill r517 after risky transformations

so the next constraint attempt gave:

         Choosing alt 0 in insn 74:  (0) =w  (1) m {*aarch64_simd_movv2di}

which was still good.  Then during inheritance we had:

      Creating newreg=672 from oldreg=517, assigning class GENERAL_REGS to inheritance r672
    Original reg change 517->672 (bb8):
   74: r287:V2DI=r672:DI#0
    Add inheritance<-original before:
  939: r672:DI=r517:DI

    Inheritance reuse change 517->672 (bb8):
  620: r572:DI=r672:DI
      REG_DEAD r672:DI

    Use smallest class of POINTER_REGS and GENERAL_REGS
      Creating newreg=673 from oldreg=517, assigning class POINTER_REGS to inheritance r673
    Original reg change 517->673 (bb8):
  936: r669:DI=r673:DI
    Add inheritance<-original before:
  940: r673:DI=r517:DI

("Use smallest class of POINTER_REGS and GENERAL_REGS" ought to
give GENERAL_REGS.  That might be a missed optimisation, and probably
due to both classes having the same number of allocatable registers.
I'll look at that as a follow-on.)

Thus LRA created two inheritance registers for r517, one (r673)
that included the unallocatable x31 and another (r672) that didn't.
The r672 references included the paradoxical subreg in insn 74 but the
r673 ones didn't.  LRA then allocated x30 to r673, which was a valid
choice.

Later LRA decided to "undo" the inheritance for insn 620, but because
of the double inheritance, it got confused as to what the original
situation was, and made insn 74 use the other inheritance register
instead of r517:

********** Undoing inheritance #2: **********

Inherit 11 out of 12 (91.67%)
   Insn after restoring regs:
  620: r572:DI=r517:DI
      REG_DEAD r517:DI
    Change reload insn:
   74: r287:V2DI=r673:DI#0       <-------------------
   Insn after restoring regs:
  939: r517:DI=r673:DI
      REG_DEAD r673:DI

This might be a bug in itself: we should probably look through sets
of other inheritance pseudos to find the "real" origin.

Either way, at this point we had a situation in which r673 was used in an
insn whose subreg was larger than the biggest_mode that r673 had when it
was allocated.  While x30 was valid for the original biggest_mode, it
wasn't valid for this subreg use.

The next attempt to constrain insn 74 was:

        Choosing alt 5 in insn 74:  (0) ?w  (1) r {*aarch64_simd_movv2di}
      Creating newreg=684, assigning class GENERAL_REGS to r684
   74: r287:V2DI=r684:V2DI
    Inserting insn reload before:
  951: r684:V2DI=r673:DI#0

where LRA reloaded the SUBREG rather than the SUBREG_REG.  And it
then cycled trying the same thing when reloading the reload (and the
reload of the reload, etc.).

What it should be doing here is reloading the SUBREG_REG instead.
There's already code to cope with this case when the paradoxical
subreg falls outside the class (which isn't true here, since r673
is POINTER_REGS and POINTER_REGS includes x31).  But I think we
should also test whether LRA is entitled to allocate the spanned
registers.  Not doing that seems like a bug regardless of the above
missed optimisation and the mix-up undoing inheritance.

2018-05-30  Richard Sandiford  <richard.sandiford@linaro.org>

gcc/
	* lra-constraints.c (simplify_operand_subreg): In the paradoxical
	case, check whether the outer register overlaps an unallocatable
	register, not just whether it fits the required class.

gcc/testsuite/
	* g++.dg/torture/aarch64-vect-init-1.C: New test.


git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@261531 138bc75d-0d04-0410-961f-82ee72b054a4
jcmvbkbc pushed a commit that referenced this issue Jan 28, 2019
This adds a 4th information level for the -gnatR output, where relevant
compiler-generated types are listed in addition to the information
already output by -gnatR3.

For the following package P:

package P is

  type Arr0 is array (Positive range <>) of Boolean;

    type Rec (D1 : Positive; D2 : Boolean) is record
       C1 : Integer;
       C2 : Arr0 (1 .. D1);

       case D2 is
          when False =>
             C3 : Character;
          when True =>
             C4 : String (1 .. 3);
             C5 : Float;
       end case;
    end record;

    type Arr1 is array (1 .. 8) of Rec (1, True);

end P;

the output generated by -gnatR4 must be:

Representation information for unit P (spec)
--------------------------------------------

for Arr0'Alignment use 1;
for Arr0'Component_Size use 8;

for Rec'Object_Size use 17179869344;
for Rec'Value_Size use (if (#2 != 0) then ((((#1 + 15) & -4) + 8) * 8)
else ((((#1 + 15) & -4) + 1) * 8) end);
for Rec'Alignment use 4;
for Rec use record
   D1 at  0 range  0 .. 31;
   D2 at  4 range  0 ..  7;
   C1 at  8 range  0 .. 31;
   C2 at 12 range  0 .. ((#1 * 8)) - 1;
   C3 at ((#1 + 15) & -4) range  0 ..  7;
   C4 at ((#1 + 15) & -4) range  0 .. 23;
   C5 at (((#1 + 15) & -4) + 4) range  0 .. 31;
end record;

for Arr1'Size use 1536;
for Arr1'Alignment use 4;
for Arr1'Component_Size use 192;

for Tarr1c'Size use 192;
for Tarr1c'Alignment use 4;
for Tarr1c use record
   D1 at  0 range  0 .. 31;
   D2 at  4 range  0 ..  7;
   C1 at  8 range  0 .. 31;
   C2 at 12 range  0 ..  7;
   C4 at 16 range  0 .. 23;
   C5 at 20 range  0 .. 31;
end record;

2018-11-14  Eric Botcazou  <ebotcazou@adacore.com>

gcc/ada/

	* doc/gnat_ugn/building_executable_programs_with_gnat.rst
	(-gnatR): Document new -gnatR4 level.
	* gnat_ugn.texi: Regenerate.
	* opt.ads (List_Representation_Info): Bump upper bound to 4.
	* repinfo.adb: Add with clause for GNAT.HTable.
	(Relevant_Entities_Size): New constant.
	(Entity_Header_Num): New type.
	(Entity_Hash): New function.
	(Relevant_Entities): New set implemented with GNAT.HTable.
	(List_Entities): Also list compiled-generated entities present
	in the Relevant_Entities set. Consider that the Component_Type
	of an array type is relevant.
	(List_Rep_Info): Reset Relevant_Entities for each unit.
	* switch-c.adb (Scan_Front_End_Switches): Add support for -gnatR4.
	* switch-m.adb (Normalize_Compiler_Switches): Likewise
	* usage.adb (Usage): Likewise.

git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@266131 138bc75d-0d04-0410-961f-82ee72b054a4
jcmvbkbc pushed a commit that referenced this issue Jun 7, 2020
This patch implements DR 2237 which says that a simple-template-id is
no longer valid as the declarator-id of a constructor or destructor;
see [diff.cpp17.class]#2.  It is not explicitly stated but out-of-line
destructors with a simple-template-id are also meant to be ill-formed
now.  (Out-of-line constructors like that are invalid since DR1435 I
think.)  This change only applies to C++20; it is not a DR against C++17.

I'm not crazy about the diagnostic in constructors but ISTM that
cp_parser_constructor_declarator_p shouldn't print errors.

	DR 2237
	* parser.c (cp_parser_unqualified_id): Reject simple-template-id as
	the declarator-id of a destructor.
	(cp_parser_constructor_declarator_p): Reject simple-template-id as
	the declarator-id of a constructor.

	* g++.dg/DRs/dr2237.C: New test.
	* g++.dg/parse/constructor2.C: Add dg-error for C++20.
	* g++.dg/parse/dtor12.C: Likewise.
	* g++.dg/parse/dtor4.C: Likewise.
	* g++.dg/template/dtor4.C: Adjust dg-error.
	* g++.dg/template/error34.C: Likewise.
	* g++.old-deja/g++.other/inline15.C: Only run for C++17 and lesses.
	* g++.old-deja/g++.pt/ctor2.C: Add dg-error for C++20.
jcmvbkbc pushed a commit that referenced this issue Jun 11, 2022
Here ever since r12-6022-gbb2a7f80a98de3 we stopped deeming the partial
specialization #2 to be more specialized than #1 ultimately because
dependent operator expressions now have a DEPENDENT_OPERATOR_TYPE type
instead of an empty type, and this made unify stop deducing T(2) == 1
for K during partial ordering for #1 and #2.

This minimal patch fixes this by making the relevant logic in unify
treat DEPENDENT_OPERATOR_TYPE like an empty type.

	PR c++/105425

gcc/cp/ChangeLog:

	* pt.cc (unify) <case TEMPLATE_PARM_INDEX>: Treat
	DEPENDENT_OPERATOR_TYPE like an empty type.

gcc/testsuite/ChangeLog:

	* g++.dg/template/partial-specialization13.C: New test.
jcmvbkbc pushed a commit that referenced this issue Jun 11, 2022
Here during cp_parser_single_declaration for #2, we were calling
associate_classtype_constraints for TPL<T> (the primary template type)
before maybe_process_partial_specialization could get a chance to
notice that we're in fact declaring a distinct constrained partial
spec and not redeclaring the primary template.  This caused us to
emit a bogus error about differing constraints b/t the primary template
and #2's constraints.  This patch fixes this by moving the call to
associate_classtype_constraints after the call to shadow_tag (which
calls maybe_process_partial_specialization) and adjusting shadow_tag to
use the return value of m_p_p_s.

Moreover, if we later try to define a constrained partial specialization
that's been declared earlier (as in the third testcase), then
maybe_new_partial_specialization correctly notices it's a redeclaration
and returns NULL_TREE.  But in this case we also need to update TYPE to
point to the redeclared partial spec (it'll otherwise continue pointing
to the primary template type, eventually leading to a bogus error).

	PR c++/96363

gcc/cp/ChangeLog:

	* decl.cc (shadow_tag): Use the return value of
	maybe_process_partial_specialization.
	* parser.cc (cp_parser_single_declaration): Call shadow_tag
	before associate_classtype_constraints.
	* pt.cc (maybe_new_partial_specialization): Change return type
	to bool.  Take 'type' argument by mutable reference.  Set 'type'
	to point to the correct constrained specialization when
	appropriate.
	(maybe_process_partial_specialization): Adjust accordingly.

gcc/testsuite/ChangeLog:

	* g++.dg/cpp2a/concepts-partial-spec12.C: New test.
	* g++.dg/cpp2a/concepts-partial-spec12a.C: New test.
	* g++.dg/cpp2a/concepts-partial-spec13.C: New test.
jcmvbkbc pushed a commit that referenced this issue Jun 11, 2022
As explained in r11-4959-gde6f64f9556ae3, the atom cache assumes two
equivalent expressions (according to cp_tree_equal) must use the same
template parameters (according to find_template_parameters).  This
assumption turned out to not hold for TARGET_EXPR, which was addressed
by that commit.

But this assumption apparently doesn't hold for PARM_DECL either:
find_template_parameters walks its DECL_CONTEXT but cp_tree_equal by
default doesn't consider DECL_CONTEXT unless comparing_specializations
is set.  Thus in the first testcase below, the atomic constraints of #1
and #2 are equivalent according to cp_tree_equal, but according to
find_template_parameters the former uses T and the latter uses both T
and U (surprisingly).

We could fix this assumption violation by setting comparing_specializations
in the atom_hasher, which would make cp_tree_equal return false for the
two atoms, but that seems overly pessimistic here.  Ideally the atoms
should continue being considered equivalent and we instead fix
find_template_paremeters to return just T for #2's atom.

To that end this patch makes for_each_template_parm_r stop walking the
DECL_CONTEXT of a PARM_DECL.  This should be safe to do because
tsubst_copy / tsubst_decl only substitutes the TREE_TYPE of a PARM_DECL
and doesn't bother substituting the DECL_CONTEXT, thus the only relevant
template parameters are those used in its type.  any_template_parm_r is
currently responsible for walking its TREE_TYPE, but I suppose it now makes
sense for for_each_template_parm_r to do so instead.

In passing this patch also makes for_each_template_parm_r stop walking
the DECL_CONTEXT of a VAR_/FUNCTION_DECL since doing so after walking
DECL_TI_ARGS is redundant, I think.

I experimented with not walking DECL_CONTEXT for CONST_DECL, but the
second testcase below demonstrates it's necessary to walk it.

	PR c++/105797

gcc/cp/ChangeLog:

	* pt.cc (for_each_template_parm_r) <case FUNCTION_DECL, VAR_DECL>:
	Don't walk DECL_CONTEXT.
	<case PARM_DECL>: Likewise.  Walk TREE_TYPE.
	<case CONST_DECL>: Simplify.
	(any_template_parm_r) <case PARM_DECL>: Don't walk TREE_TYPE.

gcc/testsuite/ChangeLog:

	* g++.dg/cpp2a/concepts-decltype4.C: New test.
	* g++.dg/cpp2a/concepts-memfun3.C: New test.
jcmvbkbc pushed a commit that referenced this issue Jul 18, 2022
This patch implements C++23 P2255R2, which adds two new type traits to
detect reference binding to a temporary.  They can be used to detect code
like

  std::tuple<const std::string&> t("meow");

which is incorrect because it always creates a dangling reference, because
the std::string temporary is created inside the selected constructor of
std::tuple, and not outside it.

There are two new compiler builtins, __reference_constructs_from_temporary
and __reference_converts_from_temporary.  The former is used to simulate
direct- and the latter copy-initialization context.  But I had a hard time
finding a test where there's actually a difference.  Under DR 2267, both
of these are invalid:

  struct A { } a;
  struct B { explicit B(const A&); };
  const B &b1{a};
  const B &b2(a);

so I had to peruse [over.match.ref], and eventually realized that the
difference can be seen here:

  struct G {
    operator int(); // #1
    explicit operator int&&(); // #2
  };

int&& r1(G{}); // use #2 (no temporary)
int&& r2 = G{}; // use #1 (a temporary is created to be bound to int&&)

The implementation itself was rather straightforward because we already
have the conv_binds_ref_to_prvalue function.  The main function here is
ref_xes_from_temporary.
I've changed the return type of ref_conv_binds_directly to tristate, because
previously the function didn't distinguish between an invalid conversion and
one that binds to a prvalue.  Since it no longer returns a bool, I removed
the _p suffix.

The patch also adds the relevant class and variable templates to <type_traits>.

	PR c++/104477

gcc/c-family/ChangeLog:

	* c-common.cc (c_common_reswords): Add
	__reference_constructs_from_temporary and
	__reference_converts_from_temporary.
	* c-common.h (enum rid): Add RID_REF_CONSTRUCTS_FROM_TEMPORARY and
	RID_REF_CONVERTS_FROM_TEMPORARY.

gcc/cp/ChangeLog:

	* call.cc (ref_conv_binds_directly_p): Rename to ...
	(ref_conv_binds_directly): ... this.  Add a new bool parameter.  Change
	the return type to tristate.
	* constraint.cc (diagnose_trait_expr): Handle
	CPTK_REF_CONSTRUCTS_FROM_TEMPORARY and CPTK_REF_CONVERTS_FROM_TEMPORARY.
	* cp-tree.h: Include "tristate.h".
	(enum cp_trait_kind): Add CPTK_REF_CONSTRUCTS_FROM_TEMPORARY
	and CPTK_REF_CONVERTS_FROM_TEMPORARY.
	(ref_conv_binds_directly_p): Rename to ...
	(ref_conv_binds_directly): ... this.
	(ref_xes_from_temporary): Declare.
	* cxx-pretty-print.cc (pp_cxx_trait_expression): Handle
	CPTK_REF_CONSTRUCTS_FROM_TEMPORARY and CPTK_REF_CONVERTS_FROM_TEMPORARY.
	* method.cc (ref_xes_from_temporary): New.
	* parser.cc (cp_parser_primary_expression): Handle
	RID_REF_CONSTRUCTS_FROM_TEMPORARY and RID_REF_CONVERTS_FROM_TEMPORARY.
	(cp_parser_trait_expr): Likewise.
	(warn_for_range_copy): Adjust to call ref_conv_binds_directly.
	* semantics.cc (trait_expr_value): Handle
	CPTK_REF_CONSTRUCTS_FROM_TEMPORARY and CPTK_REF_CONVERTS_FROM_TEMPORARY.
	(finish_trait_expr): Likewise.

libstdc++-v3/ChangeLog:

	* include/std/type_traits (reference_constructs_from_temporary,
	reference_converts_from_temporary): New class templates.
	(reference_constructs_from_temporary_v,
	reference_converts_from_temporary_v): New variable templates.
	(__cpp_lib_reference_from_temporary): Define for C++23.
	* include/std/version (__cpp_lib_reference_from_temporary): Define for
	C++23.
	* testsuite/20_util/variable_templates_for_traits.cc: Test
	reference_constructs_from_temporary_v and
	reference_converts_from_temporary_v.
	* testsuite/20_util/reference_from_temporary/value.cc: New test.
	* testsuite/20_util/reference_from_temporary/value2.cc: New test.
	* testsuite/20_util/reference_from_temporary/version.cc: New test.

gcc/testsuite/ChangeLog:

	* g++.dg/ext/reference_constructs_from_temporary1.C: New test.
	* g++.dg/ext/reference_converts_from_temporary1.C: New test.
jcmvbkbc pushed a commit that referenced this issue Aug 16, 2022
This patch implements some additional zero-extension and sign-extension
related optimizations in simplify-rtx.cc.  The original motivation comes
from PR rtl-optimization/71775, where in comment #2 Andrew Pinksi sees:

Failed to match this instruction:
(set (reg:DI 88 [ _1 ])
    (sign_extend:DI (subreg:SI (ctz:DI (reg/v:DI 86 [ x ])) 0)))

On many platforms the result of DImode CTZ is constrained to be a
small unsigned integer (between 0 and 64), hence the truncation to
32-bits (using a SUBREG) and the following sign extension back to
64-bits are effectively a no-op, so the above should ideally (often)
be simplified to "(set (reg:DI 88) (ctz:DI (reg/v:DI 86 [ x ]))".

To implement this, and some closely related transformations, we build
upon the existing val_signbit_known_clear_p predicate.  In the first
chunk, nonzero_bits knows that FFS and ABS can't leave the sign-bit
bit set, so the simplification of of ABS (ABS (x)) and ABS (FFS (x))
can itself be simplified.  The second transformation is that we can
canonicalized SIGN_EXTEND to ZERO_EXTEND (as in the PR 71775 case above)
when the operand's sign-bit is known to be clear.  The final two chunks
are for SIGN_EXTEND of a truncating SUBREG, and ZERO_EXTEND of a
truncating SUBREG respectively.  The nonzero_bits of a truncating
SUBREG pessimistically thinks that the upper bits may have an
arbitrary value (by taking the SUBREG), so we need look deeper at the
SUBREG's operand to confirm that the high bits are known to be zero.

Unfortunately, for PR rtl-optimization/71775, ctz:DI on x86_64 with
default architecture options is undefined at zero, so we can't be sure
the upper bits of reg:DI 88 will be sign extended (all zeros or all ones).
nonzero_bits knows this, so the above transformations don't trigger,
but the transformations themselves are perfectly valid for other
operations such as FFS, POPCOUNT and PARITY, and on other targets/-march
settings where CTZ is defined at zero.

2022-08-03  Roger Sayle  <roger@nextmovesoftware.com>
	    Segher Boessenkool  <segher@kernel.crashing.org>
	    Richard Sandiford  <richard.sandiford@arm.com>

gcc/ChangeLog
	* simplify-rtx.cc (simplify_unary_operation_1) <ABS>: Add
	optimizations for CLRSB, PARITY, POPCOUNT, SS_ABS and LSHIFTRT
	that are all positive to complement the existing FFS and
	idempotent ABS simplifications.
	<SIGN_EXTEND>: Canonicalize SIGN_EXTEND to ZERO_EXTEND when
	val_signbit_known_clear_p is true of the operand.
	Simplify sign extensions of SUBREG truncations of operands
	that are already suitably (zero) extended.
	<ZERO_EXTEND>: Simplify zero extensions of SUBREG truncations
	of operands that are already suitably zero extended.
jcmvbkbc pushed a commit that referenced this issue Aug 17, 2022
In my previous patches I've been extending our std::move warnings,
but this tweak actually dials it down a little bit.  As reported in
bug 89780, it's questionable to warn about expressions in templates
that were type-dependent, but aren't anymore because we're instantiating
the template.  As in,

  template <typename T>
  Dest withMove() {
    T x;
    return std::move(x);
  }

  template Dest withMove<Dest>(); // #1
  template Dest withMove<Source>(); // #2

Saying that the std::move is pessimizing for #1 is not incorrect, but
it's not useful, because removing the std::move would then pessimize #2.
So the user can't really win.  At the same time, disabling the warning
just because we're in a template would be going too far, I still want to
warn for

  template <typename>
  Dest withMove() {
    Dest x;
    return std::move(x);
  }

because the std::move therein will be pessimizing for any instantiation.

So I'm using the suppress_warning machinery to that effect.
Problem: I had to add a new group to nowarn_spec_t, otherwise
suppressing the -Wpessimizing-move warning would disable a whole bunch
of other warnings, which we really don't want.

	PR c++/89780

gcc/cp/ChangeLog:

	* pt.cc (tsubst_copy_and_build) <case CALL_EXPR>: Maybe suppress
	-Wpessimizing-move.
	* typeck.cc (maybe_warn_pessimizing_move): Don't issue warnings
	if they are suppressed.
	(check_return_expr): Disable -Wpessimizing-move when returning
	a dependent expression.

gcc/ChangeLog:

	* diagnostic-spec.cc (nowarn_spec_t::nowarn_spec_t): Handle
	OPT_Wpessimizing_move and OPT_Wredundant_move.
	* diagnostic-spec.h (nowarn_spec_t): Add NW_REDUNDANT enumerator.

gcc/testsuite/ChangeLog:

	* g++.dg/cpp0x/Wpessimizing-move3.C: Remove dg-warning.
	* g++.dg/cpp0x/Wredundant-move2.C: Likewise.
jcmvbkbc pushed a commit that referenced this issue Nov 8, 2022
The eliminate reg-reg move by inverting the condition of
a cmove #2 peephole2 converts the following sequence:

  473: bx:DI=[r14:DI*0x8+r12:DI]
  960: r15:DI=r8:DI
  485: {flags:CCC=cmp(r15:DI+bx:DI,bx:DI);r15:DI=r15:DI+bx:DI;}
  737: r15:DI={(geu(flags:CCC,0))?r15:DI:bx:DI}

to:

 1110: {flags:CCC=cmp(r8:DI+bx:DI,bx:DI);r8:DI=r8:DI+bx:DI;}
 1111: r15:DI=[r14:DI*0x8+r12:DI]
 1112: r15:DI={(geu(flags:CCC,0))?r8:DI:r15:DI}

Please note that(insn 1110) uses register BX, but its
initialization was eliminated.

Avoid conversion if eliminated move intialized a register, used
in the moved instruction.

2022-11-03  Uroš Bizjak  <ubizjak@gmail.com>

gcc/ChangeLog:

	PR target/107404
	* config/i386/i386.md (eliminate reg-reg move by inverting the
	condition of a cmove #2 peephole2): Check if eliminated move
	initialized a register, used in the moved instruction.

gcc/testsuite/ChangeLog:

	PR target/107404
	* g++.target/i386/pr107404.C: New test.
jcmvbkbc pushed a commit that referenced this issue Jan 14, 2023
While looking at PR 105549, which is about fixing the ABI break
introduced in GCC 9.1 in parameter alignment with bit-fields, we
noticed that the GCC 9.1 warning is not emitted in all the cases where
it should be.  This patch fixes that and the next patch in the series
fixes the GCC 9.1 break.

We split this into two patches since patch #2 introduces a new ABI
break starting with GCC 13.1.  This way, patch #1 can be back-ported
to release branches if needed to fix the GCC 9.1 warning issue.

The main idea is to add a new global boolean that indicates whether
we're expanding the start of a function, so that aarch64_layout_arg
can emit warnings for callees as well as callers.  This removes the
need for aarch64_function_arg_boundary to warn (with its incomplete
information).  However, in the first patch there are still cases where
we emit warnings were we should not; this is fixed in patch #2 where
we can distinguish between GCC 9.1 and GCC.13.1 ABI breaks properly.

The fix in aarch64_function_arg_boundary (replacing & with &&) looks
like an oversight of a previous commit in this area which changed
'abi_break' from a boolean to an integer.

We also take the opportunity to fix the comment above
aarch64_function_arg_alignment since the value of the abi_break
parameter was changed in a previous commit, no longer matching the
description.

2022-11-28  Christophe Lyon  <christophe.lyon@arm.com>
	    Richard Sandiford  <richard.sandiford@arm.com>

gcc/ChangeLog:

	* config/aarch64/aarch64.cc (aarch64_function_arg_alignment): Fix
	comment.
	(aarch64_layout_arg): Factorize warning conditions.
	(aarch64_function_arg_boundary): Fix typo.
	* function.cc (currently_expanding_function_start): New variable.
	(expand_function_start): Handle
	currently_expanding_function_start.
	* function.h (currently_expanding_function_start): Declare.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/bitfield-abi-warning-align16-O2.c: New test.
	* gcc.target/aarch64/bitfield-abi-warning-align16-O2-extra.c: New
	test.
	* gcc.target/aarch64/bitfield-abi-warning-align32-O2.c: New test.
	* gcc.target/aarch64/bitfield-abi-warning-align32-O2-extra.c: New
	test.
	* gcc.target/aarch64/bitfield-abi-warning-align8-O2.c: New test.
	* gcc.target/aarch64/bitfield-abi-warning.h: New test.
	* g++.target/aarch64/bitfield-abi-warning-align16-O2.C: New test.
	* g++.target/aarch64/bitfield-abi-warning-align16-O2-extra.C: New
	test.
	* g++.target/aarch64/bitfield-abi-warning-align32-O2.C: New test.
	* g++.target/aarch64/bitfield-abi-warning-align32-O2-extra.C: New
	test.
	* g++.target/aarch64/bitfield-abi-warning-align8-O2.C: New test.
	* g++.target/aarch64/bitfield-abi-warning.h: New test.
jcmvbkbc pushed a commit that referenced this issue Feb 13, 2023
Here the ahead-of-time overload set pruning in finish_call_expr is
unintentionally returning a CALL_EXPR whose (pruned) callee is wrapped
in an ADDR_EXPR, despite the original callee not being wrapped in an
ADDR_EXPR.  This ends up causing a bogus declaration mismatch error in
the below testcase because the call to min in #1 gets expressed as a
CALL_EXPR of ADDR_EXPR of FUNCTION_DECL, whereas the level-lowered call
to min in #2 gets expressed instead as a CALL_EXPR of FUNCTION_DECL.

This patch fixes this by stripping the spurious ADDR_EXPR appropriately.
Thus the first call to min now also gets expressed as a CALL_EXPR of
FUNCTION_DECL, matching the behavior before r12-6075-g2decd2cabe5a4f.

	PR c++/107461

gcc/cp/ChangeLog:

	* semantics.cc (finish_call_expr): Strip ADDR_EXPR from
	the selected callee during overload set pruning.

gcc/testsuite/ChangeLog:

	* g++.dg/template/call9.C: New test.
jcmvbkbc pushed a commit that referenced this issue Feb 13, 2023
After r13-5684-g59e0376f607805 the (pruned) callee of a non-dependent
CALL_EXPR is a bare FUNCTION_DECL rather than ADDR_EXPR of FUNCTION_DECL.
This innocent change revealed that cp_tree_equal doesn't first check
dependence of a CALL_EXPR before treating a FUNCTION_DECL callee as a
dependent name, which leads to us incorrectly accepting the first two
testcases below and rejecting the third:

 * In the first testcase, cp_tree_equal incorrectly returns true for
   the two non-dependent CALL_EXPRs f(0) and f(0) (whose CALL_EXPR_FN
   are different FUNCTION_DECLs) which causes us to treat #2 as a
   redeclaration of #1.

 * Same issue in the second testcase, for f<int*>() and f<char>().

 * In the third testcase, cp_tree_equal incorrectly returns true for
   f<int>() and f<void(*)(int)>() which causes us to conflate the two
   dependent specializations A<decltype(f<int>()(U()))> and
   A<decltype(f<void(*)(int)>()(U()))>.

This patch fixes this by making called_fns_equal treat two callees as
dependent names only if the overall CALL_EXPRs are dependent, via a new
convenience function call_expr_dependent_name that is like dependent_name
but also checks dependence of the overall CALL_EXPR.

	PR c++/107461

gcc/cp/ChangeLog:

	* cp-tree.h (call_expr_dependent_name): Declare.
	* pt.cc (iterative_hash_template_arg) <case CALL_EXPR>: Use
	call_expr_dependent_name instead of dependent_name.
	* tree.cc (call_expr_dependent_name): Define.
	(called_fns_equal): Adjust to take two CALL_EXPRs instead of
	CALL_EXPR_FNs thereof.  Use call_expr_dependent_name instead
	of dependent_name.
	(cp_tree_equal) <case CALL_EXPR>: Adjust call to called_fns_equal.

gcc/testsuite/ChangeLog:

	* g++.dg/cpp0x/overload5.C: New test.
	* g++.dg/cpp0x/overload5a.C: New test.
	* g++.dg/cpp0x/overload6.C: New test.
jcmvbkbc pushed a commit that referenced this issue May 8, 2023
… in asm in different mode

See gcc.c-torture/execute/20030222-1.c.  Consider the code for 32-bit (e.g. BE) target:
  int i, v; long x; x = v; asm ("" : "=r" (i) : "0" (x));
We generate the following RTL with reload insns:
  1. subreg:si(x:di, 0) = 0;
  2. subreg:si(x:di, 4) = v:si;
  3. t:di = x:di, dead x;
  4. asm ("" : "=r" (subreg:si(t:di,4)) : "0" (t:di))
  5. i:si = subreg:si(t:di,4);
If we assign hard reg of x to t, dead code elimination will remove insn #2
and we will use unitialized hard reg.  So exclude the hard reg of x for t.
We could ignore this problem for non-empty asm using all x value but it is hard to
check that the asm are expanded into insn realy using x and setting r.
The old reload pass used the same approach.

gcc/ChangeLog

	* lra-constraints.cc (match_reload): Exclude some hard regs for
	multi-reg inout reload pseudos used in asm in different mode.
jcmvbkbc pushed a commit that referenced this issue May 8, 2023
Currently on xstormy16 SImode shifts by a single bit require two
instructions, and shifts by other non-zero integer immediate constants
require five instructions.  This patch implements the obvious optimization
that shifts by two bits can be done in four instructions, by using two
single-bit sequences.

Hence, ashift_2 was previously generated as:
        mov r7,r2 | shl r2,#2 | shl r3,#2 | shr r7,#14 | or r3,r7
        ret
and with this patch we now generate:
        shl r2,#1 | rlc r3,#1 | shl r2,#1 | rlc r3,#1
        ret

2023-04-23  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
	* config/stormy16/stormy16.cc (xstormy16_output_shift): Implement
	SImode shifts by two by performing a single bit SImode shift twice.

gcc/testsuite/ChangeLog
	* gcc.target/xstormy16/shiftsi.c: New test case.
jcmvbkbc pushed a commit that referenced this issue May 8, 2023
I noticed that for member class templates of a class template we were
unnecessarily substituting both the template and its type.  Avoiding that
duplication speeds compilation of this silly testcase from ~12s to ~9s on my
laptop.  It's unlikely to make a difference on any real code, but the
simplification is also nice.

We still need to clear CLASSTYPE_USE_TEMPLATE on the partial instantiation
of the template class, but it makes more sense to do that in
tsubst_template_decl anyway.

  #define NC(X)					\
    template <class U> struct X##1;		\
    template <class U> struct X##2;		\
    template <class U> struct X##3;		\
    template <class U> struct X##4;		\
    template <class U> struct X##5;		\
    template <class U> struct X##6;
  #define NC2(X) NC(X##a) NC(X##b) NC(X##c) NC(X##d) NC(X##e) NC(X##f)
  #define NC3(X) NC2(X##A) NC2(X##B) NC2(X##C) NC2(X##D) NC2(X##E)
  template <int I> struct A
  {
    NC3(am)
  };
  template <class...Ts> void sink(Ts...);
  template <int...Is> void g()
  {
    sink(A<Is>()...);
  }
  template <int I> void f()
  {
    g<__integer_pack(I)...>();
  }
  int main()
  {
    f<1000>();
  }

gcc/cp/ChangeLog:

	* pt.cc (instantiate_class_template): Skip the RECORD_TYPE
	of a class template.
	(tsubst_template_decl): Clear CLASSTYPE_USE_TEMPLATE.
jcmvbkbc pushed a commit that referenced this issue May 22, 2023
…i parts [2]

[part #2 of PR/109279]

SPEC2017 deepsjeng uses large constants which currently generates less than
ideal code. This fix improves codegen for large constants which have
same low and hi parts: e.g.

	long long f(void) { return 0x0101010101010101ull; }

Before
        li      a5,0x1010000
        addi    a5,a5,0x101
        mv      a0,a5
        slli    a5,a5,32
        add     a0,a5,a0
        ret

With patch
	li	a5,0x1010000
	addi	a5,a5,0x101
	slli	a0,a5,32
	add	a0,a0,a5
	ret

This is testsuite clean.

gcc/ChangeLog:

	* config/riscv/riscv.cc (riscv_split_integer): if loval is equal
	to hival, ASHIFT the corresponding regs.

Signed-off-by: Vineet Gupta <vineetg@rivosinc.com>
jcmvbkbc pushed a commit that referenced this issue May 22, 2023
This patch decreses one machine instruction from "single bit extraction
with shifting" operation, and tries to eliminate the conditional
branch if CST2_POW2 doesn't fit into signed 12 bits with the help
of ifcvt optimization.

    /* example #1 */
    int test0(int x) {
      return (x & 1048576) != 0 ? 1024 : 0;
    }
    extern int foo(void);
    int test1(void) {
      return (foo() & 1048576) != 0 ? 16777216 : 0;
    }

    ;; before
    test0:
	movi	a9, 0x400
	srai	a2, a2, 10
	and	a2, a2, a9
	ret.n
    test1:
	addi	sp, sp, -16
	s32i.n	a0, sp, 12
	call0	foo
	extui	a2, a2, 20, 1
	slli	a2, a2, 20
	beqz.n	a2, .L2
	movi.n	a2, 1
	slli	a2, a2, 24
    .L2:
	l32i.n	a0, sp, 12
	addi	sp, sp, 16
	ret.n

    ;; after
    test0:
	extui	a2, a2, 20, 1
	slli	a2, a2, 10
	ret.n
    test1:
	addi	sp, sp, -16
	s32i.n	a0, sp, 12
	call0	foo
	l32i.n	a0, sp, 12
	extui	a2, a2, 20, 1
	slli	a2, a2, 24
	addi	sp, sp, 16
	ret.n

In addition, if the left shift amount ('exact_log2(CST2_POW2)') is
between 1 through 3 and a either addition or subtraction with another
register follows, emit a ADDX[248] or SUBX[248] machine instruction
instead of separate left shift and add/subtract ones.

    /* example #2 */
    int test2(int x, int y) {
      return ((x & 1048576) != 0 ? 4 : 0) + y;
    }
    int test3(int x, int y) {
      return ((x & 2) != 0 ? 8 : 0) - y;
    }

    ;; before
    test2:
	movi.n	a9, 4
	srai	a2, a2, 18
	and	a2, a2, a9
	add.n	a2, a2, a3
	ret.n
    test3:
	movi.n	a9, 8
	slli	a2, a2, 2
	and	a2, a2, a9
	sub	a2, a2, a3
	ret.n

    ;; after
    test2:
	extui	a2, a2, 20, 1
	addx4	a2, a2, a3
	ret.n
    test3:
	extui	a2, a2, 1, 1
	subx8	a2, a2, a3
	ret.n

gcc/ChangeLog:

	* config/xtensa/predicates.md (addsub_operator): New.
	* config/xtensa/xtensa.md (*extzvsi-1bit_ashlsi3,
	*extzvsi-1bit_addsubx): New insn_and_split patterns.
	* config/xtensa/xtensa.cc (xtensa_rtx_costs):
	Add a special case about ifcvt 'noce_try_cmove()' to handle
	constant loads that do not fit into signed 12 bits in the
	patterns added above.
jcmvbkbc pushed a commit that referenced this issue May 23, 2023
This patch decreses one machine instruction from "single bit extraction
with shifting" operation, and tries to eliminate the conditional
branch if CST2_POW2 doesn't fit into signed 12 bits with the help
of ifcvt optimization.

    /* example #1 */
    int test0(int x) {
      return (x & 1048576) != 0 ? 1024 : 0;
    }
    extern int foo(void);
    int test1(void) {
      return (foo() & 1048576) != 0 ? 16777216 : 0;
    }

    ;; before
    test0:
	movi	a9, 0x400
	srai	a2, a2, 10
	and	a2, a2, a9
	ret.n
    test1:
	addi	sp, sp, -16
	s32i.n	a0, sp, 12
	call0	foo
	extui	a2, a2, 20, 1
	slli	a2, a2, 20
	beqz.n	a2, .L2
	movi.n	a2, 1
	slli	a2, a2, 24
    .L2:
	l32i.n	a0, sp, 12
	addi	sp, sp, 16
	ret.n

    ;; after
    test0:
	extui	a2, a2, 20, 1
	slli	a2, a2, 10
	ret.n
    test1:
	addi	sp, sp, -16
	s32i.n	a0, sp, 12
	call0	foo
	l32i.n	a0, sp, 12
	extui	a2, a2, 20, 1
	slli	a2, a2, 24
	addi	sp, sp, 16
	ret.n

In addition, if the left shift amount ('exact_log2(CST2_POW2)') is
between 1 through 3 and a either addition or subtraction with another
register follows, emit a ADDX[248] or SUBX[248] machine instruction
instead of separate left shift and add/subtract ones.

    /* example #2 */
    int test2(int x, int y) {
      return ((x & 1048576) != 0 ? 4 : 0) + y;
    }
    int test3(int x, int y) {
      return ((x & 2) != 0 ? 8 : 0) - y;
    }

    ;; before
    test2:
	movi.n	a9, 4
	srai	a2, a2, 18
	and	a2, a2, a9
	add.n	a2, a2, a3
	ret.n
    test3:
	movi.n	a9, 8
	slli	a2, a2, 2
	and	a2, a2, a9
	sub	a2, a2, a3
	ret.n

    ;; after
    test2:
	extui	a2, a2, 20, 1
	addx4	a2, a2, a3
	ret.n
    test3:
	extui	a2, a2, 1, 1
	subx8	a2, a2, a3
	ret.n

gcc/ChangeLog:

	* config/xtensa/predicates.md (addsub_operator): New.
	* config/xtensa/xtensa.md (*extzvsi-1bit_ashlsi3,
	*extzvsi-1bit_addsubx): New insn_and_split patterns.
	* config/xtensa/xtensa.cc (xtensa_rtx_costs):
	Add a special case about ifcvt 'noce_try_cmove()' to handle
	constant loads that do not fit into signed 12 bits in the
	patterns added above.
jcmvbkbc pushed a commit that referenced this issue May 23, 2023
This patch decreses one machine instruction from "single bit extraction
with shifting" operation, and tries to eliminate the conditional
branch if CST2_POW2 doesn't fit into signed 12 bits with the help
of ifcvt optimization.

    /* example #1 */
    int test0(int x) {
      return (x & 1048576) != 0 ? 1024 : 0;
    }
    extern int foo(void);
    int test1(void) {
      return (foo() & 1048576) != 0 ? 16777216 : 0;
    }

    ;; before
    test0:
	movi	a9, 0x400
	srai	a2, a2, 10
	and	a2, a2, a9
	ret.n
    test1:
	addi	sp, sp, -16
	s32i.n	a0, sp, 12
	call0	foo
	extui	a2, a2, 20, 1
	slli	a2, a2, 20
	beqz.n	a2, .L2
	movi.n	a2, 1
	slli	a2, a2, 24
    .L2:
	l32i.n	a0, sp, 12
	addi	sp, sp, 16
	ret.n

    ;; after
    test0:
	extui	a2, a2, 20, 1
	slli	a2, a2, 10
	ret.n
    test1:
	addi	sp, sp, -16
	s32i.n	a0, sp, 12
	call0	foo
	l32i.n	a0, sp, 12
	extui	a2, a2, 20, 1
	slli	a2, a2, 24
	addi	sp, sp, 16
	ret.n

In addition, if the left shift amount ('exact_log2(CST2_POW2)') is
between 1 through 3 and a either addition or subtraction with another
register follows, emit a ADDX[248] or SUBX[248] machine instruction
instead of separate left shift and add/subtract ones.

    /* example #2 */
    int test2(int x, int y) {
      return ((x & 1048576) != 0 ? 4 : 0) + y;
    }
    int test3(int x, int y) {
      return ((x & 2) != 0 ? 8 : 0) - y;
    }

    ;; before
    test2:
	movi.n	a9, 4
	srai	a2, a2, 18
	and	a2, a2, a9
	add.n	a2, a2, a3
	ret.n
    test3:
	movi.n	a9, 8
	slli	a2, a2, 2
	and	a2, a2, a9
	sub	a2, a2, a3
	ret.n

    ;; after
    test2:
	extui	a2, a2, 20, 1
	addx4	a2, a2, a3
	ret.n
    test3:
	extui	a2, a2, 1, 1
	subx8	a2, a2, a3
	ret.n

gcc/ChangeLog:

	* config/xtensa/predicates.md (addsub_operator): New.
	* config/xtensa/xtensa.md (*extzvsi-1bit_ashlsi3,
	*extzvsi-1bit_addsubx): New insn_and_split patterns.
	* config/xtensa/xtensa.cc (xtensa_rtx_costs):
	Add a special case about ifcvt 'noce_try_cmove()' to handle
	constant loads that do not fit into signed 12 bits in the
	patterns added above.
jcmvbkbc pushed a commit that referenced this issue Jun 5, 2023
…n S[IF]mode

This patch optimizes the boolean evaluation of EQ/NE against zero
by adding two insn_and_split patterns similar to SImode conditional
store:

"eq_zero":
	op0 = (op1 == 0) ? 1 : 0;
	op0 = clz(op1) >> 5;  /* optimized (requires TARGET_NSA) */

"movsicc_ne0_reg_0":
	op0 = (op1 != 0) ? op2 : 0;
	op0 = op2; if (op1 == 0) ? op0 = op1;  /* optimized */

    /* example #1 */
    int bool_eqSI(int x) {
      return x == 0;
    }
    int bool_neSI(int x) {
      return x != 0;
    }

    ;; after (TARGET_NSA)
    bool_eqSI:
	nsau	a2, a2
	srli	a2, a2, 5
	ret.n
    bool_neSI:
	mov.n	a9, a2
	movi.n	a2, 1
	moveqz	a2, a9, a9
	ret.n

These also work in SFmode by ignoring their sign bits, and further-
more, the branch if EQ/NE against zero in SFmode is also done in the
same manner.

The reasons for this optimization in SFmode are:

  - Only zero values (negative or non-negative) contain no bits of 1
    with both the exponent and the mantissa.
  - EQ/NE comparisons involving NaNs produce no signal even if they
    are signaling.
  - Even if the use of IEEE 754 single-precision floating-point co-
    processor is configured (TARGET_HARD_FLOAT is true):
	1. Load zero value to FP register
        2. Possibly, additional FP move if the comparison target is
	   an address register
	3. FP equality check instruction
	4. Read the boolean register containing the result, or condi-
	   tional branch
    As noted above, a considerable number of instructions are still
    generated.

    /* example #2 */
    int bool_eqSF(float x) {
      return x == 0;
    }
    int bool_neSF(float x) {
      return x != 0;
    }
    int bool_ltSF(float x) {
      return x < 0;
    }
    extern void foo(void);
    void cb_eqSF(float x) {
      if(x != 0)
        foo();
    }
    void cb_neSF(float x) {
      if(x == 0)
        foo();
    }
    void cb_geSF(float x) {
      if(x < 0)
        foo();
    }

    ;; after
    ;; (TARGET_NSA, TARGET_BOOLEANS and TARGET_HARD_FLOAT)
    bool_eqSF:
	add.n	a2, a2, a2
	nsau	a2, a2
	srli	a2, a2, 5
	ret.n
    bool_neSF:
	add.n	a9, a2, a2
	movi.n	a2, 1
	moveqz	a2, a9, a9
	ret.n
    bool_ltSF:
	movi.n	a9, 0
	wfr	f0, a2
	wfr	f1, a9
	olt.s	b0, f0, f1
	movi.n	a9, 0
	movi.n	a2, 1
	movf	a2, a9, b0
	ret.n
    cb_eqSF:
	add.n	a2, a2, a2
	beqz.n	a2, .L6
	j.l	foo, a9
    .L6:
	ret.n
    cb_neSF:
	add.n	a2, a2, a2
	bnez.n	a2, .L8
	j.l	foo, a9
    .L8:
	ret.n
    cb_geSF:
	addi	sp, sp, -16
	movi.n	a3, 0
	s32i.n	a12, sp, 8
	s32i.n	a0, sp, 12
	mov.n	a12, a2
	call0	__unordsf2
	bnez.n	a2, .L10
	movi.n	a3, 0
	mov.n	a2, a12
	call0	__gesf2
	bnei	a2, -1, .L10
	l32i.n	a0, sp, 12
	l32i.n	a12, sp, 8
	addi	sp, sp, 16
	j.l	foo, a9
    .L10:
	l32i.n	a0, sp, 12
	l32i.n	a12, sp, 8
	addi	sp, sp, 16
	ret.n

gcc/ChangeLog:

	* config/xtensa/predicates.md (const_float_0_operand):
	Rename from obsolete "const_float_1_operand" and change the
	constant to compare.
	(cstoresf_cbranchsf_operand, cstoresf_cbranchsf_operator):
	New.
	* config/xtensa/xtensa.cc (xtensa_expand_conditional_branch):
	Add code for EQ/NE comparison with constant zero in SFmode.
	(xtensa_expand_scc): Added code to derive boolean evaluation
	of EQ/NE with constant zero for comparison in SFmode.
	(xtensa_rtx_costs): Change cost of CONST_DOUBLE with value
	zero inside "cbranchsf4" to 0.
	* config/xtensa/xtensa.md (cbranchsf4, cstoresf4):
	Change "match_operator" and the third "match_operand" to the
	ones mentioned above.
	(movsicc_ne0_reg_zero, eq_zero): New.
jcmvbkbc pushed a commit that referenced this issue Sep 5, 2023
This patch is inspired by Jakub's work on PR rtl-optimization/110717.
The bitfield example described in comment #2, looks like:

struct S { __int128 a : 69; };
unsigned type bar (struct S *p) {
  return p->a;
}

which on x86_64 with -O2 currently generates:

bar:    movzbl  8(%rdi), %ecx
        movq    (%rdi), %rax
        andl    $31, %ecx
        movq    %rcx, %rdx
        salq    $59, %rdx
        sarq    $59, %rdx
        ret

The ANDL $31 is interesting... we first extract an unsigned 69-bit bitfield
by masking/clearing the top bits of the most significant word, and then
it gets sign-extended, by left shifting and arithmetic right shifting.
Obviously, this bit-wise AND is redundant, for signed bit-fields, we don't
require these bits to be cleared, if we're about to set them appropriately.

This patch eliminates this redundancy in the middle-end, during RTL
expansion, but extending the extract_bit_field APIs so that the integer
UNSIGNEDP argument takes a special value; 0 indicates the field should
be sign extended, 1 (any non-zero value) indicates the field should be
zero extended, but -1 indicates a third option, that we don't care how
or whether the field is extended.  By passing and checking this sentinel
value at the appropriate places we avoid the useless bit masking (on
all targets).

For the test case above, with this patch we now generate:

bar:    movzbl  8(%rdi), %ecx
        movq    (%rdi), %rax
        movq    %rcx, %rdx
        salq    $59, %rdx
        sarq    $59, %rdx
        ret

2023-08-04  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
	* expmed.cc (extract_bit_field_1): Document that an UNSIGNEDP
	value of -1 is equivalent to don't care.
	(extract_integral_bit_field): Indicate that we don't require
	the most significant word to be zero extended, if we're about
	to sign extend it.
	(extract_fixed_bit_field_1): Document that an UNSIGNEDP value
	of -1 is equivalent to don't care.  Don't clear the most
	significant bits with AND mask when UNSIGNEDP is -1.

gcc/testsuite/ChangeLog
	* gcc.target/i386/pr110717-2.c: New test case.
jcmvbkbc pushed a commit that referenced this issue Nov 11, 2023
Given below example for VLS mode

void
test (vl_t *u)
{
  vl_t t;
  long long *p = (long long *)&t;

  p[0] = p[1] = 2;

  *u = t;
}

The vec_set will simplify the insn to vmv.s.x when index is 0, without
merged operand. That will result in some problems in DCE, aka:

1:  137[DI] = a0
2:  138[V2DI] = 134[V2DI]                              // deleted by DCE
3:  139[DI] = #2                                       // deleted by DCE
4:  140[DI] = #2                                       // deleted by DCE
5:  141[V2DI] = vec_dup:V2DI (139[DI])                 // deleted by DCE
6:  138[V2DI] = vslideup_imm (138[V2DI], 141[V2DI], 1) // deleted by DCE
7:  135[V2DI] = 138[V2DI]                              // deleted by DCE
8:  142[V2DI] = 135[V2DI]                              // deleted by DCE
9:  143[DI] = #2
10: 142[V2DI] = vec_dup:V2DI (143[DI])
11: (137[DI]) = 142[V2DI]

The higher 64 bits of 142[V2DI] is unknown here and it generated incorrect
code when store back to memory. This patch would like to fix this issue
by adding a new SCALAR_MOVE_MERGED_OP for vec_set.

Please note this patch doesn't enable VLS for vec_set, the underlying
patches will support this soon.

gcc/ChangeLog:

	* config/riscv/autovec.md: Bugfix.
	* config/riscv/riscv-protos.h (SCALAR_MOVE_MERGED_OP): New enum.

gcc/testsuite/ChangeLog:

	* gcc.target/riscv/rvv/base/scalar-move-merged-run-1.c: New test.

Signed-off-by: Pan Li <pan2.li@intel.com>
jcmvbkbc pushed a commit that referenced this issue Dec 31, 2023
Improve stack protector patterns and peephole2s even more:

a. Use unrelated register clears with integer mode size <= word
   mode size to clear stack protector scratch register.

b. Use unrelated register initializations in front of stack
   protector sequence to clear stack protector scratch register.

c. Use unrelated register initializations using LEA instructions
   to clear stack protector scratch register.

These stack protector improvements reuse 6914 unrelated register
initializations to substitute the clear of stack protector scratch
register in 12034 instances of stack protector sequence in recent linux
defconfig build.

gcc/ChangeLog:

	* config/i386/i386.md (@stack_protect_set_1_<PTR:mode>_<W:mode>):
	Use W mode iterator instead of SWI48.  Output MOV instead of XOR
	for TARGET_USE_MOV0.
	(stack_protect_set_1 peephole2): Use integer modes with
	mode size <= word mode size for operand 3.
	(stack_protect_set_1 peephole2 #2): New peephole2 pattern to
	substitute stack protector scratch register clear with unrelated
	register initialization, originally in front of stack
	protector sequence.
	(*stack_protect_set_3_<PTR:mode>_<SWI48:mode>): New insn pattern.
	(stack_protect_set_1 peephole2): New peephole2 pattern to
	substitute stack protector scratch register clear with unrelated
	register initialization involving LEA instruction.
jcmvbkbc pushed a commit that referenced this issue Dec 31, 2023
Use unrelated register initializations using zero/sign-extend instructions
to clear stack protector scratch register.

Hanlde only SI -> DImode extensions for 64-bit targets, as this is the
only extension that triggers the peephole in a non-negligible number.

Also use explicit check for word_mode instead of mode iterator in peephole2
patterns to avoid pattern explosion.

gcc/ChangeLog:

	* config/i386/i386.md (stack_protect_set_1 peephole2):
	Explicitly check operand 2 for word_mode.
	(stack_protect_set_1 peephole2 #2): Ditto.
	(stack_protect_set_2 peephole2): Ditto.
	(stack_protect_set_3 peephole2): Ditto.
	(*stack_protect_set_4z_<mode>_di): New insn patter.
	(*stack_protect_set_4s_<mode>_di): Ditto.
	(stack_protect_set_4 peephole2): New peephole2 pattern to
	substitute stack protector scratch register clear with unrelated
	register initialization involving zero/sign-extend instruction.
jcmvbkbc pushed a commit that referenced this issue Dec 31, 2023
Since the last import from upstream libsanitizer, the output has changed
and now looks more like this:

READ of size 6 at 0x7ff7beb2a144 thread T0
    #0 0x101cf7796 in MemcmpInterceptorCommon(void*, int (*)(void const*, void const*, unsigned long), void const*, void const*, unsigned long) sanitizer_common_interceptors.inc:813
    #1 0x101cf7b99 in memcmp sanitizer_common_interceptors.inc:840
    #2 0x108a0c39f in __stack_chk_guard+0xf (dyld:x86_64+0x8039f)

so let's adjust the pattern accordingly.

gcc/testsuite/ChangeLog:

	* c-c++-common/asan/memcmp-1.c: Adjust pattern on darwin.
jcmvbkbc pushed a commit that referenced this issue Dec 31, 2023
…-int (PR target/112413)

On m68k the compiler assumes that the PC-relative jump-via-jump-table
instruction and the jump table are adjacent with no padding in between.

When -mlong-jump-table-offsets is combined with -malign-int, a 2-byte
nop may be inserted before the jump table, causing the jump to add the
fetched offset to the wrong PC base and thus jump to the wrong address.

Fixed by referencing the jump table via its label. On the test case
in the PR the object code change is (the moveal at 16 is the nop):

    a:  6536            bcss 42 <f+0x42>
    c:  e588            lsll #2,%d0
    e:  203b 0808       movel %pc@(18 <f+0x18>,%d0:l),%d0
-  12:  4efb 0802       jmp %pc@(16 <f+0x16>,%d0:l)
+  12:  4efb 0804       jmp %pc@(18 <f+0x18>,%d0:l)
   16:  284c            moveal %a4,%a4
   18:  0000 0020       orib #32,%d0
   1c:  0000 002c       orib #44,%d0

Bootstrapped and tested on m68k-linux-gnu, no regressions.

Note: I don't have commit rights to I would need assistance applying this.

	PR target/112413
gcc/

	* config/m68k/linux.h (ASM_RETURN_CASE_JUMP): For
	TARGET_LONG_JUMP_TABLE_OFFSETS, reference the jump table
	via its label.
	* config/m68k/m68kelf.h (ASM_RETURN_CASE_JUMP): Likewise.
	* config/m68k/netbsd-elf.h (ASM_RETURN_CASE_JUMP): Likewise.
jcmvbkbc pushed a commit that referenced this issue Dec 31, 2023
During partial ordering, we want to look through dependent alias
template specializations within template arguments and otherwise
treat them as opaque in other contexts (see e.g. r7-7116-g0c942f3edab108
and r11-7011-g6e0a231a4aa240).  To that end template_args_equal was
given a partial_order flag that controls this behavior.  This flag
does the right thing when a dependent alias template specialization
appears as template argument of the partial specialization, e.g. in

  template<class T, class...> using first_t = T;
  template<class T> struct traits;
  template<class T> struct traits<first_t<T, T&>> { }; // #1
  template<class T> struct traits<first_t<const T, T&>> { }; // #2

we correctly consider #2 to be more specialized than #1.  But if the
alias specialization appears as a nested template argument of another
class template specialization, e.g. in

  template<class T> struct traits<A<first_t<T, T&>>> { }; // #1
  template<class T> struct traits<A<first_t<const T, T&>>> { }; // #2

then we incorrectly consider #1 and #2 to be unordered.  This is because

  1. we don't propagate the flag to recursive template_args_equal calls
  2. we don't use structural equality for class template specializations
     written in terms of dependent alias template specializations

This patch fixes the first issue by turning the partial_order flag into
a global.  This patch fixes the second issue by making us propagate
structural equality appropriately when building a class template
specialization.  In passing this patch also improves hashing of
specializations that use structural equality.

	PR c++/90679

gcc/cp/ChangeLog:

	* cp-tree.h (comp_template_args): Remove partial_order parameter.
	(template_args_equal): Likewise.
	* pt.cc (comparing_for_partial_ordering): New global flag.
	(iterative_hash_template_arg) <case tcc_type>: Hash the template
	and arguments for specializations that use structural equality.
	(template_args_equal): Remove partial order parameter and
	use comparing_for_partial_ordering instead.
	(comp_template_args): Likewise.
	(comp_template_args_porder): Set comparing_for_partial_ordering
	instead.  Make static.
	(any_template_arguments_need_structural_equality_p): Return true
	for an argument that's a dependent alias template specialization
	or a class template specialization that itself needs structural
	equality.
	* tree.cc (cp_tree_equal) <case TREE_VEC>: Adjust call to
	comp_template_args.

gcc/testsuite/ChangeLog:

	* g++.dg/cpp0x/alias-decl-75a.C: New test.
	* g++.dg/cpp0x/alias-decl-75b.C: New test.
jcmvbkbc pushed a commit that referenced this issue Feb 7, 2024
This patch adjusts the costs so that we treat REG and SUBREG expressions the
same for costing.

This was motivated by bt_skip_func and bt_find_func in xz and results in nearly
a 5% improvement in the dynamic instruction count for input #2 and smaller, but
definitely visible improvements pretty much across the board.  Exceptions would
be perlbench input #1 and exchange2 which showed very small regressions.

In the bt_find_func and bt_skip_func cases we have  something like this:

> (insn 10 7 11 2 (set (reg/v:DI 136 [ x ])
>         (zero_extend:DI (subreg/s/u:SI (reg/v:DI 137 [ a ]) 0))) "zz.c":6:21 387 {*zero_extendsidi2_bitmanip}
>      (nil))
> (insn 11 10 12 2 (set (reg:DI 142 [ _1 ])
>         (plus:DI (reg/v:DI 136 [ x ])
>             (reg/v:DI 139 [ b ]))) "zz.c":7:23 5 {adddi3}
>      (nil))

[ ... ]> (insn 13 12 14 2 (set (reg:DI 143 [ _2 ])
>         (plus:DI (reg/v:DI 136 [ x ])
>             (reg/v:DI 141 [ c ]))) "zz.c":8:23 5 {adddi3}
>      (nil))

Note the two uses of (reg 136). The best way to handle that in combine might be
a 3->2 split.  But there's a much better approach if we look at fwprop...

(set (reg:DI 142 [ _1 ])
    (plus:DI (zero_extend:DI (subreg/s/u:SI (reg/v:DI 137 [ a ]) 0))
        (reg/v:DI 139 [ b ])))
change not profitable (cost 4 -> cost 8)

So that should be the same cost as a regular DImode addition when the ZBA
extension is enabled.  But it ends up costing more because the clause to cost
this variant isn't prepared to handle a SUBREG.  That results in the RTL above
having too high a cost and fwprop gives up.

One approach would be to replace the REG_P  with REG_P || SUBREG_P in the
costing code.  I ultimately decided against that and instead check if the
operand in question passes register_operand.

By far the most important case to handle is the DImode PLUS.  But for the sake
of consistency, I changed the other instances in riscv_rtx_costs as well.  For
those other cases we're talking about improvements in the .000001% range.

While we are into stage4, this just hits cost modeling which we've generally
agreed is still appropriate (though we were mostly talking about vector).  So
I'm going to extend that general agreement ever so slightly and include scalar
cost modeling :-)

gcc/
	* config/riscv/riscv.cc (riscv_rtx_costs): Handle SUBREG and REG
	similarly.

gcc/testsuite/

	* gcc.target/riscv/reg_subreg_costs.c: New test.

	Co-authored-by: Jivan Hakobyan <jivanhakobyan9@gmail.com>
jcmvbkbc pushed a commit that referenced this issue May 11, 2024
We evaluate constexpr functions on the original, pre-genericization bodies.
That means that the function body we're evaluating will not have gone
through cp_genericize_r's "Map block scope extern declarations to visible
declarations with the same name and type in outer scopes if any".  Here:

  constexpr bool bar() { return true; } // #1
  constexpr bool foo() {
    constexpr bool bar(void); // #2
    return bar();
  }

it means that we:
1) register_constexpr_fundef (#1)
2) cp_genericize (#1)
   nothing interesting happens
3) register_constexpr_fundef (foo)
   does copy_fn, so we have two copies of the BIND_EXPR
4) cp_genericize (foo)
   this remaps #2 to #1, but only on one copy of the BIND_EXPR
5) retrieve_constexpr_fundef (foo)
   we find it, no problem
6) retrieve_constexpr_fundef (#2)
   and here #2 isn't found in constexpr_fundef_table, because
   we're working on the BIND_EXPR copy where #2 wasn't mapped to #1
   so we fail.  We've only registered #1.

It should work to use DECL_LOCAL_DECL_ALIAS (which used to be
extern_decl_map).  We evaluate constexpr functions on pre-cp_fold
bodies to avoid diagnostic problems, but the remapping I'm proposing
should not interfere with diagnostics.

This is not a problem for a global scope redeclaration; there we go
through duplicate_decls which keeps the DECL_UID:
  DECL_UID (olddecl) = olddecl_uid;
and DECL_UID is what constexpr_fundef_hasher::hash uses.

	PR c++/111132

gcc/cp/ChangeLog:

	* constexpr.cc (get_function_named_in_call): Use
	cp_get_fndecl_from_callee.
	* cvt.cc (cp_get_fndecl_from_callee): If there's a
	DECL_LOCAL_DECL_ALIAS, use it.

gcc/testsuite/ChangeLog:

	* g++.dg/cpp0x/constexpr-redeclaration3.C: New test.
	* g++.dg/cpp0x/constexpr-redeclaration4.C: New test.
jcmvbkbc pushed a commit that referenced this issue Jun 17, 2024
Here during overload resolution we have two strictly viable ambiguous
candidates #1 and #2, and two non-strictly viable candidates #3 and #4
which we hold on to ever since r14-6522.  These latter candidates have
an empty second arg conversion since the first arg conversion was deemed
bad, and this trips up joust when called on #3 and #4 which assumes all
arg conversions are there.

We can fix this by making joust robust to empty arg conversions, but in
this situation we shouldn't need to compare #3 and #4 at all given that
we have a strictly viable candidate.  To that end, this patch makes
tourney shortcut considering non-strictly viable candidates upon
encountering ambiguity between two strictly viable candidates (taking
advantage of the fact that the candidates list is sorted according to
viability via splice_viable).

	PR c++/115239

gcc/cp/ChangeLog:

	* call.cc (tourney): Don't consider a non-strictly viable
	candidate as the champ if there was ambiguity between two
	strictly viable candidates.

gcc/testsuite/ChangeLog:

	* g++.dg/overload/error7.C: New test.

Reviewed-by: Jason Merrill <jason@redhat.com>
jcmvbkbc pushed a commit that referenced this issue Nov 10, 2024
We currently crash upon the following invalid code (notice the "void
void**" parameter)

=== cut here ===
using size_t = decltype(sizeof(int));
void *operator new(size_t, void void **p) noexcept { return p; }
int x;
void f() {
    int y;
    new (&y) int(x);
}
=== cut here ===

The problem is that in this case, we end up with a NULL_TREE parameter
list for the new operator because of the error, and (1) coerce_new_type
wrongly complains about the first parameter type not being size_t,
(2) std_placement_new_fn_p blindly accesses the parameter list, hence a
crash.

This patch does NOT address #1 since we can't easily distinguish between
a new operator declaration without parameters from one with erroneous
parameters (and it's not worth the risk to refactor and break things for
an error recovery issue) hence a dg-bogus in new52.C, but it does
address #2 and the ICE by simply checking the first parameter against
NULL_TREE.

It also adds a new testcase checking that we complain about new
operators with no or invalid first parameters, since we did not have
any.

	PR c++/117101

gcc/cp/ChangeLog:

	* init.cc (std_placement_new_fn_p): Check first_arg against
	NULL_TREE.

gcc/testsuite/ChangeLog:

	* g++.dg/init/new52.C: New test.
	* g++.dg/init/new53.C: New test.
jcmvbkbc pushed a commit that referenced this issue Nov 10, 2024
The second source register of insn "*extzvsi-1bit_addsubx" cannot be the
same as the destination register, because that register will be overwritten
with an intermediate value after insn splitting.

     /* example #1 */
     int test1(int b, int a) {
       return ((a & 1024) ? 4 : 0) + b;
     }

     ;; result #1 (incorrect)
     test1:
     	extui	a2, a3, 10, 1	;; overwrites A2 before used
     	addx4	a2, a2, a2
     	ret.n

This patch fixes that.

     ;; result #1 (correct)
     test1:
     	extui	a3, a3, 10, 1	;; uses A3 and then overwrites
     	addx4	a2, a3, a2
     	ret.n

However, it should be noted that the first source register can be the same
as the destination without any problems.

     /* example #2 */
     int test2(int a, int b) {
       return ((a & 1024) ? 4 : 0) + b;
     }

     ;; result (correct)
     test2:
     	extui	a2, a2, 10, 1	;; uses A2 and then overwrites
     	addx4	a2, a2, a3
     	ret.n

gcc/ChangeLog:

	* config/xtensa/xtensa.md (*extzvsi-1bit_addsubx):
	Add '&' to the destination register constraint to indicate that
	it is 'earlyclobber', append '0' to the first source register
	constraint to indicate that it can be the same as the destination
	register, and change the split condition from 1 to reload_completed
	so that the insn will be split only after RA in order to obtain
         allocated registers that satisfy the above constraints.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants