gcc-xtensa appears to hardcode data alignment to 4 #2

pfalcon · 2015-02-04T08:54:55Z

I'm investigating why project built for xtensa produces unexpectedly high data section size. Project is built with -fdata-sections (which is common for embedded projects; the issue likely manifests itself even without it still). In map file I see:

 .rodata.rule_yield_arg
                0x000000003ffe9cfc        0x6 build/py/parse.o
 *fill*         0x000000003ffe9d02        0x2
 .rodata.rule_yield_expr
                0x000000003ffe9d04        0x6 build/py/parse.o
 *fill*         0x000000003ffe9d0a        0x2

I.e. each 6-byte structure gets aligned on 4-byte boundary. Looking at structures, they consist only of short's, i.e. have natural alignment of 2.

The simplest testcase to reproduce the issue is:

struct foo {
    short a, b, c;
};

struct foo s1 = {1};
struct foo s2 = {2};

When build for both arm and x64, this produces following assembly:

    .global s1
    .data
    .align  2
    .type   s1, %object
    .size   s1, 6
s1:
    .short  1
    .space  4
    .global s2
    .align  2
    .type   s2, %object
    .size   s2, 6
s2:
    .short  2
    .space  4

with xtensa-lx106-elf-gcc the result is:

    .global s1
    .data
    .align  4
    .type   s1, @object
    .size   s1, 6
s1:
    .short  1
    .zero   4
    .global s2
    .align  4
    .type   s2, @object
    .size   s2, 6
s2:
    .short  2
    .zero   4

Note the difference in ".align" directives.

The expect behavior is that structure alignment should be its natural alignment (which is defined as maximum alignment of any structure field). Is current xtensa-lx106-elf-gcc behavior grounded in any Xtensa ABI or something? Even if it is, the behavior like above is detrimental for embedded usage, where ABI issues are not relevant, but losses from overzealous alignment are noticeable (for example, in the original case, there're hundreds of such structures; if structures/variables are just single short's, there's 50% loss of space in 4-byte alignment).

The text was updated successfully, but these errors were encountered:

jcmvbkbc · 2015-02-06T19:44:02Z

That matches exactly what MIPS does. I've had a look at what others do here, found no consensus. Most common is doing natural alignment when optimizing for size. I can implement that, will that work for you?

pfalcon · 2015-02-06T20:51:44Z

Yes, sure, if you think it makes sense, that would be good enough and definitely would help that project, as it's built with -Os. Thanks!

jcmvbkbc · 2015-02-06T20:58:41Z

Pushed proposed fix to the call0-4.8.2-natural-align branch for preview.
Will test and integrate soon.
Thanks for your report.

jcmvbkbc · 2015-02-22T00:32:50Z

Fixed and submitted upstream.

pfalcon · 2015-02-22T06:52:27Z

Thanks. Have been travelling and didn't have chance to look into it since, but hope to get to it coming weeks.

pfalcon · 2015-06-12T18:33:18Z

Turns out, I never tested this properly nor upgraded esp-open-sdk to the version with this patch. I did testing now, using MicroPython esp8266 build as a subject.

Here're section address/size diffs between build with old and new toolchain:

 .irom0.text     0x0000000040210000    0x3fe2c
 .text           0x0000000040100000     0x7236
 .data           0x000000003ffe8000      0x574
-.rodata         0x000000003ffe8580     0x511c
-.bss            0x000000003ffed6a0     0xd3a0
+.rodata         0x000000003ffe8580     0x4c80
+.bss            0x000000003ffed200     0xd3a0

So, only .rodata is affected, and more than a kilobyte was saved. For micropython, that can be well saving 10% of available RAM, i.e. really good results. Thanks! (esp-open-sdk is also upgraded)

@dpgeorge: FYI

dpgeorge · 2015-06-14T11:17:17Z

@pfalcon thanks for ping, it's great that such improvements can be made upstream for all to benefit from.

* tree.c (ovl_copy): Adjust assert, copy OVL_LOOKUP. (ovl_used): New. (lookup_keep): Call it. PR c++/80891 (#2) * g++.dg/lookup/pr80891-2.C: New. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@248570 138bc75d-0d04-0410-961f-82ee72b054a4

This patch implements GCC support for mitigating vulnerability CVE-2017-5715 known as Spectre #2 on IBM Z. In order to disable prediction of indirect branches the implementation makes use of an IBM Z specific feature - the execute instruction. Performing an indirect branch via execute prevents the branch from being subject to dynamic branch prediction. The implementation tries to stay close to the x86 solution regarding user interface. x86 style options supported (without thunk-inline): -mindirect-branch=(keep|thunk|thunk-extern) -mfunction-return=(keep|thunk|thunk-extern) IBM Z specific options: -mindirect-branch-jump=(keep|thunk|thunk-extern|thunk-inline) -mindirect-branch-call=(keep|thunk|thunk-extern) -mfunction-return-reg=(keep|thunk|thunk-extern) -mfunction-return-mem=(keep|thunk|thunk-extern) These options allow us to enable/disable the branch conversion at a finer granularity. -mindirect-branch sets the value of -mindirect-branch-jump and -mindirect-branch-call. -mfunction-return sets the value of -mfunction-return-reg and -mfunction-return-mem. All these options are supported on GCC command line as well as function attributes. 'thunk' triggers the generation of out of line thunks (expolines) and replaces the formerly indirect branch with a direct branch to the thunk. Depending on the -march= setting two different types of thunks are generated. With -march=z10 or higher exrl (execute relative long) is being used while targeting older machines makes use of larl/ex instead. From a security perspective the exrl variant is preferable. 'thunk-extern' does the branch replacement like 'thunk' but does not emit the thunks. 'thunk-inline' is only available for indirect jumps. It should be used in environments where correct CFI is important - known as user space. Additionally the patch introduces the -mindirect-branch-table option which generates tables pointing to the locations which have been modified. This is supposed to allow reverting the changes without re-compilation in situations where it isn't required. The sections are split up into one section per option. gcc/ChangeLog: 2018-02-08 Andreas Krebbel <krebbel@linux.vnet.ibm.com> * config/s390/s390-opts.h (enum indirect_branch): Define. * config/s390/s390-protos.h (s390_return_addr_from_memory) (s390_indirect_branch_via_thunk) (s390_indirect_branch_via_inline_thunk): Add function prototypes. (enum s390_indirect_branch_type): Define. * config/s390/s390.c (struct s390_frame_layout, struct machine_function): Remove. (indirect_branch_prez10thunk_mask, indirect_branch_z10thunk_mask) (indirect_branch_table_label_no, indirect_branch_table_name): Define variables. (INDIRECT_BRANCH_NUM_OPTIONS): Define macro. (enum s390_indirect_branch_option): Define. (s390_return_addr_from_memory): New function. (s390_handle_string_attribute): New function. (s390_attribute_table): Add new attribute handler. (s390_execute_label): Handle UNSPEC_EXECUTE_JUMP patterns. (s390_indirect_branch_via_thunk): New function. (s390_indirect_branch_via_inline_thunk): New function. (s390_function_ok_for_sibcall): When jumping via thunk disallow sibling call optimization for non z10 compiles. (s390_emit_call): Force indirect branch target to be a single register. Add r1 clobber for non-z10 compiles. (s390_emit_epilogue): Emit return jump via return_use expander. (s390_reorg): Handle JUMP_INSNs as execute targets. (s390_option_override_internal): Perform validity checks for the new command line options. (s390_indirect_branch_attrvalue): New function. (s390_indirect_branch_settings): New function. (s390_set_current_function): Invoke s390_indirect_branch_settings. (s390_output_indirect_thunk_function): New function. (s390_code_end): Implement target hook. (s390_case_values_threshold): Implement target hook. (TARGET_ASM_CODE_END, TARGET_CASE_VALUES_THRESHOLD): Define target macros. * config/s390/s390.h (struct s390_frame_layout) (struct machine_function): Move here from s390.c. (TARGET_INDIRECT_BRANCH_NOBP_RET) (TARGET_INDIRECT_BRANCH_NOBP_JUMP) (TARGET_INDIRECT_BRANCH_NOBP_JUMP_THUNK) (TARGET_INDIRECT_BRANCH_NOBP_JUMP_INLINE_THUNK) (TARGET_INDIRECT_BRANCH_NOBP_CALL) (TARGET_DEFAULT_INDIRECT_BRANCH_TABLE) (TARGET_INDIRECT_BRANCH_THUNK_NAME_EXRL) (TARGET_INDIRECT_BRANCH_THUNK_NAME_EX) (TARGET_INDIRECT_BRANCH_TABLE): Define macros. * config/s390/s390.md (UNSPEC_EXECUTE_JUMP) (INDIRECT_BRANCH_THUNK_REGNUM): Define constants. (mnemonic attribute): Add values which aren't recognized automatically. ("*cjump_long", "*icjump_long", "*basr", "*basr_r"): Disable pattern for branch conversion. Fix mnemonic attribute. ("*c<code>", "*sibcall_br", "*sibcall_value_br", "*return"): Emit indirect branch via thunk if requested. ("indirect_jump", "<code>"): Expand patterns for branch conversion. ("*indirect_jump"): Disable for branch conversion using out of line thunks. ("indirect_jump_via_thunk<mode>_z10") ("indirect_jump_via_thunk<mode>") ("indirect_jump_via_inlinethunk<mode>_z10") ("indirect_jump_via_inlinethunk<mode>", "*casesi_jump") ("casesi_jump_via_thunk<mode>_z10", "casesi_jump_via_thunk<mode>") ("casesi_jump_via_inlinethunk<mode>_z10") ("casesi_jump_via_inlinethunk<mode>", "*basr_via_thunk<mode>_z10") ("*basr_via_thunk<mode>", "*basr_r_via_thunk_z10") ("*basr_r_via_thunk", "return<mode>_prez10"): New pattern. ("*indirect2_jump"): Disable for branch conversion. ("casesi_jump"): Turn into expander and expand patterns for branch conversion. ("return_use"): New expander. ("*return"): Emit return via thunk and rename it to ... ("*return<mode>"): ... this one. * config/s390/s390.opt: Add new options and and enum for the option values. gcc/testsuite/ChangeLog: 2018-02-08 Andreas Krebbel <krebbel@linux.vnet.ibm.com> * gcc.target/s390/nobp-function-pointer-attr.c: New test. * gcc.target/s390/nobp-function-pointer-nothunk.c: New test. * gcc.target/s390/nobp-function-pointer-z10.c: New test. * gcc.target/s390/nobp-function-pointer-z900.c: New test. * gcc.target/s390/nobp-indirect-jump-attr.c: New test. * gcc.target/s390/nobp-indirect-jump-inline-attr.c: New test. * gcc.target/s390/nobp-indirect-jump-inline-z10.c: New test. * gcc.target/s390/nobp-indirect-jump-inline-z900.c: New test. * gcc.target/s390/nobp-indirect-jump-nothunk.c: New test. * gcc.target/s390/nobp-indirect-jump-z10.c: New test. * gcc.target/s390/nobp-indirect-jump-z900.c: New test. * gcc.target/s390/nobp-return-attr-all.c: New test. * gcc.target/s390/nobp-return-attr-neg.c: New test. * gcc.target/s390/nobp-return-mem-attr.c: New test. * gcc.target/s390/nobp-return-mem-nothunk.c: New test. * gcc.target/s390/nobp-return-mem-z10.c: New test. * gcc.target/s390/nobp-return-mem-z900.c: New test. * gcc.target/s390/nobp-return-reg-attr.c: New test. * gcc.target/s390/nobp-return-reg-mixed.c: New test. * gcc.target/s390/nobp-return-reg-nothunk.c: New test. * gcc.target/s390/nobp-return-reg-z10.c: New test. * gcc.target/s390/nobp-return-reg-z900.c: New test. * gcc.target/s390/nobp-table-jump-inline-z10.c: New test. * gcc.target/s390/nobp-table-jump-inline-z900.c: New test. * gcc.target/s390/nobp-table-jump-z10.c: New test. * gcc.target/s390/nobp-table-jump-z900.c: New test. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@257489 138bc75d-0d04-0410-961f-82ee72b054a4

When -fcf-protection -mcet is used, I got FAIL: g++.dg/eh/sighandle.C (gdb) bt #0 _Unwind_RaiseException (exc=exc@entry=0x416ed0) at /export/gnu/import/git/sources/gcc/libgcc/unwind.inc:140 #1 0x00007ffff7d9936b in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x403dd0 <typeinfo for int@@CXXABI_1.3>, dest=0x0) at /export/gnu/import/git/sources/gcc/libstdc++-v3/libsupc++/eh_throw.cc:90 #2 0x0000000000401255 in sighandler (signo=11, si=0x7fffffffd6f8, uc=0x7fffffffd5c0) at /export/gnu/import/git/sources/gcc/gcc/testsuite/g++.dg/eh/sighandle.C:9 #3 <signal handler called> <<<< Signal frame which isn't on shadow stack #4 dosegv () at /export/gnu/import/git/sources/gcc/gcc/testsuite/g++.dg/eh/sighandle.C:14 #5 0x00000000004012e3 in main () at /export/gnu/import/git/sources/gcc/gcc/testsuite/g++.dg/eh/sighandle.C:30 (gdb) p frames $6 = 5 (gdb) frame count should be 4, not 5. This patch skips signal frames when unwinding shadow stack. gcc/testsuite/ PR libgcc/85334 * g++.dg/torture/pr85334.C: New test. libgcc/ PR libgcc/85334 * unwind-generic.h (_Unwind_Frames_Increment): New. * config/i386/shadow-stack-unwind.h (_Unwind_Frames_Increment): Likewise. * unwind.inc (_Unwind_RaiseException_Phase2): Increment frame count with _Unwind_Frames_Increment. (_Unwind_ForcedUnwind_Phase2): Likewise. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@259502 138bc75d-0d04-0410-961f-82ee72b054a4

Make ix86_frame available to i386 code generation. This is needed to backport the patch set of -mindirect-branch= to mitigate variant #2 of the speculative execution vulnerabilities on x86 processors identified by CVE-2017-5715, aka Spectre. Backport from mainline * config/i386/i386.c (ix86_frame): Moved to ... * config/i386/i386.h (ix86_frame): Here. (machine_function): Add frame. * config/i386/i386.c (ix86_compute_frame_layout): Repace the frame argument with &cfun->machine->frame. (ix86_can_use_return_insn_p): Don't pass &frame to ix86_compute_frame_layout. Copy frame from cfun->machine->frame. (ix86_can_eliminate): Likewise. (ix86_expand_prologue): Likewise. (ix86_expand_epilogue): Likewise. (ix86_expand_split_stack_prologue): Likewise. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gcc-7-branch@256691 138bc75d-0d04-0410-961f-82ee72b054a4

This patch fixes an LRA cycling problem on the attached testcase. The original insn was: (insn 74 72 76 8 (set (reg:V2DI 287 [ _166 ]) (subreg:V2DI (reg/v/f:DI 112 [ d ]) 0)) 1060 {*aarch64_simd_movv2di} (nil)) which IRA converted to: (insn 74 72 580 8 (set (reg:V2DI 287 [ _166 ]) (subreg:V2DI (reg/v/f:DI 517 [orig:112 d ] [112]) 0)) 1060 {*aarch64_simd_movv2di} (nil)) after creating loop allocnos. It happens that the ALLOCNO_WMODEs for both 112 and 517 were not set to V2DI due to another bug that I'll post a separate patch for, but we nevertheless got a valid allocation of register 1. LRA's first try at constraining the instruction gave: Choosing alt 5 in insn 74: (0) ?w (1) r {*aarch64_simd_movv2di} at which point all was good. But LRA later decided it needed to spill r517: Spill r517 after risky transformations so the next constraint attempt gave: Choosing alt 0 in insn 74: (0) =w (1) m {*aarch64_simd_movv2di} which was still good. Then during inheritance we had: Creating newreg=672 from oldreg=517, assigning class GENERAL_REGS to inheritance r672 Original reg change 517->672 (bb8): 74: r287:V2DI=r672:DI#0 Add inheritance<-original before: 939: r672:DI=r517:DI Inheritance reuse change 517->672 (bb8): 620: r572:DI=r672:DI REG_DEAD r672:DI Use smallest class of POINTER_REGS and GENERAL_REGS Creating newreg=673 from oldreg=517, assigning class POINTER_REGS to inheritance r673 Original reg change 517->673 (bb8): 936: r669:DI=r673:DI Add inheritance<-original before: 940: r673:DI=r517:DI ("Use smallest class of POINTER_REGS and GENERAL_REGS" ought to give GENERAL_REGS. That might be a missed optimisation, and probably due to both classes having the same number of allocatable registers. I'll look at that as a follow-on.) Thus LRA created two inheritance registers for r517, one (r673) that included the unallocatable x31 and another (r672) that didn't. The r672 references included the paradoxical subreg in insn 74 but the r673 ones didn't. LRA then allocated x30 to r673, which was a valid choice. Later LRA decided to "undo" the inheritance for insn 620, but because of the double inheritance, it got confused as to what the original situation was, and made insn 74 use the other inheritance register instead of r517: ********** Undoing inheritance #2: ********** Inherit 11 out of 12 (91.67%) Insn after restoring regs: 620: r572:DI=r517:DI REG_DEAD r517:DI Change reload insn: 74: r287:V2DI=r673:DI#0 <------------------- Insn after restoring regs: 939: r517:DI=r673:DI REG_DEAD r673:DI This might be a bug in itself: we should probably look through sets of other inheritance pseudos to find the "real" origin. Either way, at this point we had a situation in which r673 was used in an insn whose subreg was larger than the biggest_mode that r673 had when it was allocated. While x30 was valid for the original biggest_mode, it wasn't valid for this subreg use. The next attempt to constrain insn 74 was: Choosing alt 5 in insn 74: (0) ?w (1) r {*aarch64_simd_movv2di} Creating newreg=684, assigning class GENERAL_REGS to r684 74: r287:V2DI=r684:V2DI Inserting insn reload before: 951: r684:V2DI=r673:DI#0 where LRA reloaded the SUBREG rather than the SUBREG_REG. And it then cycled trying the same thing when reloading the reload (and the reload of the reload, etc.). What it should be doing here is reloading the SUBREG_REG instead. There's already code to cope with this case when the paradoxical subreg falls outside the class (which isn't true here, since r673 is POINTER_REGS and POINTER_REGS includes x31). But I think we should also test whether LRA is entitled to allocate the spanned registers. Not doing that seems like a bug regardless of the above missed optimisation and the mix-up undoing inheritance. 2018-05-30 Richard Sandiford <richard.sandiford@linaro.org> gcc/ * lra-constraints.c (simplify_operand_subreg): In the paradoxical case, check whether the outer register overlaps an unallocatable register, not just whether it fits the required class. gcc/testsuite/ * g++.dg/torture/aarch64-vect-init-1.C: New test. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@261531 138bc75d-0d04-0410-961f-82ee72b054a4

This adds a 4th information level for the -gnatR output, where relevant compiler-generated types are listed in addition to the information already output by -gnatR3. For the following package P: package P is type Arr0 is array (Positive range <>) of Boolean; type Rec (D1 : Positive; D2 : Boolean) is record C1 : Integer; C2 : Arr0 (1 .. D1); case D2 is when False => C3 : Character; when True => C4 : String (1 .. 3); C5 : Float; end case; end record; type Arr1 is array (1 .. 8) of Rec (1, True); end P; the output generated by -gnatR4 must be: Representation information for unit P (spec) -------------------------------------------- for Arr0'Alignment use 1; for Arr0'Component_Size use 8; for Rec'Object_Size use 17179869344; for Rec'Value_Size use (if (#2 != 0) then ((((#1 + 15) & -4) + 8) * 8) else ((((#1 + 15) & -4) + 1) * 8) end); for Rec'Alignment use 4; for Rec use record D1 at 0 range 0 .. 31; D2 at 4 range 0 .. 7; C1 at 8 range 0 .. 31; C2 at 12 range 0 .. ((#1 * 8)) - 1; C3 at ((#1 + 15) & -4) range 0 .. 7; C4 at ((#1 + 15) & -4) range 0 .. 23; C5 at (((#1 + 15) & -4) + 4) range 0 .. 31; end record; for Arr1'Size use 1536; for Arr1'Alignment use 4; for Arr1'Component_Size use 192; for Tarr1c'Size use 192; for Tarr1c'Alignment use 4; for Tarr1c use record D1 at 0 range 0 .. 31; D2 at 4 range 0 .. 7; C1 at 8 range 0 .. 31; C2 at 12 range 0 .. 7; C4 at 16 range 0 .. 23; C5 at 20 range 0 .. 31; end record; 2018-11-14 Eric Botcazou <ebotcazou@adacore.com> gcc/ada/ * doc/gnat_ugn/building_executable_programs_with_gnat.rst (-gnatR): Document new -gnatR4 level. * gnat_ugn.texi: Regenerate. * opt.ads (List_Representation_Info): Bump upper bound to 4. * repinfo.adb: Add with clause for GNAT.HTable. (Relevant_Entities_Size): New constant. (Entity_Header_Num): New type. (Entity_Hash): New function. (Relevant_Entities): New set implemented with GNAT.HTable. (List_Entities): Also list compiled-generated entities present in the Relevant_Entities set. Consider that the Component_Type of an array type is relevant. (List_Rep_Info): Reset Relevant_Entities for each unit. * switch-c.adb (Scan_Front_End_Switches): Add support for -gnatR4. * switch-m.adb (Normalize_Compiler_Switches): Likewise * usage.adb (Usage): Likewise. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@266131 138bc75d-0d04-0410-961f-82ee72b054a4

This patch implements DR 2237 which says that a simple-template-id is no longer valid as the declarator-id of a constructor or destructor; see [diff.cpp17.class]#2. It is not explicitly stated but out-of-line destructors with a simple-template-id are also meant to be ill-formed now. (Out-of-line constructors like that are invalid since DR1435 I think.) This change only applies to C++20; it is not a DR against C++17. I'm not crazy about the diagnostic in constructors but ISTM that cp_parser_constructor_declarator_p shouldn't print errors. DR 2237 * parser.c (cp_parser_unqualified_id): Reject simple-template-id as the declarator-id of a destructor. (cp_parser_constructor_declarator_p): Reject simple-template-id as the declarator-id of a constructor. * g++.dg/DRs/dr2237.C: New test. * g++.dg/parse/constructor2.C: Add dg-error for C++20. * g++.dg/parse/dtor12.C: Likewise. * g++.dg/parse/dtor4.C: Likewise. * g++.dg/template/dtor4.C: Adjust dg-error. * g++.dg/template/error34.C: Likewise. * g++.old-deja/g++.other/inline15.C: Only run for C++17 and lesses. * g++.old-deja/g++.pt/ctor2.C: Add dg-error for C++20.

Here ever since r12-6022-gbb2a7f80a98de3 we stopped deeming the partial specialization #2 to be more specialized than #1 ultimately because dependent operator expressions now have a DEPENDENT_OPERATOR_TYPE type instead of an empty type, and this made unify stop deducing T(2) == 1 for K during partial ordering for #1 and #2. This minimal patch fixes this by making the relevant logic in unify treat DEPENDENT_OPERATOR_TYPE like an empty type. PR c++/105425 gcc/cp/ChangeLog: * pt.cc (unify) <case TEMPLATE_PARM_INDEX>: Treat DEPENDENT_OPERATOR_TYPE like an empty type. gcc/testsuite/ChangeLog: * g++.dg/template/partial-specialization13.C: New test.

Here during cp_parser_single_declaration for #2, we were calling associate_classtype_constraints for TPL<T> (the primary template type) before maybe_process_partial_specialization could get a chance to notice that we're in fact declaring a distinct constrained partial spec and not redeclaring the primary template. This caused us to emit a bogus error about differing constraints b/t the primary template and #2's constraints. This patch fixes this by moving the call to associate_classtype_constraints after the call to shadow_tag (which calls maybe_process_partial_specialization) and adjusting shadow_tag to use the return value of m_p_p_s. Moreover, if we later try to define a constrained partial specialization that's been declared earlier (as in the third testcase), then maybe_new_partial_specialization correctly notices it's a redeclaration and returns NULL_TREE. But in this case we also need to update TYPE to point to the redeclared partial spec (it'll otherwise continue pointing to the primary template type, eventually leading to a bogus error). PR c++/96363 gcc/cp/ChangeLog: * decl.cc (shadow_tag): Use the return value of maybe_process_partial_specialization. * parser.cc (cp_parser_single_declaration): Call shadow_tag before associate_classtype_constraints. * pt.cc (maybe_new_partial_specialization): Change return type to bool. Take 'type' argument by mutable reference. Set 'type' to point to the correct constrained specialization when appropriate. (maybe_process_partial_specialization): Adjust accordingly. gcc/testsuite/ChangeLog: * g++.dg/cpp2a/concepts-partial-spec12.C: New test. * g++.dg/cpp2a/concepts-partial-spec12a.C: New test. * g++.dg/cpp2a/concepts-partial-spec13.C: New test.

As explained in r11-4959-gde6f64f9556ae3, the atom cache assumes two equivalent expressions (according to cp_tree_equal) must use the same template parameters (according to find_template_parameters). This assumption turned out to not hold for TARGET_EXPR, which was addressed by that commit. But this assumption apparently doesn't hold for PARM_DECL either: find_template_parameters walks its DECL_CONTEXT but cp_tree_equal by default doesn't consider DECL_CONTEXT unless comparing_specializations is set. Thus in the first testcase below, the atomic constraints of #1 and #2 are equivalent according to cp_tree_equal, but according to find_template_parameters the former uses T and the latter uses both T and U (surprisingly). We could fix this assumption violation by setting comparing_specializations in the atom_hasher, which would make cp_tree_equal return false for the two atoms, but that seems overly pessimistic here. Ideally the atoms should continue being considered equivalent and we instead fix find_template_paremeters to return just T for #2's atom. To that end this patch makes for_each_template_parm_r stop walking the DECL_CONTEXT of a PARM_DECL. This should be safe to do because tsubst_copy / tsubst_decl only substitutes the TREE_TYPE of a PARM_DECL and doesn't bother substituting the DECL_CONTEXT, thus the only relevant template parameters are those used in its type. any_template_parm_r is currently responsible for walking its TREE_TYPE, but I suppose it now makes sense for for_each_template_parm_r to do so instead. In passing this patch also makes for_each_template_parm_r stop walking the DECL_CONTEXT of a VAR_/FUNCTION_DECL since doing so after walking DECL_TI_ARGS is redundant, I think. I experimented with not walking DECL_CONTEXT for CONST_DECL, but the second testcase below demonstrates it's necessary to walk it. PR c++/105797 gcc/cp/ChangeLog: * pt.cc (for_each_template_parm_r) <case FUNCTION_DECL, VAR_DECL>: Don't walk DECL_CONTEXT. <case PARM_DECL>: Likewise. Walk TREE_TYPE. <case CONST_DECL>: Simplify. (any_template_parm_r) <case PARM_DECL>: Don't walk TREE_TYPE. gcc/testsuite/ChangeLog: * g++.dg/cpp2a/concepts-decltype4.C: New test. * g++.dg/cpp2a/concepts-memfun3.C: New test.

This patch implements C++23 P2255R2, which adds two new type traits to detect reference binding to a temporary. They can be used to detect code like std::tuple<const std::string&> t("meow"); which is incorrect because it always creates a dangling reference, because the std::string temporary is created inside the selected constructor of std::tuple, and not outside it. There are two new compiler builtins, __reference_constructs_from_temporary and __reference_converts_from_temporary. The former is used to simulate direct- and the latter copy-initialization context. But I had a hard time finding a test where there's actually a difference. Under DR 2267, both of these are invalid: struct A { } a; struct B { explicit B(const A&); }; const B &b1{a}; const B &b2(a); so I had to peruse [over.match.ref], and eventually realized that the difference can be seen here: struct G { operator int(); // #1 explicit operator int&&(); // #2 }; int&& r1(G{}); // use #2 (no temporary) int&& r2 = G{}; // use #1 (a temporary is created to be bound to int&&) The implementation itself was rather straightforward because we already have the conv_binds_ref_to_prvalue function. The main function here is ref_xes_from_temporary. I've changed the return type of ref_conv_binds_directly to tristate, because previously the function didn't distinguish between an invalid conversion and one that binds to a prvalue. Since it no longer returns a bool, I removed the _p suffix. The patch also adds the relevant class and variable templates to <type_traits>. PR c++/104477 gcc/c-family/ChangeLog: * c-common.cc (c_common_reswords): Add __reference_constructs_from_temporary and __reference_converts_from_temporary. * c-common.h (enum rid): Add RID_REF_CONSTRUCTS_FROM_TEMPORARY and RID_REF_CONVERTS_FROM_TEMPORARY. gcc/cp/ChangeLog: * call.cc (ref_conv_binds_directly_p): Rename to ... (ref_conv_binds_directly): ... this. Add a new bool parameter. Change the return type to tristate. * constraint.cc (diagnose_trait_expr): Handle CPTK_REF_CONSTRUCTS_FROM_TEMPORARY and CPTK_REF_CONVERTS_FROM_TEMPORARY. * cp-tree.h: Include "tristate.h". (enum cp_trait_kind): Add CPTK_REF_CONSTRUCTS_FROM_TEMPORARY and CPTK_REF_CONVERTS_FROM_TEMPORARY. (ref_conv_binds_directly_p): Rename to ... (ref_conv_binds_directly): ... this. (ref_xes_from_temporary): Declare. * cxx-pretty-print.cc (pp_cxx_trait_expression): Handle CPTK_REF_CONSTRUCTS_FROM_TEMPORARY and CPTK_REF_CONVERTS_FROM_TEMPORARY. * method.cc (ref_xes_from_temporary): New. * parser.cc (cp_parser_primary_expression): Handle RID_REF_CONSTRUCTS_FROM_TEMPORARY and RID_REF_CONVERTS_FROM_TEMPORARY. (cp_parser_trait_expr): Likewise. (warn_for_range_copy): Adjust to call ref_conv_binds_directly. * semantics.cc (trait_expr_value): Handle CPTK_REF_CONSTRUCTS_FROM_TEMPORARY and CPTK_REF_CONVERTS_FROM_TEMPORARY. (finish_trait_expr): Likewise. libstdc++-v3/ChangeLog: * include/std/type_traits (reference_constructs_from_temporary, reference_converts_from_temporary): New class templates. (reference_constructs_from_temporary_v, reference_converts_from_temporary_v): New variable templates. (__cpp_lib_reference_from_temporary): Define for C++23. * include/std/version (__cpp_lib_reference_from_temporary): Define for C++23. * testsuite/20_util/variable_templates_for_traits.cc: Test reference_constructs_from_temporary_v and reference_converts_from_temporary_v. * testsuite/20_util/reference_from_temporary/value.cc: New test. * testsuite/20_util/reference_from_temporary/value2.cc: New test. * testsuite/20_util/reference_from_temporary/version.cc: New test. gcc/testsuite/ChangeLog: * g++.dg/ext/reference_constructs_from_temporary1.C: New test. * g++.dg/ext/reference_converts_from_temporary1.C: New test.

This patch implements some additional zero-extension and sign-extension related optimizations in simplify-rtx.cc. The original motivation comes from PR rtl-optimization/71775, where in comment #2 Andrew Pinksi sees: Failed to match this instruction: (set (reg:DI 88 [ _1 ]) (sign_extend:DI (subreg:SI (ctz:DI (reg/v:DI 86 [ x ])) 0))) On many platforms the result of DImode CTZ is constrained to be a small unsigned integer (between 0 and 64), hence the truncation to 32-bits (using a SUBREG) and the following sign extension back to 64-bits are effectively a no-op, so the above should ideally (often) be simplified to "(set (reg:DI 88) (ctz:DI (reg/v:DI 86 [ x ]))". To implement this, and some closely related transformations, we build upon the existing val_signbit_known_clear_p predicate. In the first chunk, nonzero_bits knows that FFS and ABS can't leave the sign-bit bit set, so the simplification of of ABS (ABS (x)) and ABS (FFS (x)) can itself be simplified. The second transformation is that we can canonicalized SIGN_EXTEND to ZERO_EXTEND (as in the PR 71775 case above) when the operand's sign-bit is known to be clear. The final two chunks are for SIGN_EXTEND of a truncating SUBREG, and ZERO_EXTEND of a truncating SUBREG respectively. The nonzero_bits of a truncating SUBREG pessimistically thinks that the upper bits may have an arbitrary value (by taking the SUBREG), so we need look deeper at the SUBREG's operand to confirm that the high bits are known to be zero. Unfortunately, for PR rtl-optimization/71775, ctz:DI on x86_64 with default architecture options is undefined at zero, so we can't be sure the upper bits of reg:DI 88 will be sign extended (all zeros or all ones). nonzero_bits knows this, so the above transformations don't trigger, but the transformations themselves are perfectly valid for other operations such as FFS, POPCOUNT and PARITY, and on other targets/-march settings where CTZ is defined at zero. 2022-08-03 Roger Sayle <roger@nextmovesoftware.com> Segher Boessenkool <segher@kernel.crashing.org> Richard Sandiford <richard.sandiford@arm.com> gcc/ChangeLog * simplify-rtx.cc (simplify_unary_operation_1) <ABS>: Add optimizations for CLRSB, PARITY, POPCOUNT, SS_ABS and LSHIFTRT that are all positive to complement the existing FFS and idempotent ABS simplifications. <SIGN_EXTEND>: Canonicalize SIGN_EXTEND to ZERO_EXTEND when val_signbit_known_clear_p is true of the operand. Simplify sign extensions of SUBREG truncations of operands that are already suitably (zero) extended. <ZERO_EXTEND>: Simplify zero extensions of SUBREG truncations of operands that are already suitably zero extended.

In my previous patches I've been extending our std::move warnings, but this tweak actually dials it down a little bit. As reported in bug 89780, it's questionable to warn about expressions in templates that were type-dependent, but aren't anymore because we're instantiating the template. As in, template <typename T> Dest withMove() { T x; return std::move(x); } template Dest withMove<Dest>(); // #1 template Dest withMove<Source>(); // #2 Saying that the std::move is pessimizing for #1 is not incorrect, but it's not useful, because removing the std::move would then pessimize #2. So the user can't really win. At the same time, disabling the warning just because we're in a template would be going too far, I still want to warn for template <typename> Dest withMove() { Dest x; return std::move(x); } because the std::move therein will be pessimizing for any instantiation. So I'm using the suppress_warning machinery to that effect. Problem: I had to add a new group to nowarn_spec_t, otherwise suppressing the -Wpessimizing-move warning would disable a whole bunch of other warnings, which we really don't want. PR c++/89780 gcc/cp/ChangeLog: * pt.cc (tsubst_copy_and_build) <case CALL_EXPR>: Maybe suppress -Wpessimizing-move. * typeck.cc (maybe_warn_pessimizing_move): Don't issue warnings if they are suppressed. (check_return_expr): Disable -Wpessimizing-move when returning a dependent expression. gcc/ChangeLog: * diagnostic-spec.cc (nowarn_spec_t::nowarn_spec_t): Handle OPT_Wpessimizing_move and OPT_Wredundant_move. * diagnostic-spec.h (nowarn_spec_t): Add NW_REDUNDANT enumerator. gcc/testsuite/ChangeLog: * g++.dg/cpp0x/Wpessimizing-move3.C: Remove dg-warning. * g++.dg/cpp0x/Wredundant-move2.C: Likewise.

The eliminate reg-reg move by inverting the condition of a cmove #2 peephole2 converts the following sequence: 473: bx:DI=[r14:DI*0x8+r12:DI] 960: r15:DI=r8:DI 485: {flags:CCC=cmp(r15:DI+bx:DI,bx:DI);r15:DI=r15:DI+bx:DI;} 737: r15:DI={(geu(flags:CCC,0))?r15:DI:bx:DI} to: 1110: {flags:CCC=cmp(r8:DI+bx:DI,bx:DI);r8:DI=r8:DI+bx:DI;} 1111: r15:DI=[r14:DI*0x8+r12:DI] 1112: r15:DI={(geu(flags:CCC,0))?r8:DI:r15:DI} Please note that(insn 1110) uses register BX, but its initialization was eliminated. Avoid conversion if eliminated move intialized a register, used in the moved instruction. 2022-11-03 Uroš Bizjak <ubizjak@gmail.com> gcc/ChangeLog: PR target/107404 * config/i386/i386.md (eliminate reg-reg move by inverting the condition of a cmove #2 peephole2): Check if eliminated move initialized a register, used in the moved instruction. gcc/testsuite/ChangeLog: PR target/107404 * g++.target/i386/pr107404.C: New test.

While looking at PR 105549, which is about fixing the ABI break introduced in GCC 9.1 in parameter alignment with bit-fields, we noticed that the GCC 9.1 warning is not emitted in all the cases where it should be. This patch fixes that and the next patch in the series fixes the GCC 9.1 break. We split this into two patches since patch #2 introduces a new ABI break starting with GCC 13.1. This way, patch #1 can be back-ported to release branches if needed to fix the GCC 9.1 warning issue. The main idea is to add a new global boolean that indicates whether we're expanding the start of a function, so that aarch64_layout_arg can emit warnings for callees as well as callers. This removes the need for aarch64_function_arg_boundary to warn (with its incomplete information). However, in the first patch there are still cases where we emit warnings were we should not; this is fixed in patch #2 where we can distinguish between GCC 9.1 and GCC.13.1 ABI breaks properly. The fix in aarch64_function_arg_boundary (replacing & with &&) looks like an oversight of a previous commit in this area which changed 'abi_break' from a boolean to an integer. We also take the opportunity to fix the comment above aarch64_function_arg_alignment since the value of the abi_break parameter was changed in a previous commit, no longer matching the description. 2022-11-28 Christophe Lyon <christophe.lyon@arm.com> Richard Sandiford <richard.sandiford@arm.com> gcc/ChangeLog: * config/aarch64/aarch64.cc (aarch64_function_arg_alignment): Fix comment. (aarch64_layout_arg): Factorize warning conditions. (aarch64_function_arg_boundary): Fix typo. * function.cc (currently_expanding_function_start): New variable. (expand_function_start): Handle currently_expanding_function_start. * function.h (currently_expanding_function_start): Declare. gcc/testsuite/ChangeLog: * gcc.target/aarch64/bitfield-abi-warning-align16-O2.c: New test. * gcc.target/aarch64/bitfield-abi-warning-align16-O2-extra.c: New test. * gcc.target/aarch64/bitfield-abi-warning-align32-O2.c: New test. * gcc.target/aarch64/bitfield-abi-warning-align32-O2-extra.c: New test. * gcc.target/aarch64/bitfield-abi-warning-align8-O2.c: New test. * gcc.target/aarch64/bitfield-abi-warning.h: New test. * g++.target/aarch64/bitfield-abi-warning-align16-O2.C: New test. * g++.target/aarch64/bitfield-abi-warning-align16-O2-extra.C: New test. * g++.target/aarch64/bitfield-abi-warning-align32-O2.C: New test. * g++.target/aarch64/bitfield-abi-warning-align32-O2-extra.C: New test. * g++.target/aarch64/bitfield-abi-warning-align8-O2.C: New test. * g++.target/aarch64/bitfield-abi-warning.h: New test.

Here the ahead-of-time overload set pruning in finish_call_expr is unintentionally returning a CALL_EXPR whose (pruned) callee is wrapped in an ADDR_EXPR, despite the original callee not being wrapped in an ADDR_EXPR. This ends up causing a bogus declaration mismatch error in the below testcase because the call to min in #1 gets expressed as a CALL_EXPR of ADDR_EXPR of FUNCTION_DECL, whereas the level-lowered call to min in #2 gets expressed instead as a CALL_EXPR of FUNCTION_DECL. This patch fixes this by stripping the spurious ADDR_EXPR appropriately. Thus the first call to min now also gets expressed as a CALL_EXPR of FUNCTION_DECL, matching the behavior before r12-6075-g2decd2cabe5a4f. PR c++/107461 gcc/cp/ChangeLog: * semantics.cc (finish_call_expr): Strip ADDR_EXPR from the selected callee during overload set pruning. gcc/testsuite/ChangeLog: * g++.dg/template/call9.C: New test.

After r13-5684-g59e0376f607805 the (pruned) callee of a non-dependent CALL_EXPR is a bare FUNCTION_DECL rather than ADDR_EXPR of FUNCTION_DECL. This innocent change revealed that cp_tree_equal doesn't first check dependence of a CALL_EXPR before treating a FUNCTION_DECL callee as a dependent name, which leads to us incorrectly accepting the first two testcases below and rejecting the third: * In the first testcase, cp_tree_equal incorrectly returns true for the two non-dependent CALL_EXPRs f(0) and f(0) (whose CALL_EXPR_FN are different FUNCTION_DECLs) which causes us to treat #2 as a redeclaration of #1. * Same issue in the second testcase, for f<int*>() and f<char>(). * In the third testcase, cp_tree_equal incorrectly returns true for f<int>() and f<void(*)(int)>() which causes us to conflate the two dependent specializations A<decltype(f<int>()(U()))> and A<decltype(f<void(*)(int)>()(U()))>. This patch fixes this by making called_fns_equal treat two callees as dependent names only if the overall CALL_EXPRs are dependent, via a new convenience function call_expr_dependent_name that is like dependent_name but also checks dependence of the overall CALL_EXPR. PR c++/107461 gcc/cp/ChangeLog: * cp-tree.h (call_expr_dependent_name): Declare. * pt.cc (iterative_hash_template_arg) <case CALL_EXPR>: Use call_expr_dependent_name instead of dependent_name. * tree.cc (call_expr_dependent_name): Define. (called_fns_equal): Adjust to take two CALL_EXPRs instead of CALL_EXPR_FNs thereof. Use call_expr_dependent_name instead of dependent_name. (cp_tree_equal) <case CALL_EXPR>: Adjust call to called_fns_equal. gcc/testsuite/ChangeLog: * g++.dg/cpp0x/overload5.C: New test. * g++.dg/cpp0x/overload5a.C: New test. * g++.dg/cpp0x/overload6.C: New test.

… in asm in different mode See gcc.c-torture/execute/20030222-1.c. Consider the code for 32-bit (e.g. BE) target: int i, v; long x; x = v; asm ("" : "=r" (i) : "0" (x)); We generate the following RTL with reload insns: 1. subreg:si(x:di, 0) = 0; 2. subreg:si(x:di, 4) = v:si; 3. t:di = x:di, dead x; 4. asm ("" : "=r" (subreg:si(t:di,4)) : "0" (t:di)) 5. i:si = subreg:si(t:di,4); If we assign hard reg of x to t, dead code elimination will remove insn #2 and we will use unitialized hard reg. So exclude the hard reg of x for t. We could ignore this problem for non-empty asm using all x value but it is hard to check that the asm are expanded into insn realy using x and setting r. The old reload pass used the same approach. gcc/ChangeLog * lra-constraints.cc (match_reload): Exclude some hard regs for multi-reg inout reload pseudos used in asm in different mode.

Currently on xstormy16 SImode shifts by a single bit require two instructions, and shifts by other non-zero integer immediate constants require five instructions. This patch implements the obvious optimization that shifts by two bits can be done in four instructions, by using two single-bit sequences. Hence, ashift_2 was previously generated as: mov r7,r2 | shl r2,#2 | shl r3,#2 | shr r7,#14 | or r3,r7 ret and with this patch we now generate: shl r2,#1 | rlc r3,#1 | shl r2,#1 | rlc r3,#1 ret 2023-04-23 Roger Sayle <roger@nextmovesoftware.com> gcc/ChangeLog * config/stormy16/stormy16.cc (xstormy16_output_shift): Implement SImode shifts by two by performing a single bit SImode shift twice. gcc/testsuite/ChangeLog * gcc.target/xstormy16/shiftsi.c: New test case.

I noticed that for member class templates of a class template we were unnecessarily substituting both the template and its type. Avoiding that duplication speeds compilation of this silly testcase from ~12s to ~9s on my laptop. It's unlikely to make a difference on any real code, but the simplification is also nice. We still need to clear CLASSTYPE_USE_TEMPLATE on the partial instantiation of the template class, but it makes more sense to do that in tsubst_template_decl anyway. #define NC(X) \ template <class U> struct X##1; \ template <class U> struct X##2; \ template <class U> struct X##3; \ template <class U> struct X##4; \ template <class U> struct X##5; \ template <class U> struct X##6; #define NC2(X) NC(X##a) NC(X##b) NC(X##c) NC(X##d) NC(X##e) NC(X##f) #define NC3(X) NC2(X##A) NC2(X##B) NC2(X##C) NC2(X##D) NC2(X##E) template <int I> struct A { NC3(am) }; template <class...Ts> void sink(Ts...); template <int...Is> void g() { sink(A<Is>()...); } template <int I> void f() { g<__integer_pack(I)...>(); } int main() { f<1000>(); } gcc/cp/ChangeLog: * pt.cc (instantiate_class_template): Skip the RECORD_TYPE of a class template. (tsubst_template_decl): Clear CLASSTYPE_USE_TEMPLATE.

…i parts [2] [part #2 of PR/109279] SPEC2017 deepsjeng uses large constants which currently generates less than ideal code. This fix improves codegen for large constants which have same low and hi parts: e.g. long long f(void) { return 0x0101010101010101ull; } Before li a5,0x1010000 addi a5,a5,0x101 mv a0,a5 slli a5,a5,32 add a0,a5,a0 ret With patch li a5,0x1010000 addi a5,a5,0x101 slli a0,a5,32 add a0,a0,a5 ret This is testsuite clean. gcc/ChangeLog: * config/riscv/riscv.cc (riscv_split_integer): if loval is equal to hival, ASHIFT the corresponding regs. Signed-off-by: Vineet Gupta <vineetg@rivosinc.com>

This patch decreses one machine instruction from "single bit extraction with shifting" operation, and tries to eliminate the conditional branch if CST2_POW2 doesn't fit into signed 12 bits with the help of ifcvt optimization. /* example #1 */ int test0(int x) { return (x & 1048576) != 0 ? 1024 : 0; } extern int foo(void); int test1(void) { return (foo() & 1048576) != 0 ? 16777216 : 0; } ;; before test0: movi a9, 0x400 srai a2, a2, 10 and a2, a2, a9 ret.n test1: addi sp, sp, -16 s32i.n a0, sp, 12 call0 foo extui a2, a2, 20, 1 slli a2, a2, 20 beqz.n a2, .L2 movi.n a2, 1 slli a2, a2, 24 .L2: l32i.n a0, sp, 12 addi sp, sp, 16 ret.n ;; after test0: extui a2, a2, 20, 1 slli a2, a2, 10 ret.n test1: addi sp, sp, -16 s32i.n a0, sp, 12 call0 foo l32i.n a0, sp, 12 extui a2, a2, 20, 1 slli a2, a2, 24 addi sp, sp, 16 ret.n In addition, if the left shift amount ('exact_log2(CST2_POW2)') is between 1 through 3 and a either addition or subtraction with another register follows, emit a ADDX[248] or SUBX[248] machine instruction instead of separate left shift and add/subtract ones. /* example #2 */ int test2(int x, int y) { return ((x & 1048576) != 0 ? 4 : 0) + y; } int test3(int x, int y) { return ((x & 2) != 0 ? 8 : 0) - y; } ;; before test2: movi.n a9, 4 srai a2, a2, 18 and a2, a2, a9 add.n a2, a2, a3 ret.n test3: movi.n a9, 8 slli a2, a2, 2 and a2, a2, a9 sub a2, a2, a3 ret.n ;; after test2: extui a2, a2, 20, 1 addx4 a2, a2, a3 ret.n test3: extui a2, a2, 1, 1 subx8 a2, a2, a3 ret.n gcc/ChangeLog: * config/xtensa/predicates.md (addsub_operator): New. * config/xtensa/xtensa.md (*extzvsi-1bit_ashlsi3, *extzvsi-1bit_addsubx): New insn_and_split patterns. * config/xtensa/xtensa.cc (xtensa_rtx_costs): Add a special case about ifcvt 'noce_try_cmove()' to handle constant loads that do not fit into signed 12 bits in the patterns added above.

…n S[IF]mode This patch optimizes the boolean evaluation of EQ/NE against zero by adding two insn_and_split patterns similar to SImode conditional store: "eq_zero": op0 = (op1 == 0) ? 1 : 0; op0 = clz(op1) >> 5; /* optimized (requires TARGET_NSA) */ "movsicc_ne0_reg_0": op0 = (op1 != 0) ? op2 : 0; op0 = op2; if (op1 == 0) ? op0 = op1; /* optimized */ /* example #1 */ int bool_eqSI(int x) { return x == 0; } int bool_neSI(int x) { return x != 0; } ;; after (TARGET_NSA) bool_eqSI: nsau a2, a2 srli a2, a2, 5 ret.n bool_neSI: mov.n a9, a2 movi.n a2, 1 moveqz a2, a9, a9 ret.n These also work in SFmode by ignoring their sign bits, and further- more, the branch if EQ/NE against zero in SFmode is also done in the same manner. The reasons for this optimization in SFmode are: - Only zero values (negative or non-negative) contain no bits of 1 with both the exponent and the mantissa. - EQ/NE comparisons involving NaNs produce no signal even if they are signaling. - Even if the use of IEEE 754 single-precision floating-point co- processor is configured (TARGET_HARD_FLOAT is true): 1. Load zero value to FP register 2. Possibly, additional FP move if the comparison target is an address register 3. FP equality check instruction 4. Read the boolean register containing the result, or condi- tional branch As noted above, a considerable number of instructions are still generated. /* example #2 */ int bool_eqSF(float x) { return x == 0; } int bool_neSF(float x) { return x != 0; } int bool_ltSF(float x) { return x < 0; } extern void foo(void); void cb_eqSF(float x) { if(x != 0) foo(); } void cb_neSF(float x) { if(x == 0) foo(); } void cb_geSF(float x) { if(x < 0) foo(); } ;; after ;; (TARGET_NSA, TARGET_BOOLEANS and TARGET_HARD_FLOAT) bool_eqSF: add.n a2, a2, a2 nsau a2, a2 srli a2, a2, 5 ret.n bool_neSF: add.n a9, a2, a2 movi.n a2, 1 moveqz a2, a9, a9 ret.n bool_ltSF: movi.n a9, 0 wfr f0, a2 wfr f1, a9 olt.s b0, f0, f1 movi.n a9, 0 movi.n a2, 1 movf a2, a9, b0 ret.n cb_eqSF: add.n a2, a2, a2 beqz.n a2, .L6 j.l foo, a9 .L6: ret.n cb_neSF: add.n a2, a2, a2 bnez.n a2, .L8 j.l foo, a9 .L8: ret.n cb_geSF: addi sp, sp, -16 movi.n a3, 0 s32i.n a12, sp, 8 s32i.n a0, sp, 12 mov.n a12, a2 call0 __unordsf2 bnez.n a2, .L10 movi.n a3, 0 mov.n a2, a12 call0 __gesf2 bnei a2, -1, .L10 l32i.n a0, sp, 12 l32i.n a12, sp, 8 addi sp, sp, 16 j.l foo, a9 .L10: l32i.n a0, sp, 12 l32i.n a12, sp, 8 addi sp, sp, 16 ret.n gcc/ChangeLog: * config/xtensa/predicates.md (const_float_0_operand): Rename from obsolete "const_float_1_operand" and change the constant to compare. (cstoresf_cbranchsf_operand, cstoresf_cbranchsf_operator): New. * config/xtensa/xtensa.cc (xtensa_expand_conditional_branch): Add code for EQ/NE comparison with constant zero in SFmode. (xtensa_expand_scc): Added code to derive boolean evaluation of EQ/NE with constant zero for comparison in SFmode. (xtensa_rtx_costs): Change cost of CONST_DOUBLE with value zero inside "cbranchsf4" to 0. * config/xtensa/xtensa.md (cbranchsf4, cstoresf4): Change "match_operator" and the third "match_operand" to the ones mentioned above. (movsicc_ne0_reg_zero, eq_zero): New.

This patch is inspired by Jakub's work on PR rtl-optimization/110717. The bitfield example described in comment #2, looks like: struct S { __int128 a : 69; }; unsigned type bar (struct S *p) { return p->a; } which on x86_64 with -O2 currently generates: bar: movzbl 8(%rdi), %ecx movq (%rdi), %rax andl $31, %ecx movq %rcx, %rdx salq $59, %rdx sarq $59, %rdx ret The ANDL $31 is interesting... we first extract an unsigned 69-bit bitfield by masking/clearing the top bits of the most significant word, and then it gets sign-extended, by left shifting and arithmetic right shifting. Obviously, this bit-wise AND is redundant, for signed bit-fields, we don't require these bits to be cleared, if we're about to set them appropriately. This patch eliminates this redundancy in the middle-end, during RTL expansion, but extending the extract_bit_field APIs so that the integer UNSIGNEDP argument takes a special value; 0 indicates the field should be sign extended, 1 (any non-zero value) indicates the field should be zero extended, but -1 indicates a third option, that we don't care how or whether the field is extended. By passing and checking this sentinel value at the appropriate places we avoid the useless bit masking (on all targets). For the test case above, with this patch we now generate: bar: movzbl 8(%rdi), %ecx movq (%rdi), %rax movq %rcx, %rdx salq $59, %rdx sarq $59, %rdx ret 2023-08-04 Roger Sayle <roger@nextmovesoftware.com> gcc/ChangeLog * expmed.cc (extract_bit_field_1): Document that an UNSIGNEDP value of -1 is equivalent to don't care. (extract_integral_bit_field): Indicate that we don't require the most significant word to be zero extended, if we're about to sign extend it. (extract_fixed_bit_field_1): Document that an UNSIGNEDP value of -1 is equivalent to don't care. Don't clear the most significant bits with AND mask when UNSIGNEDP is -1. gcc/testsuite/ChangeLog * gcc.target/i386/pr110717-2.c: New test case.

Given below example for VLS mode void test (vl_t *u) { vl_t t; long long *p = (long long *)&t; p[0] = p[1] = 2; *u = t; } The vec_set will simplify the insn to vmv.s.x when index is 0, without merged operand. That will result in some problems in DCE, aka: 1: 137[DI] = a0 2: 138[V2DI] = 134[V2DI] // deleted by DCE 3: 139[DI] = #2 // deleted by DCE 4: 140[DI] = #2 // deleted by DCE 5: 141[V2DI] = vec_dup:V2DI (139[DI]) // deleted by DCE 6: 138[V2DI] = vslideup_imm (138[V2DI], 141[V2DI], 1) // deleted by DCE 7: 135[V2DI] = 138[V2DI] // deleted by DCE 8: 142[V2DI] = 135[V2DI] // deleted by DCE 9: 143[DI] = #2 10: 142[V2DI] = vec_dup:V2DI (143[DI]) 11: (137[DI]) = 142[V2DI] The higher 64 bits of 142[V2DI] is unknown here and it generated incorrect code when store back to memory. This patch would like to fix this issue by adding a new SCALAR_MOVE_MERGED_OP for vec_set. Please note this patch doesn't enable VLS for vec_set, the underlying patches will support this soon. gcc/ChangeLog: * config/riscv/autovec.md: Bugfix. * config/riscv/riscv-protos.h (SCALAR_MOVE_MERGED_OP): New enum. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/base/scalar-move-merged-run-1.c: New test. Signed-off-by: Pan Li <pan2.li@intel.com>

Improve stack protector patterns and peephole2s even more: a. Use unrelated register clears with integer mode size <= word mode size to clear stack protector scratch register. b. Use unrelated register initializations in front of stack protector sequence to clear stack protector scratch register. c. Use unrelated register initializations using LEA instructions to clear stack protector scratch register. These stack protector improvements reuse 6914 unrelated register initializations to substitute the clear of stack protector scratch register in 12034 instances of stack protector sequence in recent linux defconfig build. gcc/ChangeLog: * config/i386/i386.md (@stack_protect_set_1_<PTR:mode>_<W:mode>): Use W mode iterator instead of SWI48. Output MOV instead of XOR for TARGET_USE_MOV0. (stack_protect_set_1 peephole2): Use integer modes with mode size <= word mode size for operand 3. (stack_protect_set_1 peephole2 #2): New peephole2 pattern to substitute stack protector scratch register clear with unrelated register initialization, originally in front of stack protector sequence. (*stack_protect_set_3_<PTR:mode>_<SWI48:mode>): New insn pattern. (stack_protect_set_1 peephole2): New peephole2 pattern to substitute stack protector scratch register clear with unrelated register initialization involving LEA instruction.

Use unrelated register initializations using zero/sign-extend instructions to clear stack protector scratch register. Hanlde only SI -> DImode extensions for 64-bit targets, as this is the only extension that triggers the peephole in a non-negligible number. Also use explicit check for word_mode instead of mode iterator in peephole2 patterns to avoid pattern explosion. gcc/ChangeLog: * config/i386/i386.md (stack_protect_set_1 peephole2): Explicitly check operand 2 for word_mode. (stack_protect_set_1 peephole2 #2): Ditto. (stack_protect_set_2 peephole2): Ditto. (stack_protect_set_3 peephole2): Ditto. (*stack_protect_set_4z_<mode>_di): New insn patter. (*stack_protect_set_4s_<mode>_di): Ditto. (stack_protect_set_4 peephole2): New peephole2 pattern to substitute stack protector scratch register clear with unrelated register initialization involving zero/sign-extend instruction.

Since the last import from upstream libsanitizer, the output has changed and now looks more like this: READ of size 6 at 0x7ff7beb2a144 thread T0 #0 0x101cf7796 in MemcmpInterceptorCommon(void*, int (*)(void const*, void const*, unsigned long), void const*, void const*, unsigned long) sanitizer_common_interceptors.inc:813 #1 0x101cf7b99 in memcmp sanitizer_common_interceptors.inc:840 #2 0x108a0c39f in __stack_chk_guard+0xf (dyld:x86_64+0x8039f) so let's adjust the pattern accordingly. gcc/testsuite/ChangeLog: * c-c++-common/asan/memcmp-1.c: Adjust pattern on darwin.

…-int (PR target/112413) On m68k the compiler assumes that the PC-relative jump-via-jump-table instruction and the jump table are adjacent with no padding in between. When -mlong-jump-table-offsets is combined with -malign-int, a 2-byte nop may be inserted before the jump table, causing the jump to add the fetched offset to the wrong PC base and thus jump to the wrong address. Fixed by referencing the jump table via its label. On the test case in the PR the object code change is (the moveal at 16 is the nop): a: 6536 bcss 42 <f+0x42> c: e588 lsll #2,%d0 e: 203b 0808 movel %pc@(18 <f+0x18>,%d0:l),%d0 - 12: 4efb 0802 jmp %pc@(16 <f+0x16>,%d0:l) + 12: 4efb 0804 jmp %pc@(18 <f+0x18>,%d0:l) 16: 284c moveal %a4,%a4 18: 0000 0020 orib #32,%d0 1c: 0000 002c orib #44,%d0 Bootstrapped and tested on m68k-linux-gnu, no regressions. Note: I don't have commit rights to I would need assistance applying this. PR target/112413 gcc/ * config/m68k/linux.h (ASM_RETURN_CASE_JUMP): For TARGET_LONG_JUMP_TABLE_OFFSETS, reference the jump table via its label. * config/m68k/m68kelf.h (ASM_RETURN_CASE_JUMP): Likewise. * config/m68k/netbsd-elf.h (ASM_RETURN_CASE_JUMP): Likewise.

During partial ordering, we want to look through dependent alias template specializations within template arguments and otherwise treat them as opaque in other contexts (see e.g. r7-7116-g0c942f3edab108 and r11-7011-g6e0a231a4aa240). To that end template_args_equal was given a partial_order flag that controls this behavior. This flag does the right thing when a dependent alias template specialization appears as template argument of the partial specialization, e.g. in template<class T, class...> using first_t = T; template<class T> struct traits; template<class T> struct traits<first_t<T, T&>> { }; // #1 template<class T> struct traits<first_t<const T, T&>> { }; // #2 we correctly consider #2 to be more specialized than #1. But if the alias specialization appears as a nested template argument of another class template specialization, e.g. in template<class T> struct traits<A<first_t<T, T&>>> { }; // #1 template<class T> struct traits<A<first_t<const T, T&>>> { }; // #2 then we incorrectly consider #1 and #2 to be unordered. This is because 1. we don't propagate the flag to recursive template_args_equal calls 2. we don't use structural equality for class template specializations written in terms of dependent alias template specializations This patch fixes the first issue by turning the partial_order flag into a global. This patch fixes the second issue by making us propagate structural equality appropriately when building a class template specialization. In passing this patch also improves hashing of specializations that use structural equality. PR c++/90679 gcc/cp/ChangeLog: * cp-tree.h (comp_template_args): Remove partial_order parameter. (template_args_equal): Likewise. * pt.cc (comparing_for_partial_ordering): New global flag. (iterative_hash_template_arg) <case tcc_type>: Hash the template and arguments for specializations that use structural equality. (template_args_equal): Remove partial order parameter and use comparing_for_partial_ordering instead. (comp_template_args): Likewise. (comp_template_args_porder): Set comparing_for_partial_ordering instead. Make static. (any_template_arguments_need_structural_equality_p): Return true for an argument that's a dependent alias template specialization or a class template specialization that itself needs structural equality. * tree.cc (cp_tree_equal) <case TREE_VEC>: Adjust call to comp_template_args. gcc/testsuite/ChangeLog: * g++.dg/cpp0x/alias-decl-75a.C: New test. * g++.dg/cpp0x/alias-decl-75b.C: New test.

This patch adjusts the costs so that we treat REG and SUBREG expressions the same for costing. This was motivated by bt_skip_func and bt_find_func in xz and results in nearly a 5% improvement in the dynamic instruction count for input #2 and smaller, but definitely visible improvements pretty much across the board. Exceptions would be perlbench input #1 and exchange2 which showed very small regressions. In the bt_find_func and bt_skip_func cases we have something like this: > (insn 10 7 11 2 (set (reg/v:DI 136 [ x ]) > (zero_extend:DI (subreg/s/u:SI (reg/v:DI 137 [ a ]) 0))) "zz.c":6:21 387 {*zero_extendsidi2_bitmanip} > (nil)) > (insn 11 10 12 2 (set (reg:DI 142 [ _1 ]) > (plus:DI (reg/v:DI 136 [ x ]) > (reg/v:DI 139 [ b ]))) "zz.c":7:23 5 {adddi3} > (nil)) [ ... ]> (insn 13 12 14 2 (set (reg:DI 143 [ _2 ]) > (plus:DI (reg/v:DI 136 [ x ]) > (reg/v:DI 141 [ c ]))) "zz.c":8:23 5 {adddi3} > (nil)) Note the two uses of (reg 136). The best way to handle that in combine might be a 3->2 split. But there's a much better approach if we look at fwprop... (set (reg:DI 142 [ _1 ]) (plus:DI (zero_extend:DI (subreg/s/u:SI (reg/v:DI 137 [ a ]) 0)) (reg/v:DI 139 [ b ]))) change not profitable (cost 4 -> cost 8) So that should be the same cost as a regular DImode addition when the ZBA extension is enabled. But it ends up costing more because the clause to cost this variant isn't prepared to handle a SUBREG. That results in the RTL above having too high a cost and fwprop gives up. One approach would be to replace the REG_P with REG_P || SUBREG_P in the costing code. I ultimately decided against that and instead check if the operand in question passes register_operand. By far the most important case to handle is the DImode PLUS. But for the sake of consistency, I changed the other instances in riscv_rtx_costs as well. For those other cases we're talking about improvements in the .000001% range. While we are into stage4, this just hits cost modeling which we've generally agreed is still appropriate (though we were mostly talking about vector). So I'm going to extend that general agreement ever so slightly and include scalar cost modeling :-) gcc/ * config/riscv/riscv.cc (riscv_rtx_costs): Handle SUBREG and REG similarly. gcc/testsuite/ * gcc.target/riscv/reg_subreg_costs.c: New test. Co-authored-by: Jivan Hakobyan <jivanhakobyan9@gmail.com>

We evaluate constexpr functions on the original, pre-genericization bodies. That means that the function body we're evaluating will not have gone through cp_genericize_r's "Map block scope extern declarations to visible declarations with the same name and type in outer scopes if any". Here: constexpr bool bar() { return true; } // #1 constexpr bool foo() { constexpr bool bar(void); // #2 return bar(); } it means that we: 1) register_constexpr_fundef (#1) 2) cp_genericize (#1) nothing interesting happens 3) register_constexpr_fundef (foo) does copy_fn, so we have two copies of the BIND_EXPR 4) cp_genericize (foo) this remaps #2 to #1, but only on one copy of the BIND_EXPR 5) retrieve_constexpr_fundef (foo) we find it, no problem 6) retrieve_constexpr_fundef (#2) and here #2 isn't found in constexpr_fundef_table, because we're working on the BIND_EXPR copy where #2 wasn't mapped to #1 so we fail. We've only registered #1. It should work to use DECL_LOCAL_DECL_ALIAS (which used to be extern_decl_map). We evaluate constexpr functions on pre-cp_fold bodies to avoid diagnostic problems, but the remapping I'm proposing should not interfere with diagnostics. This is not a problem for a global scope redeclaration; there we go through duplicate_decls which keeps the DECL_UID: DECL_UID (olddecl) = olddecl_uid; and DECL_UID is what constexpr_fundef_hasher::hash uses. PR c++/111132 gcc/cp/ChangeLog: * constexpr.cc (get_function_named_in_call): Use cp_get_fndecl_from_callee. * cvt.cc (cp_get_fndecl_from_callee): If there's a DECL_LOCAL_DECL_ALIAS, use it. gcc/testsuite/ChangeLog: * g++.dg/cpp0x/constexpr-redeclaration3.C: New test. * g++.dg/cpp0x/constexpr-redeclaration4.C: New test.

Here during overload resolution we have two strictly viable ambiguous candidates #1 and #2, and two non-strictly viable candidates #3 and #4 which we hold on to ever since r14-6522. These latter candidates have an empty second arg conversion since the first arg conversion was deemed bad, and this trips up joust when called on #3 and #4 which assumes all arg conversions are there. We can fix this by making joust robust to empty arg conversions, but in this situation we shouldn't need to compare #3 and #4 at all given that we have a strictly viable candidate. To that end, this patch makes tourney shortcut considering non-strictly viable candidates upon encountering ambiguity between two strictly viable candidates (taking advantage of the fact that the candidates list is sorted according to viability via splice_viable). PR c++/115239 gcc/cp/ChangeLog: * call.cc (tourney): Don't consider a non-strictly viable candidate as the champ if there was ambiguity between two strictly viable candidates. gcc/testsuite/ChangeLog: * g++.dg/overload/error7.C: New test. Reviewed-by: Jason Merrill <jason@redhat.com>

We currently crash upon the following invalid code (notice the "void void**" parameter) === cut here === using size_t = decltype(sizeof(int)); void *operator new(size_t, void void **p) noexcept { return p; } int x; void f() { int y; new (&y) int(x); } === cut here === The problem is that in this case, we end up with a NULL_TREE parameter list for the new operator because of the error, and (1) coerce_new_type wrongly complains about the first parameter type not being size_t, (2) std_placement_new_fn_p blindly accesses the parameter list, hence a crash. This patch does NOT address #1 since we can't easily distinguish between a new operator declaration without parameters from one with erroneous parameters (and it's not worth the risk to refactor and break things for an error recovery issue) hence a dg-bogus in new52.C, but it does address #2 and the ICE by simply checking the first parameter against NULL_TREE. It also adds a new testcase checking that we complain about new operators with no or invalid first parameters, since we did not have any. PR c++/117101 gcc/cp/ChangeLog: * init.cc (std_placement_new_fn_p): Check first_arg against NULL_TREE. gcc/testsuite/ChangeLog: * g++.dg/init/new52.C: New test. * g++.dg/init/new53.C: New test.

The second source register of insn "*extzvsi-1bit_addsubx" cannot be the same as the destination register, because that register will be overwritten with an intermediate value after insn splitting. /* example #1 */ int test1(int b, int a) { return ((a & 1024) ? 4 : 0) + b; } ;; result #1 (incorrect) test1: extui a2, a3, 10, 1 ;; overwrites A2 before used addx4 a2, a2, a2 ret.n This patch fixes that. ;; result #1 (correct) test1: extui a3, a3, 10, 1 ;; uses A3 and then overwrites addx4 a2, a3, a2 ret.n However, it should be noted that the first source register can be the same as the destination without any problems. /* example #2 */ int test2(int a, int b) { return ((a & 1024) ? 4 : 0) + b; } ;; result (correct) test2: extui a2, a2, 10, 1 ;; uses A2 and then overwrites addx4 a2, a2, a3 ret.n gcc/ChangeLog: * config/xtensa/xtensa.md (*extzvsi-1bit_addsubx): Add '&' to the destination register constraint to indicate that it is 'earlyclobber', append '0' to the first source register constraint to indicate that it can be the same as the destination register, and change the split condition from 1 to reload_completed so that the insn will be split only after RA in order to obtain allocated registers that satisfy the above constraints.

jcmvbkbc closed this as completed Feb 22, 2015

pfalcon mentioned this issue Apr 1, 2015

Add pic16bit port micropython/micropython#1175

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gcc-xtensa appears to hardcode data alignment to 4 #2

gcc-xtensa appears to hardcode data alignment to 4 #2

pfalcon commented Feb 4, 2015

jcmvbkbc commented Feb 6, 2015

pfalcon commented Feb 6, 2015

jcmvbkbc commented Feb 6, 2015

jcmvbkbc commented Feb 22, 2015

pfalcon commented Feb 22, 2015

pfalcon commented Jun 12, 2015

dpgeorge commented Jun 14, 2015

gcc-xtensa appears to hardcode data alignment to 4 #2

gcc-xtensa appears to hardcode data alignment to 4 #2

Comments

pfalcon commented Feb 4, 2015

jcmvbkbc commented Feb 6, 2015

pfalcon commented Feb 6, 2015

jcmvbkbc commented Feb 6, 2015

jcmvbkbc commented Feb 22, 2015

pfalcon commented Feb 22, 2015

pfalcon commented Jun 12, 2015

dpgeorge commented Jun 14, 2015