, ,

gergo-/missed-optimizations

gergo-/missed-optimizations

news picture

Here’s a checklist of some missed optimizations in C compilers. To my knowledge,
these haven’t been reported by others (but it’s some distance sophisticated to glance for
such things, and of us may perchance perhaps not own to write these up sooner than). I
own reported these kind of in malicious program trackers and/or developed roughly
stale patches for some; gaze hyperlinks below.

These are not correctness bugs, most appealing circumstances where a compiler misses an
different to generate simpler or otherwise (reputedly) more efficient
code. None of these have a tendency to electrify the accurate, measurable performance
of real applications. Moreover, I genuinely haven’t measured every particular person of these: The reputedly
inefficient code may perchance perhaps by some skill flip out to be faster on real hardware.

I genuinely own tested GCC, Clang, and CompCert, largely concentrated on ARM at -O3.
Except eminent otherwise, the examples below can be reproduced with the
following versions and configurations:

Compiler Version Revision Date Configuration
GCC eight.0.0 20170906 r251801 2017-09-06 --purpose=armv7a-eabihf --with-arch=armv7-a --with-fpu=vfpv3-d16 --with-circulate-abi=exhausting --with-circulate=exhausting
Clang 6.0.0 (trunk 312637) LLVM r312635, Clang r312634 2017-09-06 --purpose=armv7a-eabihf -march=armv7-a -mfpu=vfpv3-d16 -mfloat-abi=exhausting
CompCert CompCert three.0.1 f3c60be 2017-08-28 armv7a-linux

For every instance, not not up to this kind of three compilers (typically GCC or
Clang) produces the “appropriate” code.

This record was assign together by Gergö Barany gergo.barany@inria.fr. I’m
attracted to feedback.

GCC

Ineffective initialization of struct passed by price

struct S0 {
  int f0;
  int f1;
  int f2;
  int f3;
};

int f1(struct S0 p) {
    return p.f0;
}

The struct is passed in registers, and the feature’s consequence’s already in
r0, which is also the return register. The feature may perchance perhaps return
straight away, but GCC first stores your complete struct fields to the stack and
reloads the well-known field:

    sub sp, sp, #16
    add ip, sp, #16
    stmdb   ip, {r0, r1, r2, r3}
    ldr r0, [sp]
    add sp, sp, #16
    bx  lr

Reported at https://gcc.gnu.org/bugzilla/show_bug.cgi?identity=80905.

circulate to char kind conversion goes by plot of reminiscence

char fn1(circulate p1) {
  return (char) p1;
}

outcomes in:

    vcvt.u32.f32    s15, s0
    sub sp, sp, #eight
    vstr.32 s15, [sp, #4]   @ int
    ldrb    r0, [sp, #4]    @ zero_extendqisi2

i.e., the outcomes of the conversion in s15 is kept to the stack after which
reloaded into r0 in preference to appropriate copying it between the registers (and
presumably truncating to char‘s fluctuate). Same for double in preference to
circulate, but not for brief in preference to char.

Reported at https://gcc.gnu.org/bugzilla/show_bug.cgi?identity=80861.

Spill in preference to register reproduction

int N;
int fn2(int p1, int p2) {
  int a = p2;
  if (N)
    a *= 10.5;
  return a;
}

GCC spills a despite the indisputable truth that ample integer registers may perchance perchance perchance be readily accessible:

    str r1, [sp, #4]    // initialize a to p2
    cmp r3, #0
    beq .L13
    ... compute multiplication, assign into d7 ...
    vcvt.s32.f64    s15, d7
    vstr.32 s15, [sp, #4]   @ int   // update a in branch
.L13:
    ldr r0, [sp, #4]    // reload a

Reported at https://gcc.gnu.org/bugzilla/show_bug.cgi?identity=81012 (together
with one more agonize additional below).

Left out simplification of multiplication by integer-valued floating-level fixed

Variant of the above code with the fixed modified fair a runt:

int N;
int fn5(int p1, int p2) {
  int a = p2;
  if (N)
    a *= 10.0;
  return a;
}

GCC converts a to double and function above, however the consequence must be the
same as simply multiplying by the integer 10. Clang realizes this and
generates an integer multiply, putting off all floating-level operations.

Ineffective store for int to double conversion

double fn3(int p1, brief p2) {
  double a = p1;
  if (1881 * p2 + 5)
    a = 2;
  return a;
}

a is initialized to the associated price of p1, and GCC spills this price. The
spill store is dull, the associated price is infrequently reloaded:

    ...
    str r0, [sp, #4]
    cmn r1, #5
    vmovne.f64  d0, #2.0e+0
    vmoveq  s15, r0 @ int
    vcvteq.f64.s32  d0, s15
    add sp, sp, #eight
    @ sp wanted
    bx  lr

Reported at https://gcc.gnu.org/bugzilla/show_bug.cgi?identity=81012 (together
with one more agonize above).

Left out energy reduction of division to comparison

unsigned int fn1(unsigned int a, unsigned int p2) {
  int b;
  b = three;
  if (p2 / a)
    b = 7;
  return b;
}

The situation p2 / a is factual (nonzero) iff p2 >= a. Clang compares, GCC
generates a division (on ARM, x86, and PowerPC).

Left out simplification of floating-level addition

long fn1(int p1) {
  double a;
  long b = a = p1;
  if (0 + a + 0)
    b = 9;
  return b;
}

It’s not perfect in customary to interchange a + 0 by a in floating-level
arithmetic attributable to NaNs and unfavorable zeros and whatnot, but right here a is
initialized from an int‘s price, so none of these circumstances can occur. GCC
generates your complete floating-level operations, while Clang appropriate compares p1
to zero.

Left out reassociation of multiplications

double fn1(double p1) {
  double v;
  v = p1 * p1 * (-p1 * p1);
  return v;
}

int fn2(int p1) {
  int v;
  v = p1 * p1 * (-p1 * p1);
  return v;
}

For every feature, GCC generates three multiplications:

    vnmul.f64   d7, d0, d0
    vmul.f64    d0, d0, d0
    vmul.f64    d0, d7, d0

and

    rsb r2, r0, #0
    mul r3, r0, r0
    mul r0, r0, r2
    mul r0, r3, r0

Clang can carry out the equivalent with two multiplications every:

    vmul.f64    d0, d0, d0
    vnmul.f64   d0, d0, d0

and

    mul r0, r0, r0
    rsb r1, r0, #0
    mul r0, r0, r1

Audacious loop optimization in preference to putting off the loop

int N;
double fn1(double *p1) {
  int i = 0;
  double v = 0;
  while (i < N) {
    v = p1[i];
    i++;
  }
  return v;
}

This option returns 0 if N is 0 or unfavorable; otherwise, it returns
p1[N-1]. GCC compiles it as such on ARM (with out a loop), but on x86-Sixty four, it
generates a posh unrolled loop. The code is too long and uninteresting to illustrate
right here, but or not it's same ample to https://godbolt.org/g/RYwgq4.

Pointless spilling attributable to badly scheduled pass-immediates

double fn1(double p1, circulate p2, double p3, circulate p4, double *p5, double p6)
{
  double a, c, d, e;
  circulate b;
  a = p3 / p1 / 38 / 2.56380695236e38f * 5;
  c = 946293016.021f / a + 9;
  b = p1 - 6023397303.74f / p3 * 909 * *p5;
  d = c / 683596028.301 + p3 - b + 701.296936339 - p4 * p6;
  e = p2 + d;
  return e;
}

Sorry relating to the mess. GCC generates a bunch of code for this, including:

    vstr.Sixty four d10, [sp]
    vmov.f64    d11, #5.0e+0
    vmov.f64    d13, #9.0e+0
    ... computations involving d10 but not d11 or d13 ...
    vldr.Sixty four d10, [sp]
    vcvt.f64.f32    d6, s0
    vmul.f64    d11, d7, d11
    vdiv.f64    d5, d12, d11
    vadd.f64    d13, d5, d13
    vdiv.f64    d0, d13, d14

The register d10 must be spilled to make room for the constants 5 and 9
being loaded into d11 and d13, respectively. Nevertheless these loads are worthy
too early: Their values are most appealing wanted after d10 is restored. These pass
instructions (not not up to at least one of them) may perchance perhaps easy be nearer to their uses, freeing
up registers and fending off the spill.

Left out fixed propagation (or so it looked)

int fn1(int p1) {
  int a = 6;
  int b = (p1 / 12 == a);
  return b;
}
int fn2(int p1) {
  int b = (p1 / 12 == 6);
  return b;
}

These functions may perchance perhaps easy be compiled identically. The situation is equivalent
to 6 * 12 <= p1 < 7 * 12, and the 2 signed comparisons can be replaced
by a subtraction and a single unsigned comparison:

    sub r0, r0, #Seventy two
    cmp r0, #11
    movhi   r0, #0
    movls   r0, #1
    bx  lr

GCC susceptible to preserve up out this for the well-known feature, but not the second one. It
looked fancy fixed propagation did not interchange a by 6.

Reported at https://gcc.gnu.org/bugzilla/show_bug.cgi?identity=81346 and mounted in
revision r250342; it turns out that this pattern susceptible to be implemented
(most appealing) in a really early folding pass that ran sooner than fixed propagation.

Left out tail-call optimization for compiler-generated call to intrinsic

int fn1(int a, int b) {
  return a / b;
}

On ARM, integer division is modified into a feature call:

    push    {r4, lr}
    bl  __aeabi_idiv
    pop {r4, notebook laptop}

Within the code above, GCC allocates a stack physique, not realizing that this call
in tail assign may perchance perhaps very effectively be simply

    b   __aeabi_idiv

(When the resolution, even to __aeabi_idiv, seems to be within the source code, GCC is
completely able to compiling it as a tail call.)

Clang

Incomplete fluctuate diagnosis for explicit sorts

brief fn1(brief p) {
  brief c = p / 32769;
  return c;
}

int fn2(signed char p) {
  int c = p / 129;
  return c;
}

In both circumstances, the divisor is out of doorways the that you may perchance perchance contemplate of fluctuate of values for p,
so the feature's consequence's continuously 0. Clang doesn't imprint this and
generates code to compute the consequence. (It does optimize corresponding circumstances
for unsigned brief and char.)

More incomplete fluctuate diagnosis

int fn1(brief p4, int p5) {
  int v = 0;
  if ((p4 - 32779) - !p5)
    v = 7;
  return v;
}

The situation is continuously factual, and the feature continuously returns 7. Clang
generates code to own in thoughts the situation. Akin to the case above, but
looks to be more complicated: If the - !p5 is eradicated, Clang also realizes
that the situation is continuously factual.

Rather more incomplete fluctuate diagnosis

int fn1(int p, int q) {
  int a = q + (p % 6) / 9;
  return a;
}

int fn2(int p) {
  return (1 / p) / One hundred;
}
int fn3(int p5, int p6) {
  int a, c, v;
  v = -!p5;
  a = v + 7;
  c = (4 % a);
  return c;
}
int fn4(int p4, int p5) {
  int a = 5;
  if ((5 % p5) / 9)
    a = p4;
  return a;
}

The first feature continuously returns q, the others continuously return 0, 4, and 5,
respectively. Clang evaluates not not up to at least one division or modulo operation in
every case.

(I genuinely own a lot more examples fancy this, but this record is uninteresting ample as
it's some distance.)

Redundant code generated for double to int conversion

int fn1(double c, int *p, int *q, int *r, int *s) {
  int i = (int) c;
  *p = i;
  *q = i;
  *r = i;
  *s = i;
  return i;
}

There may perchance be a single kind conversion on the source stage. Clang duplicates the
conversion operation (vcvt) for every use of i:

    vcvt.s32.f64    s2, d0
    vstr    s2, [r0]
    vcvt.s32.f64    s2, d0
    vstr    s2, [r1]
    vcvt.s32.f64    s2, d0
    vstr    s2, [r2]
    vcvt.s32.f64    s2, d0
    vcvt.s32.f64    s0, d0
    vmov    r0, s0
    vstr    s2, [r3]

Several occasions, the outcomes of the conversion is written into the register
that already holds that very same price (s2).

Reported at https://bugs.llvm.org/show_bug.cgi?identity=33199.

Lacking instruction different pattern for vnmla

double fn1(double p1, double p2, double p3) {
  double a = -p1 - p2 * p3;
  return a;
}

Clang generates

    vneg.f64    d0, d0
    vmls.f64    d0, d1, d2

but this may perchance perhaps appropriate be

    vnmla.f64   d0, d1, d2

The instruction selector has a pattern for picking vnmla for the
expression -(a * b) - dst, but not for the symmetric -dst - (a * b).

Patch submitted: https://critiques.llvm.org/D35911

CompCert

No dull code elimination for ineffective branches

void fn1(int r0) {
  if (r0 + forty two)
    0;
}

The conditional branch attributable to the if assertion is eradicated, however the
computation for its situation stays within the generated code:

    add r0, r0, #forty two

This looks to be because dull code elimination runs sooner than ineffective
branches are identified and eradicated.

Lacking fixed folding patterns for modulo

int fn4(int p1) {
  int a = 0 % p1;
  return a;
}

int fn5(int a) {
  int b = a % a;
  return b; 
} 

In both circumstances, CompCert generates a modulo operation (on ARM, it generates a
library call, a multiply, and a subtraction). In both circumstances, this operation
may perchance perhaps very effectively be fixed folded to zero.

My patch at
https://github.com/gergo-/CompCert/commit/78555eb7afacac184d94db43c9d4438d20502be8
fixes the well-known one, but not the second. It also fails to address the
equivalent case of

  int zero = 0;
  int a = zero % p1;

because, by the time the fixed propagator runs, the modulo operation has
been mangled into a posh call/multiply/subtract sequence that is
inconvenient to peek.

Failure to generate movw instruction

int fn1(int p1) {
  int a = 506 * p1;
  return a;
}

outcomes in:

    mov r1, #250
    orr r1, r1, #256
    mul r0, r0, r1

Two instructions are spent on loading the fixed. Nonetheless, ARM's movw
instruction can recall a appropriate away between 0 and 65535, so it may perchance perhaps be
simplified to:

    movw r1, #506

I genuinely own a patch at
https://github.com/gergo-/CompCert/commit/8aa7b1ce3b0228232e027773752727f61b42a847.

Failure to fixed fold mla

Name this code mla.c:

int fn1(int p1) {
  int a, b, c, d, e, v, f;
  a = 0;
  b = c = 0;
  d = e = p1;
  v = 4;
  f = e * d | a * p1 + b;
  return f;
}

The expression a * p1 + b evaluates to 0 * p1 + 0, i.e., 0. CompCert is
ready to fold both 0 * x and x + 0 in isolation, but on ARM it generates
an mla (multiply-add) instruction, for which it has no fixed folding
suggestions:

    mov r4, #0
    mov r12, #0
    ...
    mla r2, r4, r0, r12

I genuinely own a patch at
https://github.com/gergo-/CompCert/commit/37e7fd10fdb737dc4b5c07ee2f14aeffa51a6d75.

Pointless spilling attributable to dull code

On mla.c from above without the mla fixed folding patch, CompCert
spills the r4 register which it then uses to store a non permanent:

    str r4, [sp, #8]
    mov r4, #0
    mov r12, #0
    mov r1, r0
    mov r2, r1
    mul r3, r2, r1
    mla r2, r4, r0, r12
    orr r0, r3, r2
    ldr r4, [sp, #8]

Nonetheless, the spill goes away if the line v = 4; is eradicated from the
program:

    mov r3, #0
    mov r12, #0
    mov r1, r0
    mov r2, r1
    mul r1, r1, r2
    mla r12, r3, r0, r12
    orr r0, r1, r12

The associated price of v is infrequently prone. This project is dull code, and it's some distance
compiled away. It'd easy not affect register allocation.

Left out register coalescing

Within the old instance, the copies for the arguments to the mul operation
(r0 to r1, then on to r2) are redundant. They'd perchance very effectively be eradicated, and
the multiplication written as appropriate mul r1, r0, r0.

Failure to propagate folded constants

Persevering with the mla.c instance additional, but this time after applying the
mla fixed folding patch, CompCert generates:

    mov r1, r0
    mul r2, r1, r0
    mov r1, #0
    orr r0, r2, r1

The x | 0 operation is redundant. CompCert is able to fold this
operation if it seems to be within the source code, but in this case the 0 comes
from the previously folded mla operation. Such constants are not
propagated by the "fixed propagation" (if truth be told, most appealing local folding)
pass. Values are most appealing propagated by the associated price diagnosis that runs sooner than, but
for technical causes the associated price diagnosis can not within the intervening time recall perfect thing about
the algebraic properties of operators' unbiased and zero aspects.

Left out reassociation and folding of constants

int fn1(int p1) {
  int a, v, b;
  v = 1;
  a = three * p1 * 2 * three;
  b = v + a * eighty three;
  return b;
}

All of the multiplications may perchance perhaps very effectively be folded into a single multiplication of p1
by 1494, but CompCert generates a series of provides and shifts sooner than a
multiplication by eighty three:

    add r3, r0, r0, lsl #1
    mov r0, r3, lsl #1
    add r1, r0, r0, lsl #1
    mov r2, #eighty three
    mla r0, r2, r1, r12

Study More

What do you think?

0 points
Upvote Downvote

Total votes: 0

Upvotes: 0

Upvotes percentage: 0.000000%

Downvotes: 0

Downvotes percentage: 0.000000%