Here’s a checklist of some missed optimizations in C compilers. To my knowledge,

these haven’t been reported by others (but it’s some distance sophisticated to glance for

such things, and of us may perchance perhaps not own to write these up sooner than). I

own reported these kind of in malicious program trackers and/or developed roughly

stale patches for some; gaze hyperlinks below.

These are *not* correctness bugs, most appealing circumstances where a compiler misses an

different to generate simpler or otherwise (reputedly) more efficient

code. None of these have a tendency to electrify the accurate, measurable performance

of real applications. Moreover, I genuinely haven’t measured every particular person of these: The reputedly

inefficient code may perchance perhaps by some skill flip out to be faster on real hardware.

I genuinely own tested GCC, Clang, and CompCert, largely concentrated on ARM at `-O3`

.

Except eminent otherwise, the examples below can be reproduced with the

following versions and configurations:

Compiler | Version | Revision | Date | Configuration |
---|---|---|---|---|

GCC | eight.0.0 20170906 | r251801 | 2017-09-06 | `--purpose=armv7a-eabihf --with-arch=armv7-a --with-fpu=vfpv3-d16 --with-circulate-abi=exhausting --with-circulate=exhausting` |

Clang | 6.0.0 (trunk 312637) | LLVM r312635, Clang r312634 | 2017-09-06 | `--purpose=armv7a-eabihf -march=armv7-a -mfpu=vfpv3-d16 -mfloat-abi=exhausting` |

CompCert | CompCert three.0.1 | f3c60be | 2017-08-28 | `armv7a-linux` |

For every instance, not not up to this kind of three compilers (typically GCC or

Clang) produces the “appropriate” code.

This record was assign together by Gergö Barany gergo.barany@inria.fr. I’m

attracted to feedback.

## GCC

### Ineffective initialization of struct passed by price

```
struct S0 {
int f0;
int f1;
int f2;
int f3;
};
int f1(struct S0 p) {
return p.f0;
}
```

The struct is passed in registers, and the feature’s consequence’s already in

`r0`

, which is also the return register. The feature may perchance perhaps return

straight away, but GCC first stores your complete struct fields to the stack and

reloads the well-known field:

```
sub sp, sp, #16
add ip, sp, #16
stmdb ip, {r0, r1, r2, r3}
ldr r0, [sp]
add sp, sp, #16
bx lr
```

Reported at https://gcc.gnu.org/bugzilla/show_bug.cgi?identity=80905.

###
`circulate`

to `char`

kind conversion goes by plot of reminiscence

```
char fn1(circulate p1) {
return (char) p1;
}
```

outcomes in:

```
vcvt.u32.f32 s15, s0
sub sp, sp, #eight
vstr.32 s15, [sp, #4] @ int
ldrb r0, [sp, #4] @ zero_extendqisi2
```

i.e., the outcomes of the conversion in `s15`

is kept to the stack after which

reloaded into `r0`

in preference to appropriate copying it between the registers (and

presumably truncating to `char`

‘s fluctuate). Same for `double`

in preference to

`circulate`

, but *not* for `brief`

in preference to `char`

.

Reported at https://gcc.gnu.org/bugzilla/show_bug.cgi?identity=80861.

### Spill in preference to register reproduction

```
int N;
int fn2(int p1, int p2) {
int a = p2;
if (N)
a *= 10.5;
return a;
}
```

GCC spills `a`

despite the indisputable truth that ample integer registers may perchance perchance perchance be readily accessible:

```
str r1, [sp, #4] // initialize a to p2
cmp r3, #0
beq .L13
... compute multiplication, assign into d7 ...
vcvt.s32.f64 s15, d7
vstr.32 s15, [sp, #4] @ int // update a in branch
.L13:
ldr r0, [sp, #4] // reload a
```

Reported at https://gcc.gnu.org/bugzilla/show_bug.cgi?identity=81012 (together

with one more agonize additional below).

### Left out simplification of multiplication by integer-valued floating-level fixed

Variant of the above code with the fixed modified fair a runt:

```
int N;
int fn5(int p1, int p2) {
int a = p2;
if (N)
a *= 10.0;
return a;
}
```

GCC converts `a`

to double and function above, however the consequence must be the

same as simply multiplying by the integer 10. Clang realizes this and

generates an integer multiply, putting off all floating-level operations.

###
`int`

to `double`

conversion

Ineffective store for ```
double fn3(int p1, brief p2) {
double a = p1;
if (1881 * p2 + 5)
a = 2;
return a;
}
```

`a`

is initialized to the associated price of `p1`

, and GCC spills this price. The

spill store is dull, the associated price is infrequently reloaded:

```
...
str r0, [sp, #4]
cmn r1, #5
vmovne.f64 d0, #2.0e+0
vmoveq s15, r0 @ int
vcvteq.f64.s32 d0, s15
add sp, sp, #eight
@ sp wanted
bx lr
```

Reported at https://gcc.gnu.org/bugzilla/show_bug.cgi?identity=81012 (together

with one more agonize above).

### Left out energy reduction of division to comparison

```
unsigned int fn1(unsigned int a, unsigned int p2) {
int b;
b = three;
if (p2 / a)
b = 7;
return b;
}
```

The situation `p2 / a`

is factual (nonzero) iff `p2 >= a`

. Clang compares, GCC

generates a division (on ARM, x86, and PowerPC).

### Left out simplification of floating-level addition

```
long fn1(int p1) {
double a;
long b = a = p1;
if (0 + a + 0)
b = 9;
return b;
}
```

It’s not perfect in customary to interchange `a + 0`

by `a`

in floating-level

arithmetic attributable to NaNs and unfavorable zeros and whatnot, but right here `a`

is

initialized from an `int`

‘s price, so none of these circumstances can occur. GCC

generates your complete floating-level operations, while Clang appropriate compares `p1`

to zero.

### Left out reassociation of multiplications

```
double fn1(double p1) {
double v;
v = p1 * p1 * (-p1 * p1);
return v;
}
int fn2(int p1) {
int v;
v = p1 * p1 * (-p1 * p1);
return v;
}
```

For every feature, GCC generates three multiplications:

```
vnmul.f64 d7, d0, d0
vmul.f64 d0, d0, d0
vmul.f64 d0, d7, d0
```

and

```
rsb r2, r0, #0
mul r3, r0, r0
mul r0, r0, r2
mul r0, r3, r0
```

Clang can carry out the equivalent with two multiplications every:

```
vmul.f64 d0, d0, d0
vnmul.f64 d0, d0, d0
```

and

```
mul r0, r0, r0
rsb r1, r0, #0
mul r0, r0, r1
```

### Audacious loop optimization in preference to putting off the loop

```
int N;
double fn1(double *p1) {
int i = 0;
double v = 0;
while (i < N) {
v = p1[i];
i++;
}
return v;
}
```

This option returns 0 if `N`

is 0 or unfavorable; otherwise, it returns

`p1[N-1]`

. GCC compiles it as such on ARM (with out a loop), but on x86-Sixty four, it

generates a posh unrolled loop. The code is too long and uninteresting to illustrate

right here, but or not it's same ample to https://godbolt.org/g/RYwgq4.

### Pointless spilling attributable to badly scheduled pass-immediates

```
double fn1(double p1, circulate p2, double p3, circulate p4, double *p5, double p6)
{
double a, c, d, e;
circulate b;
a = p3 / p1 / 38 / 2.56380695236e38f * 5;
c = 946293016.021f / a + 9;
b = p1 - 6023397303.74f / p3 * 909 * *p5;
d = c / 683596028.301 + p3 - b + 701.296936339 - p4 * p6;
e = p2 + d;
return e;
}
```

Sorry relating to the mess. GCC generates a bunch of code for this, including:

```
vstr.Sixty four d10, [sp]
vmov.f64 d11, #5.0e+0
vmov.f64 d13, #9.0e+0
... computations involving d10 but not d11 or d13 ...
vldr.Sixty four d10, [sp]
vcvt.f64.f32 d6, s0
vmul.f64 d11, d7, d11
vdiv.f64 d5, d12, d11
vadd.f64 d13, d5, d13
vdiv.f64 d0, d13, d14
```

The register `d10`

must be spilled to make room for the constants 5 and 9

being loaded into `d11`

and `d13`

, respectively. Nevertheless these loads are worthy

too early: Their values are most appealing wanted after `d10`

is restored. These pass

instructions (not not up to at least one of them) may perchance perhaps easy be nearer to their uses, freeing

up registers and fending off the spill.

### Left out fixed propagation (or so it looked)

```
int fn1(int p1) {
int a = 6;
int b = (p1 / 12 == a);
return b;
}
int fn2(int p1) {
int b = (p1 / 12 == 6);
return b;
}
```

These functions may perchance perhaps easy be compiled identically. The situation is equivalent

to `6 * 12 <= p1 < 7 * 12`

, and the 2 signed comparisons can be replaced

by a subtraction and a single unsigned comparison:

```
sub r0, r0, #Seventy two
cmp r0, #11
movhi r0, #0
movls r0, #1
bx lr
```

GCC susceptible to preserve up out this for the well-known feature, but not the second one. It

looked fancy fixed propagation did not interchange `a`

by 6.

Reported at https://gcc.gnu.org/bugzilla/show_bug.cgi?identity=81346 and mounted in

revision r250342; it turns out that this pattern susceptible to be implemented

(most appealing) in a really early folding pass that ran sooner than fixed propagation.

### Left out tail-call optimization for compiler-generated call to intrinsic

```
int fn1(int a, int b) {
return a / b;
}
```

On ARM, integer division is modified into a feature call:

```
push {r4, lr}
bl __aeabi_idiv
pop {r4, notebook laptop}
```

Within the code above, GCC allocates a stack physique, not realizing that this call

in tail assign may perchance perhaps very effectively be simply

```
b __aeabi_idiv
```

(When the resolution, even to `__aeabi_idiv`

, seems to be within the source code, GCC is

completely able to compiling it as a tail call.)

## Clang

### Incomplete fluctuate diagnosis for explicit sorts

```
brief fn1(brief p) {
brief c = p / 32769;
return c;
}
int fn2(signed char p) {
int c = p / 129;
return c;
}
```

In both circumstances, the divisor is out of doorways the that you may perchance perchance contemplate of fluctuate of values for `p`

,

so the feature's consequence's continuously 0. Clang doesn't imprint this and

generates code to compute the consequence. (It does optimize corresponding circumstances

for unsigned `brief`

and `char`

.)

### More incomplete fluctuate diagnosis

```
int fn1(brief p4, int p5) {
int v = 0;
if ((p4 - 32779) - !p5)
v = 7;
return v;
}
```

The situation is continuously factual, and the feature continuously returns 7. Clang

generates code to own in thoughts the situation. Akin to the case above, but

looks to be more complicated: If the `- !p5`

is eradicated, Clang also realizes

that the situation is continuously factual.

### Rather more incomplete fluctuate diagnosis

```
int fn1(int p, int q) {
int a = q + (p % 6) / 9;
return a;
}
int fn2(int p) {
return (1 / p) / One hundred;
}
int fn3(int p5, int p6) {
int a, c, v;
v = -!p5;
a = v + 7;
c = (4 % a);
return c;
}
int fn4(int p4, int p5) {
int a = 5;
if ((5 % p5) / 9)
a = p4;
return a;
}
```

The first feature continuously returns `q`

, the others continuously return 0, 4, and 5,

respectively. Clang evaluates not not up to at least one division or modulo operation in

every case.

(I genuinely own a *lot* more examples fancy this, but this record is uninteresting ample as

it's some distance.)

###
`double`

to `int`

conversion

Redundant code generated for ```
int fn1(double c, int *p, int *q, int *r, int *s) {
int i = (int) c;
*p = i;
*q = i;
*r = i;
*s = i;
return i;
}
```

There may perchance be a single kind conversion on the source stage. Clang duplicates the

conversion operation (`vcvt`

) for every use of `i`

:

```
vcvt.s32.f64 s2, d0
vstr s2, [r0]
vcvt.s32.f64 s2, d0
vstr s2, [r1]
vcvt.s32.f64 s2, d0
vstr s2, [r2]
vcvt.s32.f64 s2, d0
vcvt.s32.f64 s0, d0
vmov r0, s0
vstr s2, [r3]
```

Several occasions, the outcomes of the conversion is written into the register

that already holds that very same price (`s2`

).

Reported at https://bugs.llvm.org/show_bug.cgi?identity=33199.

###
`vnmla`

Lacking instruction different pattern for ```
double fn1(double p1, double p2, double p3) {
double a = -p1 - p2 * p3;
return a;
}
```

Clang generates

```
vneg.f64 d0, d0
vmls.f64 d0, d1, d2
```

but this may perchance perhaps appropriate be

```
vnmla.f64 d0, d1, d2
```

The instruction selector has a pattern for picking `vnmla`

for the

expression `-(a * b) - dst`

, but not for the symmetric `-dst - (a * b)`

.

Patch submitted: https://critiques.llvm.org/D35911

## CompCert

### No dull code elimination for ineffective branches

```
void fn1(int r0) {
if (r0 + forty two)
0;
}
```

The conditional branch attributable to the `if`

assertion is eradicated, however the

computation for its situation stays within the generated code:

```
add r0, r0, #forty two
```

This looks to be because dull code elimination runs sooner than ineffective

branches are identified and eradicated.

### Lacking fixed folding patterns for modulo

```
int fn4(int p1) {
int a = 0 % p1;
return a;
}
int fn5(int a) {
int b = a % a;
return b;
}
```

In both circumstances, CompCert generates a modulo operation (on ARM, it generates a

library call, a multiply, and a subtraction). In both circumstances, this operation

may perchance perhaps very effectively be fixed folded to zero.

My patch at

https://github.com/gergo-/CompCert/commit/78555eb7afacac184d94db43c9d4438d20502be8

fixes the well-known one, but not the second. It also fails to address the

equivalent case of

```
int zero = 0;
int a = zero % p1;
```

because, by the time the fixed propagator runs, the modulo operation has

been mangled into a posh call/multiply/subtract sequence that is

inconvenient to peek.

###
`movw`

instruction

Failure to generate ```
int fn1(int p1) {
int a = 506 * p1;
return a;
}
```

outcomes in:

```
mov r1, #250
orr r1, r1, #256
mul r0, r0, r1
```

Two instructions are spent on loading the fixed. Nonetheless, ARM's `movw`

instruction can recall a appropriate away between 0 and 65535, so it may perchance perhaps be

simplified to:

```
movw r1, #506
```

I genuinely own a patch at

https://github.com/gergo-/CompCert/commit/8aa7b1ce3b0228232e027773752727f61b42a847.

###
`mla`

Failure to fixed fold Name this code `mla.c`

:

```
int fn1(int p1) {
int a, b, c, d, e, v, f;
a = 0;
b = c = 0;
d = e = p1;
v = 4;
f = e * d | a * p1 + b;
return f;
}
```

The expression `a * p1 + b`

evaluates to `0 * p1 + 0`

, i.e., 0. CompCert is

ready to fold both `0 * x`

and `x + 0`

in isolation, but on ARM it generates

an `mla`

(multiply-add) instruction, for which it has no fixed folding

suggestions:

```
mov r4, #0
mov r12, #0
...
mla r2, r4, r0, r12
```

I genuinely own a patch at

https://github.com/gergo-/CompCert/commit/37e7fd10fdb737dc4b5c07ee2f14aeffa51a6d75.

### Pointless spilling attributable to dull code

On `mla.c`

from above without the `mla`

fixed folding patch, CompCert

spills the `r4`

register which it then uses to store a non permanent:

```
str r4, [sp, #8]
mov r4, #0
mov r12, #0
mov r1, r0
mov r2, r1
mul r3, r2, r1
mla r2, r4, r0, r12
orr r0, r3, r2
ldr r4, [sp, #8]
```

Nonetheless, the spill goes away if the line `v = 4;`

is eradicated from the

program:

```
mov r3, #0
mov r12, #0
mov r1, r0
mov r2, r1
mul r1, r1, r2
mla r12, r3, r0, r12
orr r0, r1, r12
```

The associated price of `v`

is infrequently prone. This project is dull code, and it's some distance

compiled away. It'd easy not affect register allocation.

### Left out register coalescing

Within the old instance, the copies for the arguments to the `mul`

operation

(`r0`

to `r1`

, then on to `r2`

) are redundant. They'd perchance very effectively be eradicated, and

the multiplication written as appropriate `mul r1, r0, r0`

.

### Failure to propagate folded constants

Persevering with the `mla.c`

instance additional, but this time after applying the

`mla`

fixed folding patch, CompCert generates:

```
mov r1, r0
mul r2, r1, r0
mov r1, #0
orr r0, r2, r1
```

The `x | 0`

operation is redundant. CompCert is able to fold this

operation if it seems to be within the source code, but in this case the 0 comes

from the previously folded `mla`

operation. Such constants are not

propagated by the "fixed propagation" (if truth be told, most appealing local folding)

pass. Values are most appealing propagated by the associated price diagnosis that runs sooner than, but

for technical causes the associated price diagnosis can not within the intervening time recall perfect thing about

the algebraic properties of operators' unbiased and zero aspects.

### Left out reassociation and folding of constants

```
int fn1(int p1) {
int a, v, b;
v = 1;
a = three * p1 * 2 * three;
b = v + a * eighty three;
return b;
}
```

All of the multiplications may perchance perhaps very effectively be folded into a single multiplication of `p1`

by 1494, but CompCert generates a series of provides and shifts sooner than a

multiplication by eighty three:

```
add r3, r0, r0, lsl #1
mov r0, r3, lsl #1
add r1, r0, r0, lsl #1
mov r2, #eighty three
mla r0, r2, r1, r12
```