If you’ve ever looked at the assembly output after compiling a C program with GCC, you may have noticed that sometimes simple arithmetic like
unsigned int a = ...
unsigned int b = a % 3;
becomes something much more involved using shifts, multiplications with wierd constants and subtractions:
movl %eax, %edx
movl $2863311531, %ecx
imulq %rcx, %rdx
shrq $33, %rdx
leal (%rdx,%rdx,2), %edx
movl %eax, %esi
subl %edx, %esi
(In the following registers are simply referred to by their 16-bit
names such as ax
, bx
, cx
,
di
etc. Some instructions use the lower 32 bits of the
register and some the full 64 bits, which is just technical clutter and
ignored for brevity.)
Let’s first explore these simpler optimizations where the right-hand operand is a power of 2. Modulo by a power of 10 in base 10 is simply ignoring the higher digits. For example or and so on. The same idea is used here
unsigned int b = a % 4;
becomes a bitwise and with a bitmask of the digits according to the divisor:
movl %eax, %esi
andl $3, %esi
The decimal 3 is a literal whose binary representation is
0000 0000 0000 0011
. When this is bitwise anded, the last
two bits of the other operand is in the output, which is exactly what
the modulo operation does. In general, modulo with
becomes bitwise and with the binary representation of the literal
.
Floor division with a power of 2 is also simple, but slightly different:
unsigned int b = a / 4;
becomes simply a right shift:
movl %eax, %esi
shrl $2, %esi
This approach is of course the same as division by a power of 10 in decimal. One might expect the same to be the case for multiplication by a power of 2. For instance
unsigned int b = a * 16;
becomes a left shift instead
movl %eax, %esi
sall $4, %esi
For some other powers of two, something different happens.
If one modifies the example above to
unsigned int b = a * 4;
The multiplications is actually not performed by a left shift, but
using the lea
, load effective address, instruction.
leal 0(,%rax,4), %esi
The lea
instruction computes the address of the first
operand and stores it in the second operand.
offset(base register, index register, scale)
is an
effective address calculation that does the following:
base register + offset + index register * scale
. This
address calculation can be used to array accesses and plays quite well
with pointer arithmetic. However, it is not used for computing memory
addresses here, but ordinary arithmetic. The above then becomes
0 + ax * 4
which is simply the multiplication by 4 that we
want. Even though the scale
has to be a power of 2, this
can be exploited to multiply numbers not a power of 2:
Multiplication with 3:
leal (%rax,%rax,2), %esi
ax + ax * 2
Multiplication by 7 is multiplication by 8 followed by a subtraction:
leal 0(,%rax,8), %esi
subl %eax, %esi
Multiplication of 11:
leal (%rax,%rax,4), %edx
leal (%rax,%rdx,2), %esi
The first line multiplies by 5, and the second line multiplies by 2 and adds 1.
What happens when the divisor is not a power of 2? Lets consider an example of
unsigned int a = ...
unsigned int b = a / 3;
this becomes
movl %eax, %esi
movl $2863311531, %eax
imulq %rax, %rsi
shrq $33, %rsi
Let’s first examine what the assembly code actually does. To begin
with, a
is in the ax
register.
ax
( where a
is stored) is
copied to dx
.cx
.cx
is multiplied with dx
. The result is
stored in dx
.dx
is shifted 33
places to the right. This is the same as dividing by
as we saw earlier.In mathematical terms But why is this necessarily and where does this magic constant come from? We wish to find the magic constant:
Another way of seeing this is, by treating the division as a multiplication: (let’s ignore the integer division truncation for the next bit)
We can turn any division to a corresponding multiplication: It would be much easier to do this computation if the denominator was a power of two (as shown in a previous section). We can express as a fraction with a denominator of a power of two by multiplying the enumerator with some appropriate constant such that the denominator is a power of two:
We will need to be integer, unfortunately can’t be. However, we can make an integer approximation. For sufficiently large , the approximation gets very close to , as is the case with ; Since we’re doing integer division in a limited range of integers, a with sufficiently large actually gives correct values for all possible values of , as is the case with 2863311531 for division with 3. But why couldn’t we just have used or ? Ie., what makes large enough? I still can’t quite figure this out.
Anyway, now that the compiler found an appropriate constant approximation, , the division can be carried out by first multiplying, then shifting. We can do this because of the associativity of the operators; . The latter is used, thus the expensive operation of division can be performed using a multiplication and a shift.
Let’s consider modulo in the case where the right-hand operand is not necessarily a power of two.
unsigned int a = ...
unsigned int b = a % 3;
which becomes
movl %eax, %edx
movl $2863311531, %ecx
imulq %rcx, %rdx
shrq $33, %rdx
leal (%rdx,%rdx,2), %edx
movl %eax, %esi
subl %edx, %esi
1.-4. Computes as explained in the previous section.
dx
by (in this case) 3, as explained earlier.a
is copied to
si
.si
is subtracted by dx
. The result is
stored in si
.The result b
is in si
.
In mathematical terms; Where 1.-4. corresponds to the floor division, 5. corresponds to the multiplication by 3 and finally 6.-7. corresponds to the subtraction.
From The Division Algorithm we have that , and . In our example , and still . Having computed , we can isolate : In our example Which is exactly what happens in the assembly code.