Technium Adeptus: ARM Assembly: Loops

After if statements, loops are the second most common flow control structures. Loops can allow us to perform operations across an array of data. They can let us give the user control of when an application exits. They can wait for user input or for other I/O to complete. They can even just consume some time, to make sure a program is not running too fast (though, there are typically much better ways of doing this). They can count for us. They are incredibly useful, and like if statements, they are essential to modern computing.

Loops are created with conditional branches. In assembly, there is no distinction between a for loop and a while loop. In fact, there aren't even really loops. We construct loops out of conditional branches.

We will start with an unoptimal but easy to understand example.

.text
.balign 4
.global _start
_start:
    mov r1, #5

    cmp r1, #0
    beq break
loop:
    add r0, #1
    subs r1, r1, #1
    bne loop
break:

    mov r7, #1
    svc #0

This will count up to 5, using r1 as the counter variable. Notice that when I check the condition within the loop, I use subs, instead of cmp. In this case, I want to decrement the counter, and then I want to check if it has gotten to 0. I need to save the counter though, so I know where I am in the loop. I could use sub and then cmp, but using subs lets me combine these instructions.

This is not a terribly efficient way to do this. Because ARM has the s suffix, we don't actually reduce the size of the program by doing this. In most other assembly languages, this would save us an instruction or two. This will improve the performance of the program slightly, because we replaced a conditional branch with an unconditional one.

.text
.balign 4
.global _start
_start:
    mov r1, #5

    b test
loop:
    add r0, #1
    sub r1, r1, #1
test:
    cmp r1, #0
    bne loop
break:

    mov r7, #1
    svc #0

This will work exactly like the previous version, except slightly faster. These examples are basic for loops. For loops should not run at all, if the break out condition is met. This means that we need to test the counter before we start the loop. The old way of doing this is an extra cmp before the loop, and a conditional branch to skip the loop if the condition is met. That is exactly what we did in the first example. The C compiler has been updated, improved, and optimized over a rather long time, and this includes finding more efficient ways of doing common tasks. At some point someone realized that instead of testing the condition before starting the loop, they could just jump straight to the end part of the loop, where it is testing its conditions. This eliminates a conditional branch, avoiding the problem of failed branch prediction. In assembly languages that don't have an s suffix, it also reduces the number of instructions. In ARM, we actually had to separate the sub and the cmd to do this, which eliminated the benefit of reducing the number of instructions.

It turns out this optimization is a trade off. Notice that the original loop has 3 instructions inside the loop. The second one has 4 instructions. If we counted the number of instructions actually executed in this program, the second example would execute 5 more instructions than the first. This means that the first loop is actually faster. This is not exactly a big difference, but what if we were doing 1,000 iterations? That would 1,000 more instructions executed. If this was a loop in a performance critical part of a program, and it was expected to typically go through a lot of iterations, the cost of failed branch prediction would be more than made up for by using the first example, with fewer instructions. For a loop that is only expected to do a few iterations, eliminating the cost of failed branch prediction might make up for the extra instruction in the loop.

Of course, on an Intel system, where there is no equivalent to subs, the original program would be doing both the subtract and the compare in the first place, which makes the second example always optimal.

Technium Adeptus

Saturday, May 6, 2017

ARM Assembly: Loops

No comments:

Post a Comment