Friday, May 5, 2017

ARM Assembly: The Stack

If you have been following the tutorials in order, you have likely hit a point where you are wondering what to do if you don't have enough registers to hold everything to need.  Of course, with so many registers, maybe you are not there yet.  Either way, now is a good time to learn about local memory.

The stack is where a process stores local variables. The stack is also used to hold onto anything it needs to store right before or after making a function call.  The stack is generally fairly small, often having a limit of only 8MB.  This means it is not good for storing large data structures.  What it is good for is storing information that is being used in a function, but which will no longer matter when the function is returned.  In other words, the stack is good for storing local variables that go out of scope when the function returns.

Before we start trying to use it, we need to know how it works.  There are several different ways the stack can work, depending on system.  The stack can grow up or down, and the stack pointer can either point at the last value stored on the stack, or it can point to the first unused address on the stack.  This presents four options, and ARM assembly is equipped to handle all of them.  Thankfully, we only need to know one.

Linux on ARM grows the stack down, and the stack pointer (sp) points to the last value on the stack.  This means, if you want to allocate more space on the stack, first you decrease the sp by the amount of memory you want, then you write to that memory.  When you are done using that memory, you increase the sp to its previous location.  This paragraph contains some of the most important information for using the stack, so reread it if you need to.

The second most important thing you need to know about the stack is that when you return from a function you must set the sp to exactly where it was when the function was called.  If you don't, then the calling function won't know where to find its local variables, which will almost certainly lead to unexpected behavior and eventually a segfault.  We will talk more about this when we learn how to make functions.

Now that we know the basics of how the stack works, let's try it out.  First we need to learn how to store data in memory and how to load it back into registers.  There is a set of instructions for this.  The easiest ones are LDR and STR, for load [to] register, and store [from] register.  The ARM quick reference card shows these instructions:
<op>{size}{T} Rd, [Rn {, #<offset>}]{!}
<op>{size}{T} Rd, [Rn],  #<offset>
<op>{size} Rd, [Rn, +/-Rm {, <opsh>}]{!}
<op>{size}{T} Rd, [Rn], +/-Rm {, <opsh>}
<op>{size} Rd, <label>
If you think this is confusing, you are not alone.  I had a hard time with this initially too.  The first thing you need to know is that <op> is either LDR or STR.  For LDR, Rd is the destination, where the value is being loaded to.  For STR, Rd is the register that contains the value to store.  The {size} portion is a code, telling the processor how much data we are loading or storing.  The options are B, SB, H, or SH, where B is byte or 8 bits, SB is signed byte, H is halfword or 16 bits, and SH is signed halfword.  Note that only LDR cares about whether a value is signed or not, so SB and SH won't work with STR.  If the size is omitted, then the instructions will load and store 32 bits, which is the full size of a register.  Here signed and unsigned make no difference for LDR or STR.  The {T} is an optional character that can be used to specify privilege level, which only matters to the kernel, so we won't worry about it.

Now let's talk about addressing modes.  ARM provides several ways to specify memory addresses, so that we don't have to waste precious instructions calculating memory addresses manually.  The first four options above start with a register that contains an address.  The Rn register will contain a base address.  In the first two, an offset is added to that base address.  The top one will add the offset, then it will load or store using the calculated address.  If the optional ! is used, the calculated address will be stored in Rn.  This can be used for easily traversing arrays.  The second option will access (load from or store to) the memory location in Rn, then it will add the offset to Rn and store the result in Rn.  The difference here is whether the offset is added before or after accessing memory.  The third and fourth allow an offset that is stored in a register, which may be added or subtracted from the base address and which can optionally be shifted.  <opsh> is defined as Rm, ??? #<shift>, where ??? is a shift code (LSL for logical left shift, etc...).  This allows Rm to be an offset that will be shifted before applying it, which is useful for traversing some types of data structures.  As with the first two, the third will update the base address in Rn if the ! is included, while the fourth will access the memory at Rn first, and then it will update Rn with the offset.  The last one uses a label.  When the program is compiled, this label is converted into an offset relative to the current position of the pc, essentially hard coding the memory location.

For accessing the stack, we only care about the first two of these.  The first one is for accessing memory without updating the address and for accessing memory where the offset is applied before accessing the memory, then the memory address is updated.  The second one is for accessing memory before applying the offset, then applying the offset and updating the memory address.  We don't need optionally shifted offsets, nor do we need offsets that are stored in registers, so we are not going to worry about the third and fourth options.  We are also not using labels, so the fifth option is not useful right now either.

Now we come to the point where how the stack works is important.  There are two ways we can deal with the stack.  The first is using SUB and ADD to decrement and increment the stack pointer directly.  For example, if we need to allocate 8 bytes on the stack, we might do this:
sub sp, sp, #8
This will reduce the sp by 8, giving us access to 8 bytes of memory on the stack.  When we are done, before we return, we would do this, to put it back where it was:
add sp, sp, #8
And now we have deallocated 8 bytes.  There is a cheaper way to do this though.  While it is sometimes better to just allocate as much as you will need at once, if you only need to put one or two things on the stack, it may be better to just do this:
str r0, [sp, #-4]!
This will decrement the stack by 4 bytes (the size of a register), storing the calculated address in sp, then it will store r0 to that location in memory.  This takes care of decrementing the stack and storing the data all in a single instruction.  Once we are done, there are two ways we can finish.  If we don't need the value, we can just add 4 to the sp and walk away.  If we do need the value though, we can do this:
ldr r0, [sp], #4
Notice that we used the first option above for storing and the second for loading.  This is important.  When storing something on the stack, we need to decrement before we store the value.  When loading, we need to load and then increment.  So, this time, we are loading the value from the stack (sp is pointing at the last value stored, which is what we want to load), then once our value is loaded, it will increment the stack pointer by 4, deallocating the stack memory we were using.

You should familiarize yourself with the different memory addressing modes.  For this particular system, pre-indexing (the first option) stores and post-indexing loads is the right way to deal with the stack, but if you ever program in ARM assembly for another OS, you may find this is not the case.  Further, when traversing arrays and other data structures, you may also find that this is not the best strategy.  Understanding how each addressing mode works is valuable for a lot of applications.

Now, you should also be aware that there is an easier way to do this.  ARM defines two pseudo instructions to help you work with the stack more easily.  These instructions are PUSH and POP.  The compiler turns PUSH and POP into the above STR and LDR instructions for the Raspberry Pi, but they are defined more loosely than that.  Whatever system you are compiling for, they will work with the stack as is appropriate for that system.  So, if you take a program written and tested on the Pi using PUSH and POP, assuming everything else is the same, it will compile correctly for other ARM systems that use the stack differently.  This is not all these instructions can do though.  Before we go any further, let's write some example programs.

We are going to write a program that uses the stack with STR and LDR, then alter it to use PUSH and POP.  There is one thing that is frequently stored on the stack at the beginning of a function.  This is the link register, lr.  We will get into function calls more later, but the two instructions for making function calls overwrite the lr register with a pointer to the instruction directly after the function call.  This is the return point, and it is how the processor helps our programs to keep track of where function calls need to return to.  There is a problem here though.  What happens if we call a function within another function?  The answer is the lr is overwritten with the new return point.  If we don't store the return pointer somewhere else, then it will be lost, and our program won't know where to return to.  In practice, we will probably end up with an infinite loop between the return point of the nested function and the end of the current function.  The universal solution to this is to store lr on the stack in any function that will be calling another function.  When we need to return, we will load it from the stack and use it.

There is another important use of the stack when using functions.  We only have about 12 registers we can use.  It is not uncommon to have function calls nesting quite deep, and it is not reasonable for us to allocate registers to specific functions.  Instead, we have a set of standards that functions ought to follow, which, when followed, will guarantee that they don't step on each other's toes.  We won't get deep into this at this point, but there is one thing that is relevant here.  r0 through r3 are used for passing arguments (and r0 is used for the return value).  Further, even if you don't pass arguments in them, the caller may not assume that the values in them will be the same when a function returns.  Essentially, functions are allowed to do whatever they want with r0 through r3, and calling functions must accept that.  In practice, this means that if you have values in these registers that you want to keep, you must store them somewhere else before calling a function.  The second thing is that values in r4 through r11 are expected to be maintained through a function call.  This means that any function that wishes to use any of these registers must store the existing values and then restore them before returning.  Both of these things can be handled using the stack.  We will focus on the second, as the first is more often handled either by doing work primarily in r4 through r11 to avoid the need to store before calling a function or by copying values in r0 through r3 into the r4 to r11 range.

We will continue using GCC as our compiler here, since main() is a function, and thus we should follow these function call procedures in it.  We are going to do some work in r4, r5, and r6, and leave the results in r0, but since main() is a function, we must restore whatever is in these registers and restore it before returning.
.text
.global main
main:
    str lr, [sp, #-4]!
    str r4, [sp, #-4]!
    str r5, [sp, #-4]!
    str r6, [sp, #-4]!

    mov r4, #13
    mov r5, #27
    add r6, r4, r5
    sub r0, r6, #7

    ldr r6, [sp], #4
    ldr r5, [sp], #4
    ldr r4, [sp], #4
    ldr lr, [sp], #4
    bx lr
Our program starts by putting any registers from r4 to r11 that we will be using onto the stack.  Then we use those registers to do some stuff.  In this case, we could have used r1, r2, and r3 just as easily, but if we had wanted to print out the value using printf(), then we would have had to move what we wanted to keep out of those registers to the higher ones.

Once we are done, we restore the values from the stack to the registers.  Note that we do this in reverse order.  The stack works exactly like a stack data structure when used this way.  The last thing we put in will be the first thing that comes out.  This means we must pull things out in reverse order so they will all be restored to the same places.

Something else important to note is that we are pre-decrementing when putting things on the stack and post-incrementing when we pull things off.  This is because the stack grows downward, and the stack pointer always points at the last item on the stack.  And recall that when we return, the stack pointer must point at the same place it was at when the function was called.  If we add up the decrements and increments, we get -16 at the beginning of the function, and +16 at the end, which breaks even.

Lastly, once we compile and run the program, we can use echo $? to print the return value, which should be 13 + 27 - 7 or 33.

This function is rather inefficient.  We are doing a ton of memory access.  We have a total of 8, and because they are in succession, each will have to wait for the previous one to complete.  Because ARM is very limited in instructions that access memory, it provides some instructions optimize memory accesses.  These are LDM and STM.  The M on the end of each means "multiple".  We can use these instructions to transfer the data in many registers to and from memory in a single instruction.  I can't say I understand how the wiring works for this, but I assume that the wiring between registers and memory is fairly robust to allow this.

We are not going to use these instructions directly right now, because they have to care about the order things are being transferred in.  We would not want to store r0 through r3, and then reload the data and have it end up with the first value in r3 and the last in r0.  So these have a sort of addressing mode system giving each another four versions.  The PUSH and POP instructions don't just apply to writing single registers to the stack.  If you give them a list of registers, instead of using LDR and STR, the assembler will use LDM and STM, to optimize performance.

Let's rewrite the above program using PUSH and POP to transfer multiple values at one!

.text
.global main
main:
    push {r4, r5, r6, lr}

    mov r4, #13
    mov r5, #27
    add r6, r4, r5
    sub r0, r6, #7

    pop {r4, r5, r6, lr}
    bx lr
This program will work identically to the first one, except cleaner and faster.  In fact, it should run a lot faster, because we cut memory accesses by 75%, and none of the other code takes significant time in comparison.

Note that PUSH and POP always produce code using the stack pointer.  If you want to load or store data anywhere else, you will have to use STR, LDR, STM, or LDM directly.

As usual, running this program should return the value 33 as the error code.

No comments:

Post a Comment