Saturday, May 6, 2017

ARM Assembly: Arrays and Structs

If you are doing these in order, you just learned about loops and indexing modes.  You now have the foundation for working with arrays and structs.

Structs are heterogeneous data structures.  This means that elements of a struct can and typically do have different data types or at least different meanings.  When we are mixing data types in a struct, there are some rules we need to follow.  The most important rule is that each variable should be aligned to its own size.  So, a short should be 2 byte aligned, an integer should be 4 byte aligned, and so on.  This may leave voids in our struct.  The rule with these voids is that they should never be used.

Technically, we have already made a struct, when we learned about global memory.  That struct contained 4 integers that represented an IP address.  Let's make a more complicated struct.
.data
.balign 4
struct:
    .word 32
    .byte 12
    .skip 1
    .short 64
    .word 169

.text
.global _start
_start:
    ldr r4, =struct
    ldr r0, [r4]
    ldrb r1, [r4, #4]
    ldrs r2, [r4, #6]
    ldr r3, [r4, #8]

    add r0, r0, r1
    add r0, r0, r2
    add r0, r0, r3

    mov r7, #1
    svc #0
We start by 4 byte aligning the beginning of the struct.  Typically we want to align the struct by the largest data type it contains.  So, if we were holding a 64 bit integer, we would want to 8 byte it.  The first value in the struct is an integer, which is 4 bytes.  The second is a 1 byte value.  The next value is a 2 byte short, but the byte messed up our alignment.  So, we will skip 1 byte, which will 2 byte align the short.  The next value is another 4 byte integer.  If you add up the bytes from the byte, the void, and the short, you will find that we are already 4 byte aligned, so we don't need to skip anything here.

The code starts by getting the address of the struct.  We can use the offset indexing mode to pull out specific elements of the struct.  When calculating the offsets, we have to take into account the voids as well as the data elements.

You may notice that I used ldrb and ldrs in there.  If we just do an ldr, we will end up telling our program to read 4 bytes every time.  For the byte, this means we will read the byte, the void, and the short, all into one register.  For the short, we would end up reading the short and then the lower two bytes of the integer right after it.  If we want ldr to behave correctly with smaller data types, we have to tell it when we need these types, otherwise it will assume we always want to read a full integer.

Hopefully it is now obvious why the indexing mode we used is so valuable.

Arrays are typically homogeneous data.  They usually contain data that all has the same type and all has the same significance.  In assembly it is possible to mix data types in an array-like data structure, but if you do, you have to keep track of the types, so you can treat them properly.  We won't be doing that here.

An array is just a chunk of memory that has a size that is a multiple of the data type we are storing in in.  The defining factor of an array is that the data elements stored all have the same significance.  When we learned about global variables, we created a struct containing an IP address.  The data type was homogeneous, but the significance of each element was different.  Each element was a different value in an IP address.  A trick for determining whether a data structure is an array or a struct is to consider what would happen if the order of elements was changed.  If changing the order would have no effect or only affect things like sorting, then it is probably an array.  If changing the order would completely change the meaning of the data, then it is probably a struct (with the IP address, changing order would make it an entirely different IP address).  That said, as with everything else related to data in assembly, what the data means depends entirely on how you treat it.  There are not really arrays or structs.  They are collections of data in memory that we use as arrays and structs.

Let's make an array.
.bss
.balign 4
array:
    .space 64

.text
.global _start
_start:
    ldr r4, =array

    b init_loop_test
:init_loop
    str  r0, [r4, r0, LSL #2]
    add r0, r0, #1
:init_loop_test
    cmp r0, #16
    bne init_loop

    mov r7, #1
    svc #0
We are going to create an integer array with 16 elements (16 * 4 = 64).  Because I don't want to type in .word and a value 16 times, I stuck it in the bss section, and I initialized it in a for loop.  The loop iterates through the elements using an index stored in r0.  I am also using the index as the value to store in each element.  When the loop finishes, each element of the array will contain its own index.  I am using the register offset indexing mode with a shift, to convert the indices into the right offset for each element.

When this program terminates, the number 16 will be returned as the error code.  We will eventually learn to use the debugger.  Once we do that, it might be good to come back to this lesson and examine the contents of the array with the debugger.

In this case, we retained the original array address.  Sometimes this is what you want to do.  If we did not care about this, we could have used the [Rm], #<offset> mode, with an offset of 4.  This would have updated the pointer to point to the next element after writing to each one.  We could have eliminated the add instruction if we had done this.  The reason I did not is that I wanted to store the index of each element in it, so I had to keep track of the current index.

Also note that if I wanted to get a specific element out of the array, without traversing the entire thing, I could use the indexing mode we used above, and just set r0 to the index of the element.  This is something you generally cannot do with the immediate offset indexing mode.

ARM Assembly: Indexing Modes

We briefly discussed indexing modes when we learned about the stack and global memory.  It was probably somewhat confusing back then, because indexing modes are really designed to deal with arrays.  We won't learn about arrays quite yet though, as understanding indexing modes is very important in dealing with arrays.

Indexing modes give us a great deal of flexibility in dealing with data in memory.  This is important, since ARM only has a few instructions that can access memory.  (On Intel and some other processors, math instructions can use memory directly, instead of having to load to a register first.  This makes these instructions much slower, but it is necessary on a CPU that has very few registers.)

Indexing modes are relevant to LDR, STR, LDM, and STM instructions.  The last two are far simpler but also far less powerful.  The first two provide a great deal of flexibility.

The LDR and STR instructions take a register and something designating the address of the memory to read.  The simplest uses a label, and it has no special capabilities.  The rest use a register containing a memory address (a pointer) and some optional extra stuff.  There are four of these indexing modes, and each has some extra optional stuff.

The first indexing mode is written in the documentation as [Rn, {, #<offset>}]{!}.  The parts in braces are optional.  The means the simplest way this indexing mode could be used is ldr r0, [r1].  This loads the value pointed to by the address in r1.  If we had several integers in memory, and we wanted the second one, we could add the offset, ldr r0, [r1, #4].  This will get the data pointed to by r1 + 4.  The immediate value can also be negative.  Lastly, if we wanted to load several ints in sequence, we might use ldr r0, [r1, #4]!.  This will literally add 4 to r1, then it will use the new value in r1 as the pointer.  The exclamation point makes the instruction save the address we are accessing into r1.  If we wanted to get a series of values, we could use multiple of this instruction in succession, and each instruction will setup r1 to load the next value.  Of course, this is not exactly good for arrays, as this will always skip the first element.

The second indexing mode is [Rn], #<offset>.  This is a post indexing mode.  This means that the instruction will access the memory at Rn first, then it will add the offset to Rn.  If we used str r0, [r1], #4, the memory pointed to by r1 would first be written to.  Then 4 will be added to r1.  This won't skip the first element of an array.  If we used several of this instruction in series, we would first write the memory pointed to by r1, then r1 + 4, then r1 + 8, and so on, in 4 byte increments.  We will see an example of this when we learn how to work with arrays.

The third indexing mode uses an offset stored in a register, [Rn, +/-Rm {, <opsh>}]{!}.  If this looks familiar, it is because it is very similar to the first one.  There are two differences.  The first is the offset comes from a register, and the second is that we can put in an optional shift value.  If we use the shift value, it will be applied to the index in Rm, before that index is applied to the address in Rn.  This allows us to do things like indexing arrays by element, instead of by memory offset.  For example, we could get the second value in an integer array with ldr r0, [r1, r2, LSL #2], if we had the value 1 in r2.  The exclamation point has the same behavior as in the first indexing mode, it stores the address accessed in r1.  Note that with this one we can start on the first element of an array, by starting with a 0 in r2.

The forth indexing mode also uses an offset stored in a register, but it is similar to the second one.  The documentation shows it as [Rn], +/-Rm {, #<opsh>}.  This indexing mode will add the shifted value from Rm to Rn after the memory is accessed.

These indexing modes can allow you to easily traverse arrays and take specific elements from structs.  They are really handy for fast and efficient memory access that does not require a lot of extra instructions to manage indices.  Learning to use these will help you to be a much better ARM assembly programmer.

I mentioned that LDM and STM also have a sort of indexing mode (it's not really, as there is no actual index).  This is much simpler than the others.  The LDM instruction is shown in the documentation with the argument list Rn{!}, <reglist-PC>.  There are a few more, but the only part we care about right now is the optional exclamation point.  The register used for Rn contains the memory address being written to.  As with the previous indexing modes, the optional exclamation point causes Rn to be updated to the last memory location written by the instruction.  This works the same way with the STM instruction.

ARM Assembly: Loops

After if statements, loops are the second most common flow control structures.  Loops can allow us to perform operations across an array of data.  They can let us give the user control of when an application exits.  They can wait for user input or for other I/O to complete.  They can even just consume some time, to make sure a program is not running too fast (though, there are typically much better ways of doing this).  They can count for us.  They are incredibly useful, and like if statements, they are essential to modern computing.

Loops are created with conditional branches.  In assembly, there is no distinction between a for loop and a while loop.  In fact, there aren't even really loops.  We construct loops out of conditional branches.

We will start with an unoptimal but easy to understand example.
.text
.balign 4
.global _start
_start:
    mov r1, #5

    cmp r1, #0
    beq break
loop:
    add r0, #1
    subs r1, r1, #1
    bne loop
break:

    mov r7, #1
    svc #0
This will count up to 5, using r1 as the counter variable.  Notice that when I check the condition within the loop, I use subs, instead of cmp.  In this case, I want to decrement the counter, and then I want to check if it has gotten to 0.  I need to save the counter though, so I know where I am in the loop.  I could use sub and then cmp, but using subs lets me combine these instructions.

This is not a terribly efficient way to do this.  Because ARM has the s suffix, we don't actually reduce the size of the program by doing this.  In most other assembly languages, this would save us an instruction or two.  This will improve the performance of the program slightly, because we replaced a conditional branch with an unconditional one.
.text
.balign 4
.global _start
_start:
    mov r1, #5

    b test
loop:
    add r0, #1
    sub r1, r1, #1
test:
    cmp r1, #0

    bne loop
break:

    mov r7, #1
    svc #0
This will work exactly like the previous version, except slightly faster.  These examples are basic for loops.  For loops should not run at all, if the break out condition is met.  This means that we need to test the counter before we start the loop.  The old way of doing this is an extra cmp before the loop, and a conditional branch to skip the loop if the condition is met.  That is exactly what we did in the first example.  The C compiler has been updated, improved, and optimized over a rather long time, and this includes finding more efficient ways of doing common tasks.  At some point someone realized that instead of testing the condition before starting the loop, they could just jump straight to the end part of the loop, where it is testing its conditions.  This eliminates a conditional branch, avoiding the problem of failed branch prediction.  In assembly languages that don't have an s suffix, it also reduces the number of instructions.  In ARM,  we actually had to separate the sub and the cmd to do this, which eliminated the benefit of reducing the number of instructions.

It turns out this optimization is a trade off.  Notice that the original loop has 3 instructions inside the loop.  The second one has 4 instructions.  If we counted the number of instructions actually executed in this program, the second example would execute 5 more instructions than the first.  This means that the first loop is actually faster.  This is not exactly a big difference, but what if we were doing 1,000 iterations?  That would 1,000 more instructions executed.  If this was a loop in a performance critical part of a program, and it was expected to typically go through a lot of iterations, the cost of failed branch prediction would be more than made up for by using the first example, with fewer instructions.  For a loop that is only expected to do a few iterations, eliminating the cost of failed branch prediction might make up for the extra instruction in the loop.

Of course, on an Intel system, where there is no equivalent to subs, the original program would be doing both the subtract and the compare in the first place, which makes the second example always optimal.

ARM Assembly: Conditional Branching

One of the most common uses of branching in programs is if statements.  If statements require conditional branches.  Conditional branches only branch if certain conditions are met, otherwise they do nothing.

ARM has a list of 17 condition codes.  One, AL, we don't care about, because it is implied.  This code means "always".  This condition code makes an instruction run unconditionally, but since this is the default, we never actually use it.  There are also two condition codes that are aliases for other condition codes.  This tutorial is not going to cover all of the remaining 16 condition codes.  We will look at the most common two, and we will look at a few others.  If you want a complete list, ARM has a quick reference card that lists them all and can be found with a simple Google search of "ARM Thumb2 QRC".  (I won't guarantee this link will never break, but as of this writing, it can be downloaded from ARM's website here.  The condition codes are on the last page.)

The processor has some behind-the-scenes mechanics going on to make all of this work.  We will look at it later.  Right now, we will just discuss the conditions that satisfy some of the condition codes.

The two most common condition codes are EQ and NE.  EQ means equal, and NE means not equal.  An if statement that is testing equality will use one of these.  Condition codes generally apply to some previous instruction that does some operation.  If that operation satisfied the conditions of the code, then the instruction the condition code applies to is executed, otherwise it is not.

ARM allows condition codes on a lot of instructions, but right now we are only going to look at conditional branches, as most assembly languages have very few conditional instructions that are not branches.

Let's write a program.
.text
.balign 4
.global _start
_start:
    mov r0, #12

    cmp r0, #10
    bne then
    mov r0, #1
    b endif
then:
    mov r0, #0
endif:

    mov r7, #1
    svc #0
This program starts by putting 12 in r0.  Then we use the cmp instruction.  This instruction will subtract the second value from the first value, but it won't store the result.  So, in this case, the instruction is doing 12 - 10 = 2.  The cmp instruction will tell the processor some information about the result of the operation.  The NE condition code checks to see if the cmp instruction resulted in a 0.  If it did not, then the bne instruction will execute, going to the end label.  This is equivalent to a C or C++ if (12 == 10).  It may seem counter intuitive to use the not equal condition, but if we want the "if" code to be positioned before the "then" code, we have to do it this way.  Of course, we could reverse the if and then code, and then we would be able to use beq, instead of bne.  This is often a case of personal preference, but if you know one block will be run significantly more often, position it second.  This is because the processor tries to predict whether the branch will happen or not, and most processors just assume it will.  This means that when it does not, the processor has to back up and redo some stuff, wasting some time.  If the more frequently run block is where the conditional branch goes to (the "then" block, in this case), your program will have better performance.

EQ and NE are the most frequently used condition codes, because testing for equality is the most common condition.  What if you need to test for inequality, for example, less than or greater than?  There are condition codes for this as well.  In fact, most of the condition codes are for some kind of inequality.  The MI (minus) condition code tests for the result to be negative.  HI (higher) is for unsigned greater than.  GT (greater than) is for signed greater than.  There are also codes for both signed and unsigned less than, and signed and unsigned greater than or equal and less than or equal.  There are also a couple for checking for overflow, so that we can catch it when an integer rolls over.

Let's try the signed greater than (GT).  I also want to introduce another mechanic here.  The cmp instruction does subtraction, and it does not store the result.  This is often what we want, but it is not always.  Sometimes we do want to save the result.  Sometimes we don't want to do subtraction.  Many of the math instructions have an optional S that can be put at the end, that will tell the processor to keep track of the same things it does for the cmp instruction.  To demonstrate these things, we will also use EQ, but we will use an add instruction.
.text
.balign 4
.global _start
_start:
    mov r1, #17
    mov r2, #23
    subs r2, r2, r1
    bgt if_0
    b endif_0
:if_0
    add r0, r0, #1
:endif_0

    adds r1, r1, r2
    beq if_1
    b endif_1
:if_1
    add r0, #2
:endif_1

   mov r7, #1
   svc #0
We start by putting some values in r1 and r2.  I am saving r0 for keeping track of what succeeded and what did not (because we are not using gcc, the initial value in r0 is 0).  Notice that the sub instruction has an extra s at the end of it this time.  This tells it to update the processor state with some information about the results of the operation.  Since we subtracted 17 from 23, we ended up with a positive value, which indicates that r2 was indeed greater than r1.  My if statements are a bit ugly.  In real life, we would optimize this by just skipping the body of the if statement with a less-than-or-equal-to condition, instead of using the GT condition.  If the result is greater than, we add 1 to r0.  Next, we do an add, with the s, and then check if the result is equal.  Note that EQ is only true when the result is 0, so using add may result in intuitive results.  I could add 5 to 5, and it would not satisfy an EQ condition.  If I added 5 and -5, then it would satisfy that condition.  We typically use sub or cmp for conditionals, but once you understand the different condition codes, you can get away with using other math instructions, so long as you understand the consequences.

The other branch instructions can also be conditional.  Just add the appropriate condition code to the end of the instruction.  So, bl might be bleq, if you only want to call a function if two values are equal.

Conditional branching is something you will want to get comfortable with, because it is the basis of if statements and loops.  You will likely find that the most efficient way to use these is not the most intuitive.  This is one place that is very commonly optimized, because the most optimal solutions are often very unintuitive .

ARM Assembly: Branches

Branches, also called jumps in some assembly languages, are how we get about in our code.  Any time we don't want to run the next instruction and instead want to execute an instruction somewhere else, we use a branch.  Branches change where in the code the program is currently executing.

Branches are a big part of what makes modern computing possible.  They allow us to call functions, create loops, and conditionally run code (if statements).  Without them, our programs would be linear, only taking initial input and always performing the same operations in the same order.  We would not be able to make programs that react to user input provided while the program is running.  Computers would still be good for math and certain kinds of simulation, but things like word processors, internet browsers, and more importantly video games would just not exist.  They would not be able to.  Interactive anything would not be able to exist.  Almost all programs use branches.

Let's start with a simple branch instruction.
.text
.balign 4
.global _start
_start:
    mov r0, #27
    b skip_to_this

    mov r0, #13
skip_to_this:

    mov r7, #1
    svc #0
This program will start by setting the return value to 27.  Then it will branch to the skip_to_this label.  Notice between the branch and the label, there is an instruction that will change the return value to 13.  This instruction will never run, thus the return value will be 27.  (In real life, we don't put code in our program that never runs like this, but this is just a demonstration.  If we really needed the code to stay in the file and not run, we would comment it out.  Otherwise, we would just delete that line, instead of skipping it like this.)

The b instruction always takes a literal memory address (in the form of a label here).  It does not do anything extra.  It just changes the pc register to run the target instruction next.

As we saw in the Function Pointers tutorial, the bx instruction can be used to branch to an address stored in a register (a pointer).  We won't go over that again here.  Please review the last few paragraphs of the Function Pointers tutorial if you need to brush up.

The next branch instruction is bl.  This instruction is used to make function calls.  Typically when we call a function, we want to eventually return to the location the function was called from.  The bl instruction will help us with this, by storing the return address in lr when we call it.  If we hold onto this address, when the function is done, we can return using bx lr.

.text
.balign 4
.global _start
_start:
    bl fun

    mov r7, #1
    svc #0

fun:
    mov r0, #101
    bx lr
When the program starts running, r0 will have the value 0 in it.  In fact, only pc and sp will have non-zero values in them.  The rest of the registers, including lr, will have 0s (_start is not a function).  When we call fun() using bl, the memory address of the following mov instruction will be stored in lr, and the pc will be set to the memory address of the starting mov instruction in the function.  Once the mov instruction is executed, the bx instruction will set the pc to the address in lr, returning control to _start, at the instruction right after the function call.  Note that the bl instruction will overwrite whatever is in lr, regardless of what it is.  So when nesting functions, each must make sure that its return address is stored somewhere it will not be overwritten.

In the Function Pointers tutorial we already used the blx instruction.  It is just the bl and bx instructions combined.  It will store the return address to lr, and it takes a function pointer instead of a label.

ARM Assembly: Function Pointers

Now that we have been introduced to pointers, it is time to look at function pointers.  An important fact to keep in mind here is that the processor does not distinguish between instructions and data.  If you tell the process to run code at a particular location, it will interpret whatever is there as code, even if it is not intended to be code.  Types only really exist for the programmer.  It might help if you think of instructions as just another type.  Integers are 4 byte data values that represent whole numbers.  Floats are 4 byte data values that represent numbers which can have fractional parts.  Instructions are 4 byte values that are intended to be executed.  All of these data types are encoded exactly the same way, as 32 bit series of bits.  This means that interpretation depends on how to use them.  If you set the pc to some data, it will interpret that data as instructions.  If you perform a floating point operation on an integer, it will happily interpret it as whatever floating point value has that bit pattern.  The processor does not know or care what a given item in memory represents.  It will treat it however you tell it to.

This means that function pointers are just pointers.  When used correctly, they will point to values that you intend to be interpreted as instructions.  The processor does not know this though.  It just knows that you have a value in a register, and when you use it to call a function, it will interpret it as a pointer to executable code.

We will typically get function pointers in the same way that we get any pointer.  We will load the address into a register using the label of the function and a memory read.
.text
.balign 4
.global _start
_start:
    ldr r4, =fun
    blx r4

    mov r7, #1
    svc #0

fun:
    mov r0, #13
    bx lr
We could just call the function with bl fun.  This is significantly more efficient, because the function address is hard coded into the instruction, so we don't need an extra memory read.  What if our program is really big though, and the function is to far away in memory to encode the relative address in the bl instruction?  (Memory addresses are 32 bits.  Instructions are also 32 bits.  This means that an instruction cannot contain a whole memory address.  The ARM processor deals with this by using relative addresses.  Instead of writing the new address to the pc, it will add or subtract some value.  This value has a limited size though, and this means that addresses that are too distant in the program must use registers to store their entire address.)

The program starts by loading the memory address of our function into r4.  This is our function pointer.  We have to use a blx instruction to use a function pointer to call a function.  The function returns 13, which should be the error code for the program.

You can do this with any function in your file or declared as global in another file.  If were using GCC, we could get a function pointer for printf.  Just like any other value in memory, we can get the address of functions using their labels.

Long jumps is not the only application for function pointers.  Function pointers can be used to define behavior without constant conditionals.  For example, perhaps I am writing a program that needs to be able to use several different kinds of networks, but it will only use one at a time.  I could have if statements every time there is a network access.  This is terribly inefficient for a number of reasons.  What if instead, I set some function pointers for the kind of network I want.  This is one conditional, and then the network access can just use whatever function pointers I set.  If I need to change the type of network, I can replace the function pointers for the first type with the ones for the new type.

In assembly, you can also make anonymous functions, just by not specifying a label.  These must be called using function pointers, because there is no label to reference.  (This is a pretty advanced topic that I plan to cover much later in this series.)

There is one more kind of pointer to instructions that is not technically a function pointer, because it does not point to the start of a function.  In the above program, the last line is bx lr.  The lr register contains a memory address to an instruction somewhere within a function.  This is the address that we want the function to return to.  Since it is not a true function, we use bx, instead of blx.  We can store memory addresses that are not the beginning of a function in registers, creating pointers to instructions.  Aside from returning from functions, this is not terribly common to use.  If we want to navigate within a function, we generally just use labels, as this is more efficient.  When we want to go to another function, we almost always want to go to the beginning or to a return point.  There may be a few other esoteric applications for this, but for the most part your program will be better structured and far easier to read if you avoid this.

Friday, May 5, 2017

ARM Assembly: Pointers and Global Memory Access

I deliberately decided to write this tutorial after the one about the stack.  This is because the stack is a harder subject, but if you don't learn it first, you may fall into the bad practice of trying to put everything into global memory.  This is especially bad if you start off by storing lr in global memory.  The stack should be your primary goto place for memory, unless you need create something in a function that will still be needed after the function returns or if you need to allocate quantities of memory too large to put on the stack.  Even then, using global variables is not always the right solution.

When we learned about the stack, we learned that the stack pointer (sp) points to some location in memory, representing the top of the stack.  The stack pointer is just a pointer in the sense of C pointers.  It has a dedicated register, because it is essential to structured programming.  In reality, we can try to access any address in memory space.  Trying to access 0 will throw a null pointer exception, and trying to access anything past 0xc0000000 (3GB) will give you a permission error.  The memory is reserved for the kernel to keep track of information about the program.  Most of the rest of the space is initially unallocated space that will cause a segfault if you try to access it.

There are a few ways to allocate memory.  The stack is allocated by default, and the operating system will typically make sure there is enough, up to a certain limit (8MB, for example).  There is also a data area, a bss area, and a read only area (possibly more, depending on the OS and architecture) that you can also define, which will be allocated for the program when it starts.

These sections require special directives to create in your program.  Each section has a different purpose.  The data section is typically used for preinitialized (they start with data in them) variables.  The bss section is used for uninitialized variables.  The reason you might choose the bss section over the data section with 0s in places that are not initialized is that the data section is written into your executable, taking up more space.  If you need the variables to be preloaded with values, this is usually a secondary concern.  The bss section is allocated when the program starts, which means that it does not need space in the executable, aside from the amount of memory desired.  The last section, the read only or rodata section is typically used for constant strings or other run time constants.  When using the read system call, this appears to silently fail, not writing anything and not throwing an exception.  It appears that trying to write to the rodata area directly causes a segfault.   Put stuff in this section only if you are certain you don't want to overwrite it.  GCC typically puts string constant there.

Perhaps the most valuable area is the data area.  If you find you need to initialize a lot of mutable (changeable) variables at start up, this is probably a good place to put them.  Just make sure the stack would not be a more appropriate location.  For example, perhaps you need some memory to store an IP address.  You want the default to 127.0.0.1.  You program is going to load the IP address from a config file, but perhaps this file is empty the first time a user runs the problem.  You could allocate 4 bytes for the IP address in the bss section, but now you have to check if the IP address was in the config file, and if it was, store it to the memory, but if it was not, you will have to put the default value in.  Further, if you want to change this default, that could require finding there the default is put into memory in your config reading code.  If you put this in the data section and there is no address in the config file, you don't have to do anything, because it is already there.  If you want to change it, you can generally find one entry in the data section more easily than a bunch of moves and an LDM in the middle of a configuration section.

So, let's try it.
.data
.balign 4
ip_address:
    .word 127
    .word   0
    .word   0
    .word   1


.text
.global _start
_start:
    ldr r0, =ip_address
    ldr r0, [r0]

    mov r7, #1
    svc #0
This program is not going to read a config file, but it will set the default IP address.   Technically this is backwards, as the OS stores and accepts IP addresses starting with the lowest section and going up to the highest.

Give this a file name and compile this with as filename.s -o filename.o && ld filename.o -o filename.  From there, you can run it and output the error code, which will be 127.

The first thing that happens when you run this program is that the OS loads it into memory and reads some header information to see if it is valid.  By this time, your data area is already in memory.  Then the text section gets the memory address for the global variable and it reads the first value.  The first LDR instruction is getting a pointer to the data desired.  This is exactly like C pointers.  It is merely a memory address.  The second LDR line dereferences the pointer, getting the data that it points to.  (You could change the second LDR line to ldr r0, [r0, #4] to get the second value in the IP address.  Technically, we just created a simple struct that holds an IP address.  We can increase that 4 in multiples of 4 to get the next two elements of our struct.  Alternatively, we could use LDM to load the entire IP address into registers r0 through r3.  (And we could even flip the direction in the read or the write back, by using the right memory access code.)

The bss section is initially empty.  You use directives to tell the operating system how much preallocated memory you want, and it will allocate it and initialize it all to 0s for you (I am not certain you can rely on this last behavior on all platforms; some may leave garbage from something else there).
.bss
.balign 4 
some_memory:
    .space 16

.text
.global _start
_start:
    ldr r0, =some_memory
    ldr r0, [r0]

    mov r7, #1
    svc #0
This will allocate 16 bytes of memory (the same as the last program) when the program starts.  If you change the amount of space in the bss section to 1,600 bytes, you will find it increases the size of the executable by only 2 bytes.

The program starts by asking for 16 bytes of space in the bss section, and then it reads the first byte and returns it as an error code.  Again, the first LDR instruction is getting a pointer and the second is dereferencing it, getting the data.  You can add immediate multiples of 4 to the second LDR instruction to see the 0s in the other word sized sections of the 16 byte memory space.

The read only data section requires slightly different syntax on this particular device.  The assembler won't recognize the directive .rodata, unless it is preceeded by a .section directive.  Essentially this means that the assembler or linker does not support this section by default, so you have to define one.  The linker script used by the linker does recognize it though, and it makes sure to put it in the right place in the program binary, as well as designating it read only.
.section .rodata
.balign 4
write_this:
    .word 27

.text
.balign 4
.global _start
_start:
    mov r0, #13
    ldr r1, =write_this
    str r0,  [r1]

    mov r0, #0
    ldr r0, [r1]

    mov r7, #1
    svc #0
This will put the value 27 into the read only data section.  From there we put the number 13 in r0, and then we try to write r0 to the variable we created in the read only section.  The last part before returning first overwrites r0 with 0, to make sure we don't have an old value in there, and then we read the value from the rodata area.  If the write succeeded, we should get a 13 as the error code, but if it did not, we will probably get a segfault.  Using the debugger, I identified the location of the segfault this program throws as the STR instruction that attempts to overwrite the read only data.

As before, the LDR at the beginning is getting a pointer.  The STR instruction is dereferencing the  pointer and attempting to write to it this time (changing the stored value to a new one).  The last LDR does not run (the program crashes before this happens), but it is dereferencing the pointer stored in r1 and reading it into a register.

These different sections can give you a lot of control over your program.  If you want to make sure that a value is never changed, the read only section is very useful.  If you need your values initialized, the data section is probably the best.  If you cannot preinitialize your data, then maybe the bss section is what you need.

The things to keep in mind are that putting stuff in the data section will save the program time it would otherwise use to initialize, but putting it in the bss section will reduce the size of the executable.  The read only section is more of an error detection and security mechanism, that can prevent unskilled programmers and hackers from breaking things or sneaking malicious code or data in where it should not be.

ARM Assembly: The Stack

If you have been following the tutorials in order, you have likely hit a point where you are wondering what to do if you don't have enough registers to hold everything to need.  Of course, with so many registers, maybe you are not there yet.  Either way, now is a good time to learn about local memory.

The stack is where a process stores local variables. The stack is also used to hold onto anything it needs to store right before or after making a function call.  The stack is generally fairly small, often having a limit of only 8MB.  This means it is not good for storing large data structures.  What it is good for is storing information that is being used in a function, but which will no longer matter when the function is returned.  In other words, the stack is good for storing local variables that go out of scope when the function returns.

Before we start trying to use it, we need to know how it works.  There are several different ways the stack can work, depending on system.  The stack can grow up or down, and the stack pointer can either point at the last value stored on the stack, or it can point to the first unused address on the stack.  This presents four options, and ARM assembly is equipped to handle all of them.  Thankfully, we only need to know one.

Linux on ARM grows the stack down, and the stack pointer (sp) points to the last value on the stack.  This means, if you want to allocate more space on the stack, first you decrease the sp by the amount of memory you want, then you write to that memory.  When you are done using that memory, you increase the sp to its previous location.  This paragraph contains some of the most important information for using the stack, so reread it if you need to.

The second most important thing you need to know about the stack is that when you return from a function you must set the sp to exactly where it was when the function was called.  If you don't, then the calling function won't know where to find its local variables, which will almost certainly lead to unexpected behavior and eventually a segfault.  We will talk more about this when we learn how to make functions.

Now that we know the basics of how the stack works, let's try it out.  First we need to learn how to store data in memory and how to load it back into registers.  There is a set of instructions for this.  The easiest ones are LDR and STR, for load [to] register, and store [from] register.  The ARM quick reference card shows these instructions:
<op>{size}{T} Rd, [Rn {, #<offset>}]{!}
<op>{size}{T} Rd, [Rn],  #<offset>
<op>{size} Rd, [Rn, +/-Rm {, <opsh>}]{!}
<op>{size}{T} Rd, [Rn], +/-Rm {, <opsh>}
<op>{size} Rd, <label>
If you think this is confusing, you are not alone.  I had a hard time with this initially too.  The first thing you need to know is that <op> is either LDR or STR.  For LDR, Rd is the destination, where the value is being loaded to.  For STR, Rd is the register that contains the value to store.  The {size} portion is a code, telling the processor how much data we are loading or storing.  The options are B, SB, H, or SH, where B is byte or 8 bits, SB is signed byte, H is halfword or 16 bits, and SH is signed halfword.  Note that only LDR cares about whether a value is signed or not, so SB and SH won't work with STR.  If the size is omitted, then the instructions will load and store 32 bits, which is the full size of a register.  Here signed and unsigned make no difference for LDR or STR.  The {T} is an optional character that can be used to specify privilege level, which only matters to the kernel, so we won't worry about it.

Now let's talk about addressing modes.  ARM provides several ways to specify memory addresses, so that we don't have to waste precious instructions calculating memory addresses manually.  The first four options above start with a register that contains an address.  The Rn register will contain a base address.  In the first two, an offset is added to that base address.  The top one will add the offset, then it will load or store using the calculated address.  If the optional ! is used, the calculated address will be stored in Rn.  This can be used for easily traversing arrays.  The second option will access (load from or store to) the memory location in Rn, then it will add the offset to Rn and store the result in Rn.  The difference here is whether the offset is added before or after accessing memory.  The third and fourth allow an offset that is stored in a register, which may be added or subtracted from the base address and which can optionally be shifted.  <opsh> is defined as Rm, ??? #<shift>, where ??? is a shift code (LSL for logical left shift, etc...).  This allows Rm to be an offset that will be shifted before applying it, which is useful for traversing some types of data structures.  As with the first two, the third will update the base address in Rn if the ! is included, while the fourth will access the memory at Rn first, and then it will update Rn with the offset.  The last one uses a label.  When the program is compiled, this label is converted into an offset relative to the current position of the pc, essentially hard coding the memory location.

For accessing the stack, we only care about the first two of these.  The first one is for accessing memory without updating the address and for accessing memory where the offset is applied before accessing the memory, then the memory address is updated.  The second one is for accessing memory before applying the offset, then applying the offset and updating the memory address.  We don't need optionally shifted offsets, nor do we need offsets that are stored in registers, so we are not going to worry about the third and fourth options.  We are also not using labels, so the fifth option is not useful right now either.

Now we come to the point where how the stack works is important.  There are two ways we can deal with the stack.  The first is using SUB and ADD to decrement and increment the stack pointer directly.  For example, if we need to allocate 8 bytes on the stack, we might do this:
sub sp, sp, #8
This will reduce the sp by 8, giving us access to 8 bytes of memory on the stack.  When we are done, before we return, we would do this, to put it back where it was:
add sp, sp, #8
And now we have deallocated 8 bytes.  There is a cheaper way to do this though.  While it is sometimes better to just allocate as much as you will need at once, if you only need to put one or two things on the stack, it may be better to just do this:
str r0, [sp, #-4]!
This will decrement the stack by 4 bytes (the size of a register), storing the calculated address in sp, then it will store r0 to that location in memory.  This takes care of decrementing the stack and storing the data all in a single instruction.  Once we are done, there are two ways we can finish.  If we don't need the value, we can just add 4 to the sp and walk away.  If we do need the value though, we can do this:
ldr r0, [sp], #4
Notice that we used the first option above for storing and the second for loading.  This is important.  When storing something on the stack, we need to decrement before we store the value.  When loading, we need to load and then increment.  So, this time, we are loading the value from the stack (sp is pointing at the last value stored, which is what we want to load), then once our value is loaded, it will increment the stack pointer by 4, deallocating the stack memory we were using.

You should familiarize yourself with the different memory addressing modes.  For this particular system, pre-indexing (the first option) stores and post-indexing loads is the right way to deal with the stack, but if you ever program in ARM assembly for another OS, you may find this is not the case.  Further, when traversing arrays and other data structures, you may also find that this is not the best strategy.  Understanding how each addressing mode works is valuable for a lot of applications.

Now, you should also be aware that there is an easier way to do this.  ARM defines two pseudo instructions to help you work with the stack more easily.  These instructions are PUSH and POP.  The compiler turns PUSH and POP into the above STR and LDR instructions for the Raspberry Pi, but they are defined more loosely than that.  Whatever system you are compiling for, they will work with the stack as is appropriate for that system.  So, if you take a program written and tested on the Pi using PUSH and POP, assuming everything else is the same, it will compile correctly for other ARM systems that use the stack differently.  This is not all these instructions can do though.  Before we go any further, let's write some example programs.

We are going to write a program that uses the stack with STR and LDR, then alter it to use PUSH and POP.  There is one thing that is frequently stored on the stack at the beginning of a function.  This is the link register, lr.  We will get into function calls more later, but the two instructions for making function calls overwrite the lr register with a pointer to the instruction directly after the function call.  This is the return point, and it is how the processor helps our programs to keep track of where function calls need to return to.  There is a problem here though.  What happens if we call a function within another function?  The answer is the lr is overwritten with the new return point.  If we don't store the return pointer somewhere else, then it will be lost, and our program won't know where to return to.  In practice, we will probably end up with an infinite loop between the return point of the nested function and the end of the current function.  The universal solution to this is to store lr on the stack in any function that will be calling another function.  When we need to return, we will load it from the stack and use it.

There is another important use of the stack when using functions.  We only have about 12 registers we can use.  It is not uncommon to have function calls nesting quite deep, and it is not reasonable for us to allocate registers to specific functions.  Instead, we have a set of standards that functions ought to follow, which, when followed, will guarantee that they don't step on each other's toes.  We won't get deep into this at this point, but there is one thing that is relevant here.  r0 through r3 are used for passing arguments (and r0 is used for the return value).  Further, even if you don't pass arguments in them, the caller may not assume that the values in them will be the same when a function returns.  Essentially, functions are allowed to do whatever they want with r0 through r3, and calling functions must accept that.  In practice, this means that if you have values in these registers that you want to keep, you must store them somewhere else before calling a function.  The second thing is that values in r4 through r11 are expected to be maintained through a function call.  This means that any function that wishes to use any of these registers must store the existing values and then restore them before returning.  Both of these things can be handled using the stack.  We will focus on the second, as the first is more often handled either by doing work primarily in r4 through r11 to avoid the need to store before calling a function or by copying values in r0 through r3 into the r4 to r11 range.

We will continue using GCC as our compiler here, since main() is a function, and thus we should follow these function call procedures in it.  We are going to do some work in r4, r5, and r6, and leave the results in r0, but since main() is a function, we must restore whatever is in these registers and restore it before returning.
.text
.global main
main:
    str lr, [sp, #-4]!
    str r4, [sp, #-4]!
    str r5, [sp, #-4]!
    str r6, [sp, #-4]!

    mov r4, #13
    mov r5, #27
    add r6, r4, r5
    sub r0, r6, #7

    ldr r6, [sp], #4
    ldr r5, [sp], #4
    ldr r4, [sp], #4
    ldr lr, [sp], #4
    bx lr
Our program starts by putting any registers from r4 to r11 that we will be using onto the stack.  Then we use those registers to do some stuff.  In this case, we could have used r1, r2, and r3 just as easily, but if we had wanted to print out the value using printf(), then we would have had to move what we wanted to keep out of those registers to the higher ones.

Once we are done, we restore the values from the stack to the registers.  Note that we do this in reverse order.  The stack works exactly like a stack data structure when used this way.  The last thing we put in will be the first thing that comes out.  This means we must pull things out in reverse order so they will all be restored to the same places.

Something else important to note is that we are pre-decrementing when putting things on the stack and post-incrementing when we pull things off.  This is because the stack grows downward, and the stack pointer always points at the last item on the stack.  And recall that when we return, the stack pointer must point at the same place it was at when the function was called.  If we add up the decrements and increments, we get -16 at the beginning of the function, and +16 at the end, which breaks even.

Lastly, once we compile and run the program, we can use echo $? to print the return value, which should be 13 + 27 - 7 or 33.

This function is rather inefficient.  We are doing a ton of memory access.  We have a total of 8, and because they are in succession, each will have to wait for the previous one to complete.  Because ARM is very limited in instructions that access memory, it provides some instructions optimize memory accesses.  These are LDM and STM.  The M on the end of each means "multiple".  We can use these instructions to transfer the data in many registers to and from memory in a single instruction.  I can't say I understand how the wiring works for this, but I assume that the wiring between registers and memory is fairly robust to allow this.

We are not going to use these instructions directly right now, because they have to care about the order things are being transferred in.  We would not want to store r0 through r3, and then reload the data and have it end up with the first value in r3 and the last in r0.  So these have a sort of addressing mode system giving each another four versions.  The PUSH and POP instructions don't just apply to writing single registers to the stack.  If you give them a list of registers, instead of using LDR and STR, the assembler will use LDM and STM, to optimize performance.

Let's rewrite the above program using PUSH and POP to transfer multiple values at one!

.text
.global main
main:
    push {r4, r5, r6, lr}

    mov r4, #13
    mov r5, #27
    add r6, r4, r5
    sub r0, r6, #7

    pop {r4, r5, r6, lr}
    bx lr
This program will work identically to the first one, except cleaner and faster.  In fact, it should run a lot faster, because we cut memory accesses by 75%, and none of the other code takes significant time in comparison.

Note that PUSH and POP always produce code using the stack pointer.  If you want to load or store data anywhere else, you will have to use STR, LDR, STM, or LDM directly.

As usual, running this program should return the value 33 as the error code.