Monday, October 16, 2017

Ideological Programming

Today I was following what is actually a very good tutorial on SDL2, in C (despite it being written for C++).  I came across an excessively complex section of code.  The tutorial has me writing an initialization function.  This is an excellent idea, as it groups a bunch of related actions together.  Unfortunately, it is both hard to read and hard to write, and as far as I can tell, it is purely because the writer subscribes to a particular programming ideology.

This function starts by initializing a return variable, used to indicate failure or success.  Then it initializes the video subsystem.  If it fails, it sets the return variable to false, otherwise it goes into a nested conditional, creating a window and then testing for success, then it goes into another nested conditional for starting up the image loading subsystem, and only then does it get a surface to draw to.  This is nested three levels deep.  Each level tests if some initialization worked, and if not, it prints an error message and sets the return variable to false.  The nested conditionals make the code hard to read and hard to write.  There is a reason the writer of the tutorial did this though.  It is an ideological reason.

There is this programming ideology that says a function should only have a single return point.  In other words, it should not return anywhere except at the end.  Programmers who subscribe to this tend to use return variables and nested conditionals to avoid returning in a conditional before the end.  What is the benefit of this ideology?  There is none.  It is based on the idea of mathematical functions.  Mathematical functions can only have a single return point, and some programmers believe that this should be applied to functions in programming.  The fact, however, is that the word "function" means very different things to mathematics and programming.  Programming functions are loosely based on mathematical functions, but they are not, by any stretch, the same.  There is one exception: In functional programming languages, functions are analogous to mathematical functions.  As such, pure functional languages don't allow multiple return points in functions (this may not always be obvious though).  C is not a functional language though.  There is no benefit to adding this artificial programming constraint.  In fact, in the worse case, it will result in inferior machine code.  While a well designed compiler might be able to optimize away most of the extra work for nested conditionals to get to the return point when a failure case occurs, it will never end up better than a set of conditionals where a failure just returns inside the conditional testing for failure, instead of using a return variable and nesting.  In many cases it will end up worse though.  Intermediate return points avoid unnecessary work, especially in a case like this.  More importantly though, the flattened version of the code is both more readable and easier to comment.  The ideology might sound good, but it is applying pure math principles to a more complex and flexible discipline, and in doing so it produces worse outcomes.

This is not the only programming ideology that is expensive and wasteful without producing any benefit.  Object orientation as a programming paradigm is founded on the ideology that everything should be an object.  Following this ideology results in hard to read code that uses significantly more memory than necessary and performs poorly on modern computing systems.  Programming to a particular ideology might have some academic value, but in real life situations, it generally just causes problems.

When writing a program, these should be the primary goals, starting with the most important:
  1. The project is completed within acceptable margins for cost and schedule.
  2. The program produces the correct output for every possible input.
  3. The program is cheap and easy to maintain.
You can add other goals after these, but most will be dependent on the particular project.  Aesthetics might be important for one client, while specific features may be important to another.  Notice, however, that none of these three include anything about following a particular programming paradigm or ideology.  Unless it is a critical requirement of the project (which is almost unheard of outside of academia), maintainability always trumps adherence to a particular paradigm.  Following a particular constraint for something that happens to have a similar name from a related field is not an excuse for writing code that is hard to read, hard to comment, and hard to maintain.  Sometimes performance or correctness is more important than maintainability, but few of these ideological programming constraints improve performance or correctness, and many times they actually hinder one or both of these.

Ideological programming can be fun as a hobby or a purely academic pursuit, but in the real world, following abstract rules that have no practical value is a waste of everyone's time and money.  Effective programming is about using your tools efficiently and effectively.  When you eliminate a tool from your toolbox for a purely ideological reason, you artificially limit your skill and the quality of your work.  Just say no to ideological programming.

Monday, June 19, 2017

Why Are Pipes Hard?

I learned something amazing today.  I have done a few small multi-processing tasks in the past, some using Python and one using Java.  The most appropriate form of interprocess communication for these was pipes.  The experience was always very painful.  I had to figure out how to cobble together the right objects to produce a usable communication system.  I had to learn some special uses of objects to allow the parent process to continue executing, instead of waiting for the child to finish.  Working out the bugs was very difficult, because it was hard to tell if the problem was my communication protocol, if I had connected my objects wrong, or if maybe I had given an advanced function or method some incorrect parameter.  Of course, I expect this sort of difficulty.  I had heard from others just how hard interprocess communication is, especially with pipes.  That is just the way it is, I guess.

Today, a friend sent me a link to this.  That document discusses several types of interprocess communication.  There is one chapter on pipes, and the author starts the chapter, "There is no form of IPC that is simpler than pipes."  I found this rather amusing and slightly confusing, as this was not my experience.  Shortly following that, he adds that the document will not spend much time on pipes because they are so easy.  This was confusing.  By the end of the section, I knew two important things.  First, the author is right; pipes are incredibly easy.  Second, everyone else is doing pipes wrong.

In Python, you have to create and pass pipes into the function call that starts a new process.  I forget the exact details, but I seem to recall having to use a different function or at least put some obscure parameter into the regular one to get it to return before the child process was completed.  Then I had to use some method of the pipe object to interact with the child, and this was rather difficult to get right.  In one case I did have some issues with my communication protocol, but I thought it was yet another issue with how I was trying to use the pipe.  It took half an hour to figure out that the problem was some trivial thing with how was formatting the data I was passing through the pipe.  In short, the system was completely opaque and very difficult to use.  Java was not much better (in fact, I think I gave up on communication while the child process was still running, and just made the program call multiple times).  The friend who sent me the link told me he had the same experience when he read that section, except with the Go programming language.

Now, that document describes how to use pipes in C.  First, you create a pipe.  This requires merely making a single function call, passing in a two integer wide array.  The array is populated with a pair of file handles for the reading and writing end of the pipe respectively.  Then, you treat those file handles like you would any regular file!  The main difference is that they are more like standard in and standard out than a file on disk, but any programmer worth his or her salt should know how to deal with that!  The document goes on to use read and write calls to send and receive data, but you could easily use any function that reads or writes files for this.  If there is anything complicated about pipes, it is only that they have limited space, but 10kb is a pretty big limit.  Once you have a pipe, you just fork the process, which will provide the child process with copies of the file descriptors to the pipe.  Now the two processes  can communicate with read and write calls (or other file I/O calls) on the pipe.  Of course, if you need two way communication, you will need two pipes, but that is trivial.

The fact is, pipes are incredibly easy!  For some reason though, only low level languages can manage to get them right.  I don't often have something bad to say about Python, but this is one place where Python got it totally wrong!  I have plenty of bad things to say about Java, so it does not really surprise me that Java missed the mark on this one.  So why are pipes so hard in language that are higher level than C or C++?  In fact, I am almost certain that pipes are easier in assembly language than in higher level languages (and I know what I am talking about here).  Clearly the operating system has it right.  Why can't high level languages get it right?  They are supposed to make programming easier if I am not mistaken, but when I start considering writing a C module for Python, because that would be so much easier than just using the built in, you know something is wrong.

The problem is not terribly difficult for me to identify, but much of the programming world is so in love with Object Oriented Programming that they don't want hear anything bad about it, let alone admit that it might have a serious problem.  The problem here is that the OS gets it right in the first place, so it does not need complicated wrappers to make it easier.  At the OS level, pipes are about as simple as it gets.  C is wise to just leave it alone and let it do its thing.  Python, Java, and other languages, however, feel obligated to wrap everything of objects, because hey, object oriented.  The fact is pipes don't need objects.  They are perfectly fine just as they are.  Wrapping them in objects makes them absurdly more complicated and difficult to use, without adding anything of value.  I was actually stunned when I saw the C code in that document.  I had a hard time believing that it was correct, because my experience did not support that conclusion.  Using objects where they are unnecessary is bad design, and we really need to teach programmers to stop doing it.  There are good reasons to use objects, but there are also good reasons not to.

I have talked about this before, but the problem here is OOP, and that is just as absurd as Integer Oriented Programming (or Bit Oriented Programming, as a friend suggested).  When we decide to orient all of our programming around a single data structure (or metadata structure, in this case), we limit ourselves.  We destroy our ability to make good decisions about what tools will be the best for each problem.  Objects are a great tool, but we don't use a chain saw to screw in a screw or hammer in a nail.  Honestly, I feel like multi-processing in Python or Java is just like following some really complicated instructions for how to push a peg into a hole with a table saw without ever touching the peg or injuring yourself.  It would be really nice if those languages would just let me pick up the peg and push it into the hole with my finger, instead of trying to build a huge complex machine around such a simple process.

Thursday, June 1, 2017

Reusable Software Design

I just wrote about reusable software, and I figured reusable software design deserves its own article.  Perhaps the biggest problem I identified with reusable code is that if it is has problems, the time required to learn the code just to fix it largely negates any value in using it in the first place.  If we could easily understand reusable code enough to consistently adapt it to our own use cases, it would be perfect, but that is not how code works.  Code is inherently hard to understand.  Often, we even have a hard time understanding our own code after a few weeks of separation.  Isn't there some way we can consistently benefit from existing work, without the risk of having to spend so much time learning how it works that the cost is greater than the benefits?  It turns out that there is, and in some ways it may even be able to mitigate the costs of choosing not to use reusable code.  The solution is reusable design!

Design patterns have formally existed since object oriented programming became a thing.  Most design patterns are built around OOP.  They don't have to be though.  Things like loops and coherent if statements are primitive design patterns.  Functions are a kind of design pattern as well.  We often don't recognize this, because these have become integral elements of programming, but it is true.  These are all examples of design patterns that don't rely on object orientation.  In fact, very few design patterns require OOP.  Around a year ago, I wrote an event handler that uses the Event Queue design pattern, in C, and I have used the Observer pattern in C as well.  Any programming style or language can use design patterns (even assembly language).  Design patterns are not limited to OOP languages.

Design patterns are superior to concrete code for reuseability, because they are more abstract and flexible.  Design patterns don't specify implementation.  They merely provide a template.  For example, the "for loop" design pattern does not specify the exact number of iterations, the data type of the index variable, or the iteration step.  We can use it to do simple counting, to step through a string one (or two, or five...) character at a time, or to traverse an array.  These are flexible implementation details.  The if statement may or may not have an else or if-else clause.  The expression used to generate the boolean value used as the conditional is not defined for us.  We can use if statements for all kinds of things, because implementation is not defined for us.  Design patterns are flexible enough that it is trivial to adapt them to a particular application.

Good design can save a lot of coding time.  A friend of mine likes to say, "Hours of coding can save minutes of design."  Reusable code is generally endorsed for how much time it can save developers, but it comes with potentially serious costs.  Good design may take longer than crummy design, but it always pays for itself many times over.  Good design can save more time than using reusable code can.  Reusable design (design patterns) can save significant amounts of design time.  This is a double win, because judicious use of reusable design comes with no additional costs, and time spent (or saved) on design is generally more valuable than time spent on implementation.  When we save design time, we also save significantly more coding time.

Good design can reduce the need for reusable code, by finding ways to make coding from scratch faster than learning and using an external library or framework.  When reusable code is the best option, good design can optimize its use to reduce the costs.

Reusable design is, in my opinion, of far greater value than reusable code.  Design patterns are generic enough to solve a large number of problems, and they are easy enough to understand to customize them to a particular application.  Because design time is more valuable than coding time, they also have the potential to save far more time than reusable code.  Reusable design is probably something we should spend significantly more time and effort on.

Reusable Software

Ever since the invention of modular programming, reusable software has been a huge deal.  Nearly ever programming language comes with build-in reusable parts or some kind of standard library.  A great number of articles have been written, extolling the many virtues of reusable software components.  Enormous repositories of reusable parts have been created.  Most problem spaces have complex ecosystems of frameworks, libraries, and sometimes even just short snippets of code or individual reusable functions.  Anything this good has to come at a cost though, and the costs of reusable software are almost universally ignored.

First I want to say that reusable software is awesome.  I am not trying to bash it here.  Reusable software has many great benefits.  It saves a ton of programming time.  Popular reusable components tend to be fairly high quality, because they have already been tried, tested, fixed, and improved.  Reusable components are typically fairly easy to learn and use, because no one wants to use something that is more expensive to learn than writing it fresh would be, and no one wants to use something that takes more effort to use than it would take to roll fresh.  Reusable software can be used to make reasonably high quality software far faster than writing it from scratch most of the time.

Reusable software is a two edged sword, and most programmers cannot see the edge that is facing them.  Reusable software is so good that it is often difficult to see the costs, and even when the costs are known, they are often ignored.  Most of the time, the costs of reusable software are acceptable, but occasionally they can cause serious problems.  For example, recently a large number of web sites stopped working because a trivial piece of reusable code was removed from a popular nodeJS repository.  The time savings for using the code was probably between 1 and 5 minutes for a single developer.  The total cost in web site down time was probably in the hundreds of thousands or millions of dollars.  This is not a common scenario, but it is one that could have easily be avoided by spending a few minutes to write a trivial piece of code, instead of relying on an external source to remain available forever.

Perhaps the most obvious cost of reusable software is fitness of a particular purpose.  To be useful, reusable code must be usable for a variety of applications.  This means it must be generic.  On cost of making something generic is that it becomes less suitable for nearly every application, especially very specialized ones.  For flexible applications that are not critical, the application can generally be adapted to the reusable code.  This will almost certainly affect design, but it rare affects the utility and usability of the application.  Sometimes, however, it does.  Some specialized applications have strict design requirements that a generic component would violate.  In these cases, reusable software just cannot be used.  Occasionally, reusable software will seriously limit the design choices of even more mundane software in ways that are unacceptable.  In these cases, reusable software should not be used.  If using a particular framework or library feels like trying to fit a square peg in a round hole, it may be time to consider alternatives, including writing the code from scratch.

Another fairly obvious but deliberately ignored cost of reusable software is performance.  Making something generic means making it suitable for a wide variety of use cases.  This means a lot of features and a lot of "just in case" elements.  These use memory and processing power, whether you actually need the features or not.  In many applications, this does not make a huge difference.  In some though, it can make an enormous difference.  In applications that are performance critical, reusable code in bottlenecks can be a major problem.  Even in applications where performance is not critical, bloated or slow reusable code can be a problem.  Modern computers never run just one application at a time.  The typical computer is running 10 to 50 processes at any given moment, and all of those processes have to share resources.  While it might seem fine to waste a few megabytes here and there, when there are 50 processes wasting a few megabytes each, memory use can become a serious problem.  Similarly, processor time also must be shared between processes, and a process that is wasteful can affect the performance of the entire machine.  For many kinds of applications, this is rarely a problem but for some (web pages, for example, where a user may have anywhere from 5 to 100 tabs open at a time), a little resource hogging can go a long way.  In general, it is a good idea to keep in mind that reusable code is consuming resources for all of its features, even those you are not using.  A good rule of thumb is that if you are not using more than one or two features of a framework, it might be time to consider looking for something lighter or writing the code from scratch.

Reusable software makes code less maintainable.  This is very counter-intuitive.  There is a general assumption that using a framework or library takes some of the maintenance burden off of the programmer, and this is true.  Reusable code can definitely reduce the burden of maintenance, but this is not about maintenance time spent or saved.  This is about being able to maintain the code in the first place.  Learning exactly how a piece of reusable code works takes enough time that it defeats the purpose of using it, so few programmers bother.  If the reusable code itself has a bug, however, the only options are either to ditch the reusable code and home brew a replacement, pay to have someone spend hours, days, or weeks learning how the code works and fixing the bug, or report the bug to the project and wait for it to get fixed upstream.  The second option is so expensive that it is hardly ever an option.  The typical solution is to sacrifice quality, utility, and usability by working around the bug.  In addition to this, because reusable code is significantly harder to change (because the programmers using it are not intimately familiar with its code), valuable design changes to a program may be impossible.  As with bugs, if a valuable design change is needed that does not work with the reusable software, the options are to ditch it and home brew a replacement, or pay someone for a lot of labor to figure out how to adapt the reusable code.  Of course, once the reusable code has been altered, updates from the original source become invalid, and the reusable code must be maintained entirely in-house, at additional expense.  Again, this is not necessarily a problem that will come up a lot.  Major design changes after delivery are not exactly common.  Reusable code does tend to be more well tested as well, so bugs in it will be more rare.  When they do come up though, they can be very expensive.

Overall, reusable software is very valuable, and nearly ever program uses it in some way.  It also comes with some costs though, and understanding those costs is important to get the most out of it.  Sometimes it is better to write the code from scratch.  Code written to the specific application will always be superior to reusable code.  Reusable code can ultimately be more expensive than coding from scratch.  Cases where reusable code costs more than writing code from scratch are generally uncommon, but determining this should be part of the cost analysis of any project.  In addition, there are different levels of reusable software that should be considered.  Using something like the C standard library or other build-in functions and features of a language is generally a given.  For any popular language, these have been tested and optimized more than anything else.  Libraries for hardware interaction are often necessary as well, though there may be multiple options.  These tend to have more testing behind them as well.  Convenience and aesthetics libraries (jQuery, for example) are typically far less critical, so they tend to have less rigorous testing, and the benefits they provide are less critical.  Less essential libraries that have been around for less time deserve the most scrutiny.  Very small pieces of reusable code also deserve scrutiny, because it may end up taking more time and resources in the long run than just writing it from scratch.  Despite its high value, reusable code is not a silver bullet.  It comes with its own costs, and sometimes those costs can be much greater than any benefits.  Our industry needs to spend a bit more time considering the options and potential consequences than it currently does, and if it did, things would work a lot more smoothly for everyone.

Saturday, May 6, 2017

ARM Assembly: Arrays and Structs

If you are doing these in order, you just learned about loops and indexing modes.  You now have the foundation for working with arrays and structs.

Structs are heterogeneous data structures.  This means that elements of a struct can and typically do have different data types or at least different meanings.  When we are mixing data types in a struct, there are some rules we need to follow.  The most important rule is that each variable should be aligned to its own size.  So, a short should be 2 byte aligned, an integer should be 4 byte aligned, and so on.  This may leave voids in our struct.  The rule with these voids is that they should never be used.

Technically, we have already made a struct, when we learned about global memory.  That struct contained 4 integers that represented an IP address.  Let's make a more complicated struct.
.data
.balign 4
struct:
    .word 32
    .byte 12
    .skip 1
    .short 64
    .word 169

.text
.global _start
_start:
    ldr r4, =struct
    ldr r0, [r4]
    ldrb r1, [r4, #4]
    ldrs r2, [r4, #6]
    ldr r3, [r4, #8]

    add r0, r0, r1
    add r0, r0, r2
    add r0, r0, r3

    mov r7, #1
    svc #0
We start by 4 byte aligning the beginning of the struct.  Typically we want to align the struct by the largest data type it contains.  So, if we were holding a 64 bit integer, we would want to 8 byte it.  The first value in the struct is an integer, which is 4 bytes.  The second is a 1 byte value.  The next value is a 2 byte short, but the byte messed up our alignment.  So, we will skip 1 byte, which will 2 byte align the short.  The next value is another 4 byte integer.  If you add up the bytes from the byte, the void, and the short, you will find that we are already 4 byte aligned, so we don't need to skip anything here.

The code starts by getting the address of the struct.  We can use the offset indexing mode to pull out specific elements of the struct.  When calculating the offsets, we have to take into account the voids as well as the data elements.

You may notice that I used ldrb and ldrs in there.  If we just do an ldr, we will end up telling our program to read 4 bytes every time.  For the byte, this means we will read the byte, the void, and the short, all into one register.  For the short, we would end up reading the short and then the lower two bytes of the integer right after it.  If we want ldr to behave correctly with smaller data types, we have to tell it when we need these types, otherwise it will assume we always want to read a full integer.

Hopefully it is now obvious why the indexing mode we used is so valuable.

Arrays are typically homogeneous data.  They usually contain data that all has the same type and all has the same significance.  In assembly it is possible to mix data types in an array-like data structure, but if you do, you have to keep track of the types, so you can treat them properly.  We won't be doing that here.

An array is just a chunk of memory that has a size that is a multiple of the data type we are storing in in.  The defining factor of an array is that the data elements stored all have the same significance.  When we learned about global variables, we created a struct containing an IP address.  The data type was homogeneous, but the significance of each element was different.  Each element was a different value in an IP address.  A trick for determining whether a data structure is an array or a struct is to consider what would happen if the order of elements was changed.  If changing the order would have no effect or only affect things like sorting, then it is probably an array.  If changing the order would completely change the meaning of the data, then it is probably a struct (with the IP address, changing order would make it an entirely different IP address).  That said, as with everything else related to data in assembly, what the data means depends entirely on how you treat it.  There are not really arrays or structs.  They are collections of data in memory that we use as arrays and structs.

Let's make an array.
.bss
.balign 4
array:
    .space 64

.text
.global _start
_start:
    ldr r4, =array

    b init_loop_test
:init_loop
    str  r0, [r4, r0, LSL #2]
    add r0, r0, #1
:init_loop_test
    cmp r0, #16
    bne init_loop

    mov r7, #1
    svc #0
We are going to create an integer array with 16 elements (16 * 4 = 64).  Because I don't want to type in .word and a value 16 times, I stuck it in the bss section, and I initialized it in a for loop.  The loop iterates through the elements using an index stored in r0.  I am also using the index as the value to store in each element.  When the loop finishes, each element of the array will contain its own index.  I am using the register offset indexing mode with a shift, to convert the indices into the right offset for each element.

When this program terminates, the number 16 will be returned as the error code.  We will eventually learn to use the debugger.  Once we do that, it might be good to come back to this lesson and examine the contents of the array with the debugger.

In this case, we retained the original array address.  Sometimes this is what you want to do.  If we did not care about this, we could have used the [Rm], #<offset> mode, with an offset of 4.  This would have updated the pointer to point to the next element after writing to each one.  We could have eliminated the add instruction if we had done this.  The reason I did not is that I wanted to store the index of each element in it, so I had to keep track of the current index.

Also note that if I wanted to get a specific element out of the array, without traversing the entire thing, I could use the indexing mode we used above, and just set r0 to the index of the element.  This is something you generally cannot do with the immediate offset indexing mode.

ARM Assembly: Indexing Modes

We briefly discussed indexing modes when we learned about the stack and global memory.  It was probably somewhat confusing back then, because indexing modes are really designed to deal with arrays.  We won't learn about arrays quite yet though, as understanding indexing modes is very important in dealing with arrays.

Indexing modes give us a great deal of flexibility in dealing with data in memory.  This is important, since ARM only has a few instructions that can access memory.  (On Intel and some other processors, math instructions can use memory directly, instead of having to load to a register first.  This makes these instructions much slower, but it is necessary on a CPU that has very few registers.)

Indexing modes are relevant to LDR, STR, LDM, and STM instructions.  The last two are far simpler but also far less powerful.  The first two provide a great deal of flexibility.

The LDR and STR instructions take a register and something designating the address of the memory to read.  The simplest uses a label, and it has no special capabilities.  The rest use a register containing a memory address (a pointer) and some optional extra stuff.  There are four of these indexing modes, and each has some extra optional stuff.

The first indexing mode is written in the documentation as [Rn, {, #<offset>}]{!}.  The parts in braces are optional.  The means the simplest way this indexing mode could be used is ldr r0, [r1].  This loads the value pointed to by the address in r1.  If we had several integers in memory, and we wanted the second one, we could add the offset, ldr r0, [r1, #4].  This will get the data pointed to by r1 + 4.  The immediate value can also be negative.  Lastly, if we wanted to load several ints in sequence, we might use ldr r0, [r1, #4]!.  This will literally add 4 to r1, then it will use the new value in r1 as the pointer.  The exclamation point makes the instruction save the address we are accessing into r1.  If we wanted to get a series of values, we could use multiple of this instruction in succession, and each instruction will setup r1 to load the next value.  Of course, this is not exactly good for arrays, as this will always skip the first element.

The second indexing mode is [Rn], #<offset>.  This is a post indexing mode.  This means that the instruction will access the memory at Rn first, then it will add the offset to Rn.  If we used str r0, [r1], #4, the memory pointed to by r1 would first be written to.  Then 4 will be added to r1.  This won't skip the first element of an array.  If we used several of this instruction in series, we would first write the memory pointed to by r1, then r1 + 4, then r1 + 8, and so on, in 4 byte increments.  We will see an example of this when we learn how to work with arrays.

The third indexing mode uses an offset stored in a register, [Rn, +/-Rm {, <opsh>}]{!}.  If this looks familiar, it is because it is very similar to the first one.  There are two differences.  The first is the offset comes from a register, and the second is that we can put in an optional shift value.  If we use the shift value, it will be applied to the index in Rm, before that index is applied to the address in Rn.  This allows us to do things like indexing arrays by element, instead of by memory offset.  For example, we could get the second value in an integer array with ldr r0, [r1, r2, LSL #2], if we had the value 1 in r2.  The exclamation point has the same behavior as in the first indexing mode, it stores the address accessed in r1.  Note that with this one we can start on the first element of an array, by starting with a 0 in r2.

The forth indexing mode also uses an offset stored in a register, but it is similar to the second one.  The documentation shows it as [Rn], +/-Rm {, #<opsh>}.  This indexing mode will add the shifted value from Rm to Rn after the memory is accessed.

These indexing modes can allow you to easily traverse arrays and take specific elements from structs.  They are really handy for fast and efficient memory access that does not require a lot of extra instructions to manage indices.  Learning to use these will help you to be a much better ARM assembly programmer.

I mentioned that LDM and STM also have a sort of indexing mode (it's not really, as there is no actual index).  This is much simpler than the others.  The LDM instruction is shown in the documentation with the argument list Rn{!}, <reglist-PC>.  There are a few more, but the only part we care about right now is the optional exclamation point.  The register used for Rn contains the memory address being written to.  As with the previous indexing modes, the optional exclamation point causes Rn to be updated to the last memory location written by the instruction.  This works the same way with the STM instruction.

ARM Assembly: Loops

After if statements, loops are the second most common flow control structures.  Loops can allow us to perform operations across an array of data.  They can let us give the user control of when an application exits.  They can wait for user input or for other I/O to complete.  They can even just consume some time, to make sure a program is not running too fast (though, there are typically much better ways of doing this).  They can count for us.  They are incredibly useful, and like if statements, they are essential to modern computing.

Loops are created with conditional branches.  In assembly, there is no distinction between a for loop and a while loop.  In fact, there aren't even really loops.  We construct loops out of conditional branches.

We will start with an unoptimal but easy to understand example.
.text
.balign 4
.global _start
_start:
    mov r1, #5

    cmp r1, #0
    beq break
loop:
    add r0, #1
    subs r1, r1, #1
    bne loop
break:

    mov r7, #1
    svc #0
This will count up to 5, using r1 as the counter variable.  Notice that when I check the condition within the loop, I use subs, instead of cmp.  In this case, I want to decrement the counter, and then I want to check if it has gotten to 0.  I need to save the counter though, so I know where I am in the loop.  I could use sub and then cmp, but using subs lets me combine these instructions.

This is not a terribly efficient way to do this.  Because ARM has the s suffix, we don't actually reduce the size of the program by doing this.  In most other assembly languages, this would save us an instruction or two.  This will improve the performance of the program slightly, because we replaced a conditional branch with an unconditional one.
.text
.balign 4
.global _start
_start:
    mov r1, #5

    b test
loop:
    add r0, #1
    sub r1, r1, #1
test:
    cmp r1, #0

    bne loop
break:

    mov r7, #1
    svc #0
This will work exactly like the previous version, except slightly faster.  These examples are basic for loops.  For loops should not run at all, if the break out condition is met.  This means that we need to test the counter before we start the loop.  The old way of doing this is an extra cmp before the loop, and a conditional branch to skip the loop if the condition is met.  That is exactly what we did in the first example.  The C compiler has been updated, improved, and optimized over a rather long time, and this includes finding more efficient ways of doing common tasks.  At some point someone realized that instead of testing the condition before starting the loop, they could just jump straight to the end part of the loop, where it is testing its conditions.  This eliminates a conditional branch, avoiding the problem of failed branch prediction.  In assembly languages that don't have an s suffix, it also reduces the number of instructions.  In ARM,  we actually had to separate the sub and the cmd to do this, which eliminated the benefit of reducing the number of instructions.

It turns out this optimization is a trade off.  Notice that the original loop has 3 instructions inside the loop.  The second one has 4 instructions.  If we counted the number of instructions actually executed in this program, the second example would execute 5 more instructions than the first.  This means that the first loop is actually faster.  This is not exactly a big difference, but what if we were doing 1,000 iterations?  That would 1,000 more instructions executed.  If this was a loop in a performance critical part of a program, and it was expected to typically go through a lot of iterations, the cost of failed branch prediction would be more than made up for by using the first example, with fewer instructions.  For a loop that is only expected to do a few iterations, eliminating the cost of failed branch prediction might make up for the extra instruction in the loop.

Of course, on an Intel system, where there is no equivalent to subs, the original program would be doing both the subtract and the compare in the first place, which makes the second example always optimal.

ARM Assembly: Conditional Branching

One of the most common uses of branching in programs is if statements.  If statements require conditional branches.  Conditional branches only branch if certain conditions are met, otherwise they do nothing.

ARM has a list of 17 condition codes.  One, AL, we don't care about, because it is implied.  This code means "always".  This condition code makes an instruction run unconditionally, but since this is the default, we never actually use it.  There are also two condition codes that are aliases for other condition codes.  This tutorial is not going to cover all of the remaining 16 condition codes.  We will look at the most common two, and we will look at a few others.  If you want a complete list, ARM has a quick reference card that lists them all and can be found with a simple Google search of "ARM Thumb2 QRC".  (I won't guarantee this link will never break, but as of this writing, it can be downloaded from ARM's website here.  The condition codes are on the last page.)

The processor has some behind-the-scenes mechanics going on to make all of this work.  We will look at it later.  Right now, we will just discuss the conditions that satisfy some of the condition codes.

The two most common condition codes are EQ and NE.  EQ means equal, and NE means not equal.  An if statement that is testing equality will use one of these.  Condition codes generally apply to some previous instruction that does some operation.  If that operation satisfied the conditions of the code, then the instruction the condition code applies to is executed, otherwise it is not.

ARM allows condition codes on a lot of instructions, but right now we are only going to look at conditional branches, as most assembly languages have very few conditional instructions that are not branches.

Let's write a program.
.text
.balign 4
.global _start
_start:
    mov r0, #12

    cmp r0, #10
    bne then
    mov r0, #1
    b endif
then:
    mov r0, #0
endif:

    mov r7, #1
    svc #0
This program starts by putting 12 in r0.  Then we use the cmp instruction.  This instruction will subtract the second value from the first value, but it won't store the result.  So, in this case, the instruction is doing 12 - 10 = 2.  The cmp instruction will tell the processor some information about the result of the operation.  The NE condition code checks to see if the cmp instruction resulted in a 0.  If it did not, then the bne instruction will execute, going to the end label.  This is equivalent to a C or C++ if (12 == 10).  It may seem counter intuitive to use the not equal condition, but if we want the "if" code to be positioned before the "then" code, we have to do it this way.  Of course, we could reverse the if and then code, and then we would be able to use beq, instead of bne.  This is often a case of personal preference, but if you know one block will be run significantly more often, position it second.  This is because the processor tries to predict whether the branch will happen or not, and most processors just assume it will.  This means that when it does not, the processor has to back up and redo some stuff, wasting some time.  If the more frequently run block is where the conditional branch goes to (the "then" block, in this case), your program will have better performance.

EQ and NE are the most frequently used condition codes, because testing for equality is the most common condition.  What if you need to test for inequality, for example, less than or greater than?  There are condition codes for this as well.  In fact, most of the condition codes are for some kind of inequality.  The MI (minus) condition code tests for the result to be negative.  HI (higher) is for unsigned greater than.  GT (greater than) is for signed greater than.  There are also codes for both signed and unsigned less than, and signed and unsigned greater than or equal and less than or equal.  There are also a couple for checking for overflow, so that we can catch it when an integer rolls over.

Let's try the signed greater than (GT).  I also want to introduce another mechanic here.  The cmp instruction does subtraction, and it does not store the result.  This is often what we want, but it is not always.  Sometimes we do want to save the result.  Sometimes we don't want to do subtraction.  Many of the math instructions have an optional S that can be put at the end, that will tell the processor to keep track of the same things it does for the cmp instruction.  To demonstrate these things, we will also use EQ, but we will use an add instruction.
.text
.balign 4
.global _start
_start:
    mov r1, #17
    mov r2, #23
    subs r2, r2, r1
    bgt if_0
    b endif_0
:if_0
    add r0, r0, #1
:endif_0

    adds r1, r1, r2
    beq if_1
    b endif_1
:if_1
    add r0, #2
:endif_1

   mov r7, #1
   svc #0
We start by putting some values in r1 and r2.  I am saving r0 for keeping track of what succeeded and what did not (because we are not using gcc, the initial value in r0 is 0).  Notice that the sub instruction has an extra s at the end of it this time.  This tells it to update the processor state with some information about the results of the operation.  Since we subtracted 17 from 23, we ended up with a positive value, which indicates that r2 was indeed greater than r1.  My if statements are a bit ugly.  In real life, we would optimize this by just skipping the body of the if statement with a less-than-or-equal-to condition, instead of using the GT condition.  If the result is greater than, we add 1 to r0.  Next, we do an add, with the s, and then check if the result is equal.  Note that EQ is only true when the result is 0, so using add may result in intuitive results.  I could add 5 to 5, and it would not satisfy an EQ condition.  If I added 5 and -5, then it would satisfy that condition.  We typically use sub or cmp for conditionals, but once you understand the different condition codes, you can get away with using other math instructions, so long as you understand the consequences.

The other branch instructions can also be conditional.  Just add the appropriate condition code to the end of the instruction.  So, bl might be bleq, if you only want to call a function if two values are equal.

Conditional branching is something you will want to get comfortable with, because it is the basis of if statements and loops.  You will likely find that the most efficient way to use these is not the most intuitive.  This is one place that is very commonly optimized, because the most optimal solutions are often very unintuitive .

ARM Assembly: Branches

Branches, also called jumps in some assembly languages, are how we get about in our code.  Any time we don't want to run the next instruction and instead want to execute an instruction somewhere else, we use a branch.  Branches change where in the code the program is currently executing.

Branches are a big part of what makes modern computing possible.  They allow us to call functions, create loops, and conditionally run code (if statements).  Without them, our programs would be linear, only taking initial input and always performing the same operations in the same order.  We would not be able to make programs that react to user input provided while the program is running.  Computers would still be good for math and certain kinds of simulation, but things like word processors, internet browsers, and more importantly video games would just not exist.  They would not be able to.  Interactive anything would not be able to exist.  Almost all programs use branches.

Let's start with a simple branch instruction.
.text
.balign 4
.global _start
_start:
    mov r0, #27
    b skip_to_this

    mov r0, #13
skip_to_this:

    mov r7, #1
    svc #0
This program will start by setting the return value to 27.  Then it will branch to the skip_to_this label.  Notice between the branch and the label, there is an instruction that will change the return value to 13.  This instruction will never run, thus the return value will be 27.  (In real life, we don't put code in our program that never runs like this, but this is just a demonstration.  If we really needed the code to stay in the file and not run, we would comment it out.  Otherwise, we would just delete that line, instead of skipping it like this.)

The b instruction always takes a literal memory address (in the form of a label here).  It does not do anything extra.  It just changes the pc register to run the target instruction next.

As we saw in the Function Pointers tutorial, the bx instruction can be used to branch to an address stored in a register (a pointer).  We won't go over that again here.  Please review the last few paragraphs of the Function Pointers tutorial if you need to brush up.

The next branch instruction is bl.  This instruction is used to make function calls.  Typically when we call a function, we want to eventually return to the location the function was called from.  The bl instruction will help us with this, by storing the return address in lr when we call it.  If we hold onto this address, when the function is done, we can return using bx lr.

.text
.balign 4
.global _start
_start:
    bl fun

    mov r7, #1
    svc #0

fun:
    mov r0, #101
    bx lr
When the program starts running, r0 will have the value 0 in it.  In fact, only pc and sp will have non-zero values in them.  The rest of the registers, including lr, will have 0s (_start is not a function).  When we call fun() using bl, the memory address of the following mov instruction will be stored in lr, and the pc will be set to the memory address of the starting mov instruction in the function.  Once the mov instruction is executed, the bx instruction will set the pc to the address in lr, returning control to _start, at the instruction right after the function call.  Note that the bl instruction will overwrite whatever is in lr, regardless of what it is.  So when nesting functions, each must make sure that its return address is stored somewhere it will not be overwritten.

In the Function Pointers tutorial we already used the blx instruction.  It is just the bl and bx instructions combined.  It will store the return address to lr, and it takes a function pointer instead of a label.

ARM Assembly: Function Pointers

Now that we have been introduced to pointers, it is time to look at function pointers.  An important fact to keep in mind here is that the processor does not distinguish between instructions and data.  If you tell the process to run code at a particular location, it will interpret whatever is there as code, even if it is not intended to be code.  Types only really exist for the programmer.  It might help if you think of instructions as just another type.  Integers are 4 byte data values that represent whole numbers.  Floats are 4 byte data values that represent numbers which can have fractional parts.  Instructions are 4 byte values that are intended to be executed.  All of these data types are encoded exactly the same way, as 32 bit series of bits.  This means that interpretation depends on how to use them.  If you set the pc to some data, it will interpret that data as instructions.  If you perform a floating point operation on an integer, it will happily interpret it as whatever floating point value has that bit pattern.  The processor does not know or care what a given item in memory represents.  It will treat it however you tell it to.

This means that function pointers are just pointers.  When used correctly, they will point to values that you intend to be interpreted as instructions.  The processor does not know this though.  It just knows that you have a value in a register, and when you use it to call a function, it will interpret it as a pointer to executable code.

We will typically get function pointers in the same way that we get any pointer.  We will load the address into a register using the label of the function and a memory read.
.text
.balign 4
.global _start
_start:
    ldr r4, =fun
    blx r4

    mov r7, #1
    svc #0

fun:
    mov r0, #13
    bx lr
We could just call the function with bl fun.  This is significantly more efficient, because the function address is hard coded into the instruction, so we don't need an extra memory read.  What if our program is really big though, and the function is to far away in memory to encode the relative address in the bl instruction?  (Memory addresses are 32 bits.  Instructions are also 32 bits.  This means that an instruction cannot contain a whole memory address.  The ARM processor deals with this by using relative addresses.  Instead of writing the new address to the pc, it will add or subtract some value.  This value has a limited size though, and this means that addresses that are too distant in the program must use registers to store their entire address.)

The program starts by loading the memory address of our function into r4.  This is our function pointer.  We have to use a blx instruction to use a function pointer to call a function.  The function returns 13, which should be the error code for the program.

You can do this with any function in your file or declared as global in another file.  If were using GCC, we could get a function pointer for printf.  Just like any other value in memory, we can get the address of functions using their labels.

Long jumps is not the only application for function pointers.  Function pointers can be used to define behavior without constant conditionals.  For example, perhaps I am writing a program that needs to be able to use several different kinds of networks, but it will only use one at a time.  I could have if statements every time there is a network access.  This is terribly inefficient for a number of reasons.  What if instead, I set some function pointers for the kind of network I want.  This is one conditional, and then the network access can just use whatever function pointers I set.  If I need to change the type of network, I can replace the function pointers for the first type with the ones for the new type.

In assembly, you can also make anonymous functions, just by not specifying a label.  These must be called using function pointers, because there is no label to reference.  (This is a pretty advanced topic that I plan to cover much later in this series.)

There is one more kind of pointer to instructions that is not technically a function pointer, because it does not point to the start of a function.  In the above program, the last line is bx lr.  The lr register contains a memory address to an instruction somewhere within a function.  This is the address that we want the function to return to.  Since it is not a true function, we use bx, instead of blx.  We can store memory addresses that are not the beginning of a function in registers, creating pointers to instructions.  Aside from returning from functions, this is not terribly common to use.  If we want to navigate within a function, we generally just use labels, as this is more efficient.  When we want to go to another function, we almost always want to go to the beginning or to a return point.  There may be a few other esoteric applications for this, but for the most part your program will be better structured and far easier to read if you avoid this.

Friday, May 5, 2017

ARM Assembly: Pointers and Global Memory Access

I deliberately decided to write this tutorial after the one about the stack.  This is because the stack is a harder subject, but if you don't learn it first, you may fall into the bad practice of trying to put everything into global memory.  This is especially bad if you start off by storing lr in global memory.  The stack should be your primary goto place for memory, unless you need create something in a function that will still be needed after the function returns or if you need to allocate quantities of memory too large to put on the stack.  Even then, using global variables is not always the right solution.

When we learned about the stack, we learned that the stack pointer (sp) points to some location in memory, representing the top of the stack.  The stack pointer is just a pointer in the sense of C pointers.  It has a dedicated register, because it is essential to structured programming.  In reality, we can try to access any address in memory space.  Trying to access 0 will throw a null pointer exception, and trying to access anything past 0xc0000000 (3GB) will give you a permission error.  The memory is reserved for the kernel to keep track of information about the program.  Most of the rest of the space is initially unallocated space that will cause a segfault if you try to access it.

There are a few ways to allocate memory.  The stack is allocated by default, and the operating system will typically make sure there is enough, up to a certain limit (8MB, for example).  There is also a data area, a bss area, and a read only area (possibly more, depending on the OS and architecture) that you can also define, which will be allocated for the program when it starts.

These sections require special directives to create in your program.  Each section has a different purpose.  The data section is typically used for preinitialized (they start with data in them) variables.  The bss section is used for uninitialized variables.  The reason you might choose the bss section over the data section with 0s in places that are not initialized is that the data section is written into your executable, taking up more space.  If you need the variables to be preloaded with values, this is usually a secondary concern.  The bss section is allocated when the program starts, which means that it does not need space in the executable, aside from the amount of memory desired.  The last section, the read only or rodata section is typically used for constant strings or other run time constants.  When using the read system call, this appears to silently fail, not writing anything and not throwing an exception.  It appears that trying to write to the rodata area directly causes a segfault.   Put stuff in this section only if you are certain you don't want to overwrite it.  GCC typically puts string constant there.

Perhaps the most valuable area is the data area.  If you find you need to initialize a lot of mutable (changeable) variables at start up, this is probably a good place to put them.  Just make sure the stack would not be a more appropriate location.  For example, perhaps you need some memory to store an IP address.  You want the default to 127.0.0.1.  You program is going to load the IP address from a config file, but perhaps this file is empty the first time a user runs the problem.  You could allocate 4 bytes for the IP address in the bss section, but now you have to check if the IP address was in the config file, and if it was, store it to the memory, but if it was not, you will have to put the default value in.  Further, if you want to change this default, that could require finding there the default is put into memory in your config reading code.  If you put this in the data section and there is no address in the config file, you don't have to do anything, because it is already there.  If you want to change it, you can generally find one entry in the data section more easily than a bunch of moves and an LDM in the middle of a configuration section.

So, let's try it.
.data
.balign 4
ip_address:
    .word 127
    .word   0
    .word   0
    .word   1


.text
.global _start
_start:
    ldr r0, =ip_address
    ldr r0, [r0]

    mov r7, #1
    svc #0
This program is not going to read a config file, but it will set the default IP address.   Technically this is backwards, as the OS stores and accepts IP addresses starting with the lowest section and going up to the highest.

Give this a file name and compile this with as filename.s -o filename.o && ld filename.o -o filename.  From there, you can run it and output the error code, which will be 127.

The first thing that happens when you run this program is that the OS loads it into memory and reads some header information to see if it is valid.  By this time, your data area is already in memory.  Then the text section gets the memory address for the global variable and it reads the first value.  The first LDR instruction is getting a pointer to the data desired.  This is exactly like C pointers.  It is merely a memory address.  The second LDR line dereferences the pointer, getting the data that it points to.  (You could change the second LDR line to ldr r0, [r0, #4] to get the second value in the IP address.  Technically, we just created a simple struct that holds an IP address.  We can increase that 4 in multiples of 4 to get the next two elements of our struct.  Alternatively, we could use LDM to load the entire IP address into registers r0 through r3.  (And we could even flip the direction in the read or the write back, by using the right memory access code.)

The bss section is initially empty.  You use directives to tell the operating system how much preallocated memory you want, and it will allocate it and initialize it all to 0s for you (I am not certain you can rely on this last behavior on all platforms; some may leave garbage from something else there).
.bss
.balign 4 
some_memory:
    .space 16

.text
.global _start
_start:
    ldr r0, =some_memory
    ldr r0, [r0]

    mov r7, #1
    svc #0
This will allocate 16 bytes of memory (the same as the last program) when the program starts.  If you change the amount of space in the bss section to 1,600 bytes, you will find it increases the size of the executable by only 2 bytes.

The program starts by asking for 16 bytes of space in the bss section, and then it reads the first byte and returns it as an error code.  Again, the first LDR instruction is getting a pointer and the second is dereferencing it, getting the data.  You can add immediate multiples of 4 to the second LDR instruction to see the 0s in the other word sized sections of the 16 byte memory space.

The read only data section requires slightly different syntax on this particular device.  The assembler won't recognize the directive .rodata, unless it is preceeded by a .section directive.  Essentially this means that the assembler or linker does not support this section by default, so you have to define one.  The linker script used by the linker does recognize it though, and it makes sure to put it in the right place in the program binary, as well as designating it read only.
.section .rodata
.balign 4
write_this:
    .word 27

.text
.balign 4
.global _start
_start:
    mov r0, #13
    ldr r1, =write_this
    str r0,  [r1]

    mov r0, #0
    ldr r0, [r1]

    mov r7, #1
    svc #0
This will put the value 27 into the read only data section.  From there we put the number 13 in r0, and then we try to write r0 to the variable we created in the read only section.  The last part before returning first overwrites r0 with 0, to make sure we don't have an old value in there, and then we read the value from the rodata area.  If the write succeeded, we should get a 13 as the error code, but if it did not, we will probably get a segfault.  Using the debugger, I identified the location of the segfault this program throws as the STR instruction that attempts to overwrite the read only data.

As before, the LDR at the beginning is getting a pointer.  The STR instruction is dereferencing the  pointer and attempting to write to it this time (changing the stored value to a new one).  The last LDR does not run (the program crashes before this happens), but it is dereferencing the pointer stored in r1 and reading it into a register.

These different sections can give you a lot of control over your program.  If you want to make sure that a value is never changed, the read only section is very useful.  If you need your values initialized, the data section is probably the best.  If you cannot preinitialize your data, then maybe the bss section is what you need.

The things to keep in mind are that putting stuff in the data section will save the program time it would otherwise use to initialize, but putting it in the bss section will reduce the size of the executable.  The read only section is more of an error detection and security mechanism, that can prevent unskilled programmers and hackers from breaking things or sneaking malicious code or data in where it should not be.

ARM Assembly: The Stack

If you have been following the tutorials in order, you have likely hit a point where you are wondering what to do if you don't have enough registers to hold everything to need.  Of course, with so many registers, maybe you are not there yet.  Either way, now is a good time to learn about local memory.

The stack is where a process stores local variables. The stack is also used to hold onto anything it needs to store right before or after making a function call.  The stack is generally fairly small, often having a limit of only 8MB.  This means it is not good for storing large data structures.  What it is good for is storing information that is being used in a function, but which will no longer matter when the function is returned.  In other words, the stack is good for storing local variables that go out of scope when the function returns.

Before we start trying to use it, we need to know how it works.  There are several different ways the stack can work, depending on system.  The stack can grow up or down, and the stack pointer can either point at the last value stored on the stack, or it can point to the first unused address on the stack.  This presents four options, and ARM assembly is equipped to handle all of them.  Thankfully, we only need to know one.

Linux on ARM grows the stack down, and the stack pointer (sp) points to the last value on the stack.  This means, if you want to allocate more space on the stack, first you decrease the sp by the amount of memory you want, then you write to that memory.  When you are done using that memory, you increase the sp to its previous location.  This paragraph contains some of the most important information for using the stack, so reread it if you need to.

The second most important thing you need to know about the stack is that when you return from a function you must set the sp to exactly where it was when the function was called.  If you don't, then the calling function won't know where to find its local variables, which will almost certainly lead to unexpected behavior and eventually a segfault.  We will talk more about this when we learn how to make functions.

Now that we know the basics of how the stack works, let's try it out.  First we need to learn how to store data in memory and how to load it back into registers.  There is a set of instructions for this.  The easiest ones are LDR and STR, for load [to] register, and store [from] register.  The ARM quick reference card shows these instructions:
<op>{size}{T} Rd, [Rn {, #<offset>}]{!}
<op>{size}{T} Rd, [Rn],  #<offset>
<op>{size} Rd, [Rn, +/-Rm {, <opsh>}]{!}
<op>{size}{T} Rd, [Rn], +/-Rm {, <opsh>}
<op>{size} Rd, <label>
If you think this is confusing, you are not alone.  I had a hard time with this initially too.  The first thing you need to know is that <op> is either LDR or STR.  For LDR, Rd is the destination, where the value is being loaded to.  For STR, Rd is the register that contains the value to store.  The {size} portion is a code, telling the processor how much data we are loading or storing.  The options are B, SB, H, or SH, where B is byte or 8 bits, SB is signed byte, H is halfword or 16 bits, and SH is signed halfword.  Note that only LDR cares about whether a value is signed or not, so SB and SH won't work with STR.  If the size is omitted, then the instructions will load and store 32 bits, which is the full size of a register.  Here signed and unsigned make no difference for LDR or STR.  The {T} is an optional character that can be used to specify privilege level, which only matters to the kernel, so we won't worry about it.

Now let's talk about addressing modes.  ARM provides several ways to specify memory addresses, so that we don't have to waste precious instructions calculating memory addresses manually.  The first four options above start with a register that contains an address.  The Rn register will contain a base address.  In the first two, an offset is added to that base address.  The top one will add the offset, then it will load or store using the calculated address.  If the optional ! is used, the calculated address will be stored in Rn.  This can be used for easily traversing arrays.  The second option will access (load from or store to) the memory location in Rn, then it will add the offset to Rn and store the result in Rn.  The difference here is whether the offset is added before or after accessing memory.  The third and fourth allow an offset that is stored in a register, which may be added or subtracted from the base address and which can optionally be shifted.  <opsh> is defined as Rm, ??? #<shift>, where ??? is a shift code (LSL for logical left shift, etc...).  This allows Rm to be an offset that will be shifted before applying it, which is useful for traversing some types of data structures.  As with the first two, the third will update the base address in Rn if the ! is included, while the fourth will access the memory at Rn first, and then it will update Rn with the offset.  The last one uses a label.  When the program is compiled, this label is converted into an offset relative to the current position of the pc, essentially hard coding the memory location.

For accessing the stack, we only care about the first two of these.  The first one is for accessing memory without updating the address and for accessing memory where the offset is applied before accessing the memory, then the memory address is updated.  The second one is for accessing memory before applying the offset, then applying the offset and updating the memory address.  We don't need optionally shifted offsets, nor do we need offsets that are stored in registers, so we are not going to worry about the third and fourth options.  We are also not using labels, so the fifth option is not useful right now either.

Now we come to the point where how the stack works is important.  There are two ways we can deal with the stack.  The first is using SUB and ADD to decrement and increment the stack pointer directly.  For example, if we need to allocate 8 bytes on the stack, we might do this:
sub sp, sp, #8
This will reduce the sp by 8, giving us access to 8 bytes of memory on the stack.  When we are done, before we return, we would do this, to put it back where it was:
add sp, sp, #8
And now we have deallocated 8 bytes.  There is a cheaper way to do this though.  While it is sometimes better to just allocate as much as you will need at once, if you only need to put one or two things on the stack, it may be better to just do this:
str r0, [sp, #-4]!
This will decrement the stack by 4 bytes (the size of a register), storing the calculated address in sp, then it will store r0 to that location in memory.  This takes care of decrementing the stack and storing the data all in a single instruction.  Once we are done, there are two ways we can finish.  If we don't need the value, we can just add 4 to the sp and walk away.  If we do need the value though, we can do this:
ldr r0, [sp], #4
Notice that we used the first option above for storing and the second for loading.  This is important.  When storing something on the stack, we need to decrement before we store the value.  When loading, we need to load and then increment.  So, this time, we are loading the value from the stack (sp is pointing at the last value stored, which is what we want to load), then once our value is loaded, it will increment the stack pointer by 4, deallocating the stack memory we were using.

You should familiarize yourself with the different memory addressing modes.  For this particular system, pre-indexing (the first option) stores and post-indexing loads is the right way to deal with the stack, but if you ever program in ARM assembly for another OS, you may find this is not the case.  Further, when traversing arrays and other data structures, you may also find that this is not the best strategy.  Understanding how each addressing mode works is valuable for a lot of applications.

Now, you should also be aware that there is an easier way to do this.  ARM defines two pseudo instructions to help you work with the stack more easily.  These instructions are PUSH and POP.  The compiler turns PUSH and POP into the above STR and LDR instructions for the Raspberry Pi, but they are defined more loosely than that.  Whatever system you are compiling for, they will work with the stack as is appropriate for that system.  So, if you take a program written and tested on the Pi using PUSH and POP, assuming everything else is the same, it will compile correctly for other ARM systems that use the stack differently.  This is not all these instructions can do though.  Before we go any further, let's write some example programs.

We are going to write a program that uses the stack with STR and LDR, then alter it to use PUSH and POP.  There is one thing that is frequently stored on the stack at the beginning of a function.  This is the link register, lr.  We will get into function calls more later, but the two instructions for making function calls overwrite the lr register with a pointer to the instruction directly after the function call.  This is the return point, and it is how the processor helps our programs to keep track of where function calls need to return to.  There is a problem here though.  What happens if we call a function within another function?  The answer is the lr is overwritten with the new return point.  If we don't store the return pointer somewhere else, then it will be lost, and our program won't know where to return to.  In practice, we will probably end up with an infinite loop between the return point of the nested function and the end of the current function.  The universal solution to this is to store lr on the stack in any function that will be calling another function.  When we need to return, we will load it from the stack and use it.

There is another important use of the stack when using functions.  We only have about 12 registers we can use.  It is not uncommon to have function calls nesting quite deep, and it is not reasonable for us to allocate registers to specific functions.  Instead, we have a set of standards that functions ought to follow, which, when followed, will guarantee that they don't step on each other's toes.  We won't get deep into this at this point, but there is one thing that is relevant here.  r0 through r3 are used for passing arguments (and r0 is used for the return value).  Further, even if you don't pass arguments in them, the caller may not assume that the values in them will be the same when a function returns.  Essentially, functions are allowed to do whatever they want with r0 through r3, and calling functions must accept that.  In practice, this means that if you have values in these registers that you want to keep, you must store them somewhere else before calling a function.  The second thing is that values in r4 through r11 are expected to be maintained through a function call.  This means that any function that wishes to use any of these registers must store the existing values and then restore them before returning.  Both of these things can be handled using the stack.  We will focus on the second, as the first is more often handled either by doing work primarily in r4 through r11 to avoid the need to store before calling a function or by copying values in r0 through r3 into the r4 to r11 range.

We will continue using GCC as our compiler here, since main() is a function, and thus we should follow these function call procedures in it.  We are going to do some work in r4, r5, and r6, and leave the results in r0, but since main() is a function, we must restore whatever is in these registers and restore it before returning.
.text
.global main
main:
    str lr, [sp, #-4]!
    str r4, [sp, #-4]!
    str r5, [sp, #-4]!
    str r6, [sp, #-4]!

    mov r4, #13
    mov r5, #27
    add r6, r4, r5
    sub r0, r6, #7

    ldr r6, [sp], #4
    ldr r5, [sp], #4
    ldr r4, [sp], #4
    ldr lr, [sp], #4
    bx lr
Our program starts by putting any registers from r4 to r11 that we will be using onto the stack.  Then we use those registers to do some stuff.  In this case, we could have used r1, r2, and r3 just as easily, but if we had wanted to print out the value using printf(), then we would have had to move what we wanted to keep out of those registers to the higher ones.

Once we are done, we restore the values from the stack to the registers.  Note that we do this in reverse order.  The stack works exactly like a stack data structure when used this way.  The last thing we put in will be the first thing that comes out.  This means we must pull things out in reverse order so they will all be restored to the same places.

Something else important to note is that we are pre-decrementing when putting things on the stack and post-incrementing when we pull things off.  This is because the stack grows downward, and the stack pointer always points at the last item on the stack.  And recall that when we return, the stack pointer must point at the same place it was at when the function was called.  If we add up the decrements and increments, we get -16 at the beginning of the function, and +16 at the end, which breaks even.

Lastly, once we compile and run the program, we can use echo $? to print the return value, which should be 13 + 27 - 7 or 33.

This function is rather inefficient.  We are doing a ton of memory access.  We have a total of 8, and because they are in succession, each will have to wait for the previous one to complete.  Because ARM is very limited in instructions that access memory, it provides some instructions optimize memory accesses.  These are LDM and STM.  The M on the end of each means "multiple".  We can use these instructions to transfer the data in many registers to and from memory in a single instruction.  I can't say I understand how the wiring works for this, but I assume that the wiring between registers and memory is fairly robust to allow this.

We are not going to use these instructions directly right now, because they have to care about the order things are being transferred in.  We would not want to store r0 through r3, and then reload the data and have it end up with the first value in r3 and the last in r0.  So these have a sort of addressing mode system giving each another four versions.  The PUSH and POP instructions don't just apply to writing single registers to the stack.  If you give them a list of registers, instead of using LDR and STR, the assembler will use LDM and STM, to optimize performance.

Let's rewrite the above program using PUSH and POP to transfer multiple values at one!

.text
.global main
main:
    push {r4, r5, r6, lr}

    mov r4, #13
    mov r5, #27
    add r6, r4, r5
    sub r0, r6, #7

    pop {r4, r5, r6, lr}
    bx lr
This program will work identically to the first one, except cleaner and faster.  In fact, it should run a lot faster, because we cut memory accesses by 75%, and none of the other code takes significant time in comparison.

Note that PUSH and POP always produce code using the stack pointer.  If you want to load or store data anywhere else, you will have to use STR, LDR, STM, or LDM directly.

As usual, running this program should return the value 33 as the error code.