Saturday, February 25, 2017

ARM Assembly: Type Casting

This topic was kind of skimmed over in the last article.  It is time to take it head on.  This is about integer type casting, not casting strings to integers, integers to strings, or integers to floats.  Those will be covered later.  The biggest value in type casting is generally taking small data types and extending them to larger ones.  ARM has a series of instructions specifically for this purpose.  It does not have instructions for down casting though.  This is easy enough, if you know how.

In the documentation, type casting instructions are categorized as "Signed extend" and "Unsigned extend".  Literally, we are taking values stored as a number of bits, and extending the number of bits the values can occupy.  The simplest operation here is unsigned extend, sometimes also called "zero extend", since all it does is tacks an appropriate number of zeroes to the left side of the value.  Signed extend notes the value of the sign bit, and it tacks the appropriate number of that value to the left side of the value.  Both of these are trivial but very useful operations.

There are three standard signed extend instructions: SXTH, SXTB16, and SXTB.  The first extends a halfword to a word, or a 16 bit value to a 32 bit value, by filling the top 16 bits with whatever the top bit (or sign bit) of the 16 bit value is.  This ensures that even though the representation is changing, the actual value represented stays the same.  The second extends two bytes (8 bit values) to two half words (16 bit values).  This is similar to the parallel math instructions.  A signed 8 bit value is placed in bytes [7:0] of the register, and a second is placed in bytes [23:16].  The instruction sign extends them, and the result is two 16 bit values, one in [31:16], and the other in [15:0].  This is an especially good arrangement, if the next operation is going to be parallel addition or subtraction.  The last instruction above casts a signed byte to a signed word (8 bits to 32 bits).

There are also three standard unsigned extend instructions: UXTH, UXTB16, and UXTB.  These follow the same pattern.  The first casts a halfword to a word, the second casts two bytes to haflwords in parallel, and the third casts a byte to a word.  As with the parallel arithmetic instructions, the parallel extend instructions can be used to cast just one byte to one halfword, by just ignoring the upper half of the register.

There are also some type casting instructions that do additional operations at the same time.  These operations are all addition.  We won't discuss those here, but it is good to be aware that ARM has a number of math instructions that can do multiple operations in a single cycle.

So far, we have worked exclusively with unsigned values in the examples, and for most of this series we will stick to unsigned values for convenience.  This time though, we will work with some signed values.

printf() assumes all of its integer arguments are 32 bits.  This means that it is necessary to cast any 8 or 16 bit integers into 32 bit integers, before sending them to printf().  With unsigned values, where a register contains a single value, the 8, 16, and 32 bit interpretations are the same, but with signed values, they are not.  Let's write a small program that uses type casting to illustrate the fact that signed values can mean different things depending on data type size.
.data
string:
    .asciz "8 bit uncast value: %d, 8 bit value cast to 32 bits: %d\n"

.text
.global main
main:
    push {r12, lr}

    mov r1, #150
    sxtb r2, r1
    ldr r0, =string
    bl printf

    pop {r12, lr}
    bx lr
First let's discuss representation.  We are putting the value 150 into r1.  At least, that is what it looks like.  In reality, we are putting an 8 bit -106 into r1, but if we wrote mov r1, #-106, the assembler would put it into the register as a 32 bit -106, not an 8 bit -106.  What is the difference?  In binary, an 8 bit -106 looks like this: 10010110.  In binary a 32 bit -106 has 24 additional 1s on the left of that.  In a 32 bit register, an 8 bit -106 is 24 0s followed by 10010110, while a 32 bit -106 is 24 1s followed by 10010110.  The SXTB instruction casts to 32 bits by observing the left-most bit of 10010110 (which is 1), and then filling the left 24 bits of the register with that.  Now, the 32 bit interpretation of 24 0s followed by 10010110 is 150, which is why printf() displays 150 initially.

So, that 150 I put in r1 initially is the 8 bit representation of -106, but printf() won't interpret it that way, because it assumes all input is 32 bits.  If this sounds complicated, don't worry.  It kind of is.  Outside of computers, we don't really do anything this way.  It can take a bit to grasp.  The point with this is that remembering to type cast is important.  Let's say we want to do some math operation on our 8 bit -106, with a 32 bit value.  If we don't sign extend our 8 bit value to 32 bits first, it will act exactly like it was just a 32 bit 150.

The fact is, to the processor, values in registers are just strings of 1s and 0s or offs and ons.  It has no concept of type.  It neither knows nor cares about type.  Type is a function of how we choose to use the values stored.  To the processor an 8 bit -106 and a 32 bit 150 are identical.  It will treat the value however we tell it to.  So, if we tell it to do 32 bit math with an 8 bit -106, then that 8 bit -106 will be treated as a 32 bit 150, because the processor does not know our intent.

So now we can do up casting.  What about down casting?  ARM does not provide us with any down casting instructions.  In part this is because down casting is actually not that common.  In part this is because down casting is not horribly difficult.  We won't do an example, but I will explain how it works.

Unsigned down casting is trivial.  Merely zero out all bits above the data type size, and you have successfully down casted.  This can easily be done with an AND instruction, that masks the bits beyond the data type.  So, to cast a 32 or 16 bit unsigned integer in r1 to an 8 bit integer, you might use the instruction and r1, r1, #255.  The number 255 is 11111111 in binary.  In a 32 bit register, the left 24 bits are all 0s, so the AND will zero out all of those bits in the input value.  To cast 32 bits to 16 bits, it is a bit more complicated, because the <#imm8m> option for <Operand2> has a serious limitation.  The 8 in this option means that the immediate value must be an 8 bit value, which can be shifted to any position in the bit string.  To mask only 16 bits, we need a 16 bit value.  If you try to do this with an immediate value in an AND instruction though, you will be greeted with this error message: Error: invalid constant (ffff) after fixup.  You will have to construct this value, or you will have to store it in memory and load it.  Since memory accesses are slow, constructing the value with two instructions is probably the best option.  Start with mov r4, #0b11111111.  This will give you the bottom 8 bits.  Right after this do orr r4, #0b1111111100000000.  This will add the other 8 bits, giving you the final value.  Then, use r4 (or whatever register you have chosen to use) in the AND instruction as your mask.  Alternatively, ARM's documentation has a "Bit field" section, containing a BFC or Bit Field Clear instruction that can be used to clear a section of a register, given an index and a width.  It appears this could zero the top half or three quarters of a register in a single instruction.

Signed down casting is far less trivial.  I won't go too deep into the details here, because you will need to learn about conditionals before we can go there, but I can still describe the process.  The first thing you need to do is check the sign bit.  The easiest way to do that is to check whether or not the value is less than 0.  If it is less than 0, then the sign bit is 1.  If it is not less than 0, the sign bit is 0.  Then, mask it to the size you want.  Last, you replace the top bit (bit 7, for a 8 bit down cast, and bit 15 for a 16 bit down cast) with the value of the sign bit you checked.  Checking the sign bit requires the use of conditional logic, which we will learn later.  Writing the new sign bit can be done most easily with an ORR (to set to 1) or BIC (to set to 0) instruction.

Note that down casting often loses information.  The C compiler will generally issue a warning when it detects down casting.  Sometimes, however, it is necessary or the loss of information is even desirable.  If you feel like you need to down cast though, it is a good idea to think carefully, to make sure you are not losing anything important.

ARM Assembly: Math with Small Data Types

If you are doing these in order, we just learned to do basic math in ARM assembly.  The math instructions we used work on whole registers at a time, and since ARMv7 registers are 32 bit, that means we only know how to do math with 32 bit integers.  In real world programming, sometimes we need to work with smaller integers of 16 or even 8 bits.  There are several possible reasons for this.  Sometimes, we don't need to store larger values, but we do need to conserve memory.  In some cases, we need to take advantage of certain properties of numbers, like when a value will roll over to negative or zero.  We may just be working with data, for example audio data or video data, that is stored in smaller types than 32 bit integers.  Whatever the case, sometimes we just need to work with smaller data types.

ARM provides us with a very convenient set of instructions for some operations on 16 and 8 bit values, but it does not cover everything we will need.  We will start with these.  There are four instructions we care about right now, but they are not easy to find.  Instead of being listed with the regular math operations, they are listed under "Parallel arithmetic", because they are parallel math instructions.  Video and audio encoding are two common uses of smaller data types, and both of these benefit from speed optimizations, so ARM includes some instructions that will do math on multiple values at the same time.  The instructions we care about are ADD16, SUB16, ADD8, and SUB8.  Let's start by looking at the math notation for ADD16.
ADD16 - Rd[31:16] := Rn[31:16] + Rm[31:16],
        Rd[15: 0] := Rn[15: 0] + Rm[15: 0]
If you recall from the last article, the notation [x:y] refers to the bits in the register that are used or affected.  You may notice that there are two equations for this instruction.  The first is adding bits 16-31 of the two operands and storing the result in the same bits of the destination register.  The second is doing the same for bits 0-15.  This instruction is performing two 16 bit additions at the same time, where the operands for one are in the top half of the 32 bit registers and the operands for the other are in the bottom half.  We won't delve any deeper into this right now, but part of the reason this is so convenient is that the default for the load and store instructions is to work with 32 bit values in memory, which means that we can easily load two registers with two 16 bit values each, and then we can perform math on them, without any extra work.  You will find the SUB16 instruction is very similar.  The 8 bit instructions are even better, as they can perform four operations in parallel, since a 32 bit register divides by 8 into 4 sections.

We can easily work with individual values with these instructions; we don't have to do operations in parallel.  Consider, if we put a 16 bit value in the bottom half or an 8 bit value in the bottom quarter of a register, and the other "slots" are all 0, adding or subtracting will leave the parts we don't care about as 0s.  Even if they are values other than 0, we don't have to ever use them.

One last thing we need to know about these instructions is the prefix we want.  The ADD16 instruction syntax is listed in the documentation as <prefix>ADD16 Rd, Rn, Rm.  We only care about two options right now, S for signed, and U for unsigned.  We are going to stick with unsigned in the example.
.data
addition:
    .asciz "%u + %u = %u\n"
subtraction:
    .asciz "%u - %u = %u\n"

.text
.global main
main:
    push {r12, lr}

    mov r1, #1000
    mov r2, #500
    uadd16 r3, r1, r2
    ldr r0, =addition
    bl printf

    mov r1, #100
    mov r2 #50
    usub8 r3, r1, r2
    ldr r0, =subtraction
    bl printf

    pop {r12, lr}
    bx lr

I named this smallmath.s.  When we move the value 1,000 into r1, it is being treated like a 32 bit integer, but because it is within the normal range for a short, it fits into the bottom 16 bits of the register and the top 16 bits are 0s, because the top 16 bits of the 32 bit integer are 0s.  So we have essentially put 1,000 into the bottom 16 bits and 0 into the top 16 bits.  Putting the number 500 into r2 works the same way.  printf() won't distinguish between smaller integer sizes.  It will always assume 32 bits, so the top bits of the register must be 0, or printf() will display the 32 bit interpretation of the register.

This program will ultimately work identically to one using regular ADD and SUB instructions.  If you want proof that ADD16 and SUB8 are working with smaller data types, change the values , so that the result overflows or underflows.  For example, if you subtracted 100 from 50, in the 8 bit example, you would expect to get 206 if 8 bit math is being used, 65,486 if 16 bit math is used, and something very much larger if 32 bit math is being used.

From here, you should be able to figure out 16 bit subtraction and 8 bit addition on your own.  Let's do one more example though, doing calculations in parallel.  We will do16 bit subtraction this time.
.arch armv7-a

.data

subtraction:    .asciz "%u - %u = %u\n"

.text
.global main
main:
    push {r12, lr}

    mov r4, #1000
    movt r4, #1200

    mov r5, #500
    movt r5, #600

    usub16 r6, r4, r5

    uxth r1, r4
    uxth r2, r5
    uxth r3, r6
    ldr r0, =subtraction
    bl printf

    mov r1, r4, LSR #16

    mov r2, r5, LSR #16
    mov r3, r6, LSR #16
    ldr r0, =subtraction
    bl printf
    pop {r12, lr}
    bx lr
There is a lot going on here, and most of it is happening to extract the 16 bit values from the 32 bit registers.  First, we need to use an instruction that exists in armv7 but not armv6.  Our compiler defaults to armv6, since that is what the original Pi uses.  The top line tells the compiler about our processor, so that we can use the movt instruction.  This instruction puts a value into the top 16 bits of the register.  So, we stick 1,000 into r4, and it only occupies the bottom half.  Then we use movt to put 1,200 into the top half of r4.  Note that order is important here.  If we reversed these, the mov instruction would overwrite the 1,200 with 0s.  We do something similar with r5.  Now that we have our four values loaded into two registers, we can use the USUB16 instruction to do the math.  The easy part is over.

The hard part is getting the data out for printf() to display.  We cannot tell printf() to use just half of the register.  Notice we used r4, r5, and r6 this time.  This is because we need to keep the values between function calls.  The ARM C calling conventions don't promise r0 through r3 will not be overwritten by a function, but they do promise r4 through r11 will be preserved.

First we want to display the values in the bottom halves of the registers.  We can use the UXTH instruction to cast an unsigned 16 bit value to an unsigned 32 bit value, and it will use the bottom 16 bits and ignore whatever is in the top 16 bits.  This gets us half of our values, and we display them.  Next, we need the top 16 bits, and we need them to be in the bottom 16 bits for printf() to interpret them correctly.  One of the ways ARM maintains good performance is by adding some optional operations to instructions that those operations are often used with.  In this case, we are using the bit shift operation.  If we shift the values in our registers 16 bits to the right, the top 16 bits will become the bottom 16 bits, and the bottom 16 bits will be dropped.  So, we use simple MOV instructions, but in transit, we are using a logical right shift to isolate the data we want and prepare it to be displayed.  We will discuss bit shifting more later.  The important part to understand here is that we are using it to isolate the data we want.

So, what about 8 bit values, where 4 are packed into a single register?  These are more difficult to extract.  We can easily extract the bottom 8 bits using the UXTB instruction to cast an unsigned byte to a 32 bit integer.  To get the second 8 bits, we would shift right 8 bits, to drop the first value, and then cast to 32 bits to isolate the value we want.  The third could be isolated by shifting 16 bits and casting.  The fourth could be isolated by shifting 24 bits.

For signed values things have to work slightly differently.  First, we have to use the signed extend to cast to signed 32 bit values (the SXTH and SXTB instructions), and second, we would have to cast all of the values, instead of all except the last one.  This is true for 16 and 8 bit values.  Of course, in real applications we rarely ever need to display the values created.  Instead we store them in memory and eventually write them to files or send them to devices like the sound card or video card.  In these cases, we never have to worry about this, because writing them to memory is no different from writing regular 32 bit values to memory.

The last thing to discuss is multiplication and division.  ARM does not provide parallel instructions for these operations, nor does it provide instructions explicitly for 16 or 8 bit multiplication or division.  For unsigned multiplication, we can just treat them like 32 bit values, and then just take the lower 16 or 8 bits when  we are done.  This will just work out.  Any part of the result over the size of your type can be ignored, as an instruction specifically for that size would just discard the overflow.  For integer division, it is impossible to get a result larger than either operand, so overflow can never happen.  For signed values, things are more complicated.  The easiest way to deal with this is to sign extend (cast) the values to 32 bit numbers, do the math, and then cast back.  To cast signed values down, first the sign bit must be checked, then the bit that would be the top of the desired data type must be set to the value of the sign bit.  From there, everything above the size of the desired type can be discarded.  If you forget to move the sign bit down, your values will not overflow correctly, and you could end up with some really strange behaviors.

Because small data types can be difficult to work with, some programmers advocate never using data types smaller than the word size of the platform unless absolutely necessary.  Depending on the application, it is possible that using smaller data types will reduce CPU efficiency of a program significantly.  On the other side though, often memory is a bigger bottleneck than CPU efficiency, and in these cases, using smaller data types may be necessary for acceptable performance.  And of course, in applications where small data type operations can be parallelized, CPU efficiency can be dramatically improved by using them.  This is ultimately a judgement call that must be made when designing a program and writing the code.

My personal strategy is to use the smallest data type necessary for the particular use case, and if I find there are performance problems, then I will deal with it when it comes up.  In languages like C, increasing the data type of a variable to a larger size is trivial, but decreasing it can have unexpected consequences.  Of course, in assembly, changing the data type of a variable in either direction may require rewriting significant sections of code to use the appropriate instructions, so this may not be a feasible strategy with assembly.  The most important thing is knowing how to work with small data types when it is necessary though.

ARM Assembly: Basic Math

Programming is all about math.  Every program uses copious amounts of math.  Even a simple program that prints out a single line of text is using math somewhere, to figure out how long the text is and when it is done printing all of the characters.  While traditional programming languages may take care of some of this math for you, assembly language leaves most of the math to the programmer.  This means we need to learn how to do simple math in assembly, before we can really do much else.

The ARMv7 architecture has a large collection of math instructions.  I has several addition and subtraction instructions, and it has many different multiply instructions.  The ARMv7-R architecture also includes two division instructions.  Our Pi2 uses ARMv7-A, which does not include division instructions, however, the specific model, Cortex-A7, does include something called Virtual Extensions, which adds those division instructions.  This means we can add, subtract, multiply, and divide with fairly simple instructions.

We will use GCC again for this one, because we don't know how to print stuff to the screen without it yet.  We really need to learn to do math before we can stop relying on the C libraries for things.

There are five math instructions we care about right now.  There are several times that total, but many of them are for specific cases that are less common.  We will look at more of them later.  For now, we care about the add, sub, mul, udiv, and sdiv instructions.  These will allow us to add, subtract, multiply, and divide.  Notice there are two divide instructions, one for unsigned math and the other for signed math.  The other three operations only have one instruction each, because 2s-compliment math just works out for them, without the need to distinguish between signed and unsigned values.

The syntax for add and sub is the same.  According to ARM's documentation the syntax looks like this:
ADD{S} Rd, Rn, <Operand2>
SUB{S} Rd, Rn, <Operand2>
It would help to understand the documentation.  The part before the {S} is the actual instruction.  The {S} is an optional character that means the instruction should update the condition flags, which we discuss later.  Next is a list of what are essentially arguments.  Rd always represents the destination register.  Rn is just a register containing an input value.  <Operand2> can be a number of options.  The two most important are another register containing an input value and an immediate value of type #<imm8m> (a number that fits certain requirements).  We will look at other things <Operand2> can be later.

The syntax for mul, udiv, and sdiv are different:
MUL{S} Rd, Rm, Rs
UDIV Rd, Rn, Rm
SDIV Rd, Rn, Rm
You will notice that mul has the {S} option, but the divide instructions do not.  As with the other instructions, Rd is the destination register.  Rn and Rm are registers containing values.  Rs is also just a register with a value, but it can optionally shifted.  We won't worry about this right now though.

So now we know the syntax for the basic math instructions.  How do we actually use them?  Well, let's look at one more thing from the documentation, and then we will try it out.

The documentation describes the action of these instructions using mathematical notation, and understanding this can really help understand what an instruction is doing.  The five instructions above have their actions defined as the following:
ADD - Rd := Rn + <Operand2>
SUB - Rd := Rn - <Operand2>
MUL - Rd := (Rm * Rs)[31:0]
DIV - Rd := Rn / Rm
These pretty much all mean, the second and third operands are operated on, and the result is stored in Rd.  For subtraction and division, keep in mind that order matters (take a peak at the math notation for the RSB instruction, and compare it to SUB).  In the multiplication one, notice the [31:0].  This specifies that only the lowest 32 bits of the operation are stored in Rd (which makes sense, given that Rd is a 32 bit register).  This is also true of addition, when the result is bigger than 32 bits.

Actually using these is pretty simple.  Let's say we want to add 12 and 14.  We would start by putting one of the numbers in a register.  Then we could use an add instruction with an immediate value for the other one.  Alternatively, we could just put both values in registers and add them.
mov r1, #12
add r0, r1, #14
This code will put the number 12 into register r1.  Then it will add the value (12) in r1 to 14 and store the result (26) in r0.  Note that r1 will still have 12 in it when this is done.  We could have written add r1, r1, #14, and it would have overwritten r1 with the result of 12 + 14 (26).  If you don't care about intermediate values, you can just reuse their registers like this, but if you need to keep an intermediate value, make sure you save the result somewhere else.

We can write a simple program that does this addition with both input values in registers like this:
.text
.global main
main:
    mov r0, #12
    mov r1, #14
    add r0, r0, r1
    bx lr
Since we don't need to store anything in memory, we will start with the text section.  Because we are using GCC to compile, we need to start with a main function.  Then, we put 12 in r0 and 14 in r1.  Next we add r0 and r1 and store the result in r0.  The last line, bx lr, returns from the function.  You may recall reading in a previous article, that lr is the link register, where the return address is stored when we call a function.  main is just a function called by the C startup code, so the return address is stored in lr.  When we are done, we return with bx, which takes a register containing an address to go to.  So bx lr returns from main.  Once you are done, save your program as add.s.

Now we can compile the program.  Run gcc -o add add.s to compile.  Now run ./add to run the program.  The anticlimactic result is...nothing.  The program does not print anything to the screen, because we did not tell it to.  It did return something though.  Linux programs return an error code, and they do this by leaving the error code in r0 when they exit.  Notice our add instruction stored its result in r0, so the error code should be 26.  We can view this error code by running echo $?.  (Note that every program returns an error code, so this will only work if you run it directly after add.)  This should display the number 26.

From here the rest of the math instructions are fairly simple.  Multiplication and division require all operands to be in registers.  Subtraction works like addition.  Now let's write a bigger program that will print the values to the screen, instead of just returning one value as an error code.

Because we are compiling with GCC, we have free access to built-in C functions.  This means we can use printf() to output text to the console.  According to the documentation, the signature for this function is int printf(const char *format, ...);.   To call this function, we need to know a little bit about C calling conventions, for ARM.  Right now, there are two important things we need to know.  The first is that the return value is always placed in r0.  This is less important that the second, which is that arguments are passed in r0 to r3.  If there are more than four arguments, additional ones are passed on the stack.  Since we have not learned this yet, we will stick to four arguments or less.  One more important thing to know is that arguments are ordered in the registers the same as they would be ordered in the function call in C.

Knowing this, here is what we can get from the function signature of printf():
  • When it returns, the return value will be in r0.
  • The first argument must be placed in r0 before calling printf().
  • The first argument is a pointer to a cstring.  Note that cstrings are null terminated.
  • The second argument must be placed in r1, the third in r2, and the fourth in r3.
  • We are going to avoid more than four arguments, because we have not learned to pass arguments on the stack yet.
Let's examine a simple program that prints out a single string using printf().
.data
string:
    .asciz "Print me!\n"

.text
.global main
main:
    push {r12, lr}
    ldr r0, =string
    bl printf
    pop {r12, lr}
    bx lr

Save this as print.s, then compile it with gcc -o print print.s.  Now you can run ./print, and it will display the text "Print me!" followed by a newline.

What, exactly, is going on here?  We start by defining a data section.  Programs are composed of several sections.  The text section is where the executable code goes.  The data section is where global variables and constants go.  There is also a section for uninitialized data, and there are special sections that can be used for other things.  In fact, we could have put our string in a special section for read-only data, but we won't worry about that right now.  In the data section we create a label, which is nothing more than a symbol for referencing a specific location in memory.  Then we put some data in the data section.  The .asciz directive tells the assembler that we want to create a null (or zero) terminated string.  Next, we define the string.  The assembler will transparently add a null character to the end when it assembles the program.  Later will be need to create strings that are not null terminated, and we will use the .ascii directive for that.

Once our data is in the program, we will write our code in the text section, in the main function.  Now, I mentioned earlier that when a function is called, it overwrites the link register with its own return point.  Since main() is a function, its own return location is currently stored in lr.  When we call printf(), this will be overwritten, and if we don't save it somewhere else, main() won't know where to return to.  The first line of our main() function does a few things for us.  First, it stores the contents of r12 and lr on the stack.  Then, it updates the stack pointer, decreasing it by 8.  We will look at exactly why it does this later.  Next, we load the address that string points to into r0.  This is the cstring pointer that printf() expects for its first argument.  We don't have any more arguments for printf(), so next we call it.  Because we are using GCC, we don't have to do anything special; it knows where to find printf() for us.  printf() stores its return value in r0 before it returns, but we don't really care about that.  (Note, however, that since we don't change r0 after this, whatever it left there will be the error code for our program.)  Now, we need to get the value from lr back off of the stack, so we use the pop instruction.  Now we can return, using bx on the link register.

There are a few things you may have noticed here.  First, we are pushing and popping two registers.  Why are we storing r12, when it really never gets used anywhere?  The answer is that printf() and many other C functions and external interfaces (like the OS) expect the stack to be 8 byte aligned.  We will talk about alignment later, but the important part here is that if we push just one register, the stack will be 4 byte aligned, so we are pushing an extra one to keep it 8 byte aligned.

Now that we know how to print stuff to the screen, let's do some subtraction!
.data
string:
    .asciz "%d - %d = %d\n"

.text
.global main
main:
    push {r12, lr}
    ldr r0, =string
    mov r1, #12
    mov r2, #7
    sub r3, r1, r2
    bl printf
    pop {r12, lr}
    bx lr

Save this as sub.s, then compile with gcc -o sub sub.s.  Now run ./sub.  Verify that the output is correct.

This is very similar to the previous program, except that we are having printf() display the values used in our math.  Now, we could have done the math in any registers, but I specifically chose the ones I did, to avoid having to rearrange things later for printf().  As before, we load the address of string into r0.  Then we put our operands into r1 and r2 in the order we want prinf() to display them.  Lastly, we do our subtraction, placing the result in r3, as the fourth argument of printf().  Everything after this is identical to our previous program.  Being able to print out results like this is a major step in being able to debug larger programs, though once we let go of the C library, we will have to find other ways.

Multiplication will be left as an exercise for the reader.  Really, this should be trivial, given what you have already done.  Division will be discussed in a separate post, as there are some additional requirements to use the division instructions.