Saturday, February 25, 2017

ARM Assembly: Math with Small Data Types

If you are doing these in order, we just learned to do basic math in ARM assembly.  The math instructions we used work on whole registers at a time, and since ARMv7 registers are 32 bit, that means we only know how to do math with 32 bit integers.  In real world programming, sometimes we need to work with smaller integers of 16 or even 8 bits.  There are several possible reasons for this.  Sometimes, we don't need to store larger values, but we do need to conserve memory.  In some cases, we need to take advantage of certain properties of numbers, like when a value will roll over to negative or zero.  We may just be working with data, for example audio data or video data, that is stored in smaller types than 32 bit integers.  Whatever the case, sometimes we just need to work with smaller data types.

ARM provides us with a very convenient set of instructions for some operations on 16 and 8 bit values, but it does not cover everything we will need.  We will start with these.  There are four instructions we care about right now, but they are not easy to find.  Instead of being listed with the regular math operations, they are listed under "Parallel arithmetic", because they are parallel math instructions.  Video and audio encoding are two common uses of smaller data types, and both of these benefit from speed optimizations, so ARM includes some instructions that will do math on multiple values at the same time.  The instructions we care about are ADD16, SUB16, ADD8, and SUB8.  Let's start by looking at the math notation for ADD16.
ADD16 - Rd[31:16] := Rn[31:16] + Rm[31:16],
        Rd[15: 0] := Rn[15: 0] + Rm[15: 0]
If you recall from the last article, the notation [x:y] refers to the bits in the register that are used or affected.  You may notice that there are two equations for this instruction.  The first is adding bits 16-31 of the two operands and storing the result in the same bits of the destination register.  The second is doing the same for bits 0-15.  This instruction is performing two 16 bit additions at the same time, where the operands for one are in the top half of the 32 bit registers and the operands for the other are in the bottom half.  We won't delve any deeper into this right now, but part of the reason this is so convenient is that the default for the load and store instructions is to work with 32 bit values in memory, which means that we can easily load two registers with two 16 bit values each, and then we can perform math on them, without any extra work.  You will find the SUB16 instruction is very similar.  The 8 bit instructions are even better, as they can perform four operations in parallel, since a 32 bit register divides by 8 into 4 sections.

We can easily work with individual values with these instructions; we don't have to do operations in parallel.  Consider, if we put a 16 bit value in the bottom half or an 8 bit value in the bottom quarter of a register, and the other "slots" are all 0, adding or subtracting will leave the parts we don't care about as 0s.  Even if they are values other than 0, we don't have to ever use them.

One last thing we need to know about these instructions is the prefix we want.  The ADD16 instruction syntax is listed in the documentation as <prefix>ADD16 Rd, Rn, Rm.  We only care about two options right now, S for signed, and U for unsigned.  We are going to stick with unsigned in the example.
.data
addition:
    .asciz "%u + %u = %u\n"
subtraction:
    .asciz "%u - %u = %u\n"

.text
.global main
main:
    push {r12, lr}

    mov r1, #1000
    mov r2, #500
    uadd16 r3, r1, r2
    ldr r0, =addition
    bl printf

    mov r1, #100
    mov r2 #50
    usub8 r3, r1, r2
    ldr r0, =subtraction
    bl printf

    pop {r12, lr}
    bx lr

I named this smallmath.s.  When we move the value 1,000 into r1, it is being treated like a 32 bit integer, but because it is within the normal range for a short, it fits into the bottom 16 bits of the register and the top 16 bits are 0s, because the top 16 bits of the 32 bit integer are 0s.  So we have essentially put 1,000 into the bottom 16 bits and 0 into the top 16 bits.  Putting the number 500 into r2 works the same way.  printf() won't distinguish between smaller integer sizes.  It will always assume 32 bits, so the top bits of the register must be 0, or printf() will display the 32 bit interpretation of the register.

This program will ultimately work identically to one using regular ADD and SUB instructions.  If you want proof that ADD16 and SUB8 are working with smaller data types, change the values , so that the result overflows or underflows.  For example, if you subtracted 100 from 50, in the 8 bit example, you would expect to get 206 if 8 bit math is being used, 65,486 if 16 bit math is used, and something very much larger if 32 bit math is being used.

From here, you should be able to figure out 16 bit subtraction and 8 bit addition on your own.  Let's do one more example though, doing calculations in parallel.  We will do16 bit subtraction this time.
.arch armv7-a

.data

subtraction:    .asciz "%u - %u = %u\n"

.text
.global main
main:
    push {r12, lr}

    mov r4, #1000
    movt r4, #1200

    mov r5, #500
    movt r5, #600

    usub16 r6, r4, r5

    uxth r1, r4
    uxth r2, r5
    uxth r3, r6
    ldr r0, =subtraction
    bl printf

    mov r1, r4, LSR #16

    mov r2, r5, LSR #16
    mov r3, r6, LSR #16
    ldr r0, =subtraction
    bl printf
    pop {r12, lr}
    bx lr
There is a lot going on here, and most of it is happening to extract the 16 bit values from the 32 bit registers.  First, we need to use an instruction that exists in armv7 but not armv6.  Our compiler defaults to armv6, since that is what the original Pi uses.  The top line tells the compiler about our processor, so that we can use the movt instruction.  This instruction puts a value into the top 16 bits of the register.  So, we stick 1,000 into r4, and it only occupies the bottom half.  Then we use movt to put 1,200 into the top half of r4.  Note that order is important here.  If we reversed these, the mov instruction would overwrite the 1,200 with 0s.  We do something similar with r5.  Now that we have our four values loaded into two registers, we can use the USUB16 instruction to do the math.  The easy part is over.

The hard part is getting the data out for printf() to display.  We cannot tell printf() to use just half of the register.  Notice we used r4, r5, and r6 this time.  This is because we need to keep the values between function calls.  The ARM C calling conventions don't promise r0 through r3 will not be overwritten by a function, but they do promise r4 through r11 will be preserved.

First we want to display the values in the bottom halves of the registers.  We can use the UXTH instruction to cast an unsigned 16 bit value to an unsigned 32 bit value, and it will use the bottom 16 bits and ignore whatever is in the top 16 bits.  This gets us half of our values, and we display them.  Next, we need the top 16 bits, and we need them to be in the bottom 16 bits for printf() to interpret them correctly.  One of the ways ARM maintains good performance is by adding some optional operations to instructions that those operations are often used with.  In this case, we are using the bit shift operation.  If we shift the values in our registers 16 bits to the right, the top 16 bits will become the bottom 16 bits, and the bottom 16 bits will be dropped.  So, we use simple MOV instructions, but in transit, we are using a logical right shift to isolate the data we want and prepare it to be displayed.  We will discuss bit shifting more later.  The important part to understand here is that we are using it to isolate the data we want.

So, what about 8 bit values, where 4 are packed into a single register?  These are more difficult to extract.  We can easily extract the bottom 8 bits using the UXTB instruction to cast an unsigned byte to a 32 bit integer.  To get the second 8 bits, we would shift right 8 bits, to drop the first value, and then cast to 32 bits to isolate the value we want.  The third could be isolated by shifting 16 bits and casting.  The fourth could be isolated by shifting 24 bits.

For signed values things have to work slightly differently.  First, we have to use the signed extend to cast to signed 32 bit values (the SXTH and SXTB instructions), and second, we would have to cast all of the values, instead of all except the last one.  This is true for 16 and 8 bit values.  Of course, in real applications we rarely ever need to display the values created.  Instead we store them in memory and eventually write them to files or send them to devices like the sound card or video card.  In these cases, we never have to worry about this, because writing them to memory is no different from writing regular 32 bit values to memory.

The last thing to discuss is multiplication and division.  ARM does not provide parallel instructions for these operations, nor does it provide instructions explicitly for 16 or 8 bit multiplication or division.  For unsigned multiplication, we can just treat them like 32 bit values, and then just take the lower 16 or 8 bits when  we are done.  This will just work out.  Any part of the result over the size of your type can be ignored, as an instruction specifically for that size would just discard the overflow.  For integer division, it is impossible to get a result larger than either operand, so overflow can never happen.  For signed values, things are more complicated.  The easiest way to deal with this is to sign extend (cast) the values to 32 bit numbers, do the math, and then cast back.  To cast signed values down, first the sign bit must be checked, then the bit that would be the top of the desired data type must be set to the value of the sign bit.  From there, everything above the size of the desired type can be discarded.  If you forget to move the sign bit down, your values will not overflow correctly, and you could end up with some really strange behaviors.

Because small data types can be difficult to work with, some programmers advocate never using data types smaller than the word size of the platform unless absolutely necessary.  Depending on the application, it is possible that using smaller data types will reduce CPU efficiency of a program significantly.  On the other side though, often memory is a bigger bottleneck than CPU efficiency, and in these cases, using smaller data types may be necessary for acceptable performance.  And of course, in applications where small data type operations can be parallelized, CPU efficiency can be dramatically improved by using them.  This is ultimately a judgement call that must be made when designing a program and writing the code.

My personal strategy is to use the smallest data type necessary for the particular use case, and if I find there are performance problems, then I will deal with it when it comes up.  In languages like C, increasing the data type of a variable to a larger size is trivial, but decreasing it can have unexpected consequences.  Of course, in assembly, changing the data type of a variable in either direction may require rewriting significant sections of code to use the appropriate instructions, so this may not be a feasible strategy with assembly.  The most important thing is knowing how to work with small data types when it is necessary though.

No comments:

Post a Comment