Saturday, February 25, 2017

ARM Assembly: Math with Small Data Types

If you are doing these in order, we just learned to do basic math in ARM assembly.  The math instructions we used work on whole registers at a time, and since ARMv7 registers are 32 bit, that means we only know how to do math with 32 bit integers.  In real world programming, sometimes we need to work with smaller integers of 16 or even 8 bits.  There are several possible reasons for this.  Sometimes, we don't need to store larger values, but we do need to conserve memory.  In some cases, we need to take advantage of certain properties of numbers, like when a value will roll over to negative or zero.  We may just be working with data, for example audio data or video data, that is stored in smaller types than 32 bit integers.  Whatever the case, sometimes we just need to work with smaller data types.

ARM provides us with a very convenient set of instructions for some operations on 16 and 8 bit values, but it does not cover everything we will need.  We will start with these.  There are four instructions we care about right now, but they are not easy to find.  Instead of being listed with the regular math operations, they are listed under "Parallel arithmetic", because they are parallel math instructions.  Video and audio encoding are two common uses of smaller data types, and both of these benefit from speed optimizations, so ARM includes some instructions that will do math on multiple values at the same time.  The instructions we care about are ADD16, SUB16, ADD8, and SUB8.  Let's start by looking at the math notation for ADD16.
ADD16 - Rd[31:16] := Rn[31:16] + Rm[31:16],
        Rd[15: 0] := Rn[15: 0] + Rm[15: 0]
If you recall from the last article, the notation [x:y] refers to the bits in the register that are used or affected.  You may notice that there are two equations for this instruction.  The first is adding bits 16-31 of the two operands and storing the result in the same bits of the destination register.  The second is doing the same for bits 0-15.  This instruction is performing two 16 bit additions at the same time, where the operands for one are in the top half of the 32 bit registers and the operands for the other are in the bottom half.  We won't delve any deeper into this right now, but part of the reason this is so convenient is that the default for the load and store instructions is to work with 32 bit values in memory, which means that we can easily load two registers with two 16 bit values each, and then we can perform math on them, without any extra work.  You will find the SUB16 instruction is very similar.  The 8 bit instructions are even better, as they can perform four operations in parallel, since a 32 bit register divides by 8 into 4 sections.

We can easily work with individual values with these instructions; we don't have to do operations in parallel.  Consider, if we put a 16 bit value in the bottom half or an 8 bit value in the bottom quarter of a register, and the other "slots" are all 0, adding or subtracting will leave the parts we don't care about as 0s.  Even if they are values other than 0, we don't have to ever use them.

One last thing we need to know about these instructions is the prefix we want.  The ADD16 instruction syntax is listed in the documentation as <prefix>ADD16 Rd, Rn, Rm.  We only care about two options right now, S for signed, and U for unsigned.  We are going to stick with unsigned in the example.
.data
addition:
    .asciz "%u + %u = %u\n"
subtraction:
    .asciz "%u - %u = %u\n"

.text
.global main
main:
    push {r12, lr}

    mov r1, #1000
    mov r2, #500
    uadd16 r3, r1, r2
    ldr r0, =addition
    bl printf

    mov r1, #100
    mov r2 #50
    usub8 r3, r1, r2
    ldr r0, =subtraction
    bl printf

    pop {r12, lr}
    bx lr

I named this smallmath.s.  When we move the value 1,000 into r1, it is being treated like a 32 bit integer, but because it is within the normal range for a short, it fits into the bottom 16 bits of the register and the top 16 bits are 0s, because the top 16 bits of the 32 bit integer are 0s.  So we have essentially put 1,000 into the bottom 16 bits and 0 into the top 16 bits.  Putting the number 500 into r2 works the same way.  printf() won't distinguish between smaller integer sizes.  It will always assume 32 bits, so the top bits of the register must be 0, or printf() will display the 32 bit interpretation of the register.

This program will ultimately work identically to one using regular ADD and SUB instructions.  If you want proof that ADD16 and SUB8 are working with smaller data types, change the values , so that the result overflows or underflows.  For example, if you subtracted 100 from 50, in the 8 bit example, you would expect to get 206 if 8 bit math is being used, 65,486 if 16 bit math is used, and something very much larger if 32 bit math is being used.

From here, you should be able to figure out 16 bit subtraction and 8 bit addition on your own.  Let's do one more example though, doing calculations in parallel.  We will do16 bit subtraction this time.
.arch armv7-a

.data

subtraction:    .asciz "%u - %u = %u\n"

.text
.global main
main:
    push {r12, lr}

    mov r4, #1000
    movt r4, #1200

    mov r5, #500
    movt r5, #600

    usub16 r6, r4, r5

    uxth r1, r4
    uxth r2, r5
    uxth r3, r6
    ldr r0, =subtraction
    bl printf

    mov r1, r4, LSR #16

    mov r2, r5, LSR #16
    mov r3, r6, LSR #16
    ldr r0, =subtraction
    bl printf
    pop {r12, lr}
    bx lr
There is a lot going on here, and most of it is happening to extract the 16 bit values from the 32 bit registers.  First, we need to use an instruction that exists in armv7 but not armv6.  Our compiler defaults to armv6, since that is what the original Pi uses.  The top line tells the compiler about our processor, so that we can use the movt instruction.  This instruction puts a value into the top 16 bits of the register.  So, we stick 1,000 into r4, and it only occupies the bottom half.  Then we use movt to put 1,200 into the top half of r4.  Note that order is important here.  If we reversed these, the mov instruction would overwrite the 1,200 with 0s.  We do something similar with r5.  Now that we have our four values loaded into two registers, we can use the USUB16 instruction to do the math.  The easy part is over.

The hard part is getting the data out for printf() to display.  We cannot tell printf() to use just half of the register.  Notice we used r4, r5, and r6 this time.  This is because we need to keep the values between function calls.  The ARM C calling conventions don't promise r0 through r3 will not be overwritten by a function, but they do promise r4 through r11 will be preserved.

First we want to display the values in the bottom halves of the registers.  We can use the UXTH instruction to cast an unsigned 16 bit value to an unsigned 32 bit value, and it will use the bottom 16 bits and ignore whatever is in the top 16 bits.  This gets us half of our values, and we display them.  Next, we need the top 16 bits, and we need them to be in the bottom 16 bits for printf() to interpret them correctly.  One of the ways ARM maintains good performance is by adding some optional operations to instructions that those operations are often used with.  In this case, we are using the bit shift operation.  If we shift the values in our registers 16 bits to the right, the top 16 bits will become the bottom 16 bits, and the bottom 16 bits will be dropped.  So, we use simple MOV instructions, but in transit, we are using a logical right shift to isolate the data we want and prepare it to be displayed.  We will discuss bit shifting more later.  The important part to understand here is that we are using it to isolate the data we want.

So, what about 8 bit values, where 4 are packed into a single register?  These are more difficult to extract.  We can easily extract the bottom 8 bits using the UXTB instruction to cast an unsigned byte to a 32 bit integer.  To get the second 8 bits, we would shift right 8 bits, to drop the first value, and then cast to 32 bits to isolate the value we want.  The third could be isolated by shifting 16 bits and casting.  The fourth could be isolated by shifting 24 bits.

For signed values things have to work slightly differently.  First, we have to use the signed extend to cast to signed 32 bit values (the SXTH and SXTB instructions), and second, we would have to cast all of the values, instead of all except the last one.  This is true for 16 and 8 bit values.  Of course, in real applications we rarely ever need to display the values created.  Instead we store them in memory and eventually write them to files or send them to devices like the sound card or video card.  In these cases, we never have to worry about this, because writing them to memory is no different from writing regular 32 bit values to memory.

The last thing to discuss is multiplication and division.  ARM does not provide parallel instructions for these operations, nor does it provide instructions explicitly for 16 or 8 bit multiplication or division.  For unsigned multiplication, we can just treat them like 32 bit values, and then just take the lower 16 or 8 bits when  we are done.  This will just work out.  Any part of the result over the size of your type can be ignored, as an instruction specifically for that size would just discard the overflow.  For integer division, it is impossible to get a result larger than either operand, so overflow can never happen.  For signed values, things are more complicated.  The easiest way to deal with this is to sign extend (cast) the values to 32 bit numbers, do the math, and then cast back.  To cast signed values down, first the sign bit must be checked, then the bit that would be the top of the desired data type must be set to the value of the sign bit.  From there, everything above the size of the desired type can be discarded.  If you forget to move the sign bit down, your values will not overflow correctly, and you could end up with some really strange behaviors.

Because small data types can be difficult to work with, some programmers advocate never using data types smaller than the word size of the platform unless absolutely necessary.  Depending on the application, it is possible that using smaller data types will reduce CPU efficiency of a program significantly.  On the other side though, often memory is a bigger bottleneck than CPU efficiency, and in these cases, using smaller data types may be necessary for acceptable performance.  And of course, in applications where small data type operations can be parallelized, CPU efficiency can be dramatically improved by using them.  This is ultimately a judgement call that must be made when designing a program and writing the code.

My personal strategy is to use the smallest data type necessary for the particular use case, and if I find there are performance problems, then I will deal with it when it comes up.  In languages like C, increasing the data type of a variable to a larger size is trivial, but decreasing it can have unexpected consequences.  Of course, in assembly, changing the data type of a variable in either direction may require rewriting significant sections of code to use the appropriate instructions, so this may not be a feasible strategy with assembly.  The most important thing is knowing how to work with small data types when it is necessary though.

ARM Assembly: Basic Math

Programming is all about math.  Every program uses copious amounts of math.  Even a simple program that prints out a single line of text is using math somewhere, to figure out how long the text is and when it is done printing all of the characters.  While traditional programming languages may take care of some of this math for you, assembly language leaves most of the math to the programmer.  This means we need to learn how to do simple math in assembly, before we can really do much else.

The ARMv7 architecture has a large collection of math instructions.  I has several addition and subtraction instructions, and it has many different multiply instructions.  The ARMv7-R architecture also includes two division instructions.  Our Pi2 uses ARMv7-A, which does not include division instructions, however, the specific model, Cortex-A7, does include something called Virtual Extensions, which adds those division instructions.  This means we can add, subtract, multiply, and divide with fairly simple instructions.

We will use GCC again for this one, because we don't know how to print stuff to the screen without it yet.  We really need to learn to do math before we can stop relying on the C libraries for things.

There are five math instructions we care about right now.  There are several times that total, but many of them are for specific cases that are less common.  We will look at more of them later.  For now, we care about the add, sub, mul, udiv, and sdiv instructions.  These will allow us to add, subtract, multiply, and divide.  Notice there are two divide instructions, one for unsigned math and the other for signed math.  The other three operations only have one instruction each, because 2s-compliment math just works out for them, without the need to distinguish between signed and unsigned values.

The syntax for add and sub is the same.  According to ARM's documentation the syntax looks like this:
ADD{S} Rd, Rn, <Operand2>
SUB{S} Rd, Rn, <Operand2>
It would help to understand the documentation.  The part before the {S} is the actual instruction.  The {S} is an optional character that means the instruction should update the condition flags, which we discuss later.  Next is a list of what are essentially arguments.  Rd always represents the destination register.  Rn is just a register containing an input value.  <Operand2> can be a number of options.  The two most important are another register containing an input value and an immediate value of type #<imm8m> (a number that fits certain requirements).  We will look at other things <Operand2> can be later.

The syntax for mul, udiv, and sdiv are different:
MUL{S} Rd, Rm, Rs
UDIV Rd, Rn, Rm
SDIV Rd, Rn, Rm
You will notice that mul has the {S} option, but the divide instructions do not.  As with the other instructions, Rd is the destination register.  Rn and Rm are registers containing values.  Rs is also just a register with a value, but it can optionally shifted.  We won't worry about this right now though.

So now we know the syntax for the basic math instructions.  How do we actually use them?  Well, let's look at one more thing from the documentation, and then we will try it out.

The documentation describes the action of these instructions using mathematical notation, and understanding this can really help understand what an instruction is doing.  The five instructions above have their actions defined as the following:
ADD - Rd := Rn + <Operand2>
SUB - Rd := Rn - <Operand2>
MUL - Rd := (Rm * Rs)[31:0]
DIV - Rd := Rn / Rm
These pretty much all mean, the second and third operands are operated on, and the result is stored in Rd.  For subtraction and division, keep in mind that order matters (take a peak at the math notation for the RSB instruction, and compare it to SUB).  In the multiplication one, notice the [31:0].  This specifies that only the lowest 32 bits of the operation are stored in Rd (which makes sense, given that Rd is a 32 bit register).  This is also true of addition, when the result is bigger than 32 bits.

Actually using these is pretty simple.  Let's say we want to add 12 and 14.  We would start by putting one of the numbers in a register.  Then we could use an add instruction with an immediate value for the other one.  Alternatively, we could just put both values in registers and add them.
mov r1, #12
add r0, r1, #14
This code will put the number 12 into register r1.  Then it will add the value (12) in r1 to 14 and store the result (26) in r0.  Note that r1 will still have 12 in it when this is done.  We could have written add r1, r1, #14, and it would have overwritten r1 with the result of 12 + 14 (26).  If you don't care about intermediate values, you can just reuse their registers like this, but if you need to keep an intermediate value, make sure you save the result somewhere else.

We can write a simple program that does this addition with both input values in registers like this:
.text
.global main
main:
    mov r0, #12
    mov r1, #14
    add r0, r0, r1
    bx lr
Since we don't need to store anything in memory, we will start with the text section.  Because we are using GCC to compile, we need to start with a main function.  Then, we put 12 in r0 and 14 in r1.  Next we add r0 and r1 and store the result in r0.  The last line, bx lr, returns from the function.  You may recall reading in a previous article, that lr is the link register, where the return address is stored when we call a function.  main is just a function called by the C startup code, so the return address is stored in lr.  When we are done, we return with bx, which takes a register containing an address to go to.  So bx lr returns from main.  Once you are done, save your program as add.s.

Now we can compile the program.  Run gcc -o add add.s to compile.  Now run ./add to run the program.  The anticlimactic result is...nothing.  The program does not print anything to the screen, because we did not tell it to.  It did return something though.  Linux programs return an error code, and they do this by leaving the error code in r0 when they exit.  Notice our add instruction stored its result in r0, so the error code should be 26.  We can view this error code by running echo $?.  (Note that every program returns an error code, so this will only work if you run it directly after add.)  This should display the number 26.

From here the rest of the math instructions are fairly simple.  Multiplication and division require all operands to be in registers.  Subtraction works like addition.  Now let's write a bigger program that will print the values to the screen, instead of just returning one value as an error code.

Because we are compiling with GCC, we have free access to built-in C functions.  This means we can use printf() to output text to the console.  According to the documentation, the signature for this function is int printf(const char *format, ...);.   To call this function, we need to know a little bit about C calling conventions, for ARM.  Right now, there are two important things we need to know.  The first is that the return value is always placed in r0.  This is less important that the second, which is that arguments are passed in r0 to r3.  If there are more than four arguments, additional ones are passed on the stack.  Since we have not learned this yet, we will stick to four arguments or less.  One more important thing to know is that arguments are ordered in the registers the same as they would be ordered in the function call in C.

Knowing this, here is what we can get from the function signature of printf():
  • When it returns, the return value will be in r0.
  • The first argument must be placed in r0 before calling printf().
  • The first argument is a pointer to a cstring.  Note that cstrings are null terminated.
  • The second argument must be placed in r1, the third in r2, and the fourth in r3.
  • We are going to avoid more than four arguments, because we have not learned to pass arguments on the stack yet.
Let's examine a simple program that prints out a single string using printf().
.data
string:
    .asciz "Print me!\n"

.text
.global main
main:
    push {r12, lr}
    ldr r0, =string
    bl printf
    pop {r12, lr}
    bx lr

Save this as print.s, then compile it with gcc -o print print.s.  Now you can run ./print, and it will display the text "Print me!" followed by a newline.

What, exactly, is going on here?  We start by defining a data section.  Programs are composed of several sections.  The text section is where the executable code goes.  The data section is where global variables and constants go.  There is also a section for uninitialized data, and there are special sections that can be used for other things.  In fact, we could have put our string in a special section for read-only data, but we won't worry about that right now.  In the data section we create a label, which is nothing more than a symbol for referencing a specific location in memory.  Then we put some data in the data section.  The .asciz directive tells the assembler that we want to create a null (or zero) terminated string.  Next, we define the string.  The assembler will transparently add a null character to the end when it assembles the program.  Later will be need to create strings that are not null terminated, and we will use the .ascii directive for that.

Once our data is in the program, we will write our code in the text section, in the main function.  Now, I mentioned earlier that when a function is called, it overwrites the link register with its own return point.  Since main() is a function, its own return location is currently stored in lr.  When we call printf(), this will be overwritten, and if we don't save it somewhere else, main() won't know where to return to.  The first line of our main() function does a few things for us.  First, it stores the contents of r12 and lr on the stack.  Then, it updates the stack pointer, decreasing it by 8.  We will look at exactly why it does this later.  Next, we load the address that string points to into r0.  This is the cstring pointer that printf() expects for its first argument.  We don't have any more arguments for printf(), so next we call it.  Because we are using GCC, we don't have to do anything special; it knows where to find printf() for us.  printf() stores its return value in r0 before it returns, but we don't really care about that.  (Note, however, that since we don't change r0 after this, whatever it left there will be the error code for our program.)  Now, we need to get the value from lr back off of the stack, so we use the pop instruction.  Now we can return, using bx on the link register.

There are a few things you may have noticed here.  First, we are pushing and popping two registers.  Why are we storing r12, when it really never gets used anywhere?  The answer is that printf() and many other C functions and external interfaces (like the OS) expect the stack to be 8 byte aligned.  We will talk about alignment later, but the important part here is that if we push just one register, the stack will be 4 byte aligned, so we are pushing an extra one to keep it 8 byte aligned.

Now that we know how to print stuff to the screen, let's do some subtraction!
.data
string:
    .asciz "%d - %d = %d\n"

.text
.global main
main:
    push {r12, lr}
    ldr r0, =string
    mov r1, #12
    mov r2, #7
    sub r3, r1, r2
    bl printf
    pop {r12, lr}
    bx lr

Save this as sub.s, then compile with gcc -o sub sub.s.  Now run ./sub.  Verify that the output is correct.

This is very similar to the previous program, except that we are having printf() display the values used in our math.  Now, we could have done the math in any registers, but I specifically chose the ones I did, to avoid having to rearrange things later for printf().  As before, we load the address of string into r0.  Then we put our operands into r1 and r2 in the order we want prinf() to display them.  Lastly, we do our subtraction, placing the result in r3, as the fourth argument of printf().  Everything after this is identical to our previous program.  Being able to print out results like this is a major step in being able to debug larger programs, though once we let go of the C library, we will have to find other ways.

Multiplication will be left as an exercise for the reader.  Really, this should be trivial, given what you have already done.  Division will be discussed in a separate post, as there are some additional requirements to use the division instructions.

Monday, December 12, 2016

ARM Assembly: The ARM Architecture

The history of the ARM architecture is quite interesting, though we will only be looking at it very briefly.  ARM started with a company named Acorn Computers.  It initially produced simple computers, during the time when all computers were pretty simple.  Acorn Computers produced several machines that were successful in Europe, including the BBC Micro series.  The primary focus of the company was on processors though, and due to the unsuitability of other processors at the time, it decided to create its own processors, in the Acorn RISC Machine project, to go with the BBC Micro for more business oriented applications.  Early ARM processors compared quite favorably to early Intel processors, with the ARM2 outperforming the Intel 80286 at the same time as using less power.  In the end, the ARM processor line was turned into its own company, Advanced RISC Machines.  The unique thing about ARM as a company is that it does not manufacture processors.  It designs and licenses processor architectures for other companies to manufacture.  The consequence of this is that ARM processors are ubiquitous in the mobile device industry, because it is far cheaper to license designs and manufacture them than it is for companies to design custom processors themselves.  This is especially valuable in an industry that is constantly trying to push technological boundaries to keep up with competition.  In this business model, one company does all of the design work, often working with larger clients to include valuable new features, and the entire mobile device industry licenses and uses the designs to produce all sorts of different mobile devices.  In addition to mobile devices, there are some companies that are attempting to use the ARM architecture to produce server processors.  They believe that the lower power consumption and higher performance that may be possible with RISC processors could be very popular in a market that values processors that run cheaper and cooler but need plenty of processing power.

Instruction Sets

The specific part of the ARM architecture that is important to us is how the ARMv7 architecture works.  ARMv7 is broken up into 3 divisions.  There is an A series, an M series, and an R series.  The A series is general purpose.  The M series is mostly low power consumption processors designed for use in embedded applications and lower powered mobile devices.  The R series is designed for real-time embedded applications, where responsiveness and performance are extremely important.  We are mostly going to focus on the A series, with a little bit on the M series.  The main difference between A and M with respect to this series is that the M series only supports ARM's 16 bit Thumb/Thumb2 instruction set (possibly with support for the few 32 bit Thumb2 instructions), while the A series supports both the 32 bit ARM instruction set and the mostly 16 bit Thumb2 instruction set.

Later in this series we will also learn about another instruction set the processor in the Pi 2 supports.  This is the Neon/VFP instruction set that provides instructions for floating point math as well as parallel integer and floating point math.  We will discuss this more later.

Registers

Pretty much all processors need some sort of memory to operate on.  Modern processors use several levels of memory.  The lowest level is registers that each can store one value.  The size of the value is generally the word size of the processor.  The word size of the ARMv7 architecture is 32 bits, meaning that each register can hold a 32 bit value.

The ARMv7 architecture claims to have 16 general purpose registers, but this is not strictly true.  No less than 3 of these registers have  dedicated purposes, and attempting to use them for anything else is likely to cause problems.  Some programs may use a 4th register for a dedicated purpose as well.

ARM generally names its general purpose registers as rx, where x if the register number.  Thus, ARMv7 processors have registers r0 through r15.  We never use any registers above r12 for general use though, nor do we use them by their designations r13, r14, or r15.  Instead, we use sp for r13, lr for r14, and pc for r15.

The pc register is the program counter or instruction pointer.  It contains the memory address of the next instruction to be run.  With a few exceptions, whenever an instruction is executed, the instruction pointer is updated to point at the instruction after it.  Modifying the instruction pointer will change where the program is executing.

The sp register is the stack pointer.  It contains the memory address of the last entry on the stack.  It is often used as a reference point for accessing data stored on the stack.  Modification of the stack pointer should be done carefully and predictably, because otherwise the top of the stack can be lost, which can be disastrous.

The lr register is used to store the return point for a function.  When a function is called (there are some instructions for this), the address of the instruction directly after the function call is stored in lr, so that the function being called knows where to return to when it is done.  Technically, if the address stored in lr is saved somewhere else and restored later, lr can be used as a general purpose register, though I have not found this to be terribly common.  The most important thing to remember about lr is that if you are going to call a function from inside of another function, you need to store the value in lr somewhere else first, because it is going to get overwritten.  Don't worry to much about this right now though, as we will discuss it again when we start calling functions.

In partially compiled C programs, you may also see an fp register.  This is the frame pointer register, and when used, it is r11.  The frame pointer can be really handy for certain kinds of debugging, and it can be used as an alternative reference to sp, however it is not strictly necessary.  In fact, you will never find it in partially compiled C programs with any level of optimization.  It can only be found in unoptimized, partially compiled C code.  We won't be using it, however it is good to be aware of, as you may see it in assembly code generated by GCC.

Memory

One fairly unusual thing about ARM is that it does not have any instructions that directly operate on data in memory.  In fact, the only memory access instructions ARM has are a few for loading data from memory into registers.  Most modern processor architectures allow at least a few instructions besides load and store instructions to access data from memory directly.  ARM does not, and this means that if you want to operate on a value stored in memory, you must first load it into a register, and if you want to put the result of an operation in memory, you have to save it to memory after the operation is complete.  This keeps the instruction set fairly simple, and despite the fact that it means you have to use more instructions when accessing data in memory, ARM processors still have excellent performance.

The general workflow in 32 bit Intel processors involves a lot of memory accesses, because they have a very limited number of registers (8 general purpose registers, except that some have special purposes as well, which often limits their usefulness).  Consequently, instructions that can access memory directly have been seen as essential to good performance.  ARM, however, has plenty of registers, which means that intermediate values rarely ever have to be stored in memory.  Thus the value of instructions that access memory directly is very limited.  Historically, this has been such a significant strong point of ARM that it has been able to outperform Intel and other processors with significantly lower power consumption, despite the fact that RISC processors are defined by the fact that they maintain smaller instruction sets of simpler instructions.  In fact, this is such a potent strength that modern Intel 64 bit architectures have doubled the number of registers from 8 to 16 (comparatively though, 64 bit ARM processors now have 31 general purpose registers, and the pc and sp registers are dedicated registers, meaning that it more than doubled the number of usable registers).  The general workflow with ARM programs should take advantage of the copious amount of general purpose registers to avoid unnecessary memory accesses.

Peripherals

Peripherals won't matter much here with respect to processor architecture, because we will be accessing them through the operating system, not directly.  That said, it is still worth spending a paragraph on.

Peripherals for ARM processors are memory mapped.  This is the most common way of handling peripherals.  Memory mapped peripherals are mapped to unused memory addresses by the processor (the Pi 2 processor can address up to 1TB of memory, but it only has 1GB, which leaves a huge number of memory addresses free).  In many systems, they are hard coded to specific memory addresses, but in some (desktops and laptops, for example), they are mapped by the BIOS during boot based on a number of factors.  The Pi has its memory mappings hard coded, since its hardware is not changeable.  It can be accessed by reading and writing the memory addresses associated with the peripheral you want to control.  The catch here is that this is handled by Linux on the Pi, and we don't have direct access to the device's memory.  We access memory through a memory manager that is managed by the operating system.  We will look at this more later, but the important thing is, no matter what memory we attempt to modify in our programs, we cannot directly access memory that is mapped to peripherals.  We must go through the operating system.  This is for security reasons and because an operating system running multiple processes simultaneously has to implement some kind of access control to prevent programs from stepping on each other's toes.


Now that we have a basic understanding of the processor architecture in our Pi 2, we can get back to writing assembly code!

Wednesday, November 23, 2016

ARM Assembly

Preface

I recently designed a course on the subject of ARM assembly programming, which I am now teaching.  During the design of the course, I spent a great deal of time struggling to find information on certain topics within the subject.  I found a distinct lack of good resources on many things.  The biggest problem was finding information about the interface between assembly level code and the operating system.  On one occasion I even taught my students something totally wrong, because I had been forced to learn it myself through reverse engineering, and it turns out that the operating system and the debugger don't behave the same way.  Of course, I corrected myself two days later, and I used the incident as a teaching opportunity.  What I learned from all of this is that assembly programming, especially for the ARM architecture, is very poorly documented at the interface between OS and user space programs.  I am writing this series with an aim to fix this problem.


Introduction

This series will teach ARM assembly programming, using ARMv7 assembly running on the Raspberry Pi 2.  The Pi 3 should also work fine for a vast majority of this, however some parts will not work for the original Pis.  I will do my best to identify these where I am aware of them.

This series assumes that if you have a Raspberry Pi, you either already have everything you need to use it, or you can figure out what you need and obtain it.  The essential things you will need are internet access for your Pi, at least briefly for the first chapter, and some way to input text and view text output.  This can be a USB keyboard and a monitor that will work with the Pi, it can be an SSH connection to the Pi over a network, or it can be a serial to USB adapter that gives you direct terminal access through the GPIO pins (Adafruit sells these).  There are plenty of resources online for each of these methods.

The operating system used for this series is Raspbian.  At least one of my students is using Arch Linux successfully, so this may also be a viable option.  Raspbian can either be installed using the NOOBs installer that comes on most SD cards sold specifically for the Pi, or you can download a Raspbian image and install it directly.  The second option is recommended, as it will give you the most recent version of Raspbian.  In addition, you may be able to avoid certain issues some of my students have run into with Raspbian installed through NOOBs for another assembly project you might be interested in that is currently outside the scope of this series.

Additionally, this series expects you to already have a basic understanding of the concepts of processor registers and assembly instructions.  I am sure many people will be able to figure these things out as they go along, but if you start this without that understanding and find yourself struggling to understand even the basic concepts, perhaps you should take a break and spend a little bit of time learning about modern processor architectures and how they work.  (I will probably write a short article on this in the future, but right now this series takes higher priority.)


Table of Contents

This will be populated with links to each chapter as they are published, in order of publication.
  1. Setup
  2. The ARM Architecture
  3. Basic Math
  4. Math with Small Data Types
  5. Type Casting
  6. Advanced Integer Math
  7. Memory Architecture
  8. Integer Division
  9. Ditching GCC
  10. The Stack
  11. Pointers and Global Memory Access
  12. Function Pointers
  13. Branches
  14. Conditional Branching
  15. Loops
  16. Indexing Modes
  17. Arrays and Structs

Index

This will be populated with links to the chapters sorted by topic.

ARM Assembly: Setup

Setup

Before we can start writing assembly programs, we need to do some basic setup.  If your Pi is not already setup and running, stop here and do that.  Once it is running, login to the terminal.  You won't need the graphical interface for any of this, and it is honestly easier to just work completely in the terminal for most of the chapters in this series.  If you are an experienced user and you feel you cannot do without the GUI (both of these in one seems unlikely), feel free to use the GUI, but understand that I won't be using the GUI, and this series will assume that you are working purely from a terminal as well.

Once you have logged into a terminal, you will need to check what version of GCC is installed.  This is important, because the assembler that comes with versions of GCC older than 4.8 does not have support for some ARMv7 instructions that we will be using on our Pi 2.  (The original Pi processor is ARMv6, and thus it does not have these instructions.)  Check your version of GCC by running gcc -v.  This will spit out a lot of information.  At the bottom you will find the version information.  My Pi 2 says it's GCC version is 4.6.3.  This is what it came with.  If your version is lower than 4.8, you will need to run sudo apt-get install gcc-4.8.  Note that this will not change the default version.  If you run gcc -v again, it will still say the old version.  For compatibility reasons, installing GCC 4.8 won't replace the old version.  Now though, you should be able to run gcc-4.8 -v, and it will give you the version information about the new GCC you just installed.  If you want to compile with GCC 4.8, you will have to run gcc-4.8 instead of gcc.  Installing GCC 4.8 will replace the assembler though, which is all we really care about here.

I should add a note here, if the version of GCC says 4.8 or higher (recent versions of Raspbian now come with 4.9.2), then you can just use gcc, instead of specifying the version.

Writing a Simple Program

This part is not so much about learning assembly as it is about learning to use a terminal editor.  You have two options by default.  They are Nano and Vim.  In my opinion, Vim is the better of the two, however it has such a steep learning curve that am not even going to try to teach you how to use it (besides that, I am not even that proficient with it yet).  If you prefer Emacs, it is not installed by default, but you probably already know how to install it.  I am going to use Nano for this series.  You can run nano in the terminal to start Nano and create a new file.  You can run nano filename to open an existing file.  When you are in Nano, you will see a list of controls at the bottom.  The important ones are Ctrl+R to open a file, Ctrl+O to save a file, and Ctrl+X to exit (I have never actually used Ctrl+R...).  You can also use Ctrl+W to search, Ctrl+K to cut a line of text, and Ctrl+U to paste the last text that was cut.  If you expect to use these last few, I would suggest experimenting with them a bit to get familiar with how they work.

Let's start by writing a simple program that compiles with C.  You will probably want to create a directory to do all of this in.  I'll leave that up to you (in Linux, use the command mkdir directory_name to create a new directory in the current one, and use cd directory_name to move into that directory; if you did not know this already, you may want to find a Linux command line tutorial before continuing).  Once you are in your new directory, start Nano.  I ran nano num.s, to create a new file named num.s.  (The file extension for assembly files is ".s".)  You can instead specify the file name when you use Ctrl+X to save it the first time, if you prefer.

The first thing we need in the file is a line with the text .text.text is the section of your program with executable code, and it is generally called the text section (less commonly the "code section").  Since we are compiling with GCC, we need a "main" function.  Assembly does not have functions in the same sense as other languages, though we generally treat it like it does.  Instead, assembly has labels that are references to memory addresses.  These can be the addresses of data or instructions.  To create a "main" function, we need to do two things.  First, we need to create a label named "main" that references the instruction in our code where the program should start.  This is done by typing main:, on its own line.  The second thing we need to do is make our "main" label visible outside of our assembly file (so that the C library can see it; more on why this is necessary later).  Directly above the "main" label, add a line, .global main.  In the future, we will type the global directive before we create the actual label.

As this point, your program should look like this:

.text
.global main
main:

Now, the C compiler knows where our program starts (the instruction directly following the "main" label).  Next we need to write the actual code.  To keep things simple, our program is going to return an error code.  The convention is for functions to put their return values in the register r0.  When we return from main, the value in r0 is going to be used as the error code returned by our program.  Normally you don't get to see what error codes are returned by programs, but I'll show you a way to do it when we get there.

To put a value in r0, we will use the mov instruction.  This instruction can copy a value in one register to another, or it can take an immediate value (a number encoded in the instruction) and put it in a register.  There are some limitations to this, but we don't need to worry about them yet.  Underneath the "main" label, press tab once to indent, and then type mov r0, #27.  This will put the value 27 in the register r0.  Immediate values must generally be preceded with a # symbol.

Now that we have put our return value in r0, we can return from the main function.  This is done with a special branch instruction that takes a memory address to branch to.  By default, the program will run one instruction after another sequentially.  Branches are instructions that tell the processor to go to some other instruction and continue from there.  This is how we do things like calling functions and skipping instructions in false if statements.  We also use a branch instruction to return from a function.  In this case we will use the bx instruction, which takes a register that contains the memory address of the instruction we want to go to.  We will discuss ARM registers more in a moment, but for this instruction all we need to know about is the lr register, which contains the address for the return point when a function is called.  To return from the main function, we will use the instruction bx lr.  This will go back to whatever code C used to call the main function, allowing C to shut down anything it did before the program exits.

Our program should now look like this:
.text
.global main
main:
    mov r0, #27
    bx lr

All this program does is to return the exit code 27.  Here we need to save our program (Ctrl+O).  Now, exit Nano (Ctrl+X).  At the command line again, run gcc num.s.  This will compile our program and create an executable with the default name a.out.  We can run our program with ./a.out.  Disappointingly, it does nothing.  Like I said before, normally, we don't actually get to see the error codes.  We can display the error code of the last program that ran by running echo $?.  This should print out the number 27.  (Note that if you run echo $? again, it will print out 0, because that is the error code returned from the first time you ran it.)  You can edit your program and change the number put into r0, but there are two things to keep in mind.  First, negative numbers won't work.  The error code is an unsigned byte, so a negative number will print out as its unsigned representation.  Second, an unsigned byte can only fit in the range of values from 0 to 255 inclusive.  If you put in a number outside of this range, it will only print out whatever the unsigned representation of the lower 8 bytes are.  Additionally, due to limitations we will discuss later, some numbers outside of this range won't even allow your code to compile.  If the compiler gives you an error like "invalid constant (8888) after fixup", it means that the number you are trying to put in the register cannot be encoded in the instruction you are trying to put it in, so you will have to find another way.

The main point of this exercise was to become familiar with the editor and make sure everything works the way we expected.  An important thing to note is that we did not use the GCC 4.8 that we installed earlier, but we did use the assembler installed with it.  From here on out, most of our programs will not be compiled with GCC.  GCC adds some overhead to pretty much everything it compiles.  It adds startup and shutdown code that we really only care about if we are using C functions.  So, for the most part, the only time we will use GCC to compile our programs is when we are using C functions.

Thursday, August 25, 2016

Video Game Development: Multi-threading

Multi-threading is a complex enough subject to warrant an entire course on its own, so we are not going to go into extensive detail here.  Multi-threading does have some very valuable benefits in the realm of game development though.

Multi-threading allows different parts of a program to run at the same time.  In modern multi-core computer systems, this can mean managing the game physics on one core, rendering graphics on another core, and handling audio on a third.  It could also mean updating one game unit on one core and another on a different core, though doing this is significantly more complex.

There are two things that are essential to understand when discussing any kind of parallel programming.  The first is concurrency.  Concurrency is when the program is doing different tasks at the same time.  The second is parallelism.  Parallelism is when the program is doing the same thing to different sets of data at the same time.  An example of concurrency with respect to video games is rendering graphics in one thread while processing physics in another.  The two tasks are different, but they are happening at the same time.  This can often be beneficial, even on systems that do not have multiple processor cores, because graphics rendering spends a lot of time waiting for the video card to be ready for more data.  During that wait time, other processing, like game physics, can take place without slowing down the rendering.  Parallelism requires hardware designed for multi-threading, like multi-core processors.  With parallelism, a set of data is broken up into smaller sets, and these sets are worked on at the same time, using the same algorithm.  For example, if my game has 100 game entities that need to be updated, instead of updating each one, one at a time, I could break the list into two lists of 50 each.  Then, I could process 50 entities in one thread and 50 in another, at the same time.  This could reduce the time required to update all of the entities by up to 50% (in practice, the benefit is a bit smaller than this, due to overhead of multi-threading).

In games, we generally stick to concurrency, due to certain issues with multi-threading.  When multi-threading we have to share memory between threads to allow communication.  In a game, we might be sharing animation frame data between the rendering thread and the physics thread.  When an entity updates, the physics engine might also update that entity's animation frame.  For example, a character in the game might be walking, and when the character steps forward, the physics engine sets the animation frame to show the step being taken.  The rendering thread also needs to use this data when it actually renders the animation frame on the screen.  So, what happens if the rendering thread starts to render the animation frame, but then halfway through, the physics thread changes it?  The result is that we have the top half of the frame as one image and the bottom as another, and we can probably see a horizontal seam in the middle where the change happened (this is called tearing, though it does not generally happen in this way).  You can also run into problems where two threads load something from memory at about the same time, both change it, and then both write it back.  This results in what is called a "race condition," because whoever writes second wins.  An example of this is a variable that is incremented each game loop, by two different threads.  If the variable is an 8, and then both threads load it (both getting 8s), and then both increment it (now both are 9s), and then both write it (still 9s), the result will be a 9, even though it should be a 10, because it was incremented twice.  These bugs can be incredibly hard to find, and there is no easy way to prevent them, except not sharing anything between threads, which defeats the point.

There are a lot of complications with multi-threading, and they all come down to making sure that memory accesses are carefully managed to avoid unexpected problems.  This is usually done by "locking" memory accesses.  A function that needs to modify some memory will lock that memory, read it, modify it, and write it, and then unlock it, and since it is locked while in use, nothing else can use it.  These locks take time to set and remove though, which reduces the efficiency of the program and can even eliminate the benefits of multi-threading if too many have to be used.  The best strategy is to avoid the need for locks in the first place, and minimize their use when they are absolutely necessary.  Even with locks though, you can run into issues.

This is a very complicated topic.  More information can be found on Wikipedia and other resources on the internet.  The takeaway here is that multi-threading can be extremely valuable in optimizing video game performance, but it requires a great deal of care and complete understanding to avoid running into very difficult problems.

Video Game Design: Player Generated Content

Player generated content is anything creative within a game that was created by one or more players.  It generally requires a high level of flexibility within the game, even if that flexibility is only within a limited scope.  For example, allowing players to design their own flags in a game would be player generated content.  Of course, it can also be far more broad than this.  Games like Minecraft and Terraria allow players to build custom structures and even complex logic constructs from basic building blocks.  Most Blizzard real-time strategy games (including the Starcraft and Warcraft series) include map editors that allow players some degree of ability to create custom maps and other custom in-game content.  This allows for large scale player generated content far beyond anything the developers could imagine (and they can imagine a lot).

Player generated content can also go outside the game.  For example, there is an entire YouTube series built around the Halo games.  They use video captures from the games, and then they add voices.  There are also millions of "Let's Play" videos for all sorts of games, with Terraria and Minecraft being some of the most popular, which also qualify as player generated content.  Even written fan fiction can be a form of player generated content.

Player generated content serves several purposes.  The first is to give a game increased replay value.  Replay value is the value a game has in continuing to play it even after you have beaten it.  Some games add replay value by adding a multiplayer mode so players can play with their friends.  Others do it by randomizing maps and adding such huge amounts of content that no one can experience it all the first time through the game.  Player generated content can add almost unlimited replay value to a game though.  When players are able to add their own content to a game, they can create replay value themselves.  A few flexible options can turn into practically infinite combinations, and that helps a game to avoid ever really getting old.

Player generated content also helps the game creators more directly.  You could create a game with huge amounts of content, or you could give your players the ability to make content for you.  Spend some time looking for epic things built in Minecraft, and you will find an enormous amount of free advertising for Mojang.  When people make mods for Minecraft, they are creating their own content that players can optionally add to their game.  If you start to get bored with Minecraft, just add a few mods.  Even the 5 most popular mods have almost more content than a single person can every experience completely, and Mojang, the creators of the original game, did not have to many any of that.  Some mods add more content to the game than it has in the first place, and some people put together packages with many mods all put together.  All of that keeps people playing and brings new people into the game, at no cost to the developers.  In short, player generated content is essentially free labor for the game creator, and even though the game creator does not legally own the player generated content, it does not matter, because it still benefits them just as much as if they did.

Allowing player generated content in a game can be tedious and difficult, but it is generally well worth the time and effort.