Monday, March 6, 2017

ARM Assembly: Memory Architecture

One of the biggest strengths of the ARM architecture is that it has a lot of registers.  This allows it to avoid wasting a lot of time waiting for memory transactions.  Any serious program is going to have to use memory though.  Before we learn how to use memory, we need to learn about the memory architecture of the Raspberry Pi (which is very similar to most modern systems).

The first thing we need to learn about is virtual memory.  Nearly all modern systems use memory virtualization for a number of reasons.  The first is security.  Early computers allowed programs to access memory directly, but this lead to major security problems when multiple programs could run at the same time.  A program could easily look at and change the memory used by another program.  A background program could watch memory being used by another program to capture a username and password, and then it might report that information somewhere that another user had access to, allowing a nefarious user to easily hack other users' accounts (in modern applications, could have allowed a program to see credit card information entered into a web page).  The second problem is a matter of memory addressing.  When multiple programs have direct access to the same memory space, it can be difficult for each to know what memory is being used by the other.  This is a bad situation when 20 or more programs could be running at the same time (as is common in modern computers).

Protected memory was created to deal with these two problems.  Protected memory is a processor level technology that allows the operating system to run at a different access level than other processes.  When the processor starts up, it is in unrestricted mode.  The operating system starts up with direct memory access, as well as direct access to all other hardware.  Once the OS has started, it starts up other programs in a restricted mode.  When the OS starts a program, it sets up the program memory address space in a memory manager unit within the processor.  When the program attempts to access a memory address, the CPU consults the memory manager unit to determine where that address points to in physical memory.  This is often called virtual memory.  On a 32 bit system, the program sees 4GB of memory space that it can access.  Part of this is where the program data is stored.  This also includes stack space, global memory, heap space, linked libraries, and some space reserved for the kernel to keep track of metadata about the program.  While the program sees all of this memory space, it cannot actually use all of it.  On the Raspberry Pi, this should be obvious, as it has only 1GB of memory.  When a program starts running, it can access (read and write) allocated stack space, global memory, and it can read (and write in some systems but not others) the text area, where the executable code is.  If it tries to read or write any other area of memory or in the reserved kernel space, it will fail with a segfault.  If the program wants more memory, it must ask the operating system for it.

 This represents how memory is mapped to programs in Linux.  The kernel space is the space reserved by the kernel for metadata.  In Linux, it is 1GB, in Windows it is 2GB by default.

The stack is used for storing local variables.  It grows downwards.  The stack has a maximum size limit, often 8MB.  This means that it is not suitable for storing large amounts of data.  It is mostly pre-allocated though, which means that it is readily available.

The memory mapping area is used for two things.  One is memory mapped files.  This includes dynamically linked libraries as well as files that are mapped to memory explicitly by the program, so it can access the file as if it were just memory addresses.  The benefit of mapping dynamically linked libraries this way is that multiple programs can have their virtual memory mapped to the same physical memory, allowing the operating system to load a single copy of each library for multiple programs to use.  The second thing this area is used for is anonymous mappings, which is essentially like mapping a file to memory without actually having a file.  We will discuss this more later, but this is one method of getting access to more memory from the OS.  This section grows downward.

The heap is the other method of getting access to more memory.  When the program starts, no heap memory is allocated.  The program must ask the OS to increase the size of the heap, which grows upward.

The BSS and data areas are global memory allocated when the program starts up.  The difference is, the data area contains variables that have been initialized.  In other words, they have values in then when the program starts.  The BSS area has uninitialized global variables, which are generally filled with zeroes when the program starts.

The text segment is one of the most important parts of the program.  It contains the executable code.


Now let's look at the spaces.  Most of these spaces represent random offsets.  Bugs in programs sometimes allow malicious code to be injected.  When this happens, it is very convenient to the malicious code when memory addresses of the various sections all start at the same place.  Modern operating systems make it more difficult for malicious code to affect buggy programs by randomly picking starting addresses for the stack, the memory mapping segment, and the heap.  There are two other spaces, one between the heap and the memory mapping segment, and one at the beginning of the program.  The one between the heap and memory mapping segment is generally the largest, and it leaves room for those sections to grow.  On the Raspberry Pi, it is impossible for them to meet, because the Pi will run out of memory before that ever happens.  On a system with more memory though, if those two sections meet, the OS will refuse the request to allocate more memory.  A robust program will deal with this gracefully.  The gap at the beginning of the program exists for another reason.  Most of the reason is due to certain performance gains in an ancient system the starting address was borrowed from (this gap is exactly 128MB).  There is, however, a very important reason the text segment cannot start at 0.  This reason is that 0 is used to indicate a null pointer, and if the text area started at address 0, then the null pointer would point to a valid memory location.

Aside from this, there are a few other segment types that can be used, though none of them are very important for this system.


Like most modern systems, the Raspberry Pi uses memory mapped hardware.  Like most 32 bit processors since the early-90s, the processor on the Pi is capable of addressing 1TB of memory (yes, the 4GB memory limit of 32 bit systems was Windows, not limitations of the processors; case in point, I have a 32 bit Xeon server that was running Windows Server 2003 Enterprise (essentially XP with a few extra features, like access to all of your memory) with 6GB of memory and full access to all of it; it is now running Linux, with full access to all 6GB).  With 1,000 times the memory space needed for the amount of memory is has, it just makes sense to use the memory system for accessing peripherals.  So, access to the GPIO pins, USB, Ethernet, audio, video, and everything else happens by accessing physical memory addresses.  Unfortunately (or, perhaps, fortunately) programs are limited to their virtual memory space, which is not mapped to this hardware.  The kernel, however, has direct access to this memory space, so programs must interact with hardware through the kernel.


Some of this information is academic, unless you end up programming embedded systems, operating systems, or drivers.  The important things to remember are that the stack grows down and the heap grows up, and the stack, the heap, and the memory mapping segment start at random addresses.  Also keep in mind that when a program accesses a memory address, it is not accessing that address on the actual memory of the Raspberry Pi.  It is essentially using a lookup table to find the data in physical memory.  And, of course, while the Pi's hardware is memory mapped, you won't be able to access it directly.  Instead, you will have to do that through the operating system.

ARM Assembly: Advanced Integer Math

The ARM architecture provides some additional capabilities in its math instructions that allow multiple operations to be done in a single instruction.  This also allows for using faster instructions for certain math operations when one of the operands is static.  Knowing how to use these instructions to their fullest will allow you to produce optimal programs.

In a previous article, we saw that some of the math instructions have an operand listed in the documentation as <Operand2>.  This is something of a multi-purpose operand.  As we saw, it can be a register that holds a value, and it can be an immediate value that fits some constraints.  It can also be a register, where the value in the register is shifted left or right by a certain number of bits.

You probably already know that arithmetic shifts can be equated with multiplication and division.  Since a logical left shift is the same as an arithmetic left shift, ARM just uses the term "logical shift left" and the syntax LSL to refer to it.  Keep in mind that logical right shifts should be used for unsigned math, while arithmetic right shifts should be used for signed math.  Shifting right is equivalent to division by a power of 2 (where the number of bits shifted is the exponent), and left shifting is equivalent to multiplication by a power of 2.  We can use add and subtract instructions with shifts in the last operand to multiply by some values that are close to powers of 2.  Why would we not just use a MUL instruction?  While timing of instructions is processor specific, and this particular processor does not seem to have good published data on that timing, it is generally safe to assume that multiplication and division instructions are significantly slower than addition, subtraction, and bit shifts.  The MUL instruction is best used when the second operand is not constant or when the equation cannot be reduced to a shifted ADD or SUB instruction.

The first example is multiplication by 4.  We won't even need to use ADD or SUB for this one.  The MOV instruction has a flexible operand as well.  (Note that ARM does not have any dedicated shift instructions.  It does have pseudo instructions for shifts, but the machine code generated just uses MOV instructions with shifts.)  To multiply a value in r0 by 4, storing the result in r1, use the following:
mov r1, r0, LSL #2
This will multiply the value in r0 by 4, and then store it in r1.  With this, we can multiply by any power of 2 that the processor is capable of.  If we wanted to multiply by 5, we would use an ADD instruction:
add r1, r0, r0, LSL #2
This adds the value in r0 to the value in r0 multiplied by 4.  In math, it might look like this: r1 = r0 + (r0 * 2^2), which is the same as r1 = r0 + (r0 * 4) or r1 = r0 * 5.  Using the ADD instruction, we can multiply by one greater than any power of 2.  So, what can we do with SUB?  We cannot use SUB directly for this, because the operand that can be shifted is the one that is being subtracted.  This could allow us to negate a value and then multiply it by one less than a power of 2.  We rarely need to negate a value, and then multiply that by one less than a power of 2 though.  More likely, we will need to multiply the value by one less than a power of 2 without first negating it.  There is a special subtraction instruction that allows us to do just that.

With addition, order does not matter, but it does matter with subtraction, which is why ARM has provided the RSB instruction.  This does reverse subtraction.  If the math for the SUB instruction looks like r0 = r1 - r2, the math for RSB looks like r0 = -r1 + r2.  If this does not make sense, consider that in normal subtraction, we subtract the second operand from the first.  In assembly, the second operand is the flexible one.  This means we can take a value in a register and subtract an immediate or shifted value from it.  We cannot start with an immediate or shifted value, and subtract a register from it, without some extra instruction.  The RSB instruction allows us to do this without any extra instructions.  So, let's multiply a value by 7, using RSB.
rsb r0, r1, r1, LSL #3
In math, we are doing this: r0 = (r1 * 2^3) - r1 = (r1 * 8) - r1 = r1 * 7.  When we want to multiply by a constant that is 1 less than a power of 2, we will typically use the RSB instruction to do that.

Division is not quite as flexible, as just adding or subtracting another instance of the original value will not increase or decrease the value we are dividing by.  Using right shifts though, we can still divide by powers of 2.  For example, if we want to divide by 16, we would use this:
mov r0, r1, LSR #4
This works only for unsigned values.  Instead of LSR, you can use ASR to divide signed values by powers of 2.

There is a special way of doing division by constants, however, it is rarely used with processors that have integer division instructions.  It essentially uses multiplication with a "magic number" that sort of represents the reciprocal of the number you want to divide by.  We won't go any further into this, as we have access to integer division instructions, but you can learn more by Googling, "magic number division -sports".

Note that the above strategies only work when one of the values is constant.  If neither of the values will be known until runtime, then you will have to use MUL, SDIV, or UDIV, and if you don't have access to integer division instructions, you either have to resort to long division or looped subtraction, for division.

You should spend some time looking over ARM's quick reference card for ARMv7, to see what other math instructions are available.  It includes some more overt combined operation instructions, like multiply and subtract, dual signed multiply and add (multiplies 2 sets of 16 bit values, then adds them), as well as a few other instructions for 16 and 32 bit multiplication combined with other operations.  Most of these exist, because they are useful for certain common applications, so there is certainly some value in knowing what is available to you.

If you have been doing these tutorials in order, you should know enough by now to write a program or two that will demonstrate your understanding of the things we just covered.  I encourage you to do so, as doing helps to reinforce learning.