Technium Adeptus: 2016

Monday, December 12, 2016

ARM Assembly: The ARM Architecture

The history of the ARM architecture is quite interesting, though we will only be looking at it very briefly. ARM started with a company named Acorn Computers. It initially produced simple computers, during the time when all computers were pretty simple. Acorn Computers produced several machines that were successful in Europe, including the BBC Micro series. The primary focus of the company was on processors though, and due to the unsuitability of other processors at the time, it decided to create its own processors, in the Acorn RISC Machine project, to go with the BBC Micro for more business oriented applications. Early ARM processors compared quite favorably to early Intel processors, with the ARM2 outperforming the Intel 80286 at the same time as using less power. In the end, the ARM processor line was turned into its own company, Advanced RISC Machines. The unique thing about ARM as a company is that it does not manufacture processors. It designs and licenses processor architectures for other companies to manufacture. The consequence of this is that ARM processors are ubiquitous in the mobile device industry, because it is far cheaper to license designs and manufacture them than it is for companies to design custom processors themselves. This is especially valuable in an industry that is constantly trying to push technological boundaries to keep up with competition. In this business model, one company does all of the design work, often working with larger clients to include valuable new features, and the entire mobile device industry licenses and uses the designs to produce all sorts of different mobile devices. In addition to mobile devices, there are some companies that are attempting to use the ARM architecture to produce server processors. They believe that the lower power consumption and higher performance that may be possible with RISC processors could be very popular in a market that values processors that run cheaper and cooler but need plenty of processing power.

Instruction Sets

The specific part of the ARM architecture that is important to us is how the ARMv7 architecture works. ARMv7 is broken up into 3 divisions. There is an A series, an M series, and an R series. The A series is general purpose. The M series is mostly low power consumption processors designed for use in embedded applications and lower powered mobile devices. The R series is designed for real-time embedded applications, where responsiveness and performance are extremely important. We are mostly going to focus on the A series, with a little bit on the M series. The main difference between A and M with respect to this series is that the M series only supports ARM's 16 bit Thumb/Thumb2 instruction set (possibly with support for the few 32 bit Thumb2 instructions), while the A series supports both the 32 bit ARM instruction set and the mostly 16 bit Thumb2 instruction set.

Later in this series we will also learn about another instruction set the processor in the Pi 2 supports. This is the Neon/VFP instruction set that provides instructions for floating point math as well as parallel integer and floating point math. We will discuss this more later.

Registers

Pretty much all processors need some sort of memory to operate on. Modern processors use several levels of memory. The lowest level is registers that each can store one value. The size of the value is generally the word size of the processor. The word size of the ARMv7 architecture is 32 bits, meaning that each register can hold a 32 bit value.

The ARMv7 architecture claims to have 16 general purpose registers, but this is not strictly true. No less than 3 of these registers have dedicated purposes, and attempting to use them for anything else is likely to cause problems. Some programs may use a 4th register for a dedicated purpose as well.

ARM generally names its general purpose registers as rx, where x if the register number. Thus, ARMv7 processors have registers r0 through r15. We never use any registers above r12 for general use though, nor do we use them by their designations r13, r14, or r15. Instead, we use sp for r13, lr for r14, and pc for r15.

The pc register is the program counter or instruction pointer. It contains the memory address of the next instruction to be run. With a few exceptions, whenever an instruction is executed, the instruction pointer is updated to point at the instruction after it. Modifying the instruction pointer will change where the program is executing.

The sp register is the stack pointer. It contains the memory address of the last entry on the stack. It is often used as a reference point for accessing data stored on the stack. Modification of the stack pointer should be done carefully and predictably, because otherwise the top of the stack can be lost, which can be disastrous.

The lr register is used to store the return point for a function. When a function is called (there are some instructions for this), the address of the instruction directly after the function call is stored in lr, so that the function being called knows where to return to when it is done. Technically, if the address stored in lr is saved somewhere else and restored later, lr can be used as a general purpose register, though I have not found this to be terribly common. The most important thing to remember about lr is that if you are going to call a function from inside of another function, you need to store the value in lr somewhere else first, because it is going to get overwritten. Don't worry to much about this right now though, as we will discuss it again when we start calling functions.

In partially compiled C programs, you may also see an fp register. This is the frame pointer register, and when used, it is r11. The frame pointer can be really handy for certain kinds of debugging, and it can be used as an alternative reference to sp, however it is not strictly necessary. In fact, you will never find it in partially compiled C programs with any level of optimization. It can only be found in unoptimized, partially compiled C code. We won't be using it, however it is good to be aware of, as you may see it in assembly code generated by GCC.

Memory

One fairly unusual thing about ARM is that it does not have any instructions that directly operate on data in memory. In fact, the only memory access instructions ARM has are a few for loading data from memory into registers. Most modern processor architectures allow at least a few instructions besides load and store instructions to access data from memory directly. ARM does not, and this means that if you want to operate on a value stored in memory, you must first load it into a register, and if you want to put the result of an operation in memory, you have to save it to memory after the operation is complete. This keeps the instruction set fairly simple, and despite the fact that it means you have to use more instructions when accessing data in memory, ARM processors still have excellent performance.

The general workflow in 32 bit Intel processors involves a lot of memory accesses, because they have a very limited number of registers (8 general purpose registers, except that some have special purposes as well, which often limits their usefulness). Consequently, instructions that can access memory directly have been seen as essential to good performance. ARM, however, has plenty of registers, which means that intermediate values rarely ever have to be stored in memory. Thus the value of instructions that access memory directly is very limited. Historically, this has been such a significant strong point of ARM that it has been able to outperform Intel and other processors with significantly lower power consumption, despite the fact that RISC processors are defined by the fact that they maintain smaller instruction sets of simpler instructions. In fact, this is such a potent strength that modern Intel 64 bit architectures have doubled the number of registers from 8 to 16 (comparatively though, 64 bit ARM processors now have 31 general purpose registers, and the pc and sp registers are dedicated registers, meaning that it more than doubled the number of usable registers). The general workflow with ARM programs should take advantage of the copious amount of general purpose registers to avoid unnecessary memory accesses.

Peripherals

Peripherals won't matter much here with respect to processor architecture, because we will be accessing them through the operating system, not directly. That said, it is still worth spending a paragraph on.

Peripherals for ARM processors are memory mapped. This is the most common way of handling peripherals. Memory mapped peripherals are mapped to unused memory addresses by the processor (the Pi 2 processor can address up to 1TB of memory, but it only has 1GB, which leaves a huge number of memory addresses free). In many systems, they are hard coded to specific memory addresses, but in some (desktops and laptops, for example), they are mapped by the BIOS during boot based on a number of factors. The Pi has its memory mappings hard coded, since its hardware is not changeable. It can be accessed by reading and writing the memory addresses associated with the peripheral you want to control. The catch here is that this is handled by Linux on the Pi, and we don't have direct access to the device's memory. We access memory through a memory manager that is managed by the operating system. We will look at this more later, but the important thing is, no matter what memory we attempt to modify in our programs, we cannot directly access memory that is mapped to peripherals. We must go through the operating system. This is for security reasons and because an operating system running multiple processes simultaneously has to implement some kind of access control to prevent programs from stepping on each other's toes.

Now that we have a basic understanding of the processor architecture in our Pi 2, we can get back to writing assembly code!

Wednesday, November 23, 2016

ARM Assembly

Preface

I recently designed a course on the subject of ARM assembly programming, which I am now teaching. During the design of the course, I spent a great deal of time struggling to find information on certain topics within the subject. I found a distinct lack of good resources on many things. The biggest problem was finding information about the interface between assembly level code and the operating system. On one occasion I even taught my students something totally wrong, because I had been forced to learn it myself through reverse engineering, and it turns out that the operating system and the debugger don't behave the same way. Of course, I corrected myself two days later, and I used the incident as a teaching opportunity. What I learned from all of this is that assembly programming, especially for the ARM architecture, is very poorly documented at the interface between OS and user space programs. I am writing this series with an aim to fix this problem.

Introduction

This series will teach ARM assembly programming, using ARMv7 assembly running on the Raspberry Pi 2. The Pi 3 should also work fine for a vast majority of this, however some parts will not work for the original Pis. I will do my best to identify these where I am aware of them.

This series assumes that if you have a Raspberry Pi, you either already have everything you need to use it, or you can figure out what you need and obtain it. The essential things you will need are internet access for your Pi, at least briefly for the first chapter, and some way to input text and view text output. This can be a USB keyboard and a monitor that will work with the Pi, it can be an SSH connection to the Pi over a network, or it can be a serial to USB adapter that gives you direct terminal access through the GPIO pins (Adafruit sells these). There are plenty of resources online for each of these methods.

The operating system used for this series is Raspbian. At least one of my students is using Arch Linux successfully, so this may also be a viable option. Raspbian can either be installed using the NOOBs installer that comes on most SD cards sold specifically for the Pi, or you can download a Raspbian image and install it directly. The second option is recommended, as it will give you the most recent version of Raspbian. In addition, you may be able to avoid certain issues some of my students have run into with Raspbian installed through NOOBs for another assembly project you might be interested in that is currently outside the scope of this series.

Additionally, this series expects you to already have a basic understanding of the concepts of processor registers and assembly instructions. I am sure many people will be able to figure these things out as they go along, but if you start this without that understanding and find yourself struggling to understand even the basic concepts, perhaps you should take a break and spend a little bit of time learning about modern processor architectures and how they work. (I will probably write a short article on this in the future, but right now this series takes higher priority.)

This will be populated with links to each chapter as they are published, in order of publication.

Index

This will be populated with links to the chapters sorted by topic.

ARM Assembly: Setup

Setup

Before we can start writing assembly programs, we need to do some basic setup. If your Pi is not already setup and running, stop here and do that. Once it is running, login to the terminal. You won't need the graphical interface for any of this, and it is honestly easier to just work completely in the terminal for most of the chapters in this series. If you are an experienced user and you feel you cannot do without the GUI (both of these in one seems unlikely), feel free to use the GUI, but understand that I won't be using the GUI, and this series will assume that you are working purely from a terminal as well.

Once you have logged into a terminal, you will need to check what version of GCC is installed. This is important, because the assembler that comes with versions of GCC older than 4.8 does not have support for some ARMv7 instructions that we will be using on our Pi 2. (The original Pi processor is ARMv6, and thus it does not have these instructions.) Check your version of GCC by running gcc -v. This will spit out a lot of information. At the bottom you will find the version information. My Pi 2 says it's GCC version is 4.6.3. This is what it came with. If your version is lower than 4.8, you will need to run sudo apt-get install gcc-4.8. Note that this will not change the default version. If you run gcc -v again, it will still say the old version. For compatibility reasons, installing GCC 4.8 won't replace the old version. Now though, you should be able to run gcc-4.8 -v, and it will give you the version information about the new GCC you just installed. If you want to compile with GCC 4.8, you will have to run gcc-4.8 instead of gcc. Installing GCC 4.8 will replace the assembler though, which is all we really care about here.

I should add a note here, if the version of GCC says 4.8 or higher (recent versions of Raspbian now come with 4.9.2), then you can just use gcc, instead of specifying the version.

Writing a Simple Program

This part is not so much about learning assembly as it is about learning to use a terminal editor. You have two options by default. They are Nano and Vim. In my opinion, Vim is the better of the two, however it has such a steep learning curve that am not even going to try to teach you how to use it (besides that, I am not even that proficient with it yet). If you prefer Emacs, it is not installed by default, but you probably already know how to install it. I am going to use Nano for this series. You can run nano in the terminal to start Nano and create a new file. You can run nano filename to open an existing file. When you are in Nano, you will see a list of controls at the bottom. The important ones are Ctrl+R to open a file, Ctrl+O to save a file, and Ctrl+X to exit (I have never actually used Ctrl+R...). You can also use Ctrl+W to search, Ctrl+K to cut a line of text, and Ctrl+U to paste the last text that was cut. If you expect to use these last few, I would suggest experimenting with them a bit to get familiar with how they work.

Let's start by writing a simple program that compiles with C. You will probably want to create a directory to do all of this in. I'll leave that up to you (in Linux, use the command mkdir directory_name to create a new directory in the current one, and use cd directory_name to move into that directory; if you did not know this already, you may want to find a Linux command line tutorial before continuing). Once you are in your new directory, start Nano. I ran nano num.s, to create a new file named num.s. (The file extension for assembly files is ".s".) You can instead specify the file name when you use Ctrl+X to save it the first time, if you prefer.

The first thing we need in the file is a line with the text .text. .text is the section of your program with executable code, and it is generally called the text section (less commonly the "code section"). Since we are compiling with GCC, we need a "main" function. Assembly does not have functions in the same sense as other languages, though we generally treat it like it does. Instead, assembly has labels that are references to memory addresses. These can be the addresses of data or instructions. To create a "main" function, we need to do two things. First, we need to create a label named "main" that references the instruction in our code where the program should start. This is done by typing main:, on its own line. The second thing we need to do is make our "main" label visible outside of our assembly file (so that the C library can see it; more on why this is necessary later). Directly above the "main" label, add a line, .global main. In the future, we will type the global directive before we create the actual label.

As this point, your program should look like this:

.text
.global main
main:

Now, the C compiler knows where our program starts (the instruction directly following the "main" label). Next we need to write the actual code. To keep things simple, our program is going to return an error code. The convention is for functions to put their return values in the register r0. When we return from main, the value in r0 is going to be used as the error code returned by our program. Normally you don't get to see what error codes are returned by programs, but I'll show you a way to do it when we get there.

To put a value in r0, we will use the mov instruction. This instruction can copy a value in one register to another, or it can take an immediate value (a number encoded in the instruction) and put it in a register. There are some limitations to this, but we don't need to worry about them yet. Underneath the "main" label, press tab once to indent, and then type mov r0, #27. This will put the value 27 in the register r0. Immediate values must generally be preceded with a # symbol.

Now that we have put our return value in r0, we can return from the main function. This is done with a special branch instruction that takes a memory address to branch to. By default, the program will run one instruction after another sequentially. Branches are instructions that tell the processor to go to some other instruction and continue from there. This is how we do things like calling functions and skipping instructions in false if statements. We also use a branch instruction to return from a function. In this case we will use the bx instruction, which takes a register that contains the memory address of the instruction we want to go to. We will discuss ARM registers more in a moment, but for this instruction all we need to know about is the lr register, which contains the address for the return point when a function is called. To return from the main function, we will use the instruction bx lr. This will go back to whatever code C used to call the main function, allowing C to shut down anything it did before the program exits.

Our program should now look like this:

.text
.global main
main:
mov r0, #27
bx lr

All this program does is to return the exit code 27. Here we need to save our program (Ctrl+O). Now, exit Nano (Ctrl+X). At the command line again, run gcc num.s. This will compile our program and create an executable with the default name a.out. We can run our program with ./a.out. Disappointingly, it does nothing. Like I said before, normally, we don't actually get to see the error codes. We can display the error code of the last program that ran by running echo $?. This should print out the number 27. (Note that if you run echo $? again, it will print out 0, because that is the error code returned from the first time you ran it.) You can edit your program and change the number put into r0, but there are two things to keep in mind. First, negative numbers won't work. The error code is an unsigned byte, so a negative number will print out as its unsigned representation. Second, an unsigned byte can only fit in the range of values from 0 to 255 inclusive. If you put in a number outside of this range, it will only print out whatever the unsigned representation of the lower 8 bytes are. Additionally, due to limitations we will discuss later, some numbers outside of this range won't even allow your code to compile. If the compiler gives you an error like "invalid constant (8888) after fixup", it means that the number you are trying to put in the register cannot be encoded in the instruction you are trying to put it in, so you will have to find another way.

The main point of this exercise was to become familiar with the editor and make sure everything works the way we expected. An important thing to note is that we did not use the GCC 4.8 that we installed earlier, but we did use the assembler installed with it. From here on out, most of our programs will not be compiled with GCC. GCC adds some overhead to pretty much everything it compiles. It adds startup and shutdown code that we really only care about if we are using C functions. So, for the most part, the only time we will use GCC to compile our programs is when we are using C functions.

Thursday, August 25, 2016

Video Game Development: Multi-threading

Multi-threading is a complex enough subject to warrant an entire course on its own, so we are not going to go into extensive detail here. Multi-threading does have some very valuable benefits in the realm of game development though.

Multi-threading allows different parts of a program to run at the same time. In modern multi-core computer systems, this can mean managing the game physics on one core, rendering graphics on another core, and handling audio on a third. It could also mean updating one game unit on one core and another on a different core, though doing this is significantly more complex.

There are two things that are essential to understand when discussing any kind of parallel programming. The first is concurrency. Concurrency is when the program is doing different tasks at the same time. The second is parallelism. Parallelism is when the program is doing the same thing to different sets of data at the same time. An example of concurrency with respect to video games is rendering graphics in one thread while processing physics in another. The two tasks are different, but they are happening at the same time. This can often be beneficial, even on systems that do not have multiple processor cores, because graphics rendering spends a lot of time waiting for the video card to be ready for more data. During that wait time, other processing, like game physics, can take place without slowing down the rendering. Parallelism requires hardware designed for multi-threading, like multi-core processors. With parallelism, a set of data is broken up into smaller sets, and these sets are worked on at the same time, using the same algorithm. For example, if my game has 100 game entities that need to be updated, instead of updating each one, one at a time, I could break the list into two lists of 50 each. Then, I could process 50 entities in one thread and 50 in another, at the same time. This could reduce the time required to update all of the entities by up to 50% (in practice, the benefit is a bit smaller than this, due to overhead of multi-threading).

In games, we generally stick to concurrency, due to certain issues with multi-threading. When multi-threading we have to share memory between threads to allow communication. In a game, we might be sharing animation frame data between the rendering thread and the physics thread. When an entity updates, the physics engine might also update that entity's animation frame. For example, a character in the game might be walking, and when the character steps forward, the physics engine sets the animation frame to show the step being taken. The rendering thread also needs to use this data when it actually renders the animation frame on the screen. So, what happens if the rendering thread starts to render the animation frame, but then halfway through, the physics thread changes it? The result is that we have the top half of the frame as one image and the bottom as another, and we can probably see a horizontal seam in the middle where the change happened (this is called tearing, though it does not generally happen in this way). You can also run into problems where two threads load something from memory at about the same time, both change it, and then both write it back. This results in what is called a "race condition," because whoever writes second wins. An example of this is a variable that is incremented each game loop, by two different threads. If the variable is an 8, and then both threads load it (both getting 8s), and then both increment it (now both are 9s), and then both write it (still 9s), the result will be a 9, even though it should be a 10, because it was incremented twice. These bugs can be incredibly hard to find, and there is no easy way to prevent them, except not sharing anything between threads, which defeats the point.

There are a lot of complications with multi-threading, and they all come down to making sure that memory accesses are carefully managed to avoid unexpected problems. This is usually done by "locking" memory accesses. A function that needs to modify some memory will lock that memory, read it, modify it, and write it, and then unlock it, and since it is locked while in use, nothing else can use it. These locks take time to set and remove though, which reduces the efficiency of the program and can even eliminate the benefits of multi-threading if too many have to be used. The best strategy is to avoid the need for locks in the first place, and minimize their use when they are absolutely necessary. Even with locks though, you can run into issues.

This is a very complicated topic. More information can be found on Wikipedia and other resources on the internet. The takeaway here is that multi-threading can be extremely valuable in optimizing video game performance, but it requires a great deal of care and complete understanding to avoid running into very difficult problems.

Video Game Design: Player Generated Content

Player generated content is anything creative within a game that was created by one or more players. It generally requires a high level of flexibility within the game, even if that flexibility is only within a limited scope. For example, allowing players to design their own flags in a game would be player generated content. Of course, it can also be far more broad than this. Games like Minecraft and Terraria allow players to build custom structures and even complex logic constructs from basic building blocks. Most Blizzard real-time strategy games (including the Starcraft and Warcraft series) include map editors that allow players some degree of ability to create custom maps and other custom in-game content. This allows for large scale player generated content far beyond anything the developers could imagine (and they can imagine a lot).

Player generated content can also go outside the game. For example, there is an entire YouTube series built around the Halo games. They use video captures from the games, and then they add voices. There are also millions of "Let's Play" videos for all sorts of games, with Terraria and Minecraft being some of the most popular, which also qualify as player generated content. Even written fan fiction can be a form of player generated content.

Player generated content serves several purposes. The first is to give a game increased replay value. Replay value is the value a game has in continuing to play it even after you have beaten it. Some games add replay value by adding a multiplayer mode so players can play with their friends. Others do it by randomizing maps and adding such huge amounts of content that no one can experience it all the first time through the game. Player generated content can add almost unlimited replay value to a game though. When players are able to add their own content to a game, they can create replay value themselves. A few flexible options can turn into practically infinite combinations, and that helps a game to avoid ever really getting old.

Player generated content also helps the game creators more directly. You could create a game with huge amounts of content, or you could give your players the ability to make content for you. Spend some time looking for epic things built in Minecraft, and you will find an enormous amount of free advertising for Mojang. When people make mods for Minecraft, they are creating their own content that players can optionally add to their game. If you start to get bored with Minecraft, just add a few mods. Even the 5 most popular mods have almost more content than a single person can every experience completely, and Mojang, the creators of the original game, did not have to many any of that. Some mods add more content to the game than it has in the first place, and some people put together packages with many mods all put together. All of that keeps people playing and brings new people into the game, at no cost to the developers. In short, player generated content is essentially free labor for the game creator, and even though the game creator does not legally own the player generated content, it does not matter, because it still benefits them just as much as if they did.

Allowing player generated content in a game can be tedious and difficult, but it is generally well worth the time and effort.

Friday, June 24, 2016

Video Game Design: Educational Game Psychology in a Nutshell

Game psychology, in a nutshell, is about reward and punishment. The Paranoia XP rulebook says it best.

If you reward players for doing something you like, they will do more of that.

If you punish players for doing something you don't like, they will do less of that.

Applied to video game, this generally means, if you want your players to keep playing your game, reward them for it. Applied to educational games though, the meaning is much less clear.

In educational video games, the above psychological truth is applied to only one facet of the game. "something you like" is construed to mean learning and solving problems related to the educational goal of the game. "something you don't like" is applied only to failure to correctly solve the problems presented. The game generally rewards success with points, a good grade, some visual recognition of the student's success, or even actual game play in a non-educational element of the game. The reason educational games are, for the most part, an epic failure is that they miss at least two other facets.

In non-educational games, the reward/punishment psychology is applied to the game itself, to encourage players to keep playing the game. You reward the player for playing the game and progressing in the game, because you want the player to keep playing your game. Most educational games miss this motivational facet entirely, putting great effort into rewarding players for success in the educational element of the game, but ultimately not motivating them to actually want to play it in the first place. An educational game that cannot keep players engaged is worthless, no matter how good the educational aspect of the game is.

The educational aspect of an educational game is always viewed by the game designers as the "something" in the above quote. From the point of view of educators and parents, education is the "something" that they want to encourage students to do more of, but educators and parents are not the target audience of educational games. Students are the target audience, and designing a game without focusing on the target audience almost always ends in failure. If you don't believe this, look at educational games. The fact is, the educational aspect of the game is not just the something we want players to do. It is also a reward or punishment, and when we apply this to motivation to play the game, it becomes crystal clear why there are almost no decent educational games.

The educational aspect of most educational games is the "work" portion of the game, without any fun. We hope to motivate students to work through the educational portion of the game, so that they can learn what is being taught and gain some skill in the subject. We do this by rewarding them for their successes. Unfortunately, this is not what the players see. The players don't see the work as the thing that they have to do to earn the reward. They see it as a punishment for playing the game. When we punish people for playing our games, they don't want to play those games. If we cannot see beyond our idealized version of how the game aught to be played, through the eyes of the players, our target audience, we will miss the fact that our games are awful, because we are using learning, which should be fun, to punish the players for playing the game, and in the end we will find that we have alienated our target audience and caused them to hate learning, which is the opposite of our goal.

When making a game for any purpose besides entertainment, it is essential to understand both our target audience as well as the psychological ecosystem created by our game. If we lose sight of the primary goal of any successful game, to engage players and motivate them to continue playing, it does not matter how good or effective our game is at its stated goal, it cannot be successful, because a game that no one wants to play cannot have a positive impact on anyone.

There are three essential facets to the above quote when designing educational games:

Reward the players for playing the game. This should be the first goal for every game. If no one wants to play your game, it cannot benefit anyone (except, perhaps, as an example of what not to do).
Do not let the educational element of the game become a punishment for players. If the educational element becomes a punishment, it won't matter if people play your game either, because it will just make them hate learning and hate the subject you are trying to teach. It does not matter if your players come away knowing math or reading or anything, if they hate it so badly they never want to use it again.
Reward players for trying to solve your educational problems, and reward them more for success. Don't punish them for failure. The time lost on a failed attempt is already punishment enough, and one of the strongest motivators for playing games is that the cost of failure is low. Don't make it worse, especially in a game that may already be less motivating than you might want it to be.

These are in order of importance. A game that does not satisfy #1 has no value, because no one wants to play it in the first place. An educational game that does not satisfy #2 harms the education of the student, instead of improving it. An educational game that does not satisfy #3 probably does not satisfy #2 or #1, and even if it does, it is not going to be as effective as it could be, because it does not encourage the player's brain to consider the value of the information to be high enough to store.

This is educational game psychology in a nutshell. Making effective educational games is not horribly difficult. In fact, it only requires the same mentality that making successful non-educational games does, with a few extra priorities below the most important one.

Monday, June 20, 2016

Video Game Development: Scheduling

Scheduling is an essential part of game development. It is how the game decides what subsystems and tasks to run, when, and in which order. The three most common game subsystems are physics, graphics rendering, and input handling. These three subsystems are what defines a video game, but they don't have to be everything. Many games also have an audio subsystem and a networking subsystem. Some of these, like the physics engine, must run at at least once per game loop. Some have specific performance requirements to avoid problems, like the input handing and networking systems. Some are less important but still need to be managed well to avoid annoyances, like the audio system. Managing all of these is called scheduling.

Subsystems

The most important subsystem of a game is the physics engine. This handles things like movement of game entities, state changes, and pretty much everything that happens in the game world. It is sometimes split into multiple pieces, to make management easier, but generally all parts of the physics engine should run at the same speed, and there is usually some order of operations requirement for a good game play experience. The physics engine does not have to run at a perfectly consistent rate, but it should never be allowed to get more that a few milliseconds behind or ahead, otherwise the game will run too fast or lag.

The second most important subsystem is input handling. This is often grouped as event handling, which deals with input as well as intersystem communication within the game. The important thing with input handling is that it happens often enough to keep the game highly responsive. When a player clicks the mouse, presses a key, or presses a button, the effect on the physics engine should appear to be immediate. Players can perceive very small amounts of input delay, and laggy input can easily make a great game unplayable. The input handling system never needs to run more frequently than the physics engine, but it should generally run at least as fast as the renderer.

The third most important subsystem is rendering, often called the graphics engine. The graphics engine shows the player what is happening within the game world. The only reason it is less important than physics and input is that the graphics engine's performance requirements are much more flexible. A game that renders slowly can still be playable if the physics engine and the input handling are running at an acceptable rate. If the physics engine is too slow though, the game lags, and if the input handling is slow, the player loses the ability to respond quickly. With a technique called frame skipping, it is acceptable to skip rendering to allow the game engine and input handling to catch up if they get behind. There are two good rules of thumb for choosing a default frame rate. The first is avoid going over around 60 to 75 frames per second. This is because above about 50 frames per second, the human eye is not capable of transmitting data to the brain fast enough to see any difference. This is also because most monitors have a refresh rate of 60Hz or 75Hz, which means that rending faster than this literally will not make any difference. The other rule of thumb is to avoid frame rates lower than around 30 frames per second, and if the game is going much lower than 20 frames per second, the user is going to assume that his or her computer just cannot handle the game.

Other subsystems can depends on a lot of factors. Networking generally needs to be very responsive, to keep things synchronized. If the audio subsystem is not running frequently enough, it can end up with the sound not synchronized with the game, but it can always skip playing less important sounds if necessary. If you are doing something like dynamic terrain generation in the background, you will want it to go fast enough that nothing important gets into an ungenerated area, but not so fast that the physics engine starts to get laggy. In general, other subsystems are not as important as the top three, but when any of the top three rely on them, it may be necessary to give them higher priority.

Game Loop

The game loop is where most of the scheduling occurs. Unless you are multithreading to put graphics or networking in another thread, the game loop decides when to call all of the various subsystems. The order subsystems are called in does matter, though it typically matters the most when you have more than the main three. The input handling should happen directly before the physics, so that the physics is working with the most recent input. This is important for minimizing input delay. The graphics engine should generally render directly after the physics engine runs, to display the new game state to the user as soon as possible. If the game is networked, that may need to happen either right before or right after the the input handling or the physics (depending on the details of the protocol).

The game loop is responsible for timing. This may mean that the game loop is responsible for making sure that the physics engine is kept above a minimum rate, or it might mean that it needs to keep it within a certain range. It should make sure that input is handled regularly and that rendering happens without getting in the way of anything more important. It also needs to deal with running other subsystems regularly. The most important thing is managing priorities. In some ways, a game loop is like the scheduling system in an operating system, though with rather less complexity.

Scheduling Algorithms

The scheduling algorithm is the strategy used by the game loop to keep everything running smoothly. There are two popular scheduling algorithms that can be adjusted to fit most situations.

The first popular scheduling algorithm is time scaling. Long ago, game loops did not do any scheduling. They relied on the speed of the processor to maintain appropriate game speed. This eventually lead to a problem. When Intel's 80386 processor came out, a lot of people noticed that their games would run way too fast. The games were designed on much slower 80286 or older processors, and these were slow enough that limiting speed had never had any value. With the 80386 though, these games ran so fast that they were unplayable. The initial solution was that most computer manufacturers added a "turbo" button to their 80386 and 80486 computers that would reduce the clock rate of the chips to make older games playable. Of course, new games would also have this problem if the developers did not do something about it, and relying on the continued existence of the turbo button, as even faster computers appeared, was clearly not a wise option. One of the solutions to this problem was to scale the speed of the physics engine to the time passing.

The time scaling scheduling algorithm is simple. Any time any state change occurs in the game world, the magnitude of the change is scaled by how much time has passed since the previous game loop. If a game entity was set to move at 10 pixels per second, and one tenth of a second had passed, the entity would move 1 pixel. This provides a major advantage in two ways. If the game is running extremely fast, the game world still runs at the expected rate, just with finer precision. If the game is running very slow though, it still seems to run at the correct rate, and the slowness is only apparent in rendering and input delay. In other words, for any given computer, the game physics appear to run at the same rate, and only the visual representation and responsiveness suffer. There are also several costs to this scheduling strategy. The first is that it requires many physics variables to be floating point variables, to deal with the very fine granularity of the game running on a very fast system. This can degrade performance on slower systems. The second is that floating point math is inherently inaccurate. While these inaccuracies generally only occur on an extremely small scale, which will not be noticed by a player, they add up enough to make it quite difficult to keep networked games synchronized. The problem is that floating point error adds up faster on fast machines than slow machines, which means that two networked machines doing the same math will probably not end up with the same cumulative results. This problem can be mitigated, but another scheduling strategy exists that makes it unnecessary.

The fixed rate scheduling algorithm runs the physics engine at a fixed rate. The physics engine does not need to do any time math, because it can assume that each time it is called, the same amount of time has passed. The game loop typically handles this by keeping track of how much time a given loop has taken, and then sleeping for the remaining time at the end. One problem with fixed rate scheduling is that if the operations in the loop take longer than the allotted time, the game will get behind and start to lag. This makes it not very resilient to slower computers. In addition, this can make collision detection more difficult, as game entities generally move in larger fixed distances than time scaling. The problem arises when a game entity is moving fast enough that it traverses the distance across another entity in one frame, effectively moving right through it. The benefits of fixed rate scheduling include the fact that it does not waste CPU cycles unnecessarily, and as long as the ends of a networked game are on the same physics frame, their state should be identical.

In modern games, fixed rate scheduling is generally preferred, because so many games are networked, and it is the best algorithm for networked games. Its downsides are pretty serious though. It does not scale down well, which can limit the game to only faster systems. The length of intervals cannot be guaranteed, because no sleep function provides a 100% guarantee that it will sleep for exactly as long as requested. These are some pretty serious problems, as they mean that the game can get behind or ahead at the whim of the operating system, and slow systems will just have to deal with chronically laggy games (which does not work at all for networked games). Fixed rate scheduling comes with more serious problems than time scaled scheduling, but it turns out they are also easier to deal with.

The typical fixed rate game loop does more than just running all of the subsystems and then sleeping for the remaining time. It also prioritizes subsystems, ensuring that the most important ones always get as much time as they need, at the cost of the less important ones. In addition to this, it keeps track of how much time passes during sleeps, so that it can mitigate cases where a sleep takes longer than expected. A fixed rate game loop in Python might look like this:

while(run):
    new_t = time.time()
    pending_time = pending_time + new_t - t # Add time elapsed
    t = new_t

    input_handler()

    while (pending_time > 0.020): # 20ms is 50 times per second
        update() # A common function name for physics
        pending_time = pending_time - 0.020

    render() # A common function name for graphics
    sleep(0.20 - (time.time() - t) - pending_time)

Here are some things to note:

We only get the current time to store once per loop. There is no reason to do it more often, and in fact, if we do, we may end up frequently losing small amounts of time. An important note is that time.time() is not a high precision timer, and thus it should not be used for real games. Pygame has a Clock object that is much more suited to this purpose. Likewise, sleep() is also too low resolution to be used for games, but Pygame's Clock object can manage this functionality as well.

Instead of keeping track of only the time elapsed since the last loop, we are keeping track of all time elapsed, and then we are subtracting the expected time per loop from that for each time we run the physics engine. This ensures that if the OS sleeps for too long, we eventually get that time back when it adds up to enough to matter.

The physics engine runs in a while loop. This allows us to catch up if we get behind. If enough sleeps taking too long adds up to a full game loop of time, we will just run the physics engine an extra time during the next loop, which will catch us right up.

The render and input handling are outside of the while loop. This essentially skips frames if the game is behind, and since rendering tends to take a long time, skipping a frame now and then can allow the game to run smoothly even on slower systems. If the game has to skip a lot of frames, this may result in significant input delay though, so it may be better to put the input handler inside the loop as well, if you expect a lot of customers to be running your game on computers that can only barely handle it.

The while loop subtracts 20ms from pending_time each time it runs the physics engine. This is the second part of what allows us to keep track of smaller amounts of leftover time.

The sleep function only sleeps for 20ms minus the time elapsed since we began the last game loop and the pending time. The first part is there, because we only want to sleep for the remaining time. It accounts for the part of the 20ms that has already been used running the various subsystems. The second part is there to keep the time intervals as consistent as possible. Ideally, this means that if there is any time pending, we only sleep just long enough that we have accumulated exactly 20ms. In practice, this is unlikely to happen often, but doing this will help to avoid skipped frames as much as possible, by reducing the chances that pending_time will ever accumulate enough time to run the physics more than once a loop.

This example is fairly simple, and it does not include any extra subsystems. Most subsystems will need to be included in the same priority level as the physics engine or renderer, but lower priority subsystems may need a new priority level. This might be done by checking how much of the 20ms (or whatever period you give your game loop) is remaining after the first and second priorities have run, and if there is not enough time, skip them and go straight to the loop. Even more complex priority systems can be created as well, ensuring that the top priority subsystems run every loop, and then rotating through the lower priority subsystems, running or skipping a few each loop, depending on timing. It is important to keep in mind though, that the game loop is the top level bottleneck for the game, and it is possible for a complex scheduling system to end up taking up more time than it ultimately saves. You may need a complex scheduling system, if you have a lot of different subsystems, and if each subsystem needs a certain level of reliability to be guaranteed, but if you don't actually need this, don't do it. The scheduling system's job is to make sure the game runs smoothly and responsively, and if the scheduling system itself becomes too heavy, it will get in its own way.

Good scheduling is an essential element in the development of a good game. The scheduler budgets time, ideally making sure that everything gets as much time as it needs, but rationing time effectively when there is not enough to go around. Choosing or developing a good scheduling algorithm for your game can be just as important as any other part of development, because it really does not matter how awesome your game is if it is unplayable. In addition, a good scheduler can allow your game to run smoothly on a much wider range of systems, dramatically increasing the size of your market.

Video Game Development: Efficiency Part 2

Memory Profiling

We won't go deep into this subject, but it is good to be familiar with the idea. Memory profiling is a method of tracking and viewing the memory usage statistics of your program. It can be used to track down memory leaks and to find parts of the program that are using more memory than expected. In general, it is good to use some kind of process viewer to see how much memory your game is using, but if you find it using more memory than you expect, you will need to do some memory profiling to figure out where and why it is using so much memory.

There is a Python library called memory_profiler that can be installed with the command pip install -U memory_profiler. (Python generally comes with pip, but under certain circumstances, you may find you need to install it yourself.) You should also run pip install -U psutil, or memory_profiler may take an excessively long time calculating the results (it also may not be able to calculate some of the results). How to use this module is beyond the scope of this article, but a quick Google search should offer more than enough documentation.

Data Structures

Perhaps the two most common data structures used in software development are the array and the linked list. Understanding how these work can provide you with tools for customizing performance to your target platform. An array has a static size, and it is fast for random access. Because the size is static, if you need the size to be changed, you have to create a new, larger array and then copy all of the data, which is quite expensive. Linked lists are slow for random access, and they use more memory than arrays, but they are dynamically sizable, and resizing is cheap. Neither data structure is perfect for every situation, but arrays tend to be better when your list of data will not be increasing or decreasing often, and you need to be able to quickly access any element. Linked lists tend to be better when the size of your list will be changing frequently and most of the access is sequential (like, in a for loop). These are not the only data structures with this kind of tradeoff, though they are the most commonly used data structures in games.

Objects

Games seem like the perfect place to use objects, but there are some important things to realize about objects and efficiency. To start with, objects are a meta-data type. In other words, objects define new data types. As with any data type, objects should be used when appropriate and avoided when not. Common wisdom teaches that the best programming style for object oriented programs is to encase everything in an object. This makes about as much sense as saying that every piece of data in a program should be stored in the form of a string, a char, or an integer, especially with reference to game programming.

Objects tend to cause caching issues. Caching prefers data that is used sequentially, for example, an array of integers that is being processed in a for loop. An array of integers where only every 5th element is being processed in the loop is far less efficient for caching than an array where every element is being accessed, because we are causing data to be loaded into the cache, but we are not using it. Objects are the epitome of this problem. A game entity object might contain the coordinates of the object, the facing of the object, the current action of the object, the destination of the object, and a pointer to the sprite that will represent the object on the screen. A collision detection algorithm might only care about the coordinates and destination of the object, but if it loads an array of objects, the facing, current action, and sprite pointer will also all be loaded into cache. That leaves maybe half of the data in cache actually being used. The renderer will only care about location and sprite, so again, we have cache coherency issues slowing down our program. If, instead, we stored these entities in 5 arrays, one for each element of the object, only the appropriate data would be loaded into cache, which could significantly improve performance (at the cost of making the program a bit more difficult to understand).

Objects also use additional memory. A simple struct holding the same data will only have the data overhead. Depending on the language, an object may be storing type information, class meta-data, and some kind of function table (a VTABLE, in C++). In some languages, objects have a selection of inherently built in data and functions that can make objects quite expensive. In addition, looking up functions from a VTABLE requires some pointer lookups, which make using objects more expensive for the CPU as well. Overall, objects are just an expensive way of making programming resemble the real world a bit more, but the cost is that it makes it harder for the computer to understand, which results in inefficiencies.

In some cases, objects can actually be good for performance. If most accesses to an object use all of the data contained in the object, objects can actually improve cache coherency. There may still be higher function call overhead, but that is generally less expensive than cache coherency problems.

Objects can be bad for performance, and this is a more serious problem for games. Excessive use of objects, or using objects where it is not appropriate can come at a big cost.

In general, and especially in games, you should only use objects where they make sense. Objects should not be used as containers for a bunch of random information. They should encapsulate closely related data and functions used to manipulate that data. They should also avoid being bottlenecks in your game. Some inefficiency can be tolerated, if it is not somewhere that is used so frequently that it becomes a problem. When dealing with parts of the game that are used frequently though, it is important to consider things like cache coherency and how the use of objects will affect memory usage and performance, and avoid using objects entirely where they don't make sense.

Compiler Optimization and Programmer Intent

One of the most valuable functions of a modern compiler is optimizing the machine code it produces. A well designed compiler can do things like taking a short loop with a predictable size and unroll it into a sequence of instructions without any branches. Another optimization is putting the code for small functions directly into the code where it is used. Because branching is expensive, these reductions in branching improve the performance of the code without making it any less reliable. When a compiler does this, it is determining the intent of the programmer by the code that it is given, and it is producing different but equivalent code that is more efficient. Compiler optimization is an essential tool in modern application development, but it still has its limitations.

Compiler optimization is all about programmer intent. If the compiler can figure out what the programmer wants, then how to do it becomes very flexible. Unfortunately, imperative programming languages are all about the how, and determining what the programmer wants based on how the programmer says to do it is very difficult. Declarative programming languages, including functional languages, focus far more on the what than the how, which allows for far more optimization, however programming large games requires a familiarity with the programming paradigm that very few programmers ever get with non-imperative languages. So, unless you want to learn to write games in languages like Haskell (which is not only possible, but honestly not that hard once you gain a bit of experience), it is important to understand what assumptions a compiler can and cannot make about the intent of your code.

Unless you are using a functional language, it is rarely profitable to put a lot of effort into giving your compiler information about your intent. C99 introduced the "restrict" keyword into the C language to tell the compiler that certain kinds of pointer ambiguity do not exist, which allows it to optimize more. The "volatile" keyword in C and C++ exists to allow the compiler to optimize certain pointer operations by default, while still allowing the programmer to tell the compiler when not to optimize. Beyond this though, the build in compiler optimization is generally enough, and when it is not, you have to optimize in assembly, find another solution to the problem, or just deal with it. There are some valuable things to keep in mind that can help avoid making it difficult for the compiler to optimize.

One of the biggest hurdles to optimization is the unnecessary use of objects. By encapsulating unlike but related data, you force the compiler to group that data together in memory, and you tell the compiler than you intend on using that information as a group. When your actual usage patterns do not fit this assumption and do not work well with grouping dissimilar data, you put the compiler in a position where it will misinterpret your intent, resulting in significantly worse performance. Even when encapsulation seems very appropriate, it can be the source of a lot of performance problems, because it sends the compiler mixed messages, and the compiler essentially has to guess.

Sometimes built in language features can help you provide information about intent, but it is also important to understand that built in features are often so generic that they are not optimal for any specific application. The cout function in C++ is a great example of this. It can do anything you need, but because you do not ever need everything it does, all of that extra functionality is wasted cost. On the other hand, Python has a feature called a list comprehension that not only reduces what would normally be a for loop generating a list by appending items to the end of it into a single line statement, it also tells the compiler that your only intent is to create a list. This allows for much more optimization, and the result is that list comprehensions are almost always faster than regular for loops for generating lists.

Because imperative language compilers optimize by trying to determine the programmer's intent, it is good practice to include as much intent information as you can. Unfortunately, the amount of intent you can include is very limited and can cost far more time than it is worth. Sometimes the best way to optimize is to write up some simple code for different ways of solving the problem and then performance testing them.

Bottlenecks

Bottlenecks have been mentioned several times so far, but what are they? Every program has places where it spends a lot of time. In a video game, this might be the main update function, the graphics rendering function, and the event handler. These are bottlenecks, because if one of these things is slow, it slows down the whole program. Specific operations can also be bottlenecks, if they happen a lot. In general, anything that needs to happen every single game loop is a bottleneck, and if a bottleneck is slow, it does not matter how fast the rest of your program runs, your program will be slow. This means that bottlenecks are the most important places for optimization.

The first major bottleneck is memory access. Things like allocating memory, looking up pointers, and even reading or writing memory take a long time. Avoiding unnecessary memory access can be hard, but with modern processors often having to wait for up to 200 cycles for a memory access, this is definitely a good place to consider optimization.

In a major loop that happens a lot, poorly optimized code can be a significant bottleneck. If your program spends 50% of its time in a single function and the code in that function takes twice as long as it should, your program is wasting 25% of its time on poorly optimized code.

I/O is a major bottleneck as well. Graphics I/O not only requires a lot of memory access, it also requires even slower communication with the graphics card. Since rendering needs to happen at a very high rate, the cost of this bottleneck can be very high.

Some bottlenecks can be optimized away. For things like I/O that cannot, concurrency is a possible solution. Concurrency means running two different tasks at the same time, so that one task can keep running while another task is waiting for a slow operation. Concurrency comes with its own costs, and it can sometimes be its own bottleneck, but a well designed game can put the rendering in a separate thread to eliminate it as a bottleneck for the rest of the game engine.

When optimizing, bottlenecks should always be the biggest focus. Optimizing a function that takes only 1% of the execution time has almost no effect on performance, even if you optimize it by 90%. Optimizing a function that takes 50% of the execution time can make a difference, even if you only optimize it by 10%. It can be tempting to try to optimize everything, but the best strategy is to only optimize where it makes the greatest difference, and this is where the bottlenecks are.

Multitasking

Perhaps one of the biggest excuses in software development for lazy optimization is the enormous amount of resources available in modern computers. Try telling a programmer he or she should be using a short instead of and int, and you will get a lecture about how modern computers have so much memory that will never even matter. In some cases this may be true, but it is good practice to avoid flagrant waste. Likewise, inefficient algorithms are quite common, and they are often written by people who believe that modern processors can handle anything, so it does not matter. If the sort takes 0.002 seconds instead of 0.001, no one is ever going to notice right? Unfortunately, this idea that modern computers have unlimited resources is just not true.

It is fairly common now to find computers with 8GB of ram. Most of the offerings in stores have at least that much memory, and even most online offerings have plenty of memory. So, if you are making a new game, and you are fine with limiting your market to people with 8GB of memory, you can use most of that right? After the OS, which uses maybe 1GB or less if it is Linux or between 1GB and 2GB if it is Windows, your game should be fine using 6GB to 6.5GB. Admittedly, most games are not up to this point yet, but it does not matter. For an immersive, full screen game, you can probably expect to have around 3GB to 4GB of that 8GB. Why? What else is your customer using on their computer at the same time? Almost no one, not even most gamers, have absolutely nothing running in the background while playing. For games like World of Warcraft, Minecraft, and Terraria, which are all fairly immersive, full screen games (though, not always played literally full screen), the typical player also has a browser opened in the background, and it is pretty routine for modern browsers to use about half a GB of memory just for the browser and an empty tab. If you have one or two pages open, with information about the game you are playing, you can expect closer to 1GB, and don't forget, the OS is still probably taking between 1GB and 2GB. Few gamers have only the tabs they need opened though. They might have a tab for email, a tab for Facebook, and one or two tabs for other things, which leaves you with even less memory. Even if your game style does not lend itself to this kind of play, you should at least expect anti-virus software to be using some resources, and maybe things like a video driver control tool, a game launcher program (Steam or Battle.net launcher), a control tool for a touch pad, and any number of other things. More casual gamers might even leave very heavy weight programs like MS Office open in the background, so they don't have to wait for it to start when they go back to what they were doing before. In other words, your customer's computer is not your game's playground. It is a place where your game is sharing resources with a lot of other things.

Processor power is harder to measure, but it is equally important to play nicely. If your game never sleeps, you will likely end up consuming all of the processor power of at least one core. If your game is multithreaded, you could consume one core for each thread. Even a simple program that just loops repeatedly, waiting for an event can slow down your customers' computers. Most programming languages have some kind of sleep function, and pretty much all game libraries have their own sleep function that is better. These sleep functions are not merely loops that keep checking if enough time has passed. They actually give up control to the OS, while requesting the OS to awaken them once the specified time has passed. This allows other programs (including OS processes) to use the processor while the game is waiting (for example, for the next frame). Just as with memory, your customers' computers almost certainly have other programs running in the background that need CPU time, and often your game is relying on some of these (which means that if you don't give up unneeded CPU time, it may, ironically, cause your game to lag and perform poorly). Don't forget that your game will always have to play nice with other processes, unless it is the only thing running on an embedded system.

Another concern with attitude that every modern computer has plenty of resources is that it is a lie. A vast majority of modern computers have less than 4GB of memory, and most of them have very weak processors compared to what you might have in your laptop or desktop. I am talking about mobile devices, like smart phones and tablets. Not only do these devices typically have the resources you might expect to find on a Pentium 2 or Pentium 3, they also tend to have a lot more things running in the background. Most mobile device users do not realize that they can close programs, and even those that do often don't realize that it is a good idea to do so. There is no way to get around the problem where a user has so many open applications that your game cannot run, but you can put effort into making sure your game is not squandering resources unnecessarily.

Writing games with the expectation of having exclusive access to all system resources will result in failure. If you write a game that can take up to 8GB of memory to run, you should not expect it to run well for your customers on less than 12GB or 16GB of memory. If your game performs fine on an average processor with nothing else running, don't expect your users to get good performance without a fairly high end processor. If you really want to do it right though, instead of doing tests on a clean system running only the OS, try running it with all of the applications you normally have opened when working. A high end IDE might be comparable to having a few MS Office documents open. Leave your browser open with all of tabs you are currently using (if you only have a few, open 3 or 4 more). Make sure your anti-virus program is doing whatever it does by default. If your game runs perfectly fine under these conditions, you can probably expect anyone with an equally powerful computer to also run it fine.

The important part to remember is that modern computer users multitask, and if it is even possible that the user could be running something else at the same time as the game, you should assume that your game will be just one program in the ecosystem that has to get along and share with many others.

Efficiency has largely been thrown aside in software development and even much of computer science education, but it is no less important than it has been in the past. As computers have become more powerful, programs have also become more hungry and users have increased their multitasking. In the past, software developers and especially game designers have been lucky, because even though resources were far more limited, at least they could rely on the fact that all available resources were fair game. Modern computers are more like a habitat for many programs inhabit at the same time. If your programs don't play nice with everything else there, then you will find it hard to get and keep customers. Instead of looking at 16GB of ram and thinking that nothing could ever use it all, consider how many other programs will running, each taking a chunk of that. Playing nice may not make your game successful, but no one wants to play a game that requires them to shut everything else down just so it will run smoothly.

Thursday, June 16, 2016

Video Game Development: Efficiency Part 1

What kind of computers to game developers program on? I cannot tell you the answer to this question, and I am sure it varies widely, but there are some important things we know about programming and especially writing large programs that can give us some insight.

Computer programs take time to compile. Compiling very small programs typically takes mere seconds. Compiling medium sized programs can take significantly longer. For example, the Linux kernel can take from 1 to 8 hours to compile, depending on your machine. It may be possible to get it down to 30 minutes, but this is still quite a long time for someone sitting in an office, repeatedly compiling a game to test the most recently changed or added code. Many games are smaller than the Linux kernel, and thus should take significantly less time to compile. Most mainstream games, however, are much bigger than the Linux kernel, and thus should take significantly longer to compile. There are two ways to improve the speed at which your program compiles. Adding more memory to your machine can reduce how much disk reads and writes are necessary, and large programs can even start swapping memory to disk if you don't have enough. The other way to improve compile speeds is to get a better processor, and if your compiler can multi-thread compilation, having more cores can make a huge difference. The point here is, if every compile took many hours, video game companies would have a hard time being profitable. If, however, the developers all have the fastest machines, with as many cores as possible, and with plenty of memory, then compilation time can be very dramatically reduced, which will allow the company to be far more profitable.

What kind of computer does the average person have? The average person does not have a computer with a very fast 4, 6, or 8 core processor. The average person does not have enormous amounts of memory. In short, the average person has a cheap Walmart computer that is great for surfing the web (without too many tabs open at a time), editing Word documents, and watching Netflix. The average person does not need a computer that can compile a large video game in less than an hour.

What kind of computer does the typical member of your target audience have? This is the important question. There are some gamers who will buy a new computer, if their current one cannot run a game they want. Don't expect anyone to do this for a game developer that is not very well known though. It is essential to realize that if your game does not run well on the computers owned by your target audience, your target audience will not be part of the market for your game. In other words, you will likely find that very few people will buy it.

Platforms

There are many possible game platforms, and some of them are extremely convenient to develop games for. Knowing your platform well is very important, if you want your target audience to be able to run your game.

Aside from politics, expensive development packs, and certifications, perhaps the easiest platform to develop for is a console. Consoles provide a static set of requirements. If you are developing a game for the Wii, the XBox One, or the latest Playstation, you don't have to worry about what processor your customers will have or how much memory your game will need to work with. You can buy the console you are developing for, and then when you test your game on it, it will tell you exactly how your game will perform for every other person that has the same console. The biggest downside is that new console versions come out fairly regularly, which often makes console games become obsolete much faster than PC games.

PCs (including Linux, Mac, and Windows PCs) are a very open platform. You don't need to buy a dev kit from anyone to program for PC, and it is legal to sell your games without expensive certifications. You also don't have to deal with some of the politic surrounding console game development, that tends to push smaller game developers out of the market. Unfortunately, the ease ends there. You will need to make a choice up front: Which OS do you want to create your game for, or do you want to try to make it cross platform? If you have to pick a single OS, Windows is almost your only option for a really profitable game. If you decide to go cross platform, you are either stuck using a less efficient interpreted or bytecode compiled language, or you are stuck writing OS specific code for each OS you want to support, and then dealing with the compile time stuff for making sure the right code gets compiled. In any case, a cross platform game means that you will need to do testing on all platforms you expect it to run on. In general, PC is the most accessible platform, but it does come at a cost.

Mobile devices are currently an enormous part of the video game market, and this market is only growing. Mobile games also seem to have a might higher chance of going viral than PC or console games. As a market that still has a lot of room for innovation and improvement, this might be the market with the highest potential profit. Unfortunately, it suffers from many of the problems with PC, but it does not have most of the benefits. Officially, mobile is a very open platform, however, it can be quite difficult to get into, because it has a steep learning curve. In addition though, you will find yourself not only dealing with multiple operating system, an enormous ecosystem of devices with difference processors, memory, screen size, screen orientation, and even input interfaces (buttons, touchscreens...). If you want to write games for iOS, you will have to deal with all of the politics, dev kit, and certification that comes with developing for consoles, but you don't get the dedicated gamer base that successful consoles always have. Writing games for Android does not come with as much drama, but you still need a free dev kit, and you are largely constrained in your choice of language, unless you are really in for some pain and suffering. One of the worst parts with mobile game development (which is generally also a problem with consoles) is that you cannot generally develop on the device. You have to develop and compile on another computer, and then load into a virtual machine or your device for testing, which increases the development time.

There is one more reasonably common platform for games: Embedded systems. They used to be much more popular, and technically, some handheld consoles qualify as more complex embedded systems, but I am talking about the handheld game devices that come with only one or a few games. New games cannot be added to these, and the kind of games you find on them are generally fairly simple, but they can also be quite fun. Programming for embedded systems can be very challenging. For an embedded system game, the game itself takes the place of the OS, which means that you will also probably have to write some of your own device drivers and utility functions. Learning a new embedded system is a lot of work, and if you don't use the knowledge constantly, it is easy to forget, but embedded systems can be extremely cheap. You will find this kind of handheld game for between $5 and $20 in most cases. They are cheap enough to be impulse purchases. Generally more experienced gamers will save their money for something better, but younger children tend to really enjoy them, and the often pressure their parents to buy them. They also make good stocking stuffers for Christmas, or small gifts for other holidays. In general, embedded system games are not so good for small game companies, because LCD screens can be fairly expensive if you cannot buy in bulk, and an embedded system handheld game is very unlikely to sell well at $40 to $50 each.

Resource Management

One way to ensure that a game fails is to include so much special effects that your target audience cannot run the game on their computers. Even if most of your target audience does have powerful machines, you will almost always lose market share if your game has very high system requirements.

Any time you add something to your game that will require more resources, it is a good idea to ask yourself if that thing is really necessary. For some types of games, 3D graphics with realistic shadows may be necessary to appeal to your target audience. In rare cases, these shadows may even add significantly to the game. Most of the time though, you can reduce resource requirements by using something less realistic but also less resource hungry, and no one will notice.

Why is it that in the age of high end 3D video games, some of the most popular games use low resolution pixel graphics? Terraria and Minecraft both use graphic styles that are way behind the times. Terraria, admittedly, has absolutely beautiful pixel graphics, but so did Commander Keen, and almost no one is playing that anymore. Minecraft does have 3D graphics, but not really in any style ever used historically. Minecraft takes 3D graphics and puts low resolution pixel graphics textures on them, giving it a very unique sort of lazy grunge look, and it is one of the most popular games of all time. These two games, and a host of others, make it clear that you don't need GPU intensive graphics to impress and be profitable. A good theme (sandbox, for Minecraft and Terraria) or a good story (plenty of indie games found on Steam) are often far more valuable to gamers than epic graphics without either of those things.

The point here is, don't add expensive bells and whistles to your game when they are not needed. Sometimes high end graphics are justified, but if they come at the expensive of the game actually being good, that is a problem. Adding high end graphics to try to appeal to a wider audience may work quite well, but if the additional resource usage puts your game out of reach of most of your audience, you spent extra time and money for nothing.

Another factor in resource management is how much time and money a game costs to make. The longer the game takes, the more it costs, and the most impressive things like graphics, the more it costs. It is important to consider how an impressive feature will affect your schedule and your cost. If the game comes too late, something similar but better might upstage it, and if the cost is to high, you may find yourself having to charge more than your target audience can afford, just to break even.

Optimization

Sometimes your game will end up requiring to much resources, but there is nothing you can reasonably remove to improve the situation. You can try to take care of this problem early on, but beware of premature optimization. In many cases, optimization can increase your market by making your game run on older, or less expensive machines.

There are several ways to optimize CPU performance. The first is precalculation. It is fairly common to use a calculation in a game in multiple places. When doing movement math, you may end up with a factor that needs to be multiplied by both your x and y coordinate. Instead of doing the math twice, in two long calculations, store the factor in a variable, so you only have to calculate it once. So long as all of this happens within a function, the memory will be released as soon as it returns, making the performance gain free.

There are also some micro level CPU optimizations, however, these should not even be considered until the product is almost finished, and even then, they should generally only be attempted after everything else. One valuable CPU level optimization is cache coherency. Because memory access is very slow, the CPU will copy chunks of memory into cache, which is much faster. A CPU only has a limited amount of cache though, and if you consistently access items in memory that are very far from each other, the CPU will have to constantly reload the cache, which will cost a lot of time for the memory access. Arranging your data so that things like loops will stick to accessing only data that is close together will minimize these "cache misses," which can have a very dramatic effect on performance. The catch with cache coherency optimization is that there is a heavy dependence on what CPU you are using. While most modern CPUs use similar cache systems, failure to account for differences can make your game run really well on a system with one processor, but rather poorly on a system with a similar processor that is a different model or made by a different manufacturer. If you are programming for a console, you don't have to worry about this, but if you are programming for PC or mobile, there are practical limitations to how much you can get out of this, if you still want your game to run well on any system that meets the system requirements.

The last CPU optimization we will discuss is assembly level optimization. Even highly optimizing compilers are not perfect, and because the compiler does not always know your intent, sometimes it has to make assumptions that may not be true. If you are familiar with assembly language, you can fix these optimization issues, which can cause very dramatic performance improvements. Generally though, this should be avoided. assembly languages are processor specific. If you are trying to make a cross platform game, you will have to write the assembly optimization for every processor type you expect the game to run on. In addition, there is only one case where assembly level optimization will make a noticeable difference. This is where the thing you are optimizing is a bottleneck. Improving the performance of something that is already blazing fast won't do much for the overall performance of your game. If, however, you come across a performance problem caused by something that uses a significant amount of the execution time, assembly level optimization might be a good solution if nothing else works.

In many cases, often including mobile devices, memory usage is an important efficiency factor. While most modern PCs have more than 4GB of memory, your typical cell phone or tablet will have half that or less. It can be fairly easy to use all of that up quite quickly, if you are not paying attention. A good example of a memory optimization problem is map tile data. If we are making a game that has a tiled map, what data do we need to store for each tile? Examples might include x and y coordinates, tile type, whether or not units can walk over this tile, and the sprite to display. If the game is a strategy game, we might want to also store information like resources available on the tile and who controls it. It turns out though, that for a large sized map, tracking all of this data can be extremely expensive. It is not always obvious, but because a flat map is a 2 dimensional object, its memory usage increases much faster than linearly as it expands. A better option might be to store the map in a 2D array, where the array index represents the location of the tile, associate the tile type with the sprite, so we only have to store type, and perhaps use bit packing to store things like whether a tile is traversable or not and who owns the tile, in a single 8 or 16 bit variable. The takeaway here is, when storing state data, consider whether you really need each element of data. Some data can be left out and be calculated from other data (though, on a slower system with a lot of memory, this might not be optimal). Some data can be associated with other data, allowing it to just be looked up, instead of storing a copy with every object. Some data is just not really important. Cool features are nice, but a game that performs well is better.

Another facet to memory optimization is data types. This is a problem I see, even from professionals. It is trivial to consider what data type you actually need, and this does not qualify as premature optimization. If there is no way your variable will ever exceed 255 or go below 0, use a char instead of an int! If you end up creating many instances of an object or struct, those 3 unnecessary bytes can waste an enormous amount of memory. In general, if you are programming in a language that allows you to choose your data types, spend a second or two thinking about what will work best, instead of just picking a common default. Far too many programs, games and otherwise, waste tons of memory using ints for every integer value, even when a short or char would do perfectly fine! Also, consider whether an unsigned type might be better. There are plenty of cases where negative numbers are not needed, and the extended positive length of a smaller type would be sufficient. It is almost always worth considering whether a smaller type would be sufficient. That said, always consider carefully. You might use an unsigned char to represent your tile types, because you never expect to use more than 256 different tiles on a map, but then one day, you find that this is limiting your creativity on a certain map you want to make. Doubling the side of the type variable might not always be feasible, but it does not take much math to get a good estimate of how much memory a certain size of map will take up given the data and types the tiles contain, and you may find that for your game, the effect is not as dramatic as you expected.

The takeaway for optimization is, there are some good practices (choosing variable types wisely) that you should probably already be doing, which will avoid unnecessary performance issues. Aside from those things though, optimization should occur only when necessary. Premature optimization will cost a lot of time and money, and it rarely ends up targeting the biggest problems. Don't try to solve problems that don't yet exist.

Graphics

There are some valuable things to know about graphics that can help with optimization. The first is that graphics operations are slow I/O operations. Communicating with the video card is generally even slower than communicating with memory. Minimizing graphics operations can make a big difference for performance. OpenGL and some other graphics card interfaces allow the program to store graphics data on memory that is on the actual graphics card. This is an excellent way of improving performance.

The really important things to understand are related to human biology. The human body, and especially the eyes, have some inherent limitation, and being aware of these limitations can help you to avoid over optimizing your game. The first thing to know is that few humans can see any difference past about a 50Hz frame rate. The second thing to know is that more than 60Hz is beyond the perception of almost all humans, and if you are using a 60Hz monitor, it won't make any difference anyhow, because the monitor cannot display faster than that. Since most monitors have a refresh rate of 60Hz or 75Hz, the hardware limitations make it wasteful and pointless to try to push your frame rate past 75Hz, and biological limitations make it pointless to go past 50Hz or 60Hz. For some perspective, subliminal messages, which are generally one frame ads put into movies, are generally not noticeable running at a mere 24Hz.

One other way of optimizing graphics is to put your rendering in a separate thread from the main game loop. This will allow your game loop to keep running while the renderer is waiting for I/O to finish. For the most part though, the best way to improve graphics performance is to simplify the graphics, so the GPU has less work to do.

This is an extensive topic, so the rest will be covered in a separate post.