I thought I'd start a thread where we share our optimization strategies. To start off with, I'll list the ones I know. I start optimizing by using the strategies at the top and work my way down the list (although some of them can be moved up and down the list). Along with the strategies, I've listed some examples. If you know of other optimization strategies, feel free to list them. If you think some of these should be in a different order, feel free to mention this, but bear in mind that some of these points can be moved up or down depending on the type of project. The optimizations listed below are mostly for speed. But first, some general tips.
[EDIT: I have now started an article on the Allegro wiki on optimization. The original text is left below so that the discussion in this thread still makes sense]
General:
Does it really need to be optimized?
Consider high-level optimizations before low-level optimizations.
Are you using the right tool for the right job? This concept could apply to a broad set of aspects of programming. It could be the right algorithm, the right API, or even the right programming language.
Look at what you are trying to do (algorithmically, mathematically, the way it is written, how it utilizes resources) and see if it can be done in a more efficient manner.
With enough practice, lower-level optimizations are automatically taken into account by the programmer to the point that they don't have to think about them anymore.
Optimization strategies:
</li>
eg: Swapping two large blocks of data in an array: Consider using a linked list instead where the same effect can be achieved with four pointer-changes for a singly-linked-list.
eg: If doing collisions, instead of coming up with the fastest collision-detection algorithm, you should try and figure out if the collisions need to be done in the first place. If you divide your world into sectors and you work out which objects are in which sectors, if two objects are not in the same sector, it means a collision-check will always fail so it can be skipped. The simplest method is just to divide your world into an evenly spaced grid and use the modulus-operator (see further down for advice on choosing a moduland) to work out which sectors the object occupies. More complex methods include Quadtrees/Octrees and even BSP trees.
eg: With collisions, use a bounding-box or bounding-sphere (or some easier-to-check-for-intersection geometry) to determine if they are close enough to do a more detailed collision-check.
eg: Imagine objects spinning round the X axis in 3D space in an evenly-spaced circle on the Y-Z plane at the same speed. Which order are they drawn in? We could sort by Z position. However, if we look at the problem in reverse, we see that instead of finding out which position an object occupies, we find out which object should occupy a given position. We work out which object belongs at each position in the circle starting from the front-most and working my way to the back on both sides. We can very easily work out the draw-order this way.
eg. Polygon sorting vs. Z-buffering.
eg. When working out the position of a moving object with forces applied to it, instead of using Newton's equations of motion, use Euler's integrator.
eg. When using collision bitmasks, you can cheat by not using a 1-1 mapping of sprite-pixels to collision-mask pixels (eg. 1 collision-mask pixel represents 2x2 sprite-pixels). Likewise in 3D, mesh-mesh collisions can be optimized by using collision-meshes (low-poly meshes that look similar to the high-poly models).
If calculating certain mathematical functions by hand (eg. trigonometric) using taylor-series (when using pixel shaders, this is often how they are done), find out how many terms you can get away with leaving out until the results deteriorate too much.
eg. Same problem as above with objects spinning: If we ignore solution above and just decide to go with sorting, if we know our sorting algorithms, we can pick one. QSort is good general-purpose one for randomly sorted data. However, if spinning round in a circle, lets assume for the time being that we're updating the position once per tick rather than using delta-time. If the objects are not moving so fast that they overtake the next object's position in the next tick, we can chose a sorting algorithm that works well when little or no change in the order of Z positions has taken place. Likewise, if we use delta-time, there's no upper-bound on the amount the objects can move between ticks, so the objects could have moved any distance (even more than a full revolution), so we'd have to chose a sorting algorithm that copes well with more or less random data.
General platform-based optimizations:
</li>
eg. if the API supports a 2D tilemap, to get the tile to the left, GetTileAt(x-1,y) may actually be faster than *((¤ttile)-1).
[/list]
Some of the optimizations below can be done with a good compiler:
Mathematical optimization (finding out if equations can be made simpler if the range of inputs is known (this is especially true for non-continuous functions)). In a similar vein, equations can be re-written to use fewer intermediate values, but on the other hand, using fewer values may be more likely to cause pipeline-stalls as we're more likely to rely on the result of recently-computed values.
eg: RGB<->HSV code can be simplified if we're certain the saturation is always going to be 1.
eg: The HSV->RGB code can be written in an easy-to-understand form, but by mathematically re-arranging it, it uses fewer values but is not so easy to figure out from looking at the code.
eg. Collision bitmasks instead of pixel-masks. You can check for collisions of as many pixels as there are bits in a machine-word jusing just a single 'AND' operator.
Make use of keywords such as const, restrict, etc. (register is almost obsolete because compilers are good at register-juggling).
eg. For unsigned integers, x/8 can be optimized to x>>3, but x/7 cannot be optimized this way.
eg. If the target-CPU supports SIMD instructions (instructions that work on several values at once), write code that's easier for the compiler to make use of SIMD instructions.
eg. If the CPU uses deep execution-pipelines, try and reduce the amount of branching (if's) in your code.
If iterating using an index, and the index starts at zero, count down instead of count up. This is because on many machines, there are specialised check-for-zero instructions that do not require loading a constant to check against. However, CPU-caches are more optimized for counting up than counting down (forwards iteration).
Do have faith in the compiler. For example, it should automatically chose the faster of n+n, n*2, n<<1 if it sees something similar in the code and even picks the appropriate choice for the sub-variant of CPU you wish to optimize for. However, if you're programming in machine-code, you have to make this choice yourself.
Be aware of which instructions take up many CPU cycles, and which instructions can be done in the background (on x86 cpus with an FPU, floating-point operations can be carried out concurrently with the non-floating point operations of a CPU).
Loop-unrolling.
Use shifts instead of integer multiplies/divides by powers of two, and use bitwise AND for modulus by a power of two.
Make sure as much code/data as possible can fit into the cache. Some processors have seperate caches for data and code. However, if there are more processes running than available processors, the cache gets invalidated often. Also, bear in mind that sometimes, data can be padded to make access more efficient (some CPUs are better at accessing data aligned on a machine-word sized boundry), and that some CPUs are better at accessing data using the machine-native word-size (compilers usually do a good job of choosing the best balance of padding vs. fitting into the cache).
Are floating-point real-numbers faster than fixed-point real-numbers? For most modern CPUs, floats are faster, but on machines without a specialised FPU unit (eg. mobile devices and old hardware), it's fixed-point numbers that are faster.
If the CPU can execute parts of code in parallel, interleave the instructions in a way that the result is not immediately used by the next instruction. This prevents pipeline-stalls.
If your target architecture uses a non-symmetric set of machine code instructions (that is, different registers can be operated on in different ways), getting the most out of the processor is trickier. This means there may be several tricks the compiler may have missed.
Self-modifying code can be useful if both speed and compactness is desired, or to save on registers by hardcoding constants, but avoid it if the CPU uses separate code and data caches (unless the modification occurs only rarely).
[/list]
AE.
The "General" section use a rather abstract wording, I'm not sure that the people who need this advice will understand what they mean, and that they are the FUNDAMENTAL rules for not wasting one's time and efforts:
- Understand what takes time in one's game (profile, learn the "cost" of a blit() vs. any basic maths)
- Understand why it's better to improve the speed of display_screen() by 5% than the speed of logic_update() by 1000%
- The power of laziness: Understand that most optimized games are efficient because they do less work, not because they do it faster.
---
Now as a response to every code-level optimization:
I prefer to optimize for the shortest time of debugging/troubleshooting.
Spending hours trying to understand why it doesn't do what you want is the worst part of game programming, it's very demotivating. When I started C, I was a crazy "optimizer", now I'm wiser(?) and more reluctant to use any coding shortcuts that would make checking for errors harder.
(Granted, I'm mostly thinking of "gameplay" code, the one which is actually fun to write and tweak, and normally takes 1% of the whole game's load in the first place.)
Ummm...
Do have faith in the compiler. For example, it should automatically chose the faster of n+n, n*2, n<<1
...
Use shifts instead of integer multiplies/divides by powers of two
To add to Andrei's advice, specifically thinking about x86:
Be aware of which instructions take up many CPU cycles, and which instructions can be done in the background (on x86 cpus with an FPU, floating-point operations can be carried out concurrently with the non-floating point operations of a CPU).
The Pentium was limited to one floating-point operation concurrently with one integer operation, everything since then can do a much wider and more arbitrary array of combinations.
Loop-unrolling.
Never, ever loop-unroll. It decreases the amount of your game's functionality that can fit in the CPU cache and usually doesn't tangibly increase speed anyway — branch prediction means that the CPU will automatically act as though the loop is unrolled without the cache cost, only losing some time for its efforts in the very final iteration of your loop.
and use bitwise AND for modulus by a power of two.
You should probably still do this, but most of the time the compiler will freely interchange. Note that dividing by two and shifting right by 1 isn't guaranteed to produce the same result on all CPU architectures and all compilers. In particular, shifts are not necessarily signed, and the rounding that applies during division of a negative integer was ambiguous at least as recently as C94. I'm not sure about C99.
Are floating-point real-numbers faster than fixed-point real-numbers? For most modern CPUs, floats are faster, but on machines without a specialised FPU unit (eg. mobile devices and old hardware), it's fixed-point numbers that are faster.
I'd change "old hardware" to "extremely old hardware".
1) Never make system calls
2) Exceptions are slow.
3) Linear search is slow.
4) Optimize inner loop for cache
5) Optimize inner loop for instructions
6) Optimize other for memory.
When talking about optimization you need to be more specific: are we talking about speed optimization? size optimization?
To be honest I am more interested in flexibility optimization, I mean code that can be easily modified in just minutes. I am rewriting parts of my projects to re-design my classes trying to keep only few classes heavily used, instead of a lot of classes rarely used. In my personal experience this is the way to achieve optimization in terms of flexibility.
Paul, that's great, but if you have something that is very easy to modify within minutes but has a Big-O of O(N^2) (think bubble sort for millions of entries), wouldn't less-readable and a "little more" memory intensive be in favor if it brought the speed of your sort to O(n log n) instead?
I'd change "old hardware" to "extremely old hardware".
I'd change it to "small hardware". anything with an ARM chip, for instance.
I haven't tried to write a game in years, but my early attempts failed because of the inability to make it run fast enough. (Trying to do too much blending, etc.)
The only strategy (besides better algorithms) that really works anymore is reduce the amount of data to be processed. This usually involves some sort of compromise between quality/performance.
Packing multiplies is a good example, a larger example is something like the pairwise-clustering color quantizer in Rendera which is hopelessly slow, and requires too much memory. Preprocessing the colors (to reduce the maximum number fed into the quantizer) was required.
Reducing those datasets while keeping the resulting compromise practical is the hardest part I think.
1) Don't use software graphics
Anything else is mostly for nothing.
Speaking of SIMD, even modern compilers (gcc, icc, ...) can't really do that much without help. The best way to aid the compiler in writing good, fast SSE-code is either ASM (yuck!) or better, by using SSE intrinsics. I admit that the cases where it can be used are rare but when it's applicable, it's a serious performance boost. Four calculations for the price of one... I'd take it any day!
Another thing about GCC is that you need to make sure your functions are actually being inlined with -Winline. The no-brainer way is to declare them "static inline" at the beginning of the C file, but if you want global inline functions they have to be done a certain way, which I don't remember at the moment.
Owell, //hatever
Holy Sh*t!
That is the longest lead-off post I've ever seen....







The "General" section use a rather abstract wording, I'm not sure that the people who need this advice will understand what they mean, and that they are the FUNDAMENTAL rules for not wasting one's time and efforts:
I put these in a separate 'General' section because these are more abstract and can be applied to several of the layers of optimization.
- Understand what takes time in one's game (profile, learn the "cost" of a blit() vs. any basic maths)
- Understand why it's better to improve the speed of display_screen() by 5% than the speed of logic_update() by 1000%
This is why I suggested using a profiler to identify bottlenecks. However, bear in mind that different machines will have their bottlenecks in different places. So for example, if you're used to software rendering on a desktop-PC and then you move onto hardware rendering on a mobile device, the bottleneck can very easily shift towards the logic-code instead of blit().
- The power of laziness: Understand that most optimized games are efficient because they do less work, not because they do it faster.
This is just another way of saying "Instead of doing it as fast as possible, find out if it needs to be done at all" (which is basically what I said in my second point). Likewise, the power of laziness can be applied to the first point (don't optimize non-bottlenecks or anything that's fast enough) - in this case, the power of laziness is applied to the programmer instead of the machine.
Now as a response to every code-level optimization:
I prefer to optimize for the shortest time of debugging/troubleshooting.
Spending hours trying to understand why it doesn't do what you want is the worst part of game programming, it's very demotivating. When I started C, I was a crazy "optimizer", now I'm wiser(?) and more reluctant to use any coding shortcuts that would make checking for errors harder.
(Granted, I'm mostly thinking of "gameplay" code, the one which is actually fun to write and tweak, and normally takes 1% of the whole game's load in the first place.)
Debugging is outside the scope of the topic of optimization which is what is being discussed. I do agree that it's important to fix the bugs (or be 100% sure what's causing the bugs) before optimization is done. The bottlenecks tend to be in relatively small proportions of the code, so for just about everything else, I'd focus on making it easy to read (which also means easy to debug and maintain - especially by another programmer). Debugging can be helped if you write code where the compiler can spot you doing things you're not meant to do, or if you use ASSERT()s in appropriate places. However if you've tracked down your bug to the implementation of a particular algorithm and you're considering replacing it with a different algorithm, you can automatically get rid of the bug by replacing the algorithm.
Ummm...
Do have faith in the compiler. For example, it should automatically chose the faster of n+n, n*2, n<<1
...
Use shifts instead of integer multiplies/divides by powers of two
Using shifts instead of divides/multiplies involves picking the right values to divide/multiply by in order that they can be optimized to shifts, whereas the choice of "n+n, n*2, n<<1" is the same regardless of the value chosen for n (or positive n in the latter case). Also, some advice applies only if writing in machine-code.
and use bitwise AND for modulus by a power of two.
You should probably still do this, but most of the time the compiler will freely interchange.
The trick is not to do the replacing yourself, but to recognise when there's an oppotunity for such replacement to occur. For unsigned integers, the compiler will change %8 into &7, but for %9, it cannot do this optimization. So if picking arbitary values, it's best to pick powers of two. Picking a good arbitary value involves little effort. A tile-map implementation involves lots of multiplies, divides and modulos. This is why you should chose tiles with sides equal to powers of two. If you chose a non power of two, the compiler can't optimize so easily.
Note that dividing by two and shifting right by 1 isn't guaranteed to produce the same result on all CPU architectures and all compilers. In particular, shifts are not necessarily signed, and the rounding that applies during division of a negative integer was ambiguous at least as recently as C94. I'm not sure about C99.
For unsigned integers, a shift will always give the same result as a multiplication/division of a power of two. I've seen some compilers that compile signed divide-by-power-of-two's into an if() statement that checks to see if the number is >0 and if it is, does a shift.
Loop-unrolling
Never, ever loop-unroll. It decreases the amount of your game's functionality that can fit in the CPU cache and usually doesn't tangibly increase speed anyway branch prediction means that the CPU will automatically act as though the loop is unrolled without the cache cost, only losing some time for its efforts in the very final iteration of your loop.
I should have elaborated. It's usually bad for the cache (if the machine has a cache) and should not be done in a high-level language. Compilers will do this for you if they think it's faster. Sometimes, a complicated side-effects-less non-state-based calculation can be done using the loop-index (and perhaps a few other paramaters). If the loop start and loop-end are both constants and the difference is small, the compiler may determine that the savings that can be made by unrolling the loop and pre-calcualting everything where the index is a variable (if variables other than the index are used, mathematical optimization is used on the equation). In some cases, the compiler may not spot this, but if the programmer knows what they are doing, they might be able to squeeze a bit of extra performance by manually unrolling the loop (although this particular optimization does make the code less readable and maintainable so should only be used as a last resort and only if it really needs to be "as fast as possible"). Also, as a compromise between cache usage and execution speed, loops can sometimes be partially unrolled.
When talking about optimization you need to be more specific: are we talking about speed optimization? size optimization?
I was mainly focusing on speed optimization but some of the stratergies can be applied to size optimizaion as well. Often, there is a balance between speed and size. The programmer just needs to chose a good compromise.
flexibility optimization
This is why it's best not to optimize if you don't need to optimize.
Are floating-point real-numbers faster than fixed-point real-numbers? For most modern CPUs, floats are faster, but on machines without a specialised FPU unit (eg. mobile devices and old hardware), it's fixed-point numbers that are faster.
I'd change "old hardware" to "extremely old hardware".
I'd agree from an x86 point of view. But even today, some mobile devices still don't have an FPU.
1) Don't use software graphics
This does not work on harware-graphics-less platforms.
That is the longest lead-off post I've ever seen....
Maybe I should have posted this to the Wiki instead...
And finally...
Do have faith in the compiler.
This is really meant to be more of a workload optimization than anything. Nowardays, compilers are good at optimizing, so they know many of the optimizing tricks, so attempting to out-optimize the compiler is futile unless you're a very good machine-code programmer. However, there's still a few tricks they don't know - especially when it comes to new hardware, or when optimization becomes more of an art than a science. Optimizing for SIMD instructions can still cause problems for compilers and this is best done in machine code, but the resulting machine code is usually a very small portion of code that just calls the SIMD instruction on the data (although there will still be some higher level code to make sure the data is in the optimum layout to be SIMD'd). Even so, if you know machine code, you may want to check what your compiler is producing from time to time, but I'd only recommend this these days if you've got plenty of time on your development cycle, as if you discover something that's best written in machine code, the chances are that finding this out and writing the resulting code will take time.
AE.
Andrei, are you gonna put this on the wiki yourself?
Following your replay, my main follow-on comment is that you probably need to split your advice into several articles; all of my replies were concerned with high level languages and most of my misunderstandings seem to revolve around where you've advised something that should only be taken literally if you're using assembly.
Other than that:
Are floating-point real-numbers faster than fixed-point real-numbers? For most modern CPUs, floats are faster, but on machines without a specialised FPU unit (eg. mobile devices and old hardware), it's fixed-point numbers that are faster.
I'd change "old hardware" to "extremely old hardware".
I'd agree from an x86 point of view. But even today, some mobile devices still don't have an FPU.
Right, but you've explicitly mentioned mobile devices as a separate category. I'd also imagine that most mobile devices don't have an FPU.
Andrei, are you gonna put this on the wiki yourself?
I'll put it there when I've decided how to split up the article.
my main follow-on comment is that you probably need to split your advice into several articles; all of my replies were concerned with high level languages and most of my misunderstandings seem to revolve around where you've advised something that should only be taken literally if you're using assembly.
I agree. Several of the optimizations at the bottom of the list are only relevant if writing in machine code. I admit that "Have faith in the compiler" caused a bit of confusion because some of the statements below seem to contradict that statement.
Do you have any suggestions for how to split the article? I could split it into:
General (methods that can apply to just about any level of optimization).
Specific optimization stratergies.
Low level optimizations for non low-level languages (this will include the statement "Have faith in the compiler").
Machine-code optimizations. For when you no longer have faith in the compiler.
However, sometimes, the boundry between the last two points blurs. This depends on the compiler, and as compilers improve, the boundry moves up the list. Also, it's hard to guess where the compiler will miss some oppotunities to pre-calcualte. But even so, the third point could give hints for how to help the compiler optimize code.
I'd also imagine that most mobile devices don't have an FPU.
Some do these days, and some still don't. In fact, the Mophun virtual machine (or at least version 1.x) does not even have multiply, divide or modulo instructions in it's virtual machine-code instruction-set. However, blits are implemented natively.
AE.
Use an appropriate iterator. When iterating through an array, increase a pointer instead of using an index into an array.
false. Pointer aliasing prevents compiler optimizations. A pointer is not any faster than an simple array index, the compiler will produce the same code. However with a index parallelism can happen.
Loop-unrolling. --> Never, ever loop-unroll
I concur, loop unrolling is very old skool; newer CPUs can predict better.
3) Linear search is slow.
if searching all of something is required, linear is not slow, modern CPUs are streaming machines, they predict what you want next is going to be whatever follows in memory, so the L1/L2 cache will be working for you whilst you are working on the previous memory addresses. and search is not CPU intensive, its mem/cache intensive, so having the cache correctly predict is vital.
I've created an article on optimization on the Allegro wiki which contains a modified version of my original post with some of the suggestions incorporated. I haven't yet split the article up into sub-articles, but I've split it up into sections that could spawn articles. I've also created a new category optimization that will contain the split articles.
At the moment, the article contains very few links to other articles and external links (eg. links to relevant Wikipedia articles). Likewise, none of the other articles currently link to the article.
AE.