Optimization strategies

Andrei Ellman

I thought I'd start a thread where we share our optimization strategies. To start off with, I'll list the ones I know. I start optimizing by using the strategies at the top and work my way down the list (although some of them can be moved up and down the list). Along with the strategies, I've listed some examples. If you know of other optimization strategies, feel free to list them. If you think some of these should be in a different order, feel free to mention this, but bear in mind that some of these points can be moved up or down depending on the type of project. The optimizations listed below are mostly for speed. But first, some general tips.

[EDIT: I have now started an article on the Allegro wiki on optimization. The original text is left below so that the discussion in this thread still makes sense]

General:

Does it really need to be optimized?
Consider high-level optimizations before low-level optimizations.
Are you using the right tool for the right job? This concept could apply to a broad set of aspects of programming. It could be the right algorithm, the right API, or even the right programming language.
Look at what you are trying to do (algorithmically, mathematically, the way it is written, how it utilizes resources) and see if it can be done in a more efficient manner.
With enough practice, lower-level optimizations are automatically taken into account by the programmer to the point that they don't have to think about them anymore.

Optimization strategies:

</li>

Try and make sure the bottleneck is used as little as possible, and for that matter, does the bottleneck really need to be executed (Zen optimization)?

eg: Swapping two large blocks of data in an array: Consider using a linked list instead where the same effect can be achieved with four pointer-changes for a singly-linked-list.
eg: If doing collisions, instead of coming up with the fastest collision-detection algorithm, you should try and figure out if the collisions need to be done in the first place. If you divide your world into sectors and you work out which objects are in which sectors, if two objects are not in the same sector, it means a collision-check will always fail so it can be skipped. The simplest method is just to divide your world into an evenly spaced grid and use the modulus-operator (see further down for advice on choosing a moduland) to work out which sectors the object occupies. More complex methods include Quadtrees/Octrees and even BSP trees.
eg: With collisions, use a bounding-box or bounding-sphere (or some easier-to-check-for-intersection geometry) to determine if they are close enough to do a more detailed collision-check.

Think of different approach (thinking outside the box). Once you have a solution, think about how the problem and solution relate to eachother to gain a deeper understanding of the problem. You may then be more likely to think of an alternative approach - possibly by using a different method of abstracting the problem. The optimal approach to chose depends on the input-data size and should be experimented with until satisfaction. This is especially true if the approaches have a different rate of growth.

eg: Imagine objects spinning round the X axis in 3D space in an evenly-spaced circle on the Y-Z plane at the same speed. Which order are they drawn in? We could sort by Z position. However, if we look at the problem in reverse, we see that instead of finding out which position an object occupies, we find out which object should occupy a given position. We work out which object belongs at each position in the circle starting from the front-most and working my way to the back on both sides. We can very easily work out the draw-order this way.
eg. Polygon sorting vs. Z-buffering.

Consider solutions that don't give the correct answer but give a close-enough answer if it's quicker to compute.

eg. When working out the position of a moving object with forces applied to it, instead of using Newton's equations of motion, use Euler's integrator.
eg. When using collision bitmasks, you can cheat by not using a 1-1 mapping of sprite-pixels to collision-mask pixels (eg. 1 collision-mask pixel represents 2x2 sprite-pixels). Likewise in 3D, mesh-mesh collisions can be optimized by using collision-meshes (low-poly meshes that look similar to the high-poly models).
If calculating certain mathematical functions by hand (eg. trigonometric) using taylor-series (when using pixel shaders, this is often how they are done), find out how many terms you can get away with leaving out until the results deteriorate too much.

Think of same approach using different algorithm (appropriate complexity for input-size). Like with choosing the best approach, the optimal algorithm to chose depends on the input-data size and should be experimented with until satisfaction. This is especially true if the algorithms have a different rate of growth.

eg. Same problem as above with objects spinning: If we ignore solution above and just decide to go with sorting, if we know our sorting algorithms, we can pick one. QSort is good general-purpose one for randomly sorted data. However, if spinning round in a circle, lets assume for the time being that we're updating the position once per tick rather than using delta-time. If the objects are not moving so fast that they overtake the next object's position in the next tick, we can chose a sorting algorithm that works well when little or no change in the order of Z positions has taken place. Likewise, if we use delta-time, there's no upper-bound on the amount the objects can move between ticks, so the objects could have moved any distance (even more than a full revolution), so we'd have to chose a sorting algorithm that copes well with more or less random data.

If the bottleneck makes use of an API, find out what the API is doing, and if you're not satisfied with it's performance, write the low-level code yourself. If your own implementation is more efficient and can be generalized enough to replace the existing function, consider proposing the changes to the API authors.

[/list]

General platform-based optimizations:

</li>

On virtual machines, API calls to the platform-API can be a lot quicker than doing things yourself. This is because the platform-API may be implemented by the native machine rather than the virtual machine. Often, these optimizations can seem counter-intuitive to someone used to programing for physical machines.

eg. if the API supports a 2D tilemap, to get the tile to the left, GetTileAt(x-1,y) may actually be faster than *((&currenttile)-1).

[/list]

Some of the optimizations below can be done with a good compiler:

Mathematical optimization (finding out if equations can be made simpler if the range of inputs is known (this is especially true for non-continuous functions)). In a similar vein, equations can be re-written to use fewer intermediate values, but on the other hand, using fewer values may be more likely to cause pipeline-stalls as we're more likely to rely on the result of recently-computed values.
eg: RGB<->HSV code can be simplified if we're certain the saturation is always going to be 1.
eg: The HSV->RGB code can be written in an easy-to-understand form, but by mathematically re-arranging it, it uses fewer values but is not so easy to figure out from looking at the code.

Once you've decided on the approach and algorithm, use code-based optimizing tricks

eg. Collision bitmasks instead of pixel-masks. You can check for collisions of as many pixels as there are bits in a machine-word jusing just a single 'AND' operator.
eg. Bob's useful code-snippets.

Use iteration instead of recursion if possible. It reduces efficiency to keep pushing and popping return addresses, and you also risk overflowing the stack.<ul>[/list]

Pre-calculate as much as possible. For simple things, the compiler often does this for you, but for more complicated things, it's better to pre-calculate. This is especially true of things that remain constant throughout iteration or recursion.

[list]

Use interpolation. When iterating, it's quicker to add a pre-calculated constant than divide by a loop-counter.

</ul>

With lots of branches (eg. switch() or a series of if()s), make sure conditions that are most likely to happen are tested for before unlikely conditions are checked. Sometimes, this is intuitive, but other times, you may have to count the number of times each condition is reached.<ul>[/list]

Use an appropriate iterator. When iterating through an array, increase a pointer instead of using an index into an array.[list]</ul>

Give the compiler a helping-hand.

Make use of keywords such as const, restrict, etc. (register is almost obsolete because compilers are good at register-juggling).

Get to know the target architecture. This is another way of giving the compiler a helping hand. Write code in such a way or choose values for arbitary values that the compiler is more likely to optimize.

eg. For unsigned integers, x/8 can be optimized to x>>3, but x/7 cannot be optimized this way.
eg. If the target-CPU supports SIMD instructions (instructions that work on several values at once), write code that's easier for the compiler to make use of SIMD instructions.
eg. If the CPU uses deep execution-pipelines, try and reduce the amount of branching (if's) in your code.
If iterating using an index, and the index starts at zero, count down instead of count up. This is because on many machines, there are specialised check-for-zero instructions that do not require loading a constant to check against. However, CPU-caches are more optimized for counting up than counting down (forwards iteration).

Micro-optimizations: Any compiler worth it's salt will do these for you, but if you are aware of these, you can make it easier for the compiler to optimize by giving it a helping hand. Many of these are machine-specific (but can be specific to many machines). At best, these types of optimization offer a linear speed-increase.

Do have faith in the compiler. For example, it should automatically chose the faster of n+n, n*2, n<<1 if it sees something similar in the code and even picks the appropriate choice for the sub-variant of CPU you wish to optimize for. However, if you're programming in machine-code, you have to make this choice yourself.
Be aware of which instructions take up many CPU cycles, and which instructions can be done in the background (on x86 cpus with an FPU, floating-point operations can be carried out concurrently with the non-floating point operations of a CPU).
Loop-unrolling.
Use shifts instead of integer multiplies/divides by powers of two, and use bitwise AND for modulus by a power of two.
Make sure as much code/data as possible can fit into the cache. Some processors have seperate caches for data and code. However, if there are more processes running than available processors, the cache gets invalidated often. Also, bear in mind that sometimes, data can be padded to make access more efficient (some CPUs are better at accessing data aligned on a machine-word sized boundry), and that some CPUs are better at accessing data using the machine-native word-size (compilers usually do a good job of choosing the best balance of padding vs. fitting into the cache).
Are floating-point real-numbers faster than fixed-point real-numbers? For most modern CPUs, floats are faster, but on machines without a specialised FPU unit (eg. mobile devices and old hardware), it's fixed-point numbers that are faster.
If the CPU can execute parts of code in parallel, interleave the instructions in a way that the result is not immediately used by the next instruction. This prevents pipeline-stalls.

Machine code hAx0r'ing. Extremely platform-specific. Requires in-depth knowledge of target architecture. Would even need to write differently for sub-classes of architecture. At best, performance optimization is linear. Only do this as a last resort! If a better algorithm or a bug is discovered, a lot of work goes to waste. Not worth it because all modern compilers are competent enough to handle this if they support the CPU in mind and know the subtleties of the CPU (preventing stalls, register management, etc). Also, can be a nightmare to maintain - especially if using self-modifying code. However, there are times when low-level optimization is an art rather than a science, and compilers may be scientific but are useless at art.

If your target architecture uses a non-symmetric set of machine code instructions (that is, different registers can be operated on in different ways), getting the most out of the processor is trickier. This means there may be several tricks the compiler may have missed.
Self-modifying code can be useful if both speed and compactness is desired, or to save on registers by hardcoding constants, but avoid it if the CPU uses separate code and data caches (unless the modification occurs only rarely).

[/list]

AE.

Audric

The "General" section use a rather abstract wording, I'm not sure that the people who need this advice will understand what they mean, and that they are the FUNDAMENTAL rules for not wasting one's time and efforts:
- Understand what takes time in one's game (profile, learn the "cost" of a blit() vs. any basic maths)
- Understand why it's better to improve the speed of display_screen() by 5% than the speed of logic_update() by 1000%
- The power of laziness: Understand that most optimized games are efficient because they do less work, not because they do it faster.

---
Now as a response to every code-level optimization:
I prefer to optimize for the shortest time of debugging/troubleshooting.

Spending hours trying to understand why it doesn't do what you want is the worst part of game programming, it's very demotivating. When I started C, I was a crazy "optimizer", now I'm wiser(?) and more reluctant to use any coding shortcuts that would make checking for errors harder.
(Granted, I'm mostly thinking of "gameplay" code, the one which is actually fun to write and tweak, and normally takes 1% of the whole game's load in the first place.)

Thomas Harte

Ummm...

Quote:

Do have faith in the compiler. For example, it should automatically chose the faster of n+n, n*2, n<<1
...
Use shifts instead of integer multiplies/divides by powers of two

To add to Andrei's advice, specifically thinking about x86:

Quote:

Be aware of which instructions take up many CPU cycles, and which instructions can be done in the background (on x86 cpus with an FPU, floating-point operations can be carried out concurrently with the non-floating point operations of a CPU).

The Pentium was limited to one floating-point operation concurrently with one integer operation, everything since then can do a much wider and more arbitrary array of combinations.

Quote:

Loop-unrolling.

Never, ever loop-unroll. It decreases the amount of your game's functionality that can fit in the CPU cache and usually doesn't tangibly increase speed anyway — branch prediction means that the CPU will automatically act as though the loop is unrolled without the cache cost, only losing some time for its efforts in the very final iteration of your loop.

Quote:

and use bitwise AND for modulus by a power of two.

You should probably still do this, but most of the time the compiler will freely interchange. Note that dividing by two and shifting right by 1 isn't guaranteed to produce the same result on all CPU architectures and all compilers. In particular, shifts are not necessarily signed, and the rounding that applies during division of a negative integer was ambiguous at least as recently as C94. I'm not sure about C99.

Quote:

Are floating-point real-numbers faster than fixed-point real-numbers? For most modern CPUs, floats are faster, but on machines without a specialised FPU unit (eg. mobile devices and old hardware), it's fixed-point numbers that are faster.

I'd change "old hardware" to "extremely old hardware".

Goalie Ca

1) Never make system calls
2) Exceptions are slow.
3) Linear search is slow.
4) Optimize inner loop for cache
5) Optimize inner loop for instructions
6) Optimize other for memory.

Paul whoknows

When talking about optimization you need to be more specific: are we talking about speed optimization? size optimization?

To be honest I am more interested in flexibility optimization, I mean code that can be easily modified in just minutes. I am rewriting parts of my projects to re-design my classes trying to keep only few classes heavily used, instead of a lot of classes rarely used. In my personal experience this is the way to achieve optimization in terms of flexibility.

OnlineCop

Paul, that's great, but if you have something that is very easy to modify within minutes but has a Big-O of O(N^2) (think bubble sort for millions of entries), wouldn't less-readable and a "little more" memory intensive be in favor if it brought the speed of your sort to O(n log n) instead?

Matt Smith

I'd change "old hardware" to "extremely old hardware".

I'd change it to "small hardware". anything with an ARM chip, for instance.

lamer88

I haven't tried to write a game in years, but my early attempts failed because of the inability to make it run fast enough. (Trying to do too much blending, etc.)

The only strategy (besides better algorithms) that really works anymore is reduce the amount of data to be processed. This usually involves some sort of compromise between quality/performance.

Packing multiplies is a good example, a larger example is something like the pairwise-clustering color quantizer in Rendera which is hopelessly slow, and requires too much memory. Preprocessing the colors (to reduce the maximum number fed into the quantizer) was required.

Reducing those datasets while keeping the resulting compromise practical is the hardest part I think.

Fladimir da Gorf

1) Don't use software graphics

Anything else is mostly for nothing.

imaxcs

Speaking of SIMD, even modern compilers (gcc, icc, ...) can't really do that much without help. The best way to aid the compiler in writing good, fast SSE-code is either ASM (yuck!) or better, by using SSE intrinsics. I admit that the cases where it can be used are rare but when it's applicable, it's a serious performance boost. Four calculations for the price of one... I'd take it any day!

lamer88

Another thing about GCC is that you need to make sure your functions are actually being inlined with -Winline. The no-brainer way is to declare them "static inline" at the beginning of the C file, but if you want global inline functions they have to be done a certain way, which I don't remember at the moment.

Owell, //hatever

nonnus29

Holy Sh*t!

That is the longest lead-off post I've ever seen....

Andrei Ellman

Audric said:

I put these in a separate 'General' section because these are more abstract and can be applied to several of the layers of optimization.

Audric said:

- Understand what takes time in one's game (profile, learn the "cost" of a blit() vs. any basic maths)
- Understand why it's better to improve the speed of display_screen() by 5% than the speed of logic_update() by 1000%

This is why I suggested using a profiler to identify bottlenecks. However, bear in mind that different machines will have their bottlenecks in different places. So for example, if you're used to software rendering on a desktop-PC and then you move onto hardware rendering on a mobile device, the bottleneck can very easily shift towards the logic-code instead of blit().

Audric said:

- The power of laziness: Understand that most optimized games are efficient because they do less work, not because they do it faster.

This is just another way of saying "Instead of doing it as fast as possible, find out if it needs to be done at all" (which is basically what I said in my second point). Likewise, the power of laziness can be applied to the first point (don't optimize non-bottlenecks or anything that's fast enough) - in this case, the power of laziness is applied to the programmer instead of the machine.

Audric said:

Now as a response to every code-level optimization:
I prefer to optimize for the shortest time of debugging/troubleshooting.

Spending hours trying to understand why it doesn't do what you want is the worst part of game programming, it's very demotivating. When I started C, I was a crazy "optimizer", now I'm wiser(?) and more reluctant to use any coding shortcuts that would make checking for errors harder.
(Granted, I'm mostly thinking of "gameplay" code, the one which is actually fun to write and tweak, and normally takes 1% of the whole game's load in the first place.)

Debugging is outside the scope of the topic of optimization which is what is being discussed. I do agree that it's important to fix the bugs (or be 100% sure what's causing the bugs) before optimization is done. The bottlenecks tend to be in relatively small proportions of the code, so for just about everything else, I'd focus on making it easy to read (which also means easy to debug and maintain - especially by another programmer). Debugging can be helped if you write code where the compiler can spot you doing things you're not meant to do, or if you use ASSERT()s in appropriate places. However if you've tracked down your bug to the implementation of a particular algorithm and you're considering replacing it with a different algorithm, you can automatically get rid of the bug by replacing the algorithm.

Thomas Harte said:

Ummm...

Quote:

Do have faith in the compiler. For example, it should automatically chose the faster of n+n, n*2, n<<1
...
Use shifts instead of integer multiplies/divides by powers of two

Using shifts instead of divides/multiplies involves picking the right values to divide/multiply by in order that they can be optimized to shifts, whereas the choice of "n+n, n*2, n<<1" is the same regardless of the value chosen for n (or positive n in the latter case). Also, some advice applies only if writing in machine-code.

Thomas Harte said:

Quote:

and use bitwise AND for modulus by a power of two.

You should probably still do this, but most of the time the compiler will freely interchange.

The trick is not to do the replacing yourself, but to recognise when there's an oppotunity for such replacement to occur. For unsigned integers, the compiler will change %8 into &7, but for %9, it cannot do this optimization. So if picking arbitary values, it's best to pick powers of two. Picking a good arbitary value involves little effort. A tile-map implementation involves lots of multiplies, divides and modulos. This is why you should chose tiles with sides equal to powers of two. If you chose a non power of two, the compiler can't optimize so easily.

Thomas Harte said:

Note that dividing by two and shifting right by 1 isn't guaranteed to produce the same result on all CPU architectures and all compilers. In particular, shifts are not necessarily signed, and the rounding that applies during division of a negative integer was ambiguous at least as recently as C94. I'm not sure about C99.

For unsigned integers, a shift will always give the same result as a multiplication/division of a power of two. I've seen some compilers that compile signed divide-by-power-of-two's into an if() statement that checks to see if the number is >0 and if it is, does a shift.

Thomas Harte said:

Quote:

Loop-unrolling

Never, ever loop-unroll. It decreases the amount of your game's functionality that can fit in the CPU cache and usually doesn't tangibly increase speed anyway branch prediction means that the CPU will automatically act as though the loop is unrolled without the cache cost, only losing some time for its efforts in the very final iteration of your loop.

I should have elaborated. It's usually bad for the cache (if the machine has a cache) and should not be done in a high-level language. Compilers will do this for you if they think it's faster. Sometimes, a complicated side-effects-less non-state-based calculation can be done using the loop-index (and perhaps a few other paramaters). If the loop start and loop-end are both constants and the difference is small, the compiler may determine that the savings that can be made by unrolling the loop and pre-calcualting everything where the index is a variable (if variables other than the index are used, mathematical optimization is used on the equation). In some cases, the compiler may not spot this, but if the programmer knows what they are doing, they might be able to squeeze a bit of extra performance by manually unrolling the loop (although this particular optimization does make the code less readable and maintainable so should only be used as a last resort and only if it really needs to be "as fast as possible"). Also, as a compromise between cache usage and execution speed, loops can sometimes be partially unrolled.

Paul whoknows said:

When talking about optimization you need to be more specific: are we talking about speed optimization? size optimization?

I was mainly focusing on speed optimization but some of the stratergies can be applied to size optimizaion as well. Often, there is a balance between speed and size. The programmer just needs to chose a good compromise.

Paul whoknows said:

flexibility optimization

This is why it's best not to optimize if you don't need to optimize.

Thomas Harte said:

Quote:

I'd change "old hardware" to "extremely old hardware".

I'd agree from an x86 point of view. But even today, some mobile devices still don't have an FPU.

Fladimir da Gorf said:

1) Don't use software graphics

This does not work on harware-graphics-less platforms.

nonnus29 said:

That is the longest lead-off post I've ever seen....

Maybe I should have posted this to the Wiki instead...

And finally...

I said:

Do have faith in the compiler.

This is really meant to be more of a workload optimization than anything. Nowardays, compilers are good at optimizing, so they know many of the optimizing tricks, so attempting to out-optimize the compiler is futile unless you're a very good machine-code programmer. However, there's still a few tricks they don't know - especially when it comes to new hardware, or when optimization becomes more of an art than a science. Optimizing for SIMD instructions can still cause problems for compilers and this is best done in machine code, but the resulting machine code is usually a very small portion of code that just calls the SIMD instruction on the data (although there will still be some higher level code to make sure the data is in the optimum layout to be SIMD'd). Even so, if you know machine code, you may want to check what your compiler is producing from time to time, but I'd only recommend this these days if you've got plenty of time on your development cycle, as if you discover something that's best written in machine code, the chances are that finding this out and writing the resulting code will take time.

AE.

Matt Smith

Andrei, are you gonna put this on the wiki yourself?

Thomas Harte

Following your replay, my main follow-on comment is that you probably need to split your advice into several articles; all of my replies were concerned with high level languages and most of my misunderstandings seem to revolve around where you've advised something that should only be taken literally if you're using assembly.

Other than that:

Quote:

I'd change "old hardware" to "extremely old hardware".

I'd agree from an x86 point of view. But even today, some mobile devices still don't have an FPU.

Right, but you've explicitly mentioned mobile devices as a separate category. I'd also imagine that most mobile devices don't have an FPU.

Andrei Ellman

Matt Smith said:

Andrei, are you gonna put this on the wiki yourself?

I'll put it there when I've decided how to split up the article.

Thomas Harte said:

my main follow-on comment is that you probably need to split your advice into several articles; all of my replies were concerned with high level languages and most of my misunderstandings seem to revolve around where you've advised something that should only be taken literally if you're using assembly.

I agree. Several of the optimizations at the bottom of the list are only relevant if writing in machine code. I admit that "Have faith in the compiler" caused a bit of confusion because some of the statements below seem to contradict that statement.

Do you have any suggestions for how to split the article? I could split it into:

General (methods that can apply to just about any level of optimization).
Specific optimization stratergies.
Low level optimizations for non low-level languages (this will include the statement "Have faith in the compiler").
Machine-code optimizations. For when you no longer have faith in the compiler.

However, sometimes, the boundry between the last two points blurs. This depends on the compiler, and as compilers improve, the boundry moves up the list. Also, it's hard to guess where the compiler will miss some oppotunities to pre-calcualte. But even so, the third point could give hints for how to help the compiler optimize code.

Thomas Harte said:

I'd also imagine that most mobile devices don't have an FPU.

Some do these days, and some still don't. In fact, the Mophun virtual machine (or at least version 1.x) does not even have multiply, divide or modulo instructions in it's virtual machine-code instruction-set. However, blits are implemented natively.

AE.

aj5555

Quote:

Use an appropriate iterator. When iterating through an array, increase a pointer instead of using an index into an array.

false. Pointer aliasing prevents compiler optimizations. A pointer is not any faster than an simple array index, the compiler will produce the same code. However with a index parallelism can happen.

Quote:

Loop-unrolling. --> Never, ever loop-unroll

I concur, loop unrolling is very old skool; newer CPUs can predict better.

Quote:

3) Linear search is slow.

if searching all of something is required, linear is not slow, modern CPUs are streaming machines, they predict what you want next is going to be whatever follows in memory, so the L1/L2 cache will be working for you whilst you are working on the previous memory addresses. and search is not CPU intensive, its mem/cache intensive, so having the cache correctly predict is vital.

Andrei Ellman

I've created an article on optimization on the Allegro wiki which contains a modified version of my original post with some of the suggestions incorporated. I haven't yet split the article up into sub-articles, but I've split it up into sections that could spawn articles. I've also created a new category optimization that will contain the split articles.

At the moment, the article contains very few links to other articles and external links (eg. links to relevant Wikipedia articles). Likewise, none of the other articles currently link to the article.

AE.

Thread #595748. Printed from Allegro.cc