|
|
| Optimization strategies |
|
Andrei Ellman
Member #3,434
April 2003
|
I thought I'd start a thread where we share our optimization strategies. To start off with, I'll list the ones I know. I start optimizing by using the strategies at the top and work my way down the list (although some of them can be moved up and down the list). Along with the strategies, I've listed some examples. If you know of other optimization strategies, feel free to list them. If you think some of these should be in a different order, feel free to mention this, but bear in mind that some of these points can be moved up or down depending on the type of project. The optimizations listed below are mostly for speed. But first, some general tips. [EDIT: I have now started an article on the Allegro wiki on optimization. The original text is left below so that the discussion in this thread still makes sense] General:
Optimization strategies: </li>
[/list] General platform-based optimizations: </li>
[/list] Some of the optimizations below can be done with a good compiler:
[list] </ul>
[/list] AE. -- |
|
Audric
Member #907
January 2001
|
The "General" section use a rather abstract wording, I'm not sure that the people who need this advice will understand what they mean, and that they are the FUNDAMENTAL rules for not wasting one's time and efforts: --- Spending hours trying to understand why it doesn't do what you want is the worst part of game programming, it's very demotivating. When I started C, I was a crazy "optimizer", now I'm wiser(?) and more reluctant to use any coding shortcuts that would make checking for errors harder. |
|
Thomas Harte
Member #33
April 2000
|
Ummm... Quote:
Do have faith in the compiler. For example, it should automatically chose the faster of n+n, n*2, n<<1 To add to Andrei's advice, specifically thinking about x86: Quote: Be aware of which instructions take up many CPU cycles, and which instructions can be done in the background (on x86 cpus with an FPU, floating-point operations can be carried out concurrently with the non-floating point operations of a CPU). The Pentium was limited to one floating-point operation concurrently with one integer operation, everything since then can do a much wider and more arbitrary array of combinations. Quote: Loop-unrolling. Never, ever loop-unroll. It decreases the amount of your game's functionality that can fit in the CPU cache and usually doesn't tangibly increase speed anyway — branch prediction means that the CPU will automatically act as though the loop is unrolled without the cache cost, only losing some time for its efforts in the very final iteration of your loop. Quote: and use bitwise AND for modulus by a power of two. You should probably still do this, but most of the time the compiler will freely interchange. Note that dividing by two and shifting right by 1 isn't guaranteed to produce the same result on all CPU architectures and all compilers. In particular, shifts are not necessarily signed, and the rounding that applies during division of a negative integer was ambiguous at least as recently as C94. I'm not sure about C99. Quote: Are floating-point real-numbers faster than fixed-point real-numbers? For most modern CPUs, floats are faster, but on machines without a specialised FPU unit (eg. mobile devices and old hardware), it's fixed-point numbers that are faster. I'd change "old hardware" to "extremely old hardware". [My site] [Tetrominoes] |
|
Goalie Ca
Member #2,579
July 2002
|
1) Never make system calls ------------- |
|
Paul whoknows
Member #5,081
September 2004
|
When talking about optimization you need to be more specific: are we talking about speed optimization? size optimization? To be honest I am more interested in flexibility optimization, I mean code that can be easily modified in just minutes. I am rewriting parts of my projects to re-design my classes trying to keep only few classes heavily used, instead of a lot of classes rarely used. In my personal experience this is the way to achieve optimization in terms of flexibility. ____ "The unlimited potential has been replaced by the concrete reality of what I programmed today." - Jordan Mechner. |
|
OnlineCop
Member #7,919
October 2006
|
Paul, that's great, but if you have something that is very easy to modify within minutes but has a Big-O of O(N^2) (think bubble sort for millions of entries), wouldn't less-readable and a "little more" memory intensive be in favor if it brought the speed of your sort to O(n log n) instead?
|
|
Matt Smith
Member #783
November 2000
|
I'd change "old hardware" to "extremely old hardware". I'd change it to "small hardware". anything with an ARM chip, for instance. |
|
lamer88
Member #9,611
March 2008
|
I haven't tried to write a game in years, but my early attempts failed because of the inability to make it run fast enough. (Trying to do too much blending, etc.) The only strategy (besides better algorithms) that really works anymore is reduce the amount of data to be processed. This usually involves some sort of compromise between quality/performance. Packing multiplies is a good example, a larger example is something like the pairwise-clustering color quantizer in Rendera which is hopelessly slow, and requires too much memory. Preprocessing the colors (to reduce the maximum number fed into the quantizer) was required. Reducing those datasets while keeping the resulting compromise practical is the hardest part I think. |
|
Fladimir da Gorf
Member #1,565
October 2001
|
1) Don't use software graphics Anything else is mostly for nothing. OpenLayer has reached a random SVN version number ;) | Online manual | Installation video!| MSVC projects now possible with cmake | Now alvailable as a Dev-C++ Devpack! (Thanks to Kotori) |
|
imaxcs
Member #4,036
November 2003
|
Speaking of SIMD, even modern compilers (gcc, icc, ...) can't really do that much without help. The best way to aid the compiler in writing good, fast SSE-code is either ASM (yuck!) or better, by using SSE intrinsics. I admit that the cases where it can be used are rare but when it's applicable, it's a serious performance boost. Four calculations for the price of one... I'd take it any day!
|
|
lamer88
Member #9,611
March 2008
|
Another thing about GCC is that you need to make sure your functions are actually being inlined with -Winline. The no-brainer way is to declare them "static inline" at the beginning of the C file, but if you want global inline functions they have to be done a certain way, which I don't remember at the moment. Owell, //hatever |
|
nonnus29
Member #2,606
August 2002
|
Holy Sh*t! That is the longest lead-off post I've ever seen....
|
|
Andrei Ellman
Member #3,434
April 2003
|
Audric said: The "General" section use a rather abstract wording, I'm not sure that the people who need this advice will understand what they mean, and that they are the FUNDAMENTAL rules for not wasting one's time and efforts: I put these in a separate 'General' section because these are more abstract and can be applied to several of the layers of optimization. Audric said:
- Understand what takes time in one's game (profile, learn the "cost" of a blit() vs. any basic maths) This is why I suggested using a profiler to identify bottlenecks. However, bear in mind that different machines will have their bottlenecks in different places. So for example, if you're used to software rendering on a desktop-PC and then you move onto hardware rendering on a mobile device, the bottleneck can very easily shift towards the logic-code instead of blit(). Audric said: - The power of laziness: Understand that most optimized games are efficient because they do less work, not because they do it faster. This is just another way of saying "Instead of doing it as fast as possible, find out if it needs to be done at all" (which is basically what I said in my second point). Likewise, the power of laziness can be applied to the first point (don't optimize non-bottlenecks or anything that's fast enough) - in this case, the power of laziness is applied to the programmer instead of the machine. Audric said:
Now as a response to every code-level optimization: Spending hours trying to understand why it doesn't do what you want is the worst part of game programming, it's very demotivating. When I started C, I was a crazy "optimizer", now I'm wiser(?) and more reluctant to use any coding shortcuts that would make checking for errors harder. Debugging is outside the scope of the topic of optimization which is what is being discussed. I do agree that it's important to fix the bugs (or be 100% sure what's causing the bugs) before optimization is done. The bottlenecks tend to be in relatively small proportions of the code, so for just about everything else, I'd focus on making it easy to read (which also means easy to debug and maintain - especially by another programmer). Debugging can be helped if you write code where the compiler can spot you doing things you're not meant to do, or if you use ASSERT()s in appropriate places. However if you've tracked down your bug to the implementation of a particular algorithm and you're considering replacing it with a different algorithm, you can automatically get rid of the bug by replacing the algorithm. Thomas Harte said: Ummm... Quote:
Do have faith in the compiler. For example, it should automatically chose the faster of n+n, n*2, n<<1
Using shifts instead of divides/multiplies involves picking the right values to divide/multiply by in order that they can be optimized to shifts, whereas the choice of "n+n, n*2, n<<1" is the same regardless of the value chosen for n (or positive n in the latter case). Also, some advice applies only if writing in machine-code. Thomas Harte said:
Quote: and use bitwise AND for modulus by a power of two. You should probably still do this, but most of the time the compiler will freely interchange. The trick is not to do the replacing yourself, but to recognise when there's an oppotunity for such replacement to occur. For unsigned integers, the compiler will change %8 into &7, but for %9, it cannot do this optimization. So if picking arbitary values, it's best to pick powers of two. Picking a good arbitary value involves little effort. A tile-map implementation involves lots of multiplies, divides and modulos. This is why you should chose tiles with sides equal to powers of two. If you chose a non power of two, the compiler can't optimize so easily. Thomas Harte said: Note that dividing by two and shifting right by 1 isn't guaranteed to produce the same result on all CPU architectures and all compilers. In particular, shifts are not necessarily signed, and the rounding that applies during division of a negative integer was ambiguous at least as recently as C94. I'm not sure about C99. For unsigned integers, a shift will always give the same result as a multiplication/division of a power of two. I've seen some compilers that compile signed divide-by-power-of-two's into an if() statement that checks to see if the number is >0 and if it is, does a shift. Thomas Harte said:
Quote: Loop-unrolling Never, ever loop-unroll. It decreases the amount of your game's functionality that can fit in the CPU cache and usually doesn't tangibly increase speed anyway branch prediction means that the CPU will automatically act as though the loop is unrolled without the cache cost, only losing some time for its efforts in the very final iteration of your loop. I should have elaborated. It's usually bad for the cache (if the machine has a cache) and should not be done in a high-level language. Compilers will do this for you if they think it's faster. Sometimes, a complicated side-effects-less non-state-based calculation can be done using the loop-index (and perhaps a few other paramaters). If the loop start and loop-end are both constants and the difference is small, the compiler may determine that the savings that can be made by unrolling the loop and pre-calcualting everything where the index is a variable (if variables other than the index are used, mathematical optimization is used on the equation). In some cases, the compiler may not spot this, but if the programmer knows what they are doing, they might be able to squeeze a bit of extra performance by manually unrolling the loop (although this particular optimization does make the code less readable and maintainable so should only be used as a last resort and only if it really needs to be "as fast as possible"). Also, as a compromise between cache usage and execution speed, loops can sometimes be partially unrolled. Paul whoknows said: When talking about optimization you need to be more specific: are we talking about speed optimization? size optimization? I was mainly focusing on speed optimization but some of the stratergies can be applied to size optimizaion as well. Often, there is a balance between speed and size. The programmer just needs to chose a good compromise. Paul whoknows said: flexibility optimization This is why it's best not to optimize if you don't need to optimize. Thomas Harte said:
Quote: Are floating-point real-numbers faster than fixed-point real-numbers? For most modern CPUs, floats are faster, but on machines without a specialised FPU unit (eg. mobile devices and old hardware), it's fixed-point numbers that are faster. I'd change "old hardware" to "extremely old hardware". I'd agree from an x86 point of view. But even today, some mobile devices still don't have an FPU. Fladimir da Gorf said: 1) Don't use software graphics This does not work on harware-graphics-less platforms. nonnus29 said: That is the longest lead-off post I've ever seen.... Maybe I should have posted this to the Wiki instead... And finally... I said: Do have faith in the compiler. This is really meant to be more of a workload optimization than anything. Nowardays, compilers are good at optimizing, so they know many of the optimizing tricks, so attempting to out-optimize the compiler is futile unless you're a very good machine-code programmer. However, there's still a few tricks they don't know - especially when it comes to new hardware, or when optimization becomes more of an art than a science. Optimizing for SIMD instructions can still cause problems for compilers and this is best done in machine code, but the resulting machine code is usually a very small portion of code that just calls the SIMD instruction on the data (although there will still be some higher level code to make sure the data is in the optimum layout to be SIMD'd). Even so, if you know machine code, you may want to check what your compiler is producing from time to time, but I'd only recommend this these days if you've got plenty of time on your development cycle, as if you discover something that's best written in machine code, the chances are that finding this out and writing the resulting code will take time. AE. -- |
|
Matt Smith
Member #783
November 2000
|
Andrei, are you gonna put this on the wiki yourself? |
|
Thomas Harte
Member #33
April 2000
|
Following your replay, my main follow-on comment is that you probably need to split your advice into several articles; all of my replies were concerned with high level languages and most of my misunderstandings seem to revolve around where you've advised something that should only be taken literally if you're using assembly. Other than that: Quote:
Quote:
Quote: Are floating-point real-numbers faster than fixed-point real-numbers? For most modern CPUs, floats are faster, but on machines without a specialised FPU unit (eg. mobile devices and old hardware), it's fixed-point numbers that are faster. I'd change "old hardware" to "extremely old hardware". I'd agree from an x86 point of view. But even today, some mobile devices still don't have an FPU. Right, but you've explicitly mentioned mobile devices as a separate category. I'd also imagine that most mobile devices don't have an FPU. [My site] [Tetrominoes] |
|
Andrei Ellman
Member #3,434
April 2003
|
Matt Smith said: Andrei, are you gonna put this on the wiki yourself? I'll put it there when I've decided how to split up the article. Thomas Harte said: my main follow-on comment is that you probably need to split your advice into several articles; all of my replies were concerned with high level languages and most of my misunderstandings seem to revolve around where you've advised something that should only be taken literally if you're using assembly. I agree. Several of the optimizations at the bottom of the list are only relevant if writing in machine code. I admit that "Have faith in the compiler" caused a bit of confusion because some of the statements below seem to contradict that statement. Do you have any suggestions for how to split the article? I could split it into:
However, sometimes, the boundry between the last two points blurs. This depends on the compiler, and as compilers improve, the boundry moves up the list. Also, it's hard to guess where the compiler will miss some oppotunities to pre-calcualte. But even so, the third point could give hints for how to help the compiler optimize code. Thomas Harte said: I'd also imagine that most mobile devices don't have an FPU. Some do these days, and some still don't. In fact, the Mophun virtual machine (or at least version 1.x) does not even have multiply, divide or modulo instructions in it's virtual machine-code instruction-set. However, blits are implemented natively. AE. -- |
|
aj5555
Member #9,033
September 2007
|
Quote: Use an appropriate iterator. When iterating through an array, increase a pointer instead of using an index into an array. false. Pointer aliasing prevents compiler optimizations. A pointer is not any faster than an simple array index, the compiler will produce the same code. However with a index parallelism can happen. Quote: Loop-unrolling. --> Never, ever loop-unroll I concur, loop unrolling is very old skool; newer CPUs can predict better. Quote: 3) Linear search is slow. if searching all of something is required, linear is not slow, modern CPUs are streaming machines, they predict what you want next is going to be whatever follows in memory, so the L1/L2 cache will be working for you whilst you are working on the previous memory addresses. and search is not CPU intensive, its mem/cache intensive, so having the cache correctly predict is vital. |
|
Andrei Ellman
Member #3,434
April 2003
|
I've created an article on optimization on the Allegro wiki which contains a modified version of my original post with some of the suggestions incorporated. I haven't yet split the article up into sub-articles, but I've split it up into sections that could spawn articles. I've also created a new category optimization that will contain the split articles. At the moment, the article contains very few links to other articles and external links (eg. links to relevant Wikipedia articles). Likewise, none of the other articles currently link to the article. AE. -- |
|
|