Performance Tuning

Performance Tuning

Onewing

Member #6,152

August 2005

I was reading a book (in the appendix) that lists several methods of performance tuning. Most of them only increase framerate by a little bit (but more so on older computers). It was somewhat interesting, so I thought I'd query the infinite knowledge of a.cc to see if there are more tricks that you use. ComputerScience++;

Here are some of the methods mentioned (I'll probably have to go back to the book later to make sure I list them correctly) in the book (summarized):

Plus, minus operations are faster than multiplication which is faster than divide. Multiplication can be significantly faster than divide on some machines.
Pointer Indirection. Every indirection causes a "cache miss" which is ultimately some degree of slow down. Unfortunately, there's usually nothing you can do about this, other make your code less OOP and/or readable to comply.
Data Structures. Try to use the smallest data type that will work for your variables.
(Obvious) Use malloc/free and/or new/delete sparingly. Keep it out of the main game loop(s).

There were others but I can't remember right now. :-X

------------
Solo-Games.org | My Tech Blog: The Digital Helm

gnolam

Member #2,030

March 2002

Use the right algorithms for the task. It's going to matter a hell of a lot more than the micro-optimizations above.

--
Move to the Democratic People's Republic of Vivendi Universal (formerly known as Sweden) - officially democracy- and privacy-free since 2008-06-18!

Tobias Dammers

Member #2,604

August 2002

Quote:

Pointer Indirection. Every indirection causes a "cache miss" which is ultimately some degree of slow down. Unfortunately, there's usually nothing you can do about this, other make your code less OOP and/or readable to comply.

Yes, but readability makes for more maintainable code. Avoid pointer indirections only if you have found the particular piece of code to be a bottleneck; otherwise I'd prefer readability over a marginal or unnoticeable speed improvement.

Quote:

Data Structures. Try to use the smallest data type that will work for your variables.

This doesn't always hold true. Data sizes that are not multiples of the target architecture's word size (32 or 64 bits on anything >= 386) can cause misaligned data and make things slower instead of faster. Why else do you think most modern graphics cards use 32 bpp instead of 24?

Quote:

(Obvious) Use malloc/free and/or new/delete sparingly. Keep it out of the main game loop(s).

Use memory pools if you have to.

Here's some more:

If you need to pass anything larger than a pointer, pass by reference if you can.
Play nice with your optimizer: Write short functions, use the "const" keyword wherever appropriate, avoid complicated statements, cast explicitly, use the proper constants for initializing variables (float f = 5.0f; is better than float f = 5;), use the register keyword in very tight spots, use the inline keyword where appropriate
Optimize algorithms, not code. For example, it is far more efficient to reduce the number of collision tests than to tune the test itself.
Avoid float <-> int typecasts as much as possible. These are expensive; on modern machines, float math is almost as fast as int math, so you should only cast when you absolutely have to (e.g. passing the result of a float calculation to a function that expects ints)
Use a profiler. Both to see which parts are worth optimizing, and to see if your optimization is really an improvement (you'd be surprised).
You cannot optimize ALL of your code; focus on the 10% that is responsible for 90% of the user experience.
Prefer math over if / switch statements. Math is fast, but if statements need branch prediction to perform well.

---
Me make music: Triofobie
---
"We need Tobias and his awesome trombone, too." - Johan Halmén

Joel Pettersson

Member #4,187

January 2004

Quote:

Plus, minus operations are faster than multiplication which is faster than divide. Multiplication can be significantly faster than divide on some machines.

Divide is indeed The expensive math operation, often more expensive by an order of magnitude than the others.

Quote:

Data Structures. Try to use the smallest data type that will work for your variables.

Try doing so for large chunks of data, as it will allow you to squeeze more into the cache. Otherwise, it can be counter-productive, as noted by Tobias.

Quote:

Play nice with your optimizer: Write short functions, use the "const" keyword wherever appropriate, avoid complicated statements, cast explicitly, use the proper constants for initializing variables (float f = 5.0f; is better than float f = 5;), use the register keyword in very tight spots, use the inline keyword where appropriate

If you imply splitting functions just to keep them small, there is no reason to, unless you can use the resulting split-off functions several times and the code size is significantly reduced. Larger chunks of continuous code can be optimized better. (when inlining the calls, it often doesn't matter, though. unless you access parameters by pointer or reference; this can make inline functions noticeably worse than an equivalent macro in some cases)

As for casting explicitly, doing so can sometimes result in the addition of additional, unneccessary instructions with GCC. It has happened to me when writing double calculation results to a float buffer.

Quote:

Avoid float <-> int typecasts as much as possible. These are expensive; on modern machines, float math is almost as fast as int math, so you should only cast when you absolutely have to (e.g. passing the result of a float calculation to a function that expects ints)

Use of inline assembly or proper library functions (not always avaliable; may require certain defines. often compiler-specific) can reduce conversions to a few cycles, if you don't mind rounding occuring, or are fine with setting the mode to truncation.

Quote:

Prefer math over if / switch statements. Math is fast, but if statements need branch prediction to perform well.

Branch predition is sophisticated, so if the branch can be predicted well, this can be counter-productive. (for the cases where the conditional statement is repeatedly encountered) Make sure to test it when performance is critical.

HoHo

Member #4,534

April 2004

Avoid linked lists if possible and your overall design allows

Most other things have already been said. Start by thinking correct design that uses proper algorithms. Avoiding to do stuff is better than doing the stuff faster. Before starting to make things faster profile to see if you are working on the right piece of code.

Callgrind can be quite nice tool when combined with KCachegrind UI. Unfortunately I think it only works under Linux and possibly BSD and OSX. Here is a nice screenshot of one of my old program profiling output:
http://hoho.bafsoft.net/callgraph1.png

If you really need to get close to metal you might want to learn using special libraries that allow reading performance monitoring counters. [url
http://icl.cs.utk.edu/papi/]PAPI[/url] is one such but it doesn't work that well under Windows and using it in general is quite painful. Once you get it working you can monitor all sorts of interesting stuff. Things like branch mispredictions, cache misses, clock cycles spent in FPU and integer units, amount of SIMD instructions, cycles spent waiting for recourses and lots of other things. My Core2 has over 500 events you can monitor.

__________
In theory, there is no difference between theory and practice. But, in practice, there is - Jan L.A. van de Snepscheut
MMORPG's...Many Men Online Role Playing Girls - Radagar
"Is Java REALLY slower? Does STL really bloat your exes? Find out with your friendly host, HoHo, and his benchmarking machine!" - Jakub Wasilewski

Onewing

Member #6,152

August 2005

I've been using some of these tricks to work on the performance of my Toggles game. The main menu ran at about 28-30 fps (on my computer) and I've gotten it up to 40+. The main game runs at 57-60fps, but if I take out the "water" portion of the game, it runs at 90-100fps. Obviously I need to work on that algorithm or try a new approach all together. I'm thinking of a method with palettes might work.

And without adding a library like fblend, some of the transparency/translucency methods are bogging it down too. I'd prefer not to add another library if all possible...

------------
Solo-Games.org | My Tech Blog: The Digital Helm

Vasco Freitas

Member #6,904

February 2006

So... "x *= 0.5;" is faster than "x /= 2.0;"?

HoHo

Member #4,534

April 2004

Quote:

So... "x *= 0.5;" is faster than "x /= 2.0;"?

Yes, it will be faster, though most compilers will probably make that optimization for you.

Arthur Kalliokoski

Second in Command

February 2005

[EDIT] Deleted stupid half baked thought.

They all watch too much MSNBC... they get ideas.

ImLeftFooted

Member #3,935

October 2003

The OP are good tips for when you write your code the first time. Except the avoiding new and delete, that one doesn't make any sense.

When you go back, the best tips are the obvious ones.

Quote:

Yes, it will be faster, though most compilers will probably make that optimization for you.

Unless 2.0 is in a variable, which it usually is.

How is my posting?

zenofeller_

Member #8,364

February 2007

i think what was meant was more along the lines of, avoid putting new and delete in for loops. which does make sense.

and yes, forget / as a math symbol. / is a mistyped slash and that's all. you got \ for strings and * for numbers and that's all you'll ever need.

I, however, am strongly suggesting that you START CAPITALIZING, FUCK IT! (gnolam)

Onewing

Member #6,152

August 2005

Quote:

Except the avoiding new and delete, that one doesn't make any sense.

What I meant is keeping new/delete to a minimum and doing them at the right/logical time. You wouldn't want to create the monsters every time you went into a random battle on an RPG. You'd want to create them into a data structure at the beginning and then just give the battle engine pointers to access their data, no load time needed (at least, that's what I'm doing in my CH title).

------------
Solo-Games.org | My Tech Blog: The Digital Helm

ImLeftFooted

Member #3,935

October 2003

But new and delete are tools, great tools. Saying 'use them less' makes no sense.

What would make more sense is saying "use a smart allocation scheme". But use new and delete all you want, no sense holding out. And they don't turn evil once their inside a loop, that just makes no sense at all.

How is my posting?

Onewing

Member #6,152

August 2005

Quote:

What would make more sense is saying "use a smart allocation scheme".

I'm having a hard time putting my words together effectively today. I wonder if I've had a stroke. Checks memory

Anyway, that's what I mean. It's the same principle as to why you create a trig. lookup table before the main processing. It makes things faster to lookup the value in a table rather than performing the trig function as needed.

------------
Solo-Games.org | My Tech Blog: The Digital Helm

ImLeftFooted

Member #3,935

October 2003

Quote:

I'm having a hard time putting my words together effectively today. I wonder if I've had a stroke. Checks memory

Anyway, that's what I mean.

So then we agree:)

Quote:

It's the same principle as to why you create a trig. lookup table before the main processing. It makes things faster to lookup the value in a table rather than performing the trig function as needed.

Depends on the processor, most modern processors implement sin and cos as single instructions. But I think I understand your point.

How is my posting?

anonymous

Member #8025

November 2006

Quote:

Plus, minus operations are faster than multiplication which is faster than divide. Multiplication can be significantly faster than divide on some machines.

Alas, you don't have the choice very often, do you?

Quote:

Pointer Indirection. Every indirection causes a "cache miss" which is ultimately some degree of slow down. Unfortunately, there's usually nothing you can do about this, other make your code less OOP and/or readable to comply.

I did some testing with a class function (it took a random int and returned the n+n) and called this an immense number of times in three ways: a normal stack object (2633), dynamically allocated object (2644), another version of the class using virtual function (3044). (Sample run, times were rather consistent.)

While the difference between the first two is marginal, virtual function calls are slower.
There are a few considerations, though:
1) benchmarking some random small test program is pretty meaningless practically;
2) if you really need polymorphic behaviour (e.g don't know which function to call until runtime), OOP approach might still be faster than the equivalent code with state flags and switches etc. The sad thing is, the code would be too different to write both versions to profile them both...