Performance: Division & Multiplication
type568

I heard, that division (/) is a very slow operator, and that * is a lot faster. So, my question is, are these rumors true, for floating point variables(float)? I have some performance sensitive code, where objects have their masses, and their accelerations vary according to them. I have to divide, the force affecting an object, by the object's mass, to get the acceleration. Should I store 1/mass, and then multiply by the mass, to gain performance?

Thanks.

Thomas Harte

It's not really that simple any more. Most processors have multiple arithmetic units, and they schedule your instructions to try and use as many of those as possible at once. Nowadays most arithmetic units that can do both multiply and divide do them in the same amount of time, but some CPUs have units that can't do both. As an example, the G5 which Apple used prior to switching to Intel had two arithmetic units, one could do divide and multiply, the other only multiply. The processor before, the G4, had four arithmetic units, of which three could only do addition and subtraction and the fourth could do divide and multiply.

To be honest, I don't think storing 1/mass is going to make any tangible difference to your processing. But you might consider storing 1/mass anyway, because it means you can conceptually store objects with 'infinite' mass, i.e. objects that cannot be moved. Which are quite common in games.

type568

Thank you, and the idea of 1/mass, is great, I planned to hold another integer variable, "is_static" for this purpose.

But, after looking at the code, and thinking about the division, I found out that, I also have a variable similar to fps (frames per second..) by which I divide a lot of things every frame, to guarantee the picture motion not to be fps dependent. So, should I replace it with 1/(it)?

Thanks.

Goalie Ca

if you are going to be dividing by 1/mass in a lot of places it is a really good idea to cache 1/mass and multiply instead.

type568
Quote:

if you are going to be dividing by 1/mass in a lot of places it is a really good idea to cache 1/mass and multiply instead.

I'm going to get different results then, I suppose..

aj5555

1. premature optimzation. your sensitive code probably amounts to .001% of your programs runtime. dont get your panties in a bunch over code that takes .001% of runtime.

2. division is slower than multiply. sometimes (and its worth timing this to be sure).. you can multiply by a factor, then >> to achieve the same division, but at higher speed. it really depends on how many items you have to divide.

3. trust no one. Benchmark your results, half of everything you hear is rumour or based on limitations of old hardware.

type568

I'm doing a "model", of some physical phenomenon, so actually the performance limits the amount of objects can be processed at a time, and the complexity is n*m.

Thanks.

Audric

There are compiler options to compile in profile mode. It will add extra instructions in the executable that count how many time each function is used, how much time is spent inside, etc. After a good run, you can analyze the report to see which functions are worth your time.

Evert
Quote:

So, my question is, are these rumors true, for floating point variables(float)?

Oh, yes.

Quote:

Should I store 1/mass,

Yes, definately. Avoid divisions as much as you can.

Quote:

and then multiply by the mass

Why do you need to multiply by the mass again? If you're dividing it out first and then multiply by it again, the best thing to do is not to do either. ;)

Albin Engström

Um, isn't multiplication faster because you don't always deal with floats, i mean, would this 100*0.5 be faster than 100/2? ???

Audric

AFAIK, converting int <-> float is slower than any mathematical operator.

GullRaDriel

I remember using bit shift for power of two multiplication/division on integers.

int a = 200 , b = 50 , result = 0 ;

result = ( a >> 1 ) + ( b << 1 ) ;
/* result = ( a / 2)  + ( b * 2 ) */

printf( "%d\n" , result ); /* this would output 200 */

Evert

Guys, the original question was about floats specifically.

For integers, I think the difference between multiplication and division (in terms of speed) is less severe, if I recall correctly. Yes, bitshifts are faster for integer power-of-two divisions (or multiplications for that matter), but the compiler should catch those anyway and do the optimisation for you.

Oh, minor detail: don't mix float and integer arithmetic if you can avoid it. It'll be slower than plain floats becaused you get the extra cost of promoting the integer to a float.

Audric

By the way, in the C sources, I'm not sure the compiler optimizes implicit conversions on litterals, so be sure to write float litterals : 1.0 / mass, not 1 / mass.

type568
Quote:

Why do you need to multiply by the mass again? If you're dividing it out first and then multiply by it again, the best thing to do is not to do either. ;)

Well, I mean multiply by the coefficient that I get..

Quote:

Audric: By the way, in the C sources, I'm not sure the compiler optimizes implicit conversions on litterals, so be sure to write float litterals : 1.0 / mass, not 1 / mass.

Thanks, but I'm quite sure that if mass, is a float- in both cases I'm going to get float as a result.

And, yes I'm using only floats..

And, by the way, your posts have awakened another question:

A: 100/2
B: 100*0.5

What is faster?

Quote:

Audric: There are compiler options to compile in profile mode. It will add extra instructions in the executable that count how many time each function is used, how much time is spent inside, etc. After a good run, you can analyze the report to see which functions are worth your time.

I have no idea how to enable it in VS8 though :( What is this option's name?

Thanks, to everybody.

Audric
Quote:

if mass is a float, in both cases I'm going to get float as a result.

You don't seem to believe my advice that unnecessary casting ints to floats will make your "optimization" slower than the original.

Quote:

A: 100/2
B: 100*0.5
What is faster?

For a C compiler, A produces an integer, while B takes 200% time and produces a float.

Profiling: No idea, the Allegro manual only gives precise indications for gcc compilers.

Vanneto
Quote:

You don't seem to believe my advice that unnecessary casting ints to floats will make your "optimization" slower than the original.

Your advice is valid and true. But wouldn't the compiler cast the int to a float at compile-time?

type568
Quote:

You don't seem to believe my advice that unnecessary casting ints to floats will make your "optimization" slower than the original.

Oh, ok.. But as aj5555 said, only what worth optimization should be optimized, and mass / mass_coefficient, are only initialized once.

HoHo
Quote:

I have no idea how to enable it in VS8 though :(

There is no such mode in the free version. AFAIK it is only availiable in the enterprise version. Get a real compiler, like GCC ;)

Audric

If you put integer litterals in float computations, you'll grow carefree and a seemingly innocent "new_size = 75/100 * old_size" will devour you.

aj5555

project > properties > c++ > code generation > float point model

type568
Quote:

There is no such mode in the free version. AFAIK it is only availiable in the enterprise version. Get a real compiler, like GCC ;)

I'm using Microsoft Visual Studio 2005 Professional Edition, not like I'm a professional, but anyways.

Quote:

Audric: If you put integer litterals in float computations, you'll grow carefree and a seemingly innocent "new_size = 75/100 * old_size" will devour you.

I understand.. but in such cases if I remember, I do it 75.0/100, or cast the 100 in to float.

Quote:

aj5555: project > properties > c++ > code generation > float point model

Hmm, how does it work though, if I choose let's say fast. And, does it affect double? Thanks for the tip though.

So, in visual studio, no such option, to see how many time does the program spend in each of the functions.. ?

HoHo
Quote:

I'm using Microsoft Visual Studio 2005 Professional Edition

I'm not 100% sure but I think Professional edition doesn't have profiler either. Then again I doubt you need to worry about performance that soon.

Quote:

So, in visual studio, no such option, to see how many time does the program spend in each of the functions.. ?

You could manually measure the time by using some high-precision counters (allegro timers are not one of those). Perhaps googling for RDTSC can help you. Though under Windows you also have some other functions whose name I can't remember exactly. It was something like GetXXXFrequency and GetXXXCounter or something like that :)

aj5555

as far as GCC being a real compiler... MSVC produces faster code (on average)... GCC's strength is in its warning/error, and standards compliance. MSVC is more forgiving (which leads noobs to beleive they are doing things correctly). however as far as code generation goes, you'd have a tough time saying which compiler is better.

type568, float point model is a hint to the compiler about how to handle rounding situations and how to cut-corners when int-float-int conversions are done, if your not uber critical of the results, use fast. you can always select this option on a file by file basis.

do NOT use RDTSC on multi-core machines, your results will not be consistant.
on win32 use QueryPerformanceCounter()

HoHo
Quote:

as far as GCC being a real compiler... MSVC produces faster code (on average)

Perhaps when your code consists of awfully lot of templates. If you don't have too much of them GCC is one of the best optimizing compilers out there being only a little behind ICC, at least for heavy number crunching. When we compare 64bit compiling then GCC can even beat ICC leaving MSVC far behind.

Quote:

do NOT use RDTSC on multi-core machines, your results will not be consistant.

Are you sure? I somehow remember using it on my older P4 dualcore and it worked fine. Perhaps it is some kind of Linux thing that makes it work though.

Quote:

on win32 use QueryPerformanceCounter()

Ah, that was what it was called :)

aj5555

there is a discussion about it on MSDN somewhere. laptops under SpeedStep conditions will give varying RTDSC results.. speedstep can occur quite often.

RTDSC is a CPU instruction, from which CPU will it obtain the result?

HoHo

Yes, speedstep will screw up things. Still RDTSC is the counter with the least amount of overhead compared to other methods (IIRC windows *PerformanceCounter functions were around 100x slower).

Quote:

RTDSC is a CPU instruction, from which CPU will it obtain the result?

unless your cores run at different frequency they should have equal counter value. IIRC it started at zero on bootup. I know AMD offered a patch to Windows that fixes it (nothing for Linux, guess it works there). Though I'm not sure if it also helps with speedstep.

Thomas Harte

Per wiki, and I'm inclined to agree:

wiki said:

There is no longer any promise that the timestamp counters of multiple CPUs on a single motherboard will be synchronized. So, you can no longer get reliable timestamp values unless you lock your program to using a single CPU. Even then, the CPU speed may change due to power-saving measures taken by the OS or BIOS, or the system may be hibernated and later resumed (resetting the time stamp counter)

In addition, RDTSC is a ring 0 (kernel mode) instruction so shouldn't be available to your code anyway. Assuming your OS is well written, using RDTSC should cause your program to crash.

type568

So, no benchmarking routines available for newbie programmers?

aj5555

it doesn't matter if QueryPerformanceCounter() is 100x slower... as long as its consistant.. real-clock cycles is irrelevant, your only ever comparing the results against another set of results taken by the same clock. the two result sets are compared against each other. whats important is the resolution is high enough to actually see when there is a difference.. hence why allegro's timers are in-adequete. (spelled that one wrong!!)

[edit] you should use QueryPerformanceCounter() but if your running on a molti-core machine, lock your process to 1 CPU, using SetProcessMask()

HoHo

With multisocket CPUs I agree that it isn't reliable, at least not without binding your program to specific CPU.
Also I'm quite sure that RDTSC is not ring 0 instruction (why should it be?) as I've been able to use it without problems in XP and Linux without having administrator privileges.

I'm not saying that RDTSC is the one best way to measure performance. I just say that it is one of the easiest to use*, it is supported on pretty much every x86 CPU and it doesn't have too much overhead. Of course to be able to use it you must do all sorts of other tricks like disable power saving functions and lock your program to run on one CPU core.

In later CPUs there exist a new instruction that is similar to RDTSC but it increases at constant rate and shouldn't get messed up with multisocket/core machines.

[edit]

Quote:

it doesn't matter if QueryPerformanceCounter() is 100x slower... as long as its consistant

For as long as QPC itself doesn't take comparable amount of time to execute than the piece of code you are measuring it can be used quite well.

aj5555

its not a Ring0 anything, its an actaul CPU instruction, just like all the other CPU instructions your code runs, it does require ASM code, because C has no equivalent.

"For as long as QPC itself doesn't take comparable amount of time to execute than the piece of code you are measuring"

huh? the time QPC takes compared with the code is irrelevant too. if QPC takes 1second to execute, and your code takes 1ns, the result will be 1.000000001 sec
if your 2nd test code takes 2ns, the result will be 1.00000002 sec.

from that you can calcuate that the two peices of code differ by 1ns.

result.

Vanneto
Quote:

Its not a ring 0 anything, its an actual CPU instruction, just like all the other CPU instructions your code runs. It does require ASM code, because C has no equivalent.

Quote:

For as long as QPC itself doesn't take comparable amount of time to execute than the piece of code you are measuring

Huh? the time QPC takes compared with the code is irrelevant too. if QPC takes 1 second to execute, and your code takes 1ns, the result will be 1.000000001 sec
if your 2nd test code takes 2ns, the result will be 1.00000002 sec.

From that you can calculate that the two pieces of code differ by 1ns.

Fixed!

HoHo
Quote:

huh? the time QPC takes compared with the code is irrelevant too.

Not exactly. During execution of that instruction all sorts of "fun" things can happen like context switch. E.g one time it might execute in a hundred ns, next time a few thousand.

Thomas Harte
Quote:

its not a Ring0 anything, its an actaul CPU instruction,

You obviously don't know what you're talking about, so I don't intend to engage in a full discussion with you. This link might help, but I doubt you'll bother reading or comprehending it.

HoHo

If I may ask where did you read that RDTSC is a ring 0 instruction? This is the first time I hear about it.

Thomas Harte

Oh, I could just be a million years out of date. RDTSC was definitely a ring 0 instruction on the original Pentium (source, first one I could find), nothing immediately suggests it still is. And I can't really find time for a thorough search right now. What do Intel say?

Arthur Kalliokoski

The RDTSC is read-only in Winduhs, you can read or write to it in DOS (ring 0), I've read it in Linux but didn't try writing to it. Probably Linux & Windows set some permissions or something, I doubt it's trapping it with an exception.

And division will always be slower than multiplication, the trial divisor stuff (or equivalent brute force) and dependency on what the previous subtraction left makes it hard. Theoretically a multiplication could be split up among multiple CPUs.

HoHo
Quote:

The RDTSC is read-only

Actually RDTSC is a machine instruction that copies a number from deep within CPU to regular registers but you probably already know that :)
Are you talking about some other instruction that writes to the place where RDTSC holds its counter value?

Arthur Kalliokoski

Actually there's a WDTSC or something to write to the time stamp counter, that's what I meant I used in DOS. The earliest Plentiums had to use MOV's with eax, ecx and some weird mmr register or something. It was ten years ago, I'm a little hazy on it.

aj5555

Thomas, dont know what I'm talking about huh?
yes I did read the article, didn't see any mention of RTDSC.

Evert
Quote:

dont know what I'm talking about huh?

Quote:

its not a Ring0 anything, its an actaul CPU instruction, just like all the other CPU instructions your code runs

No, I'd say you don't know what you're talking about - or at least you don't know what ``Ring 0'' means.
That's not a bad thing per se, we all need to learn things at some point.

GullRaDriel
Thomas Harte

Right. I've grabbed the official intel x86 documentation, and found this on page 644:

Quote:

The time stamp disable (TSD) flag in register CR4 restricts the use of the RDTSC instruction. When the TSD flag is clear, the RDTSC instruction can be executed at any privilege level; when the flag is set, the instruction can only be executed at privilege level 0.

So RDTSC can be ring 0 if the OS wants it to be. The availability of RDTSC therefore technically depends on your CPU — though I think everything P2 equivalent and upwards has it (?) so that's probably not much of a concern — and your operating system. So it may not be smart to advise it as a general solution, and it definitely isn't a good idea to use it in shipping code unless you can find a guarantee as to its availability from your OS manufacturer.

Vanneto

How does this work? Once you set the TSD flag it cannot be unset? So basically the program that first runs ( the OS ) can set it and have Ring 0 privileges. Is this how it works?

HoHo

Without reading the documentation I assume that it is set by the CPU itself during bootup (before any BIOS stuff gets shown)

GullRaDriel

Hoho, a pentium 4 does not have 2 core, it has hyperthreading which is quite different, which can explain why you had consistent result with RTDSC

EDIT: correction, some have both hyperthreading and dual core.

HoHo
Quote:

some have both hyperthreading and dual core.

... and I've had both :)

Thread #593639. Printed from Allegro.cc