![]() |
|
'Fixed point' |
Thomas Harte
Member #33
April 2000
![]() |
Having a poke around my code in OS X I had a glance into the Allegro 'fixed point' routines. I observed this is the active fixmul for my platform: AL_INLINE(fixed, fixmul, (fixed x, fixed y), This is horrendously bad! Float/int conversions are even slower on PowerPC than intel because the two units of the processor are conceptually separated and a transfer through memory must occur. Would the following not be a smarter fallback for machines without an assembly implementation: unsigned int sign = (r.value^value)&0x80000000; Sint32 Multiplicand = value; if(Multiplicand < 0) Multiplicand = -Multiplicand; if(r.value < 0) r.value = -r.value; Fixed Newval; Newval.value = (Multiplicand >> 8)*(r.value >> 8) + (((Multiplicand >> 8)*(r.value&0xff)) >> 8) + (((r.value >> 8)*(Multiplicand&0xff)) >> 8); if(sign) Newval.value = -Newval.value; return Newval; Or some "smarter about signs and shifts" variation? And while I'm here posting, could there not be some compile time way of determining whether real fixed point or nasty triple cast fixed point implementations of things like fixmul are available? [My site] [Tetrominoes] |
Peter Wang
Member #23
April 2000
|
If your implementation benchmarks and tests well then of course we would accept it. I don't know if anyone will have time before the next beta (on Friday). Quote: And while I'm here posting, could there not be some compile time way of determining whether real fixed point or nasty triple cast fixed point implementations of things like fixmul are available? I don't think so. EDIT: Ok, I decided to do some quick benchmarks before going to sleep. The patch I'm using is attached. The test machine is a Pentium-4 2.4GHz, running Linux 2.6.10, gcc 3.3.4. I'm using the "Misc | Time some stuff" function from `tests/test'. Here are the rather surprising results:
I didn't really check that the patched version gives the same results, but the 3d and sprite rotate examples seemed to run ok. EDIT: 2005-05-17 added results for Evert's patch that uses LONG_LONG where possible. There seems to be a big discrepancy between the two set of asm results, so I don't know what happened there. The C results seem to match up though. The two "patched C" rows use the same implementation of fixdiv (making use of LONG_LONG).
|
Thomas Harte
Member #33
April 2000
![]() |
Quote: I didn't really check that the patched version gives the same results, but the 3d and sprite rotate examples seemed to run ok. It's unlikely to give exactly the same results as the three cast method, but then so is the ia86 32x32->64 integer multiply. I'd be surprised if the average divergance in results was substantially different though. Quote: Here are the rather surprising results: So in standard mode the patch is fastest, in debug the asm is fastest and the three cast method mysteriously overtakes the asm in standard mode? That is a little strange. Perhaps someone smarter than me can eliminate the sign stuff in my method. The problem with signed numbers is the &0xff's throw away the sign. If that could be avoided then all the conditionals could be removed and things should be even better. Possibly there is some similar method for fixdiv too... [My site] [Tetrominoes] |
nonnus29
Member #2,606
August 2002
![]() |
I remember reading that back in the 68000 days Macs had a div and mult opcode; powerpc doesn't? Mips risc does... I'm not clear on why shifts and &'s are needed; if you multiply two 32 bit values the result will be 64 bits; this result requires an arithmetic left shift of 16 (for 16.16 format) and truncation to 32 bits; the sign should be preserved. For division again only arithmetic shifts are required preserving the sign. And it doesn't have to be 16.16 either.... edit; heres some code;
edit2; hmm, looks like you'll lose the sign of a in the div() function... |
Marcello
Member #1,860
January 2002
![]() |
remember that long is only 32bits in c Marcello |
BAF
Member #2,981
December 2002
![]() |
will this work on 64bits? BTW, I have a 64bit gentoo compile, I can test stuff for you guys. |
Evert
Member #794
November 2000
![]() |
Marcello said: remember that long is only 32bits in c
Depends on your platform and compiler. It's 64 bit on my AMD machine, and possibly on 64 bit Macs as well. |
gnolam
Member #2,030
March 2002
![]() |
Marcello said: remember that long is only 32bits in c Actually, it's ≥ 32 bits, ≥ int... -- |
Marcello
Member #1,860
January 2002
![]() |
right, point being, don't use it if you want 64bits. |
Thomas Harte
Member #33
April 2000
![]() |
Quote: right, point being, don't use it if you want 64bits. Don't use what? Thinking about it a little harder, if the pure C fixmul is casting to floats, then it is loosing the low 8bits of each fixed number before multiplying anyway. Therefore the following should be just as good: (x >> 8)*(y >> 8) Conversely, if it is casting to doubles then it is no wonder that speed isn't great! Quote: I remember reading that back in the 68000 days Macs had a div and mult opcode; powerpc doesn't? Mips risc does... Why do you suppose PowerPC has no multiply or divide operations? I can see no reason you would conclude that from the thread. In fact the PowerPC has a 32x32->64 multiply capacity even in 32bit mode, through the issuing of two consecutive instructions. Otherwise you only get a more normal 32x32->32 multiply, what with RISC machines being at least partly designed for RISC usage. I'll see if I can figure out the correct ASM and so on for PowerPC but it is at least slightly beside the point if the pure-C can still be improved. [My site] [Tetrominoes] |
nonnus29
Member #2,606
August 2002
![]() |
Oops; 'long long' is 64 bits or is it 'long int'? edit; I assumed that if multiply and divide opcodes were available then they would've been used in allegro; guess not... |
Thomas Harte
Member #33
April 2000
![]() |
Depends on your compiler and processor. You may not even have a 64bit integral type. EDIT: Quote: edit; I assumed that if multiply and divide opcodes were available then they would've been used in allegro; guess not... Oh, no, the Mac OS X port just isn't complete yet in terms of things it could be doing to avoid the Allegro C fallbacks but isn't. [My site] [Tetrominoes] |
Evert
Member #794
November 2000
![]() |
Quote: In fact the PowerPC has a 32x32->64 multiply capacity even in 32bit mode So do the i386+ processors, if I recall correctly (they set the lower word in eax and the high word in edx, I think). Quote: Oops; 'long long' is 64 bits or is it 'long int'? long long may be 64 bit if it is defined at all (prior to C99, it was a GNU extension). Allegro defines a (u)int64_t on non-C99 compilers, but it's just a guess and probably unreliable on non-C99 compilers. Quote: I assumed that if multiply and divide opcodes were available then they would've been used in allegro Allegro only uses assembler code on the ix86, which is actually suboptimal on recent processors. There's some minor assembler code of AMD64, but it's trivial and mostly unimportant (it retrieves the CPUID flags). |
nonnus29
Member #2,606
August 2002
![]() |
These work with mingw 3.2;
|
Thomas Harte
Member #33
April 2000
![]() |
Quote: So do the i386+ processors, if I recall correctly (they set the lower word in eax and the high word in edx, I think). Yes, all I meant was 'even the RISC PowerPC' rather than 'unlike the yucky intel'. I'm not surprised at all that the CISC chip does. [My site] [Tetrominoes] |
Peter Wang
Member #23
April 2000
|
I added some results for Evert's latest patch with uses LONG_LONG where possible. See the table above.
|
nonnus29
Member #2,606
August 2002
![]() |
n/m I'm smoking crack or something. It's a little faster it seems. |
Evert
Member #794
November 2000
![]() |
Interesting...
On the Xeon, I got the following results: Xeon results said:
sizeof(long long) = 8, sizeof(fixed) = 4
I guess this is consistent with what Peter obtained. For my 64 bit machine, the result is AMD64 said: Timing 100000000 calls to fixmulf...CPU time = 4.133372 secs. As expected, the long long multiplication is the fastest here. I'm puzzled by the relative and absolute poor performance of the long long division code though: the 32 bit machine outperforms the 64 bit machine when dividing 64 bit integers... I'll have to check out what's wrong there. EDIT: Xeon said:
sizeof(long long) = 8, sizeof(fixed) = 4
AMD64 said:
sizeof(long long) = 8, sizeof(fixed) = 4
And I never want to hear anyone say that Xeon (or Intel) are no good and AMD is better. EDIT2: Did a more detailed run and check, now with a larger number of loop iterations and three machines: the 2.8GHz Xeon-HT (actually dual Xeon-HT), the AMD64 3200 and a 950MHz Celeron. Times are reproducible with a deviation of about 1%. results said:
Xeon: Celeron: AMD64:
The fixmull is the fastest on all machines (although it doesn't matter much at all on the AMD64). The fixdivl is the faster than fixdivf on the AMD64, but slower on the Xeon and an order of magnitude slower on the Xeon. |
Thomas Harte
Member #33
April 2000
![]() |
So conclusion is, on your high end machines: fixdivl makes some difference for you, nothing else really seems to be significant. Results of your program on my 667Mhz G4, OS X v10.4.1 (with iTunes and firefox eating 15% CPU before anything else even gets a look-in): sizeof(long long) = 8, sizeof(fixed) = 4 Digging into the assembly (compiled with O2): fixmull said: .align 2 Looks pretty good to me. fixmuli said: .align 2 Again, not awful. fixmulf said: .align 2 Note especially all the stws (store word) and lfds (load floating point double). I would guess that all of the artifical examples are getting unrealistic figures for the multiplications involving floats because they are able to overlap usage of the floating point and integer units where a real program might be able to do the same with their own calculations unrelated to the mul. This alone should make fixmuli/l preferable to fixmulf wherever their timing is not substantially worse in i/l respects. Quote: So based on this I would propose to apply the fixdivl patch regardless and only use the fixdivl patch on native 64-bit machines (does that include Macs?). The G5s (generally hanging around since 2001 I think) include support for the 64bit mode (which had been in the CPU specs since PowerPC came around in about 1993 but unimplemented in any consumer design until a few years ago) and under OS X v10.4+ it is possible to build fat binaries that include 64bit operation paths for G5s. Under 10.3 and before, 64bit is only used to provide a larger address space (unless the user asms some in themselves, of course). My G4 has no native 64bit processing except in the vector unit and that probably isn't advisable for non-vectors for all the obvious reasons. In any case, the fixdivl is substantially faster even on 32bit CPUs for PowerPC! EDIT: a true console build and run produces different results (the other having been dumping results from a Cocoa application to a log window due to that being the quickest way to get an XCode project going): Timing 100000000 calls to fixmulf...CPU time = 0.305241 secs. Notably fixmuli has leapt upwards. I have no idea why this is, but the results still show very clearly that fixmull and fixdivl are the way forward so I'm not really that interested. [My site] [Tetrominoes] |
Peter Wang
Member #23
April 2000
|
Sorry, but I think Evert's test is deeply flawed. On my machine, gcc -O2: Quote:
sizeof(long long) = 8, sizeof(fixed) = 4 Now I comment out the call to fixdivl: Quote:
sizeof(long long) = 8, sizeof(fixed) = 4 I basically don't know assembly, but as far as I can tell we're just measuring empty loops, and for some reason the fixmull and fixdivl were compiled in such a way that they executed faster. If I accumulate the results of the fixmul/fixdiv calls, e.g. like this: (full changed file attached) the results are then: Quote:
sizeof(long long) = 8, sizeof(fixed) = 4
|
Thomas Harte
Member #33
April 2000
![]() |
EDIT to earlier post, moved down due to Peter Wang comment in mean time: I found a reason that your code is a very poor example when rifling through the assembly: the result of z is never used so gcc doesn't calculate it in some of the loops! Hence your fixmul loops aren't actually doing the multiplication and it is no wonder that you suddenly get the same results for everything! I changed the z calculation to a += and added it as output to the printfs to make sure that it is actually calculated every loop. Meaningful results (O3) are: sizeof(long long) = 8, sizeof(fixed) = 4 I shall look into that fixdivf vs fixdivl. EDIT: so as not to confuse things, I ran Peter Wang's adaption. Results are: O2: No optimisations: [My site] [Tetrominoes] |
Evert
Member #794
November 2000
![]() |
Ah, you're quite right. I should have realized the compiler would optimize out an unused variable! Quote: Xeon: Celeron: AMD64:
Basically, fixmull is a good idea on the AMD64, but a bad idea on the two Intel machines. For fixdivl, it is actually the other way around (!). I'll look at the detailed assembler output later on. |
Thomas Harte
Member #33
April 2000
![]() |
I've got "gcc version 4.0.0 20041026 (Apple Computer, Inc. build 4061)". Having been code digging, it appears that the 64bit divide calls some other function for a reason I don't fully understand yet given that the PowerPC has the divd 64 bit divide instruction. I'm going to sniff around this topic and see if I can figure anything out. EDIT: from a Mac vs PC viewpoint, is the fact that my 667Mhz G4 apparently more than twice as fast as the 950Mhz Celeron for multiplies (but less than half the speed for divides) surprising? In the sense that it may suggest problems with the test. [My site] [Tetrominoes] |
Evert
Member #794
November 2000
![]() |
Quote: from a Mac vs PC viewpoint, is the fact that my 667Mhz G4 apparently more than twice as fast as the 950Mhz Celeron for multiplies (but less than half the speed for divides) surprising? I don't think so. I used 300MHz Sun workstations that were faster for some calculations. That said, it's quite possible I bodged up the test somewhere else too - I'm not a computer professional, afterall, so it's possible that there is some benchmarking subtlety I'm unaware of. It looks like it should be doing the right thing now though. |
orz
Member #565
August 2000
|
This data is for MSVC 7.1 / Athlon XP 2500+ A few notes: All of the different methods produce different results. Most handle rounding differently from each other, and fixmulf handles overflow differently. I suggest modifying the source to initialize x and y to some known value before each loop. Enabling SSE1 (which was disabled by default -this is MSVC 7.1) improves performance of fixmulf by 20%. A simple by-hand translation of fixmull to inline asm improved performance by 10% despite the inefficiencies of inline asm on MSVC (you can't tell the compiler where you want stuff and what gets clobbbered and all that like you can in gcc). Performance scaled linearly on my computer for loop counts all the way down to 10^3, for all functions except fixmulf, which only scaled down to loop count = 10^5 (at 10^4 it took almost twice as long per loop, at 10^3 it took over 3 times as long per loop). LOOP_COUNT = 200000000 total seconds 2.100045 fixmulf 2.185031 fixmuli 1.431395 fixmull 1.278765 fixmulasm 7.206338 fixdivf 12.066815 fixdivl
|
|