Having a poke around my code in OS X I had a glance into the Allegro 'fixed point' routines. I observed this is the active fixmul for my platform:
AL_INLINE(fixed, fixmul, (fixed x, fixed y),
{
return ftofix(fixtof(x) * fixtof(y));
}
This is horrendously bad! Float/int conversions are even slower on PowerPC than intel because the two units of the processor are conceptually separated and a transfer through memory must occur. Would the following not be a smarter fallback for machines without an assembly implementation:
unsigned int sign = (r.value^value)&0x80000000; Sint32 Multiplicand = value; if(Multiplicand < 0) Multiplicand = -Multiplicand; if(r.value < 0) r.value = -r.value; Fixed Newval; Newval.value = (Multiplicand >> 8)*(r.value >> 8) + (((Multiplicand >> 8)*(r.value&0xff)) >> 8) + (((r.value >> 8)*(Multiplicand&0xff)) >> 8); if(sign) Newval.value = -Newval.value; return Newval;
Or some "smarter about signs and shifts" variation? And while I'm here posting, could there not be some compile time way of determining whether real fixed point or nasty triple cast fixed point implementations of things like fixmul are available?
If your implementation benchmarks and tests well then of course we would accept it. I don't know if anyone will have time before the next beta (on Friday).
And while I'm here posting, could there not be some compile time way of determining whether real fixed point or nasty triple cast fixed point implementations of things like fixmul are available?
I don't think so.
EDIT: Ok, I decided to do some quick benchmarks before going to sleep. The patch I'm using is attached. The test machine is a Pentium-4 2.4GHz, running Linux 2.6.10, gcc 3.3.4. I'm using the "Misc | Time some stuff" function from `tests/test'. Here are the rather surprising results:
1 | Standard mode (fixmuls per second) |
2 | |
3 | asm: 25886122, 26106294 |
4 | current C: 29301026, 29226952 |
5 | patched C: 31416471, 31518915 |
6 | |
7 | ---------------------------------------------------------------------- |
8 | |
9 | Debug mode |
10 | |
11 | asm: 22182461, 22209498 |
12 | current C: 9916082, 9649226 |
13 | patched C: 21665953, 21693887 |
14 | |
15 | |
16 | ====================================================================== |
17 | |
18 | 2005-05-17 |
19 | |
20 | Standard mode fixmul/sec fixdiv/sec |
21 | |
22 | asm 29845165, 29802881 24061238, 23802441 |
23 | unpatched C 29462936, 29477438 15233914, 15289430 |
24 | patched C (32-bit fixmul) 31846917, 31742140 18414104, 18262372 |
25 | patched C (64-bit fixmul) 27817735, 27833801 18468870, 18479899 |
I didn't really check that the patched version gives the same results, but the 3d and sprite rotate examples seemed to run ok.
EDIT: 2005-05-17 added results for Evert's patch that uses LONG_LONG where possible. There seems to be a big discrepancy between the two set of asm results, so I don't know what happened there. The C results seem to match up though. The two "patched C" rows use the same implementation of fixdiv (making use of LONG_LONG).
I didn't really check that the patched version gives the same results, but the 3d and sprite rotate examples seemed to run ok.
It's unlikely to give exactly the same results as the three cast method, but then so is the ia86 32x32->64 integer multiply. I'd be surprised if the average divergance in results was substantially different though.
Here are the rather surprising results:
So in standard mode the patch is fastest, in debug the asm is fastest and the three cast method mysteriously overtakes the asm in standard mode? That is a little strange.
Perhaps someone smarter than me can eliminate the sign stuff in my method. The problem with signed numbers is the &0xff's throw away the sign. If that could be avoided then all the conditionals could be removed and things should be even better.
Possibly there is some similar method for fixdiv too...
I remember reading that back in the 68000 days Macs had a div and mult opcode; powerpc doesn't? Mips risc does...
I'm not clear on why shifts and &'s are needed; if you multiply two 32 bit values the result will be 64 bits; this result requires an arithmetic left shift of 16 (for 16.16 format) and truncation to 32 bits; the sign should be preserved. For division again only arithmetic shifts are required preserving the sign. And it doesn't have to be 16.16 either....
edit; heres some code;
1 | int mul(int a, int b) { |
2 | long t; |
3 | t = a * b; |
4 | return (int)t>>16; |
5 | } |
6 | |
7 | |
8 | |
9 | int div(int a, int b) { |
10 | long t; |
11 | t = a; |
12 | t = t<<16; |
13 | t = t/b; |
14 | t = t>>16; |
15 | return (int)t; |
16 | } |
edit2; hmm, looks like you'll lose the sign of a in the div() function...
remember that long is only 32bits in c
Marcello
will this work on 64bits?
BTW, I have a 64bit gentoo compile, I can test stuff for you guys.
remember that long is only 32bits in c
Depends on your platform and compiler. It's 64 bit on my AMD machine, and possibly on 64 bit Macs as well.
Not sure if that helps though...
remember that long is only 32bits in c
Actually, it's ≥ 32 bits, ≥ int...
right, point being, don't use it if you want 64bits.
right, point being, don't use it if you want 64bits.
Don't use what?
Thinking about it a little harder, if the pure C fixmul is casting to floats, then it is loosing the low 8bits of each fixed number before multiplying anyway. Therefore the following should be just as good:
(x >> 8)*(y >> 8)
Conversely, if it is casting to doubles then it is no wonder that speed isn't great!
EDIT:
I remember reading that back in the 68000 days Macs had a div and mult opcode; powerpc doesn't? Mips risc does...
Why do you suppose PowerPC has no multiply or divide operations? I can see no reason you would conclude that from the thread.
In fact the PowerPC has a 32x32->64 multiply capacity even in 32bit mode, through the issuing of two consecutive instructions. Otherwise you only get a more normal 32x32->32 multiply, what with RISC machines being at least partly designed for RISC usage.
I'll see if I can figure out the correct ASM and so on for PowerPC but it is at least slightly beside the point if the pure-C can still be improved.
Oops; 'long long' is 64 bits or is it 'long int'?
edit; I assumed that if multiply and divide opcodes were available then they would've been used in allegro; guess not...
Depends on your compiler and processor. You may not even have a 64bit integral type.
EDIT:
edit; I assumed that if multiply and divide opcodes were available then they would've been used in allegro; guess not...
Oh, no, the Mac OS X port just isn't complete yet in terms of things it could be doing to avoid the Allegro C fallbacks but isn't.
In fact the PowerPC has a 32x32->64 multiply capacity even in 32bit mode
So do the i386+ processors, if I recall correctly (they set the lower word in eax and the high word in edx, I think).
Oops; 'long long' is 64 bits or is it 'long int'?
long long may be 64 bit if it is defined at all (prior to C99, it was a GNU extension). Allegro defines a (u)int64_t on non-C99 compilers, but it's just a guess and probably unreliable on non-C99 compilers.
I assumed that if multiply and divide opcodes were available then they would've been used in allegro
Allegro only uses assembler code on the ix86, which is actually suboptimal on recent processors. There's some minor assembler code of AMD64, but it's trivial and mostly unimportant (it retrieves the CPUID flags).
These work with mingw 3.2;
1 | int mul16_16(int a, int b) { |
2 | long long int t = ((long long int)a * (long long int)b) >> 16; |
3 | return (int)t; |
4 | } |
5 | |
6 | int div16_16(int a, int b) { |
7 | long long int t = (long long int)a << 32; |
8 | return (int)((t/b)>>16); |
9 | } |
10 | |
11 | void print16_16(int a) { |
12 | int t = a >> 16; |
13 | printf("\n%i.",t); |
14 | t = a & 0x0000FFFF; |
15 | printf("%i",t); |
16 | } |
So do the i386+ processors, if I recall correctly (they set the lower word in eax and the high word in edx, I think).
Yes, all I meant was 'even the RISC PowerPC' rather than 'unlike the yucky intel'. I'm not surprised at all that the CISC chip does.
I added some results for Evert's latest patch with uses LONG_LONG where possible. See the table above.
n/m I'm smoking crack or something.
It's a little faster it seems.
Interesting...
I used the following programme to time the speed of the different methods (except the asm version, which I can't check easily and am not really interested in anyway ) on two machines, my 32-bit Xeon workstation (running RedHat Enterprise) and my AMD64 computer at home (running Gentoo linux).
1 | #include <stdio.h> |
2 | #include <time.h> |
3 | #include <sys/time.h> |
4 | #include <sys/resource.h> |
5 | #include <sys/types.h> |
6 | |
7 | #define CPUTIME (getrusage(RUSAGE_SELF,&ruse),\ |
8 | ruse.ru_utime.tv_sec + ruse.ru_stime.tv_sec + \ |
9 | 1e-6 * (ruse.ru_utime.tv_usec + ruse.ru_stime.tv_usec)) |
10 | |
11 | struct rusage ruse; |
12 | |
13 | extern int getrusage(); |
14 | |
15 | typedef int fixed; |
16 | |
17 | #define LOOP_COUNT 100000000 |
18 | |
19 | /* ftofix and fixtof are used in generic C versions of fixmul and fixdiv */ |
20 | inline fixed ftofix(double x) |
21 | { |
22 | if (x > 32767.0) { |
23 | return 0x7FFFFFFF; |
24 | } |
25 | |
26 | if (x < -32767.0) { |
27 | return -0x7FFFFFFF; |
28 | } |
29 | |
30 | return (int)(x * 65536.0 + (x < 0 ? -0.5 : 0.5)); |
31 | } |
32 | |
33 | |
34 | inline double fixtof(fixed x) |
35 | { |
36 | return (double)x / 65536.0; |
37 | } |
38 | |
39 | inline fixed fixmulf(fixed x, fixed y) |
40 | { |
41 | return ftofix(fixtof(x) * fixtof(y)); |
42 | } |
43 | |
44 | inline fixed fixmull(fixed x, fixed y) |
45 | { |
46 | long long lx = x; |
47 | long long ly = y; |
48 | long long lres = (lx*ly)>>16; |
49 | int res = lres; |
50 | return res; |
51 | } |
52 | |
53 | inline fixed fixmuli(fixed x, fixed y) |
54 | { |
55 | fixed sign = (x^y) & 0x80000000; |
56 | int mask_x = x >> 31; |
57 | int mask_y = y >> 31; |
58 | int mask_result = sign >> 31; |
59 | fixed result; |
60 | |
61 | x = (x^mask_x) - mask_x; |
62 | y = (y^mask_y) - mask_y; |
63 | |
64 | result = ((y >> 8)*(x >> 8) + |
65 | (((y >> 8)*(x&0xff)) >> 8) + |
66 | (((x >> 8)*(y&0xff)) >> 8)); |
67 | |
68 | return (result^mask_result) - mask_result; |
69 | } |
70 | |
71 | inline fixed fixdivf(fixed x, fixed y) |
72 | { |
73 | if (y == 0) { |
74 | return (x < 0) ? -0x7FFFFFFF : 0x7FFFFFFF; |
75 | } |
76 | else |
77 | return ftofix(fixtof(x) / fixtof(y)); |
78 | } |
79 | |
80 | inline fixed fixdivl(fixed x, fixed y) |
81 | { |
82 | if (y == 0) { |
83 | return (x < 0) ? -0x7FFFFFFF : 0x7FFFFFFF; |
84 | } |
85 | else { |
86 | long long lx = x; |
87 | long long ly = y; |
88 | long long lres = (lx << 16) / ly; |
89 | int res = lres; |
90 | return res; |
91 | } |
92 | } |
93 | |
94 | int main(void) |
95 | { |
96 | double t0, t1; |
97 | time_t u1,u2; |
98 | fixed x, y, z; |
99 | int c; |
100 | |
101 | printf("sizeof(long long) = %d, sizeof(fixed) = %d\n", sizeof(long long), sizeof(fixed)); |
102 | |
103 | printf("Timing %d calls to fixmulf...", LOOP_COUNT); |
104 | fflush(stdout); |
105 | t0 = CPUTIME; |
106 | /* place code to be timed here */ |
107 | for (c=0; c<LOOP_COUNT; c++) { |
108 | z = fixmulf(x,y); |
109 | x += 1317; |
110 | y += 7143; |
111 | } |
112 | t1 = CPUTIME; |
113 | printf("CPU time = %f secs.\n",t1-t0); |
114 | |
115 | printf("Timing %d calls to fixmuli...", LOOP_COUNT); |
116 | fflush(stdout); |
117 | t0 = CPUTIME; |
118 | /* place code to be timed here */ |
119 | for (c=0; c<LOOP_COUNT; c++) { |
120 | z = fixmuli(x,y); |
121 | x += 1317; |
122 | y += 7143; |
123 | } |
124 | t1 = CPUTIME; |
125 | printf("CPU time = %f secs.\n",t1-t0); |
126 | |
127 | printf("Timing %d calls to fixmull...", LOOP_COUNT); |
128 | fflush(stdout); |
129 | t0 = CPUTIME; |
130 | /* place code to be timed here */ |
131 | for (c=0; c<LOOP_COUNT; c++) { |
132 | z = fixmull(x,y); |
133 | x += 1317; |
134 | y += 7143; |
135 | } |
136 | t1 = CPUTIME; |
137 | printf("CPU time = %f secs.\n",t1-t0); |
138 | |
139 | printf("Timing %d calls to fixdivf...", LOOP_COUNT); |
140 | fflush(stdout); |
141 | t0 = CPUTIME; |
142 | /* place code to be timed here */ |
143 | for (c=0; c<LOOP_COUNT; c++) { |
144 | z = fixdivf(x,y); |
145 | x += 1317; |
146 | y += 7143; |
147 | } |
148 | t1 = CPUTIME; |
149 | printf("CPU time = %f secs.\n",t1-t0); |
150 | |
151 | printf("Timing %d calls to fixdivl...", LOOP_COUNT); |
152 | fflush(stdout); |
153 | t0 = CPUTIME; |
154 | /* place code to be timed here */ |
155 | for (c=0; c<LOOP_COUNT; c++) { |
156 | z = fixdivl(x,y); |
157 | x += 1317; |
158 | y += 7143; |
159 | } |
160 | t1 = CPUTIME; |
161 | printf("CPU time = %f secs.\n",t1-t0); |
162 | |
163 | return 0; |
164 | } |
On the Xeon, I got the following results:
sizeof(long long) = 8, sizeof(fixed) = 4
Timing 100000000 calls to fixmulf...CPU time = 4.568305 secs.
Timing 100000000 calls to fixmuli...CPU time = 2.769579 secs.
Timing 100000000 calls to fixmull...CPU time = 3.394484 secs.
Timing 100000000 calls to fixdivf...CPU time = 8.174757 secs.
Timing 100000000 calls to fixdivl...CPU time = 4.776274 secs.
I guess this is consistent with what Peter obtained. For my 64 bit machine, the result is
sizeof(long long) = 8, sizeof(fixed) = 4
Timing 100000000 calls to fixmulf...CPU time = 4.133372 secs.
Timing 100000000 calls to fixmuli...CPU time = 2.085683 secs.
Timing 100000000 calls to fixmull...CPU time = 1.460777 secs.
Timing 100000000 calls to fixdivf...CPU time = 6.344036 secs.
Timing 100000000 calls to fixdivl...CPU time = 5.359185 secs.
As expected, the long long multiplication is the fastest here. I'm puzzled by the relative and absolute poor performance of the long long division code though: the 32 bit machine outperforms the 64 bit machine when dividing 64 bit integers... I'll have to check out what's wrong there.
EDIT:
Meh, I forgot to pass -O2 for optimizations... raw results with optimzations posted below, will comment later.
sizeof(long long) = 8, sizeof(fixed) = 4
Timing 100000000 calls to fixmulf...CPU time = 0.054991 secs.
Timing 100000000 calls to fixmuli...CPU time = 0.054992 secs.
Timing 100000000 calls to fixmull...CPU time = 0.054991 secs.
Timing 100000000 calls to fixdivf...CPU time = 0.067990 secs.
Timing 100000000 calls to fixdivl...CPU time = 0.053992 secs.
sizeof(long long) = 8, sizeof(fixed) = 4
Timing 100000000 calls to fixmulf...CPU time = 0.100984 secs.
Timing 100000000 calls to fixmuli...CPU time = 0.100985 secs.
Timing 100000000 calls to fixmull...CPU time = 0.103984 secs.
Timing 100000000 calls to fixdivf...CPU time = 0.103984 secs.
Timing 100000000 calls to fixdivl...CPU time = 0.101984 secs.
And I never want to hear anyone say that Xeon (or Intel) are no good and AMD is better.
EDIT2: Did a more detailed run and check, now with a larger number of loop iterations and three machines: the 2.8GHz Xeon-HT (actually dual Xeon-HT), the AMD64 3200 and a 950MHz Celeron. Times are reproducible with a deviation of about 1%.
Xeon:
sizeof(long long) = 8, sizeof(fixed) = 4
Timing 2000000000 calls to fixmulf...CPU time = 1.092833 secs.
Timing 2000000000 calls to fixmuli...CPU time = 1.151825 secs.
Timing 2000000000 calls to fixmull...CPU time = 0.085987 secs.
Timing 2000000000 calls to fixdivf...CPU time = 0.056992 secs.
Timing 2000000000 calls to fixdivl...CPU time = 0.089986 secs.
Celeron:
sizeof(long long) = 8, sizeof(fixed) = 4
Timing 2000000000 calls to fixmulf...CPU time = 4.260000 secs.
Timing 2000000000 calls to fixmuli...CPU time = 4.230000 secs.
Timing 2000000000 calls to fixmull...CPU time = 2.120000 secs.
Timing 2000000000 calls to fixdivf...CPU time = 0.850000 secs.
Timing 2000000000 calls to fixdivl...CPU time = 2.140000 secs.
AMD64:
sizeof(long long) = 8, sizeof(fixed) = 4
Timing 2000000000 calls to fixmulf...CPU time = 2.024693 secs.
Timing 2000000000 calls to fixmuli...CPU time = 2.022692 secs.
Timing 2000000000 calls to fixmull...CPU time = 2.022693 secs.
Timing 2000000000 calls to fixdivf...CPU time = 0.201969 secs.
Timing 2000000000 calls to fixdivl...CPU time = 0.101984 secs.
The fixmull is the fastest on all machines (although it doesn't matter much at all on the AMD64). The fixdivl is the faster than fixdivf on the AMD64, but slower on the Xeon and an order of magnitude slower on the Xeon.
So based on this I would propose to apply the fixdivl patch regardless and only use the fixdivl patch on native 64-bit machines (does that include Macs?).
I am somewhat concerned that the running time doesn't scale uniformly with the number of iterations in the loop though...
So conclusion is, on your high end machines: fixdivl makes some difference for you, nothing else really seems to be significant.
Results of your program on my 667Mhz G4, OS X v10.4.1 (with iTunes and firefox eating 15% CPU before anything else even gets a look-in):
sizeof(long long) = 8, sizeof(fixed) = 4
Timing 100000000 calls to fixmulf...CPU time = 0.459279 secs.
Timing 100000000 calls to fixmuli...CPU time = 0.458020 secs.
Timing 100000000 calls to fixmull...CPU time = 0.458326 secs.
Timing 100000000 calls to fixdivf...CPU time = 0.457835 secs.
Timing 100000000 calls to fixdivl...CPU time = 0.305127 secs.
Digging into the assembly (compiled with O2):
.align 2
.globl _fixmull
_fixmull:
mulhw r9,r3,r4
mullw r10,r3,r4
srwi r12,r10,16
insrwi r12,r9,16,0
srawi r11,r9,16
mr r3,r12
blr
Looks pretty good to me.
.align 2
.globl _fixmuli
_fixmuli:
srawi r11,r3,31
srawi r10,r4,31
xor r2,r3,r11
xor r0,r4,r10
subf r0,r10,r0
subf r2,r11,r2
mr r9,r3
rlwinm r11,r2,0,24,31
srawi r3,r0,8
srawi r2,r2,8
mullw r11,r3,r11
rlwinm r0,r0,0,24,31
xor r9,r9,r4
srawi r9,r9,31
mullw r0,r2,r0
srawi r11,r11,8
mullw r3,r3,r2
srawi r0,r0,8
add r3,r3,r11
add r3,r3,r0
xor r3,r9,r3
subf r3,r9,r3
blr
Again, not awful.
.align 2
.globl _fixmulf
_fixmulf:
mflr r0
bcl 20,31,"L00000000003$pb"
"L00000000003$pb":
xoris r3,r3,0x8000
xoris r4,r4,0x8000
mflr r10
stw r3,-36(r1)
stw r4,-28(r1)
mtlr r0
lis r0,0x4330
addis r2,r10,ha16(LC9-"L00000000003$pb")
stw r0,-32(r1)
stw r0,-40(r1)
lfd f11,lo16(LC9-"L00000000003$pb")(r2)
addis r2,r10,ha16(LC10-"L00000000003$pb")
lfd f13,-32(r1)
lfd f0,-40(r1)
fsub f13,f13,f11
lfd f12,lo16(LC10-"L00000000003$pb")(r2)
fsub f0,f0,f11
addis r2,r10,ha16(LC7-"L00000000003$pb")
fmul f13,f13,f12
fmul f0,f0,f12
fmul f10,f0,f13
lfd f0,lo16(LC7-"L00000000003$pb")(r2)
fcmpu cr7,f10,f0
bng- cr7,L28
lis r3,0x7fff
ori r3,r3,65535
blr
L28:
addis r2,r10,ha16(LC8-"L00000000003$pb")
lfd f0,lo16(LC8-"L00000000003$pb")(r2)
fcmpu cr7,f10,f0
bnl+ cr7,L32
lis r3,0x8000
ori r3,r3,1
blr
L32:
addis r2,r10,ha16(LC11-"L00000000003$pb")
fneg f12,f10
lfd f11,lo16(LC11-"L00000000003$pb")(r2)
addis r2,r10,ha16(LC12-"L00000000003$pb")
lfd f13,lo16(LC12-"L00000000003$pb")(r2)
addis r2,r10,ha16(LC13-"L00000000003$pb")
lfd f0,lo16(LC13-"L00000000003$pb")(r2)
fsel f13,f10,f11,f13
fsel f12,f12,f13,f11
fmadd f0,f10,f0,f12
fctiwz f13,f0
stfd f13,-24(r1)
lwz r3,-20(r1)
blr
.literal8
.align 3
LC14:
.long 1088421824
.long 0
.align 3
LC15:
.long -1059061824
.long 0
.align 3
LC16:
.long 1127219200
.long -2147483648
.align 3
LC17:
.long 1055916032
.long 0
.align 3
LC18:
.long 1071644672
.long 0
.align 3
LC19:
.long -1075838976
.long 0
.align 3
LC20:
.long 1089470464
.long 0
Note especially all the stws (store word) and lfds (load floating point double).
I would guess that all of the artifical examples are getting unrealistic figures for the multiplications involving floats because they are able to overlap usage of the floating point and integer units where a real program might be able to do the same with their own calculations unrelated to the mul. This alone should make fixmuli/l preferable to fixmulf wherever their timing is not substantially worse in i/l respects.
So based on this I would propose to apply the fixdivl patch regardless and only use the fixdivl patch on native 64-bit machines (does that include Macs?).
The G5s (generally hanging around since 2001 I think) include support for the 64bit mode (which had been in the CPU specs since PowerPC came around in about 1993 but unimplemented in any consumer design until a few years ago) and under OS X v10.4+ it is possible to build fat binaries that include 64bit operation paths for G5s. Under 10.3 and before, 64bit is only used to provide a larger address space (unless the user asms some in themselves, of course). My G4 has no native 64bit processing except in the vector unit and that probably isn't advisable for non-vectors for all the obvious reasons.
In any case, the fixdivl is substantially faster even on 32bit CPUs for PowerPC!
EDIT: a true console build and run produces different results (the other having been dumping results from a Cocoa application to a log window due to that being the quickest way to get an XCode project going):
Timing 100000000 calls to fixmulf...CPU time = 0.305241 secs.
Timing 100000000 calls to fixmuli...CPU time = 0.455976 secs.
Timing 100000000 calls to fixmull...CPU time = 0.303116 secs.
Timing 100000000 calls to fixdivf...CPU time = 0.454786 secs.
Timing 100000000 calls to fixdivl...CPU time = 0.303217 secs.
Notably fixmuli has leapt upwards. I have no idea why this is, but the results still show very clearly that fixmull and fixdivl are the way forward so I'm not really that interested.
Sorry, but I think Evert's test is deeply flawed. On my machine, gcc -O2:
sizeof(long long) = 8, sizeof(fixed) = 4
Timing 200000000 calls to fixmulf...CPU time = 0.113982 secs.
Timing 200000000 calls to fixmuli...CPU time = 0.113983 secs.
Timing 200000000 calls to fixmull...CPU time = 0.018997 secs.
Timing 200000000 calls to fixdivf...CPU time = 0.114982 secs.
Timing 200000000 calls to fixdivl...CPU time = 0.019997 secs.
Now I comment out the call to fixdivl:
sizeof(long long) = 8, sizeof(fixed) = 4
Timing 200000000 calls to fixmulf...CPU time = 0.113982 secs.
Timing 200000000 calls to fixmuli...CPU time = 0.113983 secs.
Timing 200000000 calls to fixmull...CPU time = 0.018997 secs.
Timing 200000000 calls to fixdivf...CPU time = 0.114982 secs.
Timing 200000000 calls to fixdivl...CPU time = 0.112983 secs.
I basically don't know assembly, but as far as I can tell we're just measuring empty loops, and for some reason the fixmull and fixdivl were compiled in such a way that they executed faster.
If I accumulate the results of the fixmul/fixdiv calls, e.g. like this: (full changed file attached)
the results are then:
sizeof(long long) = 8, sizeof(fixed) = 4
Timing 200000000 calls to fixmulf...CPU time = 1.651749 secs.
zacc = -1382998619
Timing 200000000 calls to fixmuli...CPU time = 3.685439 secs.
zacc = -2117056946
Timing 200000000 calls to fixmull...CPU time = 3.353491 secs.
zacc = 516138025
Timing 200000000 calls to fixdivf...CPU time = 7.178909 secs.
zacc = -1716950756
Timing 200000000 calls to fixdivl...CPU time = 8.306736 secs.
zacc = -301320445
EDIT to earlier post, moved down due to Peter Wang comment in mean time:
I found a reason that your code is a very poor example when rifling through the assembly: the result of z is never used so gcc doesn't calculate it in some of the loops! Hence your fixmul loops aren't actually doing the multiplication and it is no wonder that you suddenly get the same results for everything!
I changed the z calculation to a += and added it as output to the printfs to make sure that it is actually calculated every loop.
Meaningful results (O3) are:
sizeof(long long) = 8, sizeof(fixed) = 4
Timing 100000000 calls to fixmulf...CPU time = 8.065064 secs.
Timing 100000000 calls to fixmuli...CPU time = 2.120938 secs.
Timing 100000000 calls to fixmull...CPU time = 1.212411 secs.
Timing 100000000 calls to fixdivf...CPU time = 16.796364 secs.
Timing 100000000 calls to fixdivl...CPU time = 19.028576 secs.
I shall look into that fixdivf vs fixdivl.
EDIT: so as not to confuse things, I ran Peter Wang's adaption. Results are:
O2:
sizeof(long long) = 8, sizeof(fixed) = 4
Timing 200000000 calls to fixmulf...CPU time = 16.082407 secs.
Timing 200000000 calls to fixmuli...CPU time = 3.940291 secs.
Timing 200000000 calls to fixmull...CPU time = 2.122170 secs.
Timing 200000000 calls to fixdivf...CPU time = 33.572767 secs.
Timing 200000000 calls to fixdivl...CPU time = 38.042508 secs.
No optimisations:
sizeof(long long) = 8, sizeof(fixed) = 4
Timing 200000000 calls to fixmulf...CPU time = 89.162385 secs.
Timing 200000000 calls to fixmuli...CPU time = 30.188200 secs.
Timing 200000000 calls to fixmull...CPU time = 23.056884 secs.
Timing 200000000 calls to fixdivf...CPU time = 115.471288 secs.
Timing 200000000 calls to fixdivl...CPU time = 63.658895 secs.
Ah, you're quite right. I should have realized the compiler would optimize out an unused variable!
Without the zacc output removed, my results now look like this:
Xeon:
sizeof(long long) = 8, sizeof(fixed) = 4
Timing 100000000 calls to fixmulf...CPU time = 0.891864 secs.
Timing 100000000 calls to fixmuli...CPU time = 2.023693 secs.
Timing 100000000 calls to fixmull...CPU time = 2.681592 secs.
Timing 100000000 calls to fixdivf...CPU time = 3.426479 secs.
Timing 100000000 calls to fixdivl...CPU time = 3.107528 secs.
Celeron:
sizeof(long long) = 8, sizeof(fixed) = 4
Timing 100000000 calls to fixmulf...CPU time = 2.480000 secs.
Timing 100000000 calls to fixmuli...CPU time = 2.620000 secs.
Timing 100000000 calls to fixmull...CPU time = 2.880000 secs.
Timing 100000000 calls to fixdivf...CPU time = 13.010000 secs.
Timing 100000000 calls to fixdivl...CPU time = 8.570000 secs.
AMD64:
sizeof(long long) = 8, sizeof(fixed) = 4
Timing 100000000 calls to fixmulf...CPU time = 0.445932 secs.
Timing 100000000 calls to fixmuli...CPU time = 0.717891 secs.
Timing 100000000 calls to fixmull...CPU time = 0.227965 secs.
Timing 100000000 calls to fixdivf...CPU time = 1.306801 secs.
Timing 100000000 calls to fixdivl...CPU time = 4.045385 secs.
Basically, fixmull is a good idea on the AMD64, but a bad idea on the two Intel machines. For fixdivl, it is actually the other way around (!). I'll look at the detailed assembler output later on.
For now, another relevant statistic may be the gcc version involved. The celeron runs gcc 3.3.1, the AMD64 runs gcc 3.4.1 and the Xeon runs gcc 3.4.3.
I've got "gcc version 4.0.0 20041026 (Apple Computer, Inc. build 4061)". Having been code digging, it appears that the 64bit divide calls some other function for a reason I don't fully understand yet given that the PowerPC has the divd 64 bit divide instruction. I'm going to sniff around this topic and see if I can figure anything out.
EDIT: from a Mac vs PC viewpoint, is the fact that my 667Mhz G4 apparently more than twice as fast as the 950Mhz Celeron for multiplies (but less than half the speed for divides) surprising? In the sense that it may suggest problems with the test.
from a Mac vs PC viewpoint, is the fact that my 667Mhz G4 apparently more than twice as fast as the 950Mhz Celeron for multiplies (but less than half the speed for divides) surprising?
I don't think so. I used 300MHz Sun workstations that were faster for some calculations.
That said, it's quite possible I bodged up the test somewhere else too - I'm not a computer professional, afterall, so it's possible that there is some benchmarking subtlety I'm unaware of. It looks like it should be doing the right thing now though.
This data is for MSVC 7.1 / Athlon XP 2500+
A few notes:
All of the different methods produce different results. Most handle rounding differently from each other, and fixmulf handles overflow differently. I suggest modifying the source to initialize x and y to some known value before each loop.
Enabling SSE1 (which was disabled by default -this is MSVC 7.1) improves performance of fixmulf by 20%. A simple by-hand translation of fixmull to inline asm improved performance by 10% despite the inefficiencies of inline asm on MSVC (you can't tell the compiler where you want stuff and what gets clobbbered and all that like you can in gcc).
Performance scaled linearly on my computer for loop counts all the way down to 10^3, for all functions except fixmulf, which only scaled down to loop count = 10^5 (at 10^4 it took almost twice as long per loop, at 10^3 it took over 3 times as long per loop).
LOOP_COUNT = 200000000 total seconds 2.100045 fixmulf 2.185031 fixmuli 1.431395 fixmull 1.278765 fixmulasm 7.206338 fixdivf 12.066815 fixdivl