Allegro.cc - Online Community

Allegro.cc Forums » Allegro Development » 'Fixed point'

This thread is locked; no one can reply to it. rss feed Print
'Fixed point'
Thomas Harte
Member #33
April 2000
avatar

Having a poke around my code in OS X I had a glance into the Allegro 'fixed point' routines. I observed this is the active fixmul for my platform:

AL_INLINE(fixed, fixmul, (fixed x, fixed y),
{
return ftofix(fixtof(x) * fixtof(y));
}

This is horrendously bad! Float/int conversions are even slower on PowerPC than intel because the two units of the processor are conceptually separated and a transfer through memory must occur. Would the following not be a smarter fallback for machines without an assembly implementation:

unsigned int sign = (r.value^value)&0x80000000;
Sint32 Multiplicand = value;

if(Multiplicand < 0) Multiplicand = -Multiplicand;
if(r.value < 0) r.value = -r.value;

Fixed Newval;
Newval.value = 
  (Multiplicand >> 8)*(r.value >> 8) +
  (((Multiplicand >> 8)*(r.value&0xff)) >> 8) +
  (((r.value >> 8)*(Multiplicand&0xff)) >> 8);

if(sign) Newval.value = -Newval.value;

return Newval;

Or some "smarter about signs and shifts" variation? And while I'm here posting, could there not be some compile time way of determining whether real fixed point or nasty triple cast fixed point implementations of things like fixmul are available?

Peter Wang
Member #23
April 2000

If your implementation benchmarks and tests well then of course we would accept it. I don't know if anyone will have time before the next beta (on Friday).

Quote:

And while I'm here posting, could there not be some compile time way of determining whether real fixed point or nasty triple cast fixed point implementations of things like fixmul are available?

I don't think so.

EDIT: Ok, I decided to do some quick benchmarks before going to sleep. The patch I'm using is attached. The test machine is a Pentium-4 2.4GHz, running Linux 2.6.10, gcc 3.3.4. I'm using the "Misc | Time some stuff" function from `tests/test'. Here are the rather surprising results:

1Standard mode (fixmuls per second)
2 
3asm: 25886122, 26106294
4current C: 29301026, 29226952
5patched C: 31416471, 31518915
6 
7----------------------------------------------------------------------
8 
9Debug mode
10 
11asm: 22182461, 22209498
12current C: 9916082, 9649226
13patched C: 21665953, 21693887
14 
15 
16======================================================================
17 
182005-05-17
19 
20Standard mode fixmul/sec fixdiv/sec
21 
22asm 29845165, 29802881 24061238, 23802441
23unpatched C 29462936, 29477438 15233914, 15289430
24patched C (32-bit fixmul) 31846917, 31742140 18414104, 18262372
25patched C (64-bit fixmul) 27817735, 27833801 18468870, 18479899

I didn't really check that the patched version gives the same results, but the 3d and sprite rotate examples seemed to run ok.

EDIT: 2005-05-17 added results for Evert's patch that uses LONG_LONG where possible. There seems to be a big discrepancy between the two set of asm results, so I don't know what happened there. The C results seem to match up though. The two "patched C" rows use the same implementation of fixdiv (making use of LONG_LONG).

Thomas Harte
Member #33
April 2000
avatar

Quote:

I didn't really check that the patched version gives the same results, but the 3d and sprite rotate examples seemed to run ok.

It's unlikely to give exactly the same results as the three cast method, but then so is the ia86 32x32->64 integer multiply. I'd be surprised if the average divergance in results was substantially different though.

Quote:

Here are the rather surprising results:

So in standard mode the patch is fastest, in debug the asm is fastest and the three cast method mysteriously overtakes the asm in standard mode? That is a little strange.

Perhaps someone smarter than me can eliminate the sign stuff in my method. The problem with signed numbers is the &0xff's throw away the sign. If that could be avoided then all the conditionals could be removed and things should be even better.

Possibly there is some similar method for fixdiv too...

nonnus29
Member #2,606
August 2002
avatar

I remember reading that back in the 68000 days Macs had a div and mult opcode; powerpc doesn't? Mips risc does...

I'm not clear on why shifts and &'s are needed; if you multiply two 32 bit values the result will be 64 bits; this result requires an arithmetic left shift of 16 (for 16.16 format) and truncation to 32 bits; the sign should be preserved. For division again only arithmetic shifts are required preserving the sign. And it doesn't have to be 16.16 either....

edit; heres some code;

1int mul(int a, int b) {
2 long t;
3 t = a * b;
4 return (int)t>>16;
5}
6 
7 
8 
9int div(int a, int b) {
10 long t;
11 t = a;
12 t = t<<16;
13 t = t/b;
14 t = t>>16;
15 return (int)t;
16}

edit2; hmm, looks like you'll lose the sign of a in the div() function...

Marcello
Member #1,860
January 2002
avatar

remember that long is only 32bits in c

Marcello

BAF
Member #2,981
December 2002
avatar

will this work on 64bits?

BTW, I have a 64bit gentoo compile, I can test stuff for you guys.

Evert
Member #794
November 2000
avatar

Marcello said:

remember that long is only 32bits in c

Depends on your platform and compiler. It's 64 bit on my AMD machine, and possibly on 64 bit Macs as well.
Not sure if that helps though...

gnolam
Member #2,030
March 2002
avatar

Marcello said:

remember that long is only 32bits in c

Actually, it's ≥ 32 bits, ≥ int...

--
Move to the Democratic People's Republic of Vivendi Universal (formerly known as Sweden) - officially democracy- and privacy-free since 2008-06-18!

Marcello
Member #1,860
January 2002
avatar

right, point being, don't use it if you want 64bits.

Thomas Harte
Member #33
April 2000
avatar

Quote:

right, point being, don't use it if you want 64bits.

Don't use what?

Thinking about it a little harder, if the pure C fixmul is casting to floats, then it is loosing the low 8bits of each fixed number before multiplying anyway. Therefore the following should be just as good:

(x >> 8)*(y >> 8)

Conversely, if it is casting to doubles then it is no wonder that speed isn't great!
EDIT:

Quote:

I remember reading that back in the 68000 days Macs had a div and mult opcode; powerpc doesn't? Mips risc does...

Why do you suppose PowerPC has no multiply or divide operations? I can see no reason you would conclude that from the thread.

In fact the PowerPC has a 32x32->64 multiply capacity even in 32bit mode, through the issuing of two consecutive instructions. Otherwise you only get a more normal 32x32->32 multiply, what with RISC machines being at least partly designed for RISC usage.

I'll see if I can figure out the correct ASM and so on for PowerPC but it is at least slightly beside the point if the pure-C can still be improved.

nonnus29
Member #2,606
August 2002
avatar

Oops; 'long long' is 64 bits or is it 'long int'?

edit; I assumed that if multiply and divide opcodes were available then they would've been used in allegro; guess not...

Thomas Harte
Member #33
April 2000
avatar

Depends on your compiler and processor. You may not even have a 64bit integral type.

EDIT:

Quote:

edit; I assumed that if multiply and divide opcodes were available then they would've been used in allegro; guess not...

Oh, no, the Mac OS X port just isn't complete yet in terms of things it could be doing to avoid the Allegro C fallbacks but isn't.

Evert
Member #794
November 2000
avatar

Quote:

In fact the PowerPC has a 32x32->64 multiply capacity even in 32bit mode

So do the i386+ processors, if I recall correctly (they set the lower word in eax and the high word in edx, I think).

Quote:

Oops; 'long long' is 64 bits or is it 'long int'?

long long may be 64 bit if it is defined at all (prior to C99, it was a GNU extension). Allegro defines a (u)int64_t on non-C99 compilers, but it's just a guess and probably unreliable on non-C99 compilers.

Quote:

I assumed that if multiply and divide opcodes were available then they would've been used in allegro

Allegro only uses assembler code on the ix86, which is actually suboptimal on recent processors. There's some minor assembler code of AMD64, but it's trivial and mostly unimportant (it retrieves the CPUID flags).

nonnus29
Member #2,606
August 2002
avatar

These work with mingw 3.2;

1int mul16_16(int a, int b) {
2 long long int t = ((long long int)a * (long long int)b) >> 16;
3 return (int)t;
4}
5 
6int div16_16(int a, int b) {
7 long long int t = (long long int)a << 32;
8 return (int)((t/b)>>16);
9}
10 
11void print16_16(int a) {
12 int t = a >> 16;
13 printf("\n%i.",t);
14 t = a & 0x0000FFFF;
15 printf("%i",t);
16}

Thomas Harte
Member #33
April 2000
avatar

Quote:

So do the i386+ processors, if I recall correctly (they set the lower word in eax and the high word in edx, I think).

Yes, all I meant was 'even the RISC PowerPC' rather than 'unlike the yucky intel'. I'm not surprised at all that the CISC chip does.

Peter Wang
Member #23
April 2000

I added some results for Evert's latest patch with uses LONG_LONG where possible. See the table above.

nonnus29
Member #2,606
August 2002
avatar

n/m I'm smoking crack or something.

It's a little faster it seems.

Evert
Member #794
November 2000
avatar

Interesting...
I used the following programme to time the speed of the different methods (except the asm version, which I can't check easily and am not really interested in anyway ;)) on two machines, my 32-bit Xeon workstation (running RedHat Enterprise) and my AMD64 computer at home (running Gentoo linux).

1#include <stdio.h>
2#include <time.h>
3#include <sys/time.h>
4#include <sys/resource.h>
5#include <sys/types.h>
6 
7#define CPUTIME (getrusage(RUSAGE_SELF,&ruse),\
8 ruse.ru_utime.tv_sec + ruse.ru_stime.tv_sec + \
9 1e-6 * (ruse.ru_utime.tv_usec + ruse.ru_stime.tv_usec))
10 
11struct rusage ruse;
12 
13extern int getrusage();
14 
15typedef int fixed;
16 
17#define LOOP_COUNT 100000000
18 
19/* ftofix and fixtof are used in generic C versions of fixmul and fixdiv */
20inline fixed ftofix(double x)
21{
22 if (x > 32767.0) {
23 return 0x7FFFFFFF;
24 }
25 
26 if (x < -32767.0) {
27 return -0x7FFFFFFF;
28 }
29 
30 return (int)(x * 65536.0 + (x < 0 ? -0.5 : 0.5));
31}
32 
33 
34inline double fixtof(fixed x)
35{
36 return (double)x / 65536.0;
37}
38 
39inline fixed fixmulf(fixed x, fixed y)
40{
41 return ftofix(fixtof(x) * fixtof(y));
42}
43 
44inline fixed fixmull(fixed x, fixed y)
45{
46 long long lx = x;
47 long long ly = y;
48 long long lres = (lx*ly)>>16;
49 int res = lres;
50 return res;
51}
52 
53inline fixed fixmuli(fixed x, fixed y)
54{
55 fixed sign = (x^y) & 0x80000000;
56 int mask_x = x >> 31;
57 int mask_y = y >> 31;
58 int mask_result = sign >> 31;
59 fixed result;
60 
61 x = (x^mask_x) - mask_x;
62 y = (y^mask_y) - mask_y;
63 
64 result = ((y >> 8)*(x >> 8) +
65 (((y >> 8)*(x&0xff)) >> 8) +
66 (((x >> 8)*(y&0xff)) >> 8));
67 
68 return (result^mask_result) - mask_result;
69}
70 
71inline fixed fixdivf(fixed x, fixed y)
72{
73 if (y == 0) {
74 return (x < 0) ? -0x7FFFFFFF : 0x7FFFFFFF;
75 }
76 else
77 return ftofix(fixtof(x) / fixtof(y));
78}
79 
80inline fixed fixdivl(fixed x, fixed y)
81{
82 if (y == 0) {
83 return (x < 0) ? -0x7FFFFFFF : 0x7FFFFFFF;
84 }
85 else {
86 long long lx = x;
87 long long ly = y;
88 long long lres = (lx << 16) / ly;
89 int res = lres;
90 return res;
91 }
92}
93 
94int main(void)
95{
96 double t0, t1;
97 time_t u1,u2;
98 fixed x, y, z;
99 int c;
100
101 printf("sizeof(long long) = %d, sizeof(fixed) = %d\n", sizeof(long long), sizeof(fixed));
102 
103 printf("Timing %d calls to fixmulf...", LOOP_COUNT);
104 fflush(stdout);
105 t0 = CPUTIME;
106 /* place code to be timed here */
107 for (c=0; c<LOOP_COUNT; c++) {
108 z = fixmulf(x,y);
109 x += 1317;
110 y += 7143;
111 }
112 t1 = CPUTIME;
113 printf("CPU time = %f secs.\n",t1-t0);
114 
115 printf("Timing %d calls to fixmuli...", LOOP_COUNT);
116 fflush(stdout);
117 t0 = CPUTIME;
118 /* place code to be timed here */
119 for (c=0; c<LOOP_COUNT; c++) {
120 z = fixmuli(x,y);
121 x += 1317;
122 y += 7143;
123 }
124 t1 = CPUTIME;
125 printf("CPU time = %f secs.\n",t1-t0);
126 
127 printf("Timing %d calls to fixmull...", LOOP_COUNT);
128 fflush(stdout);
129 t0 = CPUTIME;
130 /* place code to be timed here */
131 for (c=0; c<LOOP_COUNT; c++) {
132 z = fixmull(x,y);
133 x += 1317;
134 y += 7143;
135 }
136 t1 = CPUTIME;
137 printf("CPU time = %f secs.\n",t1-t0);
138 
139 printf("Timing %d calls to fixdivf...", LOOP_COUNT);
140 fflush(stdout);
141 t0 = CPUTIME;
142 /* place code to be timed here */
143 for (c=0; c<LOOP_COUNT; c++) {
144 z = fixdivf(x,y);
145 x += 1317;
146 y += 7143;
147 }
148 t1 = CPUTIME;
149 printf("CPU time = %f secs.\n",t1-t0);
150 
151 printf("Timing %d calls to fixdivl...", LOOP_COUNT);
152 fflush(stdout);
153 t0 = CPUTIME;
154 /* place code to be timed here */
155 for (c=0; c<LOOP_COUNT; c++) {
156 z = fixdivl(x,y);
157 x += 1317;
158 y += 7143;
159 }
160 t1 = CPUTIME;
161 printf("CPU time = %f secs.\n",t1-t0);
162 
163 return 0;
164}

On the Xeon, I got the following results:

Xeon results said:

sizeof(long long) = 8, sizeof(fixed) = 4
Timing 100000000 calls to fixmulf...CPU time = 4.568305 secs.
Timing 100000000 calls to fixmuli...CPU time = 2.769579 secs.
Timing 100000000 calls to fixmull...CPU time = 3.394484 secs.
Timing 100000000 calls to fixdivf...CPU time = 8.174757 secs.
Timing 100000000 calls to fixdivl...CPU time = 4.776274 secs.

I guess this is consistent with what Peter obtained. For my 64 bit machine, the result is
sizeof(long long) = 8, sizeof(fixed) = 4

AMD64 said:

Timing 100000000 calls to fixmulf...CPU time = 4.133372 secs.
Timing 100000000 calls to fixmuli...CPU time = 2.085683 secs.
Timing 100000000 calls to fixmull...CPU time = 1.460777 secs.
Timing 100000000 calls to fixdivf...CPU time = 6.344036 secs.
Timing 100000000 calls to fixdivl...CPU time = 5.359185 secs.

As expected, the long long multiplication is the fastest here. I'm puzzled by the relative and absolute poor performance of the long long division code though: the 32 bit machine outperforms the 64 bit machine when dividing 64 bit integers... I'll have to check out what's wrong there.

EDIT:
Meh, I forgot to pass -O2 for optimizations... raw results with optimzations posted below, will comment later.

Xeon said:

sizeof(long long) = 8, sizeof(fixed) = 4
Timing 100000000 calls to fixmulf...CPU time = 0.054991 secs.
Timing 100000000 calls to fixmuli...CPU time = 0.054992 secs.
Timing 100000000 calls to fixmull...CPU time = 0.054991 secs.
Timing 100000000 calls to fixdivf...CPU time = 0.067990 secs.
Timing 100000000 calls to fixdivl...CPU time = 0.053992 secs.

AMD64 said:

sizeof(long long) = 8, sizeof(fixed) = 4
Timing 100000000 calls to fixmulf...CPU time = 0.100984 secs.
Timing 100000000 calls to fixmuli...CPU time = 0.100985 secs.
Timing 100000000 calls to fixmull...CPU time = 0.103984 secs.
Timing 100000000 calls to fixdivf...CPU time = 0.103984 secs.
Timing 100000000 calls to fixdivl...CPU time = 0.101984 secs.

And I never want to hear anyone say that Xeon (or Intel) are no good and AMD is better. :P

EDIT2: Did a more detailed run and check, now with a larger number of loop iterations and three machines: the 2.8GHz Xeon-HT (actually dual Xeon-HT), the AMD64 3200 and a 950MHz Celeron. Times are reproducible with a deviation of about 1%.

results said:

Xeon:
sizeof(long long) = 8, sizeof(fixed) = 4
Timing 2000000000 calls to fixmulf...CPU time = 1.092833 secs.
Timing 2000000000 calls to fixmuli...CPU time = 1.151825 secs.
Timing 2000000000 calls to fixmull...CPU time = 0.085987 secs.
Timing 2000000000 calls to fixdivf...CPU time = 0.056992 secs.
Timing 2000000000 calls to fixdivl...CPU time = 0.089986 secs.

Celeron:
sizeof(long long) = 8, sizeof(fixed) = 4
Timing 2000000000 calls to fixmulf...CPU time = 4.260000 secs.
Timing 2000000000 calls to fixmuli...CPU time = 4.230000 secs.
Timing 2000000000 calls to fixmull...CPU time = 2.120000 secs.
Timing 2000000000 calls to fixdivf...CPU time = 0.850000 secs.
Timing 2000000000 calls to fixdivl...CPU time = 2.140000 secs.

AMD64:
sizeof(long long) = 8, sizeof(fixed) = 4
Timing 2000000000 calls to fixmulf...CPU time = 2.024693 secs.
Timing 2000000000 calls to fixmuli...CPU time = 2.022692 secs.
Timing 2000000000 calls to fixmull...CPU time = 2.022693 secs.
Timing 2000000000 calls to fixdivf...CPU time = 0.201969 secs.
Timing 2000000000 calls to fixdivl...CPU time = 0.101984 secs.

The fixmull is the fastest on all machines (although it doesn't matter much at all on the AMD64). The fixdivl is the faster than fixdivf on the AMD64, but slower on the Xeon and an order of magnitude slower on the Xeon.
So based on this I would propose to apply the fixdivl patch regardless and only use the fixdivl patch on native 64-bit machines (does that include Macs?).
I am somewhat concerned that the running time doesn't scale uniformly with the number of iterations in the loop though...

Thomas Harte
Member #33
April 2000
avatar

So conclusion is, on your high end machines: fixdivl makes some difference for you, nothing else really seems to be significant.

Results of your program on my 667Mhz G4, OS X v10.4.1 (with iTunes and firefox eating 15% CPU before anything else even gets a look-in):

sizeof(long long) = 8, sizeof(fixed) = 4
Timing 100000000 calls to fixmulf...CPU time = 0.459279 secs.
Timing 100000000 calls to fixmuli...CPU time = 0.458020 secs.
Timing 100000000 calls to fixmull...CPU time = 0.458326 secs.
Timing 100000000 calls to fixdivf...CPU time = 0.457835 secs.
Timing 100000000 calls to fixdivl...CPU time = 0.305127 secs.

Digging into the assembly (compiled with O2):

fixmull said:

.align 2
.globl _fixmull
_fixmull:
mulhw r9,r3,r4
mullw r10,r3,r4
srwi r12,r10,16
insrwi r12,r9,16,0
srawi r11,r9,16
mr r3,r12
blr

Looks pretty good to me.

fixmuli said:

.align 2
.globl _fixmuli
_fixmuli:
srawi r11,r3,31
srawi r10,r4,31
xor r2,r3,r11
xor r0,r4,r10
subf r0,r10,r0
subf r2,r11,r2
mr r9,r3
rlwinm r11,r2,0,24,31
srawi r3,r0,8
srawi r2,r2,8
mullw r11,r3,r11
rlwinm r0,r0,0,24,31
xor r9,r9,r4
srawi r9,r9,31
mullw r0,r2,r0
srawi r11,r11,8
mullw r3,r3,r2
srawi r0,r0,8
add r3,r3,r11
add r3,r3,r0
xor r3,r9,r3
subf r3,r9,r3
blr

Again, not awful.

fixmulf said:

.align 2
.globl _fixmulf
_fixmulf:
mflr r0
bcl 20,31,"L00000000003$pb"
"L00000000003$pb":
xoris r3,r3,0x8000
xoris r4,r4,0x8000
mflr r10
stw r3,-36(r1)
stw r4,-28(r1)
mtlr r0
lis r0,0x4330
addis r2,r10,ha16(LC9-"L00000000003$pb")
stw r0,-32(r1)
stw r0,-40(r1)
lfd f11,lo16(LC9-"L00000000003$pb")(r2)
addis r2,r10,ha16(LC10-"L00000000003$pb")
lfd f13,-32(r1)
lfd f0,-40(r1)
fsub f13,f13,f11
lfd f12,lo16(LC10-"L00000000003$pb")(r2)
fsub f0,f0,f11
addis r2,r10,ha16(LC7-"L00000000003$pb")
fmul f13,f13,f12
fmul f0,f0,f12
fmul f10,f0,f13
lfd f0,lo16(LC7-"L00000000003$pb")(r2)
fcmpu cr7,f10,f0
bng- cr7,L28
lis r3,0x7fff
ori r3,r3,65535
blr
L28:
addis r2,r10,ha16(LC8-"L00000000003$pb")
lfd f0,lo16(LC8-"L00000000003$pb")(r2)
fcmpu cr7,f10,f0
bnl+ cr7,L32
lis r3,0x8000
ori r3,r3,1
blr
L32:
addis r2,r10,ha16(LC11-"L00000000003$pb")
fneg f12,f10
lfd f11,lo16(LC11-"L00000000003$pb")(r2)
addis r2,r10,ha16(LC12-"L00000000003$pb")
lfd f13,lo16(LC12-"L00000000003$pb")(r2)
addis r2,r10,ha16(LC13-"L00000000003$pb")
lfd f0,lo16(LC13-"L00000000003$pb")(r2)
fsel f13,f10,f11,f13
fsel f12,f12,f13,f11
fmadd f0,f10,f0,f12
fctiwz f13,f0
stfd f13,-24(r1)
lwz r3,-20(r1)
blr
.literal8
.align 3
LC14:
.long 1088421824
.long 0
.align 3
LC15:
.long -1059061824
.long 0
.align 3
LC16:
.long 1127219200
.long -2147483648
.align 3
LC17:
.long 1055916032
.long 0
.align 3
LC18:
.long 1071644672
.long 0
.align 3
LC19:
.long -1075838976
.long 0
.align 3
LC20:
.long 1089470464
.long 0

Note especially all the stws (store word) and lfds (load floating point double).

I would guess that all of the artifical examples are getting unrealistic figures for the multiplications involving floats because they are able to overlap usage of the floating point and integer units where a real program might be able to do the same with their own calculations unrelated to the mul. This alone should make fixmuli/l preferable to fixmulf wherever their timing is not substantially worse in i/l respects.

Quote:

So based on this I would propose to apply the fixdivl patch regardless and only use the fixdivl patch on native 64-bit machines (does that include Macs?).

The G5s (generally hanging around since 2001 I think) include support for the 64bit mode (which had been in the CPU specs since PowerPC came around in about 1993 but unimplemented in any consumer design until a few years ago) and under OS X v10.4+ it is possible to build fat binaries that include 64bit operation paths for G5s. Under 10.3 and before, 64bit is only used to provide a larger address space (unless the user asms some in themselves, of course). My G4 has no native 64bit processing except in the vector unit and that probably isn't advisable for non-vectors for all the obvious reasons.

In any case, the fixdivl is substantially faster even on 32bit CPUs for PowerPC!

EDIT: a true console build and run produces different results (the other having been dumping results from a Cocoa application to a log window due to that being the quickest way to get an XCode project going):

Timing 100000000 calls to fixmulf...CPU time = 0.305241 secs.
Timing 100000000 calls to fixmuli...CPU time = 0.455976 secs.
Timing 100000000 calls to fixmull...CPU time = 0.303116 secs.
Timing 100000000 calls to fixdivf...CPU time = 0.454786 secs.
Timing 100000000 calls to fixdivl...CPU time = 0.303217 secs.

Notably fixmuli has leapt upwards. I have no idea why this is, but the results still show very clearly that fixmull and fixdivl are the way forward so I'm not really that interested.

Peter Wang
Member #23
April 2000

Sorry, but I think Evert's test is deeply flawed. On my machine, gcc -O2:

Quote:

sizeof(long long) = 8, sizeof(fixed) = 4
Timing 200000000 calls to fixmulf...CPU time = 0.113982 secs.
Timing 200000000 calls to fixmuli...CPU time = 0.113983 secs.
Timing 200000000 calls to fixmull...CPU time = 0.018997 secs.
Timing 200000000 calls to fixdivf...CPU time = 0.114982 secs.
Timing 200000000 calls to fixdivl...CPU time = 0.019997 secs.

Now I comment out the call to fixdivl:

Quote:

sizeof(long long) = 8, sizeof(fixed) = 4
Timing 200000000 calls to fixmulf...CPU time = 0.113982 secs.
Timing 200000000 calls to fixmuli...CPU time = 0.113983 secs.
Timing 200000000 calls to fixmull...CPU time = 0.018997 secs.
Timing 200000000 calls to fixdivf...CPU time = 0.114982 secs.
Timing 200000000 calls to fixdivl...CPU time = 0.112983 secs.

I basically don't know assembly, but as far as I can tell we're just measuring empty loops, and for some reason the fixmull and fixdivl were compiled in such a way that they executed faster.

If I accumulate the results of the fixmul/fixdiv calls, e.g. like this: (full changed file attached)

   printf("Timing %d calls to fixdivl...", LOOP_COUNT);
   fflush(stdout);
   zacc = 0;
   t0 = CPUTIME;
   /* place code to be timed here */
   for (c=0; c<LOOP_COUNT; c++) {
      z = fixdivl(x,y);
      x += 1317;
      y += 7143;
      zacc += z;
   }
   t1 = CPUTIME;
   printf("CPU time = %f secs.\n",t1-t0);
   printf("zacc = %d\n",zacc);

the results are then:

Quote:

sizeof(long long) = 8, sizeof(fixed) = 4
Timing 200000000 calls to fixmulf...CPU time = 1.651749 secs.
zacc = -1382998619
Timing 200000000 calls to fixmuli...CPU time = 3.685439 secs.
zacc = -2117056946
Timing 200000000 calls to fixmull...CPU time = 3.353491 secs.
zacc = 516138025
Timing 200000000 calls to fixdivf...CPU time = 7.178909 secs.
zacc = -1716950756
Timing 200000000 calls to fixdivl...CPU time = 8.306736 secs.
zacc = -301320445

Thomas Harte
Member #33
April 2000
avatar

EDIT to earlier post, moved down due to Peter Wang comment in mean time:

I found a reason that your code is a very poor example when rifling through the assembly: the result of z is never used so gcc doesn't calculate it in some of the loops! Hence your fixmul loops aren't actually doing the multiplication and it is no wonder that you suddenly get the same results for everything!

I changed the z calculation to a += and added it as output to the printfs to make sure that it is actually calculated every loop.

Meaningful results (O3) are:

sizeof(long long) = 8, sizeof(fixed) = 4
Timing 100000000 calls to fixmulf...CPU time = 8.065064 secs.
Timing 100000000 calls to fixmuli...CPU time = 2.120938 secs.
Timing 100000000 calls to fixmull...CPU time = 1.212411 secs.
Timing 100000000 calls to fixdivf...CPU time = 16.796364 secs.
Timing 100000000 calls to fixdivl...CPU time = 19.028576 secs.

I shall look into that fixdivf vs fixdivl.

EDIT: so as not to confuse things, I ran Peter Wang's adaption. Results are:

O2:
sizeof(long long) = 8, sizeof(fixed) = 4
Timing 200000000 calls to fixmulf...CPU time = 16.082407 secs.
Timing 200000000 calls to fixmuli...CPU time = 3.940291 secs.
Timing 200000000 calls to fixmull...CPU time = 2.122170 secs.
Timing 200000000 calls to fixdivf...CPU time = 33.572767 secs.
Timing 200000000 calls to fixdivl...CPU time = 38.042508 secs.

No optimisations:
sizeof(long long) = 8, sizeof(fixed) = 4
Timing 200000000 calls to fixmulf...CPU time = 89.162385 secs.
Timing 200000000 calls to fixmuli...CPU time = 30.188200 secs.
Timing 200000000 calls to fixmull...CPU time = 23.056884 secs.
Timing 200000000 calls to fixdivf...CPU time = 115.471288 secs.
Timing 200000000 calls to fixdivl...CPU time = 63.658895 secs.

Evert
Member #794
November 2000
avatar

Ah, you're quite right. I should have realized the compiler would optimize out an unused variable!
Without the zacc output removed, my results now look like this:

Quote:

Xeon:
sizeof(long long) = 8, sizeof(fixed) = 4
Timing 100000000 calls to fixmulf...CPU time = 0.891864 secs.
Timing 100000000 calls to fixmuli...CPU time = 2.023693 secs.
Timing 100000000 calls to fixmull...CPU time = 2.681592 secs.
Timing 100000000 calls to fixdivf...CPU time = 3.426479 secs.
Timing 100000000 calls to fixdivl...CPU time = 3.107528 secs.

Celeron:
sizeof(long long) = 8, sizeof(fixed) = 4
Timing 100000000 calls to fixmulf...CPU time = 2.480000 secs.
Timing 100000000 calls to fixmuli...CPU time = 2.620000 secs.
Timing 100000000 calls to fixmull...CPU time = 2.880000 secs.
Timing 100000000 calls to fixdivf...CPU time = 13.010000 secs.
Timing 100000000 calls to fixdivl...CPU time = 8.570000 secs.

AMD64:
sizeof(long long) = 8, sizeof(fixed) = 4
Timing 100000000 calls to fixmulf...CPU time = 0.445932 secs.
Timing 100000000 calls to fixmuli...CPU time = 0.717891 secs.
Timing 100000000 calls to fixmull...CPU time = 0.227965 secs.
Timing 100000000 calls to fixdivf...CPU time = 1.306801 secs.
Timing 100000000 calls to fixdivl...CPU time = 4.045385 secs.

Basically, fixmull is a good idea on the AMD64, but a bad idea on the two Intel machines. For fixdivl, it is actually the other way around (!). I'll look at the detailed assembler output later on.
For now, another relevant statistic may be the gcc version involved. The celeron runs gcc 3.3.1, the AMD64 runs gcc 3.4.1 and the Xeon runs gcc 3.4.3.

Thomas Harte
Member #33
April 2000
avatar

I've got "gcc version 4.0.0 20041026 (Apple Computer, Inc. build 4061)". Having been code digging, it appears that the 64bit divide calls some other function for a reason I don't fully understand yet given that the PowerPC has the divd 64 bit divide instruction. I'm going to sniff around this topic and see if I can figure anything out.

EDIT: from a Mac vs PC viewpoint, is the fact that my 667Mhz G4 apparently more than twice as fast as the 950Mhz Celeron for multiplies (but less than half the speed for divides) surprising? In the sense that it may suggest problems with the test.

Evert
Member #794
November 2000
avatar

Quote:

from a Mac vs PC viewpoint, is the fact that my 667Mhz G4 apparently more than twice as fast as the 950Mhz Celeron for multiplies (but less than half the speed for divides) surprising?

I don't think so. I used 300MHz Sun workstations that were faster for some calculations.

That said, it's quite possible I bodged up the test somewhere else too - I'm not a computer professional, afterall, so it's possible that there is some benchmarking subtlety I'm unaware of. It looks like it should be doing the right thing now though.

orz
Member #565
August 2000

This data is for MSVC 7.1 / Athlon XP 2500+

A few notes:

All of the different methods produce different results. Most handle rounding differently from each other, and fixmulf handles overflow differently. I suggest modifying the source to initialize x and y to some known value before each loop.

Enabling SSE1 (which was disabled by default -this is MSVC 7.1) improves performance of fixmulf by 20%. A simple by-hand translation of fixmull to inline asm improved performance by 10% despite the inefficiencies of inline asm on MSVC (you can't tell the compiler where you want stuff and what gets clobbbered and all that like you can in gcc).

Performance scaled linearly on my computer for loop counts all the way down to 10^3, for all functions except fixmulf, which only scaled down to loop count = 10^5 (at 10^4 it took almost twice as long per loop, at 10^3 it took over 3 times as long per loop).

LOOP_COUNT = 200000000
total seconds
  2.100045    fixmulf
  2.185031    fixmuli
  1.431395    fixmull
  1.278765    fixmulasm
  7.206338    fixdivf
 12.066815    fixdivl

Go to: