Allegro.cc - Online Community

Allegro.cc Forums » Allegro Development » CPU family compilation of allegro?

This thread is locked; no one can reply to it. rss feed Print
CPU family compilation of allegro?
Raf256
Member #3,501
May 2003

When compiling the allegro game library, can I somehow told it that I will run my allegro-using programs only on in example i486, or on pentium, not on i386?

If I understand correctly, then some functions will work quite faster.
Other functions are pre compiled in several ways (with, without MMX and so on) and then in run-time the correct version of it is chosen basing in cpu_info data, right?

But only some functions are that way (yes?).
If yes, then perhaps it would be cool to have script that will build several versions of allegro, like
alleg.i386.lib
alleg.i486.lib
alleg.i586.lib
and so on (abut 6 possibilities?) then one can either use it by something as `allegro-configure --libs --march ik7', or even better, for bigger/serious projects one can have an automatic builder that will make several versions of his game,

1) compiled as -i386 using default allegro i386 (liked statickly)
2) as -i586 using allegro i586 (staticly)
and so on

then have a small luncher checking cpu_info and running correct (fastest) version of the code). If might be noticable speed difference if one is doing lots of lower level stuff like pixel manimupaltions for particles system (as I do)

HoHo
Member #4,534
April 2004
avatar

In real world speed difference between different architecture targets is minimal.

Allegro uses ASM internally and ATM its most time critical functions are in mmx and are way faster than any compiler could optimize it. In future allegro might have some functions using SSE/SSE2 too. Compiler will never get as good as hand optimizeds code

__________
In theory, there is no difference between theory and practice. But, in practice, there is - Jan L.A. van de Snepscheut
MMORPG's...Many Men Online Role Playing Girls - Radagar
"Is Java REALLY slower? Does STL really bloat your exes? Find out with your friendly host, HoHo, and his benchmarking machine!" - Jakub Wasilewski

Kitty Cat
Member #2,815
October 2002
avatar

If you're using GCC, there's a way to make it compile optimized for a specific family chipset and not retain backwards compatiility. I believe it's when using make, set TARGET_EXCLOPTS="xxx" where xxx is i386, i486, i586 (or pentium), i686 (or pentiumpro), pentium2, pentium3, pentium4, athlon, athlon-tbird, athlon-xp.. there may be more. These won't implicitly enable mmx or anything, although you can add -mmmx. So, for example, exclusively compiling for a Pentium w/ MMX or above (this excludes Pentium Pro, which doesn't have MMX), do:
make TARGET_EXCLOPTS="pentium -mmmx"

If you're using MSVC, then I don't know..

--
"Do not meddle in the affairs of cats, for they are subtle and will pee on your computer." -- Bruce Graham

A J
Member #3,025
December 2002
avatar

Quote:

and are way faster than any compiler could optimize it

false.

Quote:

Compiler will never get as good as hand optimizeds code

false.

if your using msvc, i have added (in the past few days) options to specify the chipset features.
its currently in CVS, so expect it for the next WIP version.

___________________________
The more you talk, the more AJ is right. - ML

HoHo
Member #4,534
April 2004
avatar

Quote:

and are way faster than any compiler could optimize it
false.

Then why are pure-c versions slower than ASM versions of bitmap blitting, blending and the like?

Quote:

Compiler will never get as good as hand optimizeds code
false.

Human can always take compiler's ASM output and optimze it even more. Of cource if the code is very simple then compiler can create the best thing possible

__________
In theory, there is no difference between theory and practice. But, in practice, there is - Jan L.A. van de Snepscheut
MMORPG's...Many Men Online Role Playing Girls - Radagar
"Is Java REALLY slower? Does STL really bloat your exes? Find out with your friendly host, HoHo, and his benchmarking machine!" - Jakub Wasilewski

A J
Member #3,025
December 2002
avatar

Quote:

Then why are pure-c versions slower than ASM versions of bitmap blitting, blending and the like?

they have not been written for a specific instruction set.
the reason allegros ASM is quicker than allegros C code is that someone took the time to make it so. The ASM is not portable either. whereas the C code is portable, therefore you can not compare them.

Quote:

Human can always take compiler's ASM output and optimze it even more.

humans can always take the toasters toast and burn it even more that the toaster ever could. ::)

Quote:

Of cource if the code is very simple then compiler can create the best thing possible

so you go on to contradict yourself. :P
you seem to consider humans code optimizing abilites superiour to a computers, yet computers make far less mistakes than humans.

___________________________
The more you talk, the more AJ is right. - ML

HoHo
Member #4,534
April 2004
avatar

It leaves me with no other choise than to compile c and asm versios of allegro targeted to different platforms. It might take a while though, I have busy times in the couple of next days

__________
In theory, there is no difference between theory and practice. But, in practice, there is - Jan L.A. van de Snepscheut
MMORPG's...Many Men Online Role Playing Girls - Radagar
"Is Java REALLY slower? Does STL really bloat your exes? Find out with your friendly host, HoHo, and his benchmarking machine!" - Jakub Wasilewski

A J
Member #3,025
December 2002
avatar

try writing some SSE/SSE2 instructions, they are the next thing allegro will need for improved speed.

___________________________
The more you talk, the more AJ is right. - ML

HoHo
Member #4,534
April 2004
avatar

Can't compiler do it ;D Don't think any compiler(besides intel one) can use SSE/SSE2 well to speed up stuff.

__________
In theory, there is no difference between theory and practice. But, in practice, there is - Jan L.A. van de Snepscheut
MMORPG's...Many Men Online Role Playing Girls - Radagar
"Is Java REALLY slower? Does STL really bloat your exes? Find out with your friendly host, HoHo, and his benchmarking machine!" - Jakub Wasilewski

Raf256
Member #3,501
May 2003

AFAIK good compilers, like gcc, will try to quite strongly optimize instructions and using additional registers and instructions.. of course when compiling with proper -march flag

Tobi Vollebregt
Member #1,031
March 2001

Still they don't generate MMX or SSE instructions. Just CMOV if you specify -march=pentium or higher IIRC.

________________________________________
website || Zipfile reader @ Allegro Wiki || Download zipfile reader

Kitty Cat
Member #2,815
October 2002
avatar

Tobi, that's what -mmmx -msse -m3dnow etc are for. ;) Just note that those switches don't produce backwards compatible code (using -msse will require SSE-capable CPU's, which even my Athlon-Tbird 1.1GHz isn't).

--
"Do not meddle in the affairs of cats, for they are subtle and will pee on your computer." -- Bruce Graham

Evert
Member #794
November 2000
avatar

You can tell the compiler to generate SSEx or MMX instructions - at least I know you can with the Intel Fortran compiler.
Fortran compilers are typically better at optimizing than C compilers though...

As for wether or not a human can do a better job at optimizing assembler code than a computer can, yes they can indeed. A competent human generates better code than any compiler could - but it takes much more time and high level optimizations should be the first thing to do anyway.

Tobi Vollebregt
Member #1,031
March 2001

Quote:

-mmmx
-mno-mmx

-msse
-mno-sse

-msse2
-mno-sse2

-msse3
-mno-sse3

-m3dnow
-mno-3dnow
These switches enable or disable the use of built-in functions that allow direct access to the MMX, SSE, SSE2, SSE3 and 3Dnow extensions of the instruction set.

See X86 Built-in Functions, for details of the functions enabled and disabled by these switches.

To have SSE/SSE2 instructions generated automatically from floating-point code, see -mfpmath=sse.

source: http://www.dis.com/gnu/gcc/i386-and-x86-64-Options.html

See also: http://www.dis.com/gnu/gcc/X86-Built-in-Functions.html#X86%20Built-in%20Functions

So IIRC only SSE can be generated automatically by using -mfpmath=sse, and not with -msse.

I don't know about other compilers though.

________________________________________
website || Zipfile reader @ Allegro Wiki || Download zipfile reader

A J
Member #3,025
December 2002
avatar

for msvc7

/arch:SSE /arch:SSE2 -G7

___________________________
The more you talk, the more AJ is right. - ML

HoHo
Member #4,534
April 2004
avatar

Ok I thought I take the time and did some initial tests. Downloaded the latest weekly cvs snapshot and compiled four different versions:

static, asm, p4 (./configure --enable-static=yes --enable-staticprog=yes --enable-opts=pentium4 --enable-exclopts=pentium4)
static, c, p4 (./configure --enable-static=yes --enable-staticprog=yes --enable-asm=no --enable-opts=pentium4 --enable-exclopts=pentium4)
static, asm (./configure --enable-static=yes --enable-staticprog=yes)
static, c (./configure --enable-static=yes --enable-staticprog=yes --enable-asm=no )

I used static linking to be sure allegro doesn't use some dynamic library I have somewhere

I ran test and benchmarked memory bitmaps. I had to cut out some of the results(DRAW_MODE_XOR, DRAW_MODE_COPY_PATTERN, DRAW_MODE_SOLID_PATTERN and DRAW_MODE_MASKED_PATTERN) to fit my post into 64kb limit(originally was ~125k). ML should add a warning when previewing your post when it's too big

The results were very interesting.

#SelectExpand
1//static, asm, p4 2Allegro 4.1.19 (20050326), Unix profile results 3 4Memory bitmap size: 800x600 5Color depth: 16 bpp 6 7 8DRAW_MODE_SOLID results: 9 10 putpixel() - 3863980 11 hline() - 2187607 12 vline() - 1439061 13 line() - 303152 14 rectfill() - 219069 15 circle() - 198101 16 circlefill() - 177439 17 ellipse() - 123467 18 ellipsefill() - 120098 19 arc() - 174638 20 triangle() - 140105 21 22 23DRAW_MODE_TRANS results: 24 25 putpixel() - 2351360 26 hline() - 534863 27 vline() - 301393 28 line() - 160418 29 rectfill() - 15983 30 circle() - 92129 31 circlefill() - 8911 32 ellipse() - 70381 33 ellipsefill() - 9932 34 arc() - 153131 35 triangle() - 11200 36 37 38Other functions: 39 40 textout() - 205332 41 vram->vram blit() - N/A 42 aligned vram->vram blit() - N/A 43 blit() from memory - 352902 44 aligned blit() from memory - 540591 45 vram->vram masked_blit() - N/A 46 masked_blit() from memory - 188838 47 draw_sprite() - 309636 48 draw_rle_sprite() - 482440 49 draw_compiled_sprite() - 1465194 50 draw_trans_sprite() - 124051 51 draw_trans_rle_sprite() - 85396 52 draw_lit_sprite() - 112701 53 draw_lit_rle_sprite() - 188568

#SelectExpand
1//static, c, p4 2Allegro 4.1.19 (20050326), Unix profile results 3 4Memory bitmap size: 800x600 5Color depth: 16 bpp 6 7 8DRAW_MODE_SOLID results: 9 10 putpixel() - 4054055 11 hline() - 2254552 12 vline() - 1388449 13 line() - 451235 14 rectfill() - 207896 15 circle() - 284830 16 circlefill() - 188062 17 ellipse() - 155530 18 ellipsefill() - 125694 19 arc() - 253459 20 triangle() - 160083 21 22DRAW_MODE_TRANS results: 23 24 putpixel() - 2456494 25 hline() - 743660 26 vline() - 528555 27 line() - 260635 28 rectfill() - 25764 29 circle() - 156132 30 circlefill() - 27282 31 ellipse() - 106101 32 ellipsefill() - 27136 33 arc() - 187759 34 triangle() - 32878 35 36 37Other functions: 38 39 textout() - 193137 40 vram->vram blit() - N/A 41 aligned vram->vram blit() - N/A 42 blit() from memory - 447901 43 aligned blit() from memory - 426043 44 vram->vram masked_blit() - N/A 45 masked_blit() from memory - 264081 46 draw_sprite() - 421456 47 draw_rle_sprite() - 649046 48 draw_compiled_sprite() - 655934 49 draw_trans_sprite() - 159685 50 draw_trans_rle_sprite() - 194210 51 draw_lit_sprite() - 186170 52 draw_lit_rle_sprite() - 210518

#SelectExpand
1//static, asm 2Allegro 4.1.19 (20050326), Unix profile results 3 4Memory bitmap size: 800x600 5Color depth: 16 bpp 6 7 8DRAW_MODE_SOLID results: 9 10 putpixel() - 4022308 11 hline() - 2233415 12 vline() - 1496657 13 line() - 226901 14 rectfill() - 221278 15 circle() - 202662 16 circlefill() - 177981 17 ellipse() - 100883 18 ellipsefill() - 121407 19 arc() - 178891 20 triangle() - 143064 21 22 23DRAW_MODE_TRANS results: 24 25 putpixel() - 2410795 26 hline() - 458892 27 vline() - 419612 28 line() - 100949 29 rectfill() - 13648 30 circle() - 127500 31 circlefill() - 14698 32 ellipse() - 47123 33 ellipsefill() - 15687 34 arc() - 123835 35 triangle() - 18670 36 37 38Other functions: 39 40 textout() - 168178 41 vram->vram blit() - N/A 42 aligned vram->vram blit() - N/A 43 blit() from memory - 365947 44 aligned blit() from memory - 558622 45 vram->vram masked_blit() - N/A 46 masked_blit() from memory - 196613 47 draw_sprite() - 321303 48 draw_rle_sprite() - 474702 49 draw_compiled_sprite() - 1461730 50 draw_trans_sprite() - 124654 51 draw_trans_rle_sprite() - 170162 52 draw_lit_sprite() - 116946 53 draw_lit_rle_sprite() - 183653

#SelectExpand
1//static, c 2Allegro 4.1.19 (20050326), Unix profile results 3 4Memory bitmap size: 800x600 5Color depth: 16 bpp 6 7 8DRAW_MODE_SOLID results: 9 10 putpixel() - 4109923 11 hline() - 2165808 12 vline() - 1320638 13 line() - 420870 14 rectfill() - 180829 15 circle() - 273598 16 circlefill() - 182367 17 ellipse() - 158451 18 ellipsefill() - 122566 19 arc() - 233229 20 triangle() - 156225 21 22 23DRAW_MODE_TRANS results: 24 25 putpixel() - 2491428 26 hline() - 619655 27 vline() - 497386 28 line() - 246436 29 rectfill() - 19470 30 circle() - 147090 31 circlefill() - 21321 32 ellipse() - 99359 33 ellipsefill() - 22116 34 arc() - 161832 35 triangle() - 25889 36 37 38Other functions: 39 40 textout() - 165641 41 vram->vram blit() - N/A 42 aligned vram->vram blit() - N/A 43 blit() from memory - 363354 44 aligned blit() from memory - 384501 45 vram->vram masked_blit() - N/A 46 masked_blit() from memory - 208842 47 draw_sprite() - 357001 48 draw_rle_sprite() - 622629 49 draw_compiled_sprite() - 602795 50 draw_trans_sprite() - 120399 51 draw_trans_rle_sprite() - 157713 52 draw_lit_sprite() - 134902 53 draw_lit_rle_sprite() - 186415

__________
In theory, there is no difference between theory and practice. But, in practice, there is - Jan L.A. van de Snepscheut
MMORPG's...Many Men Online Role Playing Girls - Radagar
"Is Java REALLY slower? Does STL really bloat your exes? Find out with your friendly host, HoHo, and his benchmarking machine!" - Jakub Wasilewski

A J
Member #3,025
December 2002
avatar

so P4 'C' code beats the P4 asm code...

i like it!!!

___________________________
The more you talk, the more AJ is right. - ML

Kitty Cat
Member #2,815
October 2002
avatar

Even the generic 'C' code gives the generic asm code a good fight.

--
"Do not meddle in the affairs of cats, for they are subtle and will pee on your computer." -- Bruce Graham

HoHo
Member #4,534
April 2004
avatar

Quote:

i like it!!!

me too but what would happen if someone good with asm would take the output of compiler and tweak it a bit more? Actually the asm is not p4 but pentium mmx one I think. iirc gcc doesn't optimize asm files so they have no benefit from compiling to specific target architecture.

One thing that's also interesting is that c with default platform target (pentium) is practically as fast as asm version. I guess the people that made it weren't exellent asm coders but only good ones(I wish I could be even a moderate one :-[)

__________
In theory, there is no difference between theory and practice. But, in practice, there is - Jan L.A. van de Snepscheut
MMORPG's...Many Men Online Role Playing Girls - Radagar
"Is Java REALLY slower? Does STL really bloat your exes? Find out with your friendly host, HoHo, and his benchmarking machine!" - Jakub Wasilewski

Kitty Cat
Member #2,815
October 2002
avatar

Quote:

Actually the asm is not p4 but pentium mmx one I think.

The hand-written ASM in Allegro, IIRC, is i386. I can't think of any place in the code that checks anway, and without checking, running non-i386 code on a 386 will crash it (and, afaik, Allegro is i386 compatible by default).

--
"Do not meddle in the affairs of cats, for they are subtle and will pee on your computer." -- Bruce Graham

Evert
Member #794
November 2000
avatar

What's the difference between p4 asm and normal asm?
If there is none, then the noice level is quite high on that data and it's difficult to say which code is better.

Note that compilers have become better in optimzing code in recent years and that things that were a good idea to do in assembler five years ago (which is about where Allegro's assembler source originates) are not nescessarily a good idea on modern hardware anymore.

Kitty Cat
Member #2,815
October 2002
avatar

I wouldn't doubt the P4 has extra op-codes that can be taken advantage of. Plus, perhaps extra registers and the like. As well, it could also be the way the asm is written to take advantage of the CPU's specific op-timings.

--
"Do not meddle in the affairs of cats, for they are subtle and will pee on your computer." -- Bruce Graham

HoHo
Member #4,534
April 2004
avatar

In p4(sse/sse2) asm one could use data prefech, extra sse registers and instructions together with lots of other things that doesn't exist on pentium1's might speed stuff up

__________
In theory, there is no difference between theory and practice. But, in practice, there is - Jan L.A. van de Snepscheut
MMORPG's...Many Men Online Role Playing Girls - Radagar
"Is Java REALLY slower? Does STL really bloat your exes? Find out with your friendly host, HoHo, and his benchmarking machine!" - Jakub Wasilewski

X-G
Member #856
December 2000
avatar

Interestingly, whereas CPUs were previously constructed in such a way that it would be easy for humans to write good ASM, these days CPUs are designed with compilers in mind instead...

--
Since 2008-Jun-18, democracy in Sweden is dead. | 悪霊退散!悪霊退散!怨霊、物の怪、困った時は ドーマン!セーマン!ドーマン!セーマン! 直ぐに呼びましょう陰陽師レッツゴー!

HoHo
Member #4,534
April 2004
avatar

Cell should change it quite a bit. No cpu data prefech, no cache miss* or instruction reordering. Writing optimized code should be a bit simplier. Also it should be easier for compiler to generate good code.

*)There is no cache in the extra eight SPU's but they do have an extreemly fast local memory programmer can use directly to store its data in it.

__________
In theory, there is no difference between theory and practice. But, in practice, there is - Jan L.A. van de Snepscheut
MMORPG's...Many Men Online Role Playing Girls - Radagar
"Is Java REALLY slower? Does STL really bloat your exes? Find out with your friendly host, HoHo, and his benchmarking machine!" - Jakub Wasilewski



Go to: