Allegro.cc - Online Community

Allegro.cc Forums » Allegro Development » CPU family compilation of allegro?

This thread is locked; no one can reply to it. rss feed Print
CPU family compilation of allegro?
A J
Member #3,025
December 2002
avatar

as the asm code is now possibly the cause of slow code, it might be time to review its usefulness.
how about an option to turn it off?
is there such an option ?
it would also make the dependancy on gcc for the msvc build less.

___________________________
The more you talk, the more AJ is right. - ML

HoHo
Member #4,534
April 2004
avatar

./configure --enable-asm=no

[edit]

Someone with windows access should compile several versions of allegro and test(static linked). Then several people should run these on different machines to see how big of a speed gain/loss is it to use c-only vs asm.

__________
In theory, there is no difference between theory and practice. But, in practice, there is - Jan L.A. van de Snepscheut
MMORPG's...Many Men Online Role Playing Girls - Radagar
"Is Java REALLY slower? Does STL really bloat your exes? Find out with your friendly host, HoHo, and his benchmarking machine!" - Jakub Wasilewski

gillius
Member #119
April 2000

I'd like to see HoHo's test run on MSVC 7.1 with SSE or SSE2 enabled. I have personally seen in generate SSE asm code for floating point operations (are they used in Allegro, even?). In my Direct3D game, IIRC, SSE2 increased the performance by 10-20%, but the game was heavy on vector math, which is what SSE was meant for.

Gillius
Gillius's Programming -- https://gillius.org/

HoHo
Member #4,534
April 2004
avatar

Quote:

I'd like to see HoHo's test run on MSVC 7.1 with SSE or SSE2 enabled.

I don't quite get what you mean by this. Do you want me to run allegro test compiled with msvc7.1? If so then I guess I could install XP on my comp's spare partition to test it

[edit]
i could also test-run it on my work pc(also a p4). it has both: linux and xp

Only thing is that I haven't set up my development-environment in xp so It would take quite some time to get compiling things in it

__________
In theory, there is no difference between theory and practice. But, in practice, there is - Jan L.A. van de Snepscheut
MMORPG's...Many Men Online Role Playing Girls - Radagar
"Is Java REALLY slower? Does STL really bloat your exes? Find out with your friendly host, HoHo, and his benchmarking machine!" - Jakub Wasilewski

gillius
Member #119
April 2000

I didn't mean you specifically, or even anyone, should do it. I just said simply that I was curious to see the results of MSVC 7.1's optimizations for P4 over GCC's optimizations. If MSVC can leverage SSE better than GCC, then perhaps we will see the C code far outperform the ASM code on MSVC while on GCC it really is still quite a toss-up (although enough of a toss-up that I'd say that had I only the C code now I wouldn't waste my time writing an ASM version).

Gillius
Gillius's Programming -- https://gillius.org/

Oscar Giner
Member #2,207
April 2002
avatar

Quote:

ALLEGRO_USE_C=1 (GCC-based platforms only)

So how can I do the c only version of allegro with MSVC?

TARGET_ARCH_EXC is also for gcc only, but I just modified the makefile directly. I added /G7 /arch:SSE2 to CFLAGS.

A J
Member #3,025
December 2002
avatar

oscar, you might want to use -LTGC also.
whole program optimization, which does some inlining accross compilation units.

___________________________
The more you talk, the more AJ is right. - ML

Raf256
Member #3,501
May 2003

I think it might be a good idea to change some ASM code to C now, because the "write in asm for speed" is beeing a bit of urbant legend today. I'm not expert in that subject, but I would realy suggest to try:
- using normal C not ASM
- then compiling it for good architecture (--march and so on)
- (EDITED) oh, one thing in fact should be done by hand, AFAIR - the MMX/SSE usage. But it can be done in GCC (no need for asm) using neat macros, with by the way will be eiter expand to SSE opcodes, or to i386 opcodes and so on - depending on selected architecture, with makes building process (see belowe) and maintainxce much easier \o/ I cant find now the link to page describing using it though (anyone? thoes where macros like VS8_ADD(..) or something?)

- running a profiler and then using its output to optimize more (can this be run when bulding allegro? It would require to write a test function that will use all available functions few times to gave profiler occasion to analyze code, right?)

- add a builder to make several versions of allegro, like i386,486,586 and so. Then correct version of library will be loaded (game should be distributed with alleg.i386.dll alleg.i586.dll and so on, and perhaps same for *.so for linux), or
autor of allegro using game can also make multi-builder (to make like game.i386.exe .i486.exe and so on) and then he could staticly link same version of allegro (link i486 version of allegro into his code while building it for i486 and so on).

I think it might gave a noticeble speed boost, with is the best thing in allegro right after its win32/linux/... portability :)

Evert
Member #794
November 2000
avatar

Quote:

So how can I do the c only version of allegro with MSVC?

Ideally you could, but I think the GDI driver (and maybe the DirectX one too) uses inline assembler anyway, thus making it impossible to build the C only version of the library with MSVC alone.

Quote:

because the "write in asm for speed" is beeing a bit of urbant legend today

Not quite. The output of a compiler for a modern processor is going to beat hand-optimized assembly targeting a 386 though.

Quote:

But it can be done in GCC (no need for asm) using neat macros, with by the way will be eiter expand to SSE opcodes, or to i386 opcodes and so on - depending on selected architecture, with makes building process (see belowe) and maintainxce much easier \o/ I cant find now the link to page describing using it though (anyone? thoes where macros like VS8_ADD(..) or something?)

Fine, but someone has to do the work (and make sure the C only version remains processor neutral in the process).

Quote:

running a profiler and then using its output to optimize more (can this be run when bulding allegro? It would require to write a test function that will use all available functions few times to gave profiler occasion to analyze code, right?)

This is always done before applying a patch designed to increase code speed. Also check the timing functions in the test programme.

Quote:

add a builder to make several versions of allegro, like i386,486,586 and so. Then correct version of library will be loaded (game should be distributed with alleg.i386.dll alleg.i586.dll and so on, and perhaps same for *.so for linux), or
autor of allegro using game can also make multi-builder (to make like game.i386.exe .i486.exe and so on) and then he could staticly link same version of allegro (link i486 version of allegro into his code while building it for i486 and so on).

This complicates the build process considerably and makes everything harder to maintain - especially since Intel processors are not the only target for Allegro (AMD 64 and Macintosh systems being the main alternatives). It also makes the build process pretty slow, since everything has to be compiled multiple times. I also wouldn't take kindly to a game I download coming with binaries for ten different types of processor.
Distributiong shared object files in Linux is a big no-no, by the way.

Bob
Free Market Evangelist
September 2000
avatar

Quote:

Writing optimized code should be a bit simplier.

You wish. My guess is that programming for the Cell is going to be like programming for the Emotion Engine, except 10x more difficult.

Things like "caches" and "virtual memory" don't make optimizing harder, they make optimizing a whole lot easier! You don't have to worry (much) about where your data is and in what order you access it; the CPU figures it out for you.

--
- Bob
[ -- All my signature links are 404 -- ]

HoHo
Member #4,534
April 2004
avatar

Actually I meant really optimized code. Something like in Pixomatic

if it's not so easy then all we can do is hope that they have a hell of a good compiler for it (or at least a huge and useful manual)

__________
In theory, there is no difference between theory and practice. But, in practice, there is - Jan L.A. van de Snepscheut
MMORPG's...Many Men Online Role Playing Girls - Radagar
"Is Java REALLY slower? Does STL really bloat your exes? Find out with your friendly host, HoHo, and his benchmarking machine!" - Jakub Wasilewski

Bob
Free Market Evangelist
September 2000
avatar

Quote:

if it's not so easy then all we can do is hope that they have a hell of a good compiler for it

Sure, in 10-15 years, just like every other architecture. There is nothing in CELL that makes compiler writers' life easier.

Quote:

(or at least a huge and useful manual)

Likely. Not sure how useful that would be.

Quote:

Actually I meant really optimized code. Something like in Pixomatic

I'll wait till I see it run on the Cell.

--
- Bob
[ -- All my signature links are 404 -- ]

HoHo
Member #4,534
April 2004
avatar

What might help compiler is the fact that it has in-order core. If compiler knows exactly how underlying CPU works it makes optimizing a bit easier.

With the Pixomatic I meant that the guys who programmed it used a lot of extreme optimization tricks to speed it up as much as possible but still quite often program worked a bit differently than they thought because cpu reordered some instructions and altered delays.

There has been three long articles about how they developed Pixomatic engine in Dr. Dobbs journal.

__________
In theory, there is no difference between theory and practice. But, in practice, there is - Jan L.A. van de Snepscheut
MMORPG's...Many Men Online Role Playing Girls - Radagar
"Is Java REALLY slower? Does STL really bloat your exes? Find out with your friendly host, HoHo, and his benchmarking machine!" - Jakub Wasilewski

Raf256
Member #3,501
May 2003

Imho my idea of building several versions is good - just make it an option and its win-win situation.

By default liballegro will build fastly in 386 mode.

By using ./configure --multi-arch build will take longer but all (or all selected) libs will be builded.
Then user might make several versions of his game - if it is on a CD then I do not care, I can throw even 10 .exe files and 10 linux ELF executables.
Dont want this future - just do not use it, easy :)

HoHo
Member #4,534
April 2004
avatar

You can always staticlink your executeable and provide ~20 different versions of it. If you want to be smart you create one program that checks cpu capabilities and then launches the fastest program file.

__________
In theory, there is no difference between theory and practice. But, in practice, there is - Jan L.A. van de Snepscheut
MMORPG's...Many Men Online Role Playing Girls - Radagar
"Is Java REALLY slower? Does STL really bloat your exes? Find out with your friendly host, HoHo, and his benchmarking machine!" - Jakub Wasilewski

A J
Member #3,025
December 2002
avatar

Quote:

my idea of building several versions is good

and who is going to write these versions ?

___________________________
The more you talk, the more AJ is right. - ML

HoHo
Member #4,534
April 2004
avatar

He meant building a seperate version for every CPU architecture:
i386, i486, pentium3, athlon-xp, pentium4 and all the others you can define with -mtune

__________
In theory, there is no difference between theory and practice. But, in practice, there is - Jan L.A. van de Snepscheut
MMORPG's...Many Men Online Role Playing Girls - Radagar
"Is Java REALLY slower? Does STL really bloat your exes? Find out with your friendly host, HoHo, and his benchmarking machine!" - Jakub Wasilewski

Kitty Cat
Member #2,815
October 2002
avatar

-march would have a better effect than -mtune. -mtune still produces backwards-compatible code, so using, say, -mtune=pentium2 would still produce i386 compatible code, but tweak it to run better on P2's (which could be just as good for a P3 or Athlon-XP, given the must-be-i386-compatible restrictions). And why build arch-dependant versions if you're only likely to use one? :P

--
"Do not meddle in the affairs of cats, for they are subtle and will pee on your computer." -- Bruce Graham

Michael Jensen
Member #2,870
October 2002
avatar

You could simply make allegro build several dynamically linked libraries, one for 386, 486, p5, etc... and just implient the entire api with function pointers, except allegro_init(), and that could set up the api when the program starts and calls it based on what kind of beast you actaully have... only the api would be optimized and not your user code tho, but the DLLs/SOs/whatevers would be huge. -- oh, and it would be a waste of time since allegro is already fast enough...

HoHo
Member #4,534
April 2004
avatar

One reason there probably is no point to create several versions of the library is that it mostly only benefits newer machines that are mostly fast enough already.

One thing we could think of is to replace some currently asm based functions with their c-coutnerparts that are faster(draw_rle_sprite seems to benefit most).

__________
In theory, there is no difference between theory and practice. But, in practice, there is - Jan L.A. van de Snepscheut
MMORPG's...Many Men Online Role Playing Girls - Radagar
"Is Java REALLY slower? Does STL really bloat your exes? Find out with your friendly host, HoHo, and his benchmarking machine!" - Jakub Wasilewski

Michael Jensen
Member #2,870
October 2002
avatar

Quote:

(draw_rle_sprite seems to benefit most).

but on slower computers, also?

I thought the whole point of the current state of the draw_rle_sprite was for the slower computers, it was optimized for them, as for newer computers, they don't really need RLE sprites as far as speed goes.

HoHo
Member #4,534
April 2004
avatar

From my previous test:
statically linked allegro with asm routines, no special compilation flags on my home PC:
draw_rle_sprite() - 474702
Same but c-only:
draw_rle_sprite() - 622629

Difference is way bigger than A J reported in another thread about blit performance

I haven't got a slow computer ATM but during the weekend I could test it on a p200.

There is a slight possibility that I've done something terribly weong with the tests so if anyone could confirm the results I would appreciate it.

__________
In theory, there is no difference between theory and practice. But, in practice, there is - Jan L.A. van de Snepscheut
MMORPG's...Many Men Online Role Playing Girls - Radagar
"Is Java REALLY slower? Does STL really bloat your exes? Find out with your friendly host, HoHo, and his benchmarking machine!" - Jakub Wasilewski

Michael Jensen
Member #2,870
October 2002
avatar

Quote:

From my previous test:
statically linked allegro with asm routines, no special compilation flags on my home PC:
draw_rle_sprite() - 474702
Same but c-only:
draw_rle_sprite() - 622629

but that wasn't on a 486 or anything, and I'm guessing we'd really need a legit 386 to see some results; Anyway, on anything faster than that, that's not a big difference, 200k more on a function that nobody uses except people who write games for 486s... Now, maybe my argument isn't valid. But I feel that draw_rle_sprite is only practical for people forced to use it... -- How many RLE sprites do you really need to draw in a second, anyway?

HoHo
Member #4,534
April 2004
avatar

You have a 486? If you do please test it. A normal pentium, pentium2, older amd or whatever else older than my pc should be good too.

In my X-Com engine I needed to draw about 2500 sprites per frame (only for the map) sizes ranging from 20x30 to 32x64. On a p200 I had to use compiled sprites because they were olny ones fast enough so that the game could run ~30fps. Because of that I had to use all kind of hacks to get it working(bigger back buffer was the most painful one). Using RLE's it would have been much easier and it would have saved a lot of memory.

Unfortunately I couldn't finish the project. I had map and soldier rendering and animation working. If I had added day and night cycle, pathfinding and AI I couldn't have spent the whole time rendering and then FPS would have dropped to half of the 30fps-limit.

[edit]

Too bad I can't bump this thread.

I ran some more tests. This time there is no cpu-specific optimizations, its only c vs asm. I created a little table showing my results

1P4 3.0@3.82 1M cache
2512M ram single channel
32.6.11-gentoo-r4
4gcc version 3.4.3 20050110
5
6Resolution 800x600 800x600Speed difference800x600 800x600Speed difference 800x600 800x600Speed difference 800x600 800x600Speed difference
7Driver X X in % X X in % X X in % X X in %
8Bitdepth 32 32 >0 means 24 24 >0 means 16 16 >0 means 8 8 >0 means
9c/asm C Asm C is better C Asm C is better C Asm C is better C Asm C is better
10textout() 171212 213753 -19.9 155685 263032 -40.81 191343 208104 -8.05 194637 196086 -0.74
11blit() from memory 453115 300027 51.02 185657 264728 -29.87 466279 388827 19.92 363734 539238 -32.55
12masked_blit() from memory 229041 163688 39.93 129498 86325 50.01 232047 231639 0.18 198995 269500 -26.16
13draw_sprite() 416321 297268 40.05 245417 249798 -1.75 408864 344216 18.78 320361 495931 -35.4
14draw_rle_sprite() 717764 421222 70.4 640240 387080 65.4 708827 517834 36.88 674117 508662 32.53
15draw_compiled_sprite() 717493 1661973 -56.83 642139 1245370 -48.44 710376 1710608 -58.47 663957 2035349 -67.38
16draw_trans_sprite() 176353 75453 133.73 124666 44480 180.27 164053 46482 252.94 353072 464093 -23.92
17draw_trans_rle_sprite() 192268 88702 116.76 171235 47175 262.98 199796 209313 -4.55 432254 344263 25.56
18draw_lit_sprite() 182740 153953 18.7 133052 140253 -5.13 178109 148202 20.18 325448 268232 21.33
19draw_lit_rle_sprite() 204653 91639 123.33 191327 43380 341.05 222652 46045 383.55 622470 548296 13.53

As can be seen c is mostly quite a lot faster than asm.

Please someone with older generation computers/compilers run the tests too and post your results here.

I'm seriously thinking in modifying allegro test so that user only executes it and it runs most of the test(or only selected ones if I do it) so that it wouldn't be such a pain to test different settings.

[edit2]

Added 8bit result

__________
In theory, there is no difference between theory and practice. But, in practice, there is - Jan L.A. van de Snepscheut
MMORPG's...Many Men Online Role Playing Girls - Radagar
"Is Java REALLY slower? Does STL really bloat your exes? Find out with your friendly host, HoHo, and his benchmarking machine!" - Jakub Wasilewski

A J
Member #3,025
December 2002
avatar

bump ;>

___________________________
The more you talk, the more AJ is right. - ML



Go to: