CPU family compilation of allegro?

CPU family compilation of allegro?

A J

Member #3,025

December 2002

as the asm code is now possibly the cause of slow code, it might be time to review its usefulness.
how about an option to turn it off?
is there such an option ?
it would also make the dependancy on gcc for the msvc build less.

___________________________
The more you talk, the more AJ is right. - ML

HoHo

Member #4,534

April 2004

./configure --enable-asm=no

[edit]

Someone with windows access should compile several versions of allegro and test(static linked). Then several people should run these on different machines to see how big of a speed gain/loss is it to use c-only vs asm.

__________
In theory, there is no difference between theory and practice. But, in practice, there is - Jan L.A. van de Snepscheut
MMORPG's...Many Men Online Role Playing Girls - Radagar
"Is Java REALLY slower? Does STL really bloat your exes? Find out with your friendly host, HoHo, and his benchmarking machine!" - Jakub Wasilewski

gillius

Member #119

April 2000

I'd like to see HoHo's test run on MSVC 7.1 with SSE or SSE2 enabled. I have personally seen in generate SSE asm code for floating point operations (are they used in Allegro, even?). In my Direct3D game, IIRC, SSE2 increased the performance by 10-20%, but the game was heavy on vector math, which is what SSE was meant for.

Gillius
Gillius's Programming -- https://gillius.org/

HoHo

Member #4,534

April 2004

Quote:

I'd like to see HoHo's test run on MSVC 7.1 with SSE or SSE2 enabled.

I don't quite get what you mean by this. Do you want me to run allegro test compiled with msvc7.1? If so then I guess I could install XP on my comp's spare partition to test it

[edit]
i could also test-run it on my work pc(also a p4). it has both: linux and xp

Only thing is that I haven't set up my development-environment in xp so It would take quite some time to get compiling things in it

gillius

Member #119

April 2000

I didn't mean you specifically, or even anyone, should do it. I just said simply that I was curious to see the results of MSVC 7.1's optimizations for P4 over GCC's optimizations. If MSVC can leverage SSE better than GCC, then perhaps we will see the C code far outperform the ASM code on MSVC while on GCC it really is still quite a toss-up (although enough of a toss-up that I'd say that had I only the C code now I wouldn't waste my time writing an ASM version).

Gillius
Gillius's Programming -- https://gillius.org/

Oscar Giner

Member #2,207

April 2002

Quote:

ALLEGRO_USE_C=1 (GCC-based platforms only)

So how can I do the c only version of allegro with MSVC?

TARGET_ARCH_EXC is also for gcc only, but I just modified the makefile directly. I added /G7 /arch:SSE2 to CFLAGS.

--
[Website | e-mail]
[Tetris Unlimited] [AllegAVI | AlText]

A J

Member #3,025

December 2002

oscar, you might want to use -LTGC also.
whole program optimization, which does some inlining accross compilation units.

___________________________
The more you talk, the more AJ is right. - ML

Raf256

Member #3,501

May 2003

I think it might be a good idea to change some ASM code to C now, because the "write in asm for speed" is beeing a bit of urbant legend today. I'm not expert in that subject, but I would realy suggest to try:
- using normal C not ASM
- then compiling it for good architecture (--march and so on)
- (EDITED) oh, one thing in fact should be done by hand, AFAIR - the MMX/SSE usage. But it can be done in GCC (no need for asm) using neat macros, with by the way will be eiter expand to SSE opcodes, or to i386 opcodes and so on - depending on selected architecture, with makes building process (see belowe) and maintainxce much easier \o/ I cant find now the link to page describing using it though (anyone? thoes where macros like VS8_ADD(..) or something?)

- running a profiler and then using its output to optimize more (can this be run when bulding allegro? It would require to write a test function that will use all available functions few times to gave profiler occasion to analyze code, right?)

- add a builder to make several versions of allegro, like i386,486,586 and so. Then correct version of library will be loaded (game should be distributed with alleg.i386.dll alleg.i586.dll and so on, and perhaps same for *.so for linux), or
autor of allegro using game can also make multi-builder (to make like game.i386.exe .i486.exe and so on) and then he could staticly link same version of allegro (link i486 version of allegro into his code while building it for i486 and so on).

I think it might gave a noticeble speed boost, with is the best thing in allegro right after its win32/linux/... portability

Evert

Member #794

November 2000

Quote:

So how can I do the c only version of allegro with MSVC?

Ideally you could, but I think the GDI driver (and maybe the DirectX one too) uses inline assembler anyway, thus making it impossible to build the C only version of the library with MSVC alone.

Quote:

because the "write in asm for speed" is beeing a bit of urbant legend today

Not quite. The output of a compiler for a modern processor is going to beat hand-optimized assembly targeting a 386 though.

Quote:

But it can be done in GCC (no need for asm) using neat macros, with by the way will be eiter expand to SSE opcodes, or to i386 opcodes and so on - depending on selected architecture, with makes building process (see belowe) and maintainxce much easier \o/ I cant find now the link to page describing using it though (anyone? thoes where macros like VS8_ADD(..) or something?)

Fine, but someone has to do the work (and make sure the C only version remains processor neutral in the process).

Quote:

running a profiler and then using its output to optimize more (can this be run when bulding allegro? It would require to write a test function that will use all available functions few times to gave profiler occasion to analyze code, right?)

This is always done before applying a patch designed to increase code speed. Also check the timing functions in the test programme.

Quote:

add a builder to make several versions of allegro, like i386,486,586 and so. Then correct version of library will be loaded (game should be distributed with alleg.i386.dll alleg.i586.dll and so on, and perhaps same for *.so for linux), or
autor of allegro using game can also make multi-builder (to make like game.i386.exe .i486.exe and so on) and then he could staticly link same version of allegro (link i486 version of allegro into his code while building it for i486 and so on).

This complicates the build process considerably and makes everything harder to maintain - especially since Intel processors are not the only target for Allegro (AMD 64 and Macintosh systems being the main alternatives). It also makes the build process pretty slow, since everything has to be compiled multiple times. I also wouldn't take kindly to a game I download coming with binaries for ten different types of processor.
Distributiong shared object files in Linux is a big no-no, by the way.

Bob

Free Market Evangelist

September 2000

Quote:

Writing optimized code should be a bit simplier.

You wish. My guess is that programming for the Cell is going to be like programming for the Emotion Engine, except 10x more difficult.

Things like "caches" and "virtual memory" don't make optimizing harder, they make optimizing a whole lot easier! You don't have to worry (much) about where your data is and in what order you access it; the CPU figures it out for you.

--
- Bob
[ -- All my signature links are 404 -- ]

HoHo

Member #4,534

April 2004

Actually I meant really optimized code. Something like in Pixomatic

if it's not so easy then all we can do is hope that they have a hell of a good compiler for it (or at least a huge and useful manual)

Bob

Free Market Evangelist

September 2000

Quote:

if it's not so easy then all we can do is hope that they have a hell of a good compiler for it

Sure, in 10-15 years, just like every other architecture. There is nothing in CELL that makes compiler writers' life easier.

Quote:

(or at least a huge and useful manual)

Likely. Not sure how useful that would be.

Quote:

Actually I meant really optimized code. Something like in Pixomatic

I'll wait till I see it run on the Cell.

--
- Bob
[ -- All my signature links are 404 -- ]

HoHo

Member #4,534

April 2004

What might help compiler is the fact that it has in-order core. If compiler knows exactly how underlying CPU works it makes optimizing a bit easier.

With the Pixomatic I meant that the guys who programmed it used a lot of extreme optimization tricks to speed it up as much as possible but still quite often program worked a bit differently than they thought because cpu reordered some instructions and altered delays.

There has been three long articles about how they developed Pixomatic engine in Dr. Dobbs journal.

Raf256

Member #3,501

May 2003

Imho my idea of building several versions is good - just make it an option and its win-win situation.

By default liballegro will build fastly in 386 mode.

By using ./configure --multi-arch build will take longer but all (or all selected) libs will be builded.
Then user might make several versions of his game - if it is on a CD then I do not care, I can throw even 10 .exe files and 10 linux ELF executables.
Dont want this future - just do not use it, easy

HoHo

Member #4,534

April 2004

You can always staticlink your executeable and provide ~20 different versions of it. If you want to be smart you create one program that checks cpu capabilities and then launches the fastest program file.

A J

Member #3,025

December 2002

Quote:

my idea of building several versions is good

and who is going to write these versions ?

___________________________
The more you talk, the more AJ is right. - ML

HoHo

Member #4,534

April 2004

He meant building a seperate version for every CPU architecture:
i386, i486, pentium3, athlon-xp, pentium4 and all the others you can define with -mtune

Kitty Cat

Member #2,815

October 2002

-march would have a better effect than -mtune. -mtune still produces backwards-compatible code, so using, say, -mtune=pentium2 would still produce i386 compatible code, but tweak it to run better on P2's (which could be just as good for a P3 or Athlon-XP, given the must-be-i386-compatible restrictions). And why build arch-dependant versions if you're only likely to use one?

--
"Do not meddle in the affairs of cats, for they are subtle and will pee on your computer." -- Bruce Graham

Michael Jensen

Member #2,870

October 2002

You could simply make allegro build several dynamically linked libraries, one for 386, 486, p5, etc... and just implient the entire api with function pointers, except allegro_init(), and that could set up the api when the program starts and calls it based on what kind of beast you actaully have... only the api would be optimized and not your user code tho, but the DLLs/SOs/whatevers would be huge. -- oh, and it would be a waste of time since allegro is already fast enough...

HoHo

Member #4,534

April 2004

One reason there probably is no point to create several versions of the library is that it mostly only benefits newer machines that are mostly fast enough already.

One thing we could think of is to replace some currently asm based functions with their c-coutnerparts that are faster(draw_rle_sprite seems to benefit most).

Michael Jensen

Member #2,870

October 2002

Quote:

(draw_rle_sprite seems to benefit most).

but on slower computers, also?

I thought the whole point of the current state of the draw_rle_sprite was for the slower computers, it was optimized for them, as for newer computers, they don't really need RLE sprites as far as speed goes.

HoHo

Member #4,534

April 2004

From my previous test:
statically linked allegro with asm routines, no special compilation flags on my home PC:
draw_rle_sprite() - 474702
Same but c-only:
draw_rle_sprite() - 622629

Difference is way bigger than A J reported in another thread about blit performance

I haven't got a slow computer ATM but during the weekend I could test it on a p200.

There is a slight possibility that I've done something terribly weong with the tests so if anyone could confirm the results I would appreciate it.

Michael Jensen

Member #2,870

October 2002

Quote:

From my previous test:
statically linked allegro with asm routines, no special compilation flags on my home PC:
draw_rle_sprite() - 474702
Same but c-only:
draw_rle_sprite() - 622629

but that wasn't on a 486 or anything, and I'm guessing we'd really need a legit 386 to see some results; Anyway, on anything faster than that, that's not a big difference, 200k more on a function that nobody uses except people who write games for 486s... Now, maybe my argument isn't valid. But I feel that draw_rle_sprite is only practical for people forced to use it... -- How many RLE sprites do you really need to draw in a second, anyway?

HoHo

Member #4,534

April 2004

You have a 486? If you do please test it. A normal pentium, pentium2, older amd or whatever else older than my pc should be good too.

In my X-Com engine I needed to draw about 2500 sprites per frame (only for the map) sizes ranging from 20x30 to 32x64. On a p200 I had to use compiled sprites because they were olny ones fast enough so that the game could run ~30fps. Because of that I had to use all kind of hacks to get it working(bigger back buffer was the most painful one). Using RLE's it would have been much easier and it would have saved a lot of memory.

Unfortunately I couldn't finish the project. I had map and soldier rendering and animation working. If I had added day and night cycle, pathfinding and AI I couldn't have spent the whole time rendering and then FPS would have dropped to half of the 30fps-limit.

[edit]

Too bad I can't bump this thread.

I ran some more tests. This time there is no cpu-specific optimizations, its only c vs asm. I created a little table showing my results

1P4 3.0@3.82 1M cache                                                                                                                              
2512M ram single channel                                                                                                                           
32.6.11-gentoo-r4                                                                                                                                  
4gcc version 3.4.3 20050110                                                                                                                        
5                                                                                                                                                  
6Resolution                800x600 800x600Speed difference800x600 800x600Speed difference   800x600    800x600Speed difference   800x600    800x600Speed difference
7Driver                          X       X           in %       X       X           in %          X          X           in %          X          X           in %
8Bitdepth                       32      32       >0 means      24      24       >0 means         16         16       >0 means          8          8       >0 means
9c/asm                           C     Asm    C is better       C     Asm    C is better          C        Asm    C is better          C        Asm    C is better
10textout()                  171212  213753          -19.9  155685  263032         -40.81     191343     208104          -8.05     194637     196086          -0.74
11blit() from memory         453115  300027          51.02  185657  264728         -29.87     466279     388827          19.92     363734     539238         -32.55
12masked_blit() from memory  229041  163688          39.93  129498   86325          50.01     232047     231639           0.18     198995     269500         -26.16
13draw_sprite()              416321  297268          40.05  245417  249798          -1.75     408864     344216          18.78     320361     495931          -35.4
14draw_rle_sprite()          717764  421222           70.4  640240  387080           65.4     708827     517834          36.88     674117     508662          32.53
15draw_compiled_sprite()     717493 1661973         -56.83  642139 1245370         -48.44     710376    1710608         -58.47     663957    2035349         -67.38
16draw_trans_sprite()        176353   75453         133.73  124666   44480         180.27     164053      46482         252.94     353072     464093         -23.92
17draw_trans_rle_sprite()    192268   88702         116.76  171235   47175         262.98     199796     209313          -4.55     432254     344263          25.56
18draw_lit_sprite()          182740  153953           18.7  133052  140253          -5.13     178109     148202          20.18     325448     268232          21.33
19draw_lit_rle_sprite()      204653   91639         123.33  191327   43380         341.05     222652      46045         383.55     622470     548296          13.53

As can be seen c is mostly quite a lot faster than asm.

Please someone with older generation computers/compilers run the tests too and post your results here.

I'm seriously thinking in modifying allegro test so that user only executes it and it runs most of the test(or only selected ones if I do it) so that it wouldn't be such a pain to test different settings.

[edit2]

Added 8bit result

A J

Member #3,025

December 2002

bump ;>

___________________________
The more you talk, the more AJ is right. - ML