rectfill() vs AJ's MMX/SSE hax'ing

rectfill() vs AJ's MMX/SSE hax'ing

A J

Member #3,025

December 2002

Using MMX/SSE intrinics in my C++ functions.. i have written some rectfill replacements.

heres the latest results:

allegro's rectfill = 157,448 ticks
x86 32byte unrolling = 202,790 ticks
mmx (ptr arithmetic, no unrolling) = 171,896 ticks
mmx (ptr arithmetic, unrolling*4) = 140,567 ticks
mmx (array indexing, unrolling*4) = 150,706 ticks
sse2 (array indexing, unrolling*2) = 139,592 ticks
sse2 (array indexing, unrolling*4) = 139,617 ticks
sse2 (array indexing, unrolling*2, no-polute-cache instruction) = 55,951 ticks
sse2 (ptr arithmetic, no unrolling, no-polute-cache instruction) = 55,506 ticks
allegro's rectfill = 157,230 ticks (again, to confirm the others dont have cache advantage)

test machine.. AMD64 3000+ WinXPsp1, MSVC7.1
the pixel colour is random & 0xffffff; (32bit _RGB).

MMX/SSE intrinsics aren't rocket science kids.. dont be scared.. give it a try!

i'll post some code, soon, after i clean it up a little.

For all the "can't use SSE2" statements that will no doubt follow; can i at least get a "yeah, if its not going to effect current performance, lets allow _aligned_malloc() to replace malloc(), perhaps thru a define or something."

___________________________
The more you talk, the more AJ is right. - ML

HoHo

Member #4,534

April 2004

Looks interesting. If you post the code I'll try it on P4/Gentoo if I have enouth time. IIRC, P4 SSE unit should be quite effective and it will probably benefit from it more than AMD.

Have you tried to code some draw_(rle)_sprite functions too?

__________
In theory, there is no difference between theory and practice. But, in practice, there is - Jan L.A. van de Snepscheut
MMORPG's...Many Men Online Role Playing Girls - Radagar
"Is Java REALLY slower? Does STL really bloat your exes? Find out with your friendly host, HoHo, and his benchmarking machine!" - Jakub Wasilewski

Thomas Harte

Member #33

April 2000

Don't forget that rectfill is accelerated under DirectX if the target is a video surface. If you want to optimise the C only blit that'd be a super service too, as it seems to always do individual bitmaps even in places where non-vector C could do better - e.g. blitting DWORD aligned 8bpp surfaces. Are you using the GCC extensions for arbitrary vector units, or something else?

[My site] [Tetrominoes]

HoHo

Member #4,534

April 2004

He is using intrinsics that should be portable between at least GCC, ICC, MSVC, that means they should compile and work on all of these compilers.

Of course if compiler target platform doesn't support those instructions* then it gives compile time error but that can be avoided with preprocessor.

*) That means for SSE2, minimum target platform is P4 or A64. It won't compile targeting anything lower.

It would be really nice if it could be done like so that when Allegro starts up it detects what functionality CPU supports and modifies the vtables accordingly. Any volunteers

A J

Member #3,025

December 2002

intrinsics are for mem->mem stuff only... possibly for mem->sys.

Yes, intrinics work on GCC,ICC,MSVC.
SSE2 is only available on A64+P4+PM.

Yes, at some point allegro should have code to autodetect your CPU's features and use the best routines; however this will be a while away yet.

If you look at the timing table (top post) you can see that even MMX had a slight advantage over the current rectfill, so fallback for non-SSE2 CPUs to MMX is possible, anyone still using a CPU without MMX needs to have a good think about why there naval hasn't unravelled yet.

___________________________
The more you talk, the more AJ is right. - ML

Thomas Harte

Member #33

April 2000

I would have just used the GCC Vector Extensions for portability outside of the Intel domain combined with compatibility with the most popular Allegro target - as even for Windows GCC is a required part of the build chain. I guess I'll just see what AJ produces then attempt a GCC conversion.

[My site] [Tetrominoes]

Bob

Free Market Evangelist

September 2000

Does your function do the same thing as Allegro's? Allegro has a big overhead per-line to set up line pointers, banks, page locks, etc.

--
- Bob
[ -- All my signature links are 404 -- ]

HoHo

Member #4,534

April 2004

Quote:

I would have just used the GCC Vector Extensions

Tough call.
Multiplatform compability within single compiler and easier programmability with GCC vector extensions vs multicompiler support and little bit more difficult programmalbilty (no automatic fallback) with intrinsics.

A J

Member #3,025

December 2002

i haven't cleaned up the code yet to show you, but basically its:

__m128i k = [a bunch of intrinics to pack 4 pixels into a m128i];
__m128i* p = (a cast)bitmap->dat;
for ( int i=w*h ; i ; --i )
  _mm_streaming_set_instruction( p++, k );

Bob, I developed each one somewhat seperately (they use different loop amounts due to unrolling,64 and 128bit registers etc, so I was also concerned that I was only doing 1/2 or something like that, so I added a test to each one, to also dump the pointer address of the last iteration, and as you can see below: each one returned the same address (confirming correct number of iterations).

&p=[33554464] mmx (ptr arithmetic, no unrolling) = 172,524 ticks
&p=[33554464] mmx (ptr arithmetic, unrolling*4) = 139,907 ticks
&p=[33554464] mmx (array indexing, unrolling*4) = 138,849 ticks
&p=[33554464] sse2 (array indexing, unrolling*2) = 138,912 ticks
&p=[33554464] sse2 (array indexing, unrolling*4) = 140,599 ticks
&p=[33554464] sse2 (array indexing, unrolling*2, no-polute-cache instruction) = 55,890 ticks
&p=[33554464] sse2 (ptr arithmetic, no unrolling, no-polute-cache instruction) = 55,202 ticks

___________________________
The more you talk, the more AJ is right. - ML

Thomas Harte

Member #33

April 2000

Quote:

Tough call.

Not if you have a PowerPC G4 and a realistic sense of exactly how many other people have the same thing!

[My site] [Tetrominoes]

HoHo

Member #4,534

April 2004

Quote:

Not if you have a PowerPC G4

But on the other hand this is a discontinued HW platform unless Cell and its kind start replacing PC's.

Thomas Harte

Member #33

April 2000

Quote:

But on the other hand this is a discontinued HW platform unless Cell and its kind start replacing PC's.

This isn't true in the strict sense as iBooks and Mac Minis are still being manufactured and in any case isn't relevant to the debate. Even once the G4s cease - rumoured to be sometime before June - the question for Allegro is principally "has anyone submitted the code?" and secondarily "is the code going to be useful for a meaningful number of users?" which incurs an evaluation of the likely users of products that incorporate Allegro. In terms of people likely to download Allegro made games, it's almost certain that the number of G4/G5 owners equals or outnumbers the number of Linux users so PowerPC support code should be welcome for the next few years even if the second test is applied. Following your strict wording we should phase out support for anything without SSE3 by year end.

In any case, the point I made about the decision being easy was clearly "I have a PowerPC G4, not many people do, I should develop in a way that works not just on my target but on most other Allegro supported systems and that can be achieved with GCC Vector Extensions". If I thought "I have a PowerPC G4 and I'd like to improve things for PowerPC G4 and G5 users with the modern Mac OS" then there are OS X specific vector libraries I could use with even more functionality.

[My site] [Tetrominoes]

Peter Hull

Member #1,136

March 2001

Good work! AJ, how much faster is it in real terms? I see it's 3x fewer ticks, but is one tick 1/3000000000 second on a 3GHz machine or what?

Also does it depend on the size of rect that you're filling?

Pete

Thomas Harte

Member #33

April 2000

Out of interest, how would one supply a replacement rectfill in this manner? It isn't platform dependent and it also isn't an assembly file, so...

[My site] [Tetrominoes]

Peter Hull

Member #1,136

March 2001

If rectfill is part of the vtable (it is, isn't it?) then you'd need to modify the vtable at runtime (AllegroGL does this)
Otherwise, if you define rectfill in your program, the linker won't look for it in the Allegro library.

Pete

A J

Member #3,025

December 2002

Quote:

Good work! AJ, how much faster is it in real terms?

Which fake terms did i use ?

Quote:

Also does it depend on the size of rect that you're filling?

Yes. Every change in any code will effect performance. All the components work at different speeds. This has been the case since valves. PC's are async.

As to my personal opinion about all this, without having done any tests on this opinion, I would guess that the performance curve would be quite different to a x86 implementation. The limitations of SSE2 are 16byte aligned data, and the use of 128bit registers (4 x 32bit pixels at a time(SIMD)).. which means only rects divisible by 4 will work, which is most of them anyway! Rects of different sizes will cost alot more as the left over bit will need to be done with x86 code. Which means odd sized rects may end up costing more time using the SSE2 code. I'd also guess that small rects would not benefit much, due to the large overhead of setting up the conditions for the code to run, however as Bob said, the current rectfill also has these overheads.

___________________________
The more you talk, the more AJ is right. - ML

BAF

Member #2,981

December 2002

Quote:

Which fake terms did i use ?

Ticks.

BAF.zone | SantaHack!

A J

Member #3,025

December 2002

Its a measurement of time... as every computer is different, quoting the exact millisecond (or microsecond) values for my specific hardware doesn't tell people too much, and can be somewhat misleading. Best to just look at the numbers and compare the different methods results.

I did some more tests just now.. on a PentiumM-1.6 GHz
and the results are quite different..

allegro's rectfill = 191,421 ticks
x86 32byte unrolling = 251,775 ticks
mmx (ptr arithmetic, no unrolling) = 263,745 ticks
mmx (ptr arithmetic, unrolling*4) = 255,407 ticks
mmx (array indexing, unrolling*4) = 257,207 ticks
sse2 (array indexing, unrolling*2) = 256,300 ticks
sse2 (array indexing, unrolling*4) = 252,030 ticks
sse2 (array indexing, unrolling*2, no-pollute-cache instruction) = 67,090 ticks
sse2 (ptr arithmetic, no unrolling, no-pollute-cache instruction) = 65,969 ticks
allegro's rectfill = 195,308 ticks (again, to confirm the others dont have cache advantage)

Dont compare the tick values to the other chart, they are differnt hardware.

___________________________
The more you talk, the more AJ is right. - ML

BAF

Member #2,981

December 2002

Yes, but if a tick is 1/100,000,000th of a second then its not worth it for that "little" of a speed up.

BAF.zone | SantaHack!

A J

Member #3,025

December 2002

67,000 ticks vs 191,000... thats like almost 3x faster !

if your app spends 30% of its time in rectfill
then you have just saved 20% of your CPU's time.. thats huge!

for one function getting some SSE2 code, you suddenly have another 20% of your CPU available for other things.

i dont understand how you cant see that ?

give me an example of how you think it could be a "little" speed up ?

___________________________
The more you talk, the more AJ is right. - ML

BAF

Member #2,981

December 2002

I see it, but all the work spent optimizing the function isn't worth it if you use rectfill a couple times and only save 200msec of cpu time.

BAF.zone | SantaHack!

A J

Member #3,025

December 2002

I dont think the average use of rectfill is just a couple of times.

For my app, I use it alot.
I would guess there are plenty of other people that also use it more than a couple of times.

You seem to have some idea about what constitutes wasted time; do you think i have wasted my time ? What should have i done with it ?

Also, this is just the tip, having done this rectfill SSE coding; (which amounts to about 20 lines of code by the way) i have also found opportunities to accelerate blit, and masked blit, and i think i can get 2x from masked blit, possibly more.

Lets presume by next week (as the rectfill code has taken 3 days to research and write), that i have the rectfill,blit,and maskedblit working at over 2x; can you then tell me its not worth the effort?

___________________________
The more you talk, the more AJ is right. - ML

BAF

Member #2,981

December 2002

I'm not saying it was not, I was just saying that depending on how it is used, it may or may not be worth it. I persnoally don't use rectfill much, so a blit speedup would be way more useful to me.

BAF.zone | SantaHack!

A J

Member #3,025

December 2002

i've been working on blit
some results:

allegro's blit = 51,569 ticks
SSE2, (no unrolling, no-pollute-cache instruction) = 29,377 ticks
SSE2, (unrolling*4, no-pollute-cache instruction) = 28,492 ticks

and as i may have already mentioned, there is hope for masked blit also.

___________________________
The more you talk, the more AJ is right. - ML