![]() |
|
rectfill() vs AJ's MMX/SSE hax'ing |
A J
Member #3,025
December 2002
![]() |
Using MMX/SSE intrinics in my C++ functions.. i have written some rectfill replacements. heres the latest results: allegro's rectfill = 157,448 ticks test machine.. AMD64 3000+ WinXPsp1, MSVC7.1 MMX/SSE intrinsics aren't rocket science kids.. dont be scared.. give it a try! i'll post some code, soon, after i clean it up a little. For all the "can't use SSE2" statements that will no doubt follow; can i at least get a "yeah, if its not going to effect current performance, lets allow _aligned_malloc() to replace malloc(), perhaps thru a define or something." ___________________________ |
HoHo
Member #4,534
April 2004
![]() |
Looks interesting. If you post the code I'll try it on P4/Gentoo if I have enouth time. IIRC, P4 SSE unit should be quite effective and it will probably benefit from it more than AMD. Have you tried to code some draw_(rle)_sprite functions too? __________ |
Thomas Harte
Member #33
April 2000
![]() |
Don't forget that rectfill is accelerated under DirectX if the target is a video surface. If you want to optimise the C only blit that'd be a super service too, as it seems to always do individual bitmaps even in places where non-vector C could do better - e.g. blitting DWORD aligned 8bpp surfaces. Are you using the GCC extensions for arbitrary vector units, or something else? [My site] [Tetrominoes] |
HoHo
Member #4,534
April 2004
![]() |
He is using intrinsics that should be portable between at least GCC, ICC, MSVC, that means they should compile and work on all of these compilers. Of course if compiler target platform doesn't support those instructions* then it gives compile time error but that can be avoided with preprocessor. *) That means for SSE2, minimum target platform is P4 or A64. It won't compile targeting anything lower. It would be really nice if it could be done like so that when Allegro starts up it detects what functionality CPU supports and modifies the vtables accordingly. Any volunteers __________ |
A J
Member #3,025
December 2002
![]() |
intrinsics are for mem->mem stuff only... possibly for mem->sys. Yes, intrinics work on GCC,ICC,MSVC. Yes, at some point allegro should have code to autodetect your CPU's features and use the best routines; however this will be a while away yet. If you look at the timing table (top post) you can see that even MMX had a slight advantage over the current rectfill, so fallback for non-SSE2 CPUs to MMX is possible, anyone still using a CPU without MMX needs to have a good think about why there naval hasn't unravelled yet. ___________________________ |
Thomas Harte
Member #33
April 2000
![]() |
I would have just used the GCC Vector Extensions for portability outside of the Intel domain combined with compatibility with the most popular Allegro target - as even for Windows GCC is a required part of the build chain. I guess I'll just see what AJ produces then attempt a GCC conversion. [My site] [Tetrominoes] |
Bob
Free Market Evangelist
September 2000
![]() |
Does your function do the same thing as Allegro's? Allegro has a big overhead per-line to set up line pointers, banks, page locks, etc. -- |
HoHo
Member #4,534
April 2004
![]() |
Quote: I would have just used the GCC Vector Extensions Tough call. __________ |
A J
Member #3,025
December 2002
![]() |
i haven't cleaned up the code yet to show you, but basically its: __m128i k = [a bunch of intrinics to pack 4 pixels into a m128i]; __m128i* p = (a cast)bitmap->dat; for ( int i=w*h ; i ; --i ) _mm_streaming_set_instruction( p++, k ); Bob, I developed each one somewhat seperately (they use different loop amounts due to unrolling,64 and 128bit registers etc, so I was also concerned that I was only doing 1/2 or something like that, so I added a test to each one, to also dump the pointer address of the last iteration, and as you can see below: each one returned the same address (confirming correct number of iterations). &p=[33554464] mmx (ptr arithmetic, no unrolling) = 172,524 ticks ___________________________ |
Thomas Harte
Member #33
April 2000
![]() |
Quote: Tough call. Not if you have a PowerPC G4 and a realistic sense of exactly how many other people have the same thing! [My site] [Tetrominoes] |
HoHo
Member #4,534
April 2004
![]() |
Quote: Not if you have a PowerPC G4 But on the other hand this is a discontinued HW platform unless Cell and its kind start replacing PC's. __________ |
Thomas Harte
Member #33
April 2000
![]() |
Quote: But on the other hand this is a discontinued HW platform unless Cell and its kind start replacing PC's. This isn't true in the strict sense as iBooks and Mac Minis are still being manufactured and in any case isn't relevant to the debate. Even once the G4s cease - rumoured to be sometime before June - the question for Allegro is principally "has anyone submitted the code?" and secondarily "is the code going to be useful for a meaningful number of users?" which incurs an evaluation of the likely users of products that incorporate Allegro. In terms of people likely to download Allegro made games, it's almost certain that the number of G4/G5 owners equals or outnumbers the number of Linux users so PowerPC support code should be welcome for the next few years even if the second test is applied. Following your strict wording we should phase out support for anything without SSE3 by year end. In any case, the point I made about the decision being easy was clearly "I have a PowerPC G4, not many people do, I should develop in a way that works not just on my target but on most other Allegro supported systems and that can be achieved with GCC Vector Extensions". If I thought "I have a PowerPC G4 and I'd like to improve things for PowerPC G4 and G5 users with the modern Mac OS" then there are OS X specific vector libraries I could use with even more functionality. [My site] [Tetrominoes] |
Peter Hull
Member #1,136
March 2001
|
Good work! AJ, how much faster is it in real terms? I see it's 3x fewer ticks, but is one tick 1/3000000000 second on a 3GHz machine or what? Also does it depend on the size of rect that you're filling? Pete
|
Thomas Harte
Member #33
April 2000
![]() |
Out of interest, how would one supply a replacement rectfill in this manner? It isn't platform dependent and it also isn't an assembly file, so... [My site] [Tetrominoes] |
Peter Hull
Member #1,136
March 2001
|
If rectfill is part of the vtable (it is, isn't it?) then you'd need to modify the vtable at runtime (AllegroGL does this) Pete
|
A J
Member #3,025
December 2002
![]() |
Quote: Good work! AJ, how much faster is it in real terms? Which fake terms did i use ? Quote: Also does it depend on the size of rect that you're filling? Yes. Every change in any code will effect performance. All the components work at different speeds. This has been the case since valves. PC's are async. As to my personal opinion about all this, without having done any tests on this opinion, I would guess that the performance curve would be quite different to a x86 implementation. The limitations of SSE2 are 16byte aligned data, and the use of 128bit registers (4 x 32bit pixels at a time(SIMD)).. which means only rects divisible by 4 will work, which is most of them anyway! Rects of different sizes will cost alot more as the left over bit will need to be done with x86 code. Which means odd sized rects may end up costing more time using the SSE2 code. I'd also guess that small rects would not benefit much, due to the large overhead of setting up the conditions for the code to run, however as Bob said, the current rectfill also has these overheads. ___________________________ |
BAF
Member #2,981
December 2002
![]() |
Quote: Which fake terms did i use ? Ticks. |
A J
Member #3,025
December 2002
![]() |
Its a measurement of time... as every computer is different, quoting the exact millisecond (or microsecond) values for my specific hardware doesn't tell people too much, and can be somewhat misleading. Best to just look at the numbers and compare the different methods results. I did some more tests just now.. on a PentiumM-1.6 GHz allegro's rectfill = 191,421 ticks Dont compare the tick values to the other chart, they are differnt hardware. ___________________________ |
BAF
Member #2,981
December 2002
![]() |
Yes, but if a tick is 1/100,000,000th of a second then its not worth it for that "little" of a speed up. |
A J
Member #3,025
December 2002
![]() |
67,000 ticks vs 191,000... thats like almost 3x faster ! if your app spends 30% of its time in rectfill for one function getting some SSE2 code, you suddenly have another 20% of your CPU available for other things. i dont understand how you cant see that ? give me an example of how you think it could be a "little" speed up ? ___________________________ |
BAF
Member #2,981
December 2002
![]() |
I see it, but all the work spent optimizing the function isn't worth it if you use rectfill a couple times and only save 200msec of cpu time. |
A J
Member #3,025
December 2002
![]() |
I dont think the average use of rectfill is just a couple of times. For my app, I use it alot. You seem to have some idea about what constitutes wasted time; do you think i have wasted my time ? What should have i done with it ? Also, this is just the tip, having done this rectfill SSE coding; (which amounts to about 20 lines of code by the way) i have also found opportunities to accelerate blit, and masked blit, and i think i can get 2x from masked blit, possibly more. Lets presume by next week (as the rectfill code has taken 3 days to research and write), that i have the rectfill,blit,and maskedblit working at over 2x; can you then tell me its not worth the effort? ___________________________ |
BAF
Member #2,981
December 2002
![]() |
I'm not saying it was not, I was just saying that depending on how it is used, it may or may not be worth it. I persnoally don't use rectfill much, so a blit speedup would be way more useful to me. |
A J
Member #3,025
December 2002
![]() |
i've been working on blit allegro's blit = 51,569 ticks and as i may have already mentioned, there is hope for masked blit also. ___________________________ |
|