![]() |
|
Greyscale-Filter is too slow. How to improve performance? |
ImLeftFooted
Member #3,935
October 2003
![]() |
Quote: you should not use the line[] beyond a single line. Why not? |
Richard Phipps
Member #1,632
November 2001
![]() |
Because there is no guarantee that memory of line 2 will follow the memory of line 1. It's very likely to happen, but not guaranteed. |
Kitty Cat
Member #2,815
October 2002
![]() |
For memory bitmaps, the lines will be continuguous (go across line 1, you'll eventually reach line 2). However, the end of the right side is not necessarilly the beginning of the left. There may be extra padding, and the bitmap struct doesn't have a public pitch member, so you need to get it yourself. int w = (bmp->line[1]-bmp->line[0]) / ((get_bitmap_color_depth(bmp)+7)/8); for(i = w*bmp->h - 1;i >= 0;--i) ...
-- |
A J
Member #3,025
December 2002
![]() |
and doing your memory accesses backwards will be slower than forwards.. memory controllers and prediciton in CPU's prefer/presume forwards access. ___________________________ |
Evert
Member #794
November 2000
![]() |
Forgot to mention, if you do the two loops, copy the bmp->line[y] to a local pointer and dereference that in your inner loop. |
Tobias Dammers
Member #2,604
August 2002
![]() |
Here are 2 more options: Both ways make use of the built-in palette conversions allegro offers. --- |
A J
Member #3,025
December 2002
![]() |
and using a long* will be wrong on 64bit systems ___________________________ |
HoHo
Member #4,534
April 2004
![]() |
[quote]HoHo, im intersetsed in your theory behind why 2 lines at a time would work quicker, care to elaborate[/quote]Basically it would be loop unrolling. Working with two different things gives CPU better opportunity to pipeline instructions. Of course I haven't made any tests with this particular problem so if those instructions do not pipeline well there might not be so big speedup.
[edit] I just made a little test program to see if what I said holds true. It seems like not too well. Sure, unrolling helps a bit but not by as much as I hoped. I don't know why, perhaps there are too many bit logic instructions and P4 can't execute a lot of them in parallel. Another possibility is that it just ran out of registers and had to swap between L1 cache a lot. One thing I found out was that the shortcut that Kitty Cat proposed wasn't the fastest I could come up by modifying the original program. Difference wasn't big but it definitely was there. The biggest speedup came from reordering the loop to go from xy to yx. It gave ~3x speedup at smaller bitmaps and with bigger* 4x and more. That probably comes from the fact that CPU can prefetch the lines. Extracting the line pointer outside of the inner loop didn't give anything for some reason. *) At 512x256 it was ~2.8x slower, 512x512 12x slower, 1024x512 ~25x slower. The fastest was unrolling by two but difference with the reordered loop was only about 10%. So here is the program. It requires gettimeofday function so it won't compile on Windows. I you replace the Timer class it should work. [code]#include <allegro.h> #include <iostream> #include <sys/time.h> using namespace std; typedef void ( *testProc ) ( BITMAP* b ); class Timer { timeval begin, end; double time; public: virtual ~Timer() {} void start() { gettimeofday( &begin, NULL ); } void stop() { gettimeofday( &end, NULL ); time = end.tv_sec - begin.tv_sec + ( end.tv_usec - begin.tv_usec ) / 1.e6f; } double getTime() const { return time; } double getTimePassed() const { timeval tmp; gettimeofday( &tmp, NULL ); return tmp.tv_sec - begin.tv_sec + ( tmp.tv_usec - begin.tv_usec ) / 1.e6f; } }; class AutoTimer : public Timer { public: AutoTimer() { start(); } ~AutoTimer() { stop(); cout << getTime() << endl; } }; void filterMakeMonochrom( BITMAP* bmp ) { acquire_bitmap( bmp ); const int w = bmp->w, h = bmp->h; for ( int x = 0; x < w; x++ ) { for ( int y = 0; y < h; y++ ) { int grey; const int c = ( ( int * ) bmp->line[ y ] ) [ x ]; grey = ( c >> 16 ) & 0xFF; grey += ( c >> 8 ) & 0xFF; grey += c & 0xFF; grey /= 3; ( ( int * ) bmp->line[ y ] ) [ x ] = grey * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 ); }; }; release_bitmap( bmp ); }; void filterMakeMonochromReordered( BITMAP* bmp ) { acquire_bitmap( bmp ); const int w = bmp->w, h = bmp->h; for ( int y = 0; y < h; y++ ) { for ( int x = 0; x < w; x++ ) { int grey; const int c = ( ( int * ) bmp->line[ y ] ) [ x ]; grey = ( c >> 16 ) & 0xFF; grey += ( c >> 8 ) & 0xFF; grey += c & 0xFF; grey /= 3; ( ( int * ) bmp->line[ y ] ) [ x ] = grey * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 ); }; }; release_bitmap( bmp ); }; void filterMakeMonochromReorderedExtractedPointer( BITMAP* bmp ) { acquire_bitmap( bmp ); const int w = bmp->w, h = bmp->h; for ( int y = 0; y < h; y++ ) { int* line = ( int* ) bmp->line[ y ]; for ( int x = 0; x < w; x++ ) { int grey; const int c = line[ x ]; grey = ( c >> 16 ) & 0xFF; grey += ( c >> 8 ) & 0xFF; grey += c & 0xFF; grey /= 3; ( ( int * ) bmp->line[ y ] ) [ x ] = grey * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 ); }; }; release_bitmap( bmp ); }; void filterMakeMonochromUnrolled2( BITMAP* bmp ) { acquire_bitmap( bmp ); const int w = bmp->w, h = bmp->h; for ( int y = 0; y < h; y += 2 ) { int * line1 = ( int* ) bmp->line[ y ]; int* line2 = ( int* ) bmp->line[ y + 1 ]; for ( int x = 0; x < w; x++ ) { int grey1, grey2; const int c1 = line1[ x ]; const int c2 = line2[ x ]; grey1 = ( c1 >> 16 ) & 0xFF; grey1 += ( c1 >> 8 ) & 0xFF; grey1 += c1 & 0xFF; grey1 /= 3; grey2 = ( c2 >> 16 ) & 0xFF; grey2 += ( c2 >> 8 ) & 0xFF; grey2 += c2 & 0xFF; grey2 /= 3; line1[ x ] = grey1 * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 ); line2[ x ] = grey2 * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 ); }; }; release_bitmap( bmp ); }; void filterMakeMonochromUnrolled4( BITMAP* bmp ) { acquire_bitmap( bmp ); int grey1, grey2, grey3, grey4; const int w = bmp->w, h = bmp->h; for ( int y = 0; y < h; y += 4 ) { int * line1 = ( int* ) bmp->line[ y ]; int* line2 = ( int* ) bmp->line[ y + 1 ]; int* line3 = ( int* ) bmp->line[ y + 2 ]; int* line4 = ( int* ) bmp->line[ y + 3 ]; for ( int x = 0; x < w; x++ ) { const int c1 = line1[ x ]; const int c2 = line2[ x ]; const int c3 = line3[ x ]; const int c4 = line4[ x ]; grey1 = ( c1 >> 16 ) & 0xFF; grey1 += ( c1 >> 8 ) & 0xFF; grey1 += c1 & 0xFF; grey1 /= 3; grey2 = ( c2 >> 16 ) & 0xFF; grey2 += ( c2 >> 8 ) & 0xFF; grey2 += c2 & 0xFF; grey2 /= 3; grey3 = ( c3 >> 16 ) & 0xFF; grey3 += ( c3 >> 8 ) & 0xFF; grey3 += c3 & 0xFF; grey3 /= 3; grey4 = ( c4 >> 16 ) & 0xFF; grey4 += ( c4 >> 8 ) & 0xFF; grey4 += c4 & 0xFF; grey4 /= 3; line1[ x ] = grey1 * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 ); line2[ x ] = grey2 * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 ); line3[ x ] = grey3 * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 ); line4[ x ] = grey4 * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 ); }; }; release_bitmap( bmp ); }; void filterMakeMonochromApprox( BITMAP* bmp ) { acquire_bitmap( bmp ); const int w = bmp->w, h = bmp->h; for ( int y = 0; y < h; y++ ) { int* line = ( int* ) bmp->line[ y ]; for ( int x = 0; x < w; x++ ) { const int color = line[ x ]; int grey = ( ( color & 0xFC0000 ) >> 18 ) + ( ( color & 0x00FE00 ) >> 9 ) + ( ( color & 0x00F800 ) >> 11 ) + ( ( color & 0x0000F8 ) >> 3 ); line[ x ] = grey * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 ); }; }; release_bitmap( bmp ); }; void filterMakeMonochromApprox2( BITMAP* bmp ) { acquire_bitmap( bmp ); const int w = bmp->w, h = bmp->h; for ( int y = 0; y < h; y++ ) { int* line = ( int* ) bmp->line[ y ]; for ( int x = 0; x < w; x++ ) { const int color = line[ x ]; int gray = ( ( color & 0xFC0000 ) >> 18 ) + ( ( color & 0x00FE00 ) >> 9 ) + ( ( color & 0x0000FC ) >> 2 ); gray |= ( gray << 8 ) | ( gray << 16 ); line[ x ] = gray * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 ); }; }; release_bitmap( bmp ); }; void filterMakeMonochromApproxUnrolled( BITMAP* bmp ) { acquire_bitmap( bmp ); const int w = bmp->w, h = bmp->h; for ( int y = 0; y < h; y += 2 ) { int * line1 = ( int* ) bmp->line[ y ]; int* line2 = ( int* ) bmp->line[ y + 1 ]; for ( int x = 0; x < w; x++ ) { const int color1 = line1[ x ]; int grey1 = ( ( color1 & 0xFC0000 ) >> 18 ) + ( ( color1 & 0x00FE00 ) >> 9 ) + ( ( color1 & 0x00F800 ) >> 11 ) + ( ( color1 & 0x0000F8 ) >> 3 ); line1[ x ] = grey1 * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 ); const int color2 = line2[ x ]; int grey2 = ( ( color2 & 0xFC0000 ) >> 18 ) + ( ( color2 & 0x00FE00 ) >> 9 ) + ( ( color2 & 0x00F800 ) >> 11 ) + ( ( color2 & 0x0000F8 ) >> 3 ); line2[ x ] = grey2 * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 ); }; }; release_bitmap( bmp ); }; void initWindow() { if ( allegro_init() != 0 ) exit( 1 ); set_color_depth( 32 ); if ( set_gfx_mode( GFX_AUTODETECT_WINDOWED, 800, 600, 0, 0 ) != 0 ) { if ( set_gfx_mode( GFX_SAFE, 320, 200, 0, 0 ) != 0 ) { set_gfx_mode( GFX_TEXT, 0, 0, 0, 0 ); allegro_message( "Unable to set any graphic mode\n%s\n", allegro_error ); exit( 1 ); } } /* set the color palette */ set_palette( desktop_palette ); clear_to_color( screen, makecol( 255, 255, 255 ) ); } void testLoop( string name, testProc proc, BITMAP* bmp ) { //cout << name << endl; AutoTimer at1; for ( int i = 0;i < 5;i++ ) { //AutoTimer at2; for ( int j = 0;j < 1000;j++ ) { proc( bmp ); } } cout << "Total "; } void test( int w, int h ) { cout << "Testing at " << w << "x" << h << endl; BITMAP* bmp = create_bitmap( w, h ); testLoop( "plain", filterMakeMonochrom, bmp ); testLoop( "reordered loop", filterMakeMonochromReordered, bmp ); testLoop( "reordered loop, extracted pointer", filterMakeMonochromReorderedExtractedPointer, bmp ); testLoop( "unrolled by two", filterMakeMonochromUnrolled2, bmp ); testLoop( "unrolled by four", filterMakeMonochromUnrolled4, bmp ); testLoop( "approx1", filterMakeMonochromApprox, bmp ); testLoop( "approx1 unrolled by 2", filterMakeMonochromApproxUnrolled, bmp ); testLoop( "approx2", filterMakeMonochromApprox2, bmp ); destroy_bitmap( bmp ); cout << endl; } int main() { initWindow(); test( 128, 128 ); test( 256, 128 ); test( 256, 256 ); test( 512, 256 ); test( 512, 512 ); test( 1024, 512 ); test( 1024, 1024 ); } END_OF_MAIN() [/code]And here is the output from 32bit P4 @ 3.3GHz with 1M L2. Order of the output is Original code reordered loop reordered loop, extracted pointer unrolled by two unrolled by four approx1 approx1 unrolled by 2 approx2 [code]hoho@WW004029 ~/katsepolygon/quicktests/grayscale $ g++ -O3 grayscale.cpp `allegro-config --libs`&& ./a.out Allegro up, screen: 800x600 32 Testing at 128x128 Total 0.644086 Total 0.347362 Total 0.355598 Total 0.300813 Total 0.288017 Total 0.37723 Total 0.325882 Total 0.402296 Testing at 256x128 Total 1.24467 Total 0.698047 Total 0.698583 Total 0.591255 Total 0.571967 Total 0.749285 Total 0.619433 Total 0.80224 Testing at 256x256 Total 2.73151 Total 1.38458 Total 1.39629 Total 1.17581 Total 1.3294 Total 1.51228 Total 1.24345 Total 1.6106 Testing at 512x256 Total 8.02422 Total 2.80307 Total 3.01607 Total 2.74377 Total 2.58823 Total 3.21759 Total 2.68319 Total 3.36077 Testing at 512x512 Total 77.4398 Total 6.35915 Total 6.20438 Total 5.79059 Total 5.8432 Total 7.01474 Total 6.09081 Total 6.83415 Testing at 1024x512 Total 315.324 Total 12.642 Total 12.5239 Total 11.0038 Total 12.1963 Total 15.1906 Total 12.9711 Total 13.9239 Testing at 1024x1024 Total 600.725 Total 25.0655 Total 24.1274 Total 24.9667 Total 23.0532 Total 25.5888 Total 23.5083 Total 26.5958 [/code] I'll run the same test tonight with my own P4. That one is running in 64bit. If indeed the problem is in too few registers it should show right away. Only problem is that it also has twice the L2 and that could alter the results. Another interesting thing would be to use SIMD instructions. It should be relatively trivial to do and should give quite big speedup compared to what I got here. Currently the biggest bandwidth is around 870MiB/s with 1024x1024 in one direction. There is quite a bit more bandwidth available so it should be possible to get almost full speed out of SIMD stuff without hitting the bandwidth limits. __________ |
A J
Member #3,025
December 2002
![]() |
all that bit shifting and masking, why not simply use a uint8_t* and avoid all that. ___________________________ |
Richard Phipps
Member #1,632
November 2001
![]() |
Carrus85
Member #2,633
August 2002
![]() |
unsignedinteger8bit_type
|
Richard Phipps
Member #1,632
November 2001
![]() |
Ok. How will that stop the need for bitshifting and masking when using Bitmap data? |
HoHo
Member #4,534
April 2004
![]() |
Quote: all that bit shifting and masking, why not simply use a uint8_t* and avoid all that. Hehe, good point. That's what happens when you just take someones code and don't really think what is going on. Quote: How will that stop the need for bitshifting and masking when using Bitmap data? Instead of calculating RGB from the 32bit integer you simply get the RGB values directly. I'll make some tests after I have finished configuring my system. That shouldn't take more than couple of hours. __________ |
Richard Phipps
Member #1,632
November 2001
![]() |
Well, in that case I imagine the compiled code still has to do masking and bitshifting? Otherwise it would be reading one byte at a time instead of at least 4. |
A J
Member #3,025
December 2002
![]() |
why does the compiled code still need to do bit shifting and masking ? ___________________________ |
Richard Phipps
Member #1,632
November 2001
![]() |
I imagine the compiler wants to read in 4 bytes into a 32 bit register (for most PC's). It then can process 4 bytes worth of information. The alternative is to use a register for 1 byte, process it, and then read new data into the register. |
A J
Member #3,025
December 2002
![]() |
you like to 'imagine' dont you? why would you want to read in 4 bytes, when all you wanted was one ? how much asm programming have you done ? 'the' alternative, there is only 1 alternative ? what about the 64bit registers, and the 128 bit registers ? ___________________________ |
Richard Phipps
Member #1,632
November 2001
![]() |
First of all: Get rid of the negative attacking attitude. ok? Secondly: I have done limited 8-bit cpu programming in the past. And in this case it's better to read in 2 or more bytes than 1. Quote: why would you want to read in 4 bytes, when all you wanted was one ? Perhaps because it's inefficient? It was pretty well known in the past to read in more than one byte if you could. |
A J
Member #3,025
December 2002
![]() |
is there any value in discussing technical stuff when you pre-empt it with 'imagine' ? and 'perhaps' you made a bunch of assumptions, and all i asked was how you got to that conclusion ? you also claim there is only one alternative, how did you come to that conclusion ? so you dont have much experince with modern CPUs, or asm programming, yet you make statements that dont make a whole lot of sense. ___________________________ |
Richard Phipps
Member #1,632
November 2001
![]() |
I used those words because while I have experience of 8 and 16 bit processors I have no more recent experiences. My 'assumptions' were based on my experience. Nothing you've said indicates that you have experience of cpu's either. Why don't you back up your assumptions and criticisms first eh? |
HoHo
Member #4,534
April 2004
![]() |
Old CPU's have quite little in common with modern ones when you are comparing different optimization techniques __________ |
Richard Phipps
Member #1,632
November 2001
![]() |
That may be the case, which is why I used the cautious language I wrote, rather than a 'this is the best way' type of statement. I'd still like to see AJ show some knowledge of this himself instead of just negative comments. |
A J
Member #3,025
December 2002
![]() |
ive made no negative comments, ive simply asked how you got to your conclusions ? which statements of mine require proof of validity ? 'cautious' a statement pre-empted with 'imagine' or 'perhaps' is hardly cautiuos... perhaps you have sex with fluffy animals, is a little cautios, but some might even say "i imagine" is the same as "i believe" you've almost admitted you know very little on this subject, yet persist.. you can throw as much mud as you like, when the results arrive (Thanks HoHo) that will be all the knowledge you will require. ___________________________ |
Richard Phipps
Member #1,632
November 2001
![]() |
Quote: ive made no negative comments Really? Quote: you like to 'imagine' dont you?
Quote: you do love those fence-sitting qualifiers
Quote: you can throw as much mud as you like
You must be blind! I'm explained quite a few times now how I've come to my conclusions. You have failed to do that each time I've asked. So I conclude you know nothing about this. |
HoHo
Member #4,534
April 2004
![]() |
Having followed some discussions in AD and some topics here in a.cc I conclude that AJ knows about optimizations way more than average programmer I haven't yet got to the testing but I'm almost there. I just have to move some files around to make room for installing some necessary programs. It shouldn't take more than two hours for first results. __________ |
|
|