Greyscale-Filter is too slow. How to improve performance?

Greyscale-Filter is too slow. How to improve performance?

ImLeftFooted

Member #3,935

October 2003

Quote:

you should not use the line[] beyond a single line.

Why not?

How is my posting?

Richard Phipps

Member #1,632

November 2001

Because there is no guarantee that memory of line 2 will follow the memory of line 1.

It's very likely to happen, but not guaranteed.

Chaos Groove Development Blog
Free Logging System Code & Blog

Kitty Cat

Member #2,815

October 2002

For memory bitmaps, the lines will be continuguous (go across line 1, you'll eventually reach line 2). However, the end of the right side is not necessarilly the beginning of the left. There may be extra padding, and the bitmap struct doesn't have a public pitch member, so you need to get it yourself.

int w = (bmp->line[1]-bmp->line[0]) / ((get_bitmap_color_depth(bmp)+7)/8);
for(i = w*bmp->h - 1;i >= 0;--i)
...

--
"Do not meddle in the affairs of cats, for they are subtle and will pee on your computer." -- Bruce Graham

A J

Member #3,025

December 2002

and doing your memory accesses backwards will be slower than forwards.. memory controllers and prediciton in CPU's prefer/presume forwards access.

___________________________
The more you talk, the more AJ is right. - ML

Evert

Member #794

November 2000

Forgot to mention, if you do the two loops, copy the bmp->line[y] to a local pointer and dereference that in your inner loop.

Tobias Dammers

Member #2,604

August 2002

Here are 2 more options:
1. Use 8bpp mode. Create a greyscale palette and set it. Create an RGB table and select it. Load all sprites with COLORCONV_TOTAL | COLORCONV_TRANS. There you go.
2. Use >8bpp mode. Create a greyscale palette and select it. Create RGB table. Load all sprites normally, set_color_conversion(COLORCONV_TOTAL | COLORCONV_TRANS), then blit each sprite to an 8bpp sprite of same size and use that instead.

Both ways make use of the built-in palette conversions allegro offers.

---
Me make music: Triofobie
---
"We need Tobias and his awesome trombone, too." - Johan Halmén

A J

Member #3,025

December 2002

and using a long* will be wrong on 64bit systems

___________________________
The more you talk, the more AJ is right. - ML

HoHo

Member #4,534

April 2004

[quote]HoHo, im intersetsed in your theory behind why 2 lines at a time would work quicker, care to elaborate[/quote]Basically it would be loop unrolling. Working with two different things gives CPU better opportunity to pipeline instructions. Of course I haven't made any tests with this particular problem so if those instructions do not pipeline well there might not be so big speedup.

[edit]

I just made a little test program to see if what I said holds true. It seems like not too well. Sure, unrolling helps a bit but not by as much as I hoped. I don't know why, perhaps there are too many bit logic instructions and P4 can't execute a lot of them in parallel. Another possibility is that it just ran out of registers and had to swap between L1 cache a lot.

One thing I found out was that the shortcut that Kitty Cat proposed wasn't the fastest I could come up by modifying the original program. Difference wasn't big but it definitely was there.

The biggest speedup came from reordering the loop to go from xy to yx. It gave ~3x speedup at smaller bitmaps and with bigger* 4x and more. That probably comes from the fact that CPU can prefetch the lines. Extracting the line pointer outside of the inner loop didn't give anything for some reason.
*) At 512x256 it was ~2.8x slower, 512x512 12x slower, 1024x512 ~25x slower.

The fastest was unrolling by two but difference with the reordered loop was only about 10%.

So here is the program. It requires gettimeofday function so it won't compile on Windows. I you replace the Timer class it should work.
[code]#include <allegro.h>
#include <iostream>
#include <sys/time.h>
using namespace std;
typedef void ( *testProc ) ( BITMAP* b );

class Timer {
timeval begin, end;
double time;
public:
virtual ~Timer() {}
void start() {
gettimeofday( &begin, NULL );
}
void stop() {
gettimeofday( &end, NULL );

time = end.tv_sec - begin.tv_sec + ( end.tv_usec - begin.tv_usec ) / 1.e6f;
}
double getTime() const {
return time;
}
double getTimePassed() const {
timeval tmp;
gettimeofday( &tmp, NULL );
return tmp.tv_sec - begin.tv_sec + ( tmp.tv_usec - begin.tv_usec ) / 1.e6f;
}
};

class AutoTimer : public Timer {
public:
AutoTimer() {
start();
}
~AutoTimer() {
stop();
cout << getTime() << endl;
}
};

void filterMakeMonochrom( BITMAP* bmp ) {
acquire_bitmap( bmp );
const int w = bmp->w, h = bmp->h;
for ( int x = 0; x < w; x++ ) {
for ( int y = 0; y < h; y++ ) {
int grey;
const int c = ( ( int * ) bmp->line[ y ] ) [ x ];
grey = ( c >> 16 ) & 0xFF;
grey += ( c >> 8 ) & 0xFF;
grey += c & 0xFF;
grey /= 3;
( ( int * ) bmp->line[ y ] ) [ x ] = grey * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );
};
};
release_bitmap( bmp );
};

void filterMakeMonochromReordered( BITMAP* bmp ) {
acquire_bitmap( bmp );
const int w = bmp->w, h = bmp->h;
for ( int y = 0; y < h; y++ ) {
for ( int x = 0; x < w; x++ ) {
int grey;
const int c = ( ( int * ) bmp->line[ y ] ) [ x ];
grey = ( c >> 16 ) & 0xFF;
grey += ( c >> 8 ) & 0xFF;
grey += c & 0xFF;
grey /= 3;
( ( int * ) bmp->line[ y ] ) [ x ] = grey * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );
};
};
release_bitmap( bmp );
};
void filterMakeMonochromReorderedExtractedPointer( BITMAP* bmp ) {
acquire_bitmap( bmp );
const int w = bmp->w, h = bmp->h;
for ( int y = 0; y < h; y++ ) {
int* line = ( int* ) bmp->line[ y ];
for ( int x = 0; x < w; x++ ) {
int grey;
const int c = line[ x ];
grey = ( c >> 16 ) & 0xFF;
grey += ( c >> 8 ) & 0xFF;
grey += c & 0xFF;
grey /= 3;
( ( int * ) bmp->line[ y ] ) [ x ] = grey * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );
};
};
release_bitmap( bmp );
};
void filterMakeMonochromUnrolled2( BITMAP* bmp ) {
acquire_bitmap( bmp );
const int w = bmp->w, h = bmp->h;
for ( int y = 0; y < h; y += 2 ) {
int * line1 = ( int* ) bmp->line[ y ];
int* line2 = ( int* ) bmp->line[ y + 1 ];
for ( int x = 0; x < w; x++ ) {
int grey1, grey2;
const int c1 = line1[ x ];
const int c2 = line2[ x ];
grey1 = ( c1 >> 16 ) & 0xFF;
grey1 += ( c1 >> 8 ) & 0xFF;
grey1 += c1 & 0xFF;
grey1 /= 3;
grey2 = ( c2 >> 16 ) & 0xFF;
grey2 += ( c2 >> 8 ) & 0xFF;
grey2 += c2 & 0xFF;
grey2 /= 3;
line1[ x ] = grey1 * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );
line2[ x ] = grey2 * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );
};
};
release_bitmap( bmp );
};
void filterMakeMonochromUnrolled4( BITMAP* bmp ) {
acquire_bitmap( bmp );
int grey1, grey2, grey3, grey4;
const int w = bmp->w, h = bmp->h;
for ( int y = 0; y < h; y += 4 ) {
int * line1 = ( int* ) bmp->line[ y ];
int* line2 = ( int* ) bmp->line[ y + 1 ];
int* line3 = ( int* ) bmp->line[ y + 2 ];
int* line4 = ( int* ) bmp->line[ y + 3 ];
for ( int x = 0; x < w; x++ ) {
const int c1 = line1[ x ];
const int c2 = line2[ x ];
const int c3 = line3[ x ];
const int c4 = line4[ x ];
grey1 = ( c1 >> 16 ) & 0xFF;
grey1 += ( c1 >> 8 ) & 0xFF;
grey1 += c1 & 0xFF;
grey1 /= 3;
grey2 = ( c2 >> 16 ) & 0xFF;
grey2 += ( c2 >> 8 ) & 0xFF;
grey2 += c2 & 0xFF;
grey2 /= 3;
grey3 = ( c3 >> 16 ) & 0xFF;
grey3 += ( c3 >> 8 ) & 0xFF;
grey3 += c3 & 0xFF;
grey3 /= 3;
grey4 = ( c4 >> 16 ) & 0xFF;
grey4 += ( c4 >> 8 ) & 0xFF;
grey4 += c4 & 0xFF;
grey4 /= 3;
line1[ x ] = grey1 * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );
line2[ x ] = grey2 * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );
line3[ x ] = grey3 * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );
line4[ x ] = grey4 * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );
};
};
release_bitmap( bmp );
};

void filterMakeMonochromApprox( BITMAP* bmp ) {
acquire_bitmap( bmp );
const int w = bmp->w, h = bmp->h;
for ( int y = 0; y < h; y++ ) {
int* line = ( int* ) bmp->line[ y ];
for ( int x = 0; x < w; x++ ) {
const int color = line[ x ];
int grey =
( ( color & 0xFC0000 ) >> 18 ) +
( ( color & 0x00FE00 ) >> 9 ) +
( ( color & 0x00F800 ) >> 11 ) +
( ( color & 0x0000F8 ) >> 3 );

line[ x ] = grey * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );
};
};
release_bitmap( bmp );

};

void filterMakeMonochromApprox2( BITMAP* bmp ) {
acquire_bitmap( bmp );
const int w = bmp->w, h = bmp->h;
for ( int y = 0; y < h; y++ ) {
int* line = ( int* ) bmp->line[ y ];
for ( int x = 0; x < w; x++ ) {
const int color = line[ x ];
int gray =
( ( color & 0xFC0000 ) >> 18 ) +
( ( color & 0x00FE00 ) >> 9 ) +
( ( color & 0x0000FC ) >> 2 );
gray |= ( gray << 8 ) | ( gray << 16 );

line[ x ] = gray * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );
};
};
release_bitmap( bmp );

};

void filterMakeMonochromApproxUnrolled( BITMAP* bmp ) {
acquire_bitmap( bmp );
const int w = bmp->w, h = bmp->h;
for ( int y = 0; y < h; y += 2 ) {
int * line1 = ( int* ) bmp->line[ y ];
int* line2 = ( int* ) bmp->line[ y + 1 ];
for ( int x = 0; x < w; x++ ) {
const int color1 = line1[ x ];
int grey1 =
( ( color1 & 0xFC0000 ) >> 18 ) +
( ( color1 & 0x00FE00 ) >> 9 ) +
( ( color1 & 0x00F800 ) >> 11 ) +
( ( color1 & 0x0000F8 ) >> 3 );

line1[ x ] = grey1 * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );
const int color2 = line2[ x ];
int grey2 =
( ( color2 & 0xFC0000 ) >> 18 ) +
( ( color2 & 0x00FE00 ) >> 9 ) +
( ( color2 & 0x00F800 ) >> 11 ) +
( ( color2 & 0x0000F8 ) >> 3 );

line2[ x ] = grey2 * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );
};
};
release_bitmap( bmp );

};

void initWindow() {
if ( allegro_init() != 0 )
exit( 1 );
set_color_depth( 32 );
if ( set_gfx_mode( GFX_AUTODETECT_WINDOWED, 800, 600, 0, 0 ) != 0 ) {
if ( set_gfx_mode( GFX_SAFE, 320, 200, 0, 0 ) != 0 ) {
set_gfx_mode( GFX_TEXT, 0, 0, 0, 0 );
allegro_message( "Unable to set any graphic mode\n%s\n", allegro_error );
exit( 1 );
}
}
/* set the color palette */
set_palette( desktop_palette );
clear_to_color( screen, makecol( 255, 255, 255 ) );
}

void testLoop( string name, testProc proc, BITMAP* bmp ) {
//cout << name << endl;
AutoTimer at1;
for ( int i = 0;i < 5;i++ ) {
//AutoTimer at2;
for ( int j = 0;j < 1000;j++ ) {
proc( bmp );
}
}
cout << "Total ";
}
void test( int w, int h ) {
cout << "Testing at " << w << "x" << h << endl;
BITMAP* bmp = create_bitmap( w, h );
testLoop( "plain", filterMakeMonochrom, bmp );
testLoop( "reordered loop", filterMakeMonochromReordered, bmp );
testLoop( "reordered loop, extracted pointer", filterMakeMonochromReorderedExtractedPointer, bmp );
testLoop( "unrolled by two", filterMakeMonochromUnrolled2, bmp );
testLoop( "unrolled by four", filterMakeMonochromUnrolled4, bmp );
testLoop( "approx1", filterMakeMonochromApprox, bmp );
testLoop( "approx1 unrolled by 2", filterMakeMonochromApproxUnrolled, bmp );
testLoop( "approx2", filterMakeMonochromApprox2, bmp );
destroy_bitmap( bmp );
cout << endl;
}
int main() {
initWindow();
test( 128, 128 );
test( 256, 128 );
test( 256, 256 );
test( 512, 256 );
test( 512, 512 );
test( 1024, 512 );
test( 1024, 1024 );
}
END_OF_MAIN()

[/code]And here is the output from 32bit P4 @ 3.3GHz with 1M L2.

Order of the output is
Original code
reordered loop
reordered loop, extracted pointer
unrolled by two
unrolled by four
approx1
approx1 unrolled by 2
approx2

[code]hoho@WW004029 ~/katsepolygon/quicktests/grayscale $ g++ -O3 grayscale.cpp `allegro-config --libs`&& ./a.out
Allegro up, screen: 800x600 32
Testing at 128x128
Total 0.644086
Total 0.347362
Total 0.355598
Total 0.300813
Total 0.288017
Total 0.37723
Total 0.325882
Total 0.402296

Testing at 256x128
Total 1.24467
Total 0.698047
Total 0.698583
Total 0.591255
Total 0.571967
Total 0.749285
Total 0.619433
Total 0.80224

Testing at 256x256
Total 2.73151
Total 1.38458
Total 1.39629
Total 1.17581
Total 1.3294
Total 1.51228
Total 1.24345
Total 1.6106

Testing at 512x256
Total 8.02422
Total 2.80307
Total 3.01607
Total 2.74377
Total 2.58823
Total 3.21759
Total 2.68319
Total 3.36077

Testing at 512x512
Total 77.4398
Total 6.35915
Total 6.20438
Total 5.79059
Total 5.8432
Total 7.01474
Total 6.09081
Total 6.83415

Testing at 1024x512
Total 315.324
Total 12.642
Total 12.5239
Total 11.0038
Total 12.1963
Total 15.1906
Total 12.9711
Total 13.9239

Testing at 1024x1024
Total 600.725
Total 25.0655
Total 24.1274
Total 24.9667
Total 23.0532
Total 25.5888
Total 23.5083
Total 26.5958
[/code]

I'll run the same test tonight with my own P4. That one is running in 64bit. If indeed the problem is in too few registers it should show right away. Only problem is that it also has twice the L2 and that could alter the results.

Another interesting thing would be to use SIMD instructions. It should be relatively trivial to do and should give quite big speedup compared to what I got here. Currently the biggest bandwidth is around 870MiB/s with 1024x1024 in one direction. There is quite a bit more bandwidth available so it should be possible to get almost full speed out of SIMD stuff without hitting the bandwidth limits.

__________
In theory, there is no difference between theory and practice. But, in practice, there is - Jan L.A. van de Snepscheut
MMORPG's...Many Men Online Role Playing Girls - Radagar
"Is Java REALLY slower? Does STL really bloat your exes? Find out with your friendly host, HoHo, and his benchmarking machine!" - Jakub Wasilewski

A J

Member #3,025

December 2002

all that bit shifting and masking, why not simply use a uint8_t* and avoid all that.

___________________________
The more you talk, the more AJ is right. - ML

Richard Phipps

Member #1,632

November 2001

A what?

Chaos Groove Development Blog
Free Logging System Code & Blog

Carrus85

Member #2,633

August 2002

unsignedinteger8bit_type

Richard Phipps

Member #1,632

November 2001

Ok.

How will that stop the need for bitshifting and masking when using Bitmap data?

Chaos Groove Development Blog
Free Logging System Code & Blog

HoHo

Member #4,534

April 2004

Quote:

all that bit shifting and masking, why not simply use a uint8_t* and avoid all that.

Hehe, good point. That's what happens when you just take someones code and don't really think what is going on.

Quote:

How will that stop the need for bitshifting and masking when using Bitmap data?

Instead of calculating RGB from the 32bit integer you simply get the RGB values directly.

I'll make some tests after I have finished configuring my system. That shouldn't take more than couple of hours.

Richard Phipps

Member #1,632

November 2001

Well, in that case I imagine the compiled code still has to do masking and bitshifting? Otherwise it would be reading one byte at a time instead of at least 4.

Chaos Groove Development Blog
Free Logging System Code & Blog

A J

Member #3,025

December 2002

why does the compiled code still need to do bit shifting and masking ?

___________________________
The more you talk, the more AJ is right. - ML

Richard Phipps

Member #1,632

November 2001

I imagine the compiler wants to read in 4 bytes into a 32 bit register (for most PC's). It then can process 4 bytes worth of information.

The alternative is to use a register for 1 byte, process it, and then read new data into the register.

Chaos Groove Development Blog
Free Logging System Code & Blog

A J

Member #3,025

December 2002

you like to 'imagine' dont you?

why would you want to read in 4 bytes, when all you wanted was one ?

how much asm programming have you done ?

'the' alternative, there is only 1 alternative ?

what about the 64bit registers, and the 128 bit registers ?

___________________________
The more you talk, the more AJ is right. - ML

Richard Phipps

Member #1,632

November 2001

First of all: Get rid of the negative attacking attitude. ok?

Secondly: I have done limited 8-bit cpu programming in the past. And in this case it's better to read in 2 or more bytes than 1.

Quote:

why would you want to read in 4 bytes, when all you wanted was one ?

Perhaps because it's inefficient? It was pretty well known in the past to read in more than one byte if you could.

Chaos Groove Development Blog
Free Logging System Code & Blog

A J

Member #3,025

December 2002

is there any value in discussing technical stuff when you pre-empt it with 'imagine' ? and 'perhaps'

you made a bunch of assumptions, and all i asked was how you got to that conclusion ?

you also claim there is only one alternative, how did you come to that conclusion ?

so you dont have much experince with modern CPUs, or asm programming, yet you make statements that dont make a whole lot of sense.

___________________________
The more you talk, the more AJ is right. - ML

Richard Phipps

Member #1,632

November 2001

I used those words because while I have experience of 8 and 16 bit processors I have no more recent experiences.

My 'assumptions' were based on my experience. Nothing you've said indicates that you have experience of cpu's either.

Why don't you back up your assumptions and criticisms first eh?

Chaos Groove Development Blog
Free Logging System Code & Blog

HoHo

Member #4,534

April 2004

Old CPU's have quite little in common with modern ones when you are comparing different optimization techniques

Richard Phipps

Member #1,632

November 2001

That may be the case, which is why I used the cautious language I wrote, rather than a 'this is the best way' type of statement.

I'd still like to see AJ show some knowledge of this himself instead of just negative comments.

Chaos Groove Development Blog
Free Logging System Code & Blog

A J

Member #3,025

December 2002

ive made no negative comments, ive simply asked how you got to your conclusions ?

which statements of mine require proof of validity ?

'cautious' a statement pre-empted with 'imagine' or 'perhaps' is hardly cautiuos...

perhaps you have sex with fluffy animals, is a little cautios, but
i imagine you have sex with fluffy animals, isn't particauly cautious.

some might even say "i imagine" is the same as "i believe"

you've almost admitted you know very little on this subject, yet persist..
or maybe (you do love those fence-sitting qualifiers) you want to keep going, your ego seems to be up for it.

you can throw as much mud as you like, when the results arrive (Thanks HoHo) that will be all the knowledge you will require.

___________________________
The more you talk, the more AJ is right. - ML

Richard Phipps

Member #1,632

November 2001

Quote:

ive made no negative comments

Really?

Quote:

you like to 'imagine' dont you?

Quote:

you do love those fence-sitting qualifiers

Quote:

you can throw as much mud as you like

You must be blind!

I'm explained quite a few times now how I've come to my conclusions. You have failed to do that each time I've asked.

So I conclude you know nothing about this.

Chaos Groove Development Blog
Free Logging System Code & Blog

HoHo

Member #4,534

April 2004

Having followed some discussions in AD and some topics here in a.cc I conclude that AJ knows about optimizations way more than average programmer

I haven't yet got to the testing but I'm almost there. I just have to move some files around to make room for installing some necessary programs. It shouldn't take more than two hours for first results.