Greyscale-Filter is too slow. How to improve performance?
DieBacke

hi,
basically this is my filter:

1static void FilterMakeMonochrom(BITMAP* bmp)
2{
3 //make the bitmap of the child monochrom
4 acquire_bitmap(bmp);
5 int grey;
6 int color;
7 for (int x = 0; x <= SCREEN_W; x++)
8 {
9 for (int y = 0; y <= SCREEN_H; y++)
10 {
11 color = getpixel(bmp, x, y);
12 grey = ((getr(color) + getg(color) + getb(color)) / 3);
13 putpixel(bmp, x, y, makecol(grey, grey, grey));
14 };
15 };
16 release_bitmap(bmp);
17};

it is very slow... when I try it with lower resolutions, it is much smoother (1024*768 compared to 640*480 :)). So the part in the loop is not so well done by me... There must be a critical mistake, otherwise such a simple filter wouldn't lag so bad...

If I don't call acquire_bitmap() the performance doesn't get worse... shouldn't that lower fps to almost zero?

thx, ~~Backe

miran

Don't use generic getpixel, putpixel, makecol and getX functions! Use the colour depth specific ones instead. Actually you should access bitmap memory directly. That'll make your code several times faster.

EDIT: But at 1024x768 that will still be slow. You can probaly use OpenLayer to get the same effect. I've no idea how though...

Kitty Cat

What type of bitmap is being passed? A memory bitmap? Video? If you know the depth, you can use the functions directly:

_getpixel32 // note the underscore!
_putpixel32 // note the underscore!
getr32
getg32
getb32
makecol32

Replace 32 with the depth you're using.

EDIT:
You can also use

bmp_select(bmp);
addr = bmp_read_line(bmp, y);
for(x)
   buf[x] = bmp_read32(addr+(x*4));

do_stuff_to_line(buf);

addr = bmp_write_line(bmp, y);
for(x)
   bmp_write32(addr+(x*4), buf[x]);

Check the exflame example for methods of writing directly to bitmaps.

FrankyR

For some info on how to access BITMAP data directly check out: http://www.allegro.cc/manual/api/direct-access-to-video-memory/

Evert

In addition to what was said above, swapping the order of your loops may also help due to the way the bitmaps are layed out in memory.
If you can use an 8 bit image for this, you can get the effect very quickly by using a grey-scale palette.

gnolam

It should also be noted that the code you provided may not produce the result you want. In that case, you'll have to weight your color components: luminance

DieBacke

the colour depth specific getpixel and putpixel functions and the ((long *)bmp->line[y])[x] crash my program ??? in my IDE the program quits, and keeps the resolution, though. If I run the exe manually, the window just becomes black.
The bitmap passed is a memory bitmap. Nothing special, just a simple doublebuffer.
color depth is 32. The getX functions work better when color depth specific.

Richard Phipps

If it crashes then you are trying to read from a point outside the bitmap.

Elverion

I also find that, instead of using nested for loops, using a single for loop can benefit speed more than you would think. That is, instead of:

// maxa represents the max "width"
// maxb represents the max "height"
for(int b = 0; b < maxb; b++)
  for(int a = 0; a < maxa; a++)
    do_something();

use:

// maxa represents the max "width"
// maxb represents the max "height"
for(int i = 0; i < (maxb * maxa); i++)
{
  int b = i / maxa; // b represents vertical movement
  int a = i % maxa; // a represents horizontal movement
  
  do_something();
}

Of course, this doesn't always work(mostly due to optimizations where possible), and you probably knew this anyways.

I was thinking that, maybe, you could somehow use draw_lit_sprite to fake the whole grayscaling processes. I haven't used this function before, and don't know it's limitations.

DieBacke
Quote:

If it crashes then you are trying to read from a point outside the bitmap.

Yes, your right.
Ok, this already does a pretty good job:

1static void FilterMakeMonochrom(BITMAP* bmp)
2{
3 //make the bitmap of the child monochrom
4 acquire_bitmap(bmp);
5 int grey;
6 int color;
7 for(int i = 0; i < (bmp->w * bmp->h); i++)
8 {
9 int x = i % bmp->w;
10 int y = i / bmp->w;
11
12 color = _getpixel32(bmp, x, y);
13 grey = ((getr32(color) + getg32(color) + getb32(color)) / 3);
14 ((long *)bmp->line[y])[x] = makecol32(grey, grey, grey);
15 }
16 release_bitmap(bmp);
17};

It still lags, but it's better.
How could one single loop improve performance? I tried it, it is not noticeable without counting the frames (I don't count the frames in my program)

Evert
Quote:

for(int i = 0; i < (bmp->w * bmp->h); i++)
{
int x = i % bmp->w;
int y = i / bmp->w;

Store bmp->w*bmp->h, and copy bmp->w and bmp->h to local variables (right now, you're doing a pointer dereference in your inner loop, which is slower than accessing a local variable).

The single loop might buy you some time because it safes one jump instruction at the end of the loop (the nested loops have two jumps, one for each loop), on the other hand, division and modulus are slow operations (but they can probably be pipelined and executed together, someone else will know).

Richard Phipps

Try this:

1 for (y = 0 ; y < bmp->h ; y++)
2 {
3 for (x = 0 ; x < bmp->w ; x++)
4 {
5 // Fast access to 4 bytes of memory bitmap data, i.e. one 32 bit pixel.
6 c = ((long *)bmp->line[y])[x];
7
8 // Split into colour components.
9 r = (c >> 16) & 0xFF;
10 g = (c >> 8) & 0xFF;
11 b = c & 0xFF;
12
13 // Quick but inaccurate greyscale conversion.
14 grey = ((r + g + b) / 3);
15
16 // Write the new value.
17 ((long *)bmp->line[y])[x] = (grey << 16 | grey << 8 | grey);
18 }
19 }

It can be optimised further, but I wanted you to see what was happening first.

Neil Walker

.3 * R + .59 * G + .11 * B

Isn't that much slower in a computer that can do 5 trillion calculations a second.

Richard Phipps

True, but unless it needs to be accurate, the quick divide by 3 will look fine and be a little bit faster.

Kitty Cat
   c = ((long *)bmp->line[y])[x]; 

   // Split into colour components.
   // Quick but inaccurate greyscale conversion. 
   grey  = (c >> 16) & 0xFF;
   grey += (c >> 8) & 0xFF;
   grey += c & 0xFF;
   grey /= 3;

   // Write the new value.  
   ((long *)bmp->line[y])[x] = grey * ((1<<16) | (1<<8) | 1);

Richard Phipps

Yep, that's the optimised version. Although you could put the grey bit in the pixel write and get rid of the grey variable altogether.

Jonatan Hedborg

A faster way would be to make a whole new set of bitmaps, which are greyscale. They could be automaticly generated on load and stored "together" (in the same class/struct) with their "parent" bitmap. Then you'd have some function which switch their pointers when you want to draw in greyscale.

Krzysztof Kluczek
Quote:

.3 * R + .59 * G + .11 * B

.25*R + .625*G + .125*B shouldn't be too much off. :)

gray =
    ((color&0xFC0000)>>18) +
    ((color&0x00FE00)>> 9) +
    ((color&0x00F800)>>11) +
    ((color&0x0000F8)>> 3);

gray|=(gray<<8)|(gray<<16);

And if .25*R + .5*G + .25*B is fine for you:

gray =
    ((color&0xFC0000)>>18) +
    ((color&0x00FE00)>> 9) +
    ((color&0x0000FC)>> 2);

gray|=(gray<<8)|(gray<<16);

Elverion
Quote:

.3 * R + .59 * G + .11 * B
.25*R + .625*G + .125*B

What's with the seemingly random decimal points? I could understand multiplying them all by 0.3, or even using a slightly different number for green, but where do .59, .11, .625, and .125 come from?

Marcello

They come from empirical testing. r * 0.299f + g * 0.587f + b * 0.114f is what Photoshop uses, from what I've read.

miran
Quote:

They come from empirical testing. r * 0.299f + g * 0.587f + b * 0.114f is what Photoshop uses, from what I've read.

That's the standard set by some standard setting organization whose name is a French acronym I can't remember right now. .625, .5, .25, .125, etc. come from the fact that multiplying a number with them can be done with bitshifts.

HoHo

For optimizing the loops a bit more, a good thing would be to do the operations on several bitmap lines in one cycle, probably 2 or 4 should be good options. That way compiler will probably mix those functions and CPU can execute several of those in parallel.
If done correctly I wouldn't be surprised if you would get two or more times faster execution with relatively modern CPU with good OoO execution :)

Neil Walker

For a lot of colour effects I find converting to HSL much easier. One thing I've used in a game is convert to grey-scale but keep the orangey/red colours.

ImLeftFooted

I would just save the double loop / modu operator speed headaches all together and just do one loop.

for(int y = 0; y < bmp->h; y++)
 for(int x = 0; x < bmp->w; x++)
  c = ((long *)bmp->line[y])[x];

becomes

for(int i = bmp->w * bmp->h - 1; i >=0; i--)
 c = ((long *)bmp->line[0])<i>;

Evert said:

(but they can probably be pipelined and executed together, someone else will know).

Can someone confirm or deny this?

A J

dustin, your algo is wrong, and potentially dangerous.
you should not use the line[] beyond a single line.
not to mention that you also start out-of-bounds

HoHo, im intersetsed in your theory behind why 2 lines at a time would work quicker, care to elaborate

ImLeftFooted
Quote:

you should not use the line[] beyond a single line.

Why not?

Richard Phipps

Because there is no guarantee that memory of line 2 will follow the memory of line 1.

It's very likely to happen, but not guaranteed.

Kitty Cat

For memory bitmaps, the lines will be continuguous (go across line 1, you'll eventually reach line 2). However, the end of the right side is not necessarilly the beginning of the left. There may be extra padding, and the bitmap struct doesn't have a public pitch member, so you need to get it yourself.

int w = (bmp->line[1]-bmp->line[0]) / ((get_bitmap_color_depth(bmp)+7)/8);
for(i = w*bmp->h - 1;i >= 0;--i)
...

A J

and doing your memory accesses backwards will be slower than forwards.. memory controllers and prediciton in CPU's prefer/presume forwards access.

Evert

Forgot to mention, if you do the two loops, copy the bmp->line[y] to a local pointer and dereference that in your inner loop.

Tobias Dammers

Here are 2 more options:
1. Use 8bpp mode. Create a greyscale palette and set it. Create an RGB table and select it. Load all sprites with COLORCONV_TOTAL | COLORCONV_TRANS. There you go.
2. Use >8bpp mode. Create a greyscale palette and select it. Create RGB table. Load all sprites normally, set_color_conversion(COLORCONV_TOTAL | COLORCONV_TRANS), then blit each sprite to an 8bpp sprite of same size and use that instead.

Both ways make use of the built-in palette conversions allegro offers.

A J

and using a long* will be wrong on 64bit systems

HoHo
[quote]HoHo, im intersetsed in your theory behind why 2 lines at a time would work quicker, care to elaborate[/quote]Basically it would be loop unrolling. Working with two different things gives CPU better opportunity to pipeline instructions. Of course I haven't made any tests with this particular problem so if those instructions do not pipeline well there might not be so big speedup.

[edit]

I just made a little test program to see if what I said holds true. It seems like not too well. Sure, unrolling helps a bit but not by as much as I hoped. I don't know why, perhaps there are too many bit logic instructions and P4 can't execute a lot of them in parallel. Another possibility is that it just ran out of registers and had to swap between L1 cache a lot.

One thing I found out was that the shortcut that Kitty Cat proposed wasn't the fastest I could come up by modifying the original program. Difference wasn't big but it definitely was there.

The biggest speedup came from reordering the loop to go from xy to yx. It gave ~3x speedup at smaller bitmaps and with bigger* 4x and more. That probably comes from the fact that CPU can prefetch the lines. Extracting the line pointer outside of the inner loop didn't give anything for some reason.
*) At 512x256 it was ~2.8x slower, 512x512 12x slower, 1024x512 ~25x slower.

The fastest was unrolling by two but difference with the reordered loop was only about 10%.

So here is the program. It requires gettimeofday function so it won't compile on Windows. I you replace the Timer class it should work.
[code]#include <allegro.h>
#include <iostream>
#include <sys/time.h>
using namespace std;
typedef void ( *testProc ) ( BITMAP* b );

class Timer {
timeval begin, end;
double time;
public:
virtual ~Timer() {}
void start() {
gettimeofday( &begin, NULL );
}
void stop() {
gettimeofday( &end, NULL );

time = end.tv_sec - begin.tv_sec + ( end.tv_usec - begin.tv_usec ) / 1.e6f;
}
double getTime() const {
return time;
}
double getTimePassed() const {
timeval tmp;
gettimeofday( &tmp, NULL );
return tmp.tv_sec - begin.tv_sec + ( tmp.tv_usec - begin.tv_usec ) / 1.e6f;
}
};

class AutoTimer : public Timer {
public:
AutoTimer() {
start();
}
~AutoTimer() {
stop();
cout << getTime() << endl;
}
};


void filterMakeMonochrom( BITMAP* bmp ) {
acquire_bitmap( bmp );
const int w = bmp->w, h = bmp->h;
for ( int x = 0; x < w; x++ ) {
for ( int y = 0; y < h; y++ ) {
int grey;
const int c = ( ( int * ) bmp->line[ y ] ) [ x ];
grey = ( c >> 16 ) & 0xFF;
grey += ( c >> 8 ) & 0xFF;
grey += c & 0xFF;
grey /= 3;
( ( int * ) bmp->line[ y ] ) [ x ] = grey * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );
};
};
release_bitmap( bmp );
};

void filterMakeMonochromReordered( BITMAP* bmp ) {
acquire_bitmap( bmp );
const int w = bmp->w, h = bmp->h;
for ( int y = 0; y < h; y++ ) {
for ( int x = 0; x < w; x++ ) {
int grey;
const int c = ( ( int * ) bmp->line[ y ] ) [ x ];
grey = ( c >> 16 ) & 0xFF;
grey += ( c >> 8 ) & 0xFF;
grey += c & 0xFF;
grey /= 3;
( ( int * ) bmp->line[ y ] ) [ x ] = grey * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );
};
};
release_bitmap( bmp );
};
void filterMakeMonochromReorderedExtractedPointer( BITMAP* bmp ) {
acquire_bitmap( bmp );
const int w = bmp->w, h = bmp->h;
for ( int y = 0; y < h; y++ ) {
int* line = ( int* ) bmp->line[ y ];
for ( int x = 0; x < w; x++ ) {
int grey;
const int c = line[ x ];
grey = ( c >> 16 ) & 0xFF;
grey += ( c >> 8 ) & 0xFF;
grey += c & 0xFF;
grey /= 3;
( ( int * ) bmp->line[ y ] ) [ x ] = grey * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );
};
};
release_bitmap( bmp );
};
void filterMakeMonochromUnrolled2( BITMAP* bmp ) {
acquire_bitmap( bmp );
const int w = bmp->w, h = bmp->h;
for ( int y = 0; y < h; y += 2 ) {
int * line1 = ( int* ) bmp->line[ y ];
int* line2 = ( int* ) bmp->line[ y + 1 ];
for ( int x = 0; x < w; x++ ) {
int grey1, grey2;
const int c1 = line1[ x ];
const int c2 = line2[ x ];
grey1 = ( c1 >> 16 ) & 0xFF;
grey1 += ( c1 >> 8 ) & 0xFF;
grey1 += c1 & 0xFF;
grey1 /= 3;
grey2 = ( c2 >> 16 ) & 0xFF;
grey2 += ( c2 >> 8 ) & 0xFF;
grey2 += c2 & 0xFF;
grey2 /= 3;
line1[ x ] = grey1 * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );
line2[ x ] = grey2 * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );
};
};
release_bitmap( bmp );
};
void filterMakeMonochromUnrolled4( BITMAP* bmp ) {
acquire_bitmap( bmp );
int grey1, grey2, grey3, grey4;
const int w = bmp->w, h = bmp->h;
for ( int y = 0; y < h; y += 4 ) {
int * line1 = ( int* ) bmp->line[ y ];
int* line2 = ( int* ) bmp->line[ y + 1 ];
int* line3 = ( int* ) bmp->line[ y + 2 ];
int* line4 = ( int* ) bmp->line[ y + 3 ];
for ( int x = 0; x < w; x++ ) {
const int c1 = line1[ x ];
const int c2 = line2[ x ];
const int c3 = line3[ x ];
const int c4 = line4[ x ];
grey1 = ( c1 >> 16 ) & 0xFF;
grey1 += ( c1 >> 8 ) & 0xFF;
grey1 += c1 & 0xFF;
grey1 /= 3;
grey2 = ( c2 >> 16 ) & 0xFF;
grey2 += ( c2 >> 8 ) & 0xFF;
grey2 += c2 & 0xFF;
grey2 /= 3;
grey3 = ( c3 >> 16 ) & 0xFF;
grey3 += ( c3 >> 8 ) & 0xFF;
grey3 += c3 & 0xFF;
grey3 /= 3;
grey4 = ( c4 >> 16 ) & 0xFF;
grey4 += ( c4 >> 8 ) & 0xFF;
grey4 += c4 & 0xFF;
grey4 /= 3;
line1[ x ] = grey1 * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );
line2[ x ] = grey2 * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );
line3[ x ] = grey3 * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );
line4[ x ] = grey4 * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );
};
};
release_bitmap( bmp );
};

void filterMakeMonochromApprox( BITMAP* bmp ) {
acquire_bitmap( bmp );
const int w = bmp->w, h = bmp->h;
for ( int y = 0; y < h; y++ ) {
int* line = ( int* ) bmp->line[ y ];
for ( int x = 0; x < w; x++ ) {
const int color = line[ x ];
int grey =
( ( color & 0xFC0000 ) >> 18 ) +
( ( color & 0x00FE00 ) >> 9 ) +
( ( color & 0x00F800 ) >> 11 ) +
( ( color & 0x0000F8 ) >> 3 );

line[ x ] = grey * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );
};
};
release_bitmap( bmp );

};

void filterMakeMonochromApprox2( BITMAP* bmp ) {
acquire_bitmap( bmp );
const int w = bmp->w, h = bmp->h;
for ( int y = 0; y < h; y++ ) {
int* line = ( int* ) bmp->line[ y ];
for ( int x = 0; x < w; x++ ) {
const int color = line[ x ];
int gray =
( ( color & 0xFC0000 ) >> 18 ) +
( ( color & 0x00FE00 ) >> 9 ) +
( ( color & 0x0000FC ) >> 2 );
gray |= ( gray << 8 ) | ( gray << 16 );

line[ x ] = gray * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );
};
};
release_bitmap( bmp );

};

void filterMakeMonochromApproxUnrolled( BITMAP* bmp ) {
acquire_bitmap( bmp );
const int w = bmp->w, h = bmp->h;
for ( int y = 0; y < h; y += 2 ) {
int * line1 = ( int* ) bmp->line[ y ];
int* line2 = ( int* ) bmp->line[ y + 1 ];
for ( int x = 0; x < w; x++ ) {
const int color1 = line1[ x ];
int grey1 =
( ( color1 & 0xFC0000 ) >> 18 ) +
( ( color1 & 0x00FE00 ) >> 9 ) +
( ( color1 & 0x00F800 ) >> 11 ) +
( ( color1 & 0x0000F8 ) >> 3 );

line1[ x ] = grey1 * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );
const int color2 = line2[ x ];
int grey2 =
( ( color2 & 0xFC0000 ) >> 18 ) +
( ( color2 & 0x00FE00 ) >> 9 ) +
( ( color2 & 0x00F800 ) >> 11 ) +
( ( color2 & 0x0000F8 ) >> 3 );

line2[ x ] = grey2 * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );
};
};
release_bitmap( bmp );

};

void initWindow() {
if ( allegro_init() != 0 )
exit( 1 );
set_color_depth( 32 );
if ( set_gfx_mode( GFX_AUTODETECT_WINDOWED, 800, 600, 0, 0 ) != 0 ) {
if ( set_gfx_mode( GFX_SAFE, 320, 200, 0, 0 ) != 0 ) {
set_gfx_mode( GFX_TEXT, 0, 0, 0, 0 );
allegro_message( "Unable to set any graphic mode\n%s\n", allegro_error );
exit( 1 );
}
}
/* set the color palette */
set_palette( desktop_palette );
clear_to_color( screen, makecol( 255, 255, 255 ) );
}


void testLoop( string name, testProc proc, BITMAP* bmp ) {
//cout << name << endl;
AutoTimer at1;
for ( int i = 0;i < 5;i++ ) {
//AutoTimer at2;
for ( int j = 0;j < 1000;j++ ) {
proc( bmp );
}
}
cout << "Total ";
}
void test( int w, int h ) {
cout << "Testing at " << w << "x" << h << endl;
BITMAP* bmp = create_bitmap( w, h );
testLoop( "plain", filterMakeMonochrom, bmp );
testLoop( "reordered loop", filterMakeMonochromReordered, bmp );
testLoop( "reordered loop, extracted pointer", filterMakeMonochromReorderedExtractedPointer, bmp );
testLoop( "unrolled by two", filterMakeMonochromUnrolled2, bmp );
testLoop( "unrolled by four", filterMakeMonochromUnrolled4, bmp );
testLoop( "approx1", filterMakeMonochromApprox, bmp );
testLoop( "approx1 unrolled by 2", filterMakeMonochromApproxUnrolled, bmp );
testLoop( "approx2", filterMakeMonochromApprox2, bmp );
destroy_bitmap( bmp );
cout << endl;
}
int main() {
initWindow();
test( 128, 128 );
test( 256, 128 );
test( 256, 256 );
test( 512, 256 );
test( 512, 512 );
test( 1024, 512 );
test( 1024, 1024 );
}
END_OF_MAIN()


[/code]And here is the output from 32bit P4 @ 3.3GHz with 1M L2.

Order of the output is
Original code
reordered loop
reordered loop, extracted pointer
unrolled by two
unrolled by four
approx1
approx1 unrolled by 2
approx2

[code]hoho@WW004029 ~/katsepolygon/quicktests/grayscale $ g++ -O3 grayscale.cpp `allegro-config --libs`&& ./a.out
Allegro up, screen: 800x600 32
Testing at 128x128
Total 0.644086
Total 0.347362
Total 0.355598
Total 0.300813
Total 0.288017
Total 0.37723
Total 0.325882
Total 0.402296

Testing at 256x128
Total 1.24467
Total 0.698047
Total 0.698583
Total 0.591255
Total 0.571967
Total 0.749285
Total 0.619433
Total 0.80224

Testing at 256x256
Total 2.73151
Total 1.38458
Total 1.39629
Total 1.17581
Total 1.3294
Total 1.51228
Total 1.24345
Total 1.6106

Testing at 512x256
Total 8.02422
Total 2.80307
Total 3.01607
Total 2.74377
Total 2.58823
Total 3.21759
Total 2.68319
Total 3.36077

Testing at 512x512
Total 77.4398
Total 6.35915
Total 6.20438
Total 5.79059
Total 5.8432
Total 7.01474
Total 6.09081
Total 6.83415

Testing at 1024x512
Total 315.324
Total 12.642
Total 12.5239
Total 11.0038
Total 12.1963
Total 15.1906
Total 12.9711
Total 13.9239

Testing at 1024x1024
Total 600.725
Total 25.0655
Total 24.1274
Total 24.9667
Total 23.0532
Total 25.5888
Total 23.5083
Total 26.5958
[/code]

I'll run the same test tonight with my own P4. That one is running in 64bit. If indeed the problem is in too few registers it should show right away. Only problem is that it also has twice the L2 and that could alter the results.

Another interesting thing would be to use SIMD instructions. It should be relatively trivial to do and should give quite big speedup compared to what I got here. Currently the biggest bandwidth is around 870MiB/s with 1024x1024 in one direction. There is quite a bit more bandwidth available so it should be possible to get almost full speed out of SIMD stuff without hitting the bandwidth limits.
A J

all that bit shifting and masking, why not simply use a uint8_t* and avoid all that.

Richard Phipps

A what?

Carrus85

unsignedinteger8bit_type

Richard Phipps

Ok.

How will that stop the need for bitshifting and masking when using Bitmap data?

HoHo
Quote:

all that bit shifting and masking, why not simply use a uint8_t* and avoid all that.

Hehe, good point. That's what happens when you just take someones code and don't really think what is going on.

Quote:

How will that stop the need for bitshifting and masking when using Bitmap data?

Instead of calculating RGB from the 32bit integer you simply get the RGB values directly.

I'll make some tests after I have finished configuring my system. That shouldn't take more than couple of hours.

Richard Phipps

Well, in that case I imagine the compiled code still has to do masking and bitshifting? Otherwise it would be reading one byte at a time instead of at least 4.

A J

why does the compiled code still need to do bit shifting and masking ?

Richard Phipps

I imagine the compiler wants to read in 4 bytes into a 32 bit register (for most PC's). It then can process 4 bytes worth of information.

The alternative is to use a register for 1 byte, process it, and then read new data into the register.

A J

you like to 'imagine' dont you?

why would you want to read in 4 bytes, when all you wanted was one ?

how much asm programming have you done ?

'the' alternative, there is only 1 alternative ?

what about the 64bit registers, and the 128 bit registers ?

Richard Phipps

First of all: Get rid of the negative attacking attitude. ok? :)

Secondly: I have done limited 8-bit cpu programming in the past. And in this case it's better to read in 2 or more bytes than 1.

Quote:

why would you want to read in 4 bytes, when all you wanted was one ?

Perhaps because it's inefficient? It was pretty well known in the past to read in more than one byte if you could.

A J

is there any value in discussing technical stuff when you pre-empt it with 'imagine' ? and 'perhaps'

you made a bunch of assumptions, and all i asked was how you got to that conclusion ?

you also claim there is only one alternative, how did you come to that conclusion ?

so you dont have much experince with modern CPUs, or asm programming, yet you make statements that dont make a whole lot of sense.

Richard Phipps

I used those words because while I have experience of 8 and 16 bit processors I have no more recent experiences.

My 'assumptions' were based on my experience. Nothing you've said indicates that you have experience of cpu's either.

Why don't you back up your assumptions and criticisms first eh?

HoHo

Old CPU's have quite little in common with modern ones when you are comparing different optimization techniques :P

Richard Phipps

That may be the case, which is why I used the cautious language I wrote, rather than a 'this is the best way' type of statement.

I'd still like to see AJ show some knowledge of this himself instead of just negative comments.

A J

ive made no negative comments, ive simply asked how you got to your conclusions ?

which statements of mine require proof of validity ?

'cautious' a statement pre-empted with 'imagine' or 'perhaps' is hardly cautiuos...

perhaps you have sex with fluffy animals, is a little cautios, but
i imagine you have sex with fluffy animals, isn't particauly cautious.

some might even say "i imagine" is the same as "i believe"

you've almost admitted you know very little on this subject, yet persist..
or maybe (you do love those fence-sitting qualifiers) you want to keep going, your ego seems to be up for it.

you can throw as much mud as you like, when the results arrive (Thanks HoHo) that will be all the knowledge you will require.

Richard Phipps

Quote:

ive made no negative comments

Really?

Quote:

you like to 'imagine' dont you?

Quote:

you do love those fence-sitting qualifiers

Quote:

you can throw as much mud as you like

You must be blind! :D

I'm explained quite a few times now how I've come to my conclusions. You have failed to do that each time I've asked.

So I conclude you know nothing about this. :)

HoHo

Having followed some discussions in AD and some topics here in a.cc I conclude that AJ knows about optimizations way more than average programmer :)

I haven't yet got to the testing but I'm almost there. I just have to move some files around to make room for installing some necessary programs. It shouldn't take more than two hours for first results.

dudaskank

You guys are funny hehehe...

Jonatan Hedborg said:

A faster way would be to make a whole new set of bitmaps, which are greyscale.

The fastest way... it will use more memory too.

Neil Walker said:

For a lot of colour effects I find converting to HSL much easier

Hmm... I don't know nothing about hsl, but converting rgb -> hsl to do the effect and then hsl -> rgb to blit in screen doesn't seen very good if you want real time.

For the others solutions, Kitty Cat and Krzysztof Kluczek can be mixed to get the job... and avoid *, / and % if you can...

In the test Hoho did before:

gray |= ( gray << 8 ) | ( gray << 16 );
line[ x ] = gray * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );

should be

gray |= ( gray << 8 ) | ( gray << 16 );
line[ x ] = gray;

right?

Richard Phipps said:

...it's better to read in 2 or more bytes than 1.

FBlend uses this optimization too no?

ps: sorry for the English... I'm Brazilian :p

Jonatan Hedborg
Quote:

Jonatan Hedborg said:
A faster way would be to make a whole new set of bitmaps, which are greyscale.

Quote:

The fastest way... it will use more memory too.

Natually. Just about twice as much (assuming the majority of your memory usage is from bitmaps) :)
I still say it's better than the alternative... With this method you could have 133 500 by 500 bitmaps (24bpp) for 200 megs of ram. That is quite a lot of bitmap-space, and not very much in regards to the average RAM of a modern computer...
and at NO performance penalty (afaik).

dudaskank
Jonatan Hedborg said:

Natually. Just about twice as much...

For better memory usage, you can use 8 bpp bitmaps too, with a grayscale palette, so you can simple use set_palette() and then blit()... it will add 25% memory usage for each bitmap in 32bpp. You then will use the method that Tobias Dammers pointed above

Quote:

and not very much in regards to the average RAM of a modern computer

Wow, I want one of those modern computers hehehe... I only have 256 MB :P

HoHo

Finally after some trouble with moving entire /usr directory to other partition I got time to do some more benching :)

First I ran the exact same program I posted before on my home PC, a P4 @3.6GHz running 64bit Gentoo.
Main differences with the previous one are 64 bittiness, ~10% faster CPU clock, 1M vs 2M L2 cache and DDR1@220 vs DDR2@320 theoretically giving ~45% more memory throughput.

Results are here, order is as before:

1Original code
2reordered loop
3reordered loop, extracted pointer
4unrolled by two
5unrolled by four
6approx1
7approx1 unrolled by 2
8approx2
9 
10 
11Testing at 128x128
12Total 0.529107
13Total 0.307977
14Total 0.309927
15Total 0.238763
16Total 0.188685
17Total 0.31115
18Total 0.243885
19Total 0.335091
20 
21Testing at 256x128
22Total 1.0531
23Total 0.663522
24Total 0.621442
25Total 0.493756
26Total 0.376971
27Total 0.618919
28Total 0.485038
29Total 0.663787
30 
31Testing at 256x256
32Total 2.28487
33Total 1.2021
34Total 1.20635
35Total 0.931723
36Total 0.747755
37Total 1.21512
38Total 0.968342
39Total 1.31391
40 
41Testing at 512x256
42Total 6.78625
43Total 2.40071
44Total 2.40732
45Total 1.84377
46Total 1.50971
47Total 2.42776
48Total 1.912
49Total 2.6216
50 
51Testing at 512x512
52Total 13.7214
53Total 4.94516
54Total 4.92838
55Total 3.78014
56Total 3.08194
57Total 5.06025
58Total 3.84202
59Total 5.2358
60 
61Testing at 1024x512
62Total 107.284
63Total 9.99396
64Total 9.96013
65Total 7.78733
66Total 6.32854
67Total 10.0777
68Total 9.01577
69Total 10.9042
70 
71Testing at 1024x1024
72Total 337.148
73Total 20.0585
74Total 20.0961
75Total 15.6973
76Total 13.3376
77Total 20.3229
78Total 18.3092
79Total 21.7947

More registers certainly seem to help because now unrolling by four is clear winner on average being about 20-30% faster than unrolling by two and ~50-70% faster than no unrolling. Extra memory bandwidth also seems to help, especially with the original code.

L2 cache doesn't seem to help a whole lot. With each pixel doubling the time doubles. 1024x512@32 bitmap is a bit bigger than 2M and twice as big and small bitmap processing scales almost linearly in 2x increments. If there would have been a big jump in time that would have indicated that L2 cache was actually useful.

Before I started to do some more benches I decided to literally take a look how those grayscale images look like and saved them to disk. To my suprise almost all of them were blue instead of grayscale, even the original one. Only the approx2 gave grayscale image. It might be because I assume RGB order, also perhaps 64bittiness has something to do with it. I'll take a closer look at it and try to fix the functions.

I can't promise when will I get some more results. I did try converting byte-at-a-time and it seemed to give almost 2-3x speedup compared to the 4x unrolled one but the output is not correct so the real speedup might change. It's 1:30 a.m here so I'm afraid I can't get anything up before tomorrow night. By then I try to make a bit more sophisticated test program.

Perhaps if I get time and I find my PAPI wrapper I can get some more interesting results, like L1/2 cache misses/hits, retired instruction count, cycles stalled waiting for recourses etc :)

[edit]

whoops, I think I found the mistake.
It seems the modifications dudaskank proposed are to blame. I made them shortly after getting the results and before saving out bitmaps. It seems like when I put back line[ x ] = gray * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 ); it works again.

I think I would be much more productive if it wouldn't be so late :P

A J
Quote:

you like to 'imagine' dont you?

you do love those fence-sitting qualifiers

Richard, just facts, you used 'imagine' twice, and most (majority) of your statements in those 3 posts contained maybe,imagine,perhaps etc..

Either you say something because its a fact or you strongly believe it to be a fact, or you dont say it.
If you dont know something, or are uncertain, you ask directly, not this "Perhaps <statement> ?" nonsense.

Btw, im not 'attacking' RichardPhipps.intellect or RichardPhipps.logic or RichardPhipps.emotion, im attacking RichardPhipps.ego which seems to be unwilling to accept what RichardPhipps.intellect is telling it, i strongly apologize if the former 3 elements are offended, if being asked a few direct questions is too much for your ego to handle, i think its time it got put in its place.

GullRaDriel

War are a waste of time. ( and are off-topic ;-p )

Evert
Quote:

Either you say something because its a fact or you strongly believe it to be a fact, or you dont say it.

Rubbish. There's no reason not to say something you're not sure about if you specify that (which one does by adding `I expect...' or `perhaps...').

Quote:

you used 'imagine' twice,

Wasn't that `I imagine that to do X, Y does Z' in the meaning `I don't know for sure, but if I try to think how to do X, Y does Z'?
Which is a fairly normal way of using the phrase `I imagine...', as far as I know.

A J

I have done some test, (smaller value is faster, best to worst)

3830 2x unrolling, array method, better ordering
3964 2x unrolling, array method, bad ordering
4014 1x array method
4340 2x unrolling, ptr method
4840 1x ptr method

So it looks like unrolling a little helps, so does using array notation.

The unrolling amount will be CPU specific, but 2x is a fairly safe on most systems.
Array notation helps the optimz compiler avoid aliasing issues.
And a little re-ordering helps too.

Richard Phipps

AJ: Regardless of what I have said (which Evert understood why I said it) you have not explained why what I said about registers is wrong. I explained my knowledge was limited, but you have not shown me any knowledge of cpu registers despite my repeated requests that you do so. Until you do that why should I believe your views over mine?

Tobias Dammers
Quote:

Perhaps because it's inefficient? It was pretty well known in the past to read in more than one byte if you could.

Because RAM-to-cache is the major bottleneck in modern systems. Reading from cache is cheap, and whether you read one byte at a time or in groups of 4 doesn't make much of a difference. Multiple sequential cache reads are cheaper than reading 32 bits and then manually slicing them into bytes.

Richard Phipps

At last some information! Thanks Tobias.

So, is not the compiled code slicing the cached bytes from 32/64 bit values into 8 bit values itself as each one is read?

Tobias Dammers

Depends on cpu architecture I guess. It will hardly be slower though.

Richard Phipps

All I said was that if the cpu uses registers that are 32 bits or more than I would imagine it's reading in more than one pixel and then splitting it to be more efficient. :)

Tobias Dammers

Yeah. It just so happened that you imagined wrong :-*
Copying data into registers isn't so much of a bottleneck anymore due to cache.

A J

You want to explain asm programming to you ?
It would take about 20 pages..

A register is where work is done, not where data is stored.
You dont read in any more than you can actually work on.
If you read into a register 4 bytes, but want to process a byte, you then need to read into another register more data to operate on it, until you are left with 1 byte. This is 'Asm programming 101: Know your hardware'

Your confusing registers with memory, registers with cache, cache with main memory, just about all of them with each other. There are some fairly fundamental points you've missed that make me wonder just how much if any ams programming you have done.

We are talking about real things, they all have names, there is no need to use any abstraction, so you can refrain from using 'its' which just leaves room for confusion.

Richard Phipps

Ok. Obviously CPU's have moved on since my 8 bit and 16 bit days. Thank you AJ for explaining why after about 5 requests to do so. Next time, it would be much easier if you did it straight away.. ok?

Tobias Dammers
Quote:

A register is where work is done, not where data is stored.

...and in order to work on data, you need to store it somewhere. In other words, data is stored in registers and cache and main memory, all of which are different sorts of memory. Though registers don't necessarily contain what is typically referred to as "data" (when used to distinguish "code" from "data"), technically there isn't too much of a difference - pointers, numbers, cpu instructions - all of these are ultimately represented as bits & bytes.
You load data from main memory into registers to manipulate it, you write data back to main memory when you're done. Only that several levels of cache sit between registers and main memory, and data transfer through the cache is one of the crucial factors to consider when optimizing for a modern cpu.

A J

no.

Richard Phipps

Tobias: Ah I see. I never dealt with cache's on the 8-bit and 16-bit machines. :)
AJ: Kiss my Ass then. :)

Tobias Dammers
Quote:

no.

No what?

I said:

...and in order to work on data, you need to store it somewhere.

Can't see anything wrong with this.

Quote:

In other words, data is stored in registers and cache and main memory, all of which are different sorts of memory. Though registers don't necessarily contain what is typically referred to as "data" (when used to distinguish "code" from "data"), technically there isn't too much of a difference - pointers, numbers, cpu instructions - all of these are ultimately represented as bits & bytes.

Maybe you don't agree, but if they aren't memory, what are they? They all fit the criteria: you can write a certain number of bits to them, and you can read them back. What's special about registers is that the cpu can perform a lot of operations directly on them.

Quote:

You load data from main memory into registers to manipulate it, you write data back to main memory when you're done.

That is the basic (simplified) procedure. I left out the cache part, the fact that at times the values in registers are generated rather than read from somewhere else, and I didn't go into i/o.

Quote:

Only that several levels of cache sit between registers and main memory, and data transfer through the cache is one of the crucial factors to consider when optimizing for a modern cpu.

I didn't say the crucial factor, but one of the. Stuff like thread switching, instruction pairing, branch optimization etc. are equally important. Also, to clarify, by "data transfer through the cache" I am referring to the fact that avoiding cache misses is usually an easy and effective optimization technique.

Anyway, I just tried to explain the general concepts from the C++ point of view; you're the ASM god, so if you care to explain properly, please do so. No sarcasm or offense intended.

HoHo
Quote:

Copying data into registers isn't so much of a bottleneck anymore due to cache.

unless your memory access is very random, like in original code :)

With this particular case caches are not much of use when images are bigger than what fits into cache because by the time that image has gone through processing once the first lines of it have already been flushed back to ram and are needed to fetch into caches again.

When CPU reads some data from ram it automagically copies a chunk into cache. The size of that chunk is the length of a cache line. Usually it is either 32, 64 or 128 bytes. IIRC my P4 should have 64byte cache lines, Opterons and probably regular AMD64's should have 128 byte lines. PM and some older CPU's should have 32byte cache lines.

Now if you access memory the same way as in original code, it gets fun. Every time you need four bytes but CPU prefetches 64, 16x more than needed. So, when you thought for 1024x1024x32bit bitmap you would need 4MiB bandwidth you really use around 64MiB.

My work PC could do ~8 1024x1024x32bit grayscale images per second* or ~400 cycles per pixel. That is about 32MiB in total. Not at all too much. Because of what I said earlier that translates to 0.5GiB/s. Still not too much because theoretical bandwidth should be about 10x as big. So what is the problem with the code?

*) With just reordering the loop it could do 200 images per second or ~25x increase in speed. That is about 16 cycles per pixel.

My guess is that cache misses are to blame*. Cycle count for single pixels seem to confirm that. Reading from L1 takes just a few cycles, usually 2-4. If data is not in L1 it takes about 15-25 cycles to read it from L2. If it is not there too it takes about 200-400 cycles to get it from memory. Comparing that 200 cycles to the time it takes to grayscale a single pixel (~16 cycles) it should be obvious that for the most of the time CPU just sits idle while waiting for new data to arrive.

*) I haven't yet found my code for counting cache misses but when I do I'll test it. I give no promises but I'll try to do it as soon as I can so I can learn something from it and share it with others :)

Hopefully some of you now understand why data structures and memory access patterns are important. They can make your code about an order of magnitude faster if done correctly :)

All of that gets even more interesting once you start dealing with SIMD instructions and memory aligning, software prefetch, read/write around cache, streaming from several sources come into play.

Bob

Guys, quit it. You're not being helpful or even moderately funny.

A J

HoHo, have you ever had software prefetch make much of a difference ?
most of my tests haven't shown much improvement, althou i've only done linear forwards access which seems to get prefetched already.

HoHo
Quote:

HoHo, have you ever had software prefetch make much of a difference ?

I personally have never used it in my code. Most what I have said here is just theory and educated guesses. I try to test and prove the points I make gradually and that is also the reason why I want to perform some thorough test with the code.

Someone programmed a realtime ray tracer and used prefetching when traversing KD-tree*. Usually there is little linear access but it is really simple to prefetch a few steps ahead. IIRC he got ~15-30% speedup.

That ray tracer was once IOTD at flipcode. Basically it was supposed to be the next article in Jacco Bikker's ray tracing article series. Unfortunately it was never published. I have the code for that tracer stored on disk somewhere. If you want to I can send it to you once I find it.

Fortunately he and some other guys have created a new forum for discussing RT and he has a newer version of the tracer somewhere there.
http://ompf.org/forum

Tobias Dammers

So, to wrap it up, 4 options:
a) Use palettes in some way, so you can (ab)use allegro's built-in (and optimized) color conversion routines to achieve grey-scaling.
b) Use one of the functions discussed by HoHo e.a.; whichever works best. Whatever you do, always use y for the outer loop and x for the inner one.
c) Use a different library that supports hardware-accelerated color effects. OpenLayer might actually get you quite far.
d) Pre-calculate everything.

Marcello

Wouldn't a hardware-accelerated grayscale filter require programmable pixel shader?

Marcello

Krzysztof Kluczek
Quote:

Wouldn't a hardware-accelerated grayscale filter require programmable pixel shader?

No. Texture combiners with dot product support would be enough. :)

A J
Quote:

Someone programmed a realtime ray tracer and used prefetching when traversing KD-tree*

i know not what a KD-Tree is, but i presume it uses ptrs to store data, so data will not be linear in memory, therefore prefetch would make sense, and achieve some speed up.

dont bother with that article, i'll do some more research...
i've heard all the theory of prefetch, but have yet to see it work (for me) in practice.

HoHo
Quote:

i presume it uses ptrs to store data, so data will not be linear in memory

You persume wrong. The KDTree is static and calculated in preprocesing step and gets stored in single array. That makes things a bit better than linked list. The data structure is laid out so that close nodes should be quite close in memory too so when one node is prefetched some of it's neibourghs get prefetched too. Each node is 8 bytes so usually quite a lot of them get fetched.

dudaskank

Hoho, that piece of code was on the filterMakeMonochromApprox2() on your post... so when you said it didn't work, I've made some test...

the code:

1public class TesteGray {
2 
3 public static void main(String[] args) {
4 int i, x = 100000000;
5 long t1, t2;
6 int gray;
7 gray = 0x80;
8 System.out.printf("gray: %x\n", gray);
9 System.out.printf("gray rgb1: %x\n", gray * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 ));
10 System.out.printf("gray rgb2: %x\n", gray | ( gray << 8 ) | ( gray << 16));
11 System.out.printf("a simple benchmark...\n");
12 // method rgb1
13 t1 = System.currentTimeMillis();
14 for(i = 0; i < x; i++) {
15 gray = 0x80;
16 gray *= ( (1 << 16) | (1 << 8) | 1);
17 }
18 t2 = System.currentTimeMillis();
19 System.out.printf("rgb1 method %d times: %d ms\n", x, t2 - t1);
20 // method rgb2
21 t1 = System.currentTimeMillis();
22 for(i = 0; i < x; i++) {
23 gray = 0x80;
24 gray |= (gray << 8) | (gray << 16);
25 }
26 t2 = System.currentTimeMillis();
27 System.out.printf("rgb2 method %d times: %d ms\n", x, t2 - t1);
28 }
29}

well it's Java but you get the idea, hehehe... the output in my machine is:

gray: 80
gray rgb1: 808080
gray rgb2: 808080
a simple benchmark...
rgb1 method 100000000 times: 971 ms
rgb2 method 100000000 times: 951 ms

so you can see the method #2 works too... but don't have that super speed improvement that I think, hehehe

A J

that just makes the colour grey{Solid} we (as the topic heading indicates) are trying to do grey{scale}

HoHo

Also there is no memory access.

For a bit better performance numbers you should put that whole function content to a new function an call it around five to ten times in a row. I wouldn't be suprised if speed increases several times.

Thread #580811. Printed from Allegro.cc