Greyscale-Filter is too slow. How to improve performance?

Greyscale-Filter is too slow. How to improve performance?

dudaskank

Member #561

July 2000

You guys are funny hehehe...

Jonatan Hedborg said:

A faster way would be to make a whole new set of bitmaps, which are greyscale.

The fastest way... it will use more memory too.

Neil Walker said:

For a lot of colour effects I find converting to HSL much easier

Hmm... I don't know nothing about hsl, but converting rgb -> hsl to do the effect and then hsl -> rgb to blit in screen doesn't seen very good if you want real time.

For the others solutions, Kitty Cat and Krzysztof Kluczek can be mixed to get the job... and avoid *, / and % if you can...

In the test Hoho did before:

gray |= ( gray << 8 ) | ( gray << 16 );
line[ x ] = gray * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 );

should be

gray |= ( gray << 8 ) | ( gray << 16 );
line[ x ] = gray;

right?

Richard Phipps said:

...it's better to read in 2 or more bytes than 1.

FBlend uses this optimization too no?

ps: sorry for the English... I'm Brazilian :p

Toque a balada do amor inabalável, eterna love song de nós dois
Eduardo "Dudaskank"
[ Home Page (ptbr) | Blog (ptbr) | Tetris 1.1 (ptbr) | Resta Um (ptbr) | MJpgAlleg 2.3 ]

Jonatan Hedborg

Member #4,886

July 2004

Quote:

Jonatan Hedborg said:
A faster way would be to make a whole new set of bitmaps, which are greyscale.

Quote:

The fastest way... it will use more memory too.

Natually. Just about twice as much (assuming the majority of your memory usage is from bitmaps)
I still say it's better than the alternative... With this method you could have 133 500 by 500 bitmaps (24bpp) for 200 megs of ram. That is quite a lot of bitmap-space, and not very much in regards to the average RAM of a modern computer...
and at NO performance penalty (afaik).

dudaskank

Member #561

July 2000

Jonatan Hedborg said:

Natually. Just about twice as much...

For better memory usage, you can use 8 bpp bitmaps too, with a grayscale palette, so you can simple use set_palette() and then blit()... it will add 25% memory usage for each bitmap in 32bpp. You then will use the method that Tobias Dammers pointed above

Quote:

and not very much in regards to the average RAM of a modern computer

Wow, I want one of those modern computers hehehe... I only have 256 MB

Toque a balada do amor inabalável, eterna love song de nós dois
Eduardo "Dudaskank"
[ Home Page (ptbr) | Blog (ptbr) | Tetris 1.1 (ptbr) | Resta Um (ptbr) | MJpgAlleg 2.3 ]

HoHo

Member #4,534

April 2004

Finally after some trouble with moving entire /usr directory to other partition I got time to do some more benching

First I ran the exact same program I posted before on my home PC, a P4 @3.6GHz running 64bit Gentoo.
Main differences with the previous one are 64 bittiness, ~10% faster CPU clock, 1M vs 2M L2 cache and DDR1@220 vs DDR2@320 theoretically giving ~45% more memory throughput.

Results are here, order is as before:

1Original code
2reordered loop
3reordered loop, extracted pointer
4unrolled by two
5unrolled by four
6approx1
7approx1 unrolled by 2
8approx2
9 
10 
11Testing at 128x128
12Total 0.529107
13Total 0.307977
14Total 0.309927
15Total 0.238763
16Total 0.188685
17Total 0.31115
18Total 0.243885
19Total 0.335091
20 
21Testing at 256x128
22Total 1.0531
23Total 0.663522
24Total 0.621442
25Total 0.493756
26Total 0.376971
27Total 0.618919
28Total 0.485038
29Total 0.663787
30 
31Testing at 256x256
32Total 2.28487
33Total 1.2021
34Total 1.20635
35Total 0.931723
36Total 0.747755
37Total 1.21512
38Total 0.968342
39Total 1.31391
40 
41Testing at 512x256
42Total 6.78625
43Total 2.40071
44Total 2.40732
45Total 1.84377
46Total 1.50971
47Total 2.42776
48Total 1.912
49Total 2.6216
50 
51Testing at 512x512
52Total 13.7214
53Total 4.94516
54Total 4.92838
55Total 3.78014
56Total 3.08194
57Total 5.06025
58Total 3.84202
59Total 5.2358
60 
61Testing at 1024x512
62Total 107.284
63Total 9.99396
64Total 9.96013
65Total 7.78733
66Total 6.32854
67Total 10.0777
68Total 9.01577
69Total 10.9042
70 
71Testing at 1024x1024
72Total 337.148
73Total 20.0585
74Total 20.0961
75Total 15.6973
76Total 13.3376
77Total 20.3229
78Total 18.3092
79Total 21.7947

More registers certainly seem to help because now unrolling by four is clear winner on average being about 20-30% faster than unrolling by two and ~50-70% faster than no unrolling. Extra memory bandwidth also seems to help, especially with the original code.

L2 cache doesn't seem to help a whole lot. With each pixel doubling the time doubles. 1024x512@32 bitmap is a bit bigger than 2M and twice as big and small bitmap processing scales almost linearly in 2x increments. If there would have been a big jump in time that would have indicated that L2 cache was actually useful.

Before I started to do some more benches I decided to literally take a look how those grayscale images look like and saved them to disk. To my suprise almost all of them were blue instead of grayscale, even the original one. Only the approx2 gave grayscale image. It might be because I assume RGB order, also perhaps 64bittiness has something to do with it. I'll take a closer look at it and try to fix the functions.

I can't promise when will I get some more results. I did try converting byte-at-a-time and it seemed to give almost 2-3x speedup compared to the 4x unrolled one but the output is not correct so the real speedup might change. It's 1:30 a.m here so I'm afraid I can't get anything up before tomorrow night. By then I try to make a bit more sophisticated test program.

Perhaps if I get time and I find my PAPI wrapper I can get some more interesting results, like L1/2 cache misses/hits, retired instruction count, cycles stalled waiting for recourses etc

[edit]

whoops, I think I found the mistake.
It seems the modifications dudaskank proposed are to blame. I made them shortly after getting the results and before saving out bitmaps. It seems like when I put back line[ x ] = gray * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 ); it works again.

I think I would be much more productive if it wouldn't be so late

__________
In theory, there is no difference between theory and practice. But, in practice, there is - Jan L.A. van de Snepscheut
MMORPG's...Many Men Online Role Playing Girls - Radagar
"Is Java REALLY slower? Does STL really bloat your exes? Find out with your friendly host, HoHo, and his benchmarking machine!" - Jakub Wasilewski

A J

Member #3,025

December 2002

Quote:

you like to 'imagine' dont you?

you do love those fence-sitting qualifiers

Richard, just facts, you used 'imagine' twice, and most (majority) of your statements in those 3 posts contained maybe,imagine,perhaps etc..

Either you say something because its a fact or you strongly believe it to be a fact, or you dont say it.
If you dont know something, or are uncertain, you ask directly, not this "Perhaps <statement> ?" nonsense.

Btw, im not 'attacking' RichardPhipps.intellect or RichardPhipps.logic or RichardPhipps.emotion, im attacking RichardPhipps.ego which seems to be unwilling to accept what RichardPhipps.intellect is telling it, i strongly apologize if the former 3 elements are offended, if being asked a few direct questions is too much for your ego to handle, i think its time it got put in its place.

___________________________
The more you talk, the more AJ is right. - ML

GullRaDriel

Member #3,861

September 2003

War are a waste of time. ( and are off-topic ;-p )

"Code is like shit - it only smells if it is not yours"
Allegro Wiki, full of examples and articles !!

Evert

Member #794

November 2000

Quote:

Either you say something because its a fact or you strongly believe it to be a fact, or you dont say it.

Rubbish. There's no reason not to say something you're not sure about if you specify that (which one does by adding `I expect...' or `perhaps...').

Quote:

you used 'imagine' twice,

Wasn't that `I imagine that to do X, Y does Z' in the meaning `I don't know for sure, but if I try to think how to do X, Y does Z'?
Which is a fairly normal way of using the phrase `I imagine...', as far as I know.

A J

Member #3,025

December 2002

I have done some test, (smaller value is faster, best to worst)

3830 2x unrolling, array method, better ordering
3964 2x unrolling, array method, bad ordering
4014 1x array method
4340 2x unrolling, ptr method
4840 1x ptr method

So it looks like unrolling a little helps, so does using array notation.

The unrolling amount will be CPU specific, but 2x is a fairly safe on most systems.
Array notation helps the optimz compiler avoid aliasing issues.
And a little re-ordering helps too.

___________________________
The more you talk, the more AJ is right. - ML

Richard Phipps

Member #1,632

November 2001

AJ: Regardless of what I have said (which Evert understood why I said it) you have not explained why what I said about registers is wrong. I explained my knowledge was limited, but you have not shown me any knowledge of cpu registers despite my repeated requests that you do so. Until you do that why should I believe your views over mine?

Chaos Groove Development Blog
Free Logging System Code & Blog

Tobias Dammers

Member #2,604

August 2002

Quote:

Perhaps because it's inefficient? It was pretty well known in the past to read in more than one byte if you could.

Because RAM-to-cache is the major bottleneck in modern systems. Reading from cache is cheap, and whether you read one byte at a time or in groups of 4 doesn't make much of a difference. Multiple sequential cache reads are cheaper than reading 32 bits and then manually slicing them into bytes.

---
Me make music: Triofobie
---
"We need Tobias and his awesome trombone, too." - Johan Halmén

Richard Phipps

Member #1,632

November 2001

At last some information! Thanks Tobias.

So, is not the compiled code slicing the cached bytes from 32/64 bit values into 8 bit values itself as each one is read?

Chaos Groove Development Blog
Free Logging System Code & Blog

Tobias Dammers

Member #2,604

August 2002

Depends on cpu architecture I guess. It will hardly be slower though.

---
Me make music: Triofobie
---
"We need Tobias and his awesome trombone, too." - Johan Halmén

Richard Phipps

Member #1,632

November 2001

All I said was that if the cpu uses registers that are 32 bits or more than I would imagine it's reading in more than one pixel and then splitting it to be more efficient.

Chaos Groove Development Blog
Free Logging System Code & Blog

Tobias Dammers

Member #2,604

August 2002

Yeah. It just so happened that you imagined wrong :-*
Copying data into registers isn't so much of a bottleneck anymore due to cache.

---
Me make music: Triofobie
---
"We need Tobias and his awesome trombone, too." - Johan Halmén

A J

Member #3,025

December 2002

You want to explain asm programming to you ?
It would take about 20 pages..

A register is where work is done, not where data is stored.
You dont read in any more than you can actually work on.
If you read into a register 4 bytes, but want to process a byte, you then need to read into another register more data to operate on it, until you are left with 1 byte. This is 'Asm programming 101: Know your hardware'

Your confusing registers with memory, registers with cache, cache with main memory, just about all of them with each other. There are some fairly fundamental points you've missed that make me wonder just how much if any ams programming you have done.

We are talking about real things, they all have names, there is no need to use any abstraction, so you can refrain from using 'its' which just leaves room for confusion.

___________________________
The more you talk, the more AJ is right. - ML

Richard Phipps

Member #1,632

November 2001

Ok. Obviously CPU's have moved on since my 8 bit and 16 bit days. Thank you AJ for explaining why after about 5 requests to do so. Next time, it would be much easier if you did it straight away.. ok?

Chaos Groove Development Blog
Free Logging System Code & Blog

Tobias Dammers

Member #2,604

August 2002

Quote:

A register is where work is done, not where data is stored.

...and in order to work on data, you need to store it somewhere. In other words, data is stored in registers and cache and main memory, all of which are different sorts of memory. Though registers don't necessarily contain what is typically referred to as "data" (when used to distinguish "code" from "data"), technically there isn't too much of a difference - pointers, numbers, cpu instructions - all of these are ultimately represented as bits & bytes.
You load data from main memory into registers to manipulate it, you write data back to main memory when you're done. Only that several levels of cache sit between registers and main memory, and data transfer through the cache is one of the crucial factors to consider when optimizing for a modern cpu.

---
Me make music: Triofobie
---
"We need Tobias and his awesome trombone, too." - Johan Halmén

A J

Member #3,025

December 2002

no.

___________________________
The more you talk, the more AJ is right. - ML

Richard Phipps

Member #1,632

November 2001

Tobias: Ah I see. I never dealt with cache's on the 8-bit and 16-bit machines.
AJ: Kiss my Ass then.

Chaos Groove Development Blog
Free Logging System Code & Blog

Tobias Dammers

Member #2,604

August 2002

Quote:

no.

No what?

I said:

...and in order to work on data, you need to store it somewhere.

Can't see anything wrong with this.

Quote:

In other words, data is stored in registers and cache and main memory, all of which are different sorts of memory. Though registers don't necessarily contain what is typically referred to as "data" (when used to distinguish "code" from "data"), technically there isn't too much of a difference - pointers, numbers, cpu instructions - all of these are ultimately represented as bits & bytes.

Maybe you don't agree, but if they aren't memory, what are they? They all fit the criteria: you can write a certain number of bits to them, and you can read them back. What's special about registers is that the cpu can perform a lot of operations directly on them.

Quote:

You load data from main memory into registers to manipulate it, you write data back to main memory when you're done.

That is the basic (simplified) procedure. I left out the cache part, the fact that at times the values in registers are generated rather than read from somewhere else, and I didn't go into i/o.

Quote:

Only that several levels of cache sit between registers and main memory, and data transfer through the cache is one of the crucial factors to consider when optimizing for a modern cpu.

I didn't say the crucial factor, but one of the. Stuff like thread switching, instruction pairing, branch optimization etc. are equally important. Also, to clarify, by "data transfer through the cache" I am referring to the fact that avoiding cache misses is usually an easy and effective optimization technique.

Anyway, I just tried to explain the general concepts from the C++ point of view; you're the ASM god, so if you care to explain properly, please do so. No sarcasm or offense intended.

---
Me make music: Triofobie
---
"We need Tobias and his awesome trombone, too." - Johan Halmén

HoHo

Member #4,534

April 2004

Quote:

Copying data into registers isn't so much of a bottleneck anymore due to cache.

unless your memory access is very random, like in original code

With this particular case caches are not much of use when images are bigger than what fits into cache because by the time that image has gone through processing once the first lines of it have already been flushed back to ram and are needed to fetch into caches again.

When CPU reads some data from ram it automagically copies a chunk into cache. The size of that chunk is the length of a cache line. Usually it is either 32, 64 or 128 bytes. IIRC my P4 should have 64byte cache lines, Opterons and probably regular AMD64's should have 128 byte lines. PM and some older CPU's should have 32byte cache lines.

Now if you access memory the same way as in original code, it gets fun. Every time you need four bytes but CPU prefetches 64, 16x more than needed. So, when you thought for 1024x1024x32bit bitmap you would need 4MiB bandwidth you really use around 64MiB.

My work PC could do ~8 1024x1024x32bit grayscale images per second* or ~400 cycles per pixel. That is about 32MiB in total. Not at all too much. Because of what I said earlier that translates to 0.5GiB/s. Still not too much because theoretical bandwidth should be about 10x as big. So what is the problem with the code?

*) With just reordering the loop it could do 200 images per second or ~25x increase in speed. That is about 16 cycles per pixel.

My guess is that cache misses are to blame*. Cycle count for single pixels seem to confirm that. Reading from L1 takes just a few cycles, usually 2-4. If data is not in L1 it takes about 15-25 cycles to read it from L2. If it is not there too it takes about 200-400 cycles to get it from memory. Comparing that 200 cycles to the time it takes to grayscale a single pixel (~16 cycles) it should be obvious that for the most of the time CPU just sits idle while waiting for new data to arrive.

*) I haven't yet found my code for counting cache misses but when I do I'll test it. I give no promises but I'll try to do it as soon as I can so I can learn something from it and share it with others

Hopefully some of you now understand why data structures and memory access patterns are important. They can make your code about an order of magnitude faster if done correctly

All of that gets even more interesting once you start dealing with SIMD instructions and memory aligning, software prefetch, read/write around cache, streaming from several sources come into play.

Bob

Free Market Evangelist

September 2000

Guys, quit it. You're not being helpful or even moderately funny.

--
- Bob
[ -- All my signature links are 404 -- ]

A J

Member #3,025

December 2002

HoHo, have you ever had software prefetch make much of a difference ?
most of my tests haven't shown much improvement, althou i've only done linear forwards access which seems to get prefetched already.

___________________________
The more you talk, the more AJ is right. - ML

HoHo

Member #4,534

April 2004

Quote:

HoHo, have you ever had software prefetch make much of a difference ?

I personally have never used it in my code. Most what I have said here is just theory and educated guesses. I try to test and prove the points I make gradually and that is also the reason why I want to perform some thorough test with the code.

Someone programmed a realtime ray tracer and used prefetching when traversing KD-tree*. Usually there is little linear access but it is really simple to prefetch a few steps ahead. IIRC he got ~15-30% speedup.

That ray tracer was once IOTD at flipcode. Basically it was supposed to be the next article in Jacco Bikker's ray tracing article series. Unfortunately it was never published. I have the code for that tracer stored on disk somewhere. If you want to I can send it to you once I find it.

Fortunately he and some other guys have created a new forum for discussing RT and he has a newer version of the tracer somewhere there.
http://ompf.org/forum

Tobias Dammers

Member #2,604

August 2002

So, to wrap it up, 4 options:
a) Use palettes in some way, so you can (ab)use allegro's built-in (and optimized) color conversion routines to achieve grey-scaling.
b) Use one of the functions discussed by HoHo e.a.; whichever works best. Whatever you do, always use y for the outer loop and x for the inner one.
c) Use a different library that supports hardware-accelerated color effects. OpenLayer might actually get you quite far.
d) Pre-calculate everything.

---
Me make music: Triofobie
---
"We need Tobias and his awesome trombone, too." - Johan Halmén