|
Greyscale-Filter is too slow. How to improve performance? |
dudaskank
Member #561
July 2000
|
You guys are funny hehehe... Jonatan Hedborg said: A faster way would be to make a whole new set of bitmaps, which are greyscale. The fastest way... it will use more memory too. Neil Walker said: For a lot of colour effects I find converting to HSL much easier Hmm... I don't know nothing about hsl, but converting rgb -> hsl to do the effect and then hsl -> rgb to blit in screen doesn't seen very good if you want real time. For the others solutions, Kitty Cat and Krzysztof Kluczek can be mixed to get the job... and avoid *, / and % if you can... In the test Hoho did before: gray |= ( gray << 8 ) | ( gray << 16 ); line[ x ] = gray * ( ( 1 << 16 ) | ( 1 << 8 ) | 1 ); should be gray |= ( gray << 8 ) | ( gray << 16 ); line[ x ] = gray; right? Richard Phipps said: ...it's better to read in 2 or more bytes than 1. FBlend uses this optimization too no? ps: sorry for the English... I'm Brazilian :p Toque a balada do amor inabalável, eterna love song de nós dois |
Jonatan Hedborg
Member #4,886
July 2004
|
Quote:
Jonatan Hedborg said: Quote: The fastest way... it will use more memory too.
Natually. Just about twice as much (assuming the majority of your memory usage is from bitmaps)
|
dudaskank
Member #561
July 2000
|
Jonatan Hedborg said: Natually. Just about twice as much... For better memory usage, you can use 8 bpp bitmaps too, with a grayscale palette, so you can simple use set_palette() and then blit()... it will add 25% memory usage for each bitmap in 32bpp. You then will use the method that Tobias Dammers pointed above Quote: and not very much in regards to the average RAM of a modern computer Wow, I want one of those modern computers hehehe... I only have 256 MB Toque a balada do amor inabalável, eterna love song de nós dois |
HoHo
Member #4,534
April 2004
|
Finally after some trouble with moving entire /usr directory to other partition I got time to do some more benching First I ran the exact same program I posted before on my home PC, a P4 @3.6GHz running 64bit Gentoo. Results are here, order is as before:
More registers certainly seem to help because now unrolling by four is clear winner on average being about 20-30% faster than unrolling by two and ~50-70% faster than no unrolling. Extra memory bandwidth also seems to help, especially with the original code. L2 cache doesn't seem to help a whole lot. With each pixel doubling the time doubles. 1024x512@32 bitmap is a bit bigger than 2M and twice as big and small bitmap processing scales almost linearly in 2x increments. If there would have been a big jump in time that would have indicated that L2 cache was actually useful. Before I started to do some more benches I decided to literally take a look how those grayscale images look like and saved them to disk. To my suprise almost all of them were blue instead of grayscale, even the original one. Only the approx2 gave grayscale image. It might be because I assume RGB order, also perhaps 64bittiness has something to do with it. I'll take a closer look at it and try to fix the functions. I can't promise when will I get some more results. I did try converting byte-at-a-time and it seemed to give almost 2-3x speedup compared to the 4x unrolled one but the output is not correct so the real speedup might change. It's 1:30 a.m here so I'm afraid I can't get anything up before tomorrow night. By then I try to make a bit more sophisticated test program. Perhaps if I get time and I find my PAPI wrapper I can get some more interesting results, like L1/2 cache misses/hits, retired instruction count, cycles stalled waiting for recourses etc [edit] whoops, I think I found the mistake. I think I would be much more productive if it wouldn't be so late __________ |
A J
Member #3,025
December 2002
|
Quote: you like to 'imagine' dont you? you do love those fence-sitting qualifiers Richard, just facts, you used 'imagine' twice, and most (majority) of your statements in those 3 posts contained maybe,imagine,perhaps etc.. Either you say something because its a fact or you strongly believe it to be a fact, or you dont say it. Btw, im not 'attacking' RichardPhipps.intellect or RichardPhipps.logic or RichardPhipps.emotion, im attacking RichardPhipps.ego which seems to be unwilling to accept what RichardPhipps.intellect is telling it, i strongly apologize if the former 3 elements are offended, if being asked a few direct questions is too much for your ego to handle, i think its time it got put in its place. ___________________________ |
GullRaDriel
Member #3,861
September 2003
|
War are a waste of time. ( and are off-topic ;-p ) "Code is like shit - it only smells if it is not yours" |
Evert
Member #794
November 2000
|
Quote: Either you say something because its a fact or you strongly believe it to be a fact, or you dont say it. Rubbish. There's no reason not to say something you're not sure about if you specify that (which one does by adding `I expect...' or `perhaps...'). Quote: you used 'imagine' twice,
Wasn't that `I imagine that to do X, Y does Z' in the meaning `I don't know for sure, but if I try to think how to do X, Y does Z'? |
A J
Member #3,025
December 2002
|
I have done some test, (smaller value is faster, best to worst) 3830 2x unrolling, array method, better ordering So it looks like unrolling a little helps, so does using array notation. The unrolling amount will be CPU specific, but 2x is a fairly safe on most systems. ___________________________ |
Richard Phipps
Member #1,632
November 2001
|
AJ: Regardless of what I have said (which Evert understood why I said it) you have not explained why what I said about registers is wrong. I explained my knowledge was limited, but you have not shown me any knowledge of cpu registers despite my repeated requests that you do so. Until you do that why should I believe your views over mine? |
Tobias Dammers
Member #2,604
August 2002
|
Quote: Perhaps because it's inefficient? It was pretty well known in the past to read in more than one byte if you could. Because RAM-to-cache is the major bottleneck in modern systems. Reading from cache is cheap, and whether you read one byte at a time or in groups of 4 doesn't make much of a difference. Multiple sequential cache reads are cheaper than reading 32 bits and then manually slicing them into bytes. --- |
Richard Phipps
Member #1,632
November 2001
|
At last some information! Thanks Tobias. So, is not the compiled code slicing the cached bytes from 32/64 bit values into 8 bit values itself as each one is read? |
Tobias Dammers
Member #2,604
August 2002
|
Depends on cpu architecture I guess. It will hardly be slower though. --- |
Richard Phipps
Member #1,632
November 2001
|
All I said was that if the cpu uses registers that are 32 bits or more than I would imagine it's reading in more than one pixel and then splitting it to be more efficient. |
Tobias Dammers
Member #2,604
August 2002
|
Yeah. It just so happened that you imagined wrong --- |
A J
Member #3,025
December 2002
|
You want to explain asm programming to you ? A register is where work is done, not where data is stored. Your confusing registers with memory, registers with cache, cache with main memory, just about all of them with each other. There are some fairly fundamental points you've missed that make me wonder just how much if any ams programming you have done. We are talking about real things, they all have names, there is no need to use any abstraction, so you can refrain from using 'its' which just leaves room for confusion. ___________________________ |
Richard Phipps
Member #1,632
November 2001
|
Ok. Obviously CPU's have moved on since my 8 bit and 16 bit days. Thank you AJ for explaining why after about 5 requests to do so. Next time, it would be much easier if you did it straight away.. ok? |
Tobias Dammers
Member #2,604
August 2002
|
Quote: A register is where work is done, not where data is stored.
...and in order to work on data, you need to store it somewhere. In other words, data is stored in registers and cache and main memory, all of which are different sorts of memory. Though registers don't necessarily contain what is typically referred to as "data" (when used to distinguish "code" from "data"), technically there isn't too much of a difference - pointers, numbers, cpu instructions - all of these are ultimately represented as bits & bytes. --- |
A J
Member #3,025
December 2002
|
no. ___________________________ |
Richard Phipps
Member #1,632
November 2001
|
Tobias: Ah I see. I never dealt with cache's on the 8-bit and 16-bit machines. |
Tobias Dammers
Member #2,604
August 2002
|
Quote: no. No what? I said: ...and in order to work on data, you need to store it somewhere. Can't see anything wrong with this. Quote: In other words, data is stored in registers and cache and main memory, all of which are different sorts of memory. Though registers don't necessarily contain what is typically referred to as "data" (when used to distinguish "code" from "data"), technically there isn't too much of a difference - pointers, numbers, cpu instructions - all of these are ultimately represented as bits & bytes. Maybe you don't agree, but if they aren't memory, what are they? They all fit the criteria: you can write a certain number of bits to them, and you can read them back. What's special about registers is that the cpu can perform a lot of operations directly on them. Quote: You load data from main memory into registers to manipulate it, you write data back to main memory when you're done. That is the basic (simplified) procedure. I left out the cache part, the fact that at times the values in registers are generated rather than read from somewhere else, and I didn't go into i/o. Quote: Only that several levels of cache sit between registers and main memory, and data transfer through the cache is one of the crucial factors to consider when optimizing for a modern cpu. I didn't say the crucial factor, but one of the. Stuff like thread switching, instruction pairing, branch optimization etc. are equally important. Also, to clarify, by "data transfer through the cache" I am referring to the fact that avoiding cache misses is usually an easy and effective optimization technique. Anyway, I just tried to explain the general concepts from the C++ point of view; you're the ASM god, so if you care to explain properly, please do so. No sarcasm or offense intended. --- |
HoHo
Member #4,534
April 2004
|
Quote: Copying data into registers isn't so much of a bottleneck anymore due to cache. unless your memory access is very random, like in original code With this particular case caches are not much of use when images are bigger than what fits into cache because by the time that image has gone through processing once the first lines of it have already been flushed back to ram and are needed to fetch into caches again. When CPU reads some data from ram it automagically copies a chunk into cache. The size of that chunk is the length of a cache line. Usually it is either 32, 64 or 128 bytes. IIRC my P4 should have 64byte cache lines, Opterons and probably regular AMD64's should have 128 byte lines. PM and some older CPU's should have 32byte cache lines. Now if you access memory the same way as in original code, it gets fun. Every time you need four bytes but CPU prefetches 64, 16x more than needed. So, when you thought for 1024x1024x32bit bitmap you would need 4MiB bandwidth you really use around 64MiB. My work PC could do ~8 1024x1024x32bit grayscale images per second* or ~400 cycles per pixel. That is about 32MiB in total. Not at all too much. Because of what I said earlier that translates to 0.5GiB/s. Still not too much because theoretical bandwidth should be about 10x as big. So what is the problem with the code? *) With just reordering the loop it could do 200 images per second or ~25x increase in speed. That is about 16 cycles per pixel. My guess is that cache misses are to blame*. Cycle count for single pixels seem to confirm that. Reading from L1 takes just a few cycles, usually 2-4. If data is not in L1 it takes about 15-25 cycles to read it from L2. If it is not there too it takes about 200-400 cycles to get it from memory. Comparing that 200 cycles to the time it takes to grayscale a single pixel (~16 cycles) it should be obvious that for the most of the time CPU just sits idle while waiting for new data to arrive. *) I haven't yet found my code for counting cache misses but when I do I'll test it. I give no promises but I'll try to do it as soon as I can so I can learn something from it and share it with others Hopefully some of you now understand why data structures and memory access patterns are important. They can make your code about an order of magnitude faster if done correctly All of that gets even more interesting once you start dealing with SIMD instructions and memory aligning, software prefetch, read/write around cache, streaming from several sources come into play. __________ |
Bob
Free Market Evangelist
September 2000
|
Guys, quit it. You're not being helpful or even moderately funny. -- |
A J
Member #3,025
December 2002
|
HoHo, have you ever had software prefetch make much of a difference ? ___________________________ |
HoHo
Member #4,534
April 2004
|
Quote: HoHo, have you ever had software prefetch make much of a difference ? I personally have never used it in my code. Most what I have said here is just theory and educated guesses. I try to test and prove the points I make gradually and that is also the reason why I want to perform some thorough test with the code. Someone programmed a realtime ray tracer and used prefetching when traversing KD-tree*. Usually there is little linear access but it is really simple to prefetch a few steps ahead. IIRC he got ~15-30% speedup. That ray tracer was once IOTD at flipcode. Basically it was supposed to be the next article in Jacco Bikker's ray tracing article series. Unfortunately it was never published. I have the code for that tracer stored on disk somewhere. If you want to I can send it to you once I find it. Fortunately he and some other guys have created a new forum for discussing RT and he has a newer version of the tracer somewhere there. __________ |
Tobias Dammers
Member #2,604
August 2002
|
So, to wrap it up, 4 options: --- |
|
|