|
This thread is locked; no one can reply to it. |
1
2
|
rgb->yuv (yv12) optimization... |
Thomas Fjellstrom
Member #476
June 2000
|
been playing with some code to convert a BITMAP to yuv data (specifically yv12)... Now, it works, but is limited. It assumes 32bpp, and is very slow. I have very little experience with the kinds of optimizations this code will need, aka: asm, mmx, sse... If someone would be kind enough to help me optimize the crap out of this code, weather it be ideas, links to exhaustive asm+mmx+sse docs, or some replacement code I'd be much obliged.
Just some info... YV12 data is planar, with the Y plane stored first, with the U, and V planes following... for each 4 pixels you get 4 Y components, and 1 each of the U, and V. -- |
Jakub Wasilewski
Member #3,653
June 2003
|
I'm really not sure I am the best person to answer questions like that, but still... I don't know anythin about mmx or sse though, just plain C++ . First, minor things: rgba1 += 8; rgba2 = rgba1 + 4; // replace with "rgba2 += 8" rgba3 += 8; rgba4 = rgba3 + 4; // replace with "rgba4 += 8" yuv1 += 2; yuv2 = yuv1 + 1; // replace with "yuv2 += 2" yuv3 += 2; yuv4 = yuv3 + 1; // like above Then more: unsigned char getY(unsigned char R, unsigned char G, unsigned char B) { return (R * 299 + G * 587 + B * 114) * 0.001; } unsigned char getU(unsigned char R, unsigned char G, unsigned char B) { return (R * (-168) + G * (-331) + B * 500) * 0.001 + 128; // you don't really need 6 digit precision, 3 will be well enough. } unsigned char getV(unsigned char R, unsigned char G, unsigned char B) { return (R * 500 + G * (-419) + B * (-81)) * 0.001 + 128; } This will speed things up because "int + int" and "int * int" are faster than "float + float" and "int * float". But it produces one more "*", so the final speed outcome is not certain... but I think it's worth trying. Just some suggestions... hope you like them. --------------------------- |
Fladimir da Gorf
Member #1,565
October 2001
|
I think float multiplication shouldn't be too slow nowadays... But if you think that helps then OK. Quote: rgba1 += 8; Are you sure that makes any difference speed wise? I'm quite positive the compiler output will be the same. The main speed issue seems to be the high number of multiplications. One way could of course be some kind of a multiplication look-up table that returns the same number multiplied by R * 299 (if you're using the krajzega's version), but that could even be slower due to cache problems. However, about ASM and MMX, that might be a good idea, as you need to do the same calculations for every pixel group. But the actual implementation isn't that simple... maybe I should try to get something coded. Of course there's also some MMX tutorials around, like Intel's, it doesn't take long to learn the little what there actually is to learn. Though you need to switch your brains to asm programming. OpenLayer has reached a random SVN version number ;) | Online manual | Installation video!| MSVC projects now possible with cmake | Now alvailable as a Dev-C++ Devpack! (Thanks to Kotori) |
Jakub Wasilewski
Member #3,653
June 2003
|
Well, I said im not the best to answer questions like that. Just wanted to help, because nobody did answer this post earlier . As for the "+=8" and "+4"... well, the output probably won't be the same, but if it will... to hell with auto-optimization . --------------------------- |
Fladimir da Gorf
Member #1,565
October 2001
|
No, not at all, you gave me a great idea. But first, I think the best bet without ASM is to avoid the slow int->float and float->int conversions in whole, even if divisions are rather slow, too, by doing: unsigned char getY(unsigned char R, unsigned char G, unsigned char B) { return (R * 299 + G * 587 + B * 114) / 1000; } unsigned char getU(unsigned char R, unsigned char G, unsigned char B) { return (R * (-168) + G * (-331) + B * 500) / 1000 + 128; } unsigned char getV(unsigned char R, unsigned char G, unsigned char B) { return (R * 500 + G * (-419) + B * (-81)) / 1000 + 128; } The biggest problem with MMX seems currently to be that there's no division or floating point instructions... maybe I can find a hacky way around that. One way would be to use bitshifts to divide by 1024, instead of 1000. The difference isn't that big, but can you afroid any precision loss? OpenLayer has reached a random SVN version number ;) | Online manual | Installation video!| MSVC projects now possible with cmake | Now alvailable as a Dev-C++ Devpack! (Thanks to Kotori) |
Thomas Fjellstrom
Member #476
June 2000
|
I gave that one a try.. I don't think the precision loss is all that important, I didn't notice anything horrible with the display... Besides a nasty driver problem (my card's scalar seems to be wacked... seems to be trying to "smooth" the image, and failing miserably) That change bumped the frame rate from 76 to 115... not to bad. (my little demo uses Xfree86's Xv extention to display the final image using my cards overlay, thats capable of colorspace conversion and scaling...) Quote: As for the "+=8" and "+4" Have you looked closely at the names of the variables? making your suggested changes should do exactly what it already does. AFAICS. -- |
Jakub Wasilewski
Member #3,653
June 2003
|
Quote: As for the "+=8" and "+4"
Never mind. Even if it did make any difference, it would be something like 0.5 FPS. Forget it. unsigned char getY(unsigned char R, unsigned char G, unsigned char B) { return (R * 306 + G * 601 + B * 117) >> 10; // rescaled by 1.024, to allow bitshift instead of division by 1000. } Of course, you should also rescale getU and getV. Fladimir said: I think float multiplication shouldn't be too slow nowadays... Well.. it seems it is . I win . --------------------------- |
George Foot
Member #669
September 2000
|
Even without using MMX, you can change the code to divide by 1024 (the compiler will optimise this to a nice shift), and change the R/G/B coefficients to match. There won't be any precision loss (this is an increase in precision) -- certainly not any significant change in the results, especially if you work out the new coefficients based on the more precise source values you had earlier rather than basing them on these already-rounded versions. However, if you really can cope with much lower precision, you could use 256, and then you can convert two pixels at once by putting them in the same 32-bit int. This is a bit old-school, but you can write it in plain C, so if you don't know any MMX, SSE, etc, then it may be good for you. If you want to optimise your inner loop much further, I'd recommend compiling to assembly language, and checking the output for any obvious things that should be optimisable but which the compiler wasn't allowed to do. For example, on every function call that's not inlined, the compiler has to push loads of registers to the stack and must assume that any pointed-to local variables may have had their values changed. In your case, I don't think this would be a problem (the functions should get inlined, and you're not taking the addresses of any local variables, nor are you reading the same data twice from any arrays). |
A J
Member #3,025
December 2002
|
some ideas: what about using lookup tables for something ? pre-compute the most expensive calculations in a massive lookup. ___________________________ |
Fladimir da Gorf
Member #1,565
October 2001
|
Quote: I don't think the precision loss is all that important, I didn't notice anything horrible with the display... So you tried with (... >> 10) instead of (... / 1000)? Was the fps 115 when using division or not? And where do you actually need anything like this? Anyways, here's the MMX blender to do the 3 multiplications at the cost of one:
Note that you have to compile with full optimizations on, or otherwise you'll get some odd compile errors. EDIT: Whoah... 3 replies before I finished the code... Quote: what about using lookup tables for something ? Like I didn't already propose that? Quote: If you don't need that helluva precision (and you probably don't, since you use unsigned chars anyways), you can gain some more FPS by using: Well, I already proposed that too... but nevermind. Quote: Well.. it seems it is . I win . Hehehe... in this case, it is, as the numbers need to be converted to float from int. OpenLayer has reached a random SVN version number ;) | Online manual | Installation video!| MSVC projects now possible with cmake | Now alvailable as a Dev-C++ Devpack! (Thanks to Kotori) |
Thomas Fjellstrom
Member #476
June 2000
|
Quote: So you tried with (... >> 10) instead of (... / 1000)? Was the fps 115 when using division or not? the / 1000. It actually got to a steady 120fps after a reboot. Quote: And where do you actually need anything like this? I was thinking of making a nice media center like app using allegro... well, that wouldnt be too hard, but then I was thinking, make it all render to the overlay, not just the video. Not sure how It'll work. If anything, I've learned quite a bit Quote: Anyways, here's the MMX blender to do the 3 multiplications at the cost of one nice. I'll give that a go in a bit.. by "full optimizations" you mean something like "-O3 -ffast-math -funroll-loops -fomit-frame-pointer -march=athlon-xp"? -- |
Fladimir da Gorf
Member #1,565
October 2001
|
For optimizations... eh... those that you get in Dev-C++ by choosing "Full Optimizations" Oh, and I totally forgot to tell how to use the function. You need to pass it an array of 4 chars, containing the following: EDIT: Quote: I was thinking of making a nice media center like app using allegro... And you need a TV -like signal or something? OpenLayer has reached a random SVN version number ;) | Online manual | Installation video!| MSVC projects now possible with cmake | Now alvailable as a Dev-C++ Devpack! (Thanks to Kotori) |
Thomas Fjellstrom
Member #476
June 2000
|
ok, I swapped that for the current getY. it wavers around 125 fps... and since I figured that the first component was ignored, I just passed rgbaN-1. though that gives me a grey scale image. and I can't pass just rgbaN, as this is what happens: /home/moose/tmp/cc8nIZ3X.s: Assembler messages: /home/moose/tmp/cc8nIZ3X.s:85: Error: missing ')' /home/moose/tmp/cc8nIZ3X.s:85: Error: junk `(%esp))' after expression . Oh and that fourth colour component might be alpha... edit: Quote: And you need a TV -like signal or something? My card's overlays don't support RGB. They only do a few forms of YUV. -- |
Fladimir da Gorf
Member #1,565
October 2001
|
Quote: it wavers around 125 fps... Well, I guessed the speed difference won't be huge, but it seems that the compiler did something to optimize the original code, or otherwise the speed gain would be higher. However, I see the problem now is that you can't pass the values directly from the memory, but you need to construct the array, right? Quote: though that gives me a grey scale image. That's kind of odd, since the function should deal only with the Y -value, which is the lightness, and the color value shouldn't depend on it. Quote: and I can't pass just rgbaN, as this is what happens: That's exactly what will happen if the compiler optimizations fail, or you don't have any optimizations on at all. The problem is mainly that the function is inlined, and thus has a limited range of registers to use. One thing to try would be to be sure that the function doesn't get inlined, but that'd make it obviously slower. And what about making a volatile pointer to hold rgbaN-1, and then pass pointer+1 to the function. You can also try changing the destination variable to static and check if it makes any difference speed/compile wise, but I'm not too positive. EDIT: OK, it seems to cause a strange crash, possibly because the compiler somehow doesn't see the value of the variable changes in the asm statement. You can try this version instead and see if it's any faster:
Quote: Oh and that fourth colour component might be alpha... I thought the same, but I thought no one's going to store the sprites in YUV anyways, but just convert the buffer, which doesn't obviously have an alpha channel. OpenLayer has reached a random SVN version number ;) | Online manual | Installation video!| MSVC projects now possible with cmake | Now alvailable as a Dev-C++ Devpack! (Thanks to Kotori) |
Bob
Free Market Evangelist
September 2000
|
Quote: "por %%mm0, %%mm0\n" Are you sure? -- |
Thomas Fjellstrom
Member #476
June 2000
|
bob, you're so smart (and I mean that in a not smart ass way) why don't you help a little -- |
Fladimir da Gorf
Member #1,565
October 2001
|
Quote: Are you sure? Not... it should be pxor. What I'm trying to do is to set the %%mm0 register to all zeros in a quick way. If you know a better way, then let me know. Or could I assume that all MMX registers are set to all zeros when the MMX sequence begins? In that case, the whole line can be stripped out. It also seems that the biggest processor consumation is even more somewhere else in the bitmap conversion function. OpenLayer has reached a random SVN version number ;) | Online manual | Installation video!| MSVC projects now possible with cmake | Now alvailable as a Dev-C++ Devpack! (Thanks to Kotori) |
Matt Smith
Member #783
November 2000
|
the advantage of Fladimir's mmx code will be best seen if you do multiple pixels and/or calculate the U & V at the same time. This would mean you would keep the MUL() array in a register. Quote: Or could I assume that all MMX registers are set to all zeros when the MMX sequence begins? No, the registers will only be cleared if you do it explicitly. |
Bob
Free Market Evangelist
September 2000
|
It would also mean that you average out the pmadd latency, which is a large bottleneck here. -- |
A J
Member #3,025
December 2002
|
bob, i have just looked into MMX and SSE instruction support for MSVC7(.net) and it looks like something i could try. do you have any good tutorials, URLs etc.. ___________________________ |
Fladimir da Gorf
Member #1,565
October 2001
|
Quote: the advantage of Fladimir's mmx code will be best seen if you do multiple pixels and/or calculate the U & V at the same time. That's right. Actually, since there's no floating point math going in the whole rendering function, could I safely assume that the MMX registers will stay in the state I've previously left them? In that case, I could strip the setup code to an own function. That'd even save one register in the getY() -function, which makes it less prone of compiler oddities.
Quote: do you have any good tutorials, URLs etc.. I already posted a link to the Intel's tutorials. That's where I've learned ASM from. OpenLayer has reached a random SVN version number ;) | Online manual | Installation video!| MSVC projects now possible with cmake | Now alvailable as a Dev-C++ Devpack! (Thanks to Kotori) |
Matt Smith
Member #783
November 2000
|
Here is a first draft of a routine to do 4 pixels at a time. This is wrong because 2 source registers should be used because the 4 pixels will come from 2 lines of the source bitmap, and I haven't put in the quotes and \n needed for embedded asm.
|
Marcello
Member #1,860
January 2002
|
Can I ask what yuv is? Marcello |
Matt Smith
Member #783
November 2000
|
It's more correctly called YCbCr apparently, according to The ColorSpace FAQ It's a video signal encoded as Luminance (Y) and 2 color difference signals (U & V) [edit 2] Link fixed YUV is used in broadcast television, MPEG files and most overlay video cards. |
Bob
Free Market Evangelist
September 2000
|
You may want to try something closer to:
Computes the Y channel only. Set (%1) to the source data, (%2) to one line under, and (%3) to the destination and (%4) to the line just under that. Works on 2x2 blocks, so a full run will compute 2 lines. Edit: Neglected the shift by 10. Added emms. -- |
|
1
2
|