Allegro.cc - Online Community

Allegro.cc Forums » Programming Questions » rgb->yuv (yv12) optimization...

This thread is locked; no one can reply to it. rss feed Print
 1   2 
rgb->yuv (yv12) optimization...
Thomas Fjellstrom
Member #476
June 2000
avatar

been playing with some code to convert a BITMAP to yuv data (specifically yv12)...

Now, it works, but is limited. It assumes 32bpp, and is very slow. :(

I have very little experience with the kinds of optimizations this code will need, aka: asm, mmx, sse...

If someone would be kind enough to help me optimize the crap out of this code, weather it be ideas, links to exhaustive asm+mmx+sse docs, or some replacement code ;) I'd be much obliged. :D

1int bitmap2xvimage(BITMAP *src, XvImage *image)
2{
3 convertToYUV(src->dat, image->data, src->w, src->h) ? 1 : 0;
4}
5 
6double getY( unsigned char R, unsigned char G, unsigned char B ) {
7 return( R * 0.299 + G * 0.587 + B * 0.114 );
8}
9 
10double getU( unsigned char R, unsigned char G, unsigned char B ) {
11 return( R * -0.168736 + G * -0.331264 + B * 0.500 + 128 );
12}
13 
14double getV( unsigned char R, unsigned char G, unsigned char B ) {
15 return( R * 0.500 + G * -0.418688 + B * -0.081312 + 128 );
16}
17 
18unsigned char* convertToYUV( unsigned char* rgbbuffer, unsigned char* yuv_buffer, int width, int height ) {
19 unsigned int framesize;
20 unsigned char *rgba1, *yuv1;
21 unsigned char *rgba2, *yuv2;
22 unsigned char *rgba3, *yuv3;
23 unsigned char *rgba4, *yuv4;
24 unsigned char r1, r2, r3, r4, g1, g2, g3, g4, b1, b2, b3, b4;
25 unsigned char* y_plane;
26 unsigned char* u_plane;
27 unsigned char* v_plane;
28 int row = 0, col = 0;
29 int plane_size;
30
31 /* 1 y-plane + 1/4 u-plane + 1/4 v-plane */
32 framesize = (unsigned int) ( width * height * 1.5 );
33 
34 plane_size = width * height;
35 
36 y_plane = yuv_buffer;
37 u_plane = yuv_buffer + plane_size;
38 v_plane = u_plane + plane_size/4;
39 
40 for (row=0; row < height/2; row++) {
41 rgba1 = rgbbuffer+(row*width << 3); // rgbbuffer+(2*row*width*4);
42 rgba2 = rgba1 + 4;
43 rgba3 = rgbbuffer+((2*row+1)* (width<<2)); // rgbbuffer+((2*row+1)*width*4);
44 rgba4 = rgba3 + 4;
45 yuv1 = y_plane + 2*row * width;
46 yuv2 = yuv1 + 1;
47 yuv3 = y_plane + (2*row + 1) * width;
48 yuv4 = yuv3 + 1;
49
50 for (col=0; col < width/2; col++) {
51 r1 = rgba1[0];
52 g1 = rgba1[1];
53 b1 = rgba1[2];
54 *yuv1 = (unsigned char) getY( r1, g1, b1 );
55 
56 r2 = rgba2[0];
57 g2 = rgba2[1];
58 b2 = rgba2[2];
59 *yuv2 = (unsigned char) getY( r2, g2, b2 );
60 
61 r3 = rgba3[0];
62 g3 = rgba3[1];
63 b3 = rgba3[2];
64 *yuv3 = (unsigned char) getY( r3, g3, b3 );
65 
66 r4 = rgba4[0];
67 g4 = rgba4[1];
68 b4 = rgba4[2];
69 *yuv4 = (unsigned char) getY( r4, g4, b4 );
70 
71 r1 = (r1+r2+r3+r4)>>2;
72 g1 = (g1+g2+g3+g4)>>2;
73 b1 = (b1+b2+b3+b4)>>2;
74 *u_plane = (unsigned char) getU( r1, g1, b1 );
75 *v_plane = (unsigned char) getV( r1, g1, b1 );
76 u_plane++; v_plane++;
77 
78 rgba1 += 8;
79 rgba2 = rgba1 + 4;
80 rgba3 += 8;
81 rgba4 = rgba3 + 4;
82 yuv1 += 2;
83 yuv2 = yuv1 + 1;
84 yuv3 += 2;
85 yuv4 = yuv3 + 1;
86 }
87 }
88 
89 return( yuv_buffer );
90}

Just some info... YV12 data is planar, with the Y plane stored first, with the U, and V planes following... for each 4 pixels you get 4 Y components, and 1 each of the U, and V.

--
Thomas Fjellstrom - [website] - [email] - [Allegro Wiki] - [Allegro TODO]
"If you can't think of a better solution, don't try to make a better solution." -- weapon_S
"The less evidence we have for what we believe is certain, the more violently we defend beliefs against those who don't agree" -- https://twitter.com/neiltyson/status/592870205409353730

Jakub Wasilewski
Member #3,653
June 2003
avatar

I'm really not sure I am the best person to answer questions like that, but still... I don't know anythin about mmx or sse though, just plain C++ :P.

First, minor things:

   rgba1 += 8;
   rgba2 = rgba1 + 4; // replace with "rgba2 += 8"
   rgba3 += 8;
   rgba4 = rgba3 + 4; // replace with "rgba4 += 8"
   yuv1  += 2;
   yuv2  = yuv1 + 1; // replace with "yuv2 += 2"
   yuv3  += 2;
   yuv4  = yuv3 + 1; // like above

Then more:

unsigned char getY(unsigned char R, unsigned char G, unsigned char B)
{
  return (R * 299 + G * 587 + B * 114) * 0.001;
}

unsigned char getU(unsigned char R, unsigned char G, unsigned char B)
{
  return (R * (-168) + G * (-331) + B * 500) * 0.001 + 128; // you don't really need 6 digit precision, 3 will be well enough.
} 

unsigned char getV(unsigned char R, unsigned char G, unsigned char B)
{
  return (R * 500 + G * (-419) + B * (-81)) * 0.001 + 128;
}

This will speed things up because "int + int" and "int * int" are faster than "float + float" and "int * float". But it produces one more "*", so the final speed outcome is not certain... but I think it's worth trying.

Just some suggestions... hope you like them.

---------------------------
[ ChristmasHack! | My games ] :::: One CSS to style them all, One Javascript to script them, / One HTML to bring them all and in the browser bind them / In the Land of Fantasy where Standards mean something.

Fladimir da Gorf
Member #1,565
October 2001
avatar

I think float multiplication shouldn't be too slow nowadays... But if you think that helps then OK.

Quote:

rgba1 += 8;
rgba2 = rgba1 + 4; // replace with "rgba2 += 8"

Are you sure that makes any difference speed wise? I'm quite positive the compiler output will be the same.

The main speed issue seems to be the high number of multiplications. One way could of course be some kind of a multiplication look-up table that returns the same number multiplied by R * 299 (if you're using the krajzega's version), but that could even be slower due to cache problems.

However, about ASM and MMX, that might be a good idea, as you need to do the same calculations for every pixel group. But the actual implementation isn't that simple... maybe I should try to get something coded.

Of course there's also some MMX tutorials around, like Intel's, it doesn't take long to learn the little what there actually is to learn. Though you need to switch your brains to asm programming.

OpenLayer has reached a random SVN version number ;) | Online manual | Installation video!| MSVC projects now possible with cmake | Now alvailable as a Dev-C++ Devpack! (Thanks to Kotori)

Jakub Wasilewski
Member #3,653
June 2003
avatar

Well, I said im not the best to answer questions like that. Just wanted to help, because nobody did answer this post earlier ;).

As for the "+=8" and "+4"... well, the output probably won't be the same, but if it will... to hell with auto-optimization :P.

---------------------------
[ ChristmasHack! | My games ] :::: One CSS to style them all, One Javascript to script them, / One HTML to bring them all and in the browser bind them / In the Land of Fantasy where Standards mean something.

Fladimir da Gorf
Member #1,565
October 2001
avatar

No, not at all, you gave me a great idea. But first, I think the best bet without ASM is to avoid the slow int->float and float->int conversions in whole, even if divisions are rather slow, too, by doing:

unsigned char getY(unsigned char R, unsigned char G, unsigned char B)
{
  return (R * 299 + G * 587 + B * 114) / 1000;
}

unsigned char getU(unsigned char R, unsigned char G, unsigned char B)
{
  return (R * (-168) + G * (-331) + B * 500) / 1000 + 128;
} 

unsigned char getV(unsigned char R, unsigned char G, unsigned char B)
{
  return (R * 500 + G * (-419) + B * (-81)) / 1000 + 128;
}

The biggest problem with MMX seems currently to be that there's no division or floating point instructions... maybe I can find a hacky way around that. One way would be to use bitshifts to divide by 1024, instead of 1000. The difference isn't that big, but can you afroid any precision loss?

OpenLayer has reached a random SVN version number ;) | Online manual | Installation video!| MSVC projects now possible with cmake | Now alvailable as a Dev-C++ Devpack! (Thanks to Kotori)

Thomas Fjellstrom
Member #476
June 2000
avatar

I gave that one a try.. I don't think the precision loss is all that important, I didn't notice anything horrible with the display... Besides a nasty driver problem ;) (my card's scalar seems to be wacked... seems to be trying to "smooth" the image, and failing miserably)

That change bumped the frame rate from 76 to 115... not to bad. (my little demo uses Xfree86's Xv extention to display the final image using my cards overlay, thats capable of colorspace conversion and scaling...)

Quote:

As for the "+=8" and "+4"

Have you looked closely at the names of the variables? making your suggested changes should do exactly what it already does. AFAICS.

--
Thomas Fjellstrom - [website] - [email] - [Allegro Wiki] - [Allegro TODO]
"If you can't think of a better solution, don't try to make a better solution." -- weapon_S
"The less evidence we have for what we believe is certain, the more violently we defend beliefs against those who don't agree" -- https://twitter.com/neiltyson/status/592870205409353730

Jakub Wasilewski
Member #3,653
June 2003
avatar

Quote:

As for the "+=8" and "+4"

Never mind. Even if it did make any difference, it would be something like 0.5 FPS. Forget it.
I'm glad the first one works ;). If you don't need that helluva precision (and you probably don't, since you use unsigned chars anyways), you can gain some more FPS by using:

  unsigned char getY(unsigned char R, unsigned char G, unsigned char B)
{
  return (R * 306 + G * 601 + B * 117) >> 10; // rescaled by 1.024, to allow bitshift instead of division by 1000.
}

Of course, you should also rescale getU and getV.

Fladimir said:

I think float multiplication shouldn't be too slow nowadays...

Well.. it seems it is :). I win :P.

---------------------------
[ ChristmasHack! | My games ] :::: One CSS to style them all, One Javascript to script them, / One HTML to bring them all and in the browser bind them / In the Land of Fantasy where Standards mean something.

George Foot
Member #669
September 2000

Even without using MMX, you can change the code to divide by 1024 (the compiler will optimise this to a nice shift), and change the R/G/B coefficients to match. There won't be any precision loss (this is an increase in precision) -- certainly not any significant change in the results, especially if you work out the new coefficients based on the more precise source values you had earlier rather than basing them on these already-rounded versions.

However, if you really can cope with much lower precision, you could use 256, and then you can convert two pixels at once by putting them in the same 32-bit int. This is a bit old-school, but you can write it in plain C, so if you don't know any MMX, SSE, etc, then it may be good for you.

If you want to optimise your inner loop much further, I'd recommend compiling to assembly language, and checking the output for any obvious things that should be optimisable but which the compiler wasn't allowed to do. For example, on every function call that's not inlined, the compiler has to push loads of registers to the stack and must assume that any pointed-to local variables may have had their values changed.

In your case, I don't think this would be a problem (the functions should get inlined, and you're not taking the addresses of any local variables, nor are you reading the same data twice from any arrays).

A J
Member #3,025
December 2002
avatar

some ideas:

what about using lookup tables for something ?

pre-compute the most expensive calculations in a massive lookup.

___________________________
The more you talk, the more AJ is right. - ML

Fladimir da Gorf
Member #1,565
October 2001
avatar

Quote:

I don't think the precision loss is all that important, I didn't notice anything horrible with the display...

So you tried with (... >> 10) instead of (... / 1000)? Was the fps 115 when using division or not?

And where do you actually need anything like this?

Anyways, here's the MMX blender to do the 3 multiplications at the cost of one:

1unsigned char getY( unsigned char *SourceRGB ) {
2 static unsigned short MUL[4] = { 0, 299, 587, 114 };
3 unsigned long destination = 33; // Or whatever, the number should be overwritten anyways, I just used 33 to detect mmx issues //
4 asm (
5 "por %%mm0, %%mm0\n"
6 "movd (%1), %%mm1\n"
7 "punpcklbw %%mm0, %%mm1\n"
8 "pmaddwd (%2), %%mm1\n"
9 "movq %%mm1, %%mm3\n"
10 "psrlq $32, %%mm3\n"
11 "paddd %%mm3, %%mm1\n"
12 "movd %%mm1, %0\n"
13 "emms\n"
14 : "=&a" (destination)
15 : "rm" (SourceRGB), "rm" (MUL)
16 : "memory"
17 );
18 return destination >> 10;
19}

Note that you have to compile with full optimizations on, or otherwise you'll get some odd compile errors.

EDIT: Whoah... 3 replies before I finished the code...

Quote:

what about using lookup tables for something ?

Like I didn't already propose that?

Quote:

If you don't need that helluva precision (and you probably don't, since you use unsigned chars anyways), you can gain some more FPS by using:

Well, I already proposed that too... but nevermind.

Quote:

Well.. it seems it is . I win .

Hehehe... in this case, it is, as the numbers need to be converted to float from int.

OpenLayer has reached a random SVN version number ;) | Online manual | Installation video!| MSVC projects now possible with cmake | Now alvailable as a Dev-C++ Devpack! (Thanks to Kotori)

Thomas Fjellstrom
Member #476
June 2000
avatar

Quote:

So you tried with (... >> 10) instead of (... / 1000)? Was the fps 115 when using division or not?

the / 1000. It actually got to a steady 120fps after a reboot.

Quote:

And where do you actually need anything like this?

I was thinking of making a nice media center like app using allegro... well, that wouldnt be too hard, but then I was thinking, make it all render to the overlay, not just the video. Not sure how It'll work. If anything, I've learned quite a bit ;)

Quote:

Anyways, here's the MMX blender to do the 3 multiplications at the cost of one

:o nice. I'll give that a go in a bit.. by "full optimizations" you mean something like "-O3 -ffast-math -funroll-loops -fomit-frame-pointer -march=athlon-xp"?

--
Thomas Fjellstrom - [website] - [email] - [Allegro Wiki] - [Allegro TODO]
"If you can't think of a better solution, don't try to make a better solution." -- weapon_S
"The less evidence we have for what we believe is certain, the more violently we defend beliefs against those who don't agree" -- https://twitter.com/neiltyson/status/592870205409353730

Fladimir da Gorf
Member #1,565
October 2001
avatar

For optimizations... eh... those that you get in Dev-C++ by choosing "Full Optimizations" ;)
I think just plain -O3 should do. If it doesn't, try adding some extra. ;D

Oh, and I totally forgot to tell how to use the function. You need to pass it an array of 4 chars, containing the following:
{ 0, R, G, B }
As you can see, the first one is useless. If the mankind ever discovers a fourth color component, you'll be the first one to be able to use it... ;) As you can see, the array should look similiar to the MUL -array in the function.

EDIT:

Quote:

I was thinking of making a nice media center like app using allegro...

And you need a TV -like signal or something?

OpenLayer has reached a random SVN version number ;) | Online manual | Installation video!| MSVC projects now possible with cmake | Now alvailable as a Dev-C++ Devpack! (Thanks to Kotori)

Thomas Fjellstrom
Member #476
June 2000
avatar

ok, I swapped that for the current getY. it wavers around 125 fps... and since I figured that the first component was ignored, I just passed rgbaN-1. though that gives me a grey scale image. and I can't pass just rgbaN, as this is what happens:

/home/moose/tmp/cc8nIZ3X.s: Assembler messages:
/home/moose/tmp/cc8nIZ3X.s:85: Error: missing ')'
/home/moose/tmp/cc8nIZ3X.s:85: Error: junk `(%esp))' after expression

.

Oh and that fourth colour component might be alpha...

edit:

Quote:

And you need a TV -like signal or something?

My card's overlays don't support RGB. They only do a few forms of YUV.

--
Thomas Fjellstrom - [website] - [email] - [Allegro Wiki] - [Allegro TODO]
"If you can't think of a better solution, don't try to make a better solution." -- weapon_S
"The less evidence we have for what we believe is certain, the more violently we defend beliefs against those who don't agree" -- https://twitter.com/neiltyson/status/592870205409353730

Fladimir da Gorf
Member #1,565
October 2001
avatar

Quote:

it wavers around 125 fps...

Well, I guessed the speed difference won't be huge, but it seems that the compiler did something to optimize the original code, or otherwise the speed gain would be higher.

However, I see the problem now is that you can't pass the values directly from the memory, but you need to construct the array, right?

Quote:

though that gives me a grey scale image.

That's kind of odd, since the function should deal only with the Y -value, which is the lightness, and the color value shouldn't depend on it.

Quote:

and I can't pass just rgbaN, as this is what happens:

That's exactly what will happen if the compiler optimizations fail, or you don't have any optimizations on at all. The problem is mainly that the function is inlined, and thus has a limited range of registers to use. One thing to try would be to be sure that the function doesn't get inlined, but that'd make it obviously slower. And what about making a volatile pointer to hold rgbaN-1, and then pass pointer+1 to the function.

You can also try changing the destination variable to static and check if it makes any difference speed/compile wise, but I'm not too positive.

EDIT: OK, it seems to cause a strange crash, possibly because the compiler somehow doesn't see the value of the variable changes in the asm statement. You can try this version instead and see if it's any faster:
(Code edited out, see below)
EDIT2: I noticed a big bad memory leak in the code I included in this post earlier - to prevent it from crashing, I had to loose the static -keyword from the allocated destination array, which caused the memory to be allocated during every function call, but never freed. The following is a rather hacky workaround:

1unsigned int getY( unsigned char *SourceRGB ) {
2 static unsigned short MUL[4] = { 0, 299, 587, 114 };
3 static unsigned long *destinationStatic = new unsigned long[2];
4 unsigned long *volatile destination = destinationStatic;
5 asm (
6 "por %%mm0, %%mm0\n"
7 "movd (%1), %%mm1\n"
8 "punpcklbw %%mm0, %%mm1\n"
9 "pmaddwd (%2), %%mm1\n"
10 "movq %%mm1, (%0)\n"
11 "emms\n"
12 : "=&a" (destination)
13 : "rm" (SourceRGB), "rm" (MUL)
14 : "memory"
15 );
16 return (destination[0] + destination[1]) >> 10;
17}

Quote:

Oh and that fourth colour component might be alpha...

I thought the same, but I thought no one's going to store the sprites in YUV anyways, but just convert the buffer, which doesn't obviously have an alpha channel.

OpenLayer has reached a random SVN version number ;) | Online manual | Installation video!| MSVC projects now possible with cmake | Now alvailable as a Dev-C++ Devpack! (Thanks to Kotori)

Bob
Free Market Evangelist
September 2000
avatar

Quote:

"por %%mm0, %%mm0\n"

Are you sure? ;)

--
- Bob
[ -- All my signature links are 404 -- ]

Thomas Fjellstrom
Member #476
June 2000
avatar

bob, you're so smart (and I mean that in a not smart ass way) why don't you help a little ;D

--
Thomas Fjellstrom - [website] - [email] - [Allegro Wiki] - [Allegro TODO]
"If you can't think of a better solution, don't try to make a better solution." -- weapon_S
"The less evidence we have for what we believe is certain, the more violently we defend beliefs against those who don't agree" -- https://twitter.com/neiltyson/status/592870205409353730

Fladimir da Gorf
Member #1,565
October 2001
avatar

Quote:

Are you sure?

Not... it should be pxor. What I'm trying to do is to set the %%mm0 register to all zeros in a quick way. If you know a better way, then let me know.

Or could I assume that all MMX registers are set to all zeros when the MMX sequence begins? In that case, the whole line can be stripped out.

It also seems that the biggest processor consumation is even more somewhere else in the bitmap conversion function.

OpenLayer has reached a random SVN version number ;) | Online manual | Installation video!| MSVC projects now possible with cmake | Now alvailable as a Dev-C++ Devpack! (Thanks to Kotori)

Matt Smith
Member #783
November 2000

the advantage of Fladimir's mmx code will be best seen if you do multiple pixels and/or calculate the U & V at the same time. This would mean you would keep the MUL() array in a register.

Quote:

Or could I assume that all MMX registers are set to all zeros when the MMX sequence begins?

No, the registers will only be cleared if you do it explicitly.

Bob
Free Market Evangelist
September 2000
avatar

It would also mean that you average out the pmadd latency, which is a large bottleneck here.

--
- Bob
[ -- All my signature links are 404 -- ]

A J
Member #3,025
December 2002
avatar

bob, i have just looked into MMX and SSE instruction support for MSVC7(.net) and it looks like something i could try.

do you have any good tutorials, URLs etc..
seems my image processing could really do with some of these SIMD instructions.

___________________________
The more you talk, the more AJ is right. - ML

Fladimir da Gorf
Member #1,565
October 2001
avatar

Quote:

the advantage of Fladimir's mmx code will be best seen if you do multiple pixels and/or calculate the U & V at the same time.

That's right. Actually, since there's no floating point math going in the whole rendering function, could I safely assume that the MMX registers will stay in the state I've previously left them? In that case, I could strip the setup code to an own function. That'd even save one register in the getY() -function, which makes it less prone of compiler oddities.

1void SetupGetY() {
2 static unsigned short MUL[4] = { 0, 299, 587, 114 };
3 asm (
4 "pxor %%mm0, %%mm0\n"
5 "movq (%2), %%mm2\n"
6 :
7 : "rm" (MUL)
8 :
9 );
10}
11 
12 
13unsigned int getY( unsigned char *SourceRGB ) {
14 static unsigned long *destinationStatic = new unsigned long[2];
15 unsigned long *volatile destination = destinationStatic;
16 asm (
17 "movd (%1), %%mm1\n"
18 "punpcklbw %%mm0, %%mm1\n"
19 "pmaddwd %%mm2, %%mm1\n"
20 "movq %%mm1, (%0)\n"
21 : "=&a" (destination)
22 : "rm" (SourceRGB)
23 : "memory"
24 );
25 return (destination[0] + destination[1]) >> 10;
26}
27 
28 
29void EndGetY() {
30 asm (
31 "emms\n"
32 );
33}

Quote:

do you have any good tutorials, URLs etc..

I already posted a link to the Intel's tutorials. That's where I've learned ASM from.

OpenLayer has reached a random SVN version number ;) | Online manual | Installation video!| MSVC projects now possible with cmake | Now alvailable as a Dev-C++ Devpack! (Thanks to Kotori)

Matt Smith
Member #783
November 2000

Here is a first draft of a routine to do 4 pixels at a time. This is wrong because 2 source registers should be used because the 4 pixels will come from 2 lines of the source bitmap, and I haven't put in the quotes and \n needed for embedded asm.

1 
2 
3 
4 movq (y_quotients),%mm4 /* 4 short words scaled to 256 */
5 movq (u_quotients),%mm5 /* 4 short words scaled to 64 */
6 movq (v_quotients),%mm6 /* 4 short words scaled to 64 */
7 
8 mov (width/2),%ecx /* size of row */
9 
10_loop:
11 /* 1st pixel */
12 
13 movd (%esi),%mm0
14 add $4,%esi /* increment source pointer */
15 pxor %mm1,%mm1 /* mm1 = 0 */
16 punpcklbw %mm1,%mm0 /* mm0 = 000R0G0B */
17 
18 movq %mm0,%mm2 /* save unpacked RGB */
19 
20 pmaddwd %mm4,%mm0 /* multiply by y quotients, G' & B' are added */
21 movq %mm0,(%2) /* store in temp */
22 mov (%2),%eax /* eax = G' + B' */
23 add (%2+4),%eax /* + R' */
24 movb %ah,(%edi) /* write eax(15-8) to y plane 0*/
25 add (plane_size),%edi
26 
27 /* 2nd pixel */
28 movd (%esi),%mm0
29 add $4,%esi /* increment source pointer */
30 
31 punpcklbw %mm1,%mm0 /* mm0 = 000R0G0B */
32 
33 paddw %mm0,%mm2 /* accumulate RGB */
34 
35 pmaddwd %mm4,%mm0 /* multiply by y quotients, G' & B' are added */
36 movq %mm0,(%2) /* store in temp */
37 mov (%2),%eax /* eax = G' + B' */
38 add (%2+4),%eax /* + R' */
39 movb %ah,(%edi) /* write eax(15-8) to y plane 1*/
40 add (plane_size),%edi
41
42 /* 3rd pixel */
43 movd (%esi),%mm0
44 add $4,%esi /* increment source pointer */
45 
46 punpcklbw %mm1,%mm0 /* mm0 = 000R0G0B */
47 
48 paddw %mm0,%mm2 /* accumulate RGB */
49 
50 pmaddwd %mm4,%mm0 /* multiply by y quotients, G' & B' are added */
51 movq %mm0,(%2) /* store in temp */
52 mov (%2),%eax /* eax = G' + B' */
53 add (%2+4),%eax /* + R' */
54 movb %ah,(%edi) /* write eax(15-8) to y plane 2*/
55 add (plane_size),%edi
56
57 /* 4th pixel */
58 movd (%esi),%mm0
59 add $4,%esi /* increment source pointer */
60 
61 punpcklbw %mm1,%mm0 /* mm0 = 000R0G0B */
62 
63 paddw %mm0,%mm2 /* accumulate RGB */
64 
65 pmaddwd %mm4,%mm0 /* multiply by y quotients, G' & B' are added */
66 movq %mm0,(%2) /* store in temp */
67 mov (%2),%eax /* eax = G' + B' */
68 add (%2+4),%eax /* + R' */
69 movb %ah,(%edi) /* write eax(15-8) to y plane 3*/
70 add (plane_size),%edi
71
72 /* U */
73 movq %mm2,%mm0
74 pmaddwd %mm5,%mm0
75 movq %mm0,(%2) /* store in temp */
76 mov (%2),%eax /* eax = G' + B' */
77 add (%2+4),%eax /* + R' */
78 movb %ah,(%edi) /* write eax(15-8) to u plane */
79 add (plane_size),%edi
80 
81 /* V */
82 pmaddwd %mm5,%mm2
83 movq %mm2,(%2) /* store in temp */
84 mov (%2),%eax /* eax = G' + B' */
85 add (%2+4),%eax /* + R' */
86 movb %ah,(%edi) /* write eax(15-8) to v plane */
87 add (plane_size * -5 + 1),%edi /* set dest pointer to next byte in y plane 0 */
88 
89 loop _loop /* decrement %ecx and loop if not zero */

Marcello
Member #1,860
January 2002
avatar

Can I ask what yuv is?

Marcello

Matt Smith
Member #783
November 2000

It's more correctly called YCbCr apparently, according to The ColorSpace FAQ

It's a video signal encoded as Luminance (Y) and 2 color difference signals (U & V)

[edit 2] Link fixed

YUV is used in broadcast television, MPEG files and most overlay video cards.

Bob
Free Market Evangelist
September 2000
avatar

You may want to try something closer to:

1conv_to_y:
2 movq MULT_TABLE, %mm6;
3
4 movl WIDTH, %ecx;
5 shrl $2, %ecx;
6
7 /* Insert other setup code here */
8
9loop:
10 /* Read 4 pixels in a 2x2 arrangement */
11 movq (%1), %mm0; /* p0 */
12 movq (%2), %mm1; /* p0 */
13 pxor %mm7, %mm7; /* p1 */
14 /* stall - 2 cycles */
15 movq %mm0, %mm2; /* p0 */
16 movq %mm1, %mm3; /* p1 */
17
18 punpcklbw %mm7, %mm0; /* p1 */
19 pmaddwd %mm6, %mm0 /* p0 */
20
21 punpcklbw %mm7, %mm1; /* p1 */
22 pmaddwd %mm6, %mm1; /* p0 */
23
24 punpckhbw %mm7, %mm2; /* p1 */
25 pmaddwd %mm6, %mm2; /* p0 */
26
27 punpckhbw %mm7, %mm3; /* p1 */
28 pmaddwd %mm6, %mm3; /* p0 */
29 
30 movq %mm0, %mm5; /* p1 */
31 psrlq $32, %mm0; /* p1 */
32
33 movq %mm1, %mm7; /* p0 */
34 psrlq $32, %mm1; /* p1 */
35
36 paddd %mm5, %mm0; /* p0 */
37 paddd %mm7, %mm1; /* p1 */
38
39 movq %mm2, %mm5; /* p0 */
40 psrlq $32, %mm2; /* p1 */
41
42 movq %mm3, %mm7; /* p0 */
43 psrlq $32, %mm3; /* p1 */
44
45 paddd %mm5, %mm2; /* p0 */
46 paddd %mm7, %mm3; /* p1 */
47
48 punpckldq %mm1, %mm0; /* p1 */
49 punpckldq %mm3, %mm2; /* p1 */
50 
51 psrlq $10, %mm0; /* p1 */
52 psrlq $10, %mm2; /* p1 */
53
54 movq %mm0, (%3); /* p0 */
55 movq %mm2, (%4); /* p0 */
56
57 decl %ecx; /* p1 */
58 jnz loop; /* p0 */
59
60 /* Total: 22 cycles for 4 pixels */
61 
62 emms;

Computes the Y channel only.
Completely untested. Timings are for Pentium 3 with 100% cache hitrate assumption. Setup code not shown.

Set (%1) to the source data, (%2) to one line under, and (%3) to the destination and (%4) to the line just under that.

Works on 2x2 blocks, so a full run will compute 2 lines.

Edit: Neglected the shift by 10. Added emms.

--
- Bob
[ -- All my signature links are 404 -- ]

 1   2 


Go to: