fastest bliting method

lucaz

what is the fast method?, 1)use iterative loops 2)memcpy 3)other.

Steve Terry

Direct memory line access? Depends really what depth you are running at, you can use block memory access in 8bpp, or 32-bit words at a time in 16-bpp, etc.

lucaz

direct line access is the first option right?.
When Im mean iterative loops is:

for(int x=0; x<bmp1->w ;x++)
 for(int y=0; y<bmp1->h ;y++)
  bmp1->line[x][y] = bmp2->line[x][y];

Steve Terry

Right...

Chris Katko

3) MMX.

But then again, you didn't specify video/memory/system bitmaps and the destination, and also any blending.

lucaz

this is the fastest method?, ¬_¬ ....

Steve Terry

Depends on how much time you want to spend on your custom blitter MMX is great but it is kinda complex.

Krzysztof Kluczek

Quote:

for(int x=0; x<bmp1->w ;x++)
for(int y=0; y<bmp1->h ;y++)
bmp1->line[x][y] = bmp2->line[x][y];

Bitmap line array works as line[y][x], but remember about different bits per pixel count in different formats. Also you should blit in rows, not in columns, to make better use of cache. Using memory pointers can help a bit too.

int y,*s,*d,*e;
for(y=0; y<bmp1->h ;y++)
{
  s = bmp1->line[y];
  e = s + (bmp1->w*bpp+3)/4;  // bpp = bytes per pixel
  d = bmp2->line[y];
  while(s<e)
    *d++ = *s++;
}

And MMX is completely useless in blitting unless you are doing blending in 24/32bpp, since it just gives additional math operations and none are required here.

lucaz

are you sure this is the fastest method?, memcpy seems more useful, it dont iterate... just take a block of memory a copy it to another-

Steve Terry

I'm not saying memcpy wouldn't work however that will only allow you to copy a bitmap, not manipulate it or make a custom blitter, why not just use blit instead?

ReyBrujo

You can move 8 blocks of data using the MMX movq operators, instead of the one block at each time with memcpy. Problems: target machine should be MMX enabled, the addresses should be aligned, and the array should be 4x multiple. I have the code at home, will check it out later.

Gnatinator

Just blitting straight to the screen is extremley fast (I can pull thousands of frames per second). Its putting that data into memory first, then to the screen thats slow (ie double buffer)

ReyBrujo

Faster way will always be using DRS instead of updating the whole screen.

Kris Allen

of course memcpy iterates, how else would it move more than one piece of data?

lucaz

I dont know its code. but at least it just does one for(), not 2

Steve Terry

What exactly are you trying to accomplish, we may find the best method for what you want to implement.

ReyBrujo

Kris, There is a difference between manually iterating (using a loop and jump) and letting the processor iterate for you (like, in example, using repnz movsb):

1;  this is looping
2xor ecx, ecx
3mov esi, source_string
4mov edi, target_string
5begin_loop:
6    mov eax, [esi+ecx]   ; fetch a byte from the source string
7    mov [edi+ecx], eax   ; put the byte in the target string
8    inc ecx
9    test ecx, string_length
10jnz begin_loop
11 
12;  this must be memcpy way
13mov ecx, string_length
14mov esi, source_string
15mov edi, target_string
16repnz movsb              ; keep repeating movbs (take a byte from
17                         ; esi, put it in edi, decrease ecx) until
18                         ; esi is 0 (end of string) or ecx is 0.

Kris Allen

ah cool, didnt know about that, no wonder it's so fast :B

Krzysztof Kluczek

I'm not sure, but I think that my method and any other which should be faster than it will be memory bus limited anyway.

And of course fastest blitting method is to use HW acceleration.

decsonic

Wouldnt bother writing your own blitting methods unless blending is applied tho, if even then.

Richard Phipps

I really don't think it's worth spending the time trying to optimise things like this. Doing the rest of the program is more important..

lucaz

who likes to optimise blit?

ReyBrujo

Well, first optimize your game fully. You don't try optimizing printf, you try optimizing your program so that it won't use that many first.

Richard Phipps

Quote:

who likes to optimise blit?

Er.. your thread title is 'fastest blitting method'

(And I didn't say blit, I said things like this..)

decsonic

read my sig

Steve Terry

Can someone explain to me what the hell is going on? Optimize blit or optimize your code, if you need to optimize your code then it has nothing to do with blit, make your algorithms faster, not blit.

Billybob

Fastest blitting method is, as KK said, your video card's video ram->video ram accelerated blit.

End of story!
And as far as actual time is concerned, the fastest blitting method is:

blit(something);

because you don't start a thread like this, waste time debating things, and blit is only 4 letters!

lucaz

1 said:

what is the fast method?, 1)use iterative loops 2)memcpy 3)other.

2 said:

direct line access is the first option right?.
When Im mean iterative loops is:
for(int x=0; x<bmp1->w ;x++)for(int y=0; y<bmp1->h ;y++) bmp1->line[x][y] = bmp2->line[x][y];

3 said:

is is the fastest method?, ¬_¬ ....

4 said:

are you sure this is the fastest method?, memcpy seems more useful, it dont iterate... just take a block of memory a copy it to another-

5 said:

I dont know its code. but at least it just does one for(), not 2

6 said:

who likes to optimise blit?

Ive never said that Im trying to optimise.
I just like to know, like my topic says, what is the fastest.

ReyBrujo explained me why is better memcpy() than use loops, that was what I like to know.
Thanks!.

ReyBrujo

Here is the code I used to optimize. Note that, if the processor does not have MMX support, it just copied with memcpy. Of course, the bitmaps must be aligned, and you cannot use these with the screen bitmap. I tried it a couple of times with that old project (DRS system), and worked quite fine. I can tell you it won't crash unless you don't meet the requirements

1//                                                                           //
2//  The MMX optimization code is here. Hmm... I don't really know if this    //
3//  should be public for everyone (I mean, as a header, and not as another   //
4//  source file), but anyway, it is easier this way.                         //
5//                                                                           //
6//  -----------------------------------------------------------------------  //
7//                                                                           //
8//      This file is a part of DRS (alpha) package.                          //
9//      Copyright (C) 2002  Roberto Alfonso (aka ReyBrujo)                   //
10//                          reybrujo@hotmail.com                             //
11//                                                                           //
12//      This package is free software; you can redistribute it and/or        //
13//      modify it under the terms of the GNU General Public License as       //
14//      published by the Free Software Foundation; either version 2,         //
15//      or (at your option) any later version.                               //
16//                                                                           //
17//      This package is distributed in the hope that it will be useful,      //
18//      but WITHOUT ANY WARRANTY; without even the implied warranty of       //
19//      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the         //
20//      GNU General Public License for more details.                         //
21//                                                                           //
22//      You should have received a copy of the GNU General Public            //
23//      License along with this package (see the file COPYING). If not,      //
24//      write to the Free Software Foundation, Inc., 59 Temple Place,        //
25//      Suite 330, Bostom, MA  02111-1307  USA                               //
26//                                                                           //
27//  -----------------------------------------------------------------------  //
28//                                                                           //
29#ifndef _MMXCODE_H_INCLUDED
30#define _MMXCODE_H_INCLUDED 0xDEAD
31 
32 
33 
34#ifdef __cplusplus
35extern "C" {
36#endif
37 
38//                                                                           //
39//  Whenever I needed to update the internal bitmap I just used blit(). But  //
40//  I noticed it was very slow, so I tried just (since the background and    //
41//  the internal bitmap have the same size) memcpy() the 'line' pointers of  //
42//  the bitmap struct. But that gave some boost, but not enough.             //
43//                                                                           //
44//  Now, and since it was still slow, I decided to try MMX. I defined a new  //
45//  macro, MMX_MEMCPY, which checks if your hardware support MMX set (by     //
46//  checking Allegro cpu_capabilities global variable). If so, it uses the   //
47//  movq instruction to copy 32 bytes each cycle.                            //
48//                                                                           //
49//  WARNING!                                                                 //
50//  Though it increases the frame rate (with 25 objects and seed 55555555    //
51//  increases +30FPS here), be careful! This is still a hack, and I did not  //
52//  even care about aligning. The code expects to find a '_size' multiple    //
53//  of 32 (like 640x480, 320x200, etc, etc, etc). But with odd sizes (maybe  //
54//  you have set a bitmap of 319x111, which is not multiple of 32), there    //
55//  will be some bytes that are not going to be copied.                      //
56//                                                                           //
57//  Why I just copy 32 each cycle and not 8x8 = 64 bytes each cycle? Well,   //
58//  the redirection from memory (first you copy from [esi], then [esi+8],    //
59//  then [esi+16], etc) takes some time, and will drain all advantage we     //
60//  get by copying several bytes at once. According to my tests, 32 bytes    //
61//  each cycle gives a good speed.                                           //
62//                                                                           //
63//  Since the object creates the internal and background bitmaps taking the  //
64//  screen width and size, and since I haven't seen any odd resolution, the  //
65//  object itself shouldn't have problems at all. But if you take this code  //
66//  to implement your own fast_copy_bitmap() function, be warned: you need   //
67//  to align the data and manually copy the bytes that are not copied using  //
68//  the cycle.                                                               //
69//                                                                           //
70 
71 
72 
73//                                                                           //
74//  The user should not include this header directly. It is for his safety,  //
75//  I don't really care if he deletes the #error directive and hack this by  //
76//  him/herself.                                                             //
77//                                                                           //
78#ifndef _DRS_H_INCLUDED
79    #error You should not include this file directly!
80#endif
81 
82 
83 
84#ifdef USE_MMX
85#ifdef __GNUC__
86    //
87    //  MMX code for DJGPP, MingW32 and probably Linux. Sorry, but cannot test
88    //  Linux version for now until getting a new harddisk  :(
89    //
90 
91    #define MMX_MEMCPY(_t, _s, _size)             \
92        if (cpu_capabilities & CPU_MMX) {         \ 
93            asm(                                  \ 
94                "0:                    \n\t"      \ 
95                "movq   (%%esi), %%mm0 \n\t"      \ 
96                "movq  8(%%esi), %%mm1 \n\t"      \ 
97                "movq 16(%%esi), %%mm2 \n\t"      \ 
98                "movq 24(%%esi), %%mm3 \n\t"      \ 
99                "movq %%mm0,   (%%edi) \n\t"      \ 
100                "movq %%mm1,  8(%%edi) \n\t"      \ 
101                "movq %%mm2, 16(%%edi) \n\t"      \ 
102                "movq %%mm3, 24(%%edi) \n\t"      \ 
103                "addl $32, %%esi       \n\t"      \ 
104                "addl $32, %%edi       \n\t"      \ 
105                "decl %%ecx            \n\t"      \ 
106                "jnz  0b               \n\t"      \ 
107                "emms                  \n\t"      \ 
108                : : "c" ((_size) >> 5),           \ 
109                    "S" (_s->line[0]),            \ 
110                    "D" (_t->line[0])             \ 
111            );                                    \ 
112        }                                         \ 
113        else                                      \ 
114            memcpy(_t->line[0], _s->line[0], (_size)) 
115#else // !__GNUC__
116 
117    //
118    // MMX code for MSVC and, probably, BCC. MSVC doesn't understand the code
119    // as a macro, so I set it as an inline function.
120    //
121    inline void mmx_memcpy(unsigned char *target,
122                           unsigned char *source, long amount) {
123        if (cpu_capabilities & CPU_MMX) {
124            __asm {
125                mov  ecx, amount
126                mov  esi, source
127                mov  edi, target
128                again:
129                movq mm0, [esi   ]
130                movq mm1, [esi+ 8]
131                movq mm2, [esi+16]
132                movq mm3, [esi+24]
133                movq [edi   ], mm0
134                movq [edi+ 8], mm1
135                movq [edi+16], mm2
136                movq [edi+24], mm3
137                add  esi, 32
138                add  edi, 32
139                dec  ecx
140                jnz  again
141                emms
142            }
143        }
144        else
145            memcpy(target, source, amount);
146    }
147 
148    #define MMX_MEMCPY(_t, _s, _sz)   \
149        mmx_memcpy(_t->line[0], _s->line[0], (_sz) >> 5) 
150#endif
151#else // ! USE_MMX
152    #define MMX_MEMCPY(_t, _s, _sz)   \
153        memcpy(_t->line[0], _s->line[0], (_sz)) 
154#endif
155 
156 
157#ifdef __cplusplus
158}
159#endif
160 
161 
162 
163#endif // _MMXCODE_H_INCLUDED

lucaz

thanks one more time reybrujo!.
Im not trying to optimise, my idea is try to make my own blit, so I can use it in a machine without allegro.

Paul whoknows

I agree with Richard, make your game first, make it run faster later.
But of course, all of us want to own the faster blitter ever made. 8-)

lucaz

people you are crazy!, Im not trying to optimise!!!!!!!!, ahhhhhhhhhhhhhhhhhhhhhhhhh

Steve Terry

Then say that you are writing your own blitter, it makes more sense now. Since you are not using allegro I'm not sure what you are using, if it's good ol mode 13h then memcpy would work best since you probably have your bitmap stored linear anyway as well as the screen, just beware the screen "wraps" around

Thread #431144. Printed from Allegro.cc

1	; this is looping
2	xor ecx, ecx
3	mov esi, source_string
4	mov edi, target_string
5	begin_loop:
6	mov eax, [esi+ecx] ; fetch a byte from the source string
7	mov [edi+ecx], eax ; put the byte in the target string
8	inc ecx
9	test ecx, string_length
10	jnz begin_loop
11
12	; this must be memcpy way
13	mov ecx, string_length
14	mov esi, source_string
15	mov edi, target_string
16	repnz movsb ; keep repeating movbs (take a byte from
17	; esi, put it in edi, decrease ecx) until
18	; esi is 0 (end of string) or ecx is 0.

1	// //
2	// The MMX optimization code is here. Hmm... I don't really know if this //
3	// should be public for everyone (I mean, as a header, and not as another //
4	// source file), but anyway, it is easier this way. //
5	// //
6	// ----------------------------------------------------------------------- //
7	// //
8	// This file is a part of DRS (alpha) package. //
9	// Copyright (C) 2002 Roberto Alfonso (aka ReyBrujo) //
10	// reybrujo@hotmail.com //
11	// //
12	// This package is free software; you can redistribute it and/or //
13	// modify it under the terms of the GNU General Public License as //
14	// published by the Free Software Foundation; either version 2, //
15	// or (at your option) any later version. //
16	// //
17	// This package is distributed in the hope that it will be useful, //
18	// but WITHOUT ANY WARRANTY; without even the implied warranty of //
19	// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the //
20	// GNU General Public License for more details. //
21	// //
22	// You should have received a copy of the GNU General Public //
23	// License along with this package (see the file COPYING). If not, //
24	// write to the Free Software Foundation, Inc., 59 Temple Place, //
25	// Suite 330, Bostom, MA 02111-1307 USA //
26	// //
27	// ----------------------------------------------------------------------- //
28	// //
29	#ifndef _MMXCODE_H_INCLUDED
30	#define _MMXCODE_H_INCLUDED 0xDEAD
31
32
33
34	#ifdef __cplusplus
35	extern "C" {
36	#endif
37
38	// //
39	// Whenever I needed to update the internal bitmap I just used blit(). But //
40	// I noticed it was very slow, so I tried just (since the background and //
41	// the internal bitmap have the same size) memcpy() the 'line' pointers of //
42	// the bitmap struct. But that gave some boost, but not enough. //
43	// //
44	// Now, and since it was still slow, I decided to try MMX. I defined a new //
45	// macro, MMX_MEMCPY, which checks if your hardware support MMX set (by //
46	// checking Allegro cpu_capabilities global variable). If so, it uses the //
47	// movq instruction to copy 32 bytes each cycle. //
48	// //
49	// WARNING! //
50	// Though it increases the frame rate (with 25 objects and seed 55555555 //
51	// increases +30FPS here), be careful! This is still a hack, and I did not //
52	// even care about aligning. The code expects to find a '_size' multiple //
53	// of 32 (like 640x480, 320x200, etc, etc, etc). But with odd sizes (maybe //
54	// you have set a bitmap of 319x111, which is not multiple of 32), there //
55	// will be some bytes that are not going to be copied. //
56	// //
57	// Why I just copy 32 each cycle and not 8x8 = 64 bytes each cycle? Well, //
58	// the redirection from memory (first you copy from [esi], then [esi+8], //
59	// then [esi+16], etc) takes some time, and will drain all advantage we //
60	// get by copying several bytes at once. According to my tests, 32 bytes //
61	// each cycle gives a good speed. //
62	// //
63	// Since the object creates the internal and background bitmaps taking the //
64	// screen width and size, and since I haven't seen any odd resolution, the //
65	// object itself shouldn't have problems at all. But if you take this code //
66	// to implement your own fast_copy_bitmap() function, be warned: you need //
67	// to align the data and manually copy the bytes that are not copied using //
68	// the cycle. //
69	// //
70
71
72
73	// //
74	// The user should not include this header directly. It is for his safety, //
75	// I don't really care if he deletes the #error directive and hack this by //
76	// him/herself. //
77	// //
78	#ifndef _DRS_H_INCLUDED
79	#error You should not include this file directly!
80	#endif
81
82
83
84	#ifdef USE_MMX
85	#ifdef __GNUC__
86	//
87	// MMX code for DJGPP, MingW32 and probably Linux. Sorry, but cannot test
88	// Linux version for now until getting a new harddisk :(
89	//
90
91	#define MMX_MEMCPY(_t, _s, _size) \
92	if (cpu_capabilities & CPU_MMX) { \
93	asm( \
94	"0: \n\t" \
95	"movq (%%esi), %%mm0 \n\t" \
96	"movq 8(%%esi), %%mm1 \n\t" \
97	"movq 16(%%esi), %%mm2 \n\t" \
98	"movq 24(%%esi), %%mm3 \n\t" \
99	"movq %%mm0, (%%edi) \n\t" \
100	"movq %%mm1, 8(%%edi) \n\t" \
101	"movq %%mm2, 16(%%edi) \n\t" \
102	"movq %%mm3, 24(%%edi) \n\t" \
103	"addl $32, %%esi \n\t" \
104	"addl $32, %%edi \n\t" \
105	"decl %%ecx \n\t" \
106	"jnz 0b \n\t" \
107	"emms \n\t" \
108	: : "c" ((_size) >> 5), \
109	"S" (_s->line[0]), \
110	"D" (_t->line[0]) \
111	); \
112	} \
113	else \
114	memcpy(_t->line[0], _s->line[0], (_size))
115	#else // !__GNUC__
116
117	//
118	// MMX code for MSVC and, probably, BCC. MSVC doesn't understand the code
119	// as a macro, so I set it as an inline function.
120	//
121	inline void mmx_memcpy(unsigned char *target,
122	unsigned char *source, long amount) {
123	if (cpu_capabilities & CPU_MMX) {
124	__asm {
125	mov ecx, amount
126	mov esi, source
127	mov edi, target
128	again:
129	movq mm0, [esi ]
130	movq mm1, [esi+ 8]
131	movq mm2, [esi+16]
132	movq mm3, [esi+24]
133	movq [edi ], mm0
134	movq [edi+ 8], mm1
135	movq [edi+16], mm2
136	movq [edi+24], mm3
137	add esi, 32
138	add edi, 32
139	dec ecx
140	jnz again
141	emms
142	}
143	}
144	else
145	memcpy(target, source, amount);
146	}
147
148	#define MMX_MEMCPY(_t, _s, _sz) \
149	mmx_memcpy(_t->line[0], _s->line[0], (_sz) >> 5)
150	#endif
151	#else // ! USE_MMX
152	#define MMX_MEMCPY(_t, _s, _sz) \
153	memcpy(_t->line[0], _s->line[0], (_sz))
154	#endif
155
156
157	#ifdef __cplusplus
158	}
159	#endif
160
161
162
163	#endif // _MMXCODE_H_INCLUDED