![]() |
|
This thread is locked; no one can reply to it.
![]() ![]() |
1
2
|
fast memcpy |
gering
Member #4,101
December 2003
![]() |
Hi, I have here a faster version of memcpy:
What do you think about this code? Can you see any bugs or things I could change to do it faster? __________________________________________________________ |
miran
Member #2,407
June 2002
|
Uhm, if you're going to post something like that, the least you could do is make a speed comparison chart. -- |
Tobi Vollebregt
Member #1,031
March 2001
|
does it compile? EDIT: it does I've never seen a weird contruct like that with different parts of a loop in different case labels. ________________________________________ |
gering
Member #4,101
December 2003
![]() |
Quote: Uhm, if you're going to post something like that, the least you could do is make a speed comparison chart.
I know, I have made a benchmark, but forgot to take it home from work. Results were like 15% - 30% faster on Linux, MacOSX and WindowsXP - compiled with gcc. Also testet with 32Bit and 64Bit(AMD) CPU. Have to try msvc, too. EDIT: __________________________________________________________ |
kazzmir
Member #1,786
December 2001
![]() |
Quote: I've never seen a weird contruct like that with different parts of a loop in different case labels. Its called duffs device. |
ReyBrujo
Moderator
January 2001
![]() |
That is useful for masked blitting. As for your memcpy, try compiling with -O2 -ffast-math -fomit-frame-pointer -funroll-loops for maximun generic performance. I guess you can get it faster with assembler instructions, especially with MMX registers. -- |
HoHo
Member #4,534
April 2004
![]() |
Copying around cache would probably give faster results. Only problem that needs SSE2 IIRC. Memory aligning and other tricks can get memcpy speed up to somewhere around 80% of FSB theoretical maximum throughput. [edit] Also try copying between 32/64bit ints instead of chars. Speed should go up considerably. Using SIMD instinsic variables (128bit) would probably be even faster __________ |
ReyBrujo
Moderator
January 2001
![]() |
Here is my old blit code. The MMX version is very fast, but assumes a lot (aligning, size, pointers, etc). -- |
Evert
Member #794
November 2000
![]() |
Fascinating. I wouldn't expect it to be faster than libc's memcpy.
Result: Quote:
CPU time = 0.444931 secs. That's a factor of two (more or less). Maybe my test is flawed somehow? Or are computers really fast enough to copy around 512MB or RAM in a fraction of a second? |
kazzmir
Member #1,786
December 2001
![]() |
Evert, I dont see such improvements.
[edit]
|
ReyBrujo
Moderator
January 2001
![]() |
reybrujo@ansalon:~$ ./test CPU time = 0.646902 secs. CPU time = 0.415936 secs. With -O3 there is a slight optimization, going down to 0.39xxxx. For kazzmir test, I get reybrujo@ansalon:~$ ./test2 CPU time = 4.733280 secs. CPU time = 4.358337 secs.
-- |
HoHo
Member #4,534
April 2004
![]() |
Quote: Or are computers really fast enough to copy around 512MB or RAM in a fraction of a second? If FSB is able to theoretically pass through ~6.4GiB of data I see no problem with copying 0.5G in a fraction of a second. Allthough I think 0.2s is a bit slow. But that's expectable from non optimized code [edit] [edit2] __________ |
kazzmir
Member #1,786
December 2001
![]() |
This seems to be slightly faster than the parents 'fastcopy', although admittedly less portable to 64-bit machines. It is very readable though.
When run with 32 megs + 3 bytes and 100 times each I get the following.
|
Peter Wang
Member #23
April 2000
|
I only tried Evert's test quickly, but I found that the results depended on the order in which the functions were called. Whichever of fastcpy() or memcpy() was called second was faster. If I then repeated the calls then the came out pretty much the same. I haven't investigated, but I assume it's due to cache effects.
|
HoHo
Member #4,534
April 2004
![]() |
Quote: I haven't investigated, but I assume it's due to cache effects. When moving 256M of data around, 0.5-2M of cache usually doesn't make much of a difference [edit] It might be about chache but probably not data cache. Perhaps code gets cached and some jumps get predicted better in subsequent runs. Also walking through the arrays might help for some reason. __________ |
Thomas Fjellstrom
Member #476
June 2000
![]() |
#define copy8 \ *(unsigned long *)dest = *(unsigned long *)source;\ dest += 8;\ source += 8;\ That doesnt actually coppy 8 bytes you know. At least not on ia32. You need uint64_t or unsigned long long. edit: fixed typo. -- |
HoHo
Member #4,534
April 2004
![]() |
Quote: You need uint32_t or unsigned long long. No, you would need either uint64_t or unsigned long long Too bad I don't have the DDJ journal with me right now. They had a code in there that could acieve performandce of about 5.5GiB/s on nullifying 10MiB of memory and ~12GiB/s when nulling 128kiB. __________ |
Thomas Fjellstrom
Member #476
June 2000
![]() |
Quote:
No, you would need either uint64_t or unsigned long long Ooops, typo -- |
Karadoc ~~
Member #2,749
September 2002
![]() |
I haven't tried it. But if this behaves the same as memcpy except faster, then you should send it to the gcc people. ----------- |
A J
Member #3,025
December 2002
![]() |
has anyone written patches for blit using any of these techniques? ___________________________ |
gering
Member #4,101
December 2003
![]() |
@HoHo Quote:
[edit] [edit2]
I know this url, too. Had also tests with this version of memcpy... but it was always same speed as fastcopy on 32Bit machines. On 64Bit I got an segmentation fault with "code from url". @Karadoc Quote: I haven't tried it. But if this behaves the same as memcpy except faster, then you should send it to the gcc people.
@A J Quote: has anyone written patches for blit using any of these techniques? I have written a <vector>, <list> and <string>... getting a lot faster than stl. here the memmove:
[edit] __________________________________________________________ |
Arthur Kalliokoski
Second in Command
February 2005
![]() |
I've tried using 64 bit mmx registers in asm to blit, it's about 8% faster (on my machine) than using a slightly unrolled loop using the 32 bit general registers. The Pentium 3? or was it 4? can use SSE registers with 128 bits at a time. You can't use fpu registers because some bit combinations will cause exceptions, slowing things down. I don't have SSE stuff so I can't test that. The AMD 64 bitters I have no idea about. AMD K6-2 does a string (rep stosd) set about as fast as an unrolled loop with 32 bit regs, Intel is faster with the 32 bit regs. Copying is always faster than the string instructions with regs on all my computers. So I always use MMX when I'm trying for max FPS or whatever. They all watch too much MSNBC... they get ideas. |
A J
Member #3,025
December 2002
![]() |
how about using the 64bit general purpose registers of the AMD64.. thats got to be quicker than the 64bit MMX registers? or is the limit still the FSB? ___________________________ |
gering
Member #4,101
December 2003
![]() |
This has same speed ... and schould work correctly... __________________________________________________________ |
A J
Member #3,025
December 2002
![]() |
I have read in some of the P4/AMD64/ICC optimzation guides that array indexing is preferred over pointers, as this helps the compiler to avoid aliasing (or whatever its called), with pointers the compiler can not guarantee src and dst are not the same, or could overlap, therefore can not make many optimzations, however using array indexing, the compiler can figure out they dont overlap and can perform pipeline/SSE/SIMD/cache optimzations. ___________________________ |
|
1
2
|