|
This thread is locked; no one can reply to it. |
1
2
|
[C++] Unexpected performance of benchmark (Stack, Heap, Memory Pool) |
Chris Katko
Member #1,881
January 2002
|
1#include <printf.h>
2#include <sys/time.h>
3#include <stdio.h>
4#include <unistd.h>
5
6void benchmark(void (*fn)() )
7 {
8 struct timeval start, end;
9
10 long mtime, seconds, useconds;
11
12 gettimeofday(&start, NULL);
13
14 //Run test
15 fn();
16
17 gettimeofday(&end, NULL);
18
19 seconds = end.tv_sec - start.tv_sec;
20 useconds = end.tv_usec - start.tv_usec;
21
22 mtime = ((seconds) * 1000 + useconds/1000.0) + 0.5;
23
24 printf("Elapsed time: %ld milliseconds\n", mtime);
25 }
26
27void print_name(const char *test_name)
28 {
29 printf("Benchmarking: %s\n", test_name);
30 }
31
32const int NUMBER_OF_ALLOCATIONS = 50000000;
33typedef double allocation_type; //for easily changing the type
34allocation_type static_pool[NUMBER_OF_ALLOCATIONS];
35
36void test3()
37 {
38 print_name("Test3 - Static Pool");
39 for(int i = 0; i < NUMBER_OF_ALLOCATIONS; i++)
40 {
41 static_pool[i] = 0;
42 }
43 }
44
45void test2()
46 {
47 print_name("Test2 - Stack allocation");
48 for(int i = 0; i < NUMBER_OF_ALLOCATIONS; i++)
49 {
50 allocation_type p;
51 p = 0;
52 }
53 }
54
55void test1()
56 {
57 allocation_type *p;
58 print_name("Test1 - Heap allocation");
59 for(int i = 0; i < NUMBER_OF_ALLOCATIONS; i++)
60 {
61 p = new allocation_type;
62 p = 0;
63 }
64 }
65
66
67int main()
68{
69 benchmark(test1);
70 benchmark(test2);
71 benchmark(test3);
72
73 return 0;
74}
Results (compiled with optimization off, gcc -O0): Benchmarking: Test1 - Heap allocation Elapsed time: 2563 milliseconds Benchmarking: Test2 - Stack allocation Elapsed time: 158 milliseconds Benchmarking: Test3 - Static Pool Elapsed time: 375 milliseconds Am I missing something? It seems that the static pool should be closer to, or even faster than the stack. It's not even allocating anything. Is the "memory pool" slower because of the [] array offset? -----sig: |
GullRaDriel
Member #3,861
September 2003
|
It's because test 2 isn't creating a new p variable at each iteration. The test3 have to go from bytes to bytes setting value to 0 which makes it really use NUMBER_OF_ALLOCATIONS variables. Edited
"Code is like shit - it only smells if it is not yours" |
Chris Katko
Member #1,881
January 2002
|
GullRaDriel said:
It's because test 2 isn't creating a new p variable at each iteration. Is there a way to force that? I was hoping turning off optimizations would do that. Quote: The test3 have to go from bytes to bytes setting value to 0 which makes it really use NUMBER_OF_ALLOCATIONS variables. I'm not sure what you mean? [edit] Btw, removing i from static_pool[i] (setting to a fixed address) makes it 99% as fast as stack. -----sig: |
GullRaDriel
Member #3,861
September 2003
|
You're asking the compiler to do something you're not writting him to do. Edit for your edit: i'ts because it's not even an optimisation: the compiler see you want to use only array[given_position] and replace it by the address. I think a decompilation would show that. "Code is like shit - it only smells if it is not yours" |
beoran
Member #12,636
March 2011
|
Well, test1 leaks memory, and test2 just uses a single variable on the stack. What did you expect? But yes, the "stack" is the fastest type of memory, static is second and dynamic is third on common cpu architectures. |
Chris Katko
Member #1,881
January 2002
|
I'm okay with memory leaks, since it'll all get freed after the program terminates. beoran said: test2 just uses a single variable on the stack. I can't think of a way to constantly allocate new items on the stack in a loop. -----sig: |
beoran
Member #12,636
March 2011
|
Use alloca() _alloca() or _malloca() to allocate memory on the stack. |
Chris Katko
Member #1,881
January 2002
|
Those functions look insanely useful for stack-based dynamic allocation. As soon as the function returns, they disappear. Like some neat form of garbage collection. It's a pity they're non-portable. -----sig: |
beoran
Member #12,636
March 2011
|
In C99 and C11 you can use a VLA (variable length array), which is supposed to be portable, but isn't because MSVC doesn't support even C99 completely. And in C++ they don't allow it either. |
Thomas Fjellstrom
Member #476
June 2000
|
GCC has supported VLAs via extensions since forever as well. -- |
Edgar Reynaldo
Major Reynaldo
May 2007
|
Test 3 is not testing static memory allocation, it is merely testing N writes to an array that was constructed globally before main even ran. This should do more what you want, but only on the first run, because it's initialization is static and only happens once when it is actually encountered or used first. void test3() { static allocation_type[NUMBER_OF_ALLOCATIONS]; print_name("Test3 - Static Pool"); for(int i = 0; i < NUMBER_OF_ALLOCATIONS; i++) { static_pool[i] = 0; } }
My Website! | EAGLE GUI Library Demos | My Deviant Art Gallery | Spiraloid Preview | A4 FontMaker | Skyline! (Missile Defense) Eagle and Allegro 5 binaries | Older Allegro 4 and 5 binaries | Allegro 5 compile guide |
bamccaig
Member #7,536
July 2006
|
I suspect the technique used to measure the results isn't valid either... This is sort of a waste of time regardless since you can't make generalizations about such things, but if you really wanted to benchmark things you should be using an existing profiling technology to measure the results external to your actual program(s)... -- acc.js | al4anim - Allegro 4 Animation library | Allegro 5 VS/NuGet Guide | Allegro.cc Mockup | Allegro.cc <code> Tag | Allegro 4 Timer Example (w/ Semaphores) | Allegro 5 "Winpkg" (MSVC readme) | Bambot | Blog | C++ STL Container Flowchart | Castopulence Software | Check Return Values | Derail? | Is This A Discussion? Flow Chart | Filesystem Hierarchy Standard | Clean Code Talks - Global State and Singletons | How To Use Header Files | GNU/Linux (Debian, Fedora, Gentoo) | rot (rot13, rot47, rotN) | Streaming |
Chris Katko
Member #1,881
January 2002
|
Edgar Reynaldo said: Test 3 is not testing static memory allocation, it is merely testing N writes to an array that was constructed globally before main even ran. I'm not testing the time to create the static pool, only adding something to it. The confusion in this part is that I'm just assuming that whole flat memory is ready for values and lines up perfectly with no additional logic. I didn't write the proper static pool routines yet, as I was curious how fast it would be before I added the logic to process allocations to a real static pool. Ala it would be slower than that, but never faster. bamccaig said: I suspect the technique used to measure the results isn't valid either... How so? And obviously it doesn't replace a proper profiler of an actual use-case, but this is just a relative comparison to confirm what I've read: Everything reasonable should be on the stack, and static pools are significantly faster in some applications than dynamic memory. Pro-tip: If you're gonna say someone's work is completely invalid, at least bother to explain why. -----sig: |
bamccaig
Member #7,536
July 2006
|
Chris Katko said: How so? And obviously it doesn't replace a proper profiler of an actual use-case, but this is just a relative comparison to confirm what I've read: Everything reasonable should be on the stack, and static pools are significantly faster in some applications than dynamic memory. Pro-tip: If you're gonna say someone's work is completely invalid, at least bother to explain why. The "time" is a global value acquired by the system at its leisure. It would be subject to complex scheduling from the kernel, etc. You aren't strictly measuring the code that you intend to measure. The fact that your test cases are all in the same program also suffers from the fact that the overhead of cases isn't properly separated. As was pointed out, in some case, the setup is done outside of the "benchmark" process. You'd probably be better off writing 3 separate programs independently and comparing the results of those. Of course, most important is knowing what to compare and why. I'm not an expert on the subject so I'm not trying to assert any kind of failure in your technique. I'm just pointing out that in my experience it's a flawed approach to the problem. I was hoping others with more experience testing the validity of the approach would jump in. Machines are complicated and the specifics of how a piece of code performs are subject to use case specifics. If you really want to understand what happens then you should be testing multiple toolchains and probably examining the assembly or machine code output. Of course, that would probably be a huge waste of your time if your aim isn't to master optimization of low-level machine code so I suppose the emphasis on my post is to ask what you're trying to achieve. The question you're asking just doesn't seem all that valid. You're welcome to ask it and test it, but I think that you should research your method more to make sure that your results are reliable before placing too much weight on them... While we're getting nitpicky, your code looks like a bastardization of C and C++. Which would be more or less OK if you were writing C++ and using C within it, but your header includes are all C specific... Aside from the new allocation_type the program is essentially C, and allocation_type is just an alias in this case for double, which has no special constructor semantics, etc. You could have just as easily used malloc and done it in C. Calling it C++ seems like you haven't quite figured out that they're distinct languages and it generally matters which style you use in C++ (i.e., if you do things C style in C++ there's a style to it, and there's a reason for it; and if you really want C then just use C). A fun story that I learned of recently... Many years ago a collection of Perl developers came up with a concept that they thought would speed up certain kinds high-level of operations. It took them 10 years to work out all of the edge cases and blah blah blah. Initial benchmarks showed a minor improvement of about 15% in what they aimed to improve upon. Small, but it could add up. It took a while for somebody clever to compare with benchmarks of before the work began. It turns out that they slowed a core feature down by 15%, and the "speed up" managed to bring this one edge case of use back to par with what they started with. In other words, the other 85% of uses for this feature were slower and they gained nothing. They weren't focused on testing what mattered so their initial benchmarks ended up being misleading and useless and even harmful and until somebody had an honest thought the delusion fooled them all. -- acc.js | al4anim - Allegro 4 Animation library | Allegro 5 VS/NuGet Guide | Allegro.cc Mockup | Allegro.cc <code> Tag | Allegro 4 Timer Example (w/ Semaphores) | Allegro 5 "Winpkg" (MSVC readme) | Bambot | Blog | C++ STL Container Flowchart | Castopulence Software | Check Return Values | Derail? | Is This A Discussion? Flow Chart | Filesystem Hierarchy Standard | Clean Code Talks - Global State and Singletons | How To Use Header Files | GNU/Linux (Debian, Fedora, Gentoo) | rot (rot13, rot47, rotN) | Streaming |
Edgar Reynaldo
Major Reynaldo
May 2007
|
Chris Katko said: I'm not testing the time to create the static pool, only adding something to it.
But you're not adding anything to it. The whole array gets allocated once And you should definitely be comparing the assembly code output by each function. My Website! | EAGLE GUI Library Demos | My Deviant Art Gallery | Spiraloid Preview | A4 FontMaker | Skyline! (Missile Defense) Eagle and Allegro 5 binaries | Older Allegro 4 and 5 binaries | Allegro 5 compile guide |
beoran
Member #12,636
March 2011
|
Well, yes, as bamccaig suggests indirectly, gettimeofday isn't a really good high precision timer, and benchmarking and profiling are not so easy as you'd expect. For gcc, gprof may be useful to you. That being said, I think this point of performance of different storage types in C is normally nothing to worry about. Rather use the memory allocation strategy that fits your design. A static buffer is best for long term storage of a fixed amount of data, malloc'ed space is best for long term storage of dynamic data that can grow and shrink, and automatic storage (the "stack") is best for temporary, short lived data. |
Edgar Reynaldo
Major Reynaldo
May 2007
|
Last time I tried to use gprof on Windows it was utterly broken. My Website! | EAGLE GUI Library Demos | My Deviant Art Gallery | Spiraloid Preview | A4 FontMaker | Skyline! (Missile Defense) Eagle and Allegro 5 binaries | Older Allegro 4 and 5 binaries | Allegro 5 compile guide |
Thomas Fjellstrom
Member #476
June 2000
|
profiling really isn't a great benchmark, in fact its a really bad one. -- |
l j
Member #10,584
January 2009
|
Chris Katko said: As soon as the function returns, they disappear. Like some neat form of garbage collection. void test() { std::unique_ptr<T>(new T); } Not stack allocated, but still automatic. void benchmark(std::function<void()> fn) { using std::chrono::high_resolution_clock; using std::chrono::milliseconds; const auto start = high_resolution_clock::now(); fn(); const auto end = high_resolution_clock::now(); const auto duration = std::chrono::duration_cast<milliseconds>(end - start); std::cout << "Elapsed time: " << duration.count() << "ms\n"; } C++ version of the benchmark function.
|
Chris Katko
Member #1,881
January 2002
|
Quote:
void benchmark(std::function<void()> fn) const auto start = high_resolution_clock::now(); std::cout << "Elapsed time: " << duration.count() << "ms\n"; Thanks! That one I got was straight from a stack overflow. Hence to C-ness. Quote:
void test() True! I just think it's neat to be able to explicitly do a stack-based method as a kind of brain teaser. What are the implications? What are the benefits. And so on. Bam: Actually, I really don't care that my tiny program is full of C and C++. That's not the point. This isn't a live project. It's a tiny test to confirm what I've read about relative allocation speed. beoran said: Well, yes, as bamccaig suggests indirectly, gettimeofday isn't a really good high precision timer That's why I'm using longer tests and running the program multiple times to average the results. Quote: A static buffer is best for long term storage of a fixed amount of data, Static pools are great for using the "shells" of destroyed objects over and over, as well as when all objects are tied to a common "death." It's very easy to allocate a static pool (ala Boost for example), run a game round, and have everything die without memory leaks regardless of whether or not destructors are called. You could easily tie the concept to garbage collection that runs per "scene". Edgar Reynaldo said: But you're not adding anything to it. The whole array gets allocated once outside of main on first use. That's it. All you're doing is testing write speed. That's kind of the point. If you had zero overhead for finding the position of your next object to allocate (I haven't written allocation code yet), then all that would be left would be writing to that location. My code is actually typedef'd to allow me to use large structures as well as simple primitive types, so that write speed can be a significant amount of the time. Keep in mind, this test isn't done, I just started it and wanted some feedback. But it's more-or-less confirmed what all the theory I've read has stated, it just quantified it a little. I was surprised about the fake static pool being so long, however, since I was expecting it to be faster than stack memory. Since a stack still requires modifying the stack, yet this literally is using no management at all and is still slower. -----sig: |
beoran
Member #12,636
March 2011
|
Chris Katko
Member #1,881
January 2002
|
Yeah! That's one of the many SO threads I was reading that inspired me to make a quantitative test to see how much faster they really were. Also, http://wyw.dcweb.cn/static_mem_pool.htm and http://www.boost.org/doc/libs/1_57_0/libs/pool/doc/html/index.html Once I had the proper allocations setup (hence this thread), I was going to move toward some more realistic use-cases such as random access, allocations, and deletes. That idea of "always hot in the cache" is pretty interesting. I don't really know if there's any benefit to having a design pattern that takes exploits that (optimization before profiling is obviously bad!). But it's interesting to try and force yourself to think in away that all accesses are done on a stack level. Which might be identical or related to RAII--I'm not clear enough on the idea to know yet. Thanks for your insight so far, everyone! -----sig: |
l j
Member #10,584
January 2009
|
Chris Katko said: That idea of "always hot in the cache" is pretty interesting. I don't really know if there's any benefit to having a design pattern that takes exploits that (optimization before profiling is obviously bad!). There's not really a single design pattern for this, but there is data-oriented design which consists mostly of making structures of arrays instead of arrays of structures. Some AAA games, especially console exclusives tend to make extensive use of it. Jonathon Blow is actually trying to create a language tailored to this. There's also this old but still very (perhaps even more so than ever) relevant article.
|
Chris Katko
Member #1,881
January 2002
|
Quote:
CppCon 2014: Mike Acton "Data-Oriented Design and C++" Wow, I keep hearing over and over companies (Google!*) that ban use of C++ exceptions. Makes me feel better about not liking them. Whenever I suggest on Reddit I don't like them, everyone calls me a moron. But I have a feeling Reddit is more populated by CS undergrads than actual professionals. *Though, many people refute that guide. [edit] I know there's a reply below, but man, I really have to restate how awesome that video is with Mike Action. It starts a little slow but really gets good. -----sig: |
Edgar Reynaldo
Major Reynaldo
May 2007
|
Exceptions are mostly useless if you can't get a backtrace from them. And to do that you have to force a crash with a divide by zero or dereferencing a null pointer when the exception is thrown. I do this in my library, and it helps a lot. Exceptions are definitely not useless though. Look at Java! All they use is exceptions, but there's always a back trace through the JVM. My Website! | EAGLE GUI Library Demos | My Deviant Art Gallery | Spiraloid Preview | A4 FontMaker | Skyline! (Missile Defense) Eagle and Allegro 5 binaries | Older Allegro 4 and 5 binaries | Allegro 5 compile guide |
|
1
2
|