[C++] Unexpected performance of benchmark (Stack, Heap, Memory Pool)

[C++] Unexpected performance of benchmark (Stack, Heap, Memory Pool)

Chris Katko

Member #1,881

January 2002

#SelectExpand
  1#include <printf.h>
  2#include <sys/time.h>
  3#include <stdio.h>
  4#include <unistd.h>
  5
  6void benchmark(void (*fn)() )
  7  {
  8  struct timeval start, end;
  9
 10    long mtime, seconds, useconds;    
 11
 12    gettimeofday(&start, NULL);
 13    
 14    //Run test
 15    fn();
 16    
 17    gettimeofday(&end, NULL);
 18
 19    seconds  = end.tv_sec  - start.tv_sec;
 20    useconds = end.tv_usec - start.tv_usec;
 21
 22    mtime = ((seconds) * 1000 + useconds/1000.0) + 0.5;
 23
 24    printf("Elapsed time: %ld milliseconds\n", mtime);
 25  }
 26
 27void print_name(const char *test_name)
 28  {
 29  printf("Benchmarking: %s\n", test_name);
 30  }
 31
 32const int NUMBER_OF_ALLOCATIONS = 50000000;
 33typedef double allocation_type; //for easily changing the type
 34allocation_type static_pool[NUMBER_OF_ALLOCATIONS];
 35
 36void test3()
 37  {
 38  print_name("Test3 - Static Pool");
 39  for(int i = 0; i < NUMBER_OF_ALLOCATIONS; i++)
 40    {
 41    static_pool[i] = 0;
 42    }
 43  }
 44
 45void test2()
 46  {
 47  print_name("Test2 - Stack allocation");
 48  for(int i = 0; i < NUMBER_OF_ALLOCATIONS; i++)
 49    {
 50    allocation_type p;
 51    p = 0;
 52    }
 53  }
 54
 55void test1()
 56  {
 57  allocation_type *p;
 58  print_name("Test1 - Heap allocation");
 59  for(int i = 0; i < NUMBER_OF_ALLOCATIONS; i++)
 60    {
 61    p = new allocation_type;
 62    p = 0;
 63    }
 64  }
 65
 66
 67int main()
 68{
 69  benchmark(test1);
 70  benchmark(test2);
 71  benchmark(test3);
 72
 73    return 0;
 74}

Results (compiled with optimization off, gcc -O0):

Benchmarking: Test1 - Heap allocation
Elapsed time: 2563 milliseconds
Benchmarking: Test2 - Stack allocation
Elapsed time: 158 milliseconds
Benchmarking: Test3 - Static Pool
Elapsed time: 375 milliseconds

Am I missing something? It seems that the static pool should be closer to, or even faster than the stack. It's not even allocating anything.

Is the "memory pool" slower because of the [] array offset?

-----sig:
“Programs should be written for people to read, and only incidentally for machines to execute.” - Structure and Interpretation of Computer Programs
"Political Correctness is fascism disguised as manners" --George Carlin

GullRaDriel

Member #3,861

September 2003

It's because test 2 isn't creating a new p variable at each iteration.
IMO only one (the same address) variable is used loop after loop.

The test3 have to go from bytes to bytes setting value to 0 which makes it really use NUMBER_OF_ALLOCATIONS variables.

Edited

"Code is like shit - it only smells if it is not yours"
Allegro Wiki, full of examples and articles !!

Chris Katko

Member #1,881

January 2002

GullRaDriel said:

It's because test 2 isn't creating a new p variable at each iteration.
IMO only one (the same address) variable is used loop after loop.

Is there a way to force that? I was hoping turning off optimizations would do that.

Quote:

The test3 have to go from bytes to bytes setting value to 0 which makes it really use NUMBER_OF_ALLOCATIONS variables.

I'm not sure what you mean?

[edit] Btw, removing i from static_pool[i] (setting to a fixed address) makes it 99% as fast as stack.

GullRaDriel

Member #3,861

September 2003

You're asking the compiler to do something you're not writting him to do.
Test1 or test3 are the way to go if you want new variables.

Edit for your edit: i'ts because it's not even an optimisation: the compiler see you want to use only array[given_position] and replace it by the address.

I think a decompilation would show that.

"Code is like shit - it only smells if it is not yours"
Allegro Wiki, full of examples and articles !!

beoran

Member #12,636

March 2011

Well, test1 leaks memory, and test2 just uses a single variable on the stack. What did you expect? But yes, the "stack" is the fastest type of memory, static is second and dynamic is third on common cpu architectures.

Chris Katko

Member #1,881

January 2002

I'm okay with memory leaks, since it'll all get freed after the program terminates.

beoran said:

test2 just uses a single variable on the stack.

I can't think of a way to constantly allocate new items on the stack in a loop.

beoran

Member #12,636

March 2011

Use alloca() _alloca() or _malloca() to allocate memory on the stack.

Chris Katko

Member #1,881

January 2002

Those functions look insanely useful for stack-based dynamic allocation. As soon as the function returns, they disappear. Like some neat form of garbage collection.

It's a pity they're non-portable.

beoran

Member #12,636

March 2011

In C99 and C11 you can use a VLA (variable length array), which is supposed to be portable, but isn't because MSVC doesn't support even C99 completely. And in C++ they don't allow it either.

Thomas Fjellstrom

Member #476

June 2000

GCC has supported VLAs via extensions since forever as well.

--
Thomas Fjellstrom - [website] - [email] - [Allegro Wiki] - [Allegro TODO]
"If you can't think of a better solution, don't try to make a better solution." -- weapon_S
"The less evidence we have for what we believe is certain, the more violently we defend beliefs against those who don't agree" -- https://twitter.com/neiltyson/status/592870205409353730

Edgar Reynaldo

Major Reynaldo

May 2007

Test 3 is not testing static memory allocation, it is merely testing N writes to an array that was constructed globally before main even ran.

This should do more what you want, but only on the first run, because it's initialization is static and only happens once when it is actually encountered or used first.

void test3() {

  static allocation_type[NUMBER_OF_ALLOCATIONS];

  print_name("Test3 - Static Pool");

  for(int i = 0; i < NUMBER_OF_ALLOCATIONS; i++) {
    static_pool[i] = 0;
  }

}

Eagle and Allegro 5 binaries | Older Allegro 4 and 5 binaries | Allegro 5 compile guide
King Piccolo will make it so Then I could just flip it with some sweet, sweet matrix math. In either case, I’ll ride this wagon until the wheels fall off. Thanks to those that keep it running

bamccaig

Member #7,536

July 2006

I suspect the technique used to measure the results isn't valid either... :-/ This is sort of a waste of time regardless since you can't make generalizations about such things, but if you really wanted to benchmark things you should be using an existing profiling technology to measure the results external to your actual program(s)... :-X

--
I mean the best with what I say. It doesn't always sound that way.

Chris Katko

Member #1,881

January 2002

Edgar Reynaldo said:

Test 3 is not testing static memory allocation, it is merely testing N writes to an array that was constructed globally before main even ran.

I'm not testing the time to create the static pool, only adding something to it.

The confusion in this part is that I'm just assuming that whole flat memory is ready for values and lines up perfectly with no additional logic. I didn't write the proper static pool routines yet, as I was curious how fast it would be before I added the logic to process allocations to a real static pool. Ala it would be slower than that, but never faster.

bamccaig said:

I suspect the technique used to measure the results isn't valid either... :-/

How so? And obviously it doesn't replace a proper profiler of an actual use-case, but this is just a relative comparison to confirm what I've read: Everything reasonable should be on the stack, and static pools are significantly faster in some applications than dynamic memory.

Pro-tip: If you're gonna say someone's work is completely invalid, at least bother to explain why.

bamccaig

Member #7,536

July 2006

Chris Katko said:

Pro-tip: If you're gonna say someone's work is completely invalid, at least bother to explain why.

The "time" is a global value acquired by the system at its leisure. It would be subject to complex scheduling from the kernel, etc. You aren't strictly measuring the code that you intend to measure.

The fact that your test cases are all in the same program also suffers from the fact that the overhead of cases isn't properly separated. As was pointed out, in some case, the setup is done outside of the "benchmark" process. You'd probably be better off writing 3 separate programs independently and comparing the results of those.

Of course, most important is knowing what to compare and why. I'm not an expert on the subject so I'm not trying to assert any kind of failure in your technique. I'm just pointing out that in my experience it's a flawed approach to the problem. I was hoping others with more experience testing the validity of the approach would jump in. Machines are complicated and the specifics of how a piece of code performs are subject to use case specifics. If you really want to understand what happens then you should be testing multiple toolchains and probably examining the assembly or machine code output. Of course, that would probably be a huge waste of your time if your aim isn't to master optimization of low-level machine code so I suppose the emphasis on my post is to ask what you're trying to achieve. The question you're asking just doesn't seem all that valid. You're welcome to ask it and test it, but I think that you should research your method more to make sure that your results are reliable before placing too much weight on them...

While we're getting nitpicky, your code looks like a bastardization of C and C++. Which would be more or less OK if you were writing C++ and using C within it, but your header includes are all C specific... Aside from the new allocation_type the program is essentially C, and allocation_type is just an alias in this case for double, which has no special constructor semantics, etc. You could have just as easily used malloc and done it in C. Calling it C++ seems like you haven't quite figured out that they're distinct languages and it generally matters which style you use in C++ (i.e., if you do things C style in C++ there's a style to it, and there's a reason for it; and if you really want C then just use C).

A fun story that I learned of recently... Many years ago a collection of Perl developers came up with a concept that they thought would speed up certain kinds high-level of operations. It took them 10 years to work out all of the edge cases and blah blah blah. Initial benchmarks showed a minor improvement of about 15% in what they aimed to improve upon. Small, but it could add up. It took a while for somebody clever to compare with benchmarks of before the work began. It turns out that they slowed a core feature down by 15%, and the "speed up" managed to bring this one edge case of use back to par with what they started with. In other words, the other 85% of uses for this feature were slower and they gained nothing. They weren't focused on testing what mattered so their initial benchmarks ended up being misleading and useless and even harmful and until somebody had an honest thought the delusion fooled them all.

--
I mean the best with what I say. It doesn't always sound that way.

Edgar Reynaldo

Major Reynaldo

May 2007

Chris Katko said:

I'm not testing the time to create the static pool, only adding something to it.

But you're not adding anything to it. The whole array gets allocated once ~~outside of main~~ on first use. That's it. All you're doing is testing write speed.

And you should definitely be comparing the assembly code output by each function.

beoran

Member #12,636

March 2011

Well, yes, as bamccaig suggests indirectly, gettimeofday isn't a really good high precision timer, and benchmarking and profiling are not so easy as you'd expect. For gcc, gprof may be useful to you.

That being said, I think this point of performance of different storage types in C is normally nothing to worry about. Rather use the memory allocation strategy that fits your design. A static buffer is best for long term storage of a fixed amount of data, malloc'ed space is best for long term storage of dynamic data that can grow and shrink, and automatic storage (the "stack") is best for temporary, short lived data.

Edgar Reynaldo

Major Reynaldo

May 2007

Last time I tried to use gprof on Windows it was utterly broken.

Thomas Fjellstrom

Member #476

June 2000

profiling really isn't a great benchmark, in fact its a really bad one.

l j

Member #10,584

January 2009

Chris Katko said:

As soon as the function returns, they disappear. Like some neat form of garbage collection.

void test()
{
   std::unique_ptr<T>(new T);
}

Not stack allocated, but still automatic.

void benchmark(std::function<void()> fn)
{
  using std::chrono::high_resolution_clock;
  using std::chrono::milliseconds;

  const auto start = high_resolution_clock::now();
  fn();
  const auto end = high_resolution_clock::now();
  const auto duration = std::chrono::duration_cast<milliseconds>(end - start);

  std::cout << "Elapsed time: " << duration.count() << "ms\n";
}

C++ version of the benchmark function.

Chris Katko

Member #1,881

January 2002

Quote:

void benchmark(std::function<void()> fn)
{
using std::chrono::high_resolution_clock;
using std::chrono::milliseconds;

const auto start = high_resolution_clock::now();
fn();
const auto end = high_resolution_clock::now();
const auto duration = std::chrono::duration_cast<milliseconds>(end - start);

std::cout << "Elapsed time: " << duration.count() << "ms\n";
}

Thanks! That one I got was straight from a stack overflow. Hence to C-ness.

Quote:

void test()
{
std::unique_ptr<T>(new T);
}

True! I just think it's neat to be able to explicitly do a stack-based method as a kind of brain teaser. What are the implications? What are the benefits. And so on.

Bam: Actually, I really don't care that my tiny program is full of C and C++. That's not the point. This isn't a live project. It's a tiny test to confirm what I've read about relative allocation speed.

beoran said:

Well, yes, as bamccaig suggests indirectly, gettimeofday isn't a really good high precision timer

That's why I'm using longer tests and running the program multiple times to average the results.

Quote:

A static buffer is best for long term storage of a fixed amount of data,

Static pools are great for using the "shells" of destroyed objects over and over, as well as when all objects are tied to a common "death." It's very easy to allocate a static pool (ala Boost for example), run a game round, and have everything die without memory leaks regardless of whether or not destructors are called. You could easily tie the concept to garbage collection that runs per "scene".

Edgar Reynaldo said:

But you're not adding anything to it. The whole array gets allocated once outside of main on first use. That's it. All you're doing is testing write speed.

That's kind of the point. If you had zero overhead for finding the position of your next object to allocate (I haven't written allocation code yet), then all that would be left would be writing to that location. My code is actually typedef'd to allow me to use large structures as well as simple primitive types, so that write speed can be a significant amount of the time.

Keep in mind, this test isn't done, I just started it and wanted some feedback. But it's more-or-less confirmed what all the theory I've read has stated, it just quantified it a little. I was surprised about the fake static pool being so long, however, since I was expecting it to be faster than stack memory. Since a stack still requires modifying the stack, yet this literally is using no management at all and is still slower.

beoran

Member #12,636

March 2011

This may be relevant:

http://stackoverflow.com/questions/161053/c-which-is-faster-stack-allocation-or-heap-allocation

Chris Katko

Member #1,881

January 2002

Yeah! That's one of the many SO threads I was reading that inspired me to make a quantitative test to see how much faster they really were.

Also,

http://wyw.dcweb.cn/static_mem_pool.htm

and

http://www.boost.org/doc/libs/1_57_0/libs/pool/doc/html/index.html
http://www.boost.org/doc/libs/1_57_0/libs/pool/doc/html/boost_pool/pool/pooling.html

Once I had the proper allocations setup (hence this thread), I was going to move toward some more realistic use-cases such as random access, allocations, and deletes.

That idea of "always hot in the cache" is pretty interesting. I don't really know if there's any benefit to having a design pattern that takes exploits that (optimization before profiling is obviously bad!). But it's interesting to try and force yourself to think in away that all accesses are done on a stack level. Which might be identical or related to RAII--I'm not clear enough on the idea to know yet.

Thanks for your insight so far, everyone!

l j

Member #10,584

January 2009

Chris Katko said:

There's not really a single design pattern for this, but there is data-oriented design which consists mostly of making structures of arrays instead of arrays of structures. Some AAA games, especially console exclusives tend to make extensive use of it.
https://www.youtube.com/watch?v=rX0ItVEVjHc

Jonathon Blow is actually trying to create a language tailored to this.
https://www.youtube.com/user/jblow888/videos

There's also this old but still very (perhaps even more so than ever) relevant article.
http://lwn.net/Articles/250967/

Chris Katko

Member #1,881

January 2002

Quote:

CppCon 2014: Mike Acton "Data-Oriented Design and C++"
https://www.youtube.com/watch?v=rX0ItVEVjHc

Wow, I keep hearing over and over companies (Google!*) that ban use of C++ exceptions. Makes me feel better about not liking them.

Whenever I suggest on Reddit I don't like them, everyone calls me a moron. But I have a feeling Reddit is more populated by CS undergrads than actual professionals.

*Though, many people refute that guide.

[edit] I know there's a reply below, but man, I really have to restate how awesome that video is with Mike Action. It starts a little slow but really gets good.

Edgar Reynaldo

Major Reynaldo

May 2007

Exceptions are mostly useless if you can't get a backtrace from them. And to do that you have to force a crash with a divide by zero or dereferencing a null pointer when the exception is thrown. I do this in my library, and it helps a lot. Exceptions are definitely not useless though. Look at Java! All they use is exceptions, but there's always a back trace through the JVM.