Which one is faster

no-reply@allegro.cc (Shravan) — Sat, 12 Dec 2009 21:31:30 +0000

for(i=0;i<100;i++)
{
   for(j=0;j<10;j++)
  {
    //Some piece of code
  }
}

for(i=0;i<10;i++)
{
   for(j=0;j<100;j++)
  {
    //Same code as above
  }
}

no-reply@allegro.cc (SiegeLord) — Sat, 12 Dec 2009 21:34:15 +0000

Any modern compiler will rewrite one as the other as needed.

Don't worry about it.

no-reply@allegro.cc (OICW) — Sat, 12 Dec 2009 21:47:36 +0000

Both run at O(M*N) so it doesn't matter. And in case of some optimisations, it doesn't really matter that much, compiler will do what he deems necessary and most of the time doing some optimisations manually is contraproductive.

no-reply@allegro.cc (anonymous) — Sat, 12 Dec 2009 22:18:11 +0000

If "some piece of code" is accessing elements in a 2d array, then it might be wiser to loop over it so as to access elements sequentially. Otherwise it might be clearer just to loop 1000 times.

no-reply@allegro.cc (Goalie Ca) — Sun, 13 Dec 2009 08:30:33 +0000

What matter is how you are touching the memory. Performance nowadays pretty much boils down to cache usage. If you can't make good use of the l2 cache then you're toast.

no-reply@allegro.cc (Oscar Giner) — Sun, 13 Dec 2009 11:51:57 +0000

Assuming that the order at wich you do the iterations doesn't matter, it's faster to do it reverse:

for(i=99;i>=0;--i)
{
   for(j=9;j>=0;--j)
  {
    //Some piece of code
  }
}

CPU's only have comparison instructions against 0. To compare against a non zero value you need to add an extra substraction instruction and compare against 0 (i<100 is transformed to i-100<0). Of course the performance gain will be negligible in 99.99% of cases and you lose readability in your code

no-reply@allegro.cc (type568) — Sun, 13 Dec 2009 14:53:40 +0000

Oscar Giner said:

Point me at a modern CPU failing to complete a comparison of two non zero integers in a single cycle.

Append:

no-reply@allegro.cc (Goalie Ca) — Sun, 13 Dec 2009 15:30:36 +0000

type568 said:

Point me at a modern CPU failing to complete a comparison of two non zero integers in a single cycle.

L1 cache (1 or 2 cycles)
L2 cache (10s cycles)
Ram (100s cycles)
Branch misprediction (10s of cycles)

IIRC, modern chips use memory locality and work best at prefetching when traversing from front to back.

no-reply@allegro.cc (type568) — Sun, 13 Dec 2009 15:36:31 +0000

Goalie Ca said:

L1 cache (1 or 2 cycles)
L2 cache (10s cycles)
Ram (100s cycles)
Branch misprediction (10s of cycles)

Data retrieving isn't comparison. Also it's obvious, that my statement was regarding the difference of comparison vs zero or vs other value. In both cases, they gotta be retrieved(since CPU can't know there's the zero, I bet it also has to be retrieved..).

And when all the data is finally in registers, it'll be single cycle for both of'em.

no-reply@allegro.cc (Tobias Dammers) — Tue, 15 Dec 2009 15:53:16 +0000

The most important rule about code optimization: Don't do it.

Anyway: As Goalie pointed out, the only thing that matters in this context is cache performance, and this means that you should try to iterate over arrays as sequentially as possible. If you have a 2D array to iterate over, then the inner loop should iterate over the last dimension:

for (int y = 0; y < h; ++y)
  for (int x = 0; x < w; ++x)
    do_something(array[y][x]);

This way, the array is accessed sequentially, and the cache is used optimally. If you access the array the other way around, the code jumps back and forth in memory, and depending on the size of the array, each jump may result in a cache miss.

no-reply@allegro.cc (Audric) — Tue, 15 Dec 2009 16:57:28 +0000

A honest question: According to some googling, I can expect page size of 4096 on Windows. If all the data I access happens to be in the same page (say, in the 2kb in the middle of a page), does it still matter ?

no-reply@allegro.cc (type568) — Tue, 15 Dec 2009 17:04:19 +0000

Tobias Dammers said:

I think it sounds more evil than it is, but yeah. That's the only thing that came in to my mind too, when thinking of performance. Though it has nothing to do with the original question.

Audric said:

If you have 2kb, you shouldn't really care about it, unless it's many many times of 2kb. Or I might be literally wrong. If you could give more specific example, I think you could get a more detailed answer.

no-reply@allegro.cc (Audric) — Tue, 15 Dec 2009 17:09:38 +0000

A 2-level nested loop is very often used to scan a 2D array, so I think that it really matters (what's inside the nested loop, and how your data is organized)

no-reply@allegro.cc (type568) — Tue, 15 Dec 2009 17:15:56 +0000

Audric said:

A 2-level nested loop is very often used to scan a 2D array, so I think that it really matters (what's inside the nested loop, and how your data is organized)

Yeah, it's what Tobias said..

Audric said:

Perhaps I misunderstood the question.
I believe, there maybe a case when it does, however obviously the importance(the performance difference) will be greatly decreased, and it will be uncignificant enough to be ignored.. But I think it's easier to remember about the loops order rather than break your head about the size of the accessed data.

no-reply@allegro.cc (Evert) — Tue, 15 Dec 2009 19:42:53 +0000

type568 said:

I think it sounds more evil than it is

Cache misses are an absolute killer. You can easily loose a factor 2 or more in speed.
By the way, this is the processor cache, which is a lot smaller than your main memory (larger than 4k though). Main memory page boundaries or Windows have nothing to do with it.

no-reply@allegro.cc (Audric) — Tue, 15 Dec 2009 20:21:04 +0000

Ok so I guess my 16Mb lookup array for RGB->8bit color reduction was a bad idea

no-reply@allegro.cc (Tobias Dammers) — Tue, 15 Dec 2009 20:23:59 +0000

Audric said:

Ok so I guess my 16Mb lookup array for RGB->8bit color reduction was a bad idea

Yes, because it doesn't take the alpha channel into account. And we're not even talking about blending - a lookup table for truecolor alpha blending, now that would be lightning fast!

no-reply@allegro.cc (type568) — Tue, 15 Dec 2009 21:05:54 +0000

Tobias Dammers said:

Yes, because it doesn't take the alpha channel into account. And we're not even talking about blending - a lookup table for truecolor alpha blending, now that would be lightning fast!

Guess I've lost my age to learn Chinese..

no-reply@allegro.cc (Audric) — Tue, 15 Dec 2009 21:59:45 +0000

type568: There are 256x256x256 colors possible in RGB, I use an array of this size (2²⁴ bytes) to look up a 0-255 color. And I access this array for each pixel of a photo, so many times...
Tobias suggested adding Alpha channel to the mix (2³² bytes), or a "2d" array which stores the result of mixing 2 colors, with one color in X and one color in y. 2⁶⁴ bytes

no-reply@allegro.cc (Tobias Dammers) — Tue, 15 Dec 2009 22:12:58 +0000

Audric said:

2⁶⁴ bytes

Which happens to be the entire address space on a 64 bit system. Even if you could have that much memory, you wouldn't be able to fit in another 8 bytes for the actual pixels you're trying to blend.
In other news, I was being sarcastic.

no-reply@allegro.cc (Audric) — Tue, 15 Dec 2009 22:14:41 +0000

Well, I started with my array (which is actual running code)

no-reply@allegro.cc (type568) — Wed, 16 Dec 2009 13:38:16 +0000

Audric said:

There are 256x256x256 colors possible in RGB, I use an array of this size (224 bytes) to look up a 0-255 color.

Not very cheap, but affordable.. 16MiB. The question is if it's any faster.. Some extra CPU cycles to save RAM access might not be as expensive.