<?xml version="1.0"?>
<rss version="2.0">
	<channel>
		<title>Which one is faster</title>
		<link>http://www.allegro.cc/forums/view/602498</link>
		<description>Allegro.cc Forum Thread</description>
		<webMaster>matthew@allegro.cc (Matthew Leverton)</webMaster>
		<lastBuildDate>Wed, 16 Dec 2009 13:38:16 +0000</lastBuildDate>
	</channel>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="source-code snippet"><div class="inner"><pre><span class="k1">for</span><span class="k2">(</span>i<span class="k3">=</span><span class="n">0</span><span class="k2">;</span>i<span class="k3">&lt;</span><span class="n">100</span><span class="k2">;</span>i<span class="k3">+</span><span class="k3">+</span><span class="k2">)</span>
<span class="k2">{</span>
   <span class="k1">for</span><span class="k2">(</span>j<span class="k3">=</span><span class="n">0</span><span class="k2">;</span>j<span class="k3">&lt;</span><span class="n">10</span><span class="k2">;</span>j<span class="k3">+</span><span class="k3">+</span><span class="k2">)</span>
  <span class="k2">{</span>
    <span class="c">//Some piece of code</span>
  <span class="k2">}</span>
<span class="k2">}</span>
</pre></div></div><p>
or
</p><div class="source-code snippet"><div class="inner"><pre><span class="k1">for</span><span class="k2">(</span>i<span class="k3">=</span><span class="n">0</span><span class="k2">;</span>i<span class="k3">&lt;</span><span class="n">10</span><span class="k2">;</span>i<span class="k3">+</span><span class="k3">+</span><span class="k2">)</span>
<span class="k2">{</span>
   <span class="k1">for</span><span class="k2">(</span>j<span class="k3">=</span><span class="n">0</span><span class="k2">;</span>j<span class="k3">&lt;</span><span class="n">100</span><span class="k2">;</span>j<span class="k3">+</span><span class="k3">+</span><span class="k2">)</span>
  <span class="k2">{</span>
    <span class="c">//Same code as above</span>
  <span class="k2">}</span>
<span class="k2">}</span>
</pre></div></div><p>
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Shravan)</author>
		<pubDate>Sat, 12 Dec 2009 21:31:30 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>Any modern compiler will rewrite one as the other as needed.</p><p>Don&#39;t worry about it.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (SiegeLord)</author>
		<pubDate>Sat, 12 Dec 2009 21:34:15 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>Both run at <i>O</i>(M*N) so it doesn&#39;t matter. And in case of some optimisations, it doesn&#39;t really matter that much, compiler will do what he deems necessary and most of the time doing some optimisations manually is contraproductive.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (OICW)</author>
		<pubDate>Sat, 12 Dec 2009 21:47:36 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>If &quot;some piece of code&quot; is accessing elements in a 2d array, then it might be wiser to loop over it so as to access elements sequentially. Otherwise it might be clearer just to loop 1000 times.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (anonymous)</author>
		<pubDate>Sat, 12 Dec 2009 22:18:11 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>What matter is how you are touching the memory. Performance nowadays pretty much boils down to cache usage. If you can&#39;t make good use of the l2 cache then you&#39;re toast.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Goalie Ca)</author>
		<pubDate>Sun, 13 Dec 2009 08:30:33 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>Assuming that the order at wich you do the iterations doesn&#39;t matter, it&#39;s faster to do it reverse:</p><div class="source-code snippet"><div class="inner"><pre><span class="k1">for</span><span class="k2">(</span>i<span class="k3">=</span><span class="n">99</span><span class="k2">;</span>i&gt;<span class="k3">=</span><span class="n">0</span><span class="k2">;</span><span class="k3">-</span><span class="k3">-</span>i<span class="k2">)</span>
<span class="k2">{</span>
   <span class="k1">for</span><span class="k2">(</span>j<span class="k3">=</span><span class="n">9</span><span class="k2">;</span>j&gt;<span class="k3">=</span><span class="n">0</span><span class="k2">;</span><span class="k3">-</span><span class="k3">-</span>j<span class="k2">)</span>
  <span class="k2">{</span>
    <span class="c">//Some piece of code</span>
  <span class="k2">}</span>
<span class="k2">}</span>
</pre></div></div><p>

CPU&#39;s only have comparison instructions against 0. To compare against a non zero value you need to add an extra substraction instruction and compare against 0 (i&lt;100 is transformed to i-100&lt;0). Of course the performance gain will be negligible in 99.99% of cases and you lose readability in your code <img src="http://www.allegro.cc/forums/smileys/tongue.gif" alt=":P" />
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Oscar Giner)</author>
		<pubDate>Sun, 13 Dec 2009 11:51:57 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/602498/842549#target">Oscar Giner</a> said:</div><div class="quote"><p>
CPU&#39;s only have comparison instructions against 0. To compare against a non zero value you need to add an extra substraction instruction and compare against 0 (i&lt;100 is transformed to i-100&lt;0). Of course the performance gain will be negligible in 99.99% of cases and you lose readability in your code 
</p></div></div><p>

Point me at a modern CPU failing to complete a comparison of two non zero integers in a single cycle.</p><p>Append: <img src="http://www.allegro.cc/forums/smileys/cheesy.gif" alt=":D" />
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (type568)</author>
		<pubDate>Sun, 13 Dec 2009 14:53:40 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/602498/842552#target">type568</a> said:</div><div class="quote"><p>Point me at a modern CPU failing to complete a comparison of two non zero integers in a single cycle.</p></div></div><p>

L1 cache (1 or 2 cycles)<br />L2 cache (10s cycles)<br />Ram (100s cycles)<br />Branch misprediction (10s of cycles)</p><p>IIRC, modern chips use memory locality and work best at prefetching when traversing from front to back.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Goalie Ca)</author>
		<pubDate>Sun, 13 Dec 2009 15:30:36 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/602498/842555#target">Goalie Ca</a> said:</div><div class="quote"><p>
L1 cache (1 or 2 cycles)<br />L2 cache (10s cycles)<br />Ram (100s cycles)<br />Branch misprediction (10s of cycles)
</p></div></div><p>

Data retrieving isn&#39;t comparison. Also it&#39;s obvious, that my statement was regarding the difference of comparison vs zero or vs other value. In both cases, they gotta be retrieved(since CPU can&#39;t know there&#39;s the zero, I bet it also has to be retrieved..).</p><p>And when all the data is finally in registers, it&#39;ll be single cycle for both of&#39;em.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (type568)</author>
		<pubDate>Sun, 13 Dec 2009 15:36:31 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>The most important rule about code optimization: Don&#39;t do it.</p><p>Anyway: As Goalie pointed out, the only thing that matters in this context is cache performance, and this means that you should try to iterate over arrays as sequentially as possible. If you have a 2D array to iterate over, then the inner loop should iterate over the last dimension:
</p><div class="source-code snippet"><div class="inner"><pre><span class="k1">for</span> <span class="k2">(</span><span class="k1">int</span> y <span class="k3">=</span> <span class="n">0</span><span class="k2">;</span> y <span class="k3">&lt;</span> h<span class="k2">;</span> <span class="k3">+</span><span class="k3">+</span>y<span class="k2">)</span>
  <span class="k1">for</span> <span class="k2">(</span><span class="k1">int</span> x <span class="k3">=</span> <span class="n">0</span><span class="k2">;</span> x <span class="k3">&lt;</span> w<span class="k2">;</span> <span class="k3">+</span><span class="k3">+</span>x<span class="k2">)</span>
    do_something<span class="k2">(</span>array<span class="k2">[</span>y<span class="k2">]</span><span class="k2">[</span>x<span class="k2">]</span><span class="k2">)</span><span class="k2">;</span>
</pre></div></div><p>
This way, the array is accessed sequentially, and the cache is used optimally. If you access the array the other way around, the code jumps back and forth in memory, and depending on the size of the array, each jump may result in a cache miss.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Tobias Dammers)</author>
		<pubDate>Tue, 15 Dec 2009 15:53:16 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>A honest question: According to some googling, I can expect page size of 4096 on Windows. If all the data I access happens to be in the same page (say, in the 2kb in the middle of a page), does it still matter ?
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Audric)</author>
		<pubDate>Tue, 15 Dec 2009 16:57:28 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/602498/842792#target">Tobias Dammers</a> said:</div><div class="quote"><p>
This way, the array is accessed sequentially, and the cache is used optimally. If you access the array the other way around, the code jumps back and forth in memory, and depending on the size of the array, each jump may result in a cache miss.
</p></div></div><p>

I think it sounds more evil than it is, but yeah. That&#39;s the only thing that came in to my mind too, when thinking of performance. Though it has nothing to do with the original question.</p><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/602498/842805#target">Audric</a> said:</div><div class="quote"><p>
A honest question: According to some googling, I can expect page size of 4096 on Windows. If all the data I access happens to be in the same page (say, in the 2kb in the middle of a page), does it still matter ?
</p></div></div><p>

If you have 2kb, you shouldn&#39;t really care about it, unless it&#39;s many many times of 2kb. Or I might be literally wrong. If you could give more specific example, I think you could get a more detailed answer.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (type568)</author>
		<pubDate>Tue, 15 Dec 2009 17:04:19 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>A 2-level nested loop is very often used to scan a 2D array, so I think that it really matters (what&#39;s inside the nested loop, and how your data is organized)
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Audric)</author>
		<pubDate>Tue, 15 Dec 2009 17:09:38 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/602498/842812#target">Audric</a> said:</div><div class="quote"><p>
A 2-level nested loop is very often used to scan a 2D array, so I think that it really matters (what&#39;s inside the nested loop, and how your data is organized)
</p></div></div><p>

Yeah, it&#39;s what Tobias said..</p><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/602498/842805#target">Audric</a> said:</div><div class="quote"><p>
A honest question: According to some googling, I can expect page size of 4096 on Windows. If all the data I access happens to be in the same page (say, in the 2kb in the middle of a page), does it still matter ?
</p></div></div><p>

Perhaps I misunderstood the question.<br />I believe, there maybe a case when it does, however obviously the importance(the performance difference) will be greatly decreased, and it will be uncignificant enough to be ignored.. But I think it&#39;s easier to remember about the loops order rather than break your head about the size of the accessed data.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (type568)</author>
		<pubDate>Tue, 15 Dec 2009 17:15:56 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/602498/842808#target">type568</a> said:</div><div class="quote"><p>I think it sounds more evil than it is</p></div></div><p>
Cache misses are an absolute killer. You can easily loose a factor 2 or more in speed.<br />By the way, this is the processor cache, which is a lot smaller than your main memory (larger than 4k though). Main memory page boundaries or Windows have nothing to do with it.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Evert)</author>
		<pubDate>Tue, 15 Dec 2009 19:42:53 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>Ok so I guess my 16Mb lookup array for RGB-&gt;8bit color reduction was a bad idea <img src="http://www.allegro.cc/forums/smileys/grin.gif" alt=";D" />
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Audric)</author>
		<pubDate>Tue, 15 Dec 2009 20:21:04 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/602498/842842#target">Audric</a> said:</div><div class="quote"><p>
Ok so I guess my 16Mb lookup array for RGB-&gt;8bit color reduction was a bad idea
</p></div></div><p>
Yes, because it doesn&#39;t take the alpha channel into account. And we&#39;re not even talking about blending - a lookup table for truecolor alpha blending, now that would be lightning fast!
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Tobias Dammers)</author>
		<pubDate>Tue, 15 Dec 2009 20:23:59 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/602498/842844#target">Tobias Dammers</a> said:</div><div class="quote"><p>
Yes, because it doesn&#39;t take the alpha channel into account. And we&#39;re not even talking about blending - a lookup table for truecolor alpha blending, now that would be lightning fast!
</p></div></div><p>

Guess I&#39;ve lost my age to learn Chinese.. <img src="http://www.allegro.cc/forums/smileys/sad.gif" alt=":(" />
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (type568)</author>
		<pubDate>Tue, 15 Dec 2009 21:05:54 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>type568: There are 256x256x256 colors possible in RGB, I use an array of this size (2<sup>24</sup> bytes) to look up a 0-255 color. And I access this array for each pixel of a photo, so <b>many</b> times...<br />Tobias suggested adding Alpha channel to the mix (2<sup>32</sup> bytes), or a &quot;2d&quot; array which stores the result of mixing 2 colors, with one color in X and one color in y. 2<sup>64</sup> bytes
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Audric)</author>
		<pubDate>Tue, 15 Dec 2009 21:59:45 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/602498/842857#target">Audric</a> said:</div><div class="quote"><p>
2<sup>64</sup> bytes
</p></div></div><p>
Which happens to be the entire address space on a 64 bit system. Even if you could have that much memory, you wouldn&#39;t be able to fit in another 8 bytes for the actual pixels you&#39;re trying to blend.<br />In other news, I was being sarcastic.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Tobias Dammers)</author>
		<pubDate>Tue, 15 Dec 2009 22:12:58 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>Well, I started with my array (which is actual running code)
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Audric)</author>
		<pubDate>Tue, 15 Dec 2009 22:14:41 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title"><a href="http://www.allegro.cc/forums/thread/602498/842857#target">Audric</a> said:</div><div class="quote"><p>
There are 256x256x256 colors possible in RGB, I use an array of this size (224 bytes) to look up a 0-255 color.
</p></div></div><p>

Not very cheap, but affordable.. 16MiB. The question is if it&#39;s any faster.. Some extra CPU cycles to save RAM access might not be as expensive.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (type568)</author>
		<pubDate>Wed, 16 Dec 2009 13:38:16 +0000</pubDate>
	</item>
</rss>
