<?xml version="1.0"?>
<rss version="2.0">
	<channel>
		<title>Fastest Sine / Cosine Table Possible</title>
		<link>http://www.allegro.cc/forums/view/588470</link>
		<description>Allegro.cc Forum Thread</description>
		<webMaster>matthew@allegro.cc (Matthew Leverton)</webMaster>
		<lastBuildDate>Fri, 10 Nov 2006 20:37:05 +0000</lastBuildDate>
	</channel>
	<item>
		<description><![CDATA[<div class="mockup v2"><p><i>NOTE: Accidentally put this here instead of under &quot;Programming Questions&quot;... might want to move it...</i></p><p>After some experimenting I&#39;ve deduced that the following is the fastest way to implement sine/cosine tables through a function that is interchangeable with the built-in sin() and cos() functions included in MATH.H.</p><div class="source-code snippet"><div class="inner"><pre><span class="c">// NOTE: Code to build sine table not included in this excerpt.</span>

<span class="p">#define SIN_TABLE_SIZE    4096</span>
<span class="p">#define SIN_TABLE_BITMASK  4095</span>

<span class="k1">double</span> _sin_t<span class="k2">[</span>SIN_TABLE_SIZE<span class="k2">]</span><span class="k2">;</span>

<span class="k1">double</span> Get_SinT <span class="k2">(</span><span class="k1">double</span> d<span class="k2">)</span>
<span class="k2">{</span>
  <span class="k1">return</span> _sin_t<span class="k2">[</span><span class="k2">(</span><span class="k1">int</span><span class="k2">)</span><span class="k2">(</span>d <span class="k3">*</span> <span class="n">651</span>.<span class="n">8986469f</span><span class="k2">)</span> <span class="k3">&amp;</span> SIN_TABLE_BITMASK<span class="k2">]</span><span class="k2">;</span>
<span class="k2">}</span>
</pre></div></div><p>

My tests showed this routine to be almost twice as fast as the normal sin() function with no compiler optimizations enabled.</p><p>Does anyone have any ideas how to make this even faster, or is that about as fast as it&#39;s going to get while still being compatible with radians?</p><p>--- Kris Asick (Gemini)<br />--- <a href="http://www.pixelships.com">http://www.pixelships.com</a>
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Kris Asick)</author>
		<pubDate>Thu, 09 Nov 2006 10:37:47 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>You cast a floating point value to an integer.  C requires that use round-towards-zero rules.  Most x86 hardware, last I checked, required a slow switch of rounding modes to enter that mode, then a float-&gt;int conversion, then a slow switch back to normal mode (round nearest I think).  Possibly that&#39;s been improved in one of the more recent instruction set extensions, I don&#39;t know.  Some compilers support a rint() or something for float-&gt;int conversion with different rounding rules, but I can&#39;t find it atm.  My inline asm code for float-&gt;int conversion:
</p><div class="source-code"><div class="toolbar"></div><div class="inner"><table width="100%"><tbody><tr><td class="number">1</td><td><span class="c">//MSVC version (x86 only)</span></td></tr><tr><td class="number">2</td><td>  <span class="k1">int</span> iround<span class="k2">(</span><span class="k1">double</span> a<span class="k2">)</span> <span class="k2">{</span></td></tr><tr><td class="number">3</td><td>    <span class="k1">int</span> r<span class="k2">;</span> </td></tr><tr><td class="number">4</td><td>    __asm fld a   <span class="k2">;</span></td></tr><tr><td class="number">5</td><td>    __asm fistp r <span class="k2">;</span></td></tr><tr><td class="number">6</td><td>    <span class="k1">return</span> r<span class="k2">;</span></td></tr><tr><td class="number">7</td><td>  <span class="k2">}</span></td></tr><tr><td class="number">8</td><td>  <span class="k1">int</span> iround_down<span class="k2">(</span><span class="k1">double</span> a<span class="k2">)</span> <span class="k2">{</span><span class="k1">return</span> iround<span class="k2">(</span>std::floor<span class="k2">(</span>a<span class="k2">)</span><span class="k2">)</span><span class="k2">;</span><span class="k2">}</span></td></tr><tr><td class="number">9</td><td>  <span class="k1">int</span> iround_up  <span class="k2">(</span><span class="k1">double</span> a<span class="k2">)</span> <span class="k2">{</span><span class="k1">return</span> iround<span class="k2">(</span>std::ceil<span class="k2">(</span>a<span class="k2">)</span><span class="k2">)</span><span class="k2">;</span><span class="k2">}</span>  </td></tr><tr><td class="number">10</td><td><span class="p">#elif (defined USE_ASM_FLOAT_CONVERSION) &amp;&amp; (defined __GNUC__) &amp;&amp; (defined __i386__)</span></td></tr><tr><td class="number">11</td><td><span class="c">//gcc version (x86 only)</span></td></tr><tr><td class="number">12</td><td>  <span class="k1">int</span> iround<span class="k2">(</span> <span class="k1">double</span> a <span class="k2">)</span> <span class="k2">{</span></td></tr><tr><td class="number">13</td><td>    <span class="k1">int</span> r<span class="k2">;</span></td></tr><tr><td class="number">14</td><td>    __asm__ __volatile__<span class="k2">(</span> </td></tr><tr><td class="number">15</td><td>      <span class="s">"fldl %1    \n\t"</span></td></tr><tr><td class="number">16</td><td>      <span class="s">"fistpl %0  \n\t"</span></td></tr><tr><td class="number">17</td><td>      <span class="k2">:</span> <span class="s">"=m"</span> <span class="k2">(</span>r<span class="k2">)</span></td></tr><tr><td class="number">18</td><td>      <span class="k2">:</span> <span class="s">"m"</span> <span class="k2">(</span>a<span class="k2">)</span></td></tr><tr><td class="number">19</td><td>    <span class="k2">)</span><span class="k2">;</span></td></tr><tr><td class="number">20</td><td>    <span class="k1">return</span> r<span class="k2">;</span></td></tr><tr><td class="number">21</td><td>  <span class="k2">}</span></td></tr><tr><td class="number">22</td><td>  <span class="k1">int</span> iround_down<span class="k2">(</span><span class="k1">double</span> a<span class="k2">)</span> <span class="k2">{</span><span class="k1">return</span> iround<span class="k2">(</span>std::floor<span class="k2">(</span>a<span class="k2">)</span><span class="k2">)</span><span class="k2">;</span><span class="k2">}</span></td></tr><tr><td class="number">23</td><td>  <span class="k1">int</span> iround_up  <span class="k2">(</span><span class="k1">double</span> a<span class="k2">)</span> <span class="k2">{</span><span class="k1">return</span> iround<span class="k2">(</span>std::ceil<span class="k2">(</span>a<span class="k2">)</span><span class="k2">)</span><span class="k2">;</span><span class="k2">}</span></td></tr></tbody></table></div></div><p>

Benchmark that against a regular C float-&gt;int cast, see if it helps.  </p><p>edit: note, for supporting sin and cos, you can either add a constant before the masking or you can use a table 25% larger and add a constant after the masking, depending upon whether you want to reduce instruction count to the minimum or save a little extra space (memory, cache, and TLB usage).  Also, for your table size, there&#39;s no real advantage to using doubles instead of floats for the table.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (orz)</author>
		<pubDate>Thu, 09 Nov 2006 12:18:04 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>Didn&#39;t most newer CPU&#39;s have special instruction that returned sin/cos value from HW table? I think I saw something about it in the Intel architecture manuals.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (HoHo)</author>
		<pubDate>Thu, 09 Nov 2006 13:54:43 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>You might want to use functions from FFTW: </p><p><a href="http://www.fftw.org/">http://www.fftw.org/</a></p><p>Which has superfast implementations of sin and cos, but also much more... </p><p>Edit: Beware, it&#39;s GPL!
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (tobing)</author>
		<pubDate>Thu, 09 Nov 2006 14:03:27 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p><b>orz:</b> I tired implementing your code in place of the integer type-casting. There was a calculable <i>drop</i> in speed. My benchmark (100,000,000 sin-table operations with no compiler optimizations) went from about 4.5 seconds to complete its calculations to 6 seconds. Though that&#39;s still better than the 7.8 seconds the regular sin() and cos() routines take.</p><p>In case it makes a difference (since your code was assembler after all) I am using MSVC 6.0 under Windows 98SE. My processor is an AMD Sempron 2.0 GHz.</p><p>And I&#39;m only using doubles because the sin() and cos() routines do. I plan to switch to regular floats when I add this to my game engine.</p><p><b>tobing:</b> I always prefer to do things myself. A testament to that is the random number generator I made which I use in my games.</p><p>--- Kris Asick (Gemini)<br />--- <a href="http://www.pixelships.com">http://www.pixelships.com</a>
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Kris Asick)</author>
		<pubDate>Thu, 09 Nov 2006 16:26:27 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>
... with no compiler optimizations
</p></div></div><p>Why don&#39;t you allow compiler optimizations? It&#39;s kind of stupid to not let compilers do their work. After all un-optimized code is not what you will use in release, is it?
</p><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>
I am using MSVC 6.0
</p></div></div><p>That might explain some things. This compiler is one of the slowest there is. I suggest getting GCC 4.1 (MinGW) and do the benchmarks again.
</p><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>
And I&#39;m only using doubles because the sin() and cos() routines do
</p></div></div><p>For floats there is sinf and cosf that should be quite a bit faster than their double counterparts. I for one haven&#39;t yet seen a place where floats don&#39;t give enough percision, especially as you only use x87. I would understand that the 32bit SSE floats can cause some problems when you are not careful but even those are usable if you are careful.</p><p>Also if you decide to use gcc you can try out some of its <a href="http://gcc.gnu.org/onlinedocs/gcc-4.1.1/gcc/Other-Builtins.html#Other-Builtins">builtin</a> functions that should map directly to machine instructions possibly providing significant boost to the standard library implementation.</p><p>Also I would like to see your benchmark program. It would be interesting to see how it behaves on different CPU&#39;s.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (HoHo)</author>
		<pubDate>Thu, 09 Nov 2006 17:10:17 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>
I always prefer to do things myself.
</p></div></div><p>
Uh-oh. Good luck with that. I would go nowhere, if I had to do all for myself, without relying on powerful libraries, allegro being just one of many. I would also not be keen to write my own compiler, os etc...
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (tobing)</author>
		<pubDate>Thu, 09 Nov 2006 17:33:52 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>
: I tired implementing your code in place of the integer type-casting. There was a calculable drop in speed. My benchmark (100,000,000 sin-table operations with no compiler optimizations) went from about 4.5 seconds to complete its calculations to 6 seconds. Though that&#39;s still better than the 7.8 seconds the regular sin() and cos() routines take.
</p></div></div><p>
Benchmarking should be done with full optimization enabled.  The limiting factors in debug mode can be wildly different than the limiting factors in release mode in some cases.  Of course, you also need to take care that your code does something that can&#39;t be optimized away to nothing, otherwise release mode will be even further from what you want that debug mode.  <br />When I benchmark, I get 26.5 nanoseconds per sin() with your code, 18.0 nanoseconds with my code, and 72.9 nanoseconds with the regular libc sin().  That works out to 2.65 seconds, 1.80 seconds, and 7.29 seconds for each respectively, for the 100 million operations you used.  When I add interpolation, I end up with 33.9 nanoseconds per operation.  </p><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>
In case it makes a difference (since your code was assembler after all) I am using MSVC 6.0 under Windows 98SE. My processor is an AMD Sempron 2.0 GHz.
</p></div></div><p>
I used MSVC 7.1 under win2k on an Athlon XP 2600+.  </p><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>
And I&#39;m only using doubles because the sin() and cos() routines do. I plan to switch to regular floats when I add this to my game engine.
</p></div></div><p>
The table itself can be floats without effecting the type of the return value of your lookup function.  </p><p>I should further note that I prefer to not use lookup table sin values in my own code, except when I&#39;m doing everything in fixed point math.  I avoid them because the imprecision, while minor most of the time, sometimes becomes quite relevant, and such times may not always be immediately obvious ; I consider it not worth the effort involved in figuring out when sin tables won&#39;t hurt, so I never use them in regular code.  A more accurate version can be done using table lookups and interpolating between adjacent table entries, but that is loses much of the speed advantage and still, on very rare occaisons, has precision issues.  Instead, to optimize my sin() and cos() calls, I use the fsincos opcode, which calculates both values with a single opcode, and is faster than calling regular sin() and cos(), though not nearly as fast as simple table lookups (almost as fast as interpolated lookups though):
</p><div class="source-code"><div class="toolbar"></div><div class="inner"><table width="100%"><tbody><tr><td class="number">1</td><td>Vector2 unit_vector <span class="k2">(</span> Angle angle <span class="k2">)</span> <span class="k2">{</span></td></tr><tr><td class="number">2</td><td>  Vector2 r<span class="k2">;</span></td></tr><tr><td class="number">3</td><td>&#160;</td></tr><tr><td class="number">4</td><td><span class="c">//now the normal processor detection</span></td></tr><tr><td class="number">5</td><td><span class="c">//and various platform specific vesions</span></td></tr><tr><td class="number">6</td><td>&#160;</td></tr><tr><td class="number">7</td><td><span class="p">#  if defined (__i386__) &amp;&amp; !defined (NO_ASM)</span></td></tr><tr><td class="number">8</td><td><span class="p">#    if defined __GNUC__</span></td></tr><tr><td class="number">9</td><td><span class="p">#      define ASM_SINCOS</span></td></tr><tr><td class="number">10</td><td>      <span class="k1">asm</span> <span class="k2">(</span><span class="s">"fsincos"</span> <span class="k2">:</span> <span class="s">"=t"</span> <span class="k2">(</span>r.x<span class="k2">)</span>, <span class="s">"=u"</span> <span class="k2">(</span>r.y<span class="k2">)</span> <span class="k2">:</span> <span class="s">"0"</span> <span class="k2">(</span>angle.radians<span class="k2">(</span><span class="k2">)</span><span class="k2">)</span><span class="k2">)</span><span class="k2">;</span></td></tr><tr><td class="number">11</td><td><span class="p">#    elif defined _MSC_VER</span></td></tr><tr><td class="number">12</td><td><span class="p">#      define ASM_SINCOS</span></td></tr><tr><td class="number">13</td><td>      <span class="k1">double</span> a <span class="k3">=</span> angle.radians<span class="k2">(</span><span class="k2">)</span><span class="k2">;</span></td></tr><tr><td class="number">14</td><td>      __asm fld a</td></tr><tr><td class="number">15</td><td>      __asm fsincos</td></tr><tr><td class="number">16</td><td>      __asm fstp r.x</td></tr><tr><td class="number">17</td><td>      __asm fstp r.y</td></tr><tr><td class="number">18</td><td><span class="p">#    endif</span></td></tr><tr><td class="number">19</td><td><span class="p">#  endif</span></td></tr><tr><td class="number">20</td><td>&#160;</td></tr><tr><td class="number">21</td><td><span class="c">//and the fall-back version in C</span></td></tr><tr><td class="number">22</td><td>&#160;</td></tr><tr><td class="number">23</td><td><span class="p">#  ifndef ASM_SINCOS</span></td></tr><tr><td class="number">24</td><td>    <span class="k1">double</span> a <span class="k3">=</span> angle.radians<span class="k2">(</span><span class="k2">)</span><span class="k2">;</span></td></tr><tr><td class="number">25</td><td>    r.x <span class="k3">=</span> <a href="http://www.delorie.com/djgpp/doc/libc/libc_113.html" target="_blank">cos</a><span class="k2">(</span>a<span class="k2">)</span><span class="k2">;</span></td></tr><tr><td class="number">26</td><td>    r.y <span class="k3">=</span> <a href="http://www.delorie.com/djgpp/doc/libc/libc_728.html" target="_blank">sin</a><span class="k2">(</span>a<span class="k2">)</span><span class="k2">;</span></td></tr><tr><td class="number">27</td><td><span class="p">#  endif</span></td></tr><tr><td class="number">28</td><td>  <span class="k1">return</span> r<span class="k2">;</span></td></tr><tr><td class="number">29</td><td><span class="k2">}</span></td></tr></tbody></table></div></div><p>
note that MSVC does not by default define the <u>_i386</u>_ symbol ; I don&#39;t know how you&#39;re supposed to recognize target instruction set on that compiler, so I just define it myself.  </p><p>edit:
</p><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>
For floats there is sinf and cosf that should be quite a bit faster than their double counterparts. I for one haven&#39;t yet seen a place where floats don&#39;t give enough percision, especially as you only use x87. I would understand that the 32bit SSE floats can cause some problems when you are not careful but even those are usable if you are careful.
</p></div></div><p>
I have never heard of sinf or cosf before.  These are standard, portable?  My benchmarking finds them to be identical in speed to regular sin() and cos() on my compiler.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (orz)</author>
		<pubDate>Thu, 09 Nov 2006 18:26:20 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p><b>HoHo:</b></p><p>1. MSVC vs. GCC is one of those arguments I am not going to be dragged into. It&#39;s like comparing MS Word to Wordperfect.</p><p>2. sinf() benchmarks at about 11 seconds using my program... worse than sin().</p><p>3. All my benchmark program is is a for-loop and the command who&#39;s speed I wish to test, being assigned to a variable. My program doesn&#39;t actually do the timing part, I time it myself using a stopwatch a few times then figure out the average. It was faster to program that way and at the time, I wanted it programmed in a few minutes using the console instead of a half hour using Allegro&#39;s timer functions.</p><p><b>tobing:</b></p><p>Well, obviously I use Allegro, but the only thing stopping me from learning to write my own library of functions is the fact that I don&#39;t have access to enough variations of computers to test such an implementation, whereas Allegro has been through its paces on so many computers I&#39;ve never heard of my games crashing horribly on anyone&#39;s system due to Allegro&#39;s fault. (Let alone my own actually.)</p><p><b>orz:</b></p><p>MSVC is optimizing the area of my benchmark that is just a fixed command over and over to only process once, thus no matter how many times I loop for, it&#39;s only taking a split second to execute with either the built-in commands or the table commands, thus it&#39;s making it impossible for me to figure out how my functions will work when optimized.</p><p>--- Kris Asick (Gemini)<br />--- <a href="http://www.pixelships.com">http://www.pixelships.com</a>
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Kris Asick)</author>
		<pubDate>Thu, 09 Nov 2006 18:34:12 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title">HoHo said:</div><div class="quote"><p>
I suggest getting GCC 4.1 (MinGW) and do the benchmarks again.
</p></div></div><p>
Err... there is no GCC 4.1 for MinGW.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (gnolam)</author>
		<pubDate>Thu, 09 Nov 2006 18:45:24 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>
MSVC is optimizing the area of my benchmark that is just a fixed command over and over to only process once, thus no matter how many times I loop for, it&#39;s only taking a split second to execute with either the built-in commands or the table commands, thus it&#39;s making it impossible for me to figure out how my functions will work when optimized.
</p></div></div><p>
The problem is that your code is written in such way that discarding your calls is legal for the compiler, because your calls don&#39;t actually do anything except set a variable which is never looked at.  The trick is to do something such that the code cannot behave correctly without calling your function the correct number of times.  Trying adding up all return values and printing the sum.  This is a sample of the code I benchmarked with:
</p><div class="source-code snippet"><div class="inner"><pre>  sum <span class="k3">=</span> <span class="n">3</span>.<span class="n">1</span><span class="k2">;</span>
  <span class="k1">for</span> <span class="k2">(</span><span class="k1">int</span> i <span class="k3">=</span> <span class="n">0</span><span class="k2">;</span> i <span class="k3">&lt;</span> n<span class="k2">;</span> i<span class="k3">+</span><span class="k3">+</span><span class="k2">)</span> <span class="k2">{</span>
    <span class="k1">for</span> <span class="k2">(</span><span class="k1">int</span> j <span class="k3">=</span> <span class="n">0</span><span class="k2">;</span> j <span class="k3">&lt;</span> <span class="n">1000</span><span class="k2">;</span> j<span class="k3">+</span><span class="k3">+</span><span class="k2">)</span> <span class="k2">{</span>
      sum <span class="k3">+</span><span class="k3">=</span> sinf<span class="k2">(</span>sum<span class="k2">)</span><span class="k2">;</span>
      sum <span class="k3">+</span><span class="k3">=</span> sinf<span class="k2">(</span>sum<span class="k2">)</span><span class="k2">;</span>
      sum <span class="k3">+</span><span class="k3">=</span> sinf<span class="k2">(</span>sum<span class="k2">)</span><span class="k2">;</span>
      sum <span class="k3">+</span><span class="k3">=</span> sinf<span class="k2">(</span>sum<span class="k2">)</span><span class="k2">;</span>
    <span class="k2">}</span>
  <span class="k2">}</span>
  <span class="k1">double</span> time4 <span class="k3">=</span> get_time2<span class="k2">(</span><span class="k2">)</span><span class="k2">;</span>
  <span class="k1">double</span> dt3 <span class="k3">=</span> <span class="k2">(</span>time4 <span class="k3">-</span> time3<span class="k2">)</span> <span class="k3">/</span> <span class="k2">(</span>n <span class="k3">*</span> <span class="n">4000</span>.<span class="n">0</span><span class="k2">)</span> <span class="k3">*</span> <span class="n">1000000</span><span class="k2">;</span>
  <a href="http://www.delorie.com/djgpp/doc/libc/libc_624.html" target="_blank">printf</a><span class="k2">(</span><span class="s">"algorithm3: %.3f ns / sin; (check:%.9f)\n"</span>, dt3, sum<span class="k2">)</span><span class="k2">;</span>
</pre></div></div><p>

</p><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>
1. MSVC vs. GCC is one of those arguments I am not going to be dragged into. It&#39;s like comparing MS Word to Wordperfect.
</p></div></div><p>
It&#39;s not just MSVC vs GCC.  MSVC 6.0 is old, and extremely buggy if you use C++ templates beyond STL.  If you don&#39;t use C++ templates, it&#39;s probably adequate.  MSVC 7.1 is much less buggy in that regard, as are most recent and semi-recent versions of GCC.  edit: more recent MSVCs and GCCs are also faster than MSVC 6.0</p><p>edit: just for fun, this is the output of my current code:
</p><div class="source-code snippet"><div class="inner"><pre>                                Get_SinT <span class="n">27</span>.<span class="n">525</span> ns <span class="k3">/</span> call<span class="k2">;</span> <span class="k2">(</span>check:2.510948387<span class="k2">)</span>
                      get_sin_via_table2 <span class="n">18</span>.<span class="n">895</span> ns <span class="k3">/</span> call<span class="k2">;</span> <span class="k2">(</span>check:1.646245552<span class="k2">)</span>
   get_sin_via_table2_with_interpolation <span class="n">34</span>.<span class="n">012</span> ns <span class="k3">/</span> call<span class="k2">;</span> <span class="k2">(</span>check:2.000266046<span class="k2">)</span>
                                   _sinf <span class="n">74</span>.<span class="n">413</span> ns <span class="k3">/</span> call<span class="k2">;</span> <span class="k2">(</span>check:2.000186454<span class="k2">)</span>
                                     <a href="http://www.delorie.com/djgpp/doc/libc/libc_728.html" target="_blank">sin</a> <span class="n">84</span>.<span class="n">446</span> ns <span class="k3">/</span> call<span class="k2">;</span> <span class="k2">(</span>check:2.000186454<span class="k2">)</span>
                                _fsincos <span class="n">95</span>.<span class="n">225</span> ns <span class="k3">/</span> call<span class="k2">;</span> <span class="k2">(</span>check:2.000186454<span class="k2">)</span>
</pre></div></div><p>
I&#39;ve modified it to make imprecisions in the results more obvious in the &quot;check&quot; value, and to simplify the code and make the output more readable.  Notice how the top two (your original code and a modification of it to use inline assembly float-&gt;int conversion) produce large deviations, while the interpolating version (which also uses asm float-&gt;int) produces small deviations, and the remaining three produce identical results?  Also notice that _fsincos is only slightly slower than sin despite calculating both sin and cos.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (orz)</author>
		<pubDate>Thu, 09 Nov 2006 18:49:36 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>Well, at the moment, MSVC 6.0 is what I use. I&#39;ve never had any problems with it and to use any higher version I would have to upgrade my OS as well.</p><p>And in the past, GCC hated me. Partly why I stuck with MSVC.</p><p>Making the sin/cos table larger would probably help reduce the deviation values without having to add that extra little overhead of interpolation. After all, a 4096 entry table only takes up 16 KB of memory using floats and 32 KB using doubles. Making the table 16,384 entries big would only take 64 KB or 128 KB, which is still not that much when you consider MBs worth of graphics, sounds and music.</p><p>And I get what you mean by what the compiler&#39;s doing with optimizations. Actually, I never really paid much attention to optimizations. I just normally turn &#39;em to full-speed and let them do their thing. <img src="http://www.allegro.cc/forums/smileys/wink.gif" alt=";)" /></p><p>--- Kris Asick (Gemini)<br />--- <a href="http://www.pixelships.com">http://www.pixelships.com</a>
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Kris Asick)</author>
		<pubDate>Thu, 09 Nov 2006 20:37:51 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>Not trying to derail your thread Kris, but...</p><div class="quote_container"><div class="title">HoHo said:</div><div class="quote"><p>
That might explain some things. This compiler is one of the slowest there is. I suggest getting GCC 4.1 (MinGW) and do the benchmarks again.
</p></div></div><p>

I recently attempted to switch from MSVC6 to the latest mingw32 package. My current game which uses a lot of sin/cos and software polygon rendering experienced a noticeable drop in framerate, in the neighbourhood of 10FPS, with mingw32. </p><p>Otherwise, the switch went well, using MinGW Visual Studio which is a lot like MVC6.</p><p>Unless I can see some metrics on the supposed superiority of mingw32 over MSVC6, and an explanation on how to access this superior performance, I&#39;ll be sticking with MSVC6. <img src="http://www.allegro.cc/forums/smileys/smiley.gif" alt=":)" /></p><p>STL and C++ I avoid like the plague anyway, so MSVC6 works perfectly fine.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Paul Pridham)</author>
		<pubDate>Thu, 09 Nov 2006 20:46:59 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>So, what happens if you supply a value outside the range [0..2pi]? What precision do you <b>really</b> get with a table lookup?
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Bob)</author>
		<pubDate>Thu, 09 Nov 2006 21:29:34 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p><b>Kris Asick</b>:<br />With tables size 16384 instead of 4096:
</p><div class="source-code snippet"><div class="inner"><pre>                                Get_SinT <span class="n">26</span>.<span class="n">865</span> ns <span class="k3">/</span> call<span class="k2">;</span> <span class="k2">(</span>check:2.141740928<span class="k2">)</span>
                      get_sin_via_table2 <span class="n">18</span>.<span class="n">751</span> ns <span class="k3">/</span> call<span class="k2">;</span> <span class="k2">(</span>check:1.982911766<span class="k2">)</span>
   get_sin_via_table2_with_interpolation <span class="n">33</span>.<span class="n">851</span> ns <span class="k3">/</span> call<span class="k2">;</span> <span class="k2">(</span>check:2.000186454<span class="k2">)</span>
                                   _sinf <span class="n">72</span>.<span class="n">959</span> ns <span class="k3">/</span> call<span class="k2">;</span> <span class="k2">(</span>check:2.000186454<span class="k2">)</span>
                                     <a href="http://www.delorie.com/djgpp/doc/libc/libc_728.html" target="_blank">sin</a> <span class="n">85</span>.<span class="n">588</span> ns <span class="k3">/</span> call<span class="k2">;</span> <span class="k2">(</span>check:2.000186454<span class="k2">)</span>
                                _fsincos <span class="n">96</span>.<span class="n">583</span> ns <span class="k3">/</span> call<span class="k2">;</span> <span class="k2">(</span>check:2.000186454<span class="k2">)</span>
</pre></div></div><p>
The deviations by this measure got noticably smaller.  In fact, for the interpolating version, the deviation disappeared.  But, I am not a believe in large table sizes, at least for general purpose things; they may do fine on benchmarks until they start overflowing the cache, but in real world circumstances there&#39;s other stuff that wants a share of the cache too.  </p><p> <br /><b>ppridham</b>: edit: why does this showing up on the wrong line if I don&#39;t add a spurious space?<br />I only see bugs in MSVC 6 when using C++; for C it seems to produce correct code.  I do find GCC 3.3.x faster than MSVC 6 in most, though not all, cases.  I haven&#39;t tried GCC 4.* yet.  In a complete program the comparison may be more difficult than in a simpler benchmark due to libraries complicating the picture.  </p><p><b>Bob</b>:</p><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>
So, what happens if you supply a value outside the range [0..2pi]? What precision do you <b>really</b> get with a table lookup?
</p></div></div><p>
edit: corrected these results, and the description; a bug initially had the table based stuff performing better<br />It&#39;s comparable to the normal sin()/cos() in that regard.  Except the interpolating version I wrote, which barfs after a bit; but perhaps that&#39;s fixable.  A version modified to show check values with various biases applied:<br />zero bias: 
</p><div class="source-code snippet"><div class="inner"><pre>                                Get_SinT   <span class="n">29</span>.<span class="n">411</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:1.015412770
                      get_sin_via_table2   <span class="n">22</span>.<span class="n">424</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:0.738985208
   get_sin_via_table2_with_interpolation   <span class="n">36</span>.<span class="n">800</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:0.778233193
                                     <a href="http://www.delorie.com/djgpp/doc/libc/libc_728.html" target="_blank">sin</a>   <span class="n">89</span>.<span class="n">889</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:0.778233193
</pre></div></div><p>
bias: PI * 256
</p><div class="source-code snippet"><div class="inner"><pre>                                Get_SinT   <span class="n">31</span>.<span class="n">876</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:1.015412770
                      get_sin_via_table2   <span class="n">23</span>.<span class="n">182</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:0.738985208
   get_sin_via_table2_with_interpolation   <span class="n">39</span>.<span class="n">232</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:0.778233193
                                     <a href="http://www.delorie.com/djgpp/doc/libc/libc_728.html" target="_blank">sin</a>   <span class="n">92</span>.<span class="n">577</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:0.778233193
</pre></div></div><p>bias: PI * 65536
</p><div class="source-code snippet"><div class="inner"><pre>                                Get_SinT   <span class="n">29</span>.<span class="n">303</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:1.015412770
                      get_sin_via_table2   <span class="n">22</span>.<span class="n">898</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:0.738985208
   get_sin_via_table2_with_interpolation   <span class="n">36</span>.<span class="n">886</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:0.778233179
                                     <a href="http://www.delorie.com/djgpp/doc/libc/libc_728.html" target="_blank">sin</a>   <span class="n">93</span>.<span class="n">601</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:0.778233353
</pre></div></div><p>bias: PI * 4294967296
</p><div class="source-code snippet"><div class="inner"><pre>                                Get_SinT   <span class="n">29</span>.<span class="n">197</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:0.962381428
                      get_sin_via_table2   <span class="n">22</span>.<span class="n">253</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:4.964428596
   get_sin_via_table2_with_interpolation   <span class="n">36</span>.<span class="n">993</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:-1.#IND00000
                                     <a href="http://www.delorie.com/djgpp/doc/libc/libc_728.html" target="_blank">sin</a>   <span class="n">98</span>.<span class="n">563</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:0.789689926
</pre></div></div><p>bias: PI * 1099511627776
</p><div class="source-code snippet"><div class="inner"><pre>                                Get_SinT   <span class="n">29</span>.<span class="n">090</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:0.956744473
                      get_sin_via_table2   <span class="n">22</span>.<span class="n">055</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:4.964428596
   get_sin_via_table2_with_interpolation   <span class="n">37</span>.<span class="n">330</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:-1.#IND00000
                                     <a href="http://www.delorie.com/djgpp/doc/libc/libc_728.html" target="_blank">sin</a>  <span class="n">100</span>.<span class="n">243</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:3.776719549
</pre></div></div><p>bias: PI * 281474976710656
</p><div class="source-code snippet"><div class="inner"><pre>                                Get_SinT   <span class="n">29</span>.<span class="n">282</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:2.899386226
                      get_sin_via_table2   <span class="n">22</span>.<span class="n">070</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:4.964428596
   get_sin_via_table2_with_interpolation   <span class="n">37</span>.<span class="n">951</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:-1.#IND00000
                                     <a href="http://www.delorie.com/djgpp/doc/libc/libc_728.html" target="_blank">sin</a>  <span class="n">103</span>.<span class="n">442</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:-4.451114322
</pre></div></div><p>
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (orz)</author>
		<pubDate>Thu, 09 Nov 2006 22:21:27 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>Just one remark:<br />lookup table based things work well only if they sit in L1 or at least L2 cache. When they go from L1 to L2 speed drops around 3-4x, if they go to main memory speed drops about 100x compared to L1 assuming that memory prefetch doesn&#39;t work. With AMD64 things are a bit better but not that much.</p><p>orz, could you attach your code? It would be interesting to test on other architectures and compilers. I can provide 32bit Core2Duo with GCC 4.1. I&#39;m too lazy to write it myself <img src="http://www.allegro.cc/forums/smileys/tongue.gif" alt=":P" /></p><p>If you want to know how badly MSVC sucks then go <a href="http://ompf.org/forum/index.php">here</a> and search for posts containing &quot;msvc&quot;. Also keep in mind that there they talk about MSVC8. According to MS it is a giant leap over MSVC6 <img src="http://www.allegro.cc/forums/smileys/wink.gif" alt=";)" />
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (HoHo)</author>
		<pubDate>Thu, 09 Nov 2006 22:47:42 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>
slow switch of rounding modes to enter that mode (cast float to int)
</p></div></div><p>

I don&#39;t have the code here on this computer, but you can multiply the number by some magic value and grab the int off the mantissa and mask off the fp exponent &amp; sign bits.  If this doesn&#39;t allow enough range, you can cast to a double to get more accuracy. (store to double is still faster than fistp)</p><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>
My current game which uses a lot of sin/cos and software polygon rendering experienced a noticeable drop in framerate, in the neighbourhood of 10FPS, with mingw32.
</p></div></div><p>

Using a particular compiler can be like using a particular language, you have to know your way around to get best performance.  Did you try -ffast-math, -fomit-frame-pointer, -O3, -march=pentium4, -fpumath=sse,387, and countless others?  I haven&#39;t been able to outrun (within 2%) Mingw 3.42 with the free VC++ download thing if I spent a half hour on a particular program.</p><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>
value outside the range [0..2pi]
</p></div></div><p>
Even the fpu sin/cos functions have limits, huge to be sure, but there nonetheless.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Arthur Kalliokoski)</author>
		<pubDate>Fri, 10 Nov 2006 00:21:15 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>If you still need help, you can use this:
</p><div class="source-code snippet"><div class="inner"><pre><span class="k1">for</span> <span class="k2">(</span><span class="k1">int</span> a <span class="k3">=</span> <span class="n">0</span><span class="k2">;</span> a <span class="k3">&lt;</span> <span class="n">360</span><span class="k2">;</span> a<span class="k3">+</span><span class="k3">+</span><span class="k2">)</span>
<span class="k2">{</span>
    sine<span class="k2">[</span>a<span class="k2">]</span> <span class="k3">=</span> <a href="http://www.delorie.com/djgpp/doc/libc/libc_728.html" target="_blank">sin</a><span class="k2">(</span>a<span class="k3">/</span>RAC<span class="k2">)</span><span class="k2">;</span>
    cosine<span class="k2">[</span>a<span class="k2">]</span> <span class="k3">=</span> <a href="http://www.delorie.com/djgpp/doc/libc/libc_113.html" target="_blank">cos</a><span class="k2">(</span>a<span class="k3">/</span>RAC<span class="k2">)</span><span class="k2">;</span>
<span class="k2">}</span>
</pre></div></div><p>
just declare <span class="source-code"><span class="k1">float</span> sine<span class="k2">[</span><span class="n">360</span><span class="k2">]</span>, cosine<span class="k2">[</span><span class="n">360</span><span class="k2">]</span><span class="k2">;</span></span> and <span class="source-code"><span class="p">#define RAC 57.32484076</span></span></p><p>It works for anything exept fractions of degrees, and is quite fast. <img src="http://www.allegro.cc/forums/smileys/smiley.gif" alt=":)" />
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Kikaru)</author>
		<pubDate>Fri, 10 Nov 2006 01:57:30 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>
lookup table based things work well only if they sit in L1 or at least L2 cache. When they go from L1 to L2 speed drops around 3-4x, if they go to main memory speed drops about 100x compared to L1 assuming that memory prefetch doesn&#39;t work. With AMD64 things are a bit better but not that much.
</p></div></div><p>
Indeed.  And I forgot to make sure my memory accesses were unpredictable.  My performance numbers were based upon the assumption that memory access mostly went to the L1 (though, on an Athlon-based system, L1 is big enough to fit decent tables; on an intel system IIRC L1 is generally smaller but the speed hit for dropping to L2 is a lot better).  Unfortunately, it&#39;s hard to make sure it can&#39;t prefetch my memory without sticking in frequent random number generator calls, which slows things down a bit independantly of the trig speed.  </p><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>
I don&#39;t have the code here on this computer, but you can multiply the number by some magic value and grab the int off the mantissa and mask off the fp exponent &amp; sign bits. If this doesn&#39;t allow enough range, you can cast to a double to get more accuracy. (store to double is still faster than fistp)
</p></div></div><p>
I don&#39;t really like that method, as the behavior is kinda stupid when the input goes outside of the intended range.  But, it does seem to improve speed a little more.  </p><div class="source-code snippet"><div class="inner"><pre><span class="p">#elif defined USE_BINARY_FLOAT_CONVERSION</span>
<span class="p">#  define BIGDOUBLE 6755399441055744.0</span>
  <span class="k1">int</span> iround<span class="k2">(</span><span class="k1">double</span> a<span class="k2">)</span> <span class="k2">{</span>
    <span class="k1">int</span> i<span class="k2">;</span>
    a <span class="k3">=</span> a <span class="k3">+</span> <span class="k2">(</span>BIGDOUBLE <span class="k3">+</span> <span class="n">0</span><span class="k2">)</span><span class="k2">;</span>
    i <span class="k3">=</span> <span class="k3">*</span><span class="k2">(</span><span class="k2">(</span><span class="k1">int</span><span class="k3">*</span><span class="k2">)</span><span class="k3">&amp;</span>a<span class="k2">)</span><span class="k2">;</span>
    <span class="k1">return</span> i<span class="k2">;</span>
  <span class="k2">}</span>
</pre></div></div><p>

</p><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>
orz, could you attach your code? It would be interesting to test on other architectures and compilers. I can provide 32bit Core2Duo with GCC 4.1. I&#39;m too lazy to write it myself
</p></div></div><p>
Done.  Be warned though, this is kinda a mess that was quickly assembled from pieces of several different projects.  There&#39;s a ton of code you&#39;ll just want to skip over to get to the relevant stuff.  </p><p>edit: The output on my computer from the code as posted:</p><div class="source-code snippet"><div class="inner"><pre>                                   dummy    <span class="n">5</span>.<span class="n">672</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:2.1907804172
                                Get_SinT   <span class="n">33</span>.<span class="n">177</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:1.7299634784
                      get_sin_via_table2   <span class="n">23</span>.<span class="n">802</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:1.7279857059
        get_sin_via_table2_binary_iround   <span class="n">18</span>.<span class="n">555</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:1.7279857059
   get_sin_via_table2_with_interpolation   <span class="n">36</span>.<span class="n">615</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:1.7273836151
                                   _sinf   <span class="n">61</span>.<span class="n">737</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:1.7273835937
                                     <a href="http://www.delorie.com/djgpp/doc/libc/libc_728.html" target="_blank">sin</a>   <span class="n">71</span>.<span class="n">317</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:1.7273835937
                                _fsincos   <span class="n">82</span>.<span class="n">553</span> ns <span class="k3">/</span> call<span class="k2">;</span> check:1.7273835937
</pre></div></div><p>
Dummy does nothing except the overhead - it calls the random number generator, adds things up, but no sin/cos stuff.  <br />The table sizes are set to 65536.  <br />Get_SinT is the original code (except for the table size).<br />get_sin_via_table2 is that modified to use asm float-&gt;int conversion<br />get_sin_via_table2_binary_iround is that modified to use binary float-&gt;int conversion (reinterpret the floating point memory as integers w/ silly tricks)<br />get_sin_via_table2_with_interpolation uses the asm float-&gt;int, and interpolates between two entries in the table for high accuracy<br />_sinf calls sinf<br />sin is just plain sin() from the standard library<br />_fsincos uses inline asm to call the fsincos opcode
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (orz)</author>
		<pubDate>Fri, 10 Nov 2006 02:13:45 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>Does any of these benchmarks take in account that such lookup tables rely on the CPU cache...
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Fladimir da Gorf)</author>
		<pubDate>Fri, 10 Nov 2006 16:10:05 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><div class="quote_container"><div class="title">Quote:</div><div class="quote"><p>
Does any of these benchmarks take in account that such lookup tables rely on the CPU cache...
</p></div></div><p>
The timing numbers in the last set include plenty of L1 misses.  It doesn&#39;t include many L2 misses though, as even the 256 kilobyte table fits into the L2 cache footprint.  The L1 miss rate is worse than real world in most cases, since it&#39;s largely based upon randomly generated angles.  The L2 miss rate is better than real world in most cases, since no major other program data is used simultaneously.  If you want a graph of log(table size) vs performance ala linpack you could simply adjust the table size and run again a few times.
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (orz)</author>
		<pubDate>Fri, 10 Nov 2006 17:53:30 +0000</pubDate>
	</item>
	<item>
		<description><![CDATA[<div class="mockup v2"><p>Not quite. Most real-world applications do a lot of other stuff between calls to sin() or cos(), and the more that happens, the more cache misses you get.<br />To make matters worse, a lookup table that sits in L1 or L2 occupies cache space needed by other things, and may even slow down the performance of other calculations that rely on memory access speed.<br />Try benchmarking something real-world, like FFT (lends itself due to excessive usage of sin calls).
</p></div>]]>
		</description>
		<author>no-reply@allegro.cc (Tobias Dammers)</author>
		<pubDate>Fri, 10 Nov 2006 20:37:05 +0000</pubDate>
	</item>
</rss>
