Fastest Sine / Cosine Table Possible
Kris Asick

NOTE: Accidentally put this here instead of under "Programming Questions"... might want to move it...

After some experimenting I've deduced that the following is the fastest way to implement sine/cosine tables through a function that is interchangeable with the built-in sin() and cos() functions included in MATH.H.

// NOTE: Code to build sine table not included in this excerpt.

#define SIN_TABLE_SIZE    4096

double _sin_t[SIN_TABLE_SIZE];

double Get_SinT (double d)
{
return _sin_t[(int)(d * 651.8986469f) & SIN_TABLE_BITMASK];
}

My tests showed this routine to be almost twice as fast as the normal sin() function with no compiler optimizations enabled.

Does anyone have any ideas how to make this even faster, or is that about as fast as it's going to get while still being compatible with radians?

--- Kris Asick (Gemini)
--- http://www.pixelships.com

orz

You cast a floating point value to an integer. C requires that use round-towards-zero rules. Most x86 hardware, last I checked, required a slow switch of rounding modes to enter that mode, then a float->int conversion, then a slow switch back to normal mode (round nearest I think). Possibly that's been improved in one of the more recent instruction set extensions, I don't know. Some compilers support a rint() or something for float->int conversion with different rounding rules, but I can't find it atm. My inline asm code for float->int conversion:

 1 //MSVC version (x86 only) 2 int iround(double a) { 3 int r; 4 __asm fld a ; 5 __asm fistp r ; 6 return r; 7 } 8 int iround_down(double a) {return iround(std::floor(a));} 9 int iround_up (double a) {return iround(std::ceil(a));} 10 #elif (defined USE_ASM_FLOAT_CONVERSION) && (defined __GNUC__) && (defined __i386__) 11 //gcc version (x86 only) 12 int iround( double a ) { 13 int r; 14 __asm__ __volatile__( 15 "fldl %1 \n\t" 16 "fistpl %0 \n\t" 17 : "=m" (r) 18 : "m" (a) 19 ); 20 return r; 21 } 22 int iround_down(double a) {return iround(std::floor(a));} 23 int iround_up (double a) {return iround(std::ceil(a));}

Benchmark that against a regular C float->int cast, see if it helps.

edit: note, for supporting sin and cos, you can either add a constant before the masking or you can use a table 25% larger and add a constant after the masking, depending upon whether you want to reduce instruction count to the minimum or save a little extra space (memory, cache, and TLB usage). Also, for your table size, there's no real advantage to using doubles instead of floats for the table.

HoHo

Didn't most newer CPU's have special instruction that returned sin/cos value from HW table? I think I saw something about it in the Intel architecture manuals.

tobing

You might want to use functions from FFTW:

http://www.fftw.org/

Which has superfast implementations of sin and cos, but also much more...

Edit: Beware, it's GPL!

Kris Asick

orz: I tired implementing your code in place of the integer type-casting. There was a calculable drop in speed. My benchmark (100,000,000 sin-table operations with no compiler optimizations) went from about 4.5 seconds to complete its calculations to 6 seconds. Though that's still better than the 7.8 seconds the regular sin() and cos() routines take.

In case it makes a difference (since your code was assembler after all) I am using MSVC 6.0 under Windows 98SE. My processor is an AMD Sempron 2.0 GHz.

And I'm only using doubles because the sin() and cos() routines do. I plan to switch to regular floats when I add this to my game engine.

tobing: I always prefer to do things myself. A testament to that is the random number generator I made which I use in my games.

--- Kris Asick (Gemini)
--- http://www.pixelships.com

HoHo
Quote:

... with no compiler optimizations

Why don't you allow compiler optimizations? It's kind of stupid to not let compilers do their work. After all un-optimized code is not what you will use in release, is it?

Quote:

I am using MSVC 6.0

That might explain some things. This compiler is one of the slowest there is. I suggest getting GCC 4.1 (MinGW) and do the benchmarks again.

Quote:

And I'm only using doubles because the sin() and cos() routines do

For floats there is sinf and cosf that should be quite a bit faster than their double counterparts. I for one haven't yet seen a place where floats don't give enough percision, especially as you only use x87. I would understand that the 32bit SSE floats can cause some problems when you are not careful but even those are usable if you are careful.

Also if you decide to use gcc you can try out some of its builtin functions that should map directly to machine instructions possibly providing significant boost to the standard library implementation.

Also I would like to see your benchmark program. It would be interesting to see how it behaves on different CPU's.

tobing
Quote:

I always prefer to do things myself.

Uh-oh. Good luck with that. I would go nowhere, if I had to do all for myself, without relying on powerful libraries, allegro being just one of many. I would also not be keen to write my own compiler, os etc...

orz
Quote:

: I tired implementing your code in place of the integer type-casting. There was a calculable drop in speed. My benchmark (100,000,000 sin-table operations with no compiler optimizations) went from about 4.5 seconds to complete its calculations to 6 seconds. Though that's still better than the 7.8 seconds the regular sin() and cos() routines take.

Benchmarking should be done with full optimization enabled. The limiting factors in debug mode can be wildly different than the limiting factors in release mode in some cases. Of course, you also need to take care that your code does something that can't be optimized away to nothing, otherwise release mode will be even further from what you want that debug mode.
When I benchmark, I get 26.5 nanoseconds per sin() with your code, 18.0 nanoseconds with my code, and 72.9 nanoseconds with the regular libc sin(). That works out to 2.65 seconds, 1.80 seconds, and 7.29 seconds for each respectively, for the 100 million operations you used. When I add interpolation, I end up with 33.9 nanoseconds per operation.

Quote:

In case it makes a difference (since your code was assembler after all) I am using MSVC 6.0 under Windows 98SE. My processor is an AMD Sempron 2.0 GHz.

I used MSVC 7.1 under win2k on an Athlon XP 2600+.

Quote:

And I'm only using doubles because the sin() and cos() routines do. I plan to switch to regular floats when I add this to my game engine.

The table itself can be floats without effecting the type of the return value of your lookup function.

I should further note that I prefer to not use lookup table sin values in my own code, except when I'm doing everything in fixed point math. I avoid them because the imprecision, while minor most of the time, sometimes becomes quite relevant, and such times may not always be immediately obvious ; I consider it not worth the effort involved in figuring out when sin tables won't hurt, so I never use them in regular code. A more accurate version can be done using table lookups and interpolating between adjacent table entries, but that is loses much of the speed advantage and still, on very rare occaisons, has precision issues. Instead, to optimize my sin() and cos() calls, I use the fsincos opcode, which calculates both values with a single opcode, and is faster than calling regular sin() and cos(), though not nearly as fast as simple table lookups (almost as fast as interpolated lookups though):

 1 Vector2 unit_vector ( Angle angle ) { 2 Vector2 r; 3 4 //now the normal processor detection 5 //and various platform specific vesions 6 7 # if defined (__i386__) && !defined (NO_ASM) 8 # if defined __GNUC__ 9 # define ASM_SINCOS 10 asm ("fsincos" : "=t" (r.x), "=u" (r.y) : "0" (angle.radians())); 11 # elif defined _MSC_VER 12 # define ASM_SINCOS 13 double a = angle.radians(); 14 __asm fld a 15 __asm fsincos 16 __asm fstp r.x 17 __asm fstp r.y 18 # endif 19 # endif 20 21 //and the fall-back version in C 22 23 # ifndef ASM_SINCOS 24 double a = angle.radians(); 25 r.x = cos(a); 26 r.y = sin(a); 27 # endif 28 return r; 29 }

note that MSVC does not by default define the _i386_ symbol ; I don't know how you're supposed to recognize target instruction set on that compiler, so I just define it myself.

edit:

Quote:

For floats there is sinf and cosf that should be quite a bit faster than their double counterparts. I for one haven't yet seen a place where floats don't give enough percision, especially as you only use x87. I would understand that the 32bit SSE floats can cause some problems when you are not careful but even those are usable if you are careful.

I have never heard of sinf or cosf before. These are standard, portable? My benchmarking finds them to be identical in speed to regular sin() and cos() on my compiler.

Kris Asick

HoHo:

1. MSVC vs. GCC is one of those arguments I am not going to be dragged into. It's like comparing MS Word to Wordperfect.

2. sinf() benchmarks at about 11 seconds using my program... worse than sin().

3. All my benchmark program is is a for-loop and the command who's speed I wish to test, being assigned to a variable. My program doesn't actually do the timing part, I time it myself using a stopwatch a few times then figure out the average. It was faster to program that way and at the time, I wanted it programmed in a few minutes using the console instead of a half hour using Allegro's timer functions.

tobing:

Well, obviously I use Allegro, but the only thing stopping me from learning to write my own library of functions is the fact that I don't have access to enough variations of computers to test such an implementation, whereas Allegro has been through its paces on so many computers I've never heard of my games crashing horribly on anyone's system due to Allegro's fault. (Let alone my own actually.)

orz:

MSVC is optimizing the area of my benchmark that is just a fixed command over and over to only process once, thus no matter how many times I loop for, it's only taking a split second to execute with either the built-in commands or the table commands, thus it's making it impossible for me to figure out how my functions will work when optimized.

--- Kris Asick (Gemini)
--- http://www.pixelships.com

gnolam
HoHo said:

I suggest getting GCC 4.1 (MinGW) and do the benchmarks again.

Err... there is no GCC 4.1 for MinGW.

orz
Quote:

MSVC is optimizing the area of my benchmark that is just a fixed command over and over to only process once, thus no matter how many times I loop for, it's only taking a split second to execute with either the built-in commands or the table commands, thus it's making it impossible for me to figure out how my functions will work when optimized.

The problem is that your code is written in such way that discarding your calls is legal for the compiler, because your calls don't actually do anything except set a variable which is never looked at. The trick is to do something such that the code cannot behave correctly without calling your function the correct number of times. Trying adding up all return values and printing the sum. This is a sample of the code I benchmarked with:

sum = 3.1;
for (int i = 0; i < n; i++) {
for (int j = 0; j < 1000; j++) {
sum += sinf(sum);
sum += sinf(sum);
sum += sinf(sum);
sum += sinf(sum);
}
}
double time4 = get_time2();
double dt3 = (time4 - time3) / (n * 4000.0) * 1000000;
printf("algorithm3: %.3f ns / sin; (check:%.9f)\n", dt3, sum);

Quote:

1. MSVC vs. GCC is one of those arguments I am not going to be dragged into. It's like comparing MS Word to Wordperfect.

It's not just MSVC vs GCC. MSVC 6.0 is old, and extremely buggy if you use C++ templates beyond STL. If you don't use C++ templates, it's probably adequate. MSVC 7.1 is much less buggy in that regard, as are most recent and semi-recent versions of GCC. edit: more recent MSVCs and GCCs are also faster than MSVC 6.0

edit: just for fun, this is the output of my current code:

Get_SinT 27.525 ns / call; (check:2.510948387)
get_sin_via_table2 18.895 ns / call; (check:1.646245552)
get_sin_via_table2_with_interpolation 34.012 ns / call; (check:2.000266046)
_sinf 74.413 ns / call; (check:2.000186454)
sin 84.446 ns / call; (check:2.000186454)
_fsincos 95.225 ns / call; (check:2.000186454)

I've modified it to make imprecisions in the results more obvious in the "check" value, and to simplify the code and make the output more readable. Notice how the top two (your original code and a modification of it to use inline assembly float->int conversion) produce large deviations, while the interpolating version (which also uses asm float->int) produces small deviations, and the remaining three produce identical results? Also notice that _fsincos is only slightly slower than sin despite calculating both sin and cos.

Kris Asick

Well, at the moment, MSVC 6.0 is what I use. I've never had any problems with it and to use any higher version I would have to upgrade my OS as well.

And in the past, GCC hated me. Partly why I stuck with MSVC.

Making the sin/cos table larger would probably help reduce the deviation values without having to add that extra little overhead of interpolation. After all, a 4096 entry table only takes up 16 KB of memory using floats and 32 KB using doubles. Making the table 16,384 entries big would only take 64 KB or 128 KB, which is still not that much when you consider MBs worth of graphics, sounds and music.

And I get what you mean by what the compiler's doing with optimizations. Actually, I never really paid much attention to optimizations. I just normally turn 'em to full-speed and let them do their thing.

--- Kris Asick (Gemini)
--- http://www.pixelships.com

Paul Pridham

HoHo said:

That might explain some things. This compiler is one of the slowest there is. I suggest getting GCC 4.1 (MinGW) and do the benchmarks again.

I recently attempted to switch from MSVC6 to the latest mingw32 package. My current game which uses a lot of sin/cos and software polygon rendering experienced a noticeable drop in framerate, in the neighbourhood of 10FPS, with mingw32.

Otherwise, the switch went well, using MinGW Visual Studio which is a lot like MVC6.

Unless I can see some metrics on the supposed superiority of mingw32 over MSVC6, and an explanation on how to access this superior performance, I'll be sticking with MSVC6.

STL and C++ I avoid like the plague anyway, so MSVC6 works perfectly fine.

Bob

So, what happens if you supply a value outside the range [0..2pi]? What precision do you really get with a table lookup?

orz

Kris Asick:
With tables size 16384 instead of 4096:

Get_SinT 26.865 ns / call; (check:2.141740928)
get_sin_via_table2 18.751 ns / call; (check:1.982911766)
get_sin_via_table2_with_interpolation 33.851 ns / call; (check:2.000186454)
_sinf 72.959 ns / call; (check:2.000186454)
sin 85.588 ns / call; (check:2.000186454)
_fsincos 96.583 ns / call; (check:2.000186454)

The deviations by this measure got noticably smaller. In fact, for the interpolating version, the deviation disappeared. But, I am not a believe in large table sizes, at least for general purpose things; they may do fine on benchmarks until they start overflowing the cache, but in real world circumstances there's other stuff that wants a share of the cache too.

ppridham: edit: why does this showing up on the wrong line if I don't add a spurious space?
I only see bugs in MSVC 6 when using C++; for C it seems to produce correct code. I do find GCC 3.3.x faster than MSVC 6 in most, though not all, cases. I haven't tried GCC 4.* yet. In a complete program the comparison may be more difficult than in a simpler benchmark due to libraries complicating the picture.

Bob:

Quote:

So, what happens if you supply a value outside the range [0..2pi]? What precision do you really get with a table lookup?

edit: corrected these results, and the description; a bug initially had the table based stuff performing better
It's comparable to the normal sin()/cos() in that regard. Except the interpolating version I wrote, which barfs after a bit; but perhaps that's fixable. A version modified to show check values with various biases applied:
zero bias:

Get_SinT   29.411 ns / call; check:1.015412770
get_sin_via_table2   22.424 ns / call; check:0.738985208
get_sin_via_table2_with_interpolation   36.800 ns / call; check:0.778233193
sin   89.889 ns / call; check:0.778233193

bias: PI * 256

Get_SinT   31.876 ns / call; check:1.015412770
get_sin_via_table2   23.182 ns / call; check:0.738985208
get_sin_via_table2_with_interpolation   39.232 ns / call; check:0.778233193
sin   92.577 ns / call; check:0.778233193

bias: PI * 65536

Get_SinT   29.303 ns / call; check:1.015412770
get_sin_via_table2   22.898 ns / call; check:0.738985208
get_sin_via_table2_with_interpolation   36.886 ns / call; check:0.778233179
sin   93.601 ns / call; check:0.778233353

bias: PI * 4294967296

Get_SinT   29.197 ns / call; check:0.962381428
get_sin_via_table2   22.253 ns / call; check:4.964428596
get_sin_via_table2_with_interpolation   36.993 ns / call; check:-1.#IND00000
sin   98.563 ns / call; check:0.789689926

bias: PI * 1099511627776

Get_SinT   29.090 ns / call; check:0.956744473
get_sin_via_table2   22.055 ns / call; check:4.964428596
get_sin_via_table2_with_interpolation   37.330 ns / call; check:-1.#IND00000
sin  100.243 ns / call; check:3.776719549

bias: PI * 281474976710656

Get_SinT   29.282 ns / call; check:2.899386226
get_sin_via_table2   22.070 ns / call; check:4.964428596
get_sin_via_table2_with_interpolation   37.951 ns / call; check:-1.#IND00000
sin  103.442 ns / call; check:-4.451114322

HoHo

Just one remark:
lookup table based things work well only if they sit in L1 or at least L2 cache. When they go from L1 to L2 speed drops around 3-4x, if they go to main memory speed drops about 100x compared to L1 assuming that memory prefetch doesn't work. With AMD64 things are a bit better but not that much.

orz, could you attach your code? It would be interesting to test on other architectures and compilers. I can provide 32bit Core2Duo with GCC 4.1. I'm too lazy to write it myself

If you want to know how badly MSVC sucks then go here and search for posts containing "msvc". Also keep in mind that there they talk about MSVC8. According to MS it is a giant leap over MSVC6

Arthur Kalliokoski
Quote:

slow switch of rounding modes to enter that mode (cast float to int)

I don't have the code here on this computer, but you can multiply the number by some magic value and grab the int off the mantissa and mask off the fp exponent & sign bits. If this doesn't allow enough range, you can cast to a double to get more accuracy. (store to double is still faster than fistp)

Quote:

My current game which uses a lot of sin/cos and software polygon rendering experienced a noticeable drop in framerate, in the neighbourhood of 10FPS, with mingw32.

Using a particular compiler can be like using a particular language, you have to know your way around to get best performance. Did you try -ffast-math, -fomit-frame-pointer, -O3, -march=pentium4, -fpumath=sse,387, and countless others? I haven't been able to outrun (within 2%) Mingw 3.42 with the free VC++ download thing if I spent a half hour on a particular program.

Quote:

value outside the range [0..2pi]

Even the fpu sin/cos functions have limits, huge to be sure, but there nonetheless.

Kikaru

If you still need help, you can use this:

for (int a = 0; a < 360; a++)
{
sine[a] = sin(a/RAC);
cosine[a] = cos(a/RAC);
}

just declare float sine[360], cosine[360]; and #define RAC 57.32484076

It works for anything exept fractions of degrees, and is quite fast.

orz
Quote:

lookup table based things work well only if they sit in L1 or at least L2 cache. When they go from L1 to L2 speed drops around 3-4x, if they go to main memory speed drops about 100x compared to L1 assuming that memory prefetch doesn't work. With AMD64 things are a bit better but not that much.

Indeed. And I forgot to make sure my memory accesses were unpredictable. My performance numbers were based upon the assumption that memory access mostly went to the L1 (though, on an Athlon-based system, L1 is big enough to fit decent tables; on an intel system IIRC L1 is generally smaller but the speed hit for dropping to L2 is a lot better). Unfortunately, it's hard to make sure it can't prefetch my memory without sticking in frequent random number generator calls, which slows things down a bit independantly of the trig speed.

Quote:

I don't have the code here on this computer, but you can multiply the number by some magic value and grab the int off the mantissa and mask off the fp exponent & sign bits. If this doesn't allow enough range, you can cast to a double to get more accuracy. (store to double is still faster than fistp)

I don't really like that method, as the behavior is kinda stupid when the input goes outside of the intended range. But, it does seem to improve speed a little more.

#elif defined USE_BINARY_FLOAT_CONVERSION
#  define BIGDOUBLE 6755399441055744.0
int iround(double a) {
int i;
a = a + (BIGDOUBLE + 0);
i = *((int*)&a);
return i;
}

Quote:

orz, could you attach your code? It would be interesting to test on other architectures and compilers. I can provide 32bit Core2Duo with GCC 4.1. I'm too lazy to write it myself

Done. Be warned though, this is kinda a mess that was quickly assembled from pieces of several different projects. There's a ton of code you'll just want to skip over to get to the relevant stuff.

edit: The output on my computer from the code as posted:

dummy    5.672 ns / call; check:2.1907804172
Get_SinT   33.177 ns / call; check:1.7299634784
get_sin_via_table2   23.802 ns / call; check:1.7279857059
get_sin_via_table2_binary_iround   18.555 ns / call; check:1.7279857059
get_sin_via_table2_with_interpolation   36.615 ns / call; check:1.7273836151
_sinf   61.737 ns / call; check:1.7273835937
sin   71.317 ns / call; check:1.7273835937
_fsincos   82.553 ns / call; check:1.7273835937

Dummy does nothing except the overhead - it calls the random number generator, adds things up, but no sin/cos stuff.
The table sizes are set to 65536.
Get_SinT is the original code (except for the table size).
get_sin_via_table2 is that modified to use asm float->int conversion
get_sin_via_table2_binary_iround is that modified to use binary float->int conversion (reinterpret the floating point memory as integers w/ silly tricks)
get_sin_via_table2_with_interpolation uses the asm float->int, and interpolates between two entries in the table for high accuracy
_sinf calls sinf
sin is just plain sin() from the standard library
_fsincos uses inline asm to call the fsincos opcode

Does any of these benchmarks take in account that such lookup tables rely on the CPU cache...

orz
Quote:

Does any of these benchmarks take in account that such lookup tables rely on the CPU cache...

The timing numbers in the last set include plenty of L1 misses. It doesn't include many L2 misses though, as even the 256 kilobyte table fits into the L2 cache footprint. The L1 miss rate is worse than real world in most cases, since it's largely based upon randomly generated angles. The L2 miss rate is better than real world in most cases, since no major other program data is used simultaneously. If you want a graph of log(table size) vs performance ala linpack you could simply adjust the table size and run again a few times.

Tobias Dammers

Not quite. Most real-world applications do a lot of other stuff between calls to sin() or cos(), and the more that happens, the more cache misses you get.
To make matters worse, a lookup table that sits in L1 or L2 occupies cache space needed by other things, and may even slow down the performance of other calculations that rely on memory access speed.
Try benchmarking something real-world, like FFT (lends itself due to excessive usage of sin calls).