lookup tables

I was wondering,

are lookup tables for Sin, cos, tan still worth the effort ?

or is it useless on a modern computer ?

I'd say if whatever you're doing isn't terribly memory intensive that the lookup tables would be faster as long as they can stay cached.

I haven't fiddled with how it's done nowadays with 64 bit OS'es that use the SSE registers in lieu of the '87 chip, but the '87 stuff only took about 35 cycles or so to grind out the trancendental stuff, the math library stuff that insisted on the last bit of accuracy was considerably slower. With GCC you could specify to use the cpu instead of the math library with the --ffast-math option.

What are you trying to do anyway? I can't imagine using so many trig functions that it'd be a bottleneck.

Just wondering if it is still worth the efford. I once followed a raycasting tutorial and it used degree based sin, cos and tan using a lookup table and linear interpolation to speed things up.

It was in Turbo Pascal.. so quite old ..

Almost never.

We're currently sitting at 150-200 instructions per memory access.

So unless your data is hot in the cache, it's pointless. How much can you do in 200 instructions?

That being said, I'm sure someone somewhere has already benchmarked whatever use case you're looking at (like sin/cos/etc). So either google it, or simply write your own test case instead of guessing.

[edit] Here's one example:

https://jfdube.wordpress.com/2011/12/06/trigonometric-look-up-tables-revisited/

Quote:

According to this document, x86 instructions fsin and fcos takes between 65-100 cycles to execute, so we just need to write some code to compute the sine function using less than that.

But if you're investing time into something like this... 99% of the time you're probably wasting your energy. I wouldn't bother unless I was working on a tight-loop that was clearly a bottleneck in my game.

Ariesnl said:

raycasting

So you're needing <horizontal resolution> results <refresh rate> times a second? e.g. 1440 x 75 at the high end = 108,000 calculations per second. Don't worry about it.

[EDIT]

Are you referring to ray casting like Wolfenstein 3D or ray tracing like POV-RAY?

(I'm going to assume you first profiled your program, and that calls to sin/cos/tan have been spotted as a problem. Furthermore, I will assume that there are no algorithm changes in your application that would avoid these, etc).

How much precision in the output do you actually require? Or stated another way, how much error can you tolerate in the answer, and for what range of inputs?

That question is by far the most important one for determining which algorithms are appropriate. Is it 10^-4? 10^-16? Only for inputs between -pi and pi? Or for all possible input values?

The compilers' implementations are really fast, given that the output will be exact[1] for all possible inputs. However, if you're willing to tolerate inexact results, there are various algorithms available.

Most implementations work in the following way:

Step 1) Take an educated guess at the answer.

Step 2) Apply some math to refine the answer.

Step 3) Apply more (possibly different) math to further refine the answer.

Steps 4-1000) Same as above. Have as many steps as needed to achieve desired error.

So a good starting point may be to start from an existing implementation (for example, GCC's), then remove as later approximation steps as you're comfortable with.

Usually, the last step is both very expensive, and only serves to give you 1 or 2 more bits of accuracy.

Hope this helps.

[1] 'exact' is defined as calculating the result inside of the library as if there was infinite precision, and then rounding it to the return type (e.g. float).

That sounds a lot like the strategy behind Quake 3's magic inverse sqrt function.

https://en.wikipedia.org/wiki/Fast_inverse_square_root

And I agree 100%. If you really want to "optimize the !@$!" out of something, you need to start looking at it from a numeric analysis standpoint. What are the input and output ranges, and what is the acceptable error deviation from correct?

Books can be had on that topic for really cheap used.

https://en.wikipedia.org/wiki/Numerical_analysis

I bought this one years ago:

https://www.amazon.com/Numerical-Methods-Scientists-Engineers-Mathematics/dp/0486652416

There are a hundred ways to skin a cat. And many combinations of techniques... in different ORDERS (let alone combinations) will lead to a variety of results.

But the point is, instead of a simple hardcoded lookup, a modern implementation will involve a lookup, some approximation function, and maybe some more approximation function. (See the Quake 3 function link.) That way, you're benefiting from the CPU time and not just accessing a value from RAM and waiting for the next one if it's not in cache. You really want to pack things into units of single cachelines whenever possible (64-bytes on almost all systems these days).

[edit]

**Another option** is to take advantage of cachelines in a different way. Let's say you really just need X multiplied by a value from your lookup. Well, instead of multiplying ONE X at a time, why not multiply a whole ROW of X's? That's the principle behind Data-oriented design and practically REQUIRED to do any kind of PS3 programming. You pack like-minded data together, and then operate on many at the same time using the same function. So if you were using a blind lookup table, that lookup table would remain hot in cache the entire time you're using it--instead of doing ONE lookup and then running the next algorithm then the next, and then repeating for the NEXT OBJECT. This is where traditional OOP education becomes "slow as dirt". Treating each object as an independent unit by-definition means the compiler can't do anything to help you.

//to clarify, you go from this: obj1.get_stuff(); //caches get_stuff() data/code obj1.think_about_stuff(); //get_stuff no longer needed. obj1.apply_stuff(); //think_about_stuff no longer needed. obj2.get_stuff(); //apply_stuff no longer needed, start over. obj2.think_about_stuff(); obj2.apply_stuff(); //to this: get_stuff(obj1...objN); //all get_stuff routines are hot in cache after the FIRST object is called think_about_stuff(obj1...objN); //ditto apply_stuff(obj1...objN); //ditto

Or for a full lecture on how DoD rules by a industry great, game programmer, Mike Acton:

p.s. I'm so happy that Bob posted here.

Every time Bob posts I feel like God himself has entered the forums to spread universal wisdom. And then he's gone like a ghost.

Note that the simple act of having a lookup table in memory slows down **other** pieces of

your code.

Two pieces of unrelated data (other variables) stored before and after the table may no longer be on the same memory page when a piece of code requires both. You obtain more page faults.

Should write some code to perform benchmarks. I would be curious to see real results.

I may do this myself as I have a new project I am going to work on, a resurrection of an old one that used look up tables and I am curious if they are still viable. The link Chris shared would seem to strongly indicate they are, if done right. I know in my own project I could tolerate a high error without problem.