double precision for al_transform

Edgar Reynaldo

I'm guessing this is moot because of implementation details, but I was just wondering if there could be a use for a double precision version of al_transform_coordinates? I'm guessing allegro uses floats in its transformation matrices so there wouldn't be much point if that is true.

It might matter if there was a version that took double pointers instead of float pointers, because right now I have to declare two floats and perform assignment to get the data back into my double types. It's a data intensive operation in this case, so it might matter at least a little bit.

Mark Oates

What are you working on that needs double instead of float?

Edgar Reynaldo

I'm working on my Spiraloid program again, and I need super high precision angles for the spiral's theta value and theta offset, as well as rotation.

Something I'm doing now is using integer decimals and exponenents to keep the values the same and prevent precision loss when adding values, then I convert to doubles when I go to actually use the value. But I need high precision for the transformations that I'm applying. I suppose I have a matrix class lying around here somewhere that I could use...., but I really like Allegro's TRANSFORMs.

Edit
I have a list of spiral coordinates that need to be updated anytime the scale, offset, or rotation changes. The rotation changes fairly often, as the spiraloid may be spinning. There may be as many as (sqrt(1920^2 + 1200^2)/radial_delta)*(360/theta_delta) xy data points (for my laptop's specific resolution, but could be higher than that even) that need to be updated as often as once per monitor refresh. So it could be a lot of transformations, and I need to save the cpu as much as I can so it doesn't slow down the animation.

Ex, with a radial_delta of 1 and a theta_delta of 1 that is 815,000 data points running at 60 Hz gives about 2*2*50 million float to double assignments and 2*50 million transformation calculations per second, which is enough to stress the cpu.

Edit 2

Here's some 11x17 prints on the wall I made of some of my Spiraloid images today using the Color copier print service at Staples. Only about $15 bucks for 10 images, and the lady was nice enough to give me 10 free sheets of glossy photo paper to use.

{"name":"610268","src":"\/\/djungxnpq2nug.cloudfront.net\/image\/cache\/9\/6\/960872a39afeb072ee8fc19c09dde637.png","w":800,"h":450,"tn":"\/\/djungxnpq2nug.cloudfront.net\/image\/cache\/9\/6\/960872a39afeb072ee8fc19c09dde637"}

Elias

I actually use my own version which uses double - float just doesn't really work at all above values of about >20,000 or you lose 1-pixel accuracy. And that's when not manipulating coordinates - longer chains of transformations basically don't work with float, period.

Even with double it's easy to hit accuracy problems when you're not careful about the order of operations.

So basically, I'd be for converting all floats in Allegro do double

Edgar Reynaldo

Would that impact the FPU performance at all? Are floats significantly faster than doubles?

Chris Katko

Edgar Reynaldo said:

Are floats significantly faster than doubles?

Everyone on the web keeps giving out B.S. answers. I think we need to do an actual benchmark to get an answer to that.

The best I could find is this:

http://brandon.northcutt.net/article/double+VS+float+Speed+Comparison/20150625.html

In synthetic test, ever-so-slightly slower. In a "real world" test, it was twice as slow.

Of course, "twice is slow" is meaningless to a 4.0 GHZ server with 802,351 cores.

[edit] Someone linked this talk on a Reddit post:

I'm gonna watch it when I get back home. Supposedly it covers float/double performance.

https://channel9.msdn.com/Events/Build/2014/4-587

Edgar Reynaldo

Chris Katko said:

Everyone on the web keeps giving out B.S. answers. I think we need to do an actual benchmark to get an answer to that.

The best I could find is this:

http://brandon.northcutt.net/article/double+VS+float+Speed+Comparison/20150625.html

The first thing I saw was -pg and gprof. That did not inspire confidence in me. gprof is hopelessly broken and no longer in development AFAIK (at least for MinGW).

Brandon Northcutt said:

I used a second method to evaluate a more "in the wild" performance and it yielded interesting results. For this method I compiled the program without the CPU profiling switch "-pg" and then made two binaries, one which ran only the float benchmark and one that ran only the double benchmark.
BASH COMMANDS

$ time ./float_bench
real 0m13.677s
user 0m13.665s
sys 0m0.012s

$ time ./double_bench
real 0m30.670s
user 0m21.427s
sys 0m9.243s

These results carry far more weight with me. But do they mean I should sacrifice the precision of doubles for the speed of floats? I don't know.

Something to note is that there were not any optimization flags passed to the compiler. It might be worth retesting the second method with optimizations enabled. I'm not on Linux so I can't use 'time' to measure it though, and I dont' know how to use high performance counters on Windows yet.

Edit

Chris Katko said:

[edit] Someone linked this talk on a Reddit post:

I'm gonna watch it when I get back home. Supposedly it covers float/double performance.

https://channel9.msdn.com/Events/Build/2014/4-587

I watched the slideshow, and it gave some juicy tidbits about new instruction sets like AVX and AVX2 and about how 'optimizations' on one architecture can be 'stalls' on another.

Erin Maus

Double precision will be anywhere from slow to glacier on the GPU when compared to single precision; on CPUs (x86 at least), not so much.

GLM is a great math library. It's pretty much standalone, very portable, and has support for most everything you'd need for rendering. And it supports single and double precision matrices (and vectors, and so on).

Since Allegro's transforms are geared towards GPUs, or so I think, single precision is probably best.

Edgar Reynaldo

I modified Brandon's benchmarking program (the second method) slightly and fixed a minor bug (he was initializing a float array with 0.0 ((not 0.0f))) and then compiled it with different optimization levels and ran the tests with 1000 calls.

Zip file of code and batch scripts :
BenchmarksAndProfiling.zip

Here are the results :

#SelectExpand
  1
  2c:\ctwoplus\progcode\BenchmarksAndProfiling>CompileFPUbenchmark.bat
  3
  4c:\ctwoplus\progcode\BenchmarksAndProfiling>echo on
  5
  6c:\ctwoplus\progcode\BenchmarksAndProfiling>rem Compiling fpubenchmark.cpp
  7ECHO is on.
  8
  9c:\ctwoplus\progcode\BenchmarksAndProfiling>mingw32-g++ -Wall -m32 -O0 -o fpu32-0.exe -Ic:\mingw\LIBS\A5113distro\include -Lc:\mingw\LIBS\A5113distro\lib fpubenchmark.cpp -lallegro_monolith.dll
 10
 11c:\ctwoplus\progcode\BenchmarksAndProfiling>mingw32-g++ -Wall -m32 -O1 -o fpu32-1.exe -Ic:\mingw\LIBS\A5113distro\include -Lc:\mingw\LIBS\A5113distro\lib fpubenchmark.cpp -lallegro_monolith.dll
 12
 13c:\ctwoplus\progcode\BenchmarksAndProfiling>mingw32-g++ -Wall -m32 -O2 -o fpu32-2.exe -Ic:\mingw\LIBS\A5113distro\include -Lc:\mingw\LIBS\A5113distro\lib fpubenchmark.cpp -lallegro_monolith.dll
 14
 15c:\ctwoplus\progcode\BenchmarksAndProfiling>mingw32-g++ -Wall -m32 -O3 -o fpu32-3.exe -Ic:\mingw\LIBS\A5113distro\include -Lc:\mingw\LIBS\A5113distro\lib fpubenchmark.cpp -lallegro_monolith.dll
 16ECHO is on.
 17
 18c:\ctwoplus\progcode\BenchmarksAndProfiling>rem -m64 not supported on mingw32
 19
 20c:\ctwoplus\progcode\BenchmarksAndProfiling>rem mingw32-g++ -Wall -m64 -O0 -o fpu64-0.exe -Ic:\mingw\LIBS\A5113distro\include -Lc:\mingw\LIBS\A5113distro\lib fpubenchmark.cpp -lallegro_monolith.dll
 21
 22c:\ctwoplus\progcode\BenchmarksAndProfiling>rem mingw32-g++ -Wall -m64 -O1 -o fpu64-1.exe -Ic:\mingw\LIBS\A5113distro\include -Lc:\mingw\LIBS\A5113distro\lib fpubenchmark.cpp -lallegro_monolith.dll
 23
 24c:\ctwoplus\progcode\BenchmarksAndProfiling>rem mingw32-g++ -Wall -m64 -O2 -o fpu64-2.exe -Ic:\mingw\LIBS\A5113distro\include -Lc:\mingw\LIBS\A5113distro\lib fpubenchmark.cpp -lallegro_monolith.dll
 25
 26c:\ctwoplus\progcode\BenchmarksAndProfiling>rem mingw32-g++ -Wall -m64 -O3 -o fpu64-3.exe -Ic:\mingw\LIBS\A5113distro\include -Lc:\mingw\LIBS\A5113distro\lib fpubenchmark.cpp -lallegro_monolith.dll
 27c:\ctwoplus\progcode\BenchmarksAndProfiling>RunFPUbenchmark.bat
 28
 29c:\ctwoplus\progcode\BenchmarksAndProfiling>echo on
 30ECHO is on.
 31
 32c:\ctwoplus\progcode\BenchmarksAndProfiling>rem Running 32 bit fpu benchmarks
 33
 34c:\ctwoplus\progcode\BenchmarksAndProfiling>fpu32-0.exe
 35Testing 1000 calls and 6220800 memory allocations :
 36float result 9679922003968.000000
 37Float results (   43.69694183 seconds) : Total allocation time    18.57704912 seconds , total math time    23.58766864 seconds , total dealloc time     1.53222407
 38Float result averages : Allocation average     0.01857705 , math average     0.02358767 , dealloc average     0.00153222
 39double result 9674583494400.500000
 40Double results (   54.04342045 seconds) : Total allocation time    19.55067594 seconds , total math time    31.37647679 seconds , total dealloc time     3.11626772
 41Double result averages : Allocation average     0.01955068 , math average     0.03137648 , dealloc average     0.00311627
 42
 43c:\ctwoplus\progcode\BenchmarksAndProfiling>fpu32-1.exe
 44Testing 1000 calls and 6220800 memory allocations :
 45float result 9679922003968.000000
 46Float results (   25.86970711 seconds) : Total allocation time     6.87207194 seconds , total math time    17.43963017 seconds , total dealloc time     1.55800499
 47Float result averages : Allocation average     0.00687207 , math average     0.01743963 , dealloc average     0.00155800
 48double result 9674583494400.500000
 49Double results (   33.06386690 seconds) : Total allocation time    12.41526336 seconds , total math time    17.54913144 seconds , total dealloc time     3.09947210
 50Double result averages : Allocation average     0.01241526 , math average     0.01754913 , dealloc average     0.00309947
 51
 52c:\ctwoplus\progcode\BenchmarksAndProfiling>fpu32-2.exe
 53Testing 1000 calls and 6220800 memory allocations :
 54float result 9679922003968.000000
 55Float results (   24.68116194 seconds) : Total allocation time     6.41169159 seconds , total math time    16.69645412 seconds , total dealloc time     1.57301624
 56Float result averages : Allocation average     0.00641169 , math average     0.01669645 , dealloc average     0.00157302
 57double result 9674583494400.500000
 58Double results (   32.01803220 seconds) : Total allocation time    12.20064342 seconds , total math time    16.72717455 seconds , total dealloc time     3.09021423
 59Double result averages : Allocation average     0.01220064 , math average     0.01672717 , dealloc average     0.00309021
 60
 61c:\ctwoplus\progcode\BenchmarksAndProfiling>fpu32-3.exe
 62Testing 1000 calls and 6220800 memory allocations :
 63float result 9679922003968.000000
 64Float results (   24.45988656 seconds) : Total allocation time     6.31435281 seconds , total math time    16.57656673 seconds , total dealloc time     1.56896702
 65Float result averages : Allocation average     0.00631435 , math average     0.01657657 , dealloc average     0.00156897
 66double result 9674583494400.500000
 67Double results (   32.33114045 seconds) : Total allocation time    12.42040700 seconds , total math time    16.77953203 seconds , total dealloc time     3.13120142
 68Double result averages : Allocation average     0.01242041 , math average     0.01677953 , dealloc average     0.00313120
 69ECHO is on.
 70
 71c:\ctwoplus\progcode\BenchmarksAndProfiling>rem Running 64 bit fpu benchmarks
 72
 73c:\ctwoplus\progcode\BenchmarksAndProfiling>rem fpu64-0.exe
 74
 75c:\ctwoplus\progcode\BenchmarksAndProfiling>rem fpu64-1.exe
 76
 77c:\ctwoplus\progcode\BenchmarksAndProfiling>rem fpu64-2.exe
 78
 79c:\ctwoplus\progcode\BenchmarksAndProfiling>rem fpu64-3.exe
 80
 81c:\ctwoplus\progcode\BenchmarksAndProfiling>

Here's the code I used :

#SelectExpand
   1//CREATED: 2015-06-25 15:11 -  -BDN
   2//UPDATED: 2015-06-25 15:11 -  -BDN
   3//AUTHORS: Brandon D. Northcutt (brandon@northcutt.net)
   4//
   5//This is a program intended to illustrate the relative efficiency of double versus single precision floating point numbers.
   6#include "allegro5/allegro.h"
   7
   8
   9
  10#include <cstdio>
  11#include <cstdlib>
  12
  13
  14
  15volatile int MEM = 6220800;///2015-06-25 14:27 - An RGB 1920x1080 image. -BDN
  16const int CALLS = 1000;
  17
  18double  * ad;
  19float  * af;
  20
  21int CALLNUM = 0;
  22
  23double double_alloc_time[CALLS];
  24double double_math_time[CALLS];
  25double double_dealloc_time[CALLS];
  26double total_double_alloc_time = 0.0;
  27double total_double_math_time = 0.0;
  28double total_double_dealloc_time = 0.0;
  29
  30double float_alloc_time[CALLS];
  31double float_math_time[CALLS];
  32double float_dealloc_time[CALLS];
  33double total_float_alloc_time = 0.0;
  34double total_float_math_time = 0.0;
  35double total_float_dealloc_time = 0.0;
  36
  37void double_memory_allocation()
  38{
  39  ad=new double[MEM];
  40  for(int i=0;i<MEM;i++) ad[i]=0.0;
  41}
  42
  43double double_math(void)
  44{
  45  double t;
  46  for(int i=1;i<MEM;i++)
  47  {
  48    t=(double)i;
  49    ad[i]=(t*t - t)/(t + t);
  50    ad[0]+=ad[i];
  51  }
  52  return ad[0];
  53}
  54
  55void double_memory_deallocation()
  56{
  57  delete ad;
  58}
  59
  60double double_benchmark(void)
  61{
  62  double r;
  63  double t1 = al_get_time();
  64  double_memory_allocation();
  65  double t2 = al_get_time();
  66  r=double_math();
  67  double t3 = al_get_time();
  68  double_memory_deallocation();
  69  double t4 = al_get_time();
  70  
  71  total_double_alloc_time += double_alloc_time[CALLNUM] = t2 - t1;
  72  total_double_math_time += double_math_time[CALLNUM] = t3 - t2;
  73  total_double_dealloc_time += double_dealloc_time[CALLNUM] = t4 - t3;
  74  
  75  return r;
  76}
  77
  78void float_memory_allocation()
  79{
  80  af=new float[MEM];
  81  for(int i=0;i<MEM;i++) af[i]=0.0f;
  82}
  83
  84float float_math(void)
  85{
  86  float t;
  87  for(int i=1;i<MEM;i++)
  88  {
  89    t=(float)i;
  90    af[i]=(t*t - t)/(t + t);
  91    af[0]+=af[i];
  92  }
  93  return af[0];
  94}
  95
  96void float_memory_deallocation()
  97{
  98  delete af;
  99}
 100
 101float float_benchmark(void)
 102{
 103  float r;
 104  double t1 = al_get_time();
 105  float_memory_allocation();
 106  double t2 = al_get_time();
 107  r=float_math();
 108  double t3 = al_get_time();
 109  float_memory_deallocation();
 110  double t4 = al_get_time();
 111
 112  total_float_alloc_time += float_alloc_time[CALLNUM] = t2 - t1;
 113  total_float_math_time += float_math_time[CALLNUM] = t3 - t2;
 114  total_float_dealloc_time += float_dealloc_time[CALLNUM] = t4 - t3;
 115
 116  return r;
 117}
 118
 119int main (void)
 120{
 121   al_init();
 122
 123  float tmpf;    
 124  double tmpd;  
 125
 126  printf("Testing %d calls and %d memory allocations :\n" , CALLS , MEM);
 127
 128  for(CALLNUM=0;CALLNUM<CALLS;CALLNUM++) tmpf=float_benchmark();  
 129  printf("float result %f\n",tmpf);
 130
 131  double total_float_time = total_float_alloc_time + total_float_math_time + total_float_dealloc_time;
 132  printf("Float results (%14.8lf seconds) : Total allocation time %14.8lf seconds , total math time %14.8lf seconds , total dealloc time %14.8lf\n",
 133            total_float_time , total_float_alloc_time , total_float_math_time , total_float_dealloc_time);
 134   printf("Float result averages : Allocation average %14.8lf , math average %14.8lf , dealloc average %14.8lf\n",
 135            total_float_alloc_time/CALLS , total_float_math_time/CALLS , total_float_dealloc_time/CALLS);
 136          
 137  
 138  for(CALLNUM=0;CALLNUM<CALLS;CALLNUM++) tmpd=double_benchmark();  
 139  printf("double result %lf\n",tmpd);
 140  
 141  double total_double_time = total_double_alloc_time + total_double_math_time + total_double_dealloc_time;
 142  printf("Double results (%14.8lf seconds) : Total allocation time %14.8lf seconds , total math time %14.8lf seconds , total dealloc time %14.8lf\n",
 143            total_double_time , total_double_alloc_time , total_double_math_time , total_double_dealloc_time);
 144   printf("Double result averages : Allocation average %14.8lf , math average %14.8lf , dealloc average %14.8lf\n",
 145            total_double_alloc_time/CALLS , total_double_math_time/CALLS , total_double_dealloc_time/CALLS);
 146
 147  return 0;
 148}

As expected, -O0 took the longest. -O1, -O2, and -O3 were all comparable. Memory allocation and deallocation generally took twice the time for doubles as it did for floats (because they are twice as big). Deallocation times were constant across optimizations. Something important to note is that I used volatile for the memory allocation size so it couldn't be optimized away.

I used al_get_time for measurements. Allocation and deallocation can be quite costly, and should be avoided if possible. The math times are comparable on my laptop with any optimization other than -O0 (Intel i7-5700HQ @ 2.70 GHz).

I'm running Windows 10 64 bit and I wanted to test with -m64 architecture but mingw32 doesn't support it.

Edit

TL;DR;
Here's table of the results including the allocations :

-O0 float  : 43.70ms per op = 22.88FPS
-O0 double : 54.04ms per op = 18.50FPS

-O1 float  : 25.87ms per op = 38.65FPS
-O1 double : 33.06ms per op = 30.25FPS

-O2 float  : 24.68ms per op = 40.52FPS
-O2 double : 32.02ms per op = 31.23FPS

-O3 float  : 24.46ms per op = 40.88FPS
-O3 double : 32.33ms per op = 30.93FPS

And a table of the results for just the computations :

-O0 float  : 23.59ms per op = 42.39FPS
-O0 double : 31.38ms per op = 31.87FPS

-O1 float  : 17.44ms per op = 57.34FPS
-O1 double : 17.55ms per op = 56.98FPS

-O2 float  : 16.70ms per op = 59.88FPS
-O2 double : 16.73ms per op = 59.77FPS

-O3 float  : 16.58ms per op = 60.31FPS
-O3 double : 16.78ms per op = 59.59FPS

So you can see that if you wanted to process 6220800 (1920x1200x3) floating point elements per second on my laptop's cpu it would just barely keep up with a 60HZ refresh rate with optimizations enabled. But the difference between single precision floating point math and double precision floating point math is almost negligible.

Chris Katko

Aaron Bolyard said:

Since Allegro's transforms are geared towards GPUs, or so I think, single precision is probably best.

OpenGL also supports half-precision floats and integer coordinate systems. I don't see any clear reason why Allegro shouldn't support them.

The Gamecube runs with integer math. Now that OpenGL supports it, the Dolphin emulator was ported to integer math and tons of bugs have gone away.

https://dolphin-emu.org/blog/2014/03/15/pixel-processing-problems/

[edit]

ALSO, I had no idea there was a different between 0.0 and 0.0f / 0.0. There's REALLY such a thing as a float vs double literal, and the compiler will silently convert them if you have the wrong one. ... I think?

This is insanity!

Bringing back to another of my threads: Somehow, a std::string implicitly converting to a c_string is terrible, but doubles to floats, and floats to ints are OKAY being implicit?! COME ON C++. COME ON.

Edgar Reynaldo

See my last edit for FPS results of ops with and without allocations included.

SiegeLord

ALLEGRO_TRANSFORM indeed has floats inside it, and since its internals are public, we're kind of stuck with it that way. It is that way primarily because that's what is supported across platforms (the culprit in this case is Direct3D).

Erin Maus

Chris Katko said:

OpenGL also supports half-precision floats and integer coordinate systems. I don't see any clear reason why Allegro shouldn't support them.

If I remember correctly, half precision is only useful on mobile platforms. It's a no-op on most desktop GPUs. Similarly, native integer support is slow, like doubles.

But most of all, such features are useless for anyone using Allegro for rendering.

Quote:

The Gamecube runs with integer math.

The classic Xbox had a bizarre programmable GPU unlike otherwise equivalent Nvidia chips before and after. The SNES had a terribly weak CPU, only a minor step up from the NES. The Nintendo 64 was pretty much a SGI workstation. The Wii has a small ARM processor on the same die as the GPU that controls various security and I/O processes.

Consoles used to have strange quirks unlike PCs, and that was nice, but that doesn't have any relevance to modern hardware.

Edgar Reynaldo

It would be possible to create a function called al_transform_coordinates_d that took double pointers though. That would at least save the allocation of two floats. But I guess if they're on the stack it wouldn't matter, even in a heavy loop. Don't mind me. Just thinking out loud.

My only concern is this part of my code :

GeneratePlotData only gets called if the theta_delta or the radial_delta change, as that affects the number of data points in the spiral. But the transform and the modified coordinates change every time the rotation changes, which is quite often in my program.

#SelectExpand
  1void Spiral2D::Refresh() {
 if (spiral_needs_refresh) {
    GeneratePlotData();
 }
 if (transform_needs_refresh) {
    /// Refresh modified data from original using transform
    al_identity_transform(&transform);
    al_rotate_transform(&transform , rotation_degrees*(M_PI/180.0));
    al_scale_transform(&transform , scalex , scaley);
    al_translate_transform(&transform , centerx , centery);
    for (unsigned int i = 0 ; i < Size() ; ++i) {
       Pos2D mod = DataOriginal(i);
       /// TODO : This is a hack
       float x = mod.x;
       float y = mod.y;
       al_transform_coordinates(&transform , &x , &y);
       mod.x = x;
       mod.y = y;
       DataModified(i) = mod;
    }
    transform_needs_refresh = false;
 }
 23}

Thread #616178. Printed from Allegro.cc