double precision for al_transform_coordinates?
Edgar Reynaldo

I'm guessing this is moot because of implementation details, but I was just wondering if there could be a use for a double precision version of al_transform_coordinates? I'm guessing allegro uses floats in its transformation matrices so there wouldn't be much point if that is true.

It might matter if there was a version that took double pointers instead of float pointers, because right now I have to declare two floats and perform assignment to get the data back into my double types. It's a data intensive operation in this case, so it might matter at least a little bit.

Mark Oates

What are you working on that needs double instead of float?

Edgar Reynaldo

I'm working on my Spiraloid program again, and I need super high precision angles for the spiral's theta value and theta offset, as well as rotation.

Something I'm doing now is using integer decimals and exponenents to keep the values the same and prevent precision loss when adding values, then I convert to doubles when I go to actually use the value. But I need high precision for the transformations that I'm applying. I suppose I have a matrix class lying around here somewhere that I could use...., but I really like Allegro's TRANSFORMs.

Edit
I have a list of spiral coordinates that need to be updated anytime the scale, offset, or rotation changes. The rotation changes fairly often, as the spiraloid may be spinning. There may be as many as (sqrt(1920^2 + 1200^2)/radial_delta)*(360/theta_delta) xy data points (for my laptop's specific resolution, but could be higher than that even) that need to be updated as often as once per monitor refresh. So it could be a lot of transformations, and I need to save the cpu as much as I can so it doesn't slow down the animation.

Ex, with a radial_delta of 1 and a theta_delta of 1 that is 815,000 data points running at 60 Hz gives about 2*2*50 million float to double assignments and 2*50 million transformation calculations per second, which is enough to stress the cpu.

Edit 2

Here's some 11x17 prints on the wall I made of some of my Spiraloid images today using the Color copier print service at Staples. Only about \$15 bucks for 10 images, and the lady was nice enough to give me 10 free sheets of glossy photo paper to use.

{"name":"610268","src":"\/\/djungxnpq2nug.cloudfront.net\/image\/cache\/9\/6\/960872a39afeb072ee8fc19c09dde637.png","w":800,"h":450,"tn":"\/\/djungxnpq2nug.cloudfront.net\/image\/cache\/9\/6\/960872a39afeb072ee8fc19c09dde637"}

Elias

I actually use my own version which uses double - float just doesn't really work at all above values of about >20,000 or you lose 1-pixel accuracy. And that's when not manipulating coordinates - longer chains of transformations basically don't work with float, period.

Even with double it's easy to hit accuracy problems when you're not careful about the order of operations.

So basically, I'd be for converting all floats in Allegro do double

Edgar Reynaldo

Would that impact the FPU performance at all? Are floats significantly faster than doubles?

Chris Katko

Are floats significantly faster than doubles?

Everyone on the web keeps giving out B.S. answers. I think we need to do an actual benchmark to get an answer to that.

The best I could find is this:

http://brandon.northcutt.net/article/double+VS+float+Speed+Comparison/20150625.html

In synthetic test, ever-so-slightly slower. In a "real world" test, it was twice as slow.

Of course, "twice is slow" is meaningless to a 4.0 GHZ server with 802,351 cores.

 Someone linked this talk on a Reddit post:

I'm gonna watch it when I get back home. Supposedly it covers float/double performance.

Edgar Reynaldo

Everyone on the web keeps giving out B.S. answers. I think we need to do an actual benchmark to get an answer to that.

The best I could find is this:

The first thing I saw was -pg and gprof. That did not inspire confidence in me. gprof is hopelessly broken and no longer in development AFAIK (at least for MinGW).

Brandon Northcutt said:

I used a second method to evaluate a more "in the wild" performance and it yielded interesting results. For this method I compiled the program without the CPU profiling switch "-pg" and then made two binaries, one which ran only the float benchmark and one that ran only the double benchmark.
BASH COMMANDS

\$ time ./float_bench
real 0m13.677s
user 0m13.665s
sys 0m0.012s

\$ time ./double_bench
real 0m30.670s
user 0m21.427s
sys 0m9.243s

These results carry far more weight with me. But do they mean I should sacrifice the precision of doubles for the speed of floats? I don't know.

Something to note is that there were not any optimization flags passed to the compiler. It might be worth retesting the second method with optimizations enabled. I'm not on Linux so I can't use 'time' to measure it though, and I dont' know how to use high performance counters on Windows yet.

Edit

Chris Katko said:

 Someone linked this talk on a Reddit post:

I'm gonna watch it when I get back home. Supposedly it covers float/double performance.

I watched the slideshow, and it gave some juicy tidbits about new instruction sets like AVX and AVX2 and about how 'optimizations' on one architecture can be 'stalls' on another.

Erin Maus

Double precision will be anywhere from slow to glacier on the GPU when compared to single precision; on CPUs (x86 at least), not so much.

GLM is a great math library. It's pretty much standalone, very portable, and has support for most everything you'd need for rendering. And it supports single and double precision matrices (and vectors, and so on).

Since Allegro's transforms are geared towards GPUs, or so I think, single precision is probably best.

Edgar Reynaldo

I modified Brandon's benchmarking program (the second method) slightly and fixed a minor bug (he was initializing a float array with 0.0 ((not 0.0f))) and then compiled it with different optimization levels and ran the tests with 1000 calls.

Zip file of code and batch scripts :
BenchmarksAndProfiling.zip

Here are the results :

1 2c:\ctwoplus\progcode\BenchmarksAndProfiling>CompileFPUbenchmark.bat 3 4c:\ctwoplus\progcode\BenchmarksAndProfiling>echo on 5 6c:\ctwoplus\progcode\BenchmarksAndProfiling>rem Compiling fpubenchmark.cpp 7ECHO is on. 8 9c:\ctwoplus\progcode\BenchmarksAndProfiling>mingw32-g++ -Wall -m32 -O0 -o fpu32-0.exe -Ic:\mingw\LIBS\A5113distro\include -Lc:\mingw\LIBS\A5113distro\lib fpubenchmark.cpp -lallegro_monolith.dll 10 11c:\ctwoplus\progcode\BenchmarksAndProfiling>mingw32-g++ -Wall -m32 -O1 -o fpu32-1.exe -Ic:\mingw\LIBS\A5113distro\include -Lc:\mingw\LIBS\A5113distro\lib fpubenchmark.cpp -lallegro_monolith.dll 12 13c:\ctwoplus\progcode\BenchmarksAndProfiling>mingw32-g++ -Wall -m32 -O2 -o fpu32-2.exe -Ic:\mingw\LIBS\A5113distro\include -Lc:\mingw\LIBS\A5113distro\lib fpubenchmark.cpp -lallegro_monolith.dll 14 15c:\ctwoplus\progcode\BenchmarksAndProfiling>mingw32-g++ -Wall -m32 -O3 -o fpu32-3.exe -Ic:\mingw\LIBS\A5113distro\include -Lc:\mingw\LIBS\A5113distro\lib fpubenchmark.cpp -lallegro_monolith.dll 16ECHO is on. 17 18c:\ctwoplus\progcode\BenchmarksAndProfiling>rem -m64 not supported on mingw32 19 20c:\ctwoplus\progcode\BenchmarksAndProfiling>rem mingw32-g++ -Wall -m64 -O0 -o fpu64-0.exe -Ic:\mingw\LIBS\A5113distro\include -Lc:\mingw\LIBS\A5113distro\lib fpubenchmark.cpp -lallegro_monolith.dll 21 22c:\ctwoplus\progcode\BenchmarksAndProfiling>rem mingw32-g++ -Wall -m64 -O1 -o fpu64-1.exe -Ic:\mingw\LIBS\A5113distro\include -Lc:\mingw\LIBS\A5113distro\lib fpubenchmark.cpp -lallegro_monolith.dll 23 24c:\ctwoplus\progcode\BenchmarksAndProfiling>rem mingw32-g++ -Wall -m64 -O2 -o fpu64-2.exe -Ic:\mingw\LIBS\A5113distro\include -Lc:\mingw\LIBS\A5113distro\lib fpubenchmark.cpp -lallegro_monolith.dll 25 26c:\ctwoplus\progcode\BenchmarksAndProfiling>rem mingw32-g++ -Wall -m64 -O3 -o fpu64-3.exe -Ic:\mingw\LIBS\A5113distro\include -Lc:\mingw\LIBS\A5113distro\lib fpubenchmark.cpp -lallegro_monolith.dll 27c:\ctwoplus\progcode\BenchmarksAndProfiling>RunFPUbenchmark.bat 28 29c:\ctwoplus\progcode\BenchmarksAndProfiling>echo on 30ECHO is on. 31 32c:\ctwoplus\progcode\BenchmarksAndProfiling>rem Running 32 bit fpu benchmarks 33 34c:\ctwoplus\progcode\BenchmarksAndProfiling>fpu32-0.exe 35Testing 1000 calls and 6220800 memory allocations : 36float result 9679922003968.000000 37Float results ( 43.69694183 seconds) : Total allocation time 18.57704912 seconds , total math time 23.58766864 seconds , total dealloc time 1.53222407 38Float result averages : Allocation average 0.01857705 , math average 0.02358767 , dealloc average 0.00153222 39double result 9674583494400.500000 40Double results ( 54.04342045 seconds) : Total allocation time 19.55067594 seconds , total math time 31.37647679 seconds , total dealloc time 3.11626772 41Double result averages : Allocation average 0.01955068 , math average 0.03137648 , dealloc average 0.00311627 42 43c:\ctwoplus\progcode\BenchmarksAndProfiling>fpu32-1.exe 44Testing 1000 calls and 6220800 memory allocations : 45float result 9679922003968.000000 46Float results ( 25.86970711 seconds) : Total allocation time 6.87207194 seconds , total math time 17.43963017 seconds , total dealloc time 1.55800499 47Float result averages : Allocation average 0.00687207 , math average 0.01743963 , dealloc average 0.00155800 48double result 9674583494400.500000 49Double results ( 33.06386690 seconds) : Total allocation time 12.41526336 seconds , total math time 17.54913144 seconds , total dealloc time 3.09947210 50Double result averages : Allocation average 0.01241526 , math average 0.01754913 , dealloc average 0.00309947 51 52c:\ctwoplus\progcode\BenchmarksAndProfiling>fpu32-2.exe 53Testing 1000 calls and 6220800 memory allocations : 54float result 9679922003968.000000 55Float results ( 24.68116194 seconds) : Total allocation time 6.41169159 seconds , total math time 16.69645412 seconds , total dealloc time 1.57301624 56Float result averages : Allocation average 0.00641169 , math average 0.01669645 , dealloc average 0.00157302 57double result 9674583494400.500000 58Double results ( 32.01803220 seconds) : Total allocation time 12.20064342 seconds , total math time 16.72717455 seconds , total dealloc time 3.09021423 59Double result averages : Allocation average 0.01220064 , math average 0.01672717 , dealloc average 0.00309021 60 61c:\ctwoplus\progcode\BenchmarksAndProfiling>fpu32-3.exe 62Testing 1000 calls and 6220800 memory allocations : 63float result 9679922003968.000000 64Float results ( 24.45988656 seconds) : Total allocation time 6.31435281 seconds , total math time 16.57656673 seconds , total dealloc time 1.56896702 65Float result averages : Allocation average 0.00631435 , math average 0.01657657 , dealloc average 0.00156897 66double result 9674583494400.500000 67Double results ( 32.33114045 seconds) : Total allocation time 12.42040700 seconds , total math time 16.77953203 seconds , total dealloc time 3.13120142 68Double result averages : Allocation average 0.01242041 , math average 0.01677953 , dealloc average 0.00313120 69ECHO is on. 70 71c:\ctwoplus\progcode\BenchmarksAndProfiling>rem Running 64 bit fpu benchmarks 72 73c:\ctwoplus\progcode\BenchmarksAndProfiling>rem fpu64-0.exe 74 75c:\ctwoplus\progcode\BenchmarksAndProfiling>rem fpu64-1.exe 76 77c:\ctwoplus\progcode\BenchmarksAndProfiling>rem fpu64-2.exe 78 79c:\ctwoplus\progcode\BenchmarksAndProfiling>rem fpu64-3.exe 80 81c:\ctwoplus\progcode\BenchmarksAndProfiling>

Here's the code I used :

1//CREATED: 2015-06-25 15:11 - -BDN 2//UPDATED: 2015-06-25 15:11 - -BDN 3//AUTHORS: Brandon D. Northcutt (brandon@northcutt.net) 4// 5//This is a program intended to illustrate the relative efficiency of double versus single precision floating point numbers. 6#include "allegro5/allegro.h" 7 8 9 10#include <cstdio> 11#include <cstdlib> 12 13 14 15volatile int MEM = 6220800;///2015-06-25 14:27 - An RGB 1920x1080 image. -BDN 16const int CALLS = 1000; 17 18double * ad; 19float * af; 20 21int CALLNUM = 0; 22 23double double_alloc_time[CALLS]; 24double double_math_time[CALLS]; 25double double_dealloc_time[CALLS]; 26double total_double_alloc_time = 0.0; 27double total_double_math_time = 0.0; 28double total_double_dealloc_time = 0.0; 29 30double float_alloc_time[CALLS]; 31double float_math_time[CALLS]; 32double float_dealloc_time[CALLS]; 33double total_float_alloc_time = 0.0; 34double total_float_math_time = 0.0; 35double total_float_dealloc_time = 0.0; 36 37void double_memory_allocation() 38{ 39 ad=new double[MEM]; 40 for(int i=0;i<MEM;i++) ad[i]=0.0; 41} 42 43double double_math(void) 44{ 45 double t; 46 for(int i=1;i<MEM;i++) 47 { 48 t=(double)i; 49 ad[i]=(t*t - t)/(t + t); 50 ad[0]+=ad[i]; 51 } 52 return ad[0]; 53} 54 55void double_memory_deallocation() 56{ 57 delete ad; 58} 59 60double double_benchmark(void) 61{ 62 double r; 63 double t1 = al_get_time(); 64 double_memory_allocation(); 65 double t2 = al_get_time(); 66 r=double_math(); 67 double t3 = al_get_time(); 68 double_memory_deallocation(); 69 double t4 = al_get_time(); 70 71 total_double_alloc_time += double_alloc_time[CALLNUM] = t2 - t1; 72 total_double_math_time += double_math_time[CALLNUM] = t3 - t2; 73 total_double_dealloc_time += double_dealloc_time[CALLNUM] = t4 - t3; 74 75 return r; 76} 77 78void float_memory_allocation() 79{ 80 af=new float[MEM]; 81 for(int i=0;i<MEM;i++) af[i]=0.0f; 82} 83 84float float_math(void) 85{ 86 float t; 87 for(int i=1;i<MEM;i++) 88 { 89 t=(float)i; 90 af[i]=(t*t - t)/(t + t); 91 af[0]+=af[i]; 92 } 93 return af[0]; 94} 95 96void float_memory_deallocation() 97{ 98 delete af; 99} 100 101float float_benchmark(void) 102{ 103 float r; 104 double t1 = al_get_time(); 105 float_memory_allocation(); 106 double t2 = al_get_time(); 107 r=float_math(); 108 double t3 = al_get_time(); 109 float_memory_deallocation(); 110 double t4 = al_get_time(); 111 112 total_float_alloc_time += float_alloc_time[CALLNUM] = t2 - t1; 113 total_float_math_time += float_math_time[CALLNUM] = t3 - t2; 114 total_float_dealloc_time += float_dealloc_time[CALLNUM] = t4 - t3; 115 116 return r; 117} 118 119int main (void) 120{ 121 al_init(); 122 123 float tmpf; 124 double tmpd; 125 126 printf("Testing %d calls and %d memory allocations :\n" , CALLS , MEM); 127 128 for(CALLNUM=0;CALLNUM<CALLS;CALLNUM++) tmpf=float_benchmark(); 129 printf("float result %f\n",tmpf); 130 131 double total_float_time = total_float_alloc_time + total_float_math_time + total_float_dealloc_time; 132 printf("Float results (%14.8lf seconds) : Total allocation time %14.8lf seconds , total math time %14.8lf seconds , total dealloc time %14.8lf\n", 133 total_float_time , total_float_alloc_time , total_float_math_time , total_float_dealloc_time); 134 printf("Float result averages : Allocation average %14.8lf , math average %14.8lf , dealloc average %14.8lf\n", 135 total_float_alloc_time/CALLS , total_float_math_time/CALLS , total_float_dealloc_time/CALLS); 136 137 138 for(CALLNUM=0;CALLNUM<CALLS;CALLNUM++) tmpd=double_benchmark(); 139 printf("double result %lf\n",tmpd); 140 141 double total_double_time = total_double_alloc_time + total_double_math_time + total_double_dealloc_time; 142 printf("Double results (%14.8lf seconds) : Total allocation time %14.8lf seconds , total math time %14.8lf seconds , total dealloc time %14.8lf\n", 143 total_double_time , total_double_alloc_time , total_double_math_time , total_double_dealloc_time); 144 printf("Double result averages : Allocation average %14.8lf , math average %14.8lf , dealloc average %14.8lf\n", 145 total_double_alloc_time/CALLS , total_double_math_time/CALLS , total_double_dealloc_time/CALLS); 146 147 return 0; 148}

As expected, -O0 took the longest. -O1, -O2, and -O3 were all comparable. Memory allocation and deallocation generally took twice the time for doubles as it did for floats (because they are twice as big). Deallocation times were constant across optimizations. Something important to note is that I used volatile for the memory allocation size so it couldn't be optimized away.

I used al_get_time for measurements. Allocation and deallocation can be quite costly, and should be avoided if possible. The math times are comparable on my laptop with any optimization other than -O0 (Intel i7-5700HQ @ 2.70 GHz).

I'm running Windows 10 64 bit and I wanted to test with -m64 architecture but mingw32 doesn't support it.

Edit

TL;DR;
Here's table of the results including the allocations :

```-O0 float  : 43.70ms per op = 22.88FPS
-O0 double : 54.04ms per op = 18.50FPS

-O1 float  : 25.87ms per op = 38.65FPS
-O1 double : 33.06ms per op = 30.25FPS

-O2 float  : 24.68ms per op = 40.52FPS
-O2 double : 32.02ms per op = 31.23FPS

-O3 float  : 24.46ms per op = 40.88FPS
-O3 double : 32.33ms per op = 30.93FPS
```

And a table of the results for just the computations :

```-O0 float  : 23.59ms per op = 42.39FPS
-O0 double : 31.38ms per op = 31.87FPS

-O1 float  : 17.44ms per op = 57.34FPS
-O1 double : 17.55ms per op = 56.98FPS

-O2 float  : 16.70ms per op = 59.88FPS
-O2 double : 16.73ms per op = 59.77FPS

-O3 float  : 16.58ms per op = 60.31FPS
-O3 double : 16.78ms per op = 59.59FPS
```

So you can see that if you wanted to process 6220800 (1920x1200x3) floating point elements per second on my laptop's cpu it would just barely keep up with a 60HZ refresh rate with optimizations enabled. But the difference between single precision floating point math and double precision floating point math is almost negligible.

Chris Katko

Since Allegro's transforms are geared towards GPUs, or so I think, single precision is probably best.

OpenGL also supports half-precision floats and integer coordinate systems. I don't see any clear reason why Allegro shouldn't support them.

The Gamecube runs with integer math. Now that OpenGL supports it, the Dolphin emulator was ported to integer math and tons of bugs have gone away.

https://dolphin-emu.org/blog/2014/03/15/pixel-processing-problems/

ALSO, I had no idea there was a different between 0.0 and 0.0f / 0.0. There's REALLY such a thing as a float vs double literal, and the compiler will silently convert them if you have the wrong one. ... I think?

This is insanity!

Bringing back to another of my threads: Somehow, a std::string implicitly converting to a c_string is terrible, but doubles to floats, and floats to ints are OKAY being implicit?! COME ON C++. COME ON.

Edgar Reynaldo

See my last edit for FPS results of ops with and without allocations included.

SiegeLord

ALLEGRO_TRANSFORM indeed has floats inside it, and since its internals are public, we're kind of stuck with it that way. It is that way primarily because that's what is supported across platforms (the culprit in this case is Direct3D).

Erin Maus

OpenGL also supports half-precision floats and integer coordinate systems. I don't see any clear reason why Allegro shouldn't support them.

If I remember correctly, half precision is only useful on mobile platforms. It's a no-op on most desktop GPUs. Similarly, native integer support is slow, like doubles.

But most of all, such features are useless for anyone using Allegro for rendering.

Quote:

The Gamecube runs with integer math.

The classic Xbox had a bizarre programmable GPU unlike otherwise equivalent Nvidia chips before and after. The SNES had a terribly weak CPU, only a minor step up from the NES. The Nintendo 64 was pretty much a SGI workstation. The Wii has a small ARM processor on the same die as the GPU that controls various security and I/O processes.

Consoles used to have strange quirks unlike PCs, and that was nice, but that doesn't have any relevance to modern hardware.

Edgar Reynaldo

It would be possible to create a function called al_transform_coordinates_d that took double pointers though. That would at least save the allocation of two floats. But I guess if they're on the stack it wouldn't matter, even in a heavy loop. Don't mind me. Just thinking out loud.

My only concern is this part of my code :

GeneratePlotData only gets called if the theta_delta or the radial_delta change, as that affects the number of data points in the spiral. But the transform and the modified coordinates change every time the rotation changes, which is quite often in my program.

1void Spiral2D::Refresh() { 2 if (spiral_needs_refresh) { 3 GeneratePlotData(); 4 } 5 if (transform_needs_refresh) { 6 /// Refresh modified data from original using transform 7 al_identity_transform(&transform); 8 al_rotate_transform(&transform , rotation_degrees*(M_PI/180.0)); 9 al_scale_transform(&transform , scalex , scaley); 10 al_translate_transform(&transform , centerx , centery); 11 for (unsigned int i = 0 ; i < Size() ; ++i) { 12 Pos2D mod = DataOriginal(i); 13 /// TODO : This is a hack 14 float x = mod.x; 15 float y = mod.y; 16 al_transform_coordinates(&transform , &x , &y); 17 mod.x = x; 18 mod.y = y; 19 DataModified(i) = mod; 20 } 21 transform_needs_refresh = false; 22 } 23}