
Profiling old code 
ngiacomelli
Member #5,114
October 2004

I was digging around my old code and found my Mode7 engine. It's very slow and I'd really like to give it a speed jolt. The majority of slowdown comes with rendering (obviously):
fast_getpixel and fast_putpixel take up most of the execution time, along with fixtoi. I'm really looking to try and optimize this section of code. I had briefly considered a jump to hardware acceleration. Would this be of much use? Most of my project is already written in C. How much of a headache would it be to introduce something like Openlayer? Would I see a big speed increase? Any help on either optimization or otherwise is appreciated!

TeamTerradactyl
Member #7,733
September 2006

Why, specifically, are you needing individual pixelwrites? Are you using them for a particle system?

ngiacomelli
Member #5,114
October 2004

The code is based on an ancient (and legendary) Allegrospecific tutorial, which featured a Mode7 example that worked by grabbing and plotting individual pixels for the Mode7 projection.

Thomas Harte
Member #33
April 2000

Quote: ast_getpixel and fast_putpixel take up most of the execution time, along with fixtoi. They should take up most of the execution time — they're doing all the work! Your compiler might be missing one obvious optimisation though. You and I can see that screen_y doesn't change inside the innermost loop, but because you're using inline functions the compiler possibly can't do anything about that. Have you considered something more like the following?
Also you might shave a few cycles by dumping fixtoi in favour of some simple >> 16s. fixtoi is bound to do something more like (x + 32768) >> 16 or some other rounding method, whereas truncation isn't going to be noticeably different for your application. I think the speedups are likely to be insubstantial though, at best. Quote: How much of a headache would it be to introduce something like Openlayer? Would I see a big speed increase? I can't answer the headache question, but I can tell you the speed difference would be gigantic. A Mode7 floor that isn't using a tilemap like yours can be done in a single quad, which the GPU will do for you. [My site] [Tetrominoes] 
ngiacomelli
Member #5,114
October 2004

Quote: Also you might shave a few cycles by dumping fixtoi in favour of some simple >> 16s. fixtoi is bound to do something more like (x + 32768) >> 16 or some other rounding method, whereas truncation isn't going to be noticeably different for your application. I've never done any bitshifting (I believe that's the terminology). Could you explain? Also, I'm talking headache in terms of: updating my existing implementation, setting up OpenGL and writing code for it (having never done so).

TeamTerradactyl
Member #7,733
September 2006

Bitshifting: int a = 128; int b; b = a >> 1; // 'b' now equals 128 / 2 b = a >> 2; // 'b' now equals (128 / 2) / 2 b = a >> 3; // 'b' now equals ((128 / 2) / 2) / 2 For each time you bitshift to the right, you divide by two. It's called bitshifting since it's noticable in binary: a = 10000000 b = a >> 1: 01000000 // Shifted to the right by 1, so it now equals 64 b = a >> 2: 00100000 // Shifted to the right by 2, so it now equals 32 b = a >> 3: 00010000 // Shifted to the right by 3 b = a >> 7: 00000001 // Shifted to the right by 7, so it now equals 1 b = a >> 8: 00000000 // Now it equals 0 Now that I've written this, I hope you don't already know what this is and were looking for implementation instead of a "what's this?"

Thomas Harte
Member #33
April 2000

Quote: I've never done any bitshifting (I believe that's the terminology). Could you explain? Bitshifting is the same as moving the decimal point with normal numbers. If you do << 1 then you move all the binary digits in your integer number one place to the left. If you do >> 1 then you move all the binary digits in your integer number one place to the right. Because of the way binary works, if you do << 1 then you do the same thing as multiplying the number by 2. If you do << 2 then you do the same thing as multiplying the number by 4. That's exactly the same as the way that if you took an ordinary decimal number and moved all the digits one place to the left then you'd do the same thing as multiplying the number by 10. If you move all the digits two places to the left then you do the same thing as multiplying by 100. Similarly >> 1 is like a divide by 2, >> 2 is like a divide by 4, >> 3 is like a divide by 8, etc. They're not identical this time because the rounding doesn't work. For example, in decimal, if you had the number 9 and you moved it one place to the right you'd have 0.9. If someone told you to round it to an integer you'd say the answer was 1. With shifting you get truncation, not rounding — i.e. any digits that fall off the end of the memory spot reserved for the integer are just forgotten. Fixed numbers are really just integers that pretend to have a decimal point in the middle. Think of it like the difference in decimal between millimetres and metres. If you can only store integers and stick with one scale then you can only store distances like 1 metre, 2 metres, etc. If however you keep some measurements on a millimetre scale and convert them to metres when you need them then you can store with much greater precision. Fixed numbers are exactly the same, except that in the realm of binary it's easier to divide a whole unit into units of 1/65536 rather than the 1/1000 in the milimetre example. So a fixed number is just an integer that uses a different scale. So the correct way to convert a fixed to an integer is to divide by 65536, just as the correct way to convert from a millimetre to a metre is to divide by 1000. A time saving quick fix is to just shift the binary digits right by 16 places. You don't get exactly the right answer, but you save a tiny little bit of work. In the millimetres example, if you just shifted then you'd see that 0 to 999 millimetres are considered to be 0 metres, 1000 to 1999 millimetres are considered to be 1 metre, etc. So a way to do the conversion correctly with shifting would be "add 500, then shift". That's no more than the rounding you learnt in school — if the digit after the one you are interested in is 5 or more then the one you are interested will go up by 1. If you aren't then it won't. So another way to look at it is that all you do by just shifting rather than shift+adding is move the pixels on your floor texture map half a pixel to the right and half a pixel down. Which is barely any different. [My site] [Tetrominoes] 
ngiacomelli
Member #5,114
October 2004

Fantastic explanations. Thank you both. Thomas Harte, you seem to be something of an authority when it comes to this sort of stuff. There's been a slight speed increase with your suggestions. But I'm really looking to get the most out of this, as I possibly can. Is there a way of improving my rendering method, at all?

GullRaDriel
Member #3,861
September 2003

Avoiding a call to some function can help. So , I think that if you replace your fast_put/get pixel method directly by the only line inside the function, it should help, specially if you have a lots of call. Also, you should profile your whole code and use something as gprof to see where exactly you loose some time. "Code is like shit  it only smells if it is not yours" 
orz
Member #565
August 2000

Quote: For each time you bitshift to the right, you divide by two. It's called bitshifting since it's noticable in binary: I just thought I'd point out a very minor detail here; when you divide by 2 using bitshifting, the rounding is towards negative infinity (ie 1 / 2 is 0, but 1 / 2 is 1), whereas when you divide using regular integer math the result is rounded towards 0 (1/2 = 0, 1/2 = 0). Thomas Harte earlier posted this:
That's probably the biggest speed bump possible without doing something drastic (like switching color depths or using hardware acceleration). He also mentioned replacing the fixtoi calls with simple leftshifts for further minute speedups. I'd just like to add that if you want identical rounding results you can add the 32768 to space_x and space_y prior to the loop. Furthermore, it's possible to optimize away the bitwiseand operators by left shifting space_x, line_dx, space_y, and line_dy prior to the loop in such a manner that integer overflow performed the same operation. However, the resulting speed improvement would probably be negligable, as bitwiseand is extremely fast; in fact, I don't actually recommend implementing this optimization, as the improvement should be almost too small to measure. The optimized version would look something like this:
note: that aproach will fail is the tile has a width or height of 1; use only 2x2 or large tiles with it 
ngiacomelli
Member #5,114
October 2004

EDIT: Moved.

