Optimization

Optimization

type568

Member #8,381

March 2007

acquire_bitmap(frame);
        for(i=0;i<the_image[size]->w;i++)
          for(j=0;j<the_image[size]->h;j++)
            _putpixel32(frame,
tempex+cos(crix=(y+FLOW_CS_X/the_image[size]->w*i))*sin(criy=(y+FLOW_CS_Y/the_image[size]->h*j))*the_image[size]->w,
tempey+cos(criy)*sin(crix)*the_image[size]->h,_getpixel32(the_image[size],i,j));

release_bitmap(frame);

That's the code. It takes a picture & prints it to the screen, placing each pixel in such a way, that it looks like the picture is on a ball. But it is very slow, drawn every frame. Obviously, the bottleneck is the cosine & sine functions. Would you please help me out?

Thanks.

TeamTerradactyl

Member #7,733

September 2006

1) If FLOW_CS_X is a constant, and you know that the_image[size]->w will never change, then outside the for..loop, pre-calculate that math:

int the_x = FLOW_CS_X / the_image[size]->w;
int the_y = FLOW_CS_Y / the_image[size]->h;
for(...)

Then inside the for() loop, you don't have an expensive (20+ cycle) division; just a multiplication and addition:

crix = (y + the_x * i);
criy = (y + the_y * j);

My idea would be something similar to this:

1acquire_bitmap(frame);
2double result_sin_cos = 0.0;
3for (i = 0; i < the_image[size]->w; i++)
4{
5  for (j = 0; j < the_image[size]->h; j++)
6  {
7    crix = (y + FLOW_CS_X / the_image[size]->w * i); // Calculate these here, then just put the temp
8    criy = (y + FLOW_CS_Y / the_image[size]->h * j); // value inside 
9    result_sin_cos = cos(crix) * sin(criy);
10    _putpixel32(frame,
11      tempex + result_sin_cos * the_image[size]->w,
12      tempey + result_sin_cos * the_image[size]->h,
13      _getpixel32(the_image[size], i, j));
14  }
15}
16release_bitmap(frame);

This way, you only do expensive sin() and cos() operations once instead of twice.

Thomas Harte

Member #33

April 2000

Quote:

Obviously, the bottleneck is the cosine & sine functions. Would you please help me out?

It's more likely to be the float to int conversion you do after the cosine & sine. Plus the fact that you're doing your work back to front. You should be looping through screen pixels and calculating backwards to source pixels rather than going through all source pixels and deciding where to put them on screen.

[My site] [Tetrominoes]

orz

Member #565

August 2000

1. You calculate your sines and cosines twice per inner loop. You can cut that down to once per inner loop easily. Furthermer, one of those depends only upon the outer loop variable, so you can pull it out of the inner loop. The inner loop one can be further sped up by putting it into a table.
2. You have y as your inner loop instead of x. x is usually faster as the inner loop, because horizontally adjacent pixels are adjacent in memory.

double *precalc = malloc(the_image[size]->h * sizeof(double));
for(i=0;i<the_image[size]->w;i++) {
  crix=(y+FLOW_CS_X/the_image[size]->w*i);
  precalc<i> = cos(crix);
}
acquire_bitmap(frame);
for(j=0;j<the_image[size]->h;j++) {
  criy=criy=(y+FLOW_CS_Y/the_image[size]->h*j);
  double sin_criy = sin(criy);
  for(i=0;i<the_image[size]->w;i++) {
          double product = precalc<i>*sin_criy;
    _putpixel32(frame,
      tempex+product*the_image[size]->w,
      tempey+product*the_image[size]->h,
      _getpixel32(the_image[size],i,j)
    );
  }
}
release_bitmap(frame);
free(precalc);

edit:
3. Thomas Harte pointed as float->int conversion as a more likely culprit than sin/cos. I'm slightly skeptical: float->int conversion can be slow, but sin/cos are even slower, and you had more of them. But with optimization of the sines & cosines, probably float->int conversion will become your new bottleneck. I don't know how to optimize float->int portably in C, but this works decently:

1#if (!defined NO_ASM) && (defined _MSC_VER) && defined(_M_IX86)
int iround(double a) {
  int r; 
  __asm fld a   ;
  __asm fistp r ;
  return r;
}
8#elif (!defined NO_ASM) && (defined __GNUC__) && (defined __i386__)
int iround( double a ) {
  int r;
  __asm__ __volatile__( 
    "fldl %1    \n\t"
    "fistpl %0  \n\t"
    : "=m" (r)
    : "m" (a)
  );
  return r;
}
19#elif 0 //disabled because I don't know good preprocessor symbols to check
20#  define BIGDOUBLE 6755399441055744.0
int iround(double a) {
  int i;
  a = a + (BIGDOUBLE + 0);
  i = *((int*)&a);
  return i;
}
27#else
int iround     (double a) {return (int)std::floor(a+0.5);}
29#endif

edit2: removed unused crix calculation in inner loop

type568

Member #8,381

March 2007

Quote:

You should be looping through screen pixels and calculating backwards to source pixels rather than going through all source pixels and deciding where to put them on screen.

Uhm. First of all, I've no idea how to do that. Afterwards, The picture doesn't take all the screen & much smaller, and it can be anywheere generally. If your method is still OK, would somebody please expand it?

Thanks, still.

Quote:

result_sin_cos

I have forgotten these functions a bit, but i don't think that sin(a)*cos(b), is the same as cos(a)*sin(b). (and I have that actually)

Quote:

int the_x = FLOW_CS_X / the_image[size]->w;
int the_y = FLOW_CS_Y / the_image[size]->h;

Thanks!

Audric

Member #907

January 2001

Quote:

Obviously, the bottleneck is the cosine & sine functions.

The _putpixel32() and _getpixel32 pair seem better candidates to me.

By the way, swapping the two for loops (ie: loop on bmp->h first, then on bmp->w) will make getpixel examine contiguous memory very often, so the cache should speed things up.

orz

Member #565

August 2000

Quote:

I have forgotten these functions a bit, but i don't think that sin(a)*cos(b), is the same as cos(a)*sin(b). (and I have that actually)

Oops. Both TeamTerradactyl and myself misread your code there. Still, all the ideas either of us mentioned still work. Corrected version of my code to fix the stuff we misread:

double *precalc_cos = malloc(the_image[size]->w * sizeof(double));
double *precalc_sin = malloc(the_image[size]->w * sizeof(double));
for(i=0;i<the_image[size]->w;i++) {
  crix=(y+FLOW_CS_X/the_image[size]->w*i);
  precalc_cos<i> = cos(crix);
  precalc_sin<i> = sin(crix);
}
acquire_bitmap(frame);
for(j=0;j<the_image[size]->h;j++) {
  criy=criy=(y+FLOW_CS_Y/the_image[size]->h*j);
  double cos_criy = cos(criy);
  double sin_criy = sin(criy);
  for(i=0;i<the_image[size]->w;i++) {
    _putpixel32(frame,
      tempex+precalc_cos<i>*sin_criy*the_image[size]->w,
      tempey+precalc_sin<i>*cos_criy*the_image[size]->h,
      _getpixel32(the_image[size],i,j)
    );
  }
}
release_bitmap(frame);
free(precalc_cos);
free(precalc_sin);

edit: fixed amount of memory allocated for tables.

TeamTerradactyl

Member #7,733

September 2006

I like orz's optimizations. And correct, I mis-read the sin(x),cos(y)...sin(y),cos(x) part.

Thanks for the correction, all!

type568

Member #8,381

March 2007

Quote:

Quote:
Obviously, the bottleneck is the cosine & sine functions.
The _putpixel32() and _getpixel32 pair seem better candidates to me.

You are wrong. The operations, are done with locked memory, inside the ram, and are simple data copying, with some addresing dificulties I guess. (second sentance is totally my opinion, is it correct?)

putpixel(just some int pluses)
 if(getpixel())
  putpixel()

40+ FPS

The code I gave, stable 9 FPS (same picture, color deapth, program)

If you do not believe me, I can post the program. But: I did think about it, and I'll later add an option, that would perform some operations in an 8bit mode & then blit to the screen.

Quote:

By the way, swapping the two for loops (ie: loop on bmp->h first, then on bmp->w) will make getpixel examine contiguous memory very often, so the cache should speed things up.

I am sorry, I don't understand, would you please expand your idea? Thanks.

TeamTerradactyl & orz, thanks a lot!
Great idea. I think will help me a lot in future.. It looks like I'm going to have another question, I'll post there the program and will tell the FPS increasment.

Thank you.

orz

Member #565

August 2000

Quote:

By the way, swapping the two for loops (ie: loop on bmp->h first, then on bmp->w) will make getpixel examine contiguous memory very often, so the cache should speed things up.

Quote:

I am sorry, I don't understand, would you please expand your idea? Thanks.

He's saying the same thing I said about your choice of inner loops: Do a for (y...) for (x...) {} instead of a for(x...) for (y...) {}, because the way bitmaps are stored in memory makes that friendlier to the cache. edit: notice that I switched your loops in my version

Quote:

2. You have y as your inner loop instead of x. x is usually faster as the inner loop, because horizontally adjacent pixels are adjacent in memory.

type568

Member #8,381

March 2007

Thank you, but: Would you please tell me where can I get some info about the cache & how does it work?

orz

Member #565

August 2000

You can try searching for spacial locality or cache optimization or CPU cache tutorial or somesuch. The cache is basically a small fast memory that is not explicitly accessed but holds copies of recently accessed data (and data near recently accessed data in memory) so that it can be used faster.

TeamTerradactyl

Member #7,733

September 2006

The cache is part of the physical computer hardware.

The very smallest are the registers found in the CPU chip itself. On most 32-bit machines, you have about 6-8 of these registers that you have available to use and fiddle with (the compiler does all this for you, as it is EXTREMELY hardware-dependant). It can usually handle up to about 2-3 billion operations a second on slower machines.

The next-smallest is the physical memory found directly next to the registers: the "L1 cache". This is typically kept small so as to speed up all the different math and memory calls necessary to do the calculations (typically, less than 8KB).

Above that is the L2 cache. This is what most CPU vendors will advertise about their chips: most have something like 2MB of L2-Cache (or on-board cache), but newer processors are coming with 4- and 8MB. The more you have, the slower the processor, but it depends on how the actual processor has been setup.

Then, after the L2 cache is the actual RAM you have in your computer. Most people have between 512MB and 4GB of RAM. This RAM is slower than the above-mentioned ones, simply because there is more of it and it is designed differently.

So when the others here have suggested that you do row-column order instead of column-row order, it means that the computer will grab a huge chunk of memory and stick it into the L2 cache. So a bitmap that you're _putpixel()-ing onto is going to be loaded into as much of the L2 cache as can fit. If you tell it to calculate the columns first, what essentially happens is:

1) The BITMAP is loaded into the L2 cache (as much as can fit)
2) You access a single pixel, doing the calculations you need
3) You then move "down" in the image (instead of "right", or into the rest of the contiguous memory which has already been loaded), which essentially repeats Step #1 again

This slows down the computations enormously, because you have to "reload" the memory a bunch of times. However, if you do it in row-column mode, this is what happens:

1) The BITMAP is loaded into the L2 cache (as much as can fit)
2) You access a single pixel, doing the calculations you need
3) You move "right" (into the next piece of the BITMAP which is still in memory)
4) You repeat Steps #2-3 until you run out of BITMAP. When this happens, Step #1 repeats with the next "segment" of BITMAP that needs to be loaded into memory

Because you don't need to load/unload the different sections of the BITMAP from memory, it's a LOT faster access-time.

Now, the above numbers (about the L1/L2 cache, etc.) aren't exactly accurate, but are guestimates (I forget the exact numbers from my CS class...) more for clarification than exact specs.

Oh, and if you have a 64-bit machine, the number of "registers" jumps from 6-8 to nearly 14-16, IIRC. So they're a LOT faster than 32-bit machines (if compiled for 64-bit processors, that is).

HoHo

Member #4,534

April 2004

Regardin the numbers, 32bit x86 has 8 regular (7 useful) and bunch of others you can't use without writing special code (SIMD). On 64bit the number of registers doubles but they generally aren't all that much faster. In perfect situation they could be up to 4x faster but that is only when you depend on doing lots of calculations on 64bit integers. With other kind of data you'll be lucky to get >10-20% performance increase. Generally you shouldn't worry about them, that is the job of compiler.

L1 is at least 8kiB and up to 64kiB, depending on CPU. Accessing data there takes from 1-3 cycles.

L2 is from none (original Celerons and some older) to 4MiB per CPU, 512k is most common. Access speed is from 10-16 cycles

Memory access is around 150-400 cycles. So if your program requests data that is not in caches your CPU will do nothing for hundreds of cycles.

Of course any half-decent CPU has some kind of prefetching functionality built in but if at all possible (and needed) you should try to make memory access linear. That means try to use arrays/vectors instead of lists and new/malloc but do it intelligently. Sometimes linked list can be better than plain array and most of the time you won't need to worry about the performance decrease it causes.

As always, if something is slow then profile to find out what it is. Next see if it is algorithmic problem that causes slowdowns or is the code simply too inefficient. Before both of these are not done don't start optimizing the calculations. It is no use to get rid of couple of multiplications in an inner loop if you could change complexity of the loop from O(N^2) to O(N*log(N)) by choosing a better algorithm.

__________
In theory, there is no difference between theory and practice. But, in practice, there is - Jan L.A. van de Snepscheut
MMORPG's...Many Men Online Role Playing Girls - Radagar
"Is Java REALLY slower? Does STL really bloat your exes? Find out with your friendly host, HoHo, and his benchmarking machine!" - Jakub Wasilewski

type568

Member #8,381

March 2007

orz, TeamTerradactyl & HoHo, thank you very much. But of course, new questions appeared.

orz:

Quote:

The cache is basically a small fast memory that is not explicitly accessed but holds copies of recently accessed data (and data near recently accessed data in memory) so that it can be used faster.

I don't really understand, what is nearly, how is the "near" defined? And,I understood out of here, that in lots of cases (my for example) most of the bus traffic is simply useless? And, is this the part where lower CL comes for help?

TeamTerradactyl:

Quote:

The more you have, the slower the processor, but it depends on how the actual processor has been setup.

(L2 cache)

I'm totally confused, and can't understand why. The only thing I can think of, is that it's being always used up completely, and transferring data inside takes more time. Is it right?

HoHo:

Quote:

That means try to use arrays/vectors instead of lists and new/malloc but do it intelligently.

I don't understand how malloc can affect the performance, and what's the difference in 'malloc'ing space manually, and using standard routine.

Thank you all very much for the shared experience and knowledge. I'm serious

HoHo

Member #4,534

April 2004

Quote:

I don't really understand, what is nearly, how is the "near" defined?

"near" means it is close to the thing that crunches the numbers. When making things simple, L1 basically touches the computing units. L2 is about twice as far but is still inside the CPU. Regular RAM is miles away (~15 cm in real world ) sticked into a slot on your motherboard.

Quote:

And,I understood out of here, that in lots of cases (my for example) most of the bus traffic is simply useless? And, is this the part where lower CL comes for help?

Lover CL helps with memory access latency. That means instead of, say 150 CPU cycles with CL5 you could access RAM at around 140 cycles with CL4.

Quote:

The only thing I can think of, is that it's being always used up completely, and transferring data inside takes more time. Is it right?

I didn't understand that either. I guess he might have said that accessing bigger cache takes more time than accessing smaller one.

Quote:

what's the difference in 'malloc'ing space manually, and using standard routine.

Well, malloc and new are the standard routines for allocating memory. There is no difference between the two.

Quote:

I don't understand how malloc can affect the performance

That's the part where memory access linearity comes to play. Malloc returns you a block of RAM for you to use. It can be anywhere in the RAM.

If you have a particle system with 1000 particles and each and every one of them is malloced you'll have 1000 tiny pieces of memory scattered all around the RAM. If you now start updating these particles in sequence you'll be reading data from those random places in memory. As CPUs are not very good at prefeching* data to caches when doing such random accesses you'll have a whole lot of cache misses** and CPU is mostly wasting time by waiting for the data to arrive.

If you use an array/vector for the particles you also allocate a piece of memory for it but not for 1000 individual parts but for 1000 parts all at once in one big chunk. That piece of RAM is guaranteed to be linear in memory. When you start updating particles from one end of the array CPU sees you are doing linear accesses to memory and it start prefeching data to RAM before CPU starts using it thus eliminating the cache misses.

*) Prefeching: CPU automatically copies a piece of memory to caches. Once CPU needs to access that piece of memory it finds it in the cache and access is significantly faster.

**) Cache miss: When CPU can't find the data from caches. As it is not there it has to request it and it will take long time.

TeamTerradactyl

Member #7,733

September 2006

type568 said:

I don't really understand, what is nearly, how is the "near" defined?

If you request something from memory, the computer will "grab" the whole chunk of memory physically around it. Meaning, that if I have some data like this:

..............XXX......................................

What happens is that the computer will grab ALL that data; the "..." before and after the actual "XXX" that I want. This has been done on purpose, because usually, if you want "XXX", then you may possibly want something found in memory immediately before or after it. An array is a good example. If you have a BITMAP, you first want "XXX" to draw onto screen, and then you want "YYY" which is found immediately following it, and then "ZZZ" after that.

So the computer will grab the entire chunk of memory "physically near" the part you want.

If you try to access a bitmap like this:

X..................
X..................
X..................
X..................
X..................
X..................
X..................
X..................

and you want to get all the X's, the memory for that BITMAP looks like one continuous block of memory like this:

X..................X..................X..................X..................X..................X..................X..................X..................

The problem with this, is that often, there's not physically enough memory to hold all of that at once. So the computer will grab the first few chunks of it at a time, like this:

X..................X..................X..................
X..................X..................X..................
X..................X..................

The problem with this, is that it will first take the first 'X', then grab the second 'X' and then the third. But when it needs the fourth 'X', it has to run out and grab the next "chunk" from memory. If, however, you wanted to access the BITMAP like this:

XXXXXXXX...........
...................
...................
...................
...................
...................
...................
...................

What the computer actually needs would look like this:

XXXXXXXX................................................................................................................................................

And since it doesn't all fit into memory at once, it would be like this:

XXXXXXXX.................................................
.........................................................
......................................

So it doesn't even need the extra two "chunks" of memory; it just grabs the first X, then the second, then third, then fourth... all contiguous.

That's what we meant by "near": it is located physically close-to the memory that you wanted to access. It will stay loaded into memory until the CPU decides that it wants a value from some other location in memory. This is a good thing because if you want to blit() a BITMAP to screen, it asks to grab the first "chunk", then when that chunk is done, it grabs the next, then the next...

type568 said:

(L2 cache)

I'm totally confused, and can't understand why. The only thing I can think of, is that it's being always used up completely, and transferring data inside takes more time. Is it right?

This is correct. The way that it's being "used up completely" is what I just explained above: it grabs an entire chunk of memory and "fills up" the L2 cache with it. That way, if you need to access the memory immediately before or after the "piece" you requested, it is there already, ready to be used for the next CPU call.

type568 said:

I don't understand how malloc can affect the performance, and what's the difference in 'malloc'ing space manually, and using standard routine.

The std::vector arrays have been implemented in a smart way: Each time you start adding stuff to the end of them, they double the size of memory they need. This is a speed advantage, because instead of jumping out to memory and requesting another single byte (or more) of memory for EACH object you want to tack on the back of the array (and sometimes, you may have a LOT that you want to push onto them, one-after-another, like a bunch of monsters that you have to allocate memory and BITMAPs for, before they're displayed on-screen), it simply tells the computer that you want "a lot of memory" all-at-once, and then uses that memory until it needs more.

It takes less time to simply add a value or two into an array than it does to actually request the RAM space somewhere, wait for a response (or Out-Of-Memory error response, etc.), then fill it with your new data. What malloc() does is that precisely: it requests space in memory somewhere. Those arrays/vectors do the same thing, but the std::vectors usually do it as optimized as possible so you don't have to do malloc()s nearly as often.

Sorry this was long-winded again. I hope it helped answer your questions, though!

type568

Member #8,381

March 2007

Thank you.

Prefeching is a great word.

And looks like it is going to be the last, question.

I'm asking for AB, having:

...........AB,,,,,,,,,,,

Will I get ....AB || ..AB,, || AB,,,, cached?

Also, if I'm going to do "linear" requests from:

A-B-C-D-E-F-G-H-J-K-L-M-N-O

this way: A,B,C... will it understand, what do I want from it, and is it possible to come to data being cached "on fly"? (i.e. I'm asking for D, it's already caching K)

Are there other ways to predict upcoming requests? (I'm sure there are..)

Or is it some big topic, and lots of AMD & Intel scientists thinking on the way to optimize it?

No such thing as last question.

Edit:

Quote:

You can try searching for spacial locality or cache optimization or CPU cache tutorial or somesuch.

http://www.mgnet.org/mgnet/Conferences/CopperMtn01/Tutorials/ruede/img17.htm

Edit1: grammar.

orz

Member #565

August 2000

Quote:

And looks like it is going to be the last, question.

I'm asking for AB, having:

...........AB,,,,,,,,,,,

Will I get ....AB || ..AB,, || AB,,,, cached?

When you read address X into memory, the lowest cached address is
(X & ~63), or equivalently (X - (X % 64)). The highest address cached is ((X & ~63) + 63), or equivalently (X - (X%64) + 63). That is, a 64 byte block starting the the lowest multiple of 64 such that the block will include the address you read. That's assuming you're using an Athlon/XP/64/X2 or Core/2 or P4. For some older processors (P3s, P2s, etc), replace all the 64s and 63s with 32s and 31s. This value is called the "cache line size", and on non-x86s is usually between 16 and 64 bytes.

Quote:

Also, if I'm going to do "linear" requests from:

A-B-C-D-E-F-G-H-J-K-L-M-N-O

this way: A,B,C... will it understand, what do I want from it, and is it possible to come to data being cached "on fly"? (i.e. I'm asking for D, it's already caching K)

Are there other ways to predict upcoming requests? (I'm sure there are..)

Or is it some big topic, and lots of AMD & Intel scientists thinking on the way to optimize it?

No such thing as last question

If you do linear requests, recent CPUs will generally attempt to automatically prefetch the next data for you. There are prefetch hint opcodes too, with which you can explicitly tell the CPU to load some address into cache. I believe that some CPUs will prefetch if your doing linear accesses backwards (going down in memory), or if you're alternating between two different linear walks, or that kind of thing, but don't expect them to be smart. Prefetching was only introduced into mainstream CPUs in the last 4 years or so.

type568

Member #8,381

March 2007

OK, thanks.

But, what analyzes the data stream, and actually is performing the prefetching? also the CPU, burning cycles on that?

HoHo

Member #4,534

April 2004

CPUs have special units that do that, no performance is lost while doing it.

orz

Member #565

August 2000

Quote:

OK, thanks.

But, what analyzes the data stream, and actually is performing the prefetching? also the CPU, burning cycles on that?

It's not software, it's hardware: The CPU doesn't burn cycles on it, it burns transistors on it instead. It doesn't make the CPU slower, it makes the CPU more expensive to design & manufacture instead.

type568

Member #8,381

March 2007

And more expensive for us. Thanks.

What is the structure of a modern CPU?

TeamTerradactyl

Member #7,733

September 2006

type568 said:

What is the structure of a modern CPU?

When you go to the Intel or AMD websites and click on the "advanced specs" sections for each of the processors they have, there is usually some picture or diagram that shows this. It's all very technical and somewhat difficult to understand (it's made more for the benefit of hardware designers and compiler optimizers than "average programmers"), but if you want to take a look, I know they're out there...

Bob

Free Market Evangelist

September 2000

type568 said:

What is the structure of a modern CPU?

Chip-Architect has a great article on the AMD Athlon64.
Happy reading!

--
- Bob
[ -- All my signature links are 404 -- ]