For support of a game map, I'm creating a struct which contains an array of objects, its size, and the capacity.
(This is just the code I'm using, if you need to see it.)
struct s_bound { short left,right,top,bottom; short btile; }; struct s_bound_array { unsigned int size; // Number of items actually used unsigned int capacity; // Number of (pre)allocated items s_bound* array; // Array of s_bound objects itself };
To add another s_bound object to the array, I use the following function:
Works well enough, and it doubles the capacity each time a new object is added and there is not enough memory allocated to handle it, much like C++'s std::vector is implemented.
I'm going to add a new function to preallocate the memory. But my question is, should I allocate just the amount specified by the user (in this case, the map itself indicates the number of objects it needs to read in from a file), or should I bump up that amount to the next-power-of-two for data alignment?
Say that the map has stored 129 objects. Calling the add_bound_item() 129 times will need to reallocate the memory about 8 times, and result in a total capacity of 256. If I preallocate the memory at the beginning, though, I only allocate memory once.
Now, do I allocate 129 or 256 objects (the next-power-of-two)? I have to also consider that the game's script (Lua files) may add/remove any of these bounds as well: if I have capacity set to exactly 129, and add one more to it, the add_bound_item() function above states that the capacity should double, giving 258 (not 256). Not a power-of-two boundary.
Suggestions?
Look at std::vector. I bet they've tackled the same problem and found the best heuristics.
Are you worried about using too much memory or you're worried about performance side effects? (caching, paging etc.) I'd say allocate by multiples of 4096 and call it a day.
[EDIT]
That's if you allocated the start of the block on a 4096 byte boundary.
The implementations of vector I know of don't seem to tackle the "problem" at all (reserve basically reserves as much as you ask for).
How often do you expect to run into the worst case scenario? For example, if you reserve 129 and add one later, it means your initial reservation was based on a guess. Why not guess a power of two instead (128)?
Alright, I would've thought most STL implementations would act a bit more intelligently, but I stand corrected. Principle stays the same, albeit implementation changes: look at Java's ArrayList.
The size of the array being a power of two or not doesn't have any kind of effect in performance.
Data alignent impacts performance on a per element basis, so in your case each element of the array should be aligned.
So what's important is that the size of the s_bound struct is a multiple of 4 bytes for a 32 bit app, or 8 bytes for a 64 bit app (something that the compiler may already be doing for you, adding padding to the struct).
What Oscar said.
If you are in a situation where you know exactly how many elements you are going to store, just allocate enough space to accommodate them, and nothing more.
For dynamic allocation, as provided by std::vector, the most efficient strategy is to double the size whenever the capacity is exceeded. You still don't have to use a power-of-two size, and in fact, a std::vector that is initialized to have a capacity of, say, 3 elements, will probably resize to 6, 12, 24, 48, and so on.
To take care of alignment, you (or the compiler) should make sure individual elements have a size that is a multiple of the platform's native register size, i.e. 4 bytes for 32-bit systems and 8 bytes for 64-bit ones. Chances are your compiler does this automatically for you, depending on compiler options and #pragma directives.
For dynamic allocation, as provided by std::vector, the most efficient strategy is to double the size whenever the capacity is exceeded.
If I'm not mistaken, VC++ uses a factor closer to 1.7. I think the rationale was that this way it is easier to reuse memory blocks released by the vector.
Allocating to the next power of two, or doubling the size, does save some time and new/realloc calls. But the actual size doesn't really matter.
I was thinking that, within the add_bound_item() function, I could bump up the next item to the next-power-of-two.
So if they specify a value of 129 to preallocate, it will give exactly 129. But as soon as they want to add another item, it will bump up to 256 (and not 258). Of course, this savings isn't as obvious if they have 126 and add one more: they only get 128 instead of 252.
In order to align the s_bound struct, do I just do that with a preprocessor macro? The 5 shorts now equal 10 bytes: I would need 2 more bytes (1 int) to give me a 4-byte alignment for 32-bit systems, and 6 more bytes (3 ints) for an 8-byte alignment for 64-bit systems. Is that correct?
It's generally best if you manually sort the largest items to smallest items in a struct. If you put the char stuff[3] first, then there'll be padding before the next member.
Arthur: explain?
struct s_bound {char alignment[3];short left,right,top,bottom; short btile; };
or
struct s_bound { short left,right,top,bottom; short btile;char alignment[3];};
Well, I'll be damned, it used to make a difference! I'll try again once I sober up.
padding and alignment can differ on platform and compiler, but in general each short will be aligned to 4 byte boundaries (might be 8 on 64bit native systems), and your char will also start at a 4-8 byte boundary. In the first version you get space between the three chars and the next short, the second one you don't, the struct just ends.
Arthur: Rearrange that a bit to see:
Result:
Size of s_bound 1 is 20 Size of s_bound 2 is 24
When I look at Wikipedia and do a test run on my Mac (32-bit, I believe), this is what I see:
#include <stdio.h> struct s_bound { short left, right, top, bottom; short btile; } s_bound1; int main(void) { printf("Size of s_bound 1 is %d\n",sizeof(s_bound1)); return 0; }
Output:
Size of s_bound 1 is 10
But will this be the case for 64-bit platforms, or even DIFFERENT 32-bit platforms?
EDIT: Did you rearrange that after you posted? I thought your original had identical structures..?
I'm running on linux 64 for the second week now (utter noob) and I don't feel like rebooting to linux 32.
[EDIT]
And I've drank 9 beers now, I'll log off til tomorrow.
Did you rearrange that after you posted? I thought your original had identical structures..?
He did, I saw the structures identical and didn't understand what was Arthur's point.
He did, I saw the structures identical and didn't understand what was Arthur's point.
Yeah I did, and saw it, and recompiled with the new structs and it still said 24 on linux 64
Exercise caution in your daily affairs.
pepsi@fractal:~$ align
Size of s_bound 1 is 24
Size of s_bound 2 is 24
pepsi@fractal:~$
Now I'll log off again and lurk
padding and alignment can differ on platform and compiler, but in general each short will be aligned to 4 byte boundaries (might be 8 on 64bit native systems), and your char will also start at a 4-8 byte boundary.
Not really. Shorts are aligned to 2 byte boundaries, and chars to 1 byte boundaries (so not really alligned
).
So
struct s_bound { char align[5]; short shorty; long longy; double dubble; }s_bound1;
is going to be alligned like this:
1 2 3 4
00 align[0] | align[1] | align[2] | align[3]
04 align[4] | padding | shorty
08 longy
12 dubble
16 dubble
(that's 20 bytes)
while
struct s_bound_two { short shorty; long longy; double dubble; char align[5]; }s_bound2;
will be alligned this way:
1 2 3 4
00 shorty | ------ padding -------
04 longy
08 dubble
12 dubble
16 align[0] | align[1] | align[2] | align[3]
20 align[4] | ------------ padding -------------
(that's 24 bytes)
This in a 32 bit system.
[edit]
Arthur : can you try this one:
struct s_bound { char align[5]; short shorty; double dubble; long longy; }s_bound1;
Doubles may be 8 byte alligned, so in the first case you'd get 24 bytes instead of 20.
Nope, still says 24
[EDIT]
reboot to 32 bit, brb
32 bit gcc says sizeof is all 20
Not really. Shorts are aligned to 2 byte boundaries, and chars to 1 byte boundaries (so not really alligned
).
Maybe in an array. In structs things are generally aligned to the native "word" size. So two single chars would be aligned and padded to separate "words". x86 does work with unaligned accesses, but last I heard, it is slower than unaligned accesses, so you generally want it to align, even if it can waste 3 or 7 bytes per char.
The thing with the unaligned accesses is that even if some variable fits into a register cleanly, the memory bus still has to do multiple accesses if the variable in memory spans the native size.
The thing with the unaligned accesses is that even if some variable fits into a register cleanly, the memory bus still has to do multiple accesses if the variable in memory spans the native size.
Not the only problem. Accessing even a char unaligned may take some extra operations.
Please explain what an unaligned char is, please.
Thomas: what you don't want is having a variable spawning over several "lines" of memory, because that'd mean that for reading 1 variable you'd need 2 memory accesses.
This is what you don't want:
@ 00 01 02 03 04 05 06 07
| looong |
While accessing looong in the x86 is possible (other cpu's don't allow it), it requires 2 accesses (one for retrieving the first word, 00-03, and another to retrieve the second word, 04-07, and then joining the correct bytes). So there's a big performance hit, thus why the alignment.
If you have this:
@ 00 01 02 03
sshort | sshort |
is ok, and nice since you can read both shorts with just one memory access (after loading you'll need to separate them in 2 different registers, but that's a very fast operation, compared to having 2 memory accesses if you aligned the shorts to word boundaries).
So, google a bit about this. You'll find it's as I said.
http://en.wikipedia.org/wiki/Data_structure_alignment#Typical_alignment_of_C_structs_on_x86
[EDIT]
I didn't see Oscar's post before I posted this, what he means is:
The memory management stuff in the computer accesses some certain size all the time,
assume 32 bit bus.
Access a char (byte) Nxxx xxxx xNxx xxxx //All fit within a 32 bit access xxNx xxxx xxxN xxxx xxxx Nxxx... Access a 16 bit int NNxx xxxx xNNx xxxx xxxN Nxxx <-crosses 32 bit boundary xxxx NNxx... same for 32 bit var NNNN xxxx xNNN Nxxx <-crosses . . . xxxN NNNx xxxx NNNN <-aligned again
Please explain what an unaligned char is, please.
Any access that doesn't start at an aligned address, just because the amount of data is smaller doesn't make it ok (its just not as bad).
Thomas: what you don't want is having a variable spawning over several "lines" of memory, because that'd mean that for reading 1 variable you'd need 2 memory accesses.
I'm not disagreeing. Just that a single char read, not at an aligned boundary isn't as fast as one that is aligned. Some CPUs can't even access data thats unaligned without doing a larger read that is aligned, and then throwing away the bits it doesn't want. Just because the IA32 and x86_64 arch can do unaligned accesses, doesn't mean its as efficient as aligned accesses.
I'm not disagreeing. Just that a single char read, not at an aligned boundary isn't as fast as one that is aligned.
Care to post your source? Because that's not true. Memory must be accessed word by word (that's called a line), it's not possible to read just one byte (well, unless you're working with an 8 bit cpu where a word is 1 byte). It doesn't matter where in the word the byte you want is, since the cpu must retrieve the entire word no matter what.
I think you haven't even cared to read that wikipedia article because it explains how data is aligned pretty cleanly :/
Maybe there's a situation where reading a char into AL register is faster if it's on the low byte (little endian)?
Care to post your source?
Just something I've learned reading about how CPUs work.
Memory must be accessed word by word (that's called a line)
Uh, a "line" is generally larger, and depends on the CPU architecture, at least if you're talking about the cache line. My Phenom II 810 has a L3 line size of 64 bytes (not bits, 64 bits is the word size).
I think you haven't even cared to read that wikipedia article because it explains how data is aligned pretty cleanly :/
Yeah, I didn't read it. I've skimmed it. But I can see how it can help to group and pack individual chars, so it fetches all 4 at the same time, saving some operations. Of course thats only if you group all your chars in the struct into 4-8 in a row groupings. Space them out a bunch, and its pretty inefficient (on x86/x86-64).
And as I mentioned, some cpu's CAN'T access unaligned data at all. All you get is an exception, so if you attempt to access one of those packed chars, you have to fetch an entire word manually, and throw away the bits you don't want, which is what x86 does for you anyhow.
Not to rerail this topic or anything, but what should I do for this code?
struct s_bound short left,right,top,bottom,btile; // 5 shorts, no change };
...
struct s_bound { short left,right,top,bottom,btile; short unused; // 1 short to give 6-shorts (or 3 ints) alignment };
...
struct s_bound { short left,right,top,bottom,btile; short unused[3]; // 3 shorts to give 8-shorts (or 4 ints) alignment };
I don't know whether this code will be run on 32- or 64-bit architecture. Or both. So would it be better to waste space and try to align the data on a 64-bit model, or pad it for a 32-bit model?
Or write a preprocessor macro which would fill in the appropriate number depending?
#ifdef 64_bit_architecture # define PADDING(bytes_used) char unused[8-(bytes_used)] #else # define PADDING(bytes_used) char unused[4-(bytes_used)] #endif struct s_bound { short left,right,top,bottom,btile; PADDING(6); // 6 bytes, or 3 shorts, to give a total of 8 shorts (4 ints) };
I'd say all of those would be packed tightly as they are, since they're not a hodgepodge of sizes.
Not to rerail this topic or anything, but what should I do for this code?
I wouldn't worry about (manual) padding at all. This makes sense only in very tight environments, and there you know the target architecture exactly. Chances are high you get it right for one setting and make it worse for another, so just let the tool-chain do its job and let it decide that for you.
Thomas: when did I mentioned anything about cache? I think what you got wrong is the definition of data alignment. For starters, by definition, a byte is always aligned; there's no such thing as a misaligned byte. If you don't want to trust me or that wikipedia article, I'll just copy here a paragrahp of the IA32 Intel Architecture Software Developer Manual book:
(note: it's using the intel asm terminology for naming data types, so a word means 2 bytes, doubleword 4 bytes and quadword 8 bytes)
Words, doublewords, and quadwords do not need to be aligned in memory on natural boundaries. (The natural boundaries for words, double words, and quadwords are even-numbered addresses, addresses evenly divisible by four, and addresses evenly divisible by eight, respectively.) However, to improve performance of programs, data structures (especially stacks) should be aligned on natural boundaries whenever possible.
See how it doesn't mention bytes anywhere, and words (shorts in our discussion) should just be aligned to even boundaries (not necessarily to addresses divisible by 4 as you say)
PD: you should first document yourself properly. I hate when someone discusses about something he doesn't really know and doesn't even seem interested at all about learning the truth. I've studied this at university, and I've worked with a Sparc CPU (yes one of those that don't allow misaligned memory accesses).
See how it doesn't mention bytes anywhere, and words (shorts in our discussion) should just be aligned to even boundaries (not necessarily to addresses divisible by 4 as you say)
Again, as I mentioned, it works fine. But it doesn't mean its as efficient as aligning to a native word (ie: 4 or 8 bytes). Also as I mentioned, some CPUs just don't support accessing data not aligned to a native word size.
when did I mentioned anything about cache?
Its the only "line size" I know about in regards to CPUs.
http://en.wikipedia.org/wiki/Conventional_PCI#Conventional_hardware_specifications
(old skool PCI)
says PCI bus width is 32 bits.
http://en.wikipedia.org/wiki/PCI_Express
(since 2004)
has a sidebar saying "width in bits: 1 - 32"
but also says data transmission is serial 
I've read elsewhere that past Pentium I (?) cache line size is always 64 bytes. It doesn't "match" the motherboard bus size, but so what? Neither does HDD sector or cluster size.
OTOH, my mobo is 64 bit, which implies at least 64 bit bus (I hope), but I was under the impression it's not PCI express (not mentioned in manual) and there's an AGP video slot.
Why would it make it less efficient? Maybe you could provide some source? Or you're just guessing? That text clearly states that, for example in case of shorts, they only need to be aligned to even addresses, and you seem to skip that and still say they should be aligned to a word size (address divisible by 4). Anyway I looked at the Optimization Reference Manual, and found a listing for best alignaments, and in this one yes it explictly states that bytes can be at any address without performance penalties.
some CPUs just don't support accessing data not aligned to a native word size.
Tell me an example of this. I don't know of any CPU that has such alignment restrictions. The restrictions some CPU's have is that data must be aligned to it's size: shorts must be aligned to even addresses, longs to addresses multiple of 4, doubles or long longs to addresses multiple of 8 and so on.
From Sun Sparc documentation (64 bit CPU, so a word is 64 bits):
All quantities must be aligned on their natural boundaries, using standard C data types:
short integers are aligned on 16-bit boundaries.
int integers are aligned on 32-bit boundaries.
long integers are aligned on 64-bit boundaries for SPARC systems. For information on data models, see Appendix C, Making a Device Driver 64-Bit Ready.
long long integers are aligned on 64-bit boundaries.
As you can see shorts and ints are not required to be aligned to 64 bits, just 16 and 32 respectively.
For example, a structure containing only characters has no alignment restrictions
This last quote is to clearly show that bytes don't have any alignment restriction.
And I'm finished here if you're not able to provide sources for your claims, since this is getting very stupid.
You are right it is getting pretty stupid. I can't find any clear references, mind you I'm not trying very hard. But I still believe that even accessing a single byte that isn't at an aligned location will be slower, at least on some cpus. For a long time intel at least has been implementing some older operations solely in microcode stored in some hidden sram on the die. Though I'm not convinced opcodes that would take byte arguments would be a very wise candidate (not that thats stopped intel and other cpu companies from doing un-wise things) for microcode.
I am wondering weather or not you're taking this a little too personally though.
It is possible I'm mixing up the full size variable alignment restrictions. I do know that many ARM architectures can not access an unaligned 16 or 32 bit variable. They just throw exceptions. And I may have just extended that to any unaligned data. If they can access unaligned bytes, it wouldn't take much more to extend that to unaligned shorts imo.
I do know that many ARM architectures can not access an unaligned 16 or 32 bit variable. They just throw exceptions.
Even a 486 will do that if you set the correct (protected) bit in the flags register.
This is the best I could find: http://webster.cs.ucr.edu/AoA/Windows/HTML/SystemOrganizationa2.html
Skip to the middle of the page, when it starts talking about 16 bit data buses, later it also explains 32 bit.
Thanks all!