What I'm looking for is a book that explains hashing functions, the formula that takes a value and generates a key, or something like that. If the book explains modern STL and the use of iterators that would be better.
Add up all characters and mod it by the hash size?
Don't know any books, but wikipedia might be an okay place to start :
http://en.wikipedia.org/wiki/Hash_function
Add up all characters and mod it by the hash size?
Its a bit more complex than that. You want to have a hash function that gives an even distribution in all (or at least most) cases.
Heat hash. Crumble into pipe. Put pipe in mouth. Burn hash, whilst steadily sucking on pipe. When all hash is glowing red, stop breathing in and remove pipe from mouth. Hold breath. Breathe out hash smoke. Wait. Now write code.
Heat hash. Crumble into pipe. Put pipe in mouth. Burn hash, whilst steadily sucking on pipe. When all hash is glowing red, stop breathing in and remove pipe from mouth. Hold breath. Breathe out hash smoke. Wait. Now write code. 
No, no, no time for that.
Its a bit more complex than that.
Its very complex. Thats why I asked. I did find some books on the topic.
Don't know any books, but wikipedia might be an okay place to start :
Haven't looked there yet, wikipedia is always too narrow though in context to what they want to sell me.
Oh well, I'll look around more.
I don't think it's so complex that a book would be dedicated to it. It's more about understanding the (very simple) purpose and properties of a good hashing function, then coming up with one based on your data.
[edit]
What I meant to say is:
In computer chess and other computer games, transposition tables are used to speed up the search of the game tree. Transposition tables are primarily useful in perfect information games, meaning the entire state of the game is known to all players at all times.
Game playing programs work by analyzing millions of positions that could arise in the next few moves of the game. Typically, these programs employ strategies resembling depth-first search, which means that they do not keep track of all the positions analyzed so far. In many games, it is possible to reach a given position in more than one way. These are called transpositions.[1] In chess, for example, the sequence of moves 1. d4 Nf6 2. c4 g6 (see algebraic chess notation) has 4 possible transpositions, since either player may swap their move order. In general, after n moves, an upper limit on the possible transpositions is (n!)². Although many of these are illegal move sequences, it is still likely that the program will end up analyzing the same position several times.
To avoid this problem, transposition tables are used. Such a table is a hash table of each of the positions analyzed so far up to a certain depth. On encountering a new position, the program checks the table to see if the position has already been analyzed; this can be done quickly, in expected constant time. If so, the table contains the value that was previously assigned to this position; this value is used directly. If not, the value is computed and the new position is entered into the hash table. This is essentially memoization applied to the search function.
The problem is generating a hash key that is uniform over a given range. You want as few collisions as possible, and you don't want any bunching, or your algorithm gets a lot slower. And many popular hash functions have some rather easy to trigger worst cases.
Watch that for an overview of some guys that tested some known flaws against popular web frameworks/languages.
thanks for the help guys, its not that its complicated like BAF said but implementing the code is a little hard to do. And then there a lots of things to learn about.
@Thomas, interesting video but I can't sit through an hour and watch that.
The main disadvantage of the brute-force method is that, for many real-world problems, the number of natural candidates is prohibitively large. For instance, if we look for the divisors of a number as described above, the number of candidates tested will be the given number n. So if n has sixteen decimal digits, say, the search will require executing at least 1015 computer instructions, which will take several days on a typical PC. If n is a random 64-bit natural number, which has about 19 decimal digits on the average, the search will take about 10 years. This steep growth in the number of candidates, as the size of the data increases, occurs in all sorts of problems. For instance, if we are seeking a particular rearrangement of 10 letters, then we have 10! = 3,628,800 candidates to consider, which a typical PC can generate and test in less than one second. However, adding one more letter — which is only a 10% increase in the data size — will multiply the number of candidates by 11 — a 1000% increase. For 20 letters, the number of candidates is 20!, which is about 2.4×1018 or 2.4 quintillion; and the search will take about 10,000 years. This unwelcome phenomenon is commonly called the combinatorial explosion.
Math doesn't help you when there are known problems with a given hash method. All you have to do is tweak the input for it, and you can stack up as many things as you want in as few buckets as possible.
if n has sixteen decimal digits, say, the search will require executing at least 1015 computer instructions, which will take several days on a typical PC.
Hahahaha! Maybe you meant 10 to the 15th?
From the archives:
http://www.allegro.cc/forums/thread/512313
Hahahaha! Maybe you meant 10 to the 15th?
I think you are missing the larger joke.
@Thomas, interesting video but I can't sit through an hour and watch that.
Neither can I, but if it is what I think it is, here's how the attack goes:
find a target that uses hashmaps to store query string parameters (most web stacks qualify)
find a bunch of keys with identical hashes, and use them as query string parameter names
construct query strings using as many of these identical-hash keys as you can fit in the URL
send a bunch of these requests
The crux is that a hashmap performs well as long as the number of collisions is low; but when all the keys collide, it is reduced to a linked list, and the fast lookup hashmaps are known for is reduced to a devastating O(n). A relatively small number of requests can bring a server to its knees this way, without taking up much bandwidth, which makes this a very efficient DoS attack. Typical defenses (bandwidth limiting, dedicated administrator network connections, etc.) are quite ineffective against this, because unlike most DoS attacks, this one targets the CPU and RAM access rather than choking the network connection.
What they showed in the video relied on POST request, so the limit depends on the stack (PHP defaults to 8MB of data)
The main disadvantage of the brute-force method is that, for many real-world problems, the number of natural candidates is prohibitively large. For instance, if we look for the divisors of a number as described above, the number of candidates tested will be the given number n. So if n has sixteen decimal digits, say, the search will require executing at least 1015 computer instructions, which will take several days on a typical PC. If n is a random 64-bit natural number, which has about 19 decimal digits on the average, the search will take about 10 years. This steep growth in the number of candidates, as the size of the data increases, occurs in all sorts of problems. For instance, if we are seeking a particular rearrangement of 10 letters, then we have 10! = 3,628,800 candidates to consider, which a typical PC can generate and test in less than one second. However, adding one more letter — which is only a 10% increase in the data size — will multiply the number of candidates by 11 — a 1000% increase. For 20 letters, the number of candidates is 20!, which is about 2.4×1018 or 2.4 quintillion; and the search will take about 10,000 years. This unwelcome phenomenon is commonly called the combinatorial explosion.
I guess BAF, did you get this from a book? Sounds like it... come on... spill it!
The crux is that a hashmap performs well as long as the number of collisions is low; but when all the keys collide, it is reduced to a linked list, and the fast lookup hashmaps are known for is reduced to a devastating O(n). A relatively small number of requests can bring a server to its knees this way, without taking up much bandwidth, which makes this a very efficient DoS attack. Typical defenses (bandwidth limiting, dedicated administrator network connections, etc.) are quite ineffective against this, because unlike most DoS attacks, this one targets the CPU and RAM access rather than choking the network connection.
What does this have to do with my thread
You can avoid degenerate hash table cases by simply extending the size of the table dynamically when it gets too crowded.
Linky to my hashtable implementation. It seems to work fine with a few million keys. I haven't tried billions due to lack of memory.
It averages about 2x the performance of STL's hash_map<> on MSVC 7 or GCC 2.x.
The hashing function isn't that great; feel free to replace it with something better.
Thanks Bob, I saved it in a folder called BobsHash.
What they showed in the video relied on POST request, so the limit depends on the stack (PHP defaults to 8MB of data)
Most of the requests they sent were much smaller. And often sent them a bit at a time. Some of the requests could be sent at 17kbps or slower and still cause a machine to use up incredible resources.
Thanks Bob, I saved it in a folder called BobsHash.
Better hope the police don't see that
An attacker with a Gigabit connection can keep about 10.000 i7 cores busy.
That's regarding PHP. ASP.NET was 30K. Java was 100K. Some were even worse than that... The video is boring, so read the PDF.
A workaround that some have implemented is limiting the number of input variables.
Hash tables are nice and all, but don't forget about balanced binary search trees.
As for hash functions, I liked MurmurHash (2, haven't tried 3).
you can use a binary search tree instead of linked lists for the buckets.
That's pretty clever, exploiting the fact that inserting into hash tables tends to be O(n) or O(n log(n)) but is routinely confused as being an O(1) operation.
My hash code posted above is exposed to this bug and will fail in the same way. I can't think of a good way of addressing that problem without significantly hurting its performance.
In terms of preventing remote attacks, all you need to do is have a hash function that doesn't always produce the same results. Using a random seed is sufficient for that (depending on the algorithm, of course).
Or you could terminate all requests that have more than N collisions. (I think that's more friendly than just assuming all requests with more than N inputs are bad, although it is harder to implement when you are sharing hash code across the entire application.)
you can use a binary search tree instead of linked lists for the buckets.
Buckets? Whats that.
That's pretty clever, exploiting the fact that inserting into hash tables tends to be O(n) or O(n log(n)) but is routinely confused as being an O(1) operation.My hash code posted above is exposed to this bug and will fail in the same way. I can't think of a good way of addressing that problem without significantly hurting it's performance.
It maybe O(1) operation but if a hashing function takes 30ms (I'm guessing, no benchmarks here, well actually I did some with boost unordered_map, if thats a hashing function and it was kinda slow) to do a million times on a 2 ghz processor then its still kinda slow, but since everything else seems slower I guess its better but you have to realize that the one operation is still very cumbersome.
Buckets? Whats that.
You have to group all collided elements together in some way. Most people will use a linear list for collisions under the assumption that collisions are rare.
And collisions are rare ... except when somebody purposely designs keys to collide.
Buckets? Whats that.
GOOGLE?????!!!! jk.
A hash table is typically an array with N buckets (positions/slots/entries) in it, often a power of two. Each bucket contains room for one or more items. The hash value happens to be the index into that array. Why is it called a bucket? I'm not sure.
Ok I guess the question should finally be what type of tree structure should I use and what type of hashing function is the fastest for it? Thomas pointed out a well balanced one, but is that something that's a result of a specific type of tree or a b-tree with some special properties would hold as well. Are all trees b-trees btw?
A bucket is typically a watertight, vertical cylinder or truncated cone, with an open top and a flat bottom, usually attached to a semicircular carrying handle called the bail. In common usage, the terms "bucket" and "pail" are often used interchangeably. As a shipping container, a pail is a package with a sealed top or lid.
There are many types of buckets;
a water bucket is used to carry water
Household and garden uses are often for carrying liquids and granular products
Elaborate ceremonial or ritual buckets in bronze, ivory or other materials are fond in several ancient or medieval cultures and are known by the Latin for bucket, situla.
Large scoops or buckets are attached to loader and telehandler for agricultural and earthmoving purposes.
A lunch box is often called a lunch pail
Buckets can be reused as seats, tool caddies, hydroponic gardens, chamberpots, "street" drums, livestock feeders.
Ok I guess the question should finally be what type of tree structure should I use and what type of hashing function is the fastest for it? Thomas pointed out a well balanced one, but is that something that's a result of a specific type of tree or a b-tree with some special properties would hold as well. Are all trees b-trees btw?
Most often a hash table uses linked lists in each "bucket". A tree can be used, but it increases the complexity quite a bit, and is only useful for a hash table you expect to get very large, with a ton of collisions.
And no, not all trees are binary trees. There are many types of trees. Ask Baf, I'm sure he call tell you all about Trees, like he can buckets.
Most often a hash table uses linked lists in each "bucket".
Most of the data structure books I have ( actually all) have a chapter on hashing and not once do they mention buckets, but after searching google I found some examples.
Ask Baf, I'm sure he call tell you all about Trees, like he can buckets.
Yes, his coding style must consist of typing english.
only useful for a hash table you expect to get very large, with a ton of collisions.
If you're expecting a ton of collisions, you're doing it wrong.
Chess programs use hash tables to store search results, with the hash key generated from the board position. Typically, the table is stored as a plain array of 2^N elements and the index is calculated from the lowest N bits of the hash key.
The number of positions searched is very much larger than the number of elements in the table, so you need multiple buckets per entry (in practice, one looks at the next k entries, say 4 (you really would like to stay within the same cache line)) and some replacement strategy to get rid of older entries.
Not sure if that's at all helpful for what you want to do...
Not sure if that's at all helpful for what you want to do...
Anything is helpful although I have no use for chess algorithms. If there are books on this subject please mention those, maybe someone else will run across this topic or maybe I'll need it one day.
If you're expecting a ton of collisions, you're doing it wrong.
I suppose. But I can be lazy and don't want to implement a dynamically resizing hash table, and still want tens/hundreds of millions of entries. The resize operation has a rather large amount of overhead on large tables, you have to iterate over every single item and rehash it into the new table. kinda sucky.
So does anyone know if boost unordered_map is tree based or bucket based?
And is there one example of an stl container that uses buckets?
Boost's unordered_map uses buckets. See here.
std::unordered_map<> (new in C++11) also uses buckets (it's nearly identical to the boost implementation). Before C++11, no other standard STL container uses buckets.
Boost's unordered_map uses buckets.
I'm familiar with iterating through a b-tree but I'm wondering what keyword would I google to find examples of iterating through buckets?
You don't iterate through buckets, you iterate through elements. boost::unordered_map<> supports forward iterations like any other STL container.
No specified support for backwards iterations though (it may or may not work on your implementation).
but implementing the code is a little hard to do
Not really. Java uses a ridiculously simple hash function for hashing strings and gets passable results, considering that almost every Java application will be using it somewhere.
What specifically are you trying to hash that you have having trouble with?
What specifically are you trying to hash that you have having trouble with?
I want to make a clone of a boost unordered_map. I'm just doing it to learn the fundamentals of hash tables but I really liked that one so I picked that to learn from.
I did find chapter 23 in Professional C++ full of useful information.