|
This thread is locked; no one can reply to it. |
1
2
|
Any good books on hashing using c/c++, even java |
verthex
Member #11,340
September 2009
|
What I'm looking for is a book that explains hashing functions, the formula that takes a value and generates a key, or something like that. If the book explains modern STL and the use of iterators that would be better.
|
Steve Terry
Member #1,989
March 2002
|
Add up all characters and mod it by the hash size? ___________________________________ |
Edgar Reynaldo
Major Reynaldo
May 2007
|
Don't know any books, but wikipedia might be an okay place to start : My Website! | EAGLE GUI Library Demos | My Deviant Art Gallery | Spiraloid Preview | A4 FontMaker | Skyline! (Missile Defense) Eagle and Allegro 5 binaries | Older Allegro 4 and 5 binaries | Allegro 5 compile guide |
Thomas Fjellstrom
Member #476
June 2000
|
Steve Terry said: Add up all characters and mod it by the hash size? Its a bit more complex than that. You want to have a hash function that gives an even distribution in all (or at least most) cases. -- |
Dizzy Egg
Member #10,824
March 2009
|
Heat hash. Crumble into pipe. Put pipe in mouth. Burn hash, whilst steadily sucking on pipe. When all hash is glowing red, stop breathing in and remove pipe from mouth. Hold breath. Breathe out hash smoke. Wait. Now write code.
---------------------------------------------------- |
verthex
Member #11,340
September 2009
|
Dizzy Egg said: Heat hash. Crumble into pipe. Put pipe in mouth. Burn hash, whilst steadily sucking on pipe. When all hash is glowing red, stop breathing in and remove pipe from mouth. Hold breath. Breathe out hash smoke. Wait. Now write code. No, no, no time for that. Thomas Fjellstrom said: Its a bit more complex than that. Its very complex. Thats why I asked. I did find some books on the topic. Edgar Reynaldo said: Don't know any books, but wikipedia might be an okay place to start : Haven't looked there yet, wikipedia is always too narrow though in context to what they want to sell me. Oh well, I'll look around more.
|
BAF
Member #2,981
December 2002
|
I don't think it's so complex that a book would be dedicated to it. It's more about understanding the (very simple) purpose and properties of a good hashing function, then coming up with one based on your data. [edit] In computer chess and other computer games, transposition tables are used to speed up the search of the game tree. Transposition tables are primarily useful in perfect information games, meaning the entire state of the game is known to all players at all times. Game playing programs work by analyzing millions of positions that could arise in the next few moves of the game. Typically, these programs employ strategies resembling depth-first search, which means that they do not keep track of all the positions analyzed so far. In many games, it is possible to reach a given position in more than one way. These are called transpositions.[1] In chess, for example, the sequence of moves 1. d4 Nf6 2. c4 g6 (see algebraic chess notation) has 4 possible transpositions, since either player may swap their move order. In general, after n moves, an upper limit on the possible transpositions is (n!)². Although many of these are illegal move sequences, it is still likely that the program will end up analyzing the same position several times. To avoid this problem, transposition tables are used. Such a table is a hash table of each of the positions analyzed so far up to a certain depth. On encountering a new position, the program checks the table to see if the position has already been analyzed; this can be done quickly, in expected constant time. If so, the table contains the value that was previously assigned to this position; this value is used directly. If not, the value is computed and the new position is entered into the hash table. This is essentially memoization applied to the search function. |
Thomas Fjellstrom
Member #476
June 2000
|
The problem is generating a hash key that is uniform over a given range. You want as few collisions as possible, and you don't want any bunching, or your algorithm gets a lot slower. And many popular hash functions have some rather easy to trigger worst cases. Watch that for an overview of some guys that tested some known flaws against popular web frameworks/languages. -- |
verthex
Member #11,340
September 2009
|
thanks for the help guys, its not that its complicated like BAF said but implementing the code is a little hard to do. And then there a lots of things to learn about. @Thomas, interesting video but I can't sit through an hour and watch that.
|
BAF
Member #2,981
December 2002
|
The main disadvantage of the brute-force method is that, for many real-world problems, the number of natural candidates is prohibitively large. For instance, if we look for the divisors of a number as described above, the number of candidates tested will be the given number n. So if n has sixteen decimal digits, say, the search will require executing at least 1015 computer instructions, which will take several days on a typical PC. If n is a random 64-bit natural number, which has about 19 decimal digits on the average, the search will take about 10 years. This steep growth in the number of candidates, as the size of the data increases, occurs in all sorts of problems. For instance, if we are seeking a particular rearrangement of 10 letters, then we have 10! = 3,628,800 candidates to consider, which a typical PC can generate and test in less than one second. However, adding one more letter — which is only a 10% increase in the data size — will multiply the number of candidates by 11 — a 1000% increase. For 20 letters, the number of candidates is 20!, which is about 2.4×1018 or 2.4 quintillion; and the search will take about 10,000 years. This unwelcome phenomenon is commonly called the combinatorial explosion.
|
Thomas Fjellstrom
Member #476
June 2000
|
Math doesn't help you when there are known problems with a given hash method. All you have to do is tweak the input for it, and you can stack up as many things as you want in as few buckets as possible. -- |
Arthur Kalliokoski
Second in Command
February 2005
|
BAF said: if n has sixteen decimal digits, say, the search will require executing at least 1015 computer instructions, which will take several days on a typical PC. Hahahaha! Maybe you meant 10 to the 15th? They all watch too much MSNBC... they get ideas. |
Matthew Leverton
Supreme Loser
January 1999
|
From the archives: http://www.allegro.cc/forums/thread/512313 Arthur Kalliokoski said: Hahahaha! Maybe you meant 10 to the 15th? I think you are missing the larger joke. |
Tobias Dammers
Member #2,604
August 2002
|
verthex said: @Thomas, interesting video but I can't sit through an hour and watch that. Neither can I, but if it is what I think it is, here's how the attack goes:
The crux is that a hashmap performs well as long as the number of collisions is low; but when all the keys collide, it is reduced to a linked list, and the fast lookup hashmaps are known for is reduced to a devastating O(n). A relatively small number of requests can bring a server to its knees this way, without taking up much bandwidth, which makes this a very efficient DoS attack. Typical defenses (bandwidth limiting, dedicated administrator network connections, etc.) are quite ineffective against this, because unlike most DoS attacks, this one targets the CPU and RAM access rather than choking the network connection. --- |
MiquelFire
Member #3,110
January 2003
|
What they showed in the video relied on POST request, so the limit depends on the stack (PHP defaults to 8MB of data) --- |
verthex
Member #11,340
September 2009
|
BAF said: The main disadvantage of the brute-force method is that, for many real-world problems, the number of natural candidates is prohibitively large. For instance, if we look for the divisors of a number as described above, the number of candidates tested will be the given number n. So if n has sixteen decimal digits, say, the search will require executing at least 1015 computer instructions, which will take several days on a typical PC. If n is a random 64-bit natural number, which has about 19 decimal digits on the average, the search will take about 10 years. This steep growth in the number of candidates, as the size of the data increases, occurs in all sorts of problems. For instance, if we are seeking a particular rearrangement of 10 letters, then we have 10! = 3,628,800 candidates to consider, which a typical PC can generate and test in less than one second. However, adding one more letter — which is only a 10% increase in the data size — will multiply the number of candidates by 11 — a 1000% increase. For 20 letters, the number of candidates is 20!, which is about 2.4×1018 or 2.4 quintillion; and the search will take about 10,000 years. This unwelcome phenomenon is commonly called the combinatorial explosion. I guess BAF, did you get this from a book? Sounds like it... come on... spill it! Tobias Dammers said: The crux is that a hashmap performs well as long as the number of collisions is low; but when all the keys collide, it is reduced to a linked list, and the fast lookup hashmaps are known for is reduced to a devastating O(n). A relatively small number of requests can bring a server to its knees this way, without taking up much bandwidth, which makes this a very efficient DoS attack. Typical defenses (bandwidth limiting, dedicated administrator network connections, etc.) are quite ineffective against this, because unlike most DoS attacks, this one targets the CPU and RAM access rather than choking the network connection. What does this have to do with my thread
|
Bob
Free Market Evangelist
September 2000
|
You can avoid degenerate hash table cases by simply extending the size of the table dynamically when it gets too crowded. Linky to my hashtable implementation. It seems to work fine with a few million keys. I haven't tried billions due to lack of memory. It averages about 2x the performance of STL's hash_map<> on MSVC 7 or GCC 2.x. The hashing function isn't that great; feel free to replace it with something better. -- |
verthex
Member #11,340
September 2009
|
Thanks Bob, I saved it in a folder called BobsHash.
|
Thomas Fjellstrom
Member #476
June 2000
|
MiquelFire said: What they showed in the video relied on POST request, so the limit depends on the stack (PHP defaults to 8MB of data) Most of the requests they sent were much smaller. And often sent them a bit at a time. Some of the requests could be sent at 17kbps or slower and still cause a machine to use up incredible resources. verthex said: Thanks Bob, I saved it in a folder called BobsHash. Better hope the police don't see that -- |
Matthew Leverton
Supreme Loser
January 1999
|
Quote: An attacker with a Gigabit connection can keep about 10.000 i7 cores busy. That's regarding PHP. ASP.NET was 30K. Java was 100K. Some were even worse than that... The video is boring, so read the PDF. A workaround that some have implemented is limiting the number of input variables. |
Peter Wang
Member #23
April 2000
|
Hash tables are nice and all, but don't forget about balanced binary search trees. As for hash functions, I liked MurmurHash (2, haven't tried 3).
|
Thomas Fjellstrom
Member #476
June 2000
|
you can use a binary search tree instead of linked lists for the buckets. -- |
Bob
Free Market Evangelist
September 2000
|
That's pretty clever, exploiting the fact that inserting into hash tables tends to be O(n) or O(n log(n)) but is routinely confused as being an O(1) operation. My hash code posted above is exposed to this bug and will fail in the same way. I can't think of a good way of addressing that problem without significantly hurting its performance. -- |
Matthew Leverton
Supreme Loser
January 1999
|
In terms of preventing remote attacks, all you need to do is have a hash function that doesn't always produce the same results. Using a random seed is sufficient for that (depending on the algorithm, of course). Or you could terminate all requests that have more than N collisions. (I think that's more friendly than just assuming all requests with more than N inputs are bad, although it is harder to implement when you are sharing hash code across the entire application.) |
verthex
Member #11,340
September 2009
|
Thomas Fjellstrom said: you can use a binary search tree instead of linked lists for the buckets. Buckets? Whats that. Bob said: That's pretty clever, exploiting the fact that inserting into hash tables tends to be O(n) or O(n log(n)) but is routinely confused as being an O(1) operation.My hash code posted above is exposed to this bug and will fail in the same way. I can't think of a good way of addressing that problem without significantly hurting it's performance. It maybe O(1) operation but if a hashing function takes 30ms (I'm guessing, no benchmarks here, well actually I did some with boost unordered_map, if thats a hashing function and it was kinda slow) to do a million times on a 2 ghz processor then its still kinda slow, but since everything else seems slower I guess its better but you have to realize that the one operation is still very cumbersome.
|
|
1
2
|