Allegro.cc - Online Community

Allegro.cc Forums » Off-Topic Ordeals » Any good books on hashing using c/c++, even java

Credits go to BAF, Dizzy Egg, Edgar Reynaldo, Steve Terry, and Thomas Fjellstrom for helping out!
This thread is locked; no one can reply to it. rss feed Print
 1   2 
Any good books on hashing using c/c++, even java
verthex
Member #11,340
September 2009
avatar

What I'm looking for is a book that explains hashing functions, the formula that takes a value and generates a key, or something like that. If the book explains modern STL and the use of iterators that would be better.

Steve Terry
Member #1,989
March 2002
avatar

Add up all characters and mod it by the hash size?

___________________________________
[ Facebook ]
Microsoft is not the Borg collective. The Borg collective has got proper networking. - planetspace.de
Bill Gates is in fact Shawn Hargreaves' ßî+çh. - Gideon Weems

Edgar Reynaldo
Major Reynaldo
May 2007
avatar

Thomas Fjellstrom
Member #476
June 2000
avatar

Add up all characters and mod it by the hash size?

Its a bit more complex than that. You want to have a hash function that gives an even distribution in all (or at least most) cases.

--
Thomas Fjellstrom - [website] - [email] - [Allegro Wiki] - [Allegro TODO]
"If you can't think of a better solution, don't try to make a better solution." -- weapon_S
"The less evidence we have for what we believe is certain, the more violently we defend beliefs against those who don't agree" -- https://twitter.com/neiltyson/status/592870205409353730

Dizzy Egg
Member #10,824
March 2009
avatar

Heat hash. Crumble into pipe. Put pipe in mouth. Burn hash, whilst steadily sucking on pipe. When all hash is glowing red, stop breathing in and remove pipe from mouth. Hold breath. Breathe out hash smoke. Wait. Now write code. ???

----------------------------------------------------
Please check out my songs:
https://soundcloud.com/dont-rob-the-machina

verthex
Member #11,340
September 2009
avatar

Dizzy Egg said:

Heat hash. Crumble into pipe. Put pipe in mouth. Burn hash, whilst steadily sucking on pipe. When all hash is glowing red, stop breathing in and remove pipe from mouth. Hold breath. Breathe out hash smoke. Wait. Now write code. ???

No, no, no time for that.

Its a bit more complex than that.

Its very complex. Thats why I asked. I did find some books on the topic.

Don't know any books, but wikipedia might be an okay place to start :

Haven't looked there yet, wikipedia is always too narrow though in context to what they want to sell me.

Oh well, I'll look around more.

BAF
Member #2,981
December 2002
avatar

I don't think it's so complex that a book would be dedicated to it. It's more about understanding the (very simple) purpose and properties of a good hashing function, then coming up with one based on your data.

[edit]
What I meant to say is:

In computer chess and other computer games, transposition tables are used to speed up the search of the game tree. Transposition tables are primarily useful in perfect information games, meaning the entire state of the game is known to all players at all times.

Game playing programs work by analyzing millions of positions that could arise in the next few moves of the game. Typically, these programs employ strategies resembling depth-first search, which means that they do not keep track of all the positions analyzed so far. In many games, it is possible to reach a given position in more than one way. These are called transpositions.[1] In chess, for example, the sequence of moves 1. d4 Nf6 2. c4 g6 (see algebraic chess notation) has 4 possible transpositions, since either player may swap their move order. In general, after n moves, an upper limit on the possible transpositions is (n!)². Although many of these are illegal move sequences, it is still likely that the program will end up analyzing the same position several times.

To avoid this problem, transposition tables are used. Such a table is a hash table of each of the positions analyzed so far up to a certain depth. On encountering a new position, the program checks the table to see if the position has already been analyzed; this can be done quickly, in expected constant time. If so, the table contains the value that was previously assigned to this position; this value is used directly. If not, the value is computed and the new position is entered into the hash table. This is essentially memoization applied to the search function.

Thomas Fjellstrom
Member #476
June 2000
avatar

The problem is generating a hash key that is uniform over a given range. You want as few collisions as possible, and you don't want any bunching, or your algorithm gets a lot slower. And many popular hash functions have some rather easy to trigger worst cases.

video

Watch that for an overview of some guys that tested some known flaws against popular web frameworks/languages.

--
Thomas Fjellstrom - [website] - [email] - [Allegro Wiki] - [Allegro TODO]
"If you can't think of a better solution, don't try to make a better solution." -- weapon_S
"The less evidence we have for what we believe is certain, the more violently we defend beliefs against those who don't agree" -- https://twitter.com/neiltyson/status/592870205409353730

verthex
Member #11,340
September 2009
avatar

thanks for the help guys, its not that its complicated like BAF said but implementing the code is a little hard to do. And then there a lots of things to learn about.

@Thomas, interesting video but I can't sit through an hour and watch that.

BAF
Member #2,981
December 2002
avatar

The main disadvantage of the brute-force method is that, for many real-world problems, the number of natural candidates is prohibitively large. For instance, if we look for the divisors of a number as described above, the number of candidates tested will be the given number n. So if n has sixteen decimal digits, say, the search will require executing at least 1015 computer instructions, which will take several days on a typical PC. If n is a random 64-bit natural number, which has about 19 decimal digits on the average, the search will take about 10 years. This steep growth in the number of candidates, as the size of the data increases, occurs in all sorts of problems. For instance, if we are seeking a particular rearrangement of 10 letters, then we have 10! = 3,628,800 candidates to consider, which a typical PC can generate and test in less than one second. However, adding one more letter — which is only a 10% increase in the data size — will multiply the number of candidates by 11 — a 1000% increase. For 20 letters, the number of candidates is 20!, which is about 2.4×1018 or 2.4 quintillion; and the search will take about 10,000 years. This unwelcome phenomenon is commonly called the combinatorial explosion.

:-/

Thomas Fjellstrom
Member #476
June 2000
avatar

Math doesn't help you when there are known problems with a given hash method. All you have to do is tweak the input for it, and you can stack up as many things as you want in as few buckets as possible.

--
Thomas Fjellstrom - [website] - [email] - [Allegro Wiki] - [Allegro TODO]
"If you can't think of a better solution, don't try to make a better solution." -- weapon_S
"The less evidence we have for what we believe is certain, the more violently we defend beliefs against those who don't agree" -- https://twitter.com/neiltyson/status/592870205409353730

Arthur Kalliokoski
Second in Command
February 2005
avatar

BAF said:

if n has sixteen decimal digits, say, the search will require executing at least 1015 computer instructions, which will take several days on a typical PC.

Hahahaha! Maybe you meant 10 to the 15th?

They all watch too much MSNBC... they get ideas.

Matthew Leverton
Supreme Loser
January 1999
avatar

From the archives:

http://www.allegro.cc/forums/thread/512313

Hahahaha! Maybe you meant 10 to the 15th?

I think you are missing the larger joke.

Tobias Dammers
Member #2,604
August 2002
avatar

verthex said:

@Thomas, interesting video but I can't sit through an hour and watch that.

Neither can I, but if it is what I think it is, here's how the attack goes:

  • find a target that uses hashmaps to store query string parameters (most web stacks qualify)

  • find a bunch of keys with identical hashes, and use them as query string parameter names

  • construct query strings using as many of these identical-hash keys as you can fit in the URL

  • send a bunch of these requests

The crux is that a hashmap performs well as long as the number of collisions is low; but when all the keys collide, it is reduced to a linked list, and the fast lookup hashmaps are known for is reduced to a devastating O(n). A relatively small number of requests can bring a server to its knees this way, without taking up much bandwidth, which makes this a very efficient DoS attack. Typical defenses (bandwidth limiting, dedicated administrator network connections, etc.) are quite ineffective against this, because unlike most DoS attacks, this one targets the CPU and RAM access rather than choking the network connection.

---
Me make music: Triofobie
---
"We need Tobias and his awesome trombone, too." - Johan Halmén

MiquelFire
Member #3,110
January 2003
avatar

What they showed in the video relied on POST request, so the limit depends on the stack (PHP defaults to 8MB of data)

---
Febreze (and other air fresheners actually) is just below perfumes/colognes, and that's just below dead skunks in terms of smells that offend my nose.
MiquelFire.red
If anyone is of the opinion that there is no systemic racism in America, they're either blind, stupid, or racist too. ~Edgar Reynaldo

verthex
Member #11,340
September 2009
avatar

BAF said:

The main disadvantage of the brute-force method is that, for many real-world problems, the number of natural candidates is prohibitively large. For instance, if we look for the divisors of a number as described above, the number of candidates tested will be the given number n. So if n has sixteen decimal digits, say, the search will require executing at least 1015 computer instructions, which will take several days on a typical PC. If n is a random 64-bit natural number, which has about 19 decimal digits on the average, the search will take about 10 years. This steep growth in the number of candidates, as the size of the data increases, occurs in all sorts of problems. For instance, if we are seeking a particular rearrangement of 10 letters, then we have 10! = 3,628,800 candidates to consider, which a typical PC can generate and test in less than one second. However, adding one more letter — which is only a 10% increase in the data size — will multiply the number of candidates by 11 — a 1000% increase. For 20 letters, the number of candidates is 20!, which is about 2.4×1018 or 2.4 quintillion; and the search will take about 10,000 years. This unwelcome phenomenon is commonly called the combinatorial explosion.

I guess BAF, did you get this from a book? Sounds like it... come on... spill it!

The crux is that a hashmap performs well as long as the number of collisions is low; but when all the keys collide, it is reduced to a linked list, and the fast lookup hashmaps are known for is reduced to a devastating O(n). A relatively small number of requests can bring a server to its knees this way, without taking up much bandwidth, which makes this a very efficient DoS attack. Typical defenses (bandwidth limiting, dedicated administrator network connections, etc.) are quite ineffective against this, because unlike most DoS attacks, this one targets the CPU and RAM access rather than choking the network connection.

What does this have to do with my thread >:(

Bob
Free Market Evangelist
September 2000
avatar

You can avoid degenerate hash table cases by simply extending the size of the table dynamically when it gets too crowded.

Linky to my hashtable implementation. It seems to work fine with a few million keys. I haven't tried billions due to lack of memory.

It averages about 2x the performance of STL's hash_map<> on MSVC 7 or GCC 2.x.

The hashing function isn't that great; feel free to replace it with something better.

--
- Bob
[ -- All my signature links are 404 -- ]

verthex
Member #11,340
September 2009
avatar

Thanks Bob, I saved it in a folder called BobsHash.

Thomas Fjellstrom
Member #476
June 2000
avatar

What they showed in the video relied on POST request, so the limit depends on the stack (PHP defaults to 8MB of data)

Most of the requests they sent were much smaller. And often sent them a bit at a time. Some of the requests could be sent at 17kbps or slower and still cause a machine to use up incredible resources.

verthex said:

Thanks Bob, I saved it in a folder called BobsHash.

Better hope the police don't see that :-X

--
Thomas Fjellstrom - [website] - [email] - [Allegro Wiki] - [Allegro TODO]
"If you can't think of a better solution, don't try to make a better solution." -- weapon_S
"The less evidence we have for what we believe is certain, the more violently we defend beliefs against those who don't agree" -- https://twitter.com/neiltyson/status/592870205409353730

Matthew Leverton
Supreme Loser
January 1999
avatar

Quote:

An attacker with a Gigabit connection can keep about 10.000 i7 cores busy.

That's regarding PHP. ASP.NET was 30K. Java was 100K. Some were even worse than that... The video is boring, so read the PDF.

A workaround that some have implemented is limiting the number of input variables.

Peter Wang
Member #23
April 2000

Hash tables are nice and all, but don't forget about balanced binary search trees.

As for hash functions, I liked MurmurHash (2, haven't tried 3).

Thomas Fjellstrom
Member #476
June 2000
avatar

you can use a binary search tree instead of linked lists for the buckets. :D

--
Thomas Fjellstrom - [website] - [email] - [Allegro Wiki] - [Allegro TODO]
"If you can't think of a better solution, don't try to make a better solution." -- weapon_S
"The less evidence we have for what we believe is certain, the more violently we defend beliefs against those who don't agree" -- https://twitter.com/neiltyson/status/592870205409353730

Bob
Free Market Evangelist
September 2000
avatar

That's pretty clever, exploiting the fact that inserting into hash tables tends to be O(n) or O(n log(n)) but is routinely confused as being an O(1) operation.

My hash code posted above is exposed to this bug and will fail in the same way. I can't think of a good way of addressing that problem without significantly hurting its performance.

--
- Bob
[ -- All my signature links are 404 -- ]

Matthew Leverton
Supreme Loser
January 1999
avatar

In terms of preventing remote attacks, all you need to do is have a hash function that doesn't always produce the same results. Using a random seed is sufficient for that (depending on the algorithm, of course).

Or you could terminate all requests that have more than N collisions. (I think that's more friendly than just assuming all requests with more than N inputs are bad, although it is harder to implement when you are sharing hash code across the entire application.)

verthex
Member #11,340
September 2009
avatar

you can use a binary search tree instead of linked lists for the buckets.

Buckets? Whats that.

Bob said:

That's pretty clever, exploiting the fact that inserting into hash tables tends to be O(n) or O(n log(n)) but is routinely confused as being an O(1) operation.My hash code posted above is exposed to this bug and will fail in the same way. I can't think of a good way of addressing that problem without significantly hurting it's performance.

It maybe O(1) operation but if a hashing function takes 30ms (I'm guessing, no benchmarks here, well actually I did some with boost unordered_map, if thats a hashing function and it was kinda slow) to do a million times on a 2 ghz processor then its still kinda slow, but since everything else seems slower I guess its better but you have to realize that the one operation is still very cumbersome.

 1   2 


Go to: