Any good books on hashing using c/c++, even java

Any good books on hashing using c/c++, even java

verthex

Member #11,340

September 2009

What I'm looking for is a book that explains hashing functions, the formula that takes a value and generates a key, or something like that. If the book explains modern STL and the use of iterators that would be better.

Steve Terry

Member #1,989

March 2002

Add up all characters and mod it by the hash size?

___________________________________
[ Facebook ]
Microsoft is not the Borg collective. The Borg collective has got proper networking. - planetspace.de
Bill Gates is in fact Shawn Hargreaves' ßî+çh. - Gideon Weems

Edgar Reynaldo

Major Reynaldo

May 2007

Don't know any books, but wikipedia might be an okay place to start :
http://en.wikipedia.org/wiki/Hash_function

Eagle and Allegro 5 binaries | Older Allegro 4 and 5 binaries | Allegro 5 compile guide
King Piccolo will make it so Then I could just flip it with some sweet, sweet matrix math. In either case, I’ll ride this wagon until the wheels fall off. Thanks to those that keep it running

Thomas Fjellstrom

Member #476

June 2000

Steve Terry said:

Add up all characters and mod it by the hash size?

Its a bit more complex than that. You want to have a hash function that gives an even distribution in all (or at least most) cases.

--
Thomas Fjellstrom - [website] - [email] - [Allegro Wiki] - [Allegro TODO]
"If you can't think of a better solution, don't try to make a better solution." -- weapon_S
"The less evidence we have for what we believe is certain, the more violently we defend beliefs against those who don't agree" -- https://twitter.com/neiltyson/status/592870205409353730

Dizzy Egg

Member #10,824

March 2009

Heat hash. Crumble into pipe. Put pipe in mouth. Burn hash, whilst steadily sucking on pipe. When all hash is glowing red, stop breathing in and remove pipe from mouth. Hold breath. Breathe out hash smoke. Wait. Now write code.

----------------------------------------------------
Please check out my songs:
https://soundcloud.com/dont-rob-the-machina

verthex

Member #11,340

September 2009

Dizzy Egg said:

No, no, no time for that.

Thomas Fjellstrom said:

Its a bit more complex than that.

Its very complex. Thats why I asked. I did find some books on the topic.

Edgar Reynaldo said:

Don't know any books, but wikipedia might be an okay place to start :

Haven't looked there yet, wikipedia is always too narrow though in context to what they want to sell me.

Oh well, I'll look around more.

BAF

Member #2,981

December 2002

I don't think it's so complex that a book would be dedicated to it. It's more about understanding the (very simple) purpose and properties of a good hashing function, then coming up with one based on your data.

[edit]
What I meant to say is:

In computer chess and other computer games, transposition tables are used to speed up the search of the game tree. Transposition tables are primarily useful in perfect information games, meaning the entire state of the game is known to all players at all times.

Game playing programs work by analyzing millions of positions that could arise in the next few moves of the game. Typically, these programs employ strategies resembling depth-first search, which means that they do not keep track of all the positions analyzed so far. In many games, it is possible to reach a given position in more than one way. These are called transpositions.[1] In chess, for example, the sequence of moves 1. d4 Nf6 2. c4 g6 (see algebraic chess notation) has 4 possible transpositions, since either player may swap their move order. In general, after n moves, an upper limit on the possible transpositions is (n!)². Although many of these are illegal move sequences, it is still likely that the program will end up analyzing the same position several times.

To avoid this problem, transposition tables are used. Such a table is a hash table of each of the positions analyzed so far up to a certain depth. On encountering a new position, the program checks the table to see if the position has already been analyzed; this can be done quickly, in expected constant time. If so, the table contains the value that was previously assigned to this position; this value is used directly. If not, the value is computed and the new position is entered into the hash table. This is essentially memoization applied to the search function.

BAF.zone | SantaHack!

Thomas Fjellstrom

Member #476

June 2000

The problem is generating a hash key that is uniform over a given range. You want as few collisions as possible, and you don't want any bunching, or your algorithm gets a lot slower. And many popular hash functions have some rather easy to trigger worst cases.

Watch that for an overview of some guys that tested some known flaws against popular web frameworks/languages.

verthex

Member #11,340

September 2009

thanks for the help guys, its not that its complicated like BAF said but implementing the code is a little hard to do. And then there a lots of things to learn about.

@Thomas, interesting video but I can't sit through an hour and watch that.

BAF

Member #2,981

December 2002

The main disadvantage of the brute-force method is that, for many real-world problems, the number of natural candidates is prohibitively large. For instance, if we look for the divisors of a number as described above, the number of candidates tested will be the given number n. So if n has sixteen decimal digits, say, the search will require executing at least 1015 computer instructions, which will take several days on a typical PC. If n is a random 64-bit natural number, which has about 19 decimal digits on the average, the search will take about 10 years. This steep growth in the number of candidates, as the size of the data increases, occurs in all sorts of problems. For instance, if we are seeking a particular rearrangement of 10 letters, then we have 10! = 3,628,800 candidates to consider, which a typical PC can generate and test in less than one second. However, adding one more letter — which is only a 10% increase in the data size — will multiply the number of candidates by 11 — a 1000% increase. For 20 letters, the number of candidates is 20!, which is about 2.4×1018 or 2.4 quintillion; and the search will take about 10,000 years. This unwelcome phenomenon is commonly called the combinatorial explosion.

:-/

BAF.zone | SantaHack!

Thomas Fjellstrom

Member #476

June 2000

Math doesn't help you when there are known problems with a given hash method. All you have to do is tweak the input for it, and you can stack up as many things as you want in as few buckets as possible.

Arthur Kalliokoski

Second in Command

February 2005

BAF said:

if n has sixteen decimal digits, say, the search will require executing at least 1015 computer instructions, which will take several days on a typical PC.

Hahahaha! Maybe you meant 10 to the 15th?

They all watch too much MSNBC... they get ideas.

Matthew Leverton

Supreme Loser

January 1999

From the archives:

http://www.allegro.cc/forums/thread/512313

Arthur Kalliokoski said:

Hahahaha! Maybe you meant 10 to the 15th?

I think you are missing the larger joke.

--
RTFM | Follow Me on Google+ | I know 10 people

Tobias Dammers

Member #2,604

August 2002

verthex said:

@Thomas, interesting video but I can't sit through an hour and watch that.

Neither can I, but if it is what I think it is, here's how the attack goes:

find a target that uses hashmaps to store query string parameters (most web stacks qualify)
find a bunch of keys with identical hashes, and use them as query string parameter names
construct query strings using as many of these identical-hash keys as you can fit in the URL
send a bunch of these requests

The crux is that a hashmap performs well as long as the number of collisions is low; but when all the keys collide, it is reduced to a linked list, and the fast lookup hashmaps are known for is reduced to a devastating O(n). A relatively small number of requests can bring a server to its knees this way, without taking up much bandwidth, which makes this a very efficient DoS attack. Typical defenses (bandwidth limiting, dedicated administrator network connections, etc.) are quite ineffective against this, because unlike most DoS attacks, this one targets the CPU and RAM access rather than choking the network connection.

---
Me make music: Triofobie
---
"We need Tobias and his awesome trombone, too." - Johan Halmén

MiquelFire

Member #3,110

January 2003

What they showed in the video relied on POST request, so the limit depends on the stack (PHP defaults to 8MB of data)

---
Febreze (and other air fresheners actually) is just below perfumes/colognes, and that's just below dead skunks in terms of smells that offend my nose.
MiquelFire.red
If anyone is of the opinion that there is no systemic racism in America, they're either blind, stupid, or racist too. ~Edgar Reynaldo

verthex

Member #11,340

September 2009

BAF said:

I guess BAF, did you get this from a book? Sounds like it... come on... spill it!

Tobias Dammers said:

What does this have to do with my thread >:(

Bob

Free Market Evangelist

September 2000

You can avoid degenerate hash table cases by simply extending the size of the table dynamically when it gets too crowded.

Linky to my hashtable implementation. It seems to work fine with a few million keys. I haven't tried billions due to lack of memory.

It averages about 2x the performance of STL's hash_map<> on MSVC 7 or GCC 2.x.

The hashing function isn't that great; feel free to replace it with something better.

--
- Bob
[ -- All my signature links are 404 -- ]

verthex

Member #11,340

September 2009

Thanks Bob, I saved it in a folder called BobsHash.

Thomas Fjellstrom

Member #476

June 2000

MiquelFire said:

What they showed in the video relied on POST request, so the limit depends on the stack (PHP defaults to 8MB of data)

Most of the requests they sent were much smaller. And often sent them a bit at a time. Some of the requests could be sent at 17kbps or slower and still cause a machine to use up incredible resources.

verthex said:

Thanks Bob, I saved it in a folder called BobsHash.

Better hope the police don't see that :-X

Matthew Leverton

Supreme Loser

January 1999

Quote:

An attacker with a Gigabit connection can keep about 10.000 i7 cores busy.

That's regarding PHP. ASP.NET was 30K. Java was 100K. Some were even worse than that... The video is boring, so read the PDF.

A workaround that some have implemented is limiting the number of input variables.

--
RTFM | Follow Me on Google+ | I know 10 people

Peter Wang

Member #23

April 2000

Hash tables are nice and all, but don't forget about balanced binary search trees.

As for hash functions, I liked MurmurHash (2, haven't tried 3).

Thomas Fjellstrom

Member #476

June 2000

you can use a binary search tree instead of linked lists for the buckets.

Bob

Free Market Evangelist

September 2000

That's pretty clever, exploiting the fact that inserting into hash tables tends to be O(n) or O(n log(n)) but is routinely confused as being an O(1) operation.

My hash code posted above is exposed to this bug and will fail in the same way. I can't think of a good way of addressing that problem without significantly hurting its performance.

--
- Bob
[ -- All my signature links are 404 -- ]

Matthew Leverton

Supreme Loser

January 1999

In terms of preventing remote attacks, all you need to do is have a hash function that doesn't always produce the same results. Using a random seed is sufficient for that (depending on the algorithm, of course).

Or you could terminate all requests that have more than N collisions. (I think that's more friendly than just assuming all requests with more than N inputs are bad, although it is harder to implement when you are sharing hash code across the entire application.)

--
RTFM | Follow Me on Google+ | I know 10 people

verthex

Member #11,340

September 2009

Thomas Fjellstrom said:

you can use a binary search tree instead of linked lists for the buckets.

Buckets? Whats that.

Bob said:

That's pretty clever, exploiting the fact that inserting into hash tables tends to be O(n) or O(n log(n)) but is routinely confused as being an O(1) operation.My hash code posted above is exposed to this bug and will fail in the same way. I can't think of a good way of addressing that problem without significantly hurting it's performance.

It maybe O(1) operation but if a hashing function takes 30ms (I'm guessing, no benchmarks here, well actually I did some with boost unordered_map, if thats a hashing function and it was kinda slow) to do a million times on a 2 ghz processor then its still kinda slow, but since everything else seems slower I guess its better but you have to realize that the one operation is still very cumbersome.