Use of MurmurHash3 for hashing strings

ScottPJones · March 19, 2018, 4:39pm

Julia uses the 128 bit MurmurHash3 algorithm for the hash function for AbstractString.

There are two serious issues with that:

It is not possible to build up a hash value for a string in chunks:

julia> hash("bar",hash("foo"))
0x6d0f1600f8bd5874

julia> hash("foobar")
0x54fc7dff7f029834

julia> using CRC32c

julia> crc32c("bar",crc32c("foo"))
0x0d5f5c7f

julia> crc32c("foobar")
0x0d5f5c7f

It is about twice as slow as using crc32c (at least, on my machine, but most platforms have crc accelerator instructions, such as used in crc32c).
CRC-32c seems to give very similar results as far as collisions as MurmurHash3 - so it seems like it would be an advantage to switch to using it, instead of MurmurHash3.

nalimilan · March 19, 2018, 8:04pm

Good question, but AFAIK CRC32 there’s some controversy around the question of whether CRC32 is a good function for hash tables.

Have you considered Google’s CityHash? It’s relatively recent, designed for strings, and they claim it’s faster than MurmurHash while still of high quality:

CityHash, a family of hash functions for strings. […]
We are most excited by the performance of CityHash64() and its variants on
short strings, but long strings are interesting as well.
CityHash is intended to be fast, under the constraint that it hash very
well. For CPUs with the CRC32 instruction, CRC is speedy, but CRC wasn’t
designed as a hash function and shouldn’t be used as one. CityHashCrc128()
is not a CRC, but it uses the CRC32 machinery.

stevengj · March 19, 2018, 8:19pm

Apparently CityHash has been superseded by FarmHash at Google.

ScottPJones · March 19, 2018, 8:49pm

I’d been using CRC functions for hashing starting with CRC-16 back in the day, and compared with the alternatives at that time, it (and later using CRC-32) always performed quite well, compared to the alternatives generally in use back then.
Could you point to some of the controversy about using CRC for a hash table? (I’ll try to dig some references up, if you can’t) I’d done a lot of testing with large varieties of strings (many many millions, of text and binary data), and had very few collisions, but I’m interested to learn what the criticisms are.

I’m also interested in SipHash.

I have already ported the 64-bit implementation of the 128-bit MurmurHash3 to Julia (it seems to be identical in speed to the C version), because to improve the performance of hashing non-UTF8 encoded strings (while keeping them returning a UTF-8 compatible hash), I needed a Julia version so that I don’t have to convert the entire string first [hence the concern over being able to do the hashing in chunks]).

It will be in JuliaString/Strs.jl/src/murmurhash3.jl shortly.

nalimilan · March 19, 2018, 10:02pm

Apart from what CityHash authors say above, there are (not really conclusive) discussions on StackOverflow: this one and this one at least.

ScottPJones · March 19, 2018, 10:56pm

Those discussions seemed to have as many people claiming that (for non-cryptographic uses) CRC32C did quite well, as those claiming that it didn’t.
It may be that the only solution will be to try out CRC32C vs. FarmHash64, and see what does best overall.

Keno · March 19, 2018, 11:04pm

If we care about hash flood attacks, CRC32 is not an appropriate choice (though FarmHash and SipHash are designed to defend against it). It seems likely that we do want a hash flood resistant hash function here.

ScottPJones · March 20, 2018, 12:28am

Apparently MurmurHash is also susceptible to attacks, so if that’s going to be one of the criteria, then we should probably investigate both FarmHash & SipHash.

stevengj · March 20, 2018, 6:07pm

I found this discussion to be very informative. One of the authors there argues at least somewhat persuasively that no hash function can provide real protection against flooding attacks, because hash tables use only a few bits of the hash and hence can always be brute-forced (since there are a variety of ways to get the seed). Hence you might as well use CRC32c, which has good distribution properties and is fast. To defend against hash flooding, the argument was that you need to focus instead on the hash-table implementation, and in particular on hiding/randomizing the seed and on collision resolution. (Caveat: I’m no cryptographer. But the back-and-forth on this issue was interesting.)

stevengj · March 20, 2018, 6:27pm

In general, it might be nice to make it a bit easier to plug in your own hash functions so that you can use something optimized for your application. This is already possible (and fast!) by just defining a wrapper type

struct HashWrap{T}
    x::T
end 
hash(h::HashWrap....) = ...

and then defining various convert methods and a specialized keys iterator etcetera, so that Dict{HashWrap{T},V} acts like Dict{T,V}.

But it’s a fair amount of boilerplate to write, and it seems like it would better to have done in one place, e.g. in Base or a package. Especially since a slight bit of cleverness is required to encode an arbitrary hash function in the type, ala HashWrap{T,F<:Function}, and to use this information efficiently.

ScottPJones · March 20, 2018, 6:32pm

I have a very well respected cryptographer buddy (worked with him for years at InterSystems), I’ll see if I can pick his brain on this issue.
He always knows really scary stuff that hackers can do when you’re not extremely careful!

Tamas_Papp · March 20, 2018, 6:56pm

I must have missed some discussion, but why is this a concern for hash tables in Julia?

mauro3 · March 20, 2018, 7:24pm

Julia is a general purpose language, so concerns about flooding attacks concern it as much as any other language.

ScottPJones · March 20, 2018, 7:35pm

That was pretty much the conclusion that I’d come to, and if you want a more attack resistant (not attack proof, I don’t think that’s even possible) hash table, then that’s probably best addressed by a package with a different AbstractDict type.
One technique that I use to ameliorate this sort of problem, is to store the full hash values for each element of the hash table, for hash tables where checking for equality and/or calculating the hash is expensive, such as for strings, where it’s not O(1) (on the number of characters in the string).
Also, each hash table can use a separate value that is mixed in with the calculated hash value, just for that specific hash table, possibly a random value.

stevengj · March 20, 2018, 7:40pm

A lot of languages have adapted their default hash algorithms to address concerns about hash flooding. As I understand it, the argument is basically “better safe than sorry” when it comes to the default (since sometimes a library’s hash table might get used in unexpectedly sensitive places), coupled with the difficulty of using a non built-in hash in many high-level languages. Some relevant discussions from other languages:

Python: Python adopts SipHash [LWN.net]
Ruby: switch SipHash from SipHash24 to SipHash13 variant · ruby/ruby@04c94f9 · GitHub
Rust: Remove (most) cryptographic algorithms from Rust by DaGenix · Pull Request #9744 · rust-lang/rust · GitHub, Change SipHash implementation to an optimized assembly version · Issue #35735 · rust-lang/rust · GitHub, consider using a different hash than SipHash for integer keys in rustc · Issue #10586 · rust-lang/rust · GitHub
Java: Loading...
Perl: The dangerous SipHash myth // perl11 blog
Go: runtime: Extend Go's map crypto hash guarentee to all platforms in 1.5 · Issue #9365 · golang/go · GitHub, runtime: make aeshash more DOS-proof · golang/go@91059de · GitHub

Julia is in a somewhat different position than several of these languages, however, in that you can swap your own hash function into Julia’s Dict type without any performance cost. Whereas in something like CPython the hash function is embedded in the C implementation and it’s not possible to use a different hash without either sacrificing performance, writing a huge pile of C code, or recompiling Python itself.

Even so, there is still a valid argument about coding defensively when writing library/package code.

ScottPJones · March 20, 2018, 7:47pm

Ain’t Julia grand?

Tamas_Papp · March 20, 2018, 7:54pm

I don’t know a lot about hash functions, but my understanding is that there is a trade-off between cryptographically relevant properties and speed. Since Julia is computationally oriented, I wonder if simply going for speed (conditional on other relevant properties for hashing) would be a reasonable choice, too.

Of course, with a modular framework, the user can make this choice on a case-by-case basis.

stevengj · March 20, 2018, 8:19pm

Yes, if you don’t use a secret seed then my understanding is that there is essentially no point to using a secure hash — you are vulnerable to brute-force attacks if the hash function and the seed are known, even if the hash function is cryptographically strong, because of the small number of bits used by the hash table.

The SipHash paper proposes to use a cryptographically secure hash function in part to protect against the case where an attacker can actually see the value of hash(x) mod n but the seed is secret. I’m not sure in what practical circumstance this would be possible without also exposing you to other attacks to get the seed value, though. Indeed, that seems to be one of the criticisms of the proposal that SipHash increases security of hash tables.

The first step towards more security in Julia’s Dict, before anything else, would be to at least have the option of a randomized hash seed (see also Julia is vulnerable to HashDoS · Issue #16172 · JuliaLang/julia · GitHub). (Currently, you can do this with a HashKey wrapper object, but something built-in would be better.)

ScottPJones · March 20, 2018, 8:24pm

One thing that is a problem with Julia’s hashing design (not just for strings), is the way that hashing and isequal are tied together, but sometimes you aren’t trying to have a table where different types are ever considered equal, or you want to calculate a hash for fingerprinting, not for a hash table.

For my Str strings and characters I’m having to go to a lot of performance killing work, in order to make strings that I would expect to be considered equal end up hashing the same (which means having to convert them to UTF-8, just to perform the hash, instead of being free to hash based on the Unicode code points directly, or at least hashing based on UTF-16 code units, which would be much more performant in most all cases).
(This issue affects any type of AbstractString or AbstractChar currently, I believe @bkamins has raised the issue on GitHub).

ScottPJones · March 20, 2018, 8:33pm

Just randomly thinking, maybe we need something like a hash function API, and one of the parameters would be whether you need “isequal” compatible results (default true).

Topic		Replies	Views
Structural hash collisions Internals & Design	5	161	February 6, 2025
Problem with: Issue: MurmurHash3 has undefined behavior Internals & Design	27	2280	March 19, 2018
Alternatives to `Base.hash`? General Usage	14	975	April 11, 2022
About the low quality of Base.hash() General Usage question , hash	8	390	April 4, 2025
Why CRC32c but missing CRC32 General Usage	10	864	January 13, 2023

Use of MurmurHash3 for hashing strings

Related topics