Use of MurmurHash3 for hashing strings


#1

Julia uses the 128 bit MurmurHash3 algorithm for the hash function for AbstractString.

There are two serious issues with that:

  1. It is not possible to build up a hash value for a string in chunks:
julia> hash("bar",hash("foo"))
0x6d0f1600f8bd5874

julia> hash("foobar")
0x54fc7dff7f029834

julia> using CRC32c

julia> crc32c("bar",crc32c("foo"))
0x0d5f5c7f

julia> crc32c("foobar")
0x0d5f5c7f
  1. It is about twice as slow as using crc32c (at least, on my machine, but most platforms have crc accelerator instructions, such as used in crc32c).

  2. CRC-32c seems to give very similar results as far as collisions as MurmurHash3 - so it seems like it would be an advantage to switch to using it, instead of MurmurHash3.


#2

Good question, but AFAIK CRC32 there’s some controversy around the question of whether CRC32 is a good function for hash tables.

Have you considered Google’s CityHash? It’s relatively recent, designed for strings, and they claim it’s faster than MurmurHash while still of high quality:

CityHash, a family of hash functions for strings. […]
We are most excited by the performance of CityHash64() and its variants on
short strings, but long strings are interesting as well.
CityHash is intended to be fast, under the constraint that it hash very
well. For CPUs with the CRC32 instruction, CRC is speedy, but CRC wasn’t
designed as a hash function and shouldn’t be used as one. CityHashCrc128()
is not a CRC, but it uses the CRC32 machinery.


#3

Apparently CityHash has been superseded by FarmHash at Google.


#4

I’d been using CRC functions for hashing starting with CRC-16 back in the day, and compared with the alternatives at that time, it (and later using CRC-32) always performed quite well, compared to the alternatives generally in use back then.
Could you point to some of the controversy about using CRC for a hash table? (I’ll try to dig some references up, if you can’t) I’d done a lot of testing with large varieties of strings (many many millions, of text and binary data), and had very few collisions, but I’m interested to learn what the criticisms are.

I’m also interested in SipHash.

I have already ported the 64-bit implementation of the 128-bit MurmurHash3 to Julia (it seems to be identical in speed to the C version), because to improve the performance of hashing non-UTF8 encoded strings (while keeping them returning a UTF-8 compatible hash), I needed a Julia version so that I don’t have to convert the entire string first [hence the concern over being able to do the hashing in chunks]).

It will be in JuliaString/Strs.jl/src/murmurhash3.jl shortly.


#5

Apart from what CityHash authors say above, there are (not really conclusive) discussions on StackOverflow: this one and this one at least.


#6

Those discussions seemed to have as many people claiming that (for non-cryptographic uses) CRC32C did quite well, as those claiming that it didn’t.
It may be that the only solution will be to try out CRC32C vs. FarmHash64, and see what does best overall.


#7

If we care about hash flood attacks, CRC32 is not an appropriate choice (though FarmHash and SipHash are designed to defend against it). It seems likely that we do want a hash flood resistant hash function here.


#8

Apparently MurmurHash is also susceptible to attacks, so if that’s going to be one of the criteria, then we should probably investigate both FarmHash & SipHash.


#9

I found this discussion to be very informative. One of the authors there argues at least somewhat persuasively that no hash function can provide real protection against flooding attacks, because hash tables use only a few bits of the hash and hence can always be brute-forced (since there are a variety of ways to get the seed). Hence you might as well use CRC32c, which has good distribution properties and is fast. To defend against hash flooding, the argument was that you need to focus instead on the hash-table implementation, and in particular on hiding/randomizing the seed and on collision resolution. (Caveat: I’m no cryptographer. But the back-and-forth on this issue was interesting.)


#10

In general, it might be nice to make it a bit easier to plug in your own hash functions so that you can use something optimized for your application. This is already possible (and fast!) by just defining a wrapper type

struct HashWrap{T}
    x::T
end 
hash(h::HashWrap....) = ...

and then defining various convert methods and a specialized keys iterator etcetera, so that Dict{HashWrap{T},V} acts like Dict{T,V}.

But it’s a fair amount of boilerplate to write, and it seems like it would better to have done in one place, e.g. in Base or a package. Especially since a slight bit of cleverness is required to encode an arbitrary hash function in the type, ala HashWrap{T,F<:Function}, and to use this information efficiently.


#11

I have a very well respected cryptographer buddy (worked with him for years at InterSystems), I’ll see if I can pick his brain on this issue.
He always knows really scary stuff that hackers can do when you’re not extremely careful! :grinning:


#12

I must have missed some discussion, but why is this a concern for hash tables in Julia?


#13

Julia is a general purpose language, so concerns about flooding attacks concern it as much as any other language.


#14

That was pretty much the conclusion that I’d come to, and if you want a more attack resistant (not attack proof, I don’t think that’s even possible) hash table, then that’s probably best addressed by a package with a different AbstractDict type.
One technique that I use to ameliorate this sort of problem, is to store the full hash values for each element of the hash table, for hash tables where checking for equality and/or calculating the hash is expensive, such as for strings, where it’s not O(1) (on the number of characters in the string).
Also, each hash table can use a separate value that is mixed in with the calculated hash value, just for that specific hash table, possibly a random value.


#15

A lot of languages have adapted their default hash algorithms to address concerns about hash flooding. As I understand it, the argument is basically “better safe than sorry” when it comes to the default (since sometimes a library’s hash table might get used in unexpectedly sensitive places), coupled with the difficulty of using a non built-in hash in many high-level languages. Some relevant discussions from other languages:

Julia is in a somewhat different position than several of these languages, however, in that you can swap your own hash function into Julia’s Dict type without any performance cost. Whereas in something like CPython the hash function is embedded in the C implementation and it’s not possible to use a different hash without either sacrificing performance, writing a huge pile of C code, or recompiling Python itself.

Even so, there is still a valid argument about coding defensively when writing library/package code.


#16

Ain’t Julia grand? :grinning:


#17

I don’t know a lot about hash functions, but my understanding is that there is a trade-off between cryptographically relevant properties and speed. Since Julia is computationally oriented, I wonder if simply going for speed (conditional on other relevant properties for hashing) would be a reasonable choice, too.

Of course, with a modular framework, the user can make this choice on a case-by-case basis.


#18

Yes, if you don’t use a secret seed then my understanding is that there is essentially no point to using a secure hash — you are vulnerable to brute-force attacks if the hash function and the seed are known, even if the hash function is cryptographically strong, because of the small number of bits used by the hash table.

The SipHash paper proposes to use a cryptographically secure hash function in part to protect against the case where an attacker can actually see the value of hash(x) mod n but the seed is secret. I’m not sure in what practical circumstance this would be possible without also exposing you to other attacks to get the seed value, though. Indeed, that seems to be one of the criticisms of the proposal that SipHash increases security of hash tables.

The first step towards more security in Julia’s Dict, before anything else, would be to at least have the option of a randomized hash seed (see also https://github.com/JuliaLang/julia/issues/16172). (Currently, you can do this with a HashKey wrapper object, but something built-in would be better.)


#19

One thing that is a problem with Julia’s hashing design (not just for strings), is the way that hashing and isequal are tied together, but sometimes you aren’t trying to have a table where different types are ever considered equal, or you want to calculate a hash for fingerprinting, not for a hash table.

For my Str strings and characters I’m having to go to a lot of performance killing work, in order to make strings that I would expect to be considered equal end up hashing the same (which means having to convert them to UTF-8, just to perform the hash, instead of being free to hash based on the Unicode code points directly, or at least hashing based on UTF-16 code units, which would be much more performant in most all cases).
(This issue affects any type of AbstractString or AbstractChar currently, I believe @bkamins has raised the issue on GitHub).


#20

Just randomly thinking, maybe we need something like a hash function API, and one of the parameters would be whether you need “isequal” compatible results (default true).