CRC functions are linear/affine. This means that they are trivial to construct collisions for and that it’s easy to accidentally write hashing constructs that have dangerous cancellation properties.
You can already do this. If you want to use a custom hash function with a given key type, then just wrap it in a HashKey
type as I suggested above, use whatever hash
function you want, and define isequal
of your HashKey
to any other type to be false
.
I was just thinking of how to expose the built-in hash for people to use, where it doesn’t end up converting strings first to UTF-8
.
I think that there are two separate issues there, one is about security, which, as we’ve discussed here, can’t really be solved just by changing the hashing function used, and the other one, do you have examples of that sort of thing occurring in practice?
I believe that for internal hash tables with string keys, CRC32C would be a very good choice, whereas it would be recommended to use a totally different hash table design and different hash algorithm for general hash tables that might be used for “external” data (and hence be open to attack).
Yes, this sort of thing has happened but I don’t have the examples on hand and it doesn’t warrant more than a few minutes of searching through issues.
We could certainly use a better hashing function than MurmurHash3, but the best choice is almost definitely not CRC32.
I tried something along those lines in WIP/RFC: Adding EgalDict{K,V} (aka ObjectIdDict{K,V}) by mauro3 · Pull Request #24932 · JuliaLang/julia · GitHub. I think an approach like that to dicts in Julia would be good (although that PR was lacking in other departments).
A lot of the discussion on Stack Overflow compared many different hashes, and a number of people did recommend CRC32C.
Until such a time as a better hash algorithm for Julia is decided upon (possibly FarmHash), do you think a PR to replace the current usage of MurmurHash3.c and replace it with the Julia version I have written would be well received? (You can look at the code in JuliaString/Strs.jl/src/murmurhash3.jl).
Besides simply translating the MurmurHash3 algorithm, I wrote code to hash generic AbstractString strings efficiently, by building up in registers a UTF-8 encoding on the fly, and also to more efficiently deal with unaligned strings. (I do have a little more work to do there, to write the unaligned version for 32-bit platforms, and to do some testing for 32-bit).
It’s nicely faster than the C version (the C version wasn’t really all that well written, it was doing things like XOR bytes into 64-bit values where it was known to be 0 already, just to pick up the trailing bytes, in a big switch, using falling through).
I also noticed something that I think might be a flaw in the design of MurmurHash3, but I’m not sure.
In the operation performed on each block, it is calculating h1, h2, h3, h4, but instead of doing it in such a way that the new values only depend on the previous round, h4 is calculated using the value of h1 from the current round, not the previous one. Not being a cryptographer, I’m not sure if that was the desired behavior or not, but it does have a performance effect, since it means that the 4 values cannot all be calculated in parallel (it might even cut in half the potential speed of the algorithm).
There are other “mixing” operations, where it does h1 += h2 ; h2 += h1
, which don’t really seem to be the best way of mixing two 64-bit values, you really just end up with h1, h2 = h1 + h2, h1 + 2*h2
, which seems to imply that h2 is not really that different from h1 after that step.
Although the specifics elude me at the moment, I know MurmurHash3 to be faulty in some way that matters us. I looked at this a few years ago and there are some non-cryptographic hashes that otherwise have good properties overall and in specific without found algorithmic clumsiness or sharper edges. And CRC-32 is not for this purpose.
While Julia may choose to embed a cryptographic protocol (hash + signature + a few others), it must provide a [preferably two, each with best features in nonoverlapping purpose] noncryptographic hash for more dynamic use where the question of crypto-strong is not key. When I hash the entries in a small dictionary – well go-ahead and reverse engineer it – I need fast and nonlikely collisions.
Going forward, we can apply the deeply elegant trait facility to allow data structures and packaged types some few choices in hashing.