Why must the second argument of `hash` come from another hash function?

The documentation for hash contains an example that has the following comment:

# only use the output of another hash function as the second argument

What happens if I just provide any UInt64 value as the second argument?

It is not a problem for me to do it, I just want to understand why.

1 Like

The intention is for the second argument to enable combining hash values. Perhaps a better design would have been for hash to return an opaque value instead of an UInt value:

"""
    HashValue

Constructed by [`hash`](@ref), opaque.
"""
struct HashValue
    v::UInt
end

"""
    hash(::Any, ::HashValue)::HashValue

...
"""
function hash end

Then there would be no possibility for misuse.

That said, sometimes it’s OK to use a constant seed as the second argument for hash, perhaps saving a few operations.

In practice, it can be OK, but best practice is to avoid this. Here’s an example type implementation:

struct OrderedPair{A, B}
    a::A
    b::B
end

function Base.:(==)(l::OrderedPair, r::OrderedPair)
    all(==, (l.a, l.b), (r.a, r.b))
end

const ordered_pair_hash_seed = hash(0x3e44a45c2c9ad95b58036c78cd49f6d6)

function Base.hash(pair::OrderedPair, h::UInt)
    a = hash(pair.a, h)
    b = hash(pair.b, a)
    hash(b, ordered_pair_hash_seed)
end

Here’s a type that might benefit from a different kind of combination of hash values, one that’s commutative:

struct UnorderedPair{A, B}
    a::A
    b::B
end

function Base.:(==)(l::UnorderedPair, r::UnorderedPair)
    all(==, minmax(l.a, l.b), minmax(r.a, r.b))
end

const unordered_pair_hash_seed = hash(0x464ec8ef89d4d738ae8280142b46ee74)

function Base.hash(pair::UnorderedPair, h::UInt)
    a = hash(pair.a, h)
    b = hash(pair.b, h)
    c = xor(a, b)
    hash(c, unordered_pair_hash_seed)
end

This works even for the hypothetical HashValue-based design, as long as we have a method like this, because xor is commutative:

function xor(l::HashValue, r::HashValue)
    xor(l.v, r.v)
end

EDIT: made an issue regarding HashValue: safer, more foolproof `hash`: make `hash` return an opaque value, and take that type as the second argument · Issue #57055 · JuliaLang/julia · GitHub

Thank you for your reply and sorry for not replying sooner. I needed to take some time to understand your reply and be more precise about my situation.

Firstly, I am merely using hash on built in types (Vector{Float64}) and do not need to write a custom hash method.

My situation is that I have a population and want to select the best individuals from it and want to benchmark my selection method by comparing it to selecting individuals at random, so I have made a selection algorithm that selects based on the hash value (I use hash codes rather than random numbers so that I can reuse my existing framework for evaluating the selection method). I don’t have a lot of data, so I need to average several uncorrelated hash functions to get a reliable benchmark. A slightly simplified version of my hash code based selector is:

get_selector(salt::UInt64, fraction to select::Real)
    @assert 0 <= fraction_to_select <= 1
    (population) -> begin
        filter(population) do x
            hash(x, salt) / typemax(UInt64) < fraction_to_select
        end
    end
end

So, according to the comment in the documentation, salt must be generated with a hash function and I am curious to know why and what would happen if I did not respect this and simply did something like:

get_selector.(UInt64.(1:100), 0.1)

or

get_selector.(rand(UInt64, 100), 0.1)

instead of the correct code

get_selector.(hash.(1:100), 0.1)
1 Like

The second option is certainly as good as the third option, that I know. Otherwise, I’m afraid I’m going to call in @Oscar_Smith. In particular, I wonder if functions like x -> hash(x, salts[i]) (for all i) are sufficiently independent even for purely random salts?

One thing to keep in mind is it’s technically correct for hash to return a constant. The way hash is intended to be used, returning a constant would cause performance issues (because of hash collisions in hash tables), but not correctness issues. But you say you only hash Vector{Float64}, so perhaps hash is good enough for your use case, depending on the specific implementation.

Have you considered reaching for something like SHA.jl? There are other registered possibly relevant packages, too.

1 Like

I think I would probably use get_selector.(hash.(1:100), 0.1). The reason for documenting that the 2nd argument (when not 0) should be the result of a hash, is that it means that you don’t need to mix that argument up at all. You can just assume it’s a roughly uniformly distributed UInt. For example, the way an Int gets hashed is

hash(x::Int64,  h::UInt) = hash_uint64(bitcast(UInt64, x)) - 3h

so if you call this with consecutive numbers, you will obviously get highly correlated results.

2 Likes

Thank you, you are right, it is better to use a precisely defined and well documented algorithm like SHA for this. Julias algorithm may change in the future and properties like how random the output is are not guaranteed in the documentation.

1 Like