A question on hashes


#1

For an algorithm of mine, I am looking at hashes in order to reset (srand) the random number generator.

Now if I am iteratively applying hash, i.e. hash(x,hash(y,hash(z,hash(a)))). But the results seem to differ even though the individual hashes are the same.

More precisely, I have two concurrent Julia sessions running and the result of a similar hash expression does not match. Is this expected?

I read somewhere that the address of an object would matter. Is that the case?
Notably in my case x is a DataFrame.

I can get what I want by iterating over all columns of the DataFrame (last row of my code). But this is more cubmersome.

Why is doesnotmatch not the same in both sessions?

Please see the yellow markings on the screenshot.

[EDIT: I cannot share the data, I could try to find an MWE, if that is needed to answer my question]


julia> hw=0x220926c324d7bb27
0x220926c324d7bb27

julia> hd=0xcbb7fba98e0ff596
0xcbb7fba98e0ff596

julia> hn=0xe1c9b71b1ae3b261
0xe1c9b71b1ae3b261

julia> hf=0x213b20190172ee15
0x213b20190172ee15

julia>

julia> @assert hash(dtmtable.weight)==hw

julia> @assert hash(dtmtable.denominator)==hd

julia> @assert hash(dtmtable.numerator)==hn

julia> @assert hash(dtmtable.features)==hf

julia>

julia> hash(hash(dtmtable.numerator,hash(dtmtable.denominator,hash(dtmtable.weight))))
0x18566e5352609be5

julia> doesnotmatch=hash(dtmtable.features,hash(dtmtable.numerator,hash(dtmtable.denominator,hash(dtmtable.weight)))
0x9e53050d52b56580

julia> typeof(dtmtable.features)
DataFrames.DataFrame

julia> hash(dtmtable.features)
0x213b20190172ee15

julia> versioninfo()
Julia Version 0.6.1
Commit 0d7248e2ff* (2017-10-24 22:15 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

julia> Pkg.installed("DataFrames")
v"0.11.2"

julia>

julia> s=hash(dtmtable.numerator,hash(dtmtable.denominator,hash(dtmtable.weight)))
0xc4d8482b839696fe

julia> for x=1:size(dtmtable.features,2)
           s=hash(dtmtable.features[x],s)
       end

julia> s #matches which is fine
0x207ad236952c3bdb


#2

I’m having trouble following your example, but could this be simply because hash() falls back to hashing the object ID, which is different for every instance of an object and will be different across Julia sessions?

   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.1 (2017-10-24 22:15 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> mutable struct Foo
         x::Int
       end

julia> hash(Foo(1))
0x10f59811e4cbd170

julia> hash(Foo(1))
0xc6ad3c50a355bc90

   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.1 (2017-10-24 22:15 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> mutable struct Foo
         x::Int
       end

julia> hash(Foo(1))
0x8a2e20644a548a12

julia> hash(Foo(1))
0xde39b002d1cbcd43

#3

Good catch, that’s a bug in DataFrames. Note that for the initial @assert calls, you are calling the one-argument hash method, while for the combined computation below you are using the two-argument version, which can differ. You should call hash(dtmtable.weight, zero(UInt)) and so on to ensure you are calling the same functions.

In the present case, the bug is that DataFrames only defines the one-argument version of hash. See this pull request, which you can try via Pkg.checkout("DataFrames", "nl/hash").


#4

Apologies. My problem was not well formulated (I did not have the full understanding of the hash with 1 argument and 2 arguments). Actually, I still do not quite know how things are mixed if it has two arguments.
Luckily nalimilan understood my problem and the issue which she (or he) also fixed.

For completeness: below would be the MWE. If you start up Julia several times (or several paralle sessions) you will get a different result each time (which is not the case if the hash is correctly defined; you could try it with PooledArray for instance).

I think in your example the hash varies because it has not been explicitly defined (in what I consider a meaningful way) for the new type.

using DataFrames;hash(DataFrame([1]),UInt(0))