Persistent hash

Hello everybody

I work with continually updated time series and do intensive calculations on that. Say data is in vector x. Sometimes some part of old data changes when an instrument is recalibrated and I need re-do old calculations. My code was in R, and I use sha1() from digest to see if x[t1:t2] has changed, and then redo the calculation for x[t1:t2] and store the hash to check if data changes later.

Julia has hash() which does the trick, but that is not reliable as docs say " The hash value may change when a new Julia process is started." . I could use SHA but that requiers converting a vector into a string. Very slow and very inelegant. One could use some summary stats, perhaps along with sha1() but that is unreliable and inelegant.

So I have a question for you all. Is there a hash() type function that allows me to uniquely identify a vector (or a dataframe or a dictionary) AND gives same answers when I rerun the code later, perhaps on a different operating system or julia version? like is there way to make the julia SHA behave like the R SHA?

best, j

No it doesn’t, you just need the raw bytes:

SHA.sha1(reinterpret(UInt8, x))

In general, most hash algorithms (SHA, MD5, CRC32, etcetera) work on arrays of bytes. Julia forces you to call reinterpret explicitly because you have to be aware that this depends on the binary representation of the data, not just its numerical values.

3 Likes

Thanks, I had not seen that. very helpful. solves problem here. thanks.

But this does not work with dictionaries or dataframes, and unfortunately, that is what the next code base I will port from R to Julia needs.

hash() works with dictionaries and dataframes, so a version of hash() that gives consistent answers would be perfect.

The reason hash works on everything is unfortunately intimately connected to its possibility of not being persistent — it (sometimes) relies upon internals like pointer addresses, memory layouts, and such.

One option would be to hash a more-persistent canonical representation of the thing if you don’t know of a shortcut like the reinterpret for vectors of bitstypes. Often in the context of results like this, I am saving stuff to files anyways — so just hash the files themselves.

3 Likes

yes, do that as well.

I can see the issues, its just that it makes me sad that something that was easy in R is a pain in Julia. There are always workarounds, but only at the expense of complexity and code. Seems like having to convert a dictionary to a JSON or save a subset of a dataframe to a parquet only to track changes in data with julia running sha() on that file is, well not elegant.

Depending on what you’re doing in R, it looks like R is likely calling serialize on your values for you before taking the SHA1 of them. You can do exactly that in Julia as well.

import SHA, Serialization
function slightly_more_persistent_hash(x)
    b = IOBuffer()
    Serialization.serialize(b, x)
    return SHA.sha1(take!(b))
end

Depending on your use case and needs, you could use a different format. Serialize is fast and general and easy, but it’s also version-dependent.

6 Likes

Wow, that is super cool. I had no idea. That function will have a prime place in my code. Makes me sad that it is obvious in retrospect.

Thanks again @mbauman

Using sqids,

julia> using Sqids
julia> config = Sqids.configure();
julia> Sqids.encode(config, reinterpret(UInt64, Float64[pi, sqrt(2), 5/3]))
"X0PEyVSDsa8VbqtxqJPOHBQIbHjrm7TB0sVH"
julia> Sqids.encode(config, reinterpret(UInt64, Float64[pi, sqrt(2), 5/3 + 1e-9]))
"l2eZGopRr9CoiqOPqvh5YXcEGQrd9v3h0jd7"

The hash is even invertible if you know the alphabet, but for my appplication, I chop it down to the first few characters and that of course is not invertible.