Persistent hash

Barvid · February 9, 2024, 6:16pm

Hello everybody

I work with continually updated time series and do intensive calculations on that. Say data is in vector x. Sometimes some part of old data changes when an instrument is recalibrated and I need re-do old calculations. My code was in R, and I use sha1() from digest to see if x[t1:t2] has changed, and then redo the calculation for x[t1:t2] and store the hash to check if data changes later.

Julia has hash() which does the trick, but that is not reliable as docs say " The hash value may change when a new Julia process is started." . I could use SHA but that requiers converting a vector into a string. Very slow and very inelegant. One could use some summary stats, perhaps along with sha1() but that is unreliable and inelegant.

So I have a question for you all. Is there a hash() type function that allows me to uniquely identify a vector (or a dataframe or a dictionary) AND gives same answers when I rerun the code later, perhaps on a different operating system or julia version? like is there way to make the julia SHA behave like the R SHA?

best, j

stevengj · February 9, 2024, 6:26pm

No it doesn’t, you just need the raw bytes:

SHA.sha1(reinterpret(UInt8, x))

In general, most hash algorithms (SHA, MD5, CRC32, etcetera) work on arrays of bytes. Julia forces you to call reinterpret explicitly because you have to be aware that this depends on the binary representation of the data, not just its numerical values.

Barvid · February 9, 2024, 6:33pm

Thanks, I had not seen that. very helpful. solves problem here. thanks.

But this does not work with dictionaries or dataframes, and unfortunately, that is what the next code base I will port from R to Julia needs.

hash() works with dictionaries and dataframes, so a version of hash() that gives consistent answers would be perfect.

mbauman · February 9, 2024, 7:15pm

The reason hash works on everything is unfortunately intimately connected to its possibility of not being persistent — it (sometimes) relies upon internals like pointer addresses, memory layouts, and such.

One option would be to hash a more-persistent canonical representation of the thing if you don’t know of a shortcut like the reinterpret for vectors of bitstypes. Often in the context of results like this, I am saving stuff to files anyways — so just hash the files themselves.

Barvid · February 9, 2024, 7:31pm

yes, do that as well.

I can see the issues, its just that it makes me sad that something that was easy in R is a pain in Julia. There are always workarounds, but only at the expense of complexity and code. Seems like having to convert a dictionary to a JSON or save a subset of a dataframe to a parquet only to track changes in data with julia running sha() on that file is, well not elegant.

mbauman · February 9, 2024, 7:52pm

Depending on what you’re doing in R, it looks like R is likely calling serialize on your values for you before taking the SHA1 of them. You can do exactly that in Julia as well.

import SHA, Serialization
function slightly_more_persistent_hash(x)
    b = IOBuffer()
    Serialization.serialize(b, x)
    return SHA.sha1(take!(b))
end

Depending on your use case and needs, you could use a different format. Serialize is fast and general and easy, but it’s also version-dependent.

Barvid · February 9, 2024, 8:01pm

Wow, that is super cool. I had no idea. That function will have a prime place in my code. Makes me sad that it is obvious in retrospect.

Thanks again @mbauman

liamh · May 22, 2024, 2:45pm

Using sqids,

julia> using Sqids
julia> config = Sqids.configure();
julia> Sqids.encode(config, reinterpret(UInt64, Float64[pi, sqrt(2), 5/3]))
"X0PEyVSDsa8VbqtxqJPOHBQIbHjrm7TB0sVH"
julia> Sqids.encode(config, reinterpret(UInt64, Float64[pi, sqrt(2), 5/3 + 1e-9]))
"l2eZGopRr9CoiqOPqvh5YXcEGQrd9v3h0jd7"

The hash is even invertible if you know the alphabet, but for my appplication, I chop it down to the first few characters and that of course is not invertible.

Topic		Replies	Views
Hash and SHA of dictionaries and dataframes General Usage	0	232	November 30, 2021
Stable hashing across Julia versions General Usage	7	1326	September 23, 2020
Persistent hash of julia object General Usage	12	3119	January 20, 2017
DataType hash differs per patch version and OS General Usage	8	478	October 23, 2020
A question on hashes General Usage	3	1581	December 14, 2017

Persistent hash

Related topics