Efficient file hashing

cjb · March 24, 2017, 1:20pm

Hi all,

I am wondering what is a fast way to compute the hash of a file in Julia.
I know it’s possible to do hash(readstring(fileName)) but that would load the whole file in Julia, while it would be possible to process it sequentially for hashing. Any suggestions are welcomed.

Thanks!

mauro3 · March 24, 2017, 1:43pm

I’ve used this
https://github.com/staticfloat/SHA.jl

but I don’t know how efficient it is.

stevengj · March 24, 2017, 2:34pm

If you just want a checksum, rather than a cryptographically secure hash, Julia 0.6 has a hardware-accelerated CRC-32c checksum function (https://github.com/JuliaLang/julia/pull/18297). Currently it is unexported/undocumented, but that will likely change in the future.

In the meantime, you can do:

Base.crc32c(read(filename))

to read in the whole file and compute the checksum. Alternatively, you can checksum it in chunks by something like:

function checksum(filename, blocksize=16384)
    crc = zero(UInt32)
    open(filename, "r") do f
        while !eof(f)
            crc = Base.crc32c(read(f, blocksize), crc)
        end
    end
    return crc
end

(The answer is independent of blocksize.)

Update: CRC32c checksums were exported in Julia 0.7 and are now available in the CRC32c stdlib. You can checksum a file with using CRC32c; checksum = open(crc32c, filename).

dmbates · March 24, 2017, 5:59pm

I think that memory-mapping the file can make this even faster

julia> filesize("./ml10m.rda")
11444748

julia> @btime Base.crc32c(Mmap.mmap("./ml10m.rda"))
  1.804 ms (19 allocations: 816 bytes)
0x99373391

cjb · March 26, 2017, 7:32pm

Thank you all, I will stick to SHA.jl I think, because I can only use v0.5 for this project. thanks!

Palli · June 7, 2017, 6:56pm

That’s very good to have.

Don’t CPUs now also have cryptographic hash? Or only AES?

I’m not pushing ot have anything more implemented, but it could be then using those instructions(?).

ScottPJones · June 8, 2017, 5:21am

You might want to try with larger blocksize, we often found that around 1MB was a nice sweet spot on a number of different platforms for loading.
Yesterday I ran some tests, and using a 1MB blocksize instead of 16K took .8 seconds to CRC a 4GB file instead of 1.03 seconds.
(using mmap was also much slower than my version that allocates a buffer and uses readbytes! to fill the buffer - that took 1.37 seconds)

ScottPJones · June 8, 2017, 5:34am

Note: I just found https://github.com/JuliaLang/julia/pull/22274, and I think there is a bug on line 791:

crc32c(buf::IOBuffer, crc::UInt32=0x00000000) = crc32c(buf.data, crc)

that should be:

crc32c(buf::IOBuffer, crc::UInt32=0x00000000) = unsafe_crc32c(buf.data, buf.size, crc)

so that it doesn’t try to calculate the CRC on garbage data (after the valid data 1:buf.size)

ScottPJones · June 8, 2017, 5:47am

I think there is another potential issue in the code in the PR, at line 806:

    @assert 0 ≤ nb ≤ length(buf)
    return unsafe_crc32c(buf, readbytes!(f, buf, nb), crc)

If eof(f) returns true before nb is <= length(buf) (because it was looking at a file that was truncated after the filesize call, for example)
I think that the test for eof(f) should return directly with the crc, instead of having it fall into calling unsafe_crc32c one more time.
Also, I’ve read that @assert statements won’t necessarily do anything (because assertion checking might be able to be disabled at some point in the future), so the @assert should not be used if you want to make sure the check is made.

ScottPJones · June 8, 2017, 6:31pm

I’m glad to see that the bug I found has been fixed, in https://github.com/JuliaLang/julia/pull/22274/commits/f34d08c19fd31605b3b6008530d75d2081cfe869,
but I don’t think that min(buf.size, length(buf.data)) is necessary, simply buf.size should suffice.
buf.size should never be > the size of the buffer.

Topic		Replies	Views
Use of MurmurHash3 for hashing strings Internals & Design	29	8711	March 30, 2018
Stable hashing across Julia versions General Usage	7	1360	September 23, 2020
Why CRC32c but missing CRC32 General Usage	10	898	January 13, 2023
How fast is binary reading capabilities in Julia compared with other languages? Data binaryio	11	2140	April 23, 2019
Problem with: Issue: MurmurHash3 has undefined behavior Internals & Design	27	2361	March 19, 2018

Efficient file hashing

Related topics