Efficient file hashing

binaryio

#1

Hi all,

I am wondering what is a fast way to compute the hash of a file in Julia.
I know it’s possible to do hash(readstring(fileName)) but that would load the whole file in Julia, while it would be possible to process it sequentially for hashing. Any suggestions are welcomed.

Thanks!


#2

I’ve used this

but I don’t know how efficient it is.


#3

If you just want a checksum, rather than a cryptographically secure hash, Julia 0.6 has a hardware-accelerated CRC-32c checksum function (https://github.com/JuliaLang/julia/pull/18297). Currently it is unexported/undocumented, but that will likely change in the future.

In the meantime, you can do:

Base.crc32c(read(filename))

to read in the whole file and compute the checksum. Alternatively, you can checksum it in chunks by something like:

function checksum(filename, blocksize=16384)
    crc = zero(UInt32)
    open(filename, "r") do f
        while !eof(f)
            crc = Base.crc32c(read(f, blocksize), crc)
        end
    end
    return crc
end

(The answer is independent of blocksize.)


#4

I think that memory-mapping the file can make this even faster

julia> filesize("./ml10m.rda")
11444748

julia> @btime Base.crc32c(Mmap.mmap("./ml10m.rda"))
  1.804 ms (19 allocations: 816 bytes)
0x99373391

#5

Thank you all, I will stick to SHA.jl I think, because I can only use v0.5 for this project. thanks!


#6

That’s very good to have.

Don’t CPUs now also have cryptographic hash? Or only AES?

I’m not pushing ot have anything more implemented, but it could be then using those instructions(?).


#7

You might want to try with larger blocksize, we often found that around 1MB was a nice sweet spot on a number of different platforms for loading.
Yesterday I ran some tests, and using a 1MB blocksize instead of 16K took .8 seconds to CRC a 4GB file instead of 1.03 seconds.
(using mmap was also much slower than my version that allocates a buffer and uses readbytes! to fill the buffer - that took 1.37 seconds)


#8

Note: I just found https://github.com/JuliaLang/julia/pull/22274, and I think there is a bug on line 791:

crc32c(buf::IOBuffer, crc::UInt32=0x00000000) = crc32c(buf.data, crc)

that should be:

crc32c(buf::IOBuffer, crc::UInt32=0x00000000) = unsafe_crc32c(buf.data, buf.size, crc)

so that it doesn’t try to calculate the CRC on garbage data (after the valid data 1:buf.size)


#9

I think there is another potential issue in the code in the PR, at line 806:

    @assert 0 ≤ nb ≤ length(buf)
    return unsafe_crc32c(buf, readbytes!(f, buf, nb), crc)

If eof(f) returns true before nb is <= length(buf) (because it was looking at a file that was truncated after the filesize call, for example)
I think that the test for eof(f) should return directly with the crc, instead of having it fall into calling unsafe_crc32c one more time.
Also, I’ve read that @assert statements won’t necessarily do anything (because assertion checking might be able to be disabled at some point in the future), so the @assert should not be used if you want to make sure the check is made.


#10

I’m glad to see that the bug I found has been fixed, in https://github.com/JuliaLang/julia/pull/22274/commits/f34d08c19fd31605b3b6008530d75d2081cfe869,
but I don’t think that min(buf.size, length(buf.data)) is necessary, simply buf.size should suffice.
buf.size should never be > the size of the buffer.