I am wondering what is a fast way to compute the hash of a file in Julia.
I know it’s possible to do hash(readstring(fileName)) but that would load the whole file in Julia, while it would be possible to process it sequentially for hashing. Any suggestions are welcomed.
If you just want a checksum, rather than a cryptographically secure hash, Julia 0.6 has a hardware-accelerated CRC-32c checksum function (https://github.com/JuliaLang/julia/pull/18297). Currently it is unexported/undocumented, but that will likely change in the future.
In the meantime, you can do:
Base.crc32c(read(filename))
to read in the whole file and compute the checksum. Alternatively, you can checksum it in chunks by something like:
function checksum(filename, blocksize=16384)
crc = zero(UInt32)
open(filename, "r") do f
while !eof(f)
crc = Base.crc32c(read(f, blocksize), crc)
end
end
return crc
end
(The answer is independent of blocksize.)
Update: CRC32c checksums were exported in Julia 0.7 and are now available in the CRC32c stdlib. You can checksum a file with using CRC32c; checksum = open(crc32c, filename).
You might want to try with larger blocksize, we often found that around 1MB was a nice sweet spot on a number of different platforms for loading.
Yesterday I ran some tests, and using a 1MB blocksize instead of 16K took .8 seconds to CRC a 4GB file instead of 1.03 seconds.
(using mmap was also much slower than my version that allocates a buffer and uses readbytes! to fill the buffer - that took 1.37 seconds)
If eof(f) returns true before nb is <= length(buf) (because it was looking at a file that was truncated after the filesize call, for example)
I think that the test for eof(f) should return directly with the crc, instead of having it fall into calling unsafe_crc32c one more time.
Also, I’ve read that @assert statements won’t necessarily do anything (because assertion checking might be able to be disabled at some point in the future), so the @assert should not be used if you want to make sure the check is made.