I typically have the task to read and process large binary data files consisting of relatively small consecutive records (< 30 kB). If I read each record one after the other the read performance lacks far behind of what the storage device is able to provide for larger batches, especially on external disk drives.
The following MWE shows this for a Samsung T7 SSD on Windows (Julia 1.8.2):
recordsize = 32000
totalsize = 10000 * recordsize
batch = 100
buffer1 = zeros(UInt8, recordsize)
buffer2 = zeros(UInt8, recordsize*batch)
function readbuffer(io::IOStream, buffer, n)
for i in 1:n
readbytes!(io, buffer)
end
end
io = open("some very large file.dat")
readbuffer(io, buffer1, 1) # force compilation
@time readbuffer(io, buffer1, div(totalsize, recordsize))
@time readbuffer(io, buffer2, div(totalsize, recordsize*batch))
close(io)
(To reproduce you have to make sure that the file is not already in the file cache!!! Seconds runs will ony show your RAM speed…)
The times output correspond to 160 MB/s and 430 MB/s resp. The ratio gets even much worse for rotating media.
My current solution is to allocate a large buffer and fill it with a separate thread. But the usage is quite cumbersome with all the extra synchronization handling and the code has to be specially adapted.
Wouldn’t it be better to make the buffer size for an IOStream configurable? At the moment it is quite small (32k if I remember correctly). Some time ago there was a suggestion on Stackoverflow (can we customize the file open buffer size in Julia - Stack Overflow), which looked not too complicated. But it is beyond my skills as a pure Julia user to judge, if this makes sense…
Maybe this is only a problem on Windows, but I can’t test on Linux which maybe does a better read-ahead.