I am creating a function to write binary data (a data array) to a file.
First, I have just reversed the reading function (which reads data almost one byte at a time) and performed a sequential write, but noticed it took a surprisingly long time to finish.
Then I tried bulk writing the whole array in one call.
This made it orders of magnitude faster. However for larger files (>100MB) it also experienced a slowdown.
So in the third iteration it experimented with batch write of different sizes.
The results of benchmarking are below:
You can see that above 100MB, the batch approach seems to be better by almost a magnitude.
This pattern repeats on different types of drives. And it doesn’t seem to me it could be simply an effect of caching on a drive.
I assume that difference between sequential and bulk stems from the overhead of write function calls, but I was wondering if this batch speed up is something typical for Julia / computers in general or if I am doing it not in an optimal way.
Maybe someone here has any ideas, suggestions?
Tests were done on Julia 1.7.2 and Windows 11.
The code for MWE is below:
using BenchmarkTools
function write_seq(array)
fid = open("./test.dat", "w")
for i in array
write(fid, i)
end
close(fid)
end
function write_bulk(array)
fid = open("./test2.dat", "w")
write(fid, array)
close(fid)
end
function write_bulk_batch(array, batch)
fid = open("./test3.dat", "w")
iter = Int(length(array) / batch)
for i=1:iter
write(fid, array[1+batch*(iter-1):batch*iter])
end
close(fid)
end
sizes = [
1_000,
10_000,
100_000,
1_000_000
]
batches = [
1_000,
10_000,
100_000
]
seq = []
bulk = []
bulkb = [[],[],[]]
for asize in sizes
array = rand(UInt8, asize)
s = @benchmark write_seq($array) seconds=10
push!(seq, median(s).time)
b = @benchmark write_bulk($array) seconds=10
push!(bulk, median(b).time)
for (i, batch) in enumerate(batches)
if batch >= asize
push!(bulkb[i], 0)
else
bb = @benchmark write_bulk_batch($array, $batch) seconds=10
push!(bulkb[i], median(bb).time)
end
end
end