Performance of writing binary files

mkoculak · April 30, 2022, 12:34am

I am creating a function to write binary data (a data array) to a file.

First, I have just reversed the reading function (which reads data almost one byte at a time) and performed a sequential write, but noticed it took a surprisingly long time to finish.

Then I tried bulk writing the whole array in one call.
This made it orders of magnitude faster. However for larger files (>100MB) it also experienced a slowdown.
So in the third iteration it experimented with batch write of different sizes.

The results of benchmarking are below:

You can see that above 100MB, the batch approach seems to be better by almost a magnitude.
This pattern repeats on different types of drives. And it doesn’t seem to me it could be simply an effect of caching on a drive.
I assume that difference between sequential and bulk stems from the overhead of write function calls, but I was wondering if this batch speed up is something typical for Julia / computers in general or if I am doing it not in an optimal way.

Maybe someone here has any ideas, suggestions?

Tests were done on Julia 1.7.2 and Windows 11.
The code for MWE is below:

using BenchmarkTools

function write_seq(array)
    fid = open("./test.dat", "w")
    for i in array
        write(fid, i)
    end
    close(fid)
end

function write_bulk(array)
    fid = open("./test2.dat", "w")
    write(fid, array)
    close(fid)
end

function write_bulk_batch(array, batch)
    fid = open("./test3.dat", "w")

    iter = Int(length(array) / batch)
    for i=1:iter
        write(fid, array[1+batch*(iter-1):batch*iter])
    end
    close(fid)
end

sizes = [
    1_000,
    10_000,
    100_000,
    1_000_000
    ]

batches = [
    1_000,
    10_000,
    100_000
]

seq = []
bulk = []
bulkb = [[],[],[]]

for asize in sizes
    array = rand(UInt8, asize)
    
    s = @benchmark write_seq($array) seconds=10
    push!(seq, median(s).time)
    
    b = @benchmark write_bulk($array) seconds=10
    push!(bulk, median(b).time)

    for (i, batch) in enumerate(batches)
        if batch >= asize
            push!(bulkb[i], 0)
        else
            bb = @benchmark write_bulk_batch($array, $batch) seconds=10
            push!(bulkb[i], median(bb).time)
        end
    end
end

ericphanson · April 30, 2022, 12:41am

You could try using the keyword argument lock=false in open; maybe the overhead of locking on each write is one factor slowing down the sequential calls.

mkoculak · April 30, 2022, 9:53pm

That was definitely one thing slowing the sequential write!
Turning off the lock gave a stable 5x speed-up - making it a reasonable solution if one has no other choice.

Still, bulk write is much faster for files below 20-50 MB.

When I glimpsed into the Base IO code, it seems Julia specializes writing byte arrays directly into a function called unsafe_write(), as opposed to normal write() in other cases. Both of them make a call to some C function, so this is as far as I can go.

Interestingly, writing in batches seems most efficient when there is a small number of them (10 worked for me the best), as long as single batch does not exceed the 20-50 MB limit. After that it gets hit by the same penalty as bulk write.
So for bigger files, when you want to write as fast as possible, 10MB batches where the fastest in my tests.

Below is the graph comparing, sequential (with and without lock), bulk, 10k batch, and batch = 1/10 of the file.