Performance of Memory Mapped Arrays (vs. JLD2)

I need to generate a very large matrix (too big for memory), so I’m using memory mapped arrays.

I’m trying to benchmark the code and understand some of the variation in run times. Right now, I’m looking at the time it takes to write the files to disk so I’m generating random numbers to test the write times.

I generate the random numbers, save them as a JLD2 file (timed), and also insert them into a Memory Mapped array (timed). I know this isn’t an entirely fair comparison, but need something to compare the memory mapped times to.

To give you an idea of how I’m benchmarking, it looks something like:

for i = 1:101
    write = rand(200, 510, 61, 6, 11)

    @save string("test_",i,".jld2") write

    mmap_array[:,:,:,:,:,i] = write

The results of the timing are:
average jld write time: 11.469541902542113
std dev of jld write times: 7.168669045766888

average mmap write time: 23.152635397911073
std dev of mmap write times: 15.751597801816262

(I drop the values from the first iteration due to the added compile times, so the sample size is 100.)

I should also note the mmap case had fewer allocations. Here is the @time of the last iterations.

save JLD2 file
 11.887770 seconds (82 allocations: 7.891 KiB)
insert array into mmap array
 17.974852 seconds (49 allocations: 1.516 KiB)

Surprisingly (at least to me), the JLD2 writing is much, much faster. I could understand it being a little bit faster because there are fewer numbers it has to deal with; however, I still expected the memory mapped array to be faster because the file is already there and it is memory mapped.

I’d also note that the JLD2 write time is much less variable.

Can anyone offer any insight on this? My supposition is that because the JLD2 file is being written to disk via a function it is happening faster.

Beyond this (possibly unfair) comparison, is there a way to speed up the write time of the memory mapped array? At a minimum this comparison demonstrates that my computer could get the values to disk faster. (I say my computer, but tests were run on a linux cluster so this wouldn’t be do to varying usage of CPU power. The cluster also uses a Lustre file system.)

Please include a self-contained MWE.

I think mmap needs to read the data from disk first, cf

Otoh, lustre is ZFS based, right? Sparse files / RLE / LZ4 should make reading/storing a compressible file extremely fast, and there is nothing more compressible than a bunch of zeros (which is your initial state). Or do you need to talk to some other device doing the storage that maybe fails to do compression on some path? Or is compression disabled for some reason?

Try this with zeros. If this takes an appreciable amount of time / disk IO / network IO, then complain to your storage people. If this is fast, then remember that it is only fast for write-once, i.e. you need to delete your file and create a new one if you want to regenerate your matrix.