Writing an array too large to store in memory

question
jld

#1

I am transforming a dataset, and the process of transforming it both takes a long time and takes up a large amount of memory. I really only need a single row of the matrix in memory at any given time, so is there some efficient way I can write each row to file?

Ideally, I would like to be able to do so using @parallel.


#2

By “matrix”, do you mean an array of fixed-size elements? If so, then have a look at mmap.

For writing one row at a time, it’s best if the array is stored in row-major order. If you need it to be column-major, then it’s probably more efficient to first store, say, 1024 rows in memory and then write that entire block to disk at once. (Of course mmap might already take care of this.) Rows within a block could be computed in parallel.


#3

Yes exactly! Row-vs-column major isn’t a big deal to me, thanks for pointing that out. So going off the documentation, a protype would be

A = rand(100,100)
s = open(“tmp.bin”, “w+”)

We’ll write the dimensions of the array as the first two Ints in the file

write(s, size(A,1))
write(s, size(A,2))

Now write the data

for i=1:100
write(s, A[:,i])
end
close(s)

s = open(“tmp.bin”) # default is read-only
m = read(s, Int)
n = read(s, Int)
A2 = Mmap.mmap(s, Matrix{Int}, (m,n))

I can’t seem to find a way to get JLD to write this way, even though it seems to have a way of using mmap.