# Use of Memory-mapped I/O

I have a Julia code, version 1.2, which performs a lot of operations on a `10000 x 10000 Array` . Due to `OutOfMemory()` error when I run the code, I’m exploring other options to run it, such as CUDA or Memory-mapping. Concerning the use of `Mmap.mmap`, I’m a bit confused with the use of the `Array` that I map to my disk, due to little explanations on https://docs.julialang.org/en/v1/stdlib/Mmap/index.html. Here is the beginning of my code:

``````using Distances
using LinearAlgebra
using Distributions
using Mmap
data=Float32.(rand(10000,15))
Eucldist=pairwise(Euclidean(),data,dims=1)
D=maximum(Eucldist.^2)
sigma2hat=mean(((Eucldist.^2)./D)[tril!(trues(size((Eucldist.^2)./D)),-1)])
L=exp.(-(Eucldist.^2/D)/(2*sigma2hat))
``````

`L` is the big Array with which I want to work, so I mapped it to my disk with

``````s = open("mmap.bin", "w+")
write(s, size(L,1))
write(s, size(L,2))
write(s, L)
close(s)
``````

What am I supposed to do after that? The next step is to perform `eigen(L)` How should I do that? With `eigen(L)` or `eigen(s)`? What’s the role of the object `s` and when does it get involved? Moreover, I don’t understand why I have to use `Mmap.sync!` and when. After each subsequent lines after `eigen(L)`? At the end of the code? How can I be sure that I’m using my disk space instead of RAM memory?Would like some highlights about memory-mapping, please. Thank you!

You did not map, just wrote `L` to disk and closed the stream `s`. To map you have to use the `mmap` function like the documentation specifies. Once done i.e. `L=Mmap.mmap(...)`, indexing into `L` will actually load a chunk of data into memory. If that data is modified, with `sync!` the modifications are sent.back to disk.
The challenge memory mapping poses is adapting algorithms, here `eig` to use data chunks i.e. `L[:,i:j]` in a serial fashion (since `L` does not fit into memory). Difficult beyond simple indexing.

4 Likes

To add to this explanation, you’re missing the following piece of code:

``````s = open("mmap.bin")
L = Mmap.mmap(s, Matrix{Float32}, (m,n))
``````

In order for the performance of this approach to be decent, you’ll want your algorithm to work on chunks of (ideally contiguous) memory at a time, and minimize the amount of linear scan or random access you do across the entirety of `L`. If you fail to do so, your OS will swap the memory pages of `L` in and out of memory like mad, causing a potentially huge slowdown and likely unacceptable performance.

Also, in terms of sync’ing to disk: calling `sync!` will force any memory pages to copy to disk if the disk copy isn’t in sync with the pages in memory. But you technically don’t need to do this unless you need your disk copy of `L` to be completely consistent at a certain point in time, because your OS will do this for you automatically. It’s also a potentially expensive operation that you should use sparingly due to its performance ramifications.

4 Likes

Something else to consider is checking whether your array has some sort of special structure that allows it to be easily compressed. If it does, Julia makes it very easy to create specialized `AbstractArray` types. For example, something I work on involves an matrix with a very large number of columns, in which most of the columns are repeated. In order to make allocating this array more efficient, I implemented `ColRefMatrix <: AbstractMatrix` in which values in the columns are pulled from a much smaller underlying `Matrix`.

Of course I have no idea what you’re doing here or whether this is applicable to you, but it is not uncommon for me to encounter situations in which I have to deal with very large matrices which are simple to compress.

4 Likes

Thank you to all of you! Will try to overcome my barriers combining all your precious suggestions!

1 Like

Another venue worth a thought is https://github.com/joshday/OnlineStats.jl and its forks (for specialized stuff), the data can be practically endless and it’ll still work

3 Likes

Thank you I didn’t know that package!

Here is the error I get with my original code:

``````ERROR: LoadError: OutOfMemoryError()
Stacktrace:
[1] Type at ./boot.jl:396 [inlined]
[2] Type at ./boot.jl:404 [inlined]
[3] Type at ./boot.jl:411 [inlined]
[4] similar at ./abstractarray.jl:618 [inlined]
[5] similar at ./abstractarray.jl:617 [inlined]
[9] top-level scope at /NOBACKUP/vicentes/Julia/sp-simul-Gram2-2/Sparsesp.jl:106 [inlined]
[10] top-level scope at ./none:0
[11] include at ./boot.jl:317 [inlined]
[13] include(::Module, ::String) at ./sysimg.jl:29
[14] exec_options(::Base.JLOptions) at ./client.jl:266
[15] _start() at ./client.jl:425
in expression starting at /NOBACKUP/vicentes/Julia/sp-simul-Gram2-2/Sparsesp.jl:105
``````

Don’t know how to get rid of that error, even with memory mapping. Is the problem occurring at line 105 of my code?

Sorry if I missed it, but I don’t see your code here and without it it is difficult to help.

It is very likely that you want OnlineStats.jl here, as @yakir12 suggested. Alternatively, you could preallocate `mmap`ed arrays for all parts of the calculations, or at least `Eucldist` above.

[quote=“sergevic, post:9, topic:28311”]

Was asking because of the last line of the Stacktrace

``````in expression starting at /NOBACKUP/vicentes/Julia/sp-simul-Gram2-2/Sparsesp.jl:105
``````