Memory usage with mmap file

KKesgin · April 23, 2021, 2:57pm

Hi All
I’m accessing a very large binary file randomly on windows. I don’t dare to put in all the code but I managed to replicate the behavior with a small loop. Could someone please explain to me why this is causing infinite memory growth (or at least an unexpectedly large growth, despite using mmap) and how I can circumvent it?

using Mmap

io = open(“path.bin”, “r”)
rawDataMap = Mmap.mmap(io, Matrix{Int16}, (385, 88173048),grow=false,shared=false)
close(io)

count = [1]
while true
println(sum(rawDataMap[:,count[1]:count[1]+75]))
count[1] += rand(75:100)
end

I would expect the loop to be garbage collected but apparently not? And no Gc.gc() doesn’t help.

StefanKarpinski · April 23, 2021, 4:00pm

Julia doesn’t have any control over how the memory pages backed by an mmapped array are managed — that is entirely up to the operations system. The behavior you’re seeing is what I’d expect the OS to do: lazily adding actual backing memory pages to the mmapped region as you access it and then not paging them out until memory pressure gets so great that your system is completely out of memory (and maybe not even then depending on the OS and how it’s configured).

KKesgin · April 23, 2021, 4:26pm

I appreciate the response, however that means that there is no alternative here to work on files larger than memory size + leftover storage space?

I basically have a 400 GB file with events occurring at “random” times.
And I have a list containing said event times.
I want to process (not just view) a small amount of data points surrounding each event time on that list by looking at original 400 gb binary file.
How does one work this out with reasonable computing resources (32 GB RAM, 128 gb free storage space) if mmap isn’t the answer?

mgkuhn · April 23, 2021, 4:58pm

Try

view(rawDataMap, :, count[1]:count[1]+75)

instead of

rawDataMap[:,count[1]:count[1]+75]

to avoid copying data from the memory-mapped file blocks into your process memory.

KKesgin · April 23, 2021, 5:00pm

Thanks for your response, however I need to process said ranges.
Let’s say for the sake of simplicity I want to:

println(sum(rawDataMap[:,count[1]:count[1]+75]))

This still causes memory to explode (with views too)

mgkuhn · April 23, 2021, 5:04pm

If you need a copy anyway, mmap probably has little advantage over seek and read!.

Henrique_Becker · April 23, 2021, 5:07pm

Can’t you work with the file? I mean, do not create the mmapped array but instead use seek over the io object to get the exact ranges you want?

KKesgin · April 23, 2021, 5:09pm

If I’m not mistaken “read” isn’t really a possibility if I want to read 10 kbs of data midway through 400 gb file. But I appreciate for pointing out “seek” exists. I will try that and update the response as solution if that works out as soon as possible.

zgornel · April 23, 2021, 5:09pm

What if you do:

while true
   tmp = rawDataMap[:,count[1]:count[1]+75]
   f(tmp)
   tmp=[]  # explicit release
   GC.gc()
end

This works (under Linux at least) pretty well

KKesgin · April 23, 2021, 5:11pm

Thanks for the response. I’m afraid that doesn’t work on Windows. That was one of the first things I tried.

mgkuhn · April 23, 2021, 5:12pm

Try seek with read!, which takes an array to be filled as an argument.

mgkuhn · April 23, 2021, 5:18pm

I have no practical experience with how mmap() behaves on Windows, and would be curious if e.g. Linux has the same problem.

You want to find out if the memory you run out of is

allocated by Julia’s memory manager (and thus could be garbage collected by Julia)
allocated to the Julia process but not part of the heap memory that is managed by Julia’s memory manager (and thus gc() would have no effect),
is outside the Julia process and instead part of the operating-system’s block-buffer cache (which all processes share), and the problem then may be that OS just doesn’t deal well with that situation.

Is your OS swap space big enough for the file? See Increasing Virtual Memory in Windows 10

mgkuhn · April 23, 2021, 5:28pm

I hadn’t realized that you can use an mmap()ed file after closing the corresponding file descriptor. If Julia really does not munmap() the mapping when you close the file, when does Julia then undo the memory map?

The documentation doesn’t say anything on how unmapping a memory-mapped file works.

KKesgin · April 23, 2021, 5:30pm

I share the same confusion. Understanding that behavior could have helped identifying the problem. I’m afraid documentation page really isn’t helpful at all.

zgornel · April 23, 2021, 5:48pm

I think a memory mapped file should be looked more like virtual memory associated to a process.

Once a file is mapped, it becomes memory that is accessible to the julia process at some address. In Linux, one can check by running cat /proc/<julia pid>/maps and it will turn out that some memory region is mapped to the file - this stuff is handled transparently by the OS (address transformations etc)

Consequently, the mapped file can be an actual Matrix object in julia whose values happen to be not in RAM but disk.

mgkuhn · April 23, 2021, 7:09pm

BTW, looking at the source code, mmap() registers a finalizer for the array, such that when the array is garbage collected, the memory mapping will be removed. Makes sense.

StefanKarpinski · April 23, 2021, 7:26pm

I don’t get why seek and read wouldn’t work. Mmap seems like the wrong tool here: you are essentially telling the kernel “I want all of this file in my process memory (but do it lazily in case I don’t actually need to touch all of it).” Using seek and read, on the other hand is precisely the right way to read only a small part of a large file when you don’t want the rest of it in memory. In particular, if you want to process a large file “in streaming fashion”, you want to use read rather than mmap.

cstjean · September 16, 2021, 3:38am

Is there a package that implements an AbstractArray backed by “seek and read” of a file? I know Arrow does this, but it comes with a lot of baggage, and only supports vectors AFAICT…

Topic		Replies	Views
Read-only memory-mapped files Internals & Design data	2	988	July 3, 2018
Growing mmaped arrays General Usage	6	641	December 17, 2018
Mmap.mmap leaves the file open New to Julia mmap	5	1329	January 10, 2019
Use of Memory-mapped I/O General Usage memory , memory-allocation	9	2899	September 5, 2019
File IO Buffers too small? Performance binaryio , io	14	1718	November 25, 2022

Memory usage with mmap file

Related topics