Memory usage with mmap file

Hi All
I’m accessing a very large binary file randomly on windows. I don’t dare to put in all the code but I managed to replicate the behavior with a small loop. Could someone please explain to me why this is causing infinite memory growth (or at least an unexpectedly large growth, despite using mmap) and how I can circumvent it?

using Mmap

io = open(“path.bin”, “r”)
rawDataMap = Mmap.mmap(io, Matrix{Int16}, (385, 88173048),grow=false,shared=false)
close(io)

count = [1]
while true
println(sum(rawDataMap[:,count[1]:count[1]+75]))
count[1] += rand(75:100)
end

I would expect the loop to be garbage collected but apparently not? And no Gc.gc() doesn’t help.

Julia doesn’t have any control over how the memory pages backed by an mmapped array are managed — that is entirely up to the operations system. The behavior you’re seeing is what I’d expect the OS to do: lazily adding actual backing memory pages to the mmapped region as you access it and then not paging them out until memory pressure gets so great that your system is completely out of memory (and maybe not even then depending on the OS and how it’s configured).

2 Likes

I appreciate the response, however that means that there is no alternative here to work on files larger than memory size + leftover storage space?

I basically have a 400 GB file with events occurring at “random” times.
And I have a list containing said event times.
I want to process (not just view) a small amount of data points surrounding each event time on that list by looking at original 400 gb binary file.
How does one work this out with reasonable computing resources (32 GB RAM, 128 gb free storage space) if mmap isn’t the answer?

Try

view(rawDataMap, :, count[1]:count[1]+75)

instead of

rawDataMap[:,count[1]:count[1]+75]

to avoid copying data from the memory-mapped file blocks into your process memory.

1 Like

Thanks for your response, however I need to process said ranges.
Let’s say for the sake of simplicity I want to:

println(sum(rawDataMap[:,count[1]:count[1]+75]))

This still causes memory to explode (with views too)

If you need a copy anyway, mmap probably has little advantage over seek and read!.

3 Likes

Can’t you work with the file? I mean, do not create the mmapped array but instead use seek over the io object to get the exact ranges you want?

2 Likes

If I’m not mistaken “read” isn’t really a possibility if I want to read 10 kbs of data midway through 400 gb file. But I appreciate for pointing out “seek” exists. I will try that and update the response as solution if that works out as soon as possible.

What if you do:

while true
   tmp = rawDataMap[:,count[1]:count[1]+75]
   f(tmp)
   tmp=[]  # explicit release
   GC.gc()
end

This works (under Linux at least) pretty well

1 Like

Thanks for the response. I’m afraid that doesn’t work on Windows. That was one of the first things I tried.

Try seek with read!, which takes an array to be filled as an argument.

2 Likes

I have no practical experience with how mmap() behaves on Windows, and would be curious if e.g. Linux has the same problem.

You want to find out if the memory you run out of is

  • allocated by Julia’s memory manager (and thus could be garbage collected by Julia)
  • allocated to the Julia process but not part of the heap memory that is managed by Julia’s memory manager (and thus gc() would have no effect),
  • is outside the Julia process and instead part of the operating-system’s block-buffer cache (which all processes share), and the problem then may be that OS just doesn’t deal well with that situation.

Is your OS swap space big enough for the file? See Increasing Virtual Memory in Windows 10

1 Like

I hadn’t realized that you can use an mmap()ed file after closing the corresponding file descriptor. If Julia really does not munmap() the mapping when you close the file, when does Julia then undo the memory map?

The documentation doesn’t say anything on how unmapping a memory-mapped file works.

1 Like

I share the same confusion. Understanding that behavior could have helped identifying the problem. I’m afraid documentation page really isn’t helpful at all.

I think a memory mapped file should be looked more like virtual memory associated to a process.

Once a file is mapped, it becomes memory that is accessible to the julia process at some address. In Linux, one can check by running cat /proc/<julia pid>/maps and it will turn out that some memory region is mapped to the file - this stuff is handled transparently by the OS (address transformations etc)

Consequently, the mapped file can be an actual Matrix object in julia whose values happen to be not in RAM but disk.

1 Like

BTW, looking at the source code, mmap() registers a finalizer for the array, such that when the array is garbage collected, the memory mapping will be removed. Makes sense.

1 Like

I don’t get why seek and read wouldn’t work. Mmap seems like the wrong tool here: you are essentially telling the kernel “I want all of this file in my process memory (but do it lazily in case I don’t actually need to touch all of it).” Using seek and read, on the other hand is precisely the right way to read only a small part of a large file when you don’t want the rest of it in memory. In particular, if you want to process a large file “in streaming fashion”, you want to use read rather than mmap.

4 Likes

Is there a package that implements an AbstractArray backed by “seek and read” of a file? I know Arrow does this, but it comes with a lot of baggage, and only supports vectors AFAICT…

1 Like