Read-only memory-mapped files

data

#1

I am developing some code in BEDFIles.jl that may be used with very large binary data files. The data are accessed as a read-only memory-mapped Matrix{UInt8} using column-oriented algorithms whenever possible.

As I understand it, there shouldn’t be a problem with having a very large file if I am only accessing a small set of adjacent columns. Suppose that I have 100,000 rows and 10 million columns but I only access the first 10,000 columns. I believe that the columns beyond 10,000 will never need to appear in memory - that they are essentially held as a kind of a promise by the operating system (which would be Linux - I don’t care if Windows does dumb things with memory-mapped files). Is this correct?


#2

If you are opening a memory mapped file, yes that is correct.

Admittedly I am still rather hazy on some of the details, but you can get a partial description here.

It should go without saying that you still have to be careful about actually copying data out of the memory mapped array.


#3

Yes, your expectations are correct. I have mmapped 500GB files on a 16GB machine without any problems, the OS (in my case, Linux) takes care of the memory operations very transparently, paging on demand.

For 10^5\cdot10^4=10^9 UInt8s, that’s 1GB, so chances are the whole section could just fit in memory, making access really fast.