Struggling to use Mmap with ZipArchives

TimG · June 17, 2025, 8:07am

Thank you for some thoughtful ideas and suggestions!

I think this is probably a question for the package owner but here are some thoughts from me. Firstly, as a contributor, I feel like I should respect the documented functionality of the package I’m contributing to. The XLSX.jl docs say

If enable_cache=false , worksheet cells will always be read from disk. This is useful when you want to read a spreadsheet that doesn’t fit into memory.

So I take a presumption that anything I do needs to maintain the ability to read files larger than memory. Indeed, when I made an earlier PR, it received a comment saying

One thing to look into is memory usage when reading a large spreadsheet.

and I took this to be a reminder of the need to maintain support for larger than memory files.

In terms of usage scenarios, I think an important use of XLSX.jl is to read data from spreadsheets published or shared by third parties. A very large dataset, or several over several worksheets in a single file, may easily be bigger than memory of some computers. XLSX.jl offers the ability to extract a subset of this data without having to load the whole file into memory. I have no idea how often such a situation actually occurs but XLSX.jl has thoughtfully provided functionality to handle it if/when it does and I wouldn’t want to break it without discussion.

This is something I can look at but it would likely be a major change. It would seem undesirable to have XLSX.jl depend on two separate zip packages, though, if it could be avoided. The need is less than before because in my latest PR, there is a GC call only when reading a file with enable_cache=false whereas originally it was called on every read. For many uses the original issue has now gone away.

As far as XML.jl needing a Vector{UInt8}, I’m afraid my undertsanding is somewhat sketchy. In my mind, the fact that mmap can be used to treat the file as a Vector{UInt8} in ZipArchives.jl means the vector is accessed directly in the file rather than being copied into memory. Then, in XLSX.jl, I use LazyNode to access sheet rows, which I understand to mean that only the elements we actually want to read get materialised in memory. I realise I don’t know how to verify my understanding and that it could easily be wrong.

Interesting. Will be curious to see how XLSX.jl could benefit from this, too.

Topic		Replies	Views
Mmapping a discontiguous file? Visualization question	4	1162	August 26, 2017
"memory mapping failed" when reading many CSVs General Usage	11	2090	May 8, 2020
Multiple files in a gzip archive General Usage	6	2163	June 15, 2017
Write to the file the structure of the form (key => value) followed by reading using Mmap.mmap General Usage mmap	12	902	December 16, 2018
Mmap.mmap leaves the file open New to Julia mmap	5	1349	January 10, 2019

Struggling to use Mmap with ZipArchives

Related topics