Thank you for some thoughtful ideas and suggestions!
- I think this is probably a question for the package owner but here are some thoughts from me. Firstly, as a contributor, I feel like I should respect the documented functionality of the package I’m contributing to. The XLSX.jl docs say
If
enable_cache=false
, worksheet cells will always be read from disk. This is useful when you want to read a spreadsheet that doesn’t fit into memory.
So I take a presumption that anything I do needs to maintain the ability to read files larger than memory. Indeed, when I made an earlier PR, it received a comment saying
One thing to look into is memory usage when reading a large spreadsheet.
and I took this to be a reminder of the need to maintain support for larger than memory files.
In terms of usage scenarios, I think an important use of XLSX.jl is to read data from spreadsheets published or shared by third parties. A very large dataset, or several over several worksheets in a single file, may easily be bigger than memory of some computers. XLSX.jl offers the ability to extract a subset of this data without having to load the whole file into memory. I have no idea how often such a situation actually occurs but XLSX.jl has thoughtfully provided functionality to handle it if/when it does and I wouldn’t want to break it without discussion.
- This is something I can look at but it would likely be a major change. It would seem undesirable to have XLSX.jl depend on two separate zip packages, though, if it could be avoided. The need is less than before because in my latest PR, there is a GC call only when reading a file with
enable_cache=false
whereas originally it was called on every read. For many uses the original issue has now gone away.
As far as XML.jl needing a Vector{UInt8}
, I’m afraid my undertsanding is somewhat sketchy. In my mind, the fact that mmap
can be used to treat the file as a Vector{UInt8}
in ZipArchives.jl means the vector is accessed directly in the file rather than being copied into memory. Then, in XLSX.jl, I use LazyNode
to access sheet rows, which I understand to mean that only the elements we actually want to read get materialised in memory. I realise I don’t know how to verify my understanding and that it could easily be wrong.
- Interesting. Will be curious to see how XLSX.jl could benefit from this, too.