Question about what About.jl tells me

I need a bit of advice on interpreting the results of About.jl. I’m doing some work trying to make marginal gains in the memory demands of XLSX.jl and am currently focusing on the way Excel’s sharedStrings are handled. Currently, XLSX.jl stores these in a struct like this:

mutable struct SharedStringTable
    unformatted_strings::Vector{String}
    formatted_strings::Vector{String}
    index::Dict{String, Int64}
    is_loaded::Bool
end

The formatted_strings are the raw xml representation of the textual cell values and may (rarely) include some rich text formatting information.
The unformatted_strings are the parsed plain text values.
The keys in the dict are (duplicates of) the unformatted_strings

If I read an Excel file containing a large number of string values, I can use About.jl to find some size information:

julia> about(workbook.sst)
XLSX.SharedStringTable (mutable) (<: Any), occupies 32B directly (referencing 14MB in total)

julia> about(workbook.sst.unformatted_strings)
49419-element Vector{String} (mutable) (<: DenseVector{String} <: AbstractVector{String} <: Any), occupies 24B directly (referencing 4.2MB in total, holding 386kB of data)

julia> about(workbook.sst.formatted_strings)
49419-element Vector{String} (mutable) (<: DenseVector{String} <: AbstractVector{String} <: Any), occupies 24B directly (referencing 5.2MB in total, holding 386kB of data)

julia> about(workbook.sst.index)
Dict{String, Int64} with 49419 entries (mutable) (<: AbstractDict{String, Int64} <: Any), occupies 64B directly (referencing 9.0MB in total)

In my dev’d code, I’ve abandoned eager conversion to plain strings in favour of lazy parsing and now have a simpler struct, like:

mutable struct SharedStringTable
    shared_strings::Vector{String}
    index::Dict{UInt64, Vector{Int64}}
    is_loaded::Bool
end

Here, the shared_strings are dentical to the formatted_strings above. The dict keys, however, are the simple hashes of the shared_strings with a vector for dict values to accommodate any (very few to none) hash collisions. I would expect this struct to be much smaller, but:

julia> about(workbook.sst)
XLSX.SharedStringTable (mutable) (<: Any), occupies 24B directly (referencing 12MB in total)

julia> about(workbook.sst.shared_strings)
49419-element Vector{String} (mutable) (<: DenseVector{String} <: AbstractVector{String} <: Any), occupies 24B directly (referencing 5.2MB in total, holding 386kB of data)

julia> about(workbook.sst.index)
Dict{UInt64, Vector{Int64}} with 49419 entries (mutable) (<: AbstractDict{UInt64, Vector{Int64}} <: Any), occupies 64B directly (referencing 6.5MB in total)

Here are my questions:

  • Why is the first sst only 14MB when its parts seem to total 18.4MB?
  • Why is index in the second sst 6.5MB when it uses only 16 bytes for each (k, v) pair (plus vector overhead)?
  • Why do I apparently only save 2MB when, to my naive calculation, I should be saving much more than this?

For comparison, Excel’s internal sharedStrings.xml file is 4,278KB (uncompressed) and does not include any index.

2 Likes

The parts are probably referencing the same data. E.g., the same string needn’t be duplicated between the parts. And I’m sure About.jl is being smart and avoids double-counting such things when they’re a part of the same structure. When you ask about them separately, though, they’ll of course be double-counted.

This seems meaningful when you have 50k of them. Each vector has 40 bytes of overhead itself, plus then there’s type tags and pointers to them. That’s over 2MB right there.

2 Likes

So much effort, so little return! :sweat_smile:

Perhaps I should switch to using a tuple rather than a vector? The number of hash collisions is likely to be small, so most will only have a single value anyway.

Dicts are themselves hash maps. Why not just use the strings as the keys directly and let Dict take care of everything for you?

1 Like

I was labouring under the (perhaps false) impression that using strings as keys would take more memory. In my primary use case, strings can be long (many hundreds of characters). I realise my understanding here is weak! The memory footprint of XLSX.jl is very large - at least an order of magnitude greater than Excel itself.

Thank you for doing this work - reading large Excel files is something I often can’t get around in my line of work, so making this more performant is hugely welcome.

Out of interest, I had considered digging into this myself a few years ago but upon googling around couldn’t find any fast Excel parser in other languages, which meant I didn’t know how much scope for improvement there might actually be. Do you have an idea of what good would look like, either from a parsing library in another language or bottlenecks you’ve identifed in XLSX.jl?

1 Like

Actually, no I don’t. I’m constantly looking for small ways to make a difference and hoping to stumble on a breakthrough! :rofl:

Small things that have made a difference include changing CellRef in a Cell struct to use Int32, creating an EmptyFormula type to use in cells that don’t actually have a formula, and not retaining the ZipArchives IO in the XLSXFile struct. Together these have made a small but material difference in the memory footprint.

Making the changes I’m persuing in this thread have also made a small but material difference, too, and have very marginally incresed performance in to the bargain. Perhaps these aren’t optimised yet, though.

I do have more ideas to follow up, but its a constant struggle against my lack of secure understanding!

Thank you for your encouragement!

1 Like

String in Julia is a bit special, even if the string itself is long, when in an array or other container, it only takes 8 bytes. The actual string data is stored elsewhere, and the 8 bytes in the container is just a reference to be able to access the string’s data and to prevent the GC from reusing the string’s memory.

Yes, but both About and Base.summarysize accurately find and report that data usage.

That would be because About uses Base.summarysize :slight_smile:

1 Like