I need a bit of advice on interpreting the results of About.jl. I’m doing some work trying to make marginal gains in the memory demands of XLSX.jl and am currently focusing on the way Excel’s sharedStrings are handled. Currently, XLSX.jl stores these in a struct like this:
mutable struct SharedStringTable
unformatted_strings::Vector{String}
formatted_strings::Vector{String}
index::Dict{String, Int64}
is_loaded::Bool
end
The formatted_strings are the raw xml representation of the textual cell values and may (rarely) include some rich text formatting information.
The unformatted_strings are the parsed plain text values.
The keys in the dict are (duplicates of) the unformatted_strings
If I read an Excel file containing a large number of string values, I can use About.jl to find some size information:
julia> about(workbook.sst)
XLSX.SharedStringTable (mutable) (<: Any), occupies 32B directly (referencing 14MB in total)
julia> about(workbook.sst.unformatted_strings)
49419-element Vector{String} (mutable) (<: DenseVector{String} <: AbstractVector{String} <: Any), occupies 24B directly (referencing 4.2MB in total, holding 386kB of data)
julia> about(workbook.sst.formatted_strings)
49419-element Vector{String} (mutable) (<: DenseVector{String} <: AbstractVector{String} <: Any), occupies 24B directly (referencing 5.2MB in total, holding 386kB of data)
julia> about(workbook.sst.index)
Dict{String, Int64} with 49419 entries (mutable) (<: AbstractDict{String, Int64} <: Any), occupies 64B directly (referencing 9.0MB in total)
In my dev’d code, I’ve abandoned eager conversion to plain strings in favour of lazy parsing and now have a simpler struct, like:
mutable struct SharedStringTable
shared_strings::Vector{String}
index::Dict{UInt64, Vector{Int64}}
is_loaded::Bool
end
Here, the shared_strings are dentical to the formatted_strings above. The dict keys, however, are the simple hashes of the shared_strings with a vector for dict values to accommodate any (very few to none) hash collisions. I would expect this struct to be much smaller, but:
julia> about(workbook.sst)
XLSX.SharedStringTable (mutable) (<: Any), occupies 24B directly (referencing 12MB in total)
julia> about(workbook.sst.shared_strings)
49419-element Vector{String} (mutable) (<: DenseVector{String} <: AbstractVector{String} <: Any), occupies 24B directly (referencing 5.2MB in total, holding 386kB of data)
julia> about(workbook.sst.index)
Dict{UInt64, Vector{Int64}} with 49419 entries (mutable) (<: AbstractDict{UInt64, Vector{Int64}} <: Any), occupies 64B directly (referencing 6.5MB in total)
Here are my questions:
- Why is the first
sstonly 14MB when its parts seem to total 18.4MB? - Why is index in the second
sst6.5MB when it uses only 16 bytes for each(k, v)pair (plus vector overhead)? - Why do I apparently only save 2MB when, to my naive calculation, I should be saving much more than this?
For comparison, Excel’s internal sharedStrings.xml file is 4,278KB (uncompressed) and does not include any index.