Can anyone help me understand the serialization behavior for many shared references to large strings? I’ve noticed that multiple shared references to strings end up duplicated on disk when serialized via Serialization.serialize()
.
For example:
julia> using Humanize, Serialization
julia> struct StringHolder
a::String
b::String
end
julia> s = join(rand('a':'z', 1024*1024)); # 1 MiB string
julia> sh = StringHolder(s, s);
julia> Humanize.datasize(Base.summarysize(sh), style=:bin) # The string is shared in memory.
"1.0 MiB"
julia> serialize("/tmp/sh", sh)
julia> run(`ls -lh /tmp/sh`) # Size on disk is 2 MiB! (duplicated!)
-rw-r--r-- 1 nathan.daly wheel 2.0M Mar 6 00:06 /tmp/sh
Process(`ls -lh /tmp/sh`, ProcessExited(0))
julia> sh_deserialized = deserialize("/tmp/sh"); # It's big now! :'(
julia> Humanize.datasize(Base.summarysize(sh_deserialized), style=:bin)
"2.0 MiB"
Is there any way to ask julia to intern the strings and just write references to them throughout the data structures instead? If not, is there a workaround people commonly employ?
I’ve noticed that we actually do already do exactly that for Symbol
s; so they are not duplicated on disk:
julia> struct SymbolHolder
a::Symbol
b::Symbol
end
julia> s = Symbol(join(rand('a':'z', 1024*1024))); # 1 MiB Symbol
julia> sh = SymbolHolder(s, s);
julia> Humanize.datasize(Base.summarysize(sh), style=:bin) # So tiny! I guess because it just holds integer identifiers for the Symbols?
"16.0 B"
julia> serialize("/tmp/sh", sh)
julia> run(`ls -lh /tmp/sh`) # Only 1 MiB, because the Symbol is only written once!
-rw-r--r-- 1 nathan.daly wheel 1.0M Mar 6 00:01 /tmp/sh
Process(`ls -lh /tmp/sh`, ProcessExited(0))
julia> sh_deserialized = deserialize("/tmp/sh");
julia> Humanize.datasize(Base.summarysize(sh_deserialized), style=:bin)
"16.0 B"
Can we do something like that for String serialization as well? Maybe with some threshold based on the length of the string or something?
Thanks!