Can anyone help me understand the serialization behavior for many shared references to large strings? I’ve noticed that multiple shared references to strings end up duplicated on disk when serialized via Serialization.serialize().
For example:
julia> using Humanize, Serialization
julia> struct StringHolder
           a::String
           b::String
       end
julia> s = join(rand('a':'z', 1024*1024)); # 1 MiB string
julia> sh = StringHolder(s, s);
julia> Humanize.datasize(Base.summarysize(sh), style=:bin)  # The string is shared in memory.
"1.0 MiB"
julia> serialize("/tmp/sh", sh)
julia> run(`ls -lh /tmp/sh`)  # Size on disk is 2 MiB! (duplicated!)
-rw-r--r--  1 nathan.daly  wheel   2.0M Mar  6 00:06 /tmp/sh
Process(`ls -lh /tmp/sh`, ProcessExited(0))
julia> sh_deserialized = deserialize("/tmp/sh");   # It's big now! :'(
julia> Humanize.datasize(Base.summarysize(sh_deserialized), style=:bin)
"2.0 MiB"
Is there any way to ask julia to intern the strings and just write references to them throughout the data structures instead? If not, is there a workaround people commonly employ?
I’ve noticed that we actually do already do exactly that for Symbols; so they are not duplicated on disk:
julia> struct SymbolHolder
           a::Symbol
           b::Symbol
       end
julia> s = Symbol(join(rand('a':'z', 1024*1024))); # 1 MiB Symbol
julia> sh = SymbolHolder(s, s);
julia> Humanize.datasize(Base.summarysize(sh), style=:bin)  # So tiny! I guess because it just holds integer identifiers for the Symbols?
"16.0 B"
julia> serialize("/tmp/sh", sh)
julia> run(`ls -lh /tmp/sh`)  # Only 1 MiB, because the Symbol is only written once!
-rw-r--r--  1 nathan.daly  wheel   1.0M Mar  6 00:01 /tmp/sh
Process(`ls -lh /tmp/sh`, ProcessExited(0))
julia> sh_deserialized = deserialize("/tmp/sh");
julia> Humanize.datasize(Base.summarysize(sh_deserialized), style=:bin)
"16.0 B"
Can we do something like that for String serialization as well? Maybe with some threshold based on the length of the string or something?
Thanks!