`Serialization.serialize()` strings are not interned, causing duplication

Can anyone help me understand the serialization behavior for many shared references to large strings? I’ve noticed that multiple shared references to strings end up duplicated on disk when serialized via Serialization.serialize().

For example:

julia> using Humanize, Serialization

julia> struct StringHolder

julia> s = join(rand('a':'z', 1024*1024)); # 1 MiB string

julia> sh = StringHolder(s, s);

julia> Humanize.datasize(Base.summarysize(sh), style=:bin)  # The string is shared in memory.
"1.0 MiB"

julia> serialize("/tmp/sh", sh)

julia> run(`ls -lh /tmp/sh`)  # Size on disk is 2 MiB! (duplicated!)
-rw-r--r--  1 nathan.daly  wheel   2.0M Mar  6 00:06 /tmp/sh
Process(`ls -lh /tmp/sh`, ProcessExited(0))

julia> sh_deserialized = deserialize("/tmp/sh");   # It's big now! :'(

julia> Humanize.datasize(Base.summarysize(sh_deserialized), style=:bin)
"2.0 MiB"

Is there any way to ask julia to intern the strings and just write references to them throughout the data structures instead? If not, is there a workaround people commonly employ?

I’ve noticed that we actually do already do exactly that for Symbols; so they are not duplicated on disk:

julia> struct SymbolHolder

julia> s = Symbol(join(rand('a':'z', 1024*1024))); # 1 MiB Symbol

julia> sh = SymbolHolder(s, s);

julia> Humanize.datasize(Base.summarysize(sh), style=:bin)  # So tiny! I guess because it just holds integer identifiers for the Symbols?
"16.0 B"

julia> serialize("/tmp/sh", sh)

julia> run(`ls -lh /tmp/sh`)  # Only 1 MiB, because the Symbol is only written once!
-rw-r--r--  1 nathan.daly  wheel   1.0M Mar  6 00:01 /tmp/sh
Process(`ls -lh /tmp/sh`, ProcessExited(0))

julia> sh_deserialized = deserialize("/tmp/sh");

julia> Humanize.datasize(Base.summarysize(sh_deserialized), style=:bin)
"16.0 B"

Can we do something like that for String serialization as well? Maybe with some threshold based on the length of the string or something?


1 Like