`Serialization.serialize()` strings are not interned, causing duplication

NHDaly · March 6, 2020, 5:11am

Can anyone help me understand the serialization behavior for many shared references to large strings? I’ve noticed that multiple shared references to strings end up duplicated on disk when serialized via Serialization.serialize().

For example:

julia> using Humanize, Serialization

julia> struct StringHolder
           a::String
           b::String
       end

julia> s = join(rand('a':'z', 1024*1024)); # 1 MiB string

julia> sh = StringHolder(s, s);

julia> Humanize.datasize(Base.summarysize(sh), style=:bin)  # The string is shared in memory.
"1.0 MiB"

julia> serialize("/tmp/sh", sh)

julia> run(`ls -lh /tmp/sh`)  # Size on disk is 2 MiB! (duplicated!)
-rw-r--r--  1 nathan.daly  wheel   2.0M Mar  6 00:06 /tmp/sh
Process(`ls -lh /tmp/sh`, ProcessExited(0))

julia> sh_deserialized = deserialize("/tmp/sh");   # It's big now! :'(

julia> Humanize.datasize(Base.summarysize(sh_deserialized), style=:bin)
"2.0 MiB"

Is there any way to ask julia to intern the strings and just write references to them throughout the data structures instead? If not, is there a workaround people commonly employ?

I’ve noticed that we actually do already do exactly that for Symbols; so they are not duplicated on disk:

julia> struct SymbolHolder
           a::Symbol
           b::Symbol
       end

julia> s = Symbol(join(rand('a':'z', 1024*1024))); # 1 MiB Symbol

julia> sh = SymbolHolder(s, s);

julia> Humanize.datasize(Base.summarysize(sh), style=:bin)  # So tiny! I guess because it just holds integer identifiers for the Symbols?
"16.0 B"

julia> serialize("/tmp/sh", sh)

julia> run(`ls -lh /tmp/sh`)  # Only 1 MiB, because the Symbol is only written once!
-rw-r--r--  1 nathan.daly  wheel   1.0M Mar  6 00:01 /tmp/sh
Process(`ls -lh /tmp/sh`, ProcessExited(0))

julia> sh_deserialized = deserialize("/tmp/sh");

julia> Humanize.datasize(Base.summarysize(sh_deserialized), style=:bin)
"16.0 B"

Can we do something like that for String serialization as well? Maybe with some threshold based on the length of the string or something?

Thanks!

Topic		Replies	Views
Unable to serialize SharedArrays General Usage	3	610	August 18, 2018
Memory Issues: Serializing and deserializing data General Usage	2	476	August 6, 2018
Serialization compresses very well? General Usage	0	380	August 17, 2018
[ANN] InternedStrings.jl: Allocate strings once and reuse them Community	8	1929	May 8, 2018
Object Serialization In Julia General Usage	7	3926	January 22, 2018

`Serialization.serialize()` strings are not interned, causing duplication

Related topics