Arrow.jl encodes Union{Missing, T} concisely, but Union{Nothing, T} takes more space.
using Arrow
for (label, col) in [
"Union{Missing,String}" => Union{Missing,String}["a", missing, "b"],
"Union{Nothing,String}" => Union{Nothing,String}["a", nothing, "b"],
]
Arrow.write("/tmp/t.arrow", (; x = col))
t = Arrow.Table("/tmp/t.arrow")
println("$label => eltype=$(eltype(t.x)), $(filesize("/tmp/t.arrow")) bytes")
end
% jl /tmp/arrow_nulls.jl
Union{Missing,String} => eltype=Union{Missing, String}, 474 bytes
Union{Nothing,String} => eltype=Union{Missing, Nothing, String}, 1082 bytes
- At the moment we get a 3-way
Union{Missing, Nothing, String}even though nomissingvalue is present. Is that intended? - If I want to use
nothingrather thanmissingin my data, is there a way to get concise encoding other than writingmissings and manually converting them tonothings after load?
cc @quinnj