Custom serializer for arrays of Union types

I’m looking to speed up serialization of arrays that allow for missing values. Performance falls by an order of magnitude according to the following test:

	N = Int(1e8);
	buffer = Vector{UInt8}(undef, 8 * N);
	io = IOBuffer(buffer, write=true)
	
	test_cases = [
		("BitArray of trues", BitArray(true for i in 1:N)),
		("Array of zeros (Int)", zeros(Int, N)),
		("Uninitialized Vector{Int}", Vector{Int}(undef, N)),
		("Array of zeros (Union{Int, Float64})", zeros(Union{Int, Float64}, N)),
		("Array of zeros (Union{Int, Missing})", zeros(Union{Int, Missing}, N)),
		("Array of missings (Missing)", Vector{Missing}(missing, N)),
		("Array of missings (Union{Int, Missing})", Vector{Union{Int, Missing}}(missing, N)),
		("Uninitialized Vector{Union{Int, Float64}}", Vector{Union{Int, Float64}}(undef, N))
	]
	
	for (desc, arr) in test_cases
		println("\n$desc:")
		empty!(buffer)
		@time Serialization.serialize(io, arr)
	end

My results:

BitArray of trues:
  0.003427 seconds (24 allocations: 1.516 KiB)

Array of zeros (Int):
  0.354058 seconds (20 allocations: 351.369 MiB, 18.31% gc time)

Uninitialized Vector{Int}:
  0.459178 seconds (20 allocations: 651.290 MiB, 13.53% gc time)

Array of zeros (Union{Int, Float64}):
  0.535546 seconds (20 allocations: 512.001 MiB, 5.32% gc time)

Array of zeros (Union{Int, Missing}):
  0.516552 seconds (20 allocations: 512.001 MiB, 2.61% gc time)

Array of missings (Missing):
  0.000015 seconds (18 allocations: 1.422 KiB)

Array of missings (Union{Int, Missing}):
  6.071769 seconds (100.00 M allocations: 1.990 GiB, 6.16% gc time)

Uninitialized Vector{Union{Int, Float64}}:
  2.304365 seconds (100.00 M allocations: 1.990 GiB, 15.80% gc time)

I found code here that show how I can override serialization for a type.

using Serialization

# The target struct
struct Foo
    x::Int
    y::Union{Int, Nothing} #we do not want to serialize this field
end

# Custom Serialization of a Foo instance
function Serialization.serialize(s::AbstractSerializer, instance::Foo)
    Serialization.writetag(s.io, Serialization.OBJECT_TAG)
    Serialization.serialize(s, Foo)
    Serialization.serialize(s, instance.x)
end

# Custom Deserialization of a Foo instance
function Serialization.deserialize(s::AbstractSerializer, ::Type{Foo})
    x = Serialization.deserialize(s)
    Foo(x,nothing)
end

foo1 = Foo(1,2)

# Serialization
write_iob = IOBuffer()
serialize(write_iob, foo1)
seekstart(write_iob)
content = read(write_iob)

# Deserialization
read_iob = IOBuffer(content)
foo2 = deserialize(read_iob)

@show foo1
@show foo2 

But what I would really like to do is inspect if an array is of Union type with a Missing type parameter (I’d use a bitarray to flag missing entries), and pass it to the default implementation if I can’t handle the structure. I’m not sure how I can intercept the serialization for AbstractVector, and still have the default implementation that I can use as fallback. I’m also not sure if I can easily subclass AbstractSerializer, or how I can organize a custom serializer.

Thanks for any help you can offer!

why can’t you use Arrow.jl?

I had no idea about Arrow.jl. I’m not sure how missing values would affect its performance. I don’t really need the language interop aspect of Arrow.

I think I’m just going to add temporary bit vectors in the DataFrames to allow me to hide/unhide missing values when I’m transmitting.