Extract an `AbstractVector{T}` from an `AbstractVector{T, Missing}`

I know that a certain subset of the indices of an AbstractVector{T, Missing} are not missing. I want to efficiently manipulate that subset.

How can I extract x::AbstractVector{T} from y::AbstractVector{T, Missing} such that x[i] === y[i] whenever y[i] !== missing with minimal runtime overhead at creation and at use?

I am okay with undefined behavior if I attempt to access a missing element.

You’ll probably have something like

function extract!(avec::AbstractVector{T}, avecm::AbstractVector{Union{T, Missing}}, is) where {T}
    for i in is
        avec[i] = avecm[i]
    end
    avec
end

function extract(avecm::AbstractVector{Union{T, Missing}}, is)::AbstractVector{T} where {T}
    avec = similar(avecm, T, length(avecm));
    extract!(avec, avecm, is)
end

@btime extract!(vec, vecm, is) setup=(
    vecm = Vector{Union{Int, Missing}}(undef, 10);
    is = [1, 3, 5, 7, 9];
    vecm[is] .= is;
    vec = similar(vecm, Int, length(vecm));
)

@btime extract(vecm, is) setup=(
    vecm = Vector{Union{Int, Missing}}(undef, 10);
    is = [1, 3, 5, 7, 9];
    vecm[is] .= is;
)

as a baseline already, yielding

  8.000 ns (0 allocations: 0 bytes)
  34.340 ns (1 allocation: 144 bytes)
1 Like

This is nice, but I’m looking for something substantially faster. I would like to avoid allocating and copying data at all and have a constant runtime with respect to input length. I think this should be possible, at least in the case of Vector{Union{T, Missing}}, because Vector{Union{T, Missing}} is stored internally as a vector of data and a vector of bits representing whether each element is missing. I just want to get a handle on the internal Vector of data.

I’m looking for something like x = reinterpret(T, y).

For Arrays, I think this might be safe:

without_missing(avecm::Array{Union{T, Missing}}) where T = 
    unsafe_wrap(Array, reinterpret(Ptr{Int}, pointer(avecm)), size(avecm))

@btime without_missing(v) setup=(
    v = Vector{Union{Int, Missing}}(undef, 100000);
    is = 1:90000;
    v[is] .= is;
);
# 35.259 ns (2 allocations: 64 bytes)

But for AbstractArrays, I still don’t know.

1 Like