Avoiding a copy after removing missings from a vector

In Julia I am trying to write a function that will (without copying) compute something from two vectors. As an example think of Pearson correlation (although it is not that). What I have got is:

 function get_valid_entries(x::Vector, y::Vector)
    valid_entries_in_x = .!(ismissing.(x) 
    valid_entries_in_y = .!(ismissing.(y)
    return findall(valid_entries_in_x .& valid_entries_in_y)
end

x = [4.0, missing, 6.0, 5.0, 8.0, 10.0]
y = [2.0, 3.0, missing, 5.0, 6.0, 2.0]
valids = get_valid_entries(x, y)

function do_something(x::Vector{R}, y::Vector{R}) where R<:Real
    println("This is do_something(x::Vector{R}, y::Vector{R})")
end
do_something(x[valids], y[valids])

This does not work as x[valids] and y[valids] are Vector::{Union{Missing,Float64}}. I was hoping to avoid copies though so I don’t want to just convert the vectors with Vector{Float64}(x[valids]).

Is there a way I can avoid a copy? I was hoping I could avoid copying the vector slices now that there are no missings and the data in there presumably doesn’t have to be rewritten.

The other question is if for a long vector if it is generally faster to do the copy (as the copied vector will be continuous in memory)

Can you be more specific on what the final function actually needs to do in your case? And why it has a Real type constraint?

1 Like

The function that I am feeding this into only makes sense for real values. I am using a similar coding pattern to this a few times (trying to remove invalid entries and then feed it into a different function) but two of the things I am doing are Pearson correlation and Spearman correlation (I realise both are in StatsBase but I am trying out some potential improvements).

I guess I could just define those functions as taking Vector{R,Missing} but would prefer to have stricter typing… it just seems cleaner if it is possible.

It is not possible to avoid the copy from a Vector{Union{T, MIssing}} to a Vector{T}. I wonder if it could ever be possible, but I suspect no.

Some other ways you could speed it up is to not allocate the three arrays with indices. Instead, compute a single array of valid indices in one pass over x and y. Alternatively, do two passes, one to count the number of elements in the resulting Vector{T}, and another to copy their elements.

Maybe you could use something like this (not super well optimised):

function get_nonmissing_indices(x::AbstractArray{>:Missing}, xs::Vararg{AbstractArray{>:Missing}})
    result = Int[]
    vs = (x, xs...)
    for i in eachindex(vs...)
        if all(v -> !ismissing(@inbounds(v[i])), vs)
            push!(result, i)
        end
    end
    result
end

type_remove_missing(::Type{Union{T, Missing}}) where T = T
type_remove_missing(::Type{Any}) = Any

function remove_any_missings(x::AbstractArray{>:Missing}, xs::Vararg{AbstractArray{>:Missing}})
    vs = (x, xs...)
    idx = get_nonmissing_indices(vs...)
    L = length(idx)
    removed = map(vs) do v
        r = similar(v, type_remove_missing(eltype(v)), (L,))
        @inbounds for (inew, iold) in zip(eachindex(r), idx)
            r[inew] = v[iold]
        end
        r
    end
    (;idx, removed)
end
3 Likes

Lazy skipmissing?

1 Like

This is one of the reasons I never use missing with float arrays (actually, I never using missings anywhere but that’s my taste/needs). Use NaNs instead, which are numbers of the same type the array elements.

1 Like

Thanks @jakobnissen. I will have a go with your suggested optimisations. I guess I cannot avoid the copy though. Thanks @Benny, I can’t use skipmissing alone as I am also removing NaNs (I removed that from above for simplicity). I guess I could pair it with a similar thing for NaNs but it ends up being pretty equivalent to what I have.

I completely agree @joa-quim . This is at the outer edge of the codebase I am writing, once its all within my code I don’t use missing. Its hard to totally avoid missing given in most packages it is used as a default value.

By the way, note that x[valids] already creates a copy, so that Vector{Float64}(x[valids]) copies twice. It would be better to use something like Vector{Float64}(@view x[valids]) or convert(Vector{Float64}, @view x[valids]).

1 Like

There are definitely performance gotchas with missing! Depending on your usecase, you might want to convert them to NaNs right after reading from file, for example.

There’s Skipper.jl – like skipmissing, but more general :slight_smile:

Not if you really need a Vector{Float64} output, that’s just a different data structure. I think we could unsafely reuse a contiguous portion of the underlying Memory{Union{Float64, Missing}} and treat it as if it were a Memory{Float64}, but that’s assuming a lot about Memory we can’t rely on.

The allocation here doesn’t seem too bad though because valids needed allocations to begin with, so not sure what the harm of another couple allocations are.

There is an analogous SkipNan.jl. You can make an entirely lazy indexable sequence guaranteed to never return missing or NaN (by erroring at those elements), and if you want to adjust the indices to also skip those elements, you can use view with valids beforehand. collect also removes the Missing element type, unlike for a view or slice.

julia> using BenchmarkTools; using SkipNan: skipnan

julia> x = [1,missing,3,NaN,5,NaN,7,missing];

julia> @btime y = skipnan(skipmissing(view($x, $(1:2:8))))
  6.800 ns (0 allocations: 0 bytes)
skipnan(skipmissing(Union{Missing, Float64}[1.0, 3.0, 5.0, 7.0]))

julia> y[1], y[2], y[3], y[4]
(1.0, 3.0, 5.0, 7.0)

julia> collect(y) |> typeof
Vector{Float64} (alias for Array{Float64, 1})

julia> collect(view(x, 1:2:8)) |> typeof
Vector{Union{Missing, Float64}} (alias for Array{Union{Missing, Float64}, 1})

Worth noting that skipmissing and skipnan wrap iterables, so they can’t subtype AbstractVector and don’t support much except eltype and indexing if the parent object does. That’s also why the view was inside, rather than outside. Not sure how an AbstractVector-only skipmissing could be feasible because size is much harder to compute, and the size varies with setindex! calls to the parent object, not just push!/insert!/delete!. Still, that could be hypothetically done if we just accept that even more mutations of the parent object are dangerous than for view.

1 Like