Fast conversion from Matrix{Union{Missing, Float64}} to Sparse

I have a matrix that contains lots of missing values. I am toying with the idea of using sparsity to represent those missing values instead. However, one hurdle to easy conversion from Matrix to SparseMatrix is that the non-missing values of the matrix are allowed to be zero (but rarely) and I would like to keep those zeros. Instead, I wrote the following function. I am wondering if anyone else has suggestions for improvement.

using SparseArrays, Missings

function sparsenonmissing(A::AbstractMatrix{T}) where {T}
	N,K = size(A)
	numvals = sum(!ismissing,A)
	rows = Int64[]
	cols = Int64[]
	vals = nonmissingtype(T)[]

	sizehint!(rows, numvals)
	sizehint!(cols, numvals)
	sizehint!(vals, numvals)

	for ind in eachindex(IndexCartesian(),A)
		x = A[ind]
		if !ismissing(x)
			push!(rows,ind[1])
			push!(cols,ind[2])
			push!(vals,x)
		end
	end
	return sparse(rows,cols,vals,N,K)
end

A = missings(Float64, 500,10000)
A[1:5:end] .= randn(Float64,length(1:5:length(A)))
B = sparsenonmissing(A)

The degree of sparsity in the MWE is approximately what I encounter (the patterns actually display much more clustering).

I’ve been thinking this one over and, as far as performance goes, I think this is actually a pretty decent version. I can think of a few ways to get it done in fewer lines, but they’re a bit hacky and probably slower.

1 Like

This is 2x faster on my pc:

function sparsenonmissing2(A::AbstractMatrix{T}) where {T}
    N,K = size(A)
    numvals = sum(!ismissing,A)
    rows = Vector{Int64}(undef, numvals)
    cols = Vector{Int64}(undef, numvals)
    vals = Vector{nonmissingtype(T)}(undef, numvals)
    i = 0
    @inbounds for ind in CartesianIndices(A)
        x = A[ind]
        if !ismissing(x)
            i += 1
            rows[i] = ind[1]
            cols[i] = ind[2]
            vals[i] = x
        end
    end
    return sparse(rows,cols,vals,N,K)
end

(adding @inbounds to your version instead does nothing)

2 Likes

I think inbounds needs to be on the line where you do the indexing not on a for loop.

Should be the same because if I’m not mistaken it is propagated to each expression inside the for loop this way. To confirm this, I tried to do what you say on both version and nothing changes

2 Likes

What operations do you want to support on the array? (Most of the SparseMatrixCSC functionality is oriented towards linear algebra, which is useless if there are missing values.)

If you care about performance you should construct the CSC format directly, rather than going through IJV format first.

1 Like

This method seems to get a little more juice:

function sparsenonmissing3(A::AbstractMatrix{T}) where {T}
    B = similar(sparse((!ismissing).(A)), nonmissingtype(eltype(A)))
    cptr = SparseArrays.getcolptr(B)
    c = 1
    cend = cptr[c+1]
    @inbounds for i in 1:nnz(B)
        if i == cend
            c += 1
            cend = cptr[c+1]
        end
        nonzeros(B)[i] = A[rowvals(B)[i],c]
    end
    return B
end

and it doesn’t look so bad either.

BTW, the A in the OP has density 0.2 which is a little high for sparse advantage (considering the overhead of CSC).

1 Like

This isn’t fast, but it’s fast to write

sparse(coalesce.(A,0))

Using a mutated and mutant coalesce!() goes a little faster


function coalesce!(m::Matrix{Union{Missing, T}},val::T) where T<:Real
	for i in eachindex(m)
		ismissing(m[i]) && (m[i]=val)
	end
	m
end

sparsenonmissing3(A)==sparse(coalesce.(A,0))==sparse(coalesce!(A,0.))

or better


function SMCSC(A)
	nzA=(!ismissing).(A)
	r,c=size(A)
	cptr=[0;cumsum(count.(eachcol(nzA)))].+1
	nzval= A[nzA]
	rvals=[first(ci.I) for ci in findall(nzA)]
	SparseMatrixCSC(r,c,cptr,rvals,nzval)
end

function cptr_(nzA)
	cptr=Vector{Int64}(undef,10001)
	cptr[1]=1
	for (col,c) in enumerate(eachcol(nzA))
		cptr[col+1]=cptr[col]+count(c)
	end
  	cptr
end

function SMCSC(A)
	nzA=(!ismissing).(A)
	r,c=size(A)
	#cptr=[0;cumsum(count.(eachcol(nzA)))].+1
	cptr=cptr_(nzA)
	nzval= A[nzA]
	rvals=[first(ci.I) for ci in findall(nzA)]
	SparseMatrixCSC(r,c,cptr,rvals,nzval)
end
1 Like

this looks a bit suspicious?
Also, the functions need to return a non-Union type.

Another method to consider:

using MappedArrays

sparsenonmissing4(A::AbstractMatrix) = 
  sparse(mappedarray(x->coalesce(x, zero(nonmissingtype(eltype(A)))), A))

As the saying goes: why not make it a one-liner? :wink:

This is faster than sparsenonmissing3.

Tiny issue: if the original matrix contains both zeroes and missings, then the resulting matrix will make them both structural zeroes in the sparse matrix.

I’m not sure I understand your doubt. If you are referring to the fact that the size of the cptr vector is hardcoded, I assume this can generally be calculated as size(A)[2]+1.
I don’t get the other aspect about the type of elements of the matrix. I still don’t have a clear understanding of how the type system works in Julia (sooner or later I’ll have to start studying it)

Here’s what I measure on my laptop

Summary
julia> using SparseArrays, Missings, Random

julia> A = missings(Float64, 500,10000)
500×10000 Matrix{Union{Missing, Float64}}:
 
 missing  missing  missing  missing     missing  missing  missing
 ⋮                                   ⋱
 missing  missing  missing  missing  …  missing  missing  missing
 
 missing  missing  missing  missing     missing  missing  missing

julia> A[1:5:end] .= randn(Float64,length(1:5:length(A)))
1000000-element view(reshape(::Matrix{Union{Missing, Float64}}, 5000000), 1:5:4999996) with eltype Union{Missing, Float64}:
  0.35955623771445555
  0.7063141021260603
  0.8332490679598374
  1.865779624046154
  ...
  1.342504874329712
 -1.1325925605137113

julia> shuffle!(A)
500×10000 Matrix{Union{Missing, Float64}}:
 -1.28583      missing   0.471741    …    missing     missing
  ...
   missing     missing    missing         missing   -1.2943
 -2.58484      missing    missing       -1.07261      missing
 -0.0542834    missing    missing         missing   -1.05134

julia> using MappedArrays

julia> sparsenonmissing4(A::AbstractMatrix) =
         sparse(mappedarray(x->coalesce(x, zero(nonmissingtype(eltype(A)))), A))
sparsenonmissing4 (generic function with 1 method)

julia> function cptr_(nzA)
        nr,nc=size(nzA)
        cptr=Vector{Int64}(undef,nc+1)
        cptr[1]=1
        for (col,c) in enumerate(eachcol(nzA))
                cptr[col+1]=cptr[col]+count(c)
        end
                cptr
       end
cptr_ (generic function with 1 method)

julia> function SMCSC(A)
        nzA=(!ismissing).(A)
        r,c=size(A)
        #cptr=[0;cumsum(count.(eachcol(nzA)))].+1
        cptr=cptr_(nzA)
        nzval= A[nzA]
        rvals=[first(ci.I) for ci in findall(nzA)]
        SparseMatrixCSC(r,c,cptr,rvals,nzval)
       end
SMCSC (generic function with 1 method)

julia> SMCSC(A) == sparsenonmissing4(A) #true
true

julia> using BenchmarkTools

julia> @btime sparsenonmissing4(A)
  29.725 ms (31 allocations: 19.64 MiB)
500×10000 SparseMatrixCSC{Float64, Int64} with 1000000 stored entries:
⎡⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎤
⎣⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎦

julia> @btime SMCSC(A)
  14.931 ms (13 allocations: 32.15 MiB)
500×10000 SparseMatrixCSC{Union{Missing, Float64}, Int64} with 1000000 stored entries:
⎡⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎤
⎣⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎦

Okay. Maybe now I understand what you were referring to (although having a Union{Missing+Float} type didn’t give a different result than yours)
Here is a modified version that allocates less and runs a little faster

julia> function SMCSC(A)
        nzA=findall(!ismissing,A)
        r,c=size(A)
        cptr=[1;findall(i->last(nzA[i].I)!=last(nzA[i+1].I), 1:length(nzA)-1).+1;length(nzA)+1]   
        nzval = [A[a] for a in nzA]
        rvals=[first(ci.I) for ci in nzA]
        SparseMatrixCSC(r,c,cptr,rvals,nzval)
       end
SMCSC (generic function with 1 method)

julia> @btime SMCSC(A)
  11.594 ms (29 allocations: 31.47 MiB)
500×10000 SparseMatrixCSC{Float64, Int64} with 1000000 stored entries:
⎡⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎤
⎣⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎦
2 Likes

regarding the belief that a good (or lucky) use of the functions defined in the library can be more effective than a theoretically better for … loop, here it happens that S!MCSC2(A) which would seem in theory to perform the necessary operations in an “optimal” way is less performing than S!MCSC1(A).
I can’t explain why.

Summary
julia> function S!MCSC1(A)
        nzA=findall(!ismissing,A)
        r,c=size(A)
        cptr=[1;findall(i->last(nzA[i].I)!=last(nzA[i+1].I), 1:length(nzA)-1).+1;length(nzA)+1]
        nzval = [A[a] for a in nzA]
        #rvals=[first(ci.I) for ci in nzA]
        rvals=[first(Tuple(ci)) for ci in nzA]
        SparseMatrixCSC(r,c,cptr,rvals,nzval)
       end
S!MCSC1 (generic function with 1 method)

julia> @btime S!MCSC1(A)
  12.054 ms (29 allocations: 31.47 MiB)
500×10000 SparseMatrixCSC{Float64, Int64} with 1000000 stored entries:
⎡⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎤
⎣⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎦

julia> function S!MCSC2(A)
        r,c = size(A)
        rvals = Int64[]
        cptr = [1]
        nzvals = Float64[]
        sizehint!(rvals, trunc(Int,r*c*0.2))
        sizehint!(cptr, c+1)
        sizehint!(nzvals, trunc(Int,r*c*0.2))
        for (ci,col) in enumerate(eachcol(A))
                nzA=findall(!ismissing,col)
                push!(cptr,length(nzA)+cptr[end])
                append!(rvals,nzA)
                append!(nzvals,@view A[nzA,ci])
                # nzvals = [A[a] for a in nzA]
                # #rvals=[first(ci.I) for ci in nzA]
                # rvals=[first(Tuple(ci)) for ci in nzA]
        end
        SparseMatrixCSC(r,c,cptr,rvals,nzvals)
       end
S!MCSC2 (generic function with 1 method)



julia> sparsenonmissing4
sparsenonmissing4 (generic function with 1 method)

julia> S!MCSC1(A)==S!MCSC2(A)== sparsenonmissing4(A)
true

julia> @btime S!MCSC2(A)
  17.470 ms (40007 allocations: 66.32 MiB)
500×10000 SparseMatrixCSC{Float64, Int64} with 1000000 stored entries:
⎡⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎤
⎣⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎦

Thanks for the replies everyone!

It is linear algebra. Basically, this is an unbalanced panel of data (rows are dates, columns are firm observations). The unbalanced nature led to a lot of joins on dates in my first go at the estimation when I kept things in the DataFrame. So, I am trying out the approach where I store the data in this sparse matrix instead, and I also manually handle the joins via clever indexing. Maybe it will not be as good, I’m not sure.