I have a matrix that contains lots of missing values. I am toying with the idea of using sparsity to represent those missing values instead. However, one hurdle to easy conversion from Matrix to SparseMatrix is that the non-missing values of the matrix are allowed to be zero (but rarely) and I would like to keep those zeros. Instead, I wrote the following function. I am wondering if anyone else has suggestions for improvement.
using SparseArrays, Missings
function sparsenonmissing(A::AbstractMatrix{T}) where {T}
N,K = size(A)
numvals = sum(!ismissing,A)
rows = Int64[]
cols = Int64[]
vals = nonmissingtype(T)[]
sizehint!(rows, numvals)
sizehint!(cols, numvals)
sizehint!(vals, numvals)
for ind in eachindex(IndexCartesian(),A)
x = A[ind]
if !ismissing(x)
push!(rows,ind[1])
push!(cols,ind[2])
push!(vals,x)
end
end
return sparse(rows,cols,vals,N,K)
end
A = missings(Float64, 500,10000)
A[1:5:end] .= randn(Float64,length(1:5:length(A)))
B = sparsenonmissing(A)
The degree of sparsity in the MWE is approximately what I encounter (the patterns actually display much more clustering).
I’ve been thinking this one over and, as far as performance goes, I think this is actually a pretty decent version. I can think of a few ways to get it done in fewer lines, but they’re a bit hacky and probably slower.
function sparsenonmissing2(A::AbstractMatrix{T}) where {T}
N,K = size(A)
numvals = sum(!ismissing,A)
rows = Vector{Int64}(undef, numvals)
cols = Vector{Int64}(undef, numvals)
vals = Vector{nonmissingtype(T)}(undef, numvals)
i = 0
@inbounds for ind in CartesianIndices(A)
x = A[ind]
if !ismissing(x)
i += 1
rows[i] = ind[1]
cols[i] = ind[2]
vals[i] = x
end
end
return sparse(rows,cols,vals,N,K)
end
(adding @inbounds to your version instead does nothing)
Should be the same because if I’m not mistaken it is propagated to each expression inside the for loop this way. To confirm this, I tried to do what you say on both version and nothing changes
What operations do you want to support on the array? (Most of the SparseMatrixCSC functionality is oriented towards linear algebra, which is useless if there are missing values.)
If you care about performance you should construct the CSC format directly, rather than going through IJV format first.
function sparsenonmissing3(A::AbstractMatrix{T}) where {T}
B = similar(sparse((!ismissing).(A)), nonmissingtype(eltype(A)))
cptr = SparseArrays.getcolptr(B)
c = 1
cend = cptr[c+1]
@inbounds for i in 1:nnz(B)
if i == cend
c += 1
cend = cptr[c+1]
end
nonzeros(B)[i] = A[rowvals(B)[i],c]
end
return B
end
and it doesn’t look so bad either.
BTW, the A in the OP has density 0.2 which is a little high for sparse advantage (considering the overhead of CSC).
Using a mutated and mutant coalesce!() goes a little faster
function coalesce!(m::Matrix{Union{Missing, T}},val::T) where T<:Real
for i in eachindex(m)
ismissing(m[i]) && (m[i]=val)
end
m
end
sparsenonmissing3(A)==sparse(coalesce.(A,0))==sparse(coalesce!(A,0.))
or better
function SMCSC(A)
nzA=(!ismissing).(A)
r,c=size(A)
cptr=[0;cumsum(count.(eachcol(nzA)))].+1
nzval= A[nzA]
rvals=[first(ci.I) for ci in findall(nzA)]
SparseMatrixCSC(r,c,cptr,rvals,nzval)
end
function cptr_(nzA)
cptr=Vector{Int64}(undef,10001)
cptr[1]=1
for (col,c) in enumerate(eachcol(nzA))
cptr[col+1]=cptr[col]+count(c)
end
cptr
end
function SMCSC(A)
nzA=(!ismissing).(A)
r,c=size(A)
#cptr=[0;cumsum(count.(eachcol(nzA)))].+1
cptr=cptr_(nzA)
nzval= A[nzA]
rvals=[first(ci.I) for ci in findall(nzA)]
SparseMatrixCSC(r,c,cptr,rvals,nzval)
end
this looks a bit suspicious?
Also, the functions need to return a non-Union type.
Another method to consider:
using MappedArrays
sparsenonmissing4(A::AbstractMatrix) =
sparse(mappedarray(x->coalesce(x, zero(nonmissingtype(eltype(A)))), A))
As the saying goes: why not make it a one-liner?
This is faster than sparsenonmissing3.
Tiny issue: if the original matrix contains both zeroes and missings, then the resulting matrix will make them both structural zeroes in the sparse matrix.
I’m not sure I understand your doubt. If you are referring to the fact that the size of the cptr vector is hardcoded, I assume this can generally be calculated as size(A)[2]+1.
I don’t get the other aspect about the type of elements of the matrix. I still don’t have a clear understanding of how the type system works in Julia (sooner or later I’ll have to start studying it)
Here’s what I measure on my laptop
Summary
julia> using SparseArrays, Missings, Random
julia> A = missings(Float64, 500,10000)
500×10000 Matrix{Union{Missing, Float64}}:
missing missing missing missing missing missing missing
⋮ ⋱
missing missing missing missing … missing missing missing
missing missing missing missing missing missing missing
julia> A[1:5:end] .= randn(Float64,length(1:5:length(A)))
1000000-element view(reshape(::Matrix{Union{Missing, Float64}}, 5000000), 1:5:4999996) with eltype Union{Missing, Float64}:
0.35955623771445555
0.7063141021260603
0.8332490679598374
1.865779624046154
...
1.342504874329712
-1.1325925605137113
julia> shuffle!(A)
500×10000 Matrix{Union{Missing, Float64}}:
-1.28583 missing 0.471741 … missing missing
...
missing missing missing missing -1.2943
-2.58484 missing missing -1.07261 missing
-0.0542834 missing missing missing -1.05134
julia> using MappedArrays
julia> sparsenonmissing4(A::AbstractMatrix) =
sparse(mappedarray(x->coalesce(x, zero(nonmissingtype(eltype(A)))), A))
sparsenonmissing4 (generic function with 1 method)
julia> function cptr_(nzA)
nr,nc=size(nzA)
cptr=Vector{Int64}(undef,nc+1)
cptr[1]=1
for (col,c) in enumerate(eachcol(nzA))
cptr[col+1]=cptr[col]+count(c)
end
cptr
end
cptr_ (generic function with 1 method)
julia> function SMCSC(A)
nzA=(!ismissing).(A)
r,c=size(A)
#cptr=[0;cumsum(count.(eachcol(nzA)))].+1
cptr=cptr_(nzA)
nzval= A[nzA]
rvals=[first(ci.I) for ci in findall(nzA)]
SparseMatrixCSC(r,c,cptr,rvals,nzval)
end
SMCSC (generic function with 1 method)
julia> SMCSC(A) == sparsenonmissing4(A) #true
true
julia> using BenchmarkTools
julia> @btime sparsenonmissing4(A)
29.725 ms (31 allocations: 19.64 MiB)
500×10000 SparseMatrixCSC{Float64, Int64} with 1000000 stored entries:
⎡⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎤
⎣⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎦
julia> @btime SMCSC(A)
14.931 ms (13 allocations: 32.15 MiB)
500×10000 SparseMatrixCSC{Union{Missing, Float64}, Int64} with 1000000 stored entries:
⎡⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎤
⎣⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎦
Okay. Maybe now I understand what you were referring to (although having a Union{Missing+Float} type didn’t give a different result than yours)
Here is a modified version that allocates less and runs a little faster
julia> function SMCSC(A)
nzA=findall(!ismissing,A)
r,c=size(A)
cptr=[1;findall(i->last(nzA[i].I)!=last(nzA[i+1].I), 1:length(nzA)-1).+1;length(nzA)+1]
nzval = [A[a] for a in nzA]
rvals=[first(ci.I) for ci in nzA]
SparseMatrixCSC(r,c,cptr,rvals,nzval)
end
SMCSC (generic function with 1 method)
julia> @btime SMCSC(A)
11.594 ms (29 allocations: 31.47 MiB)
500×10000 SparseMatrixCSC{Float64, Int64} with 1000000 stored entries:
⎡⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎤
⎣⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎦
regarding the belief that a good (or lucky) use of the functions defined in the library can be more effective than a theoretically better for … loop, here it happens that S!MCSC2(A) which would seem in theory to perform the necessary operations in an “optimal” way is less performing than S!MCSC1(A).
I can’t explain why.
Summary
julia> function S!MCSC1(A)
nzA=findall(!ismissing,A)
r,c=size(A)
cptr=[1;findall(i->last(nzA[i].I)!=last(nzA[i+1].I), 1:length(nzA)-1).+1;length(nzA)+1]
nzval = [A[a] for a in nzA]
#rvals=[first(ci.I) for ci in nzA]
rvals=[first(Tuple(ci)) for ci in nzA]
SparseMatrixCSC(r,c,cptr,rvals,nzval)
end
S!MCSC1 (generic function with 1 method)
julia> @btime S!MCSC1(A)
12.054 ms (29 allocations: 31.47 MiB)
500×10000 SparseMatrixCSC{Float64, Int64} with 1000000 stored entries:
⎡⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎤
⎣⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎦
julia> function S!MCSC2(A)
r,c = size(A)
rvals = Int64[]
cptr = [1]
nzvals = Float64[]
sizehint!(rvals, trunc(Int,r*c*0.2))
sizehint!(cptr, c+1)
sizehint!(nzvals, trunc(Int,r*c*0.2))
for (ci,col) in enumerate(eachcol(A))
nzA=findall(!ismissing,col)
push!(cptr,length(nzA)+cptr[end])
append!(rvals,nzA)
append!(nzvals,@view A[nzA,ci])
# nzvals = [A[a] for a in nzA]
# #rvals=[first(ci.I) for ci in nzA]
# rvals=[first(Tuple(ci)) for ci in nzA]
end
SparseMatrixCSC(r,c,cptr,rvals,nzvals)
end
S!MCSC2 (generic function with 1 method)
julia> sparsenonmissing4
sparsenonmissing4 (generic function with 1 method)
julia> S!MCSC1(A)==S!MCSC2(A)== sparsenonmissing4(A)
true
julia> @btime S!MCSC2(A)
17.470 ms (40007 allocations: 66.32 MiB)
500×10000 SparseMatrixCSC{Float64, Int64} with 1000000 stored entries:
⎡⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎤
⎣⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⎦
It is linear algebra. Basically, this is an unbalanced panel of data (rows are dates, columns are firm observations). The unbalanced nature led to a lot of joins on dates in my first go at the estimation when I kept things in the DataFrame. So, I am trying out the approach where I store the data in this sparse matrix instead, and I also manually handle the joins via clever indexing. Maybe it will not be as good, I’m not sure.