Error in mean value of sparse arrays

nbakas · March 31, 2022, 10:51am

in the following code, I noticed two errors:

the mean value of the positive elements is less than the mean value including zeros
the mean value obtained from the loop is different than the mean value from the function “mean”.

n = 5000
sp_mat = sprand(Float32,n,n,0.99)./20
mean(sp_mat)#0.02474575
mean(sp_mat[sp_mat.>0])#0.024066502
sum_loop=0
for i=1:n
	for j=1:n
		sum_loop+=sp_mat[i,j]
	end
end
sum_loop/n/n#0.023825405

goerch · March 31, 2022, 11:22am

I added some more details showing what is computed:

using SparseArrays
using Statistics

n = 5000
sp_mat = sprand(Float32,n,n,0.99)./20
@show mean(sp_mat)
@show sum(nonzeros(sp_mat)) / (size(sp_mat, 1) * size(sp_mat, 2))
@show mean(sp_mat[sp_mat.>0])
@show size(sp_mat[sp_mat.>0]), sum(sp_mat[sp_mat.>0]) / size(sp_mat[sp_mat.>0], 1)
sum_loop=0
for i=1:n
	for j=1:n
		global sum_loop+=sp_mat[i,j]
	end
end
@show sum_loop/n/n

yielding

mean(sp_mat) = 0.02475363f0
sum(nonzeros(sp_mat)) / (size(sp_mat, 1) * size(sp_mat, 2)) = 0.02475363f0
mean(sp_mat[sp_mat .> 0]) = 0.025003498f0
(size(sp_mat[sp_mat .> 0]), sum(sp_mat[sp_mat .> 0]) / size(sp_mat[sp_mat .> 0], 1)) = ((24750166,), 0.025003498f0)
(sum_loop / n) / n = 0.023831278f0

I believe the reason your loop is showing a different result is due to sum using pairwise summation.

oheil · March 31, 2022, 11:24am

And of course, this one:

mean(sp_mat[sp_mat.>0])

is wrong as the number of elements for mean is smaller than in sp_mat.

oheil · March 31, 2022, 3:41pm

This is the last routine which does the sum:

@noinline function mapreduce_impl(f, op, A::AbstractArrayOrBroadcasted,
                                  ifirst::Integer, ilast::Integer, blksize::Int)
    if ifirst == ilast
        @inbounds a1 = A[ifirst]
        return mapreduce_first(f, op, a1)
    elseif ifirst + blksize > ilast
        # sequential portion
        @inbounds a1 = A[ifirst]
        @inbounds a2 = A[ifirst+1]
        v = op(f(a1), f(a2))
        @simd for i = ifirst + 2 : ilast
            @inbounds ai = A[i]
            v = op(v, f(ai))
        end
        return v
    else
        # pairwise portion
        imid = (ifirst + ilast) >> 1
        v1 = mapreduce_impl(f, op, A, ifirst, imid, blksize)
        v2 = mapreduce_impl(f, op, A, imid+1, ilast, blksize)
        return op(v1, v2)
    end
end

In case of SparseArrays the above function is called with a view over only the non-zero values, which is SparseArrays.nzvalview(sp_mat).

You can see, that there is no difference now, by calling it directly:

julia> using SparseArrays, Statistics

julia> n = 5000
5000

julia> sp_mat = sprand(Float32,n,n,0.99)./20;

julia> sum(sp_mat)
618840.25f0

julia> inds = LinearIndices(SparseArrays.nzvalview(sp_mat));

julia> Base.mapreduce_impl(identity, +, SparseArrays.nzvalview(sp_mat), first(inds), last(inds), Base.pairwise_blocksize(identity, +))
618840.25f0

~~The sum algorithms sums up blockwise, with blocksize of 1024.~~ (better short description below)
Summing up floating point numbers in different orders gives different results because of floating point precision.
You can find out which function is doing the actual calculation by using @which and @less with the next function you see in @less, e.g.:

julia> @less sum(sp_mat)
@inline ($fname)(a::AbstractArray; dims=:, kw...) = ($_fname)(a, dims; kw...)
...

julia> @less Base._sum(sp_mat,:)
($_fname)(a, ::Colon; kw...) = ($_fname)(identity, a, :; kw...)
...
julia> @less Base._sum(identity,sp_mat,:)
($_fname)(f, a, ::Colon; kw...) = mapreduce(f, $op, a; kw...)
...
julia> @which Base._sum(identity,sp_mat,:) 
#because we need to see how $op is defined, which is :add_sum
_sum(f, a, ::Colon; kw...) in Base at reducedim.jl:878

julia> @less mapreduce(identity,Base.add_sum,sp_mat,:)
mapreduce(f, op, itrs...; kw...) = reduce(op, Generator(f, itrs...); kw...)
...

And so on until you reach the worker function which does the actual work.
I like to do this sometimes, despite it can be tedious, because there is always so much to learn.

stevengj · March 31, 2022, 4:31pm

More precisely, the default sum adds pairwise recursively, with a base case of 1024 where it switches to a simple loop.

nbakas · March 31, 2022, 6:20pm

thanks, this is amazingly informative. If I run the following, I get

different values using “sum” and the loop
different values if I run the loop for different permutations.

So yes, I guess it was a combination of a different order of summing up, and of course Float32 precision along with small values to be added.

using Random
nn=5000^2
rand_vec = rand(Float32,nn)./20
sum(rand_vec)#625007.4
sum_loop = Float32(0)
ii = randperm(nn)
for i in ii
   sum_loop += rand_vec[i]
end
sum_loop#601579.75

Topic		Replies	Views
Bug in sparse matrix .+= scalar? Internals & Design	7	804	August 1, 2017
Threaded loop shows error when using sparse matrix, but not with dense matrices New to Julia sparse , threads	3	384	September 15, 2022
Error showing value of type SparseMatrixCSC General Usage question , sparsearrays	2	35	May 9, 2025
Can't access large SparseMatrix using index New to Julia sparse	2	628	May 17, 2020
Roundoff error in variance of array of constant value New to Julia float , precision	7	157	July 11, 2024

Error in mean value of sparse arrays

Related topics