How to reduce a (reshaped) sparse matrix to a max or sum with acceptable speed?

I generated a 3240000×1460 SparseMatrixCSC{UInt8, Int64} with 518339 stored entries, hereafter called activitygrid. It holds a spatial 3D grid (dims: 300,300,36) every 30 seconds of a 12 hour period. It was very fast to generate and its size is 40 (using UInt8). The dense form would be sized 4730400000.

I then reshaped this sparse matrix to obtain a 4D grid of which I want to project maximum and sum across its various dimensions, which I will plot with Makie, which needs Float32 values:

sgrid4d = reshape(Float32.(activitygrid),(dims[1],dims[2],dims[3],size(activitygrid)[2]))

julia> typeof(sgrid4d)
Base.ReshapedArray{Float32, 4, SparseMatrixCSC{Float32, Int64}, Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}}

julia> sgrid4d
300×300×36×1460 reshape(::SparseMatrixCSC{Float32, Int64}, 300, 300, 36, 1460) with eltype Float32:

This is generated in a fraction of a second.
However, when I want to take the maximum or sum across dimensions, it takes very long to generate:

julia> @time xy_max = dropdims(maximum(sgrid4d,dims=(3,4)),dims=(3,4))
 27.776353 seconds (715.90 k allocations: 47.358 MiB, 0.09% gc time, 1.53% compilation time)
300×300 Matrix{Float32}:

or, if instead I reshaped the matrix to use the time as the first dimension:

julia> @time xy_max = dropdims(maximum(sgrid4d,dims=(1,4)),dims=(1,4))
 29.179732 seconds (23 allocations: 352.672 KiB)
300×300 Matrix{Float32}:

which improves allocations but not the speed. Is there a better way that improves performance without having to allocate many gigabytes? After all, it only needs to work on the nonzero values…
Is it possible to create a true 4D SparseMatricCSC instead of a ReshapedArray, and would it help the speed?

Could you type up a copy-pastable MWE real quick? It is hard to help seeing without having activitygrid or something that looks like it.

Thanks. This will give a sparse matrix with similar number of nonzero values:

julia> activitygrid = sprand(UInt8,3240000,1460, 0.00012)
3240000×1460 SparseMatrixCSC{UInt8, Int64} with 566857 stored entries:

It performs even a bit worse:

sgrid4d = reshape(Float32.(activitygrid),(size(activitygrid)[2],dims[1],dims[2],dims[3]))

@time xy_max = dropdims(maximum(sgrid4d,dims=(1,4)),dims=(1,4))
 35.672897 seconds (31 allocations: 353.172 KiB)

It seems the whole matrix is being converted to zeros which takes time. However, the original activitygrid can be summed in 0.0012 sec or so. The question is how do I get to a format that lets itself be summed along 4D dimensions without doing this. Perhaps it will be acceptable to get a one-time conversion that takes half a minute, but afterwards I need to slice time and spatial ranges in an instant.

Clearly view and reshape of SparseMatrixCSC are not performing well. It gets tedious to write, but sometimes you can decompose the reduction into two parts with the first operating on a raw sparse array. If you are always reducing over the 4th dimension, for example, consider reducing over it first and then handle the others after.

Like this

sz4 = (300, 300, 36, 2 * 60 * 12)
sz2 = (sz4[1] * sz4[2] * sz4[3] , sz4[4])
activitygrid = sprand(Float32, sz2..., 111f-6)

# reduce over 4th dim, then 3rd
max34 = dropdims(maximum(reshape(maximum(activitygrid; dims=2), sz4[1:3]); dims=3); dims=3)
# 1.2s on my machine

If you need fancier reductions, consider different storage options, like (dim1, dim2*dim3*dim4), that might better reflect your uses. With some post-processing you can usually make these things work, although it sometimes reeks of MATLAB-style index gymnastics.

Thanks, this seems to be a workable solution at least for one dimension, also for my need of zooming with ranges, e.g.:

dropdims(maximum(reshape(maximum(activitygrid[:,1000:1200]; dims=2), dims)[100:200,100:200,5:30], dims=3),dims=3)

On the other hand, I am now trying to figure out if it could work for time vs altitude, maxing xy per z (1440*36 result). It seems that we could only get a single maximum across time for each activitygrid location, but it needs per time step.

You would probably get quite reasonable performance hand-rolling your own sparse array using a Dict, and identifying the entries to sum using modular arithmetic (or using CartesianIndices as keys).

You could create a new matrix, explicitly iterate over the entries of the sparse matrix, then, for each element, you update the sum/max of the respective entry of the matrix. Unlike in python or other high level languages, iterating in Julia is fast. Since there are less than a million entries in the sparse matrix, this should be fast enough.

A more robust solution would be to add a specialized method to reducing reshaped sparse matrices. If you want to contribute to Julia, that is welcome.

1 Like

Indeed, I just started to look into iterating. Thinking of activitygrid → nz → CartesianIndices → iterate over unique Cartesian x,y,z,t → max or sum

Before that, I tried stacking sparse array slices to build a SparseMatrixCSC. It works to cat sparse arrays for dimensions 1 and 2, but it turns into dense trying to cat dimension 3. Also, the maximum returns a dense array for each slice, making it slow. Functions like findnz do not work on the reshaped formats (my sgrid4d)

In order to squeeze some more efficiency, the following uses the sparse columns of SparseArray activitygrid as one of the dimensions to maximize (or sum). This allows continuous memory access when doing the maximization. After this step the remaining matrix is dense (or denser) and then the second maximization can be done over the first (fastest) dimension, again, to enhance memory locality.

In code:

using SparseArrays

function dosum(sz4, sz2, activitygrid)
    T = eltype(activitygrid)
    res = zeros(T, sz4[3:4])
    ci = CartesianIndices(sz4[2:4])
    for (i,c) in pairs(eachcol(activitygrid))
        I = @views CartesianIndex((ci[i]).I[2:3])
        res[I] = max(res[I], maximum(nonzeros(c); init=zero(T)))
    end
    res
end

function testsum()
    sz4 = (2*60*12, 36, 300, 300)
    sz2 = (sz4[1], sz4[2] * sz4[3] * sz4[4])
    activitygrid = sprand(Float32, sz2..., 111f-6)
    @btime dosum($sz4, $sz2, $activitygrid)
    nothing
end

and as a baseline another method from this thread:

function dosum_orig(sz4, sz2, activitygrid)
    return dropdims(maximum(reshape(maximum(activitygrid; dims=2), sz4[1:3]); dims=3); dims=3)
end

function testsum_orig()
    sz4 = (300, 300, 36, 2*60*12)
    sz2 = (sz4[1] * sz4[2] * sz4[3], sz4[4])
    activitygrid = sprand(Float32, sz2..., 111f-6)
    @btime dosum_orig($sz4, $sz2, $activitygrid)
    nothing
end

Comparing these:

julia> testsum()
  15.086 ms (2 allocations: 351.61 KiB)

julia> testsum_orig()
  28.001 ms (36 allocations: 37.42 MiB)

There is about a 2X improvement, and the memory allocated is essentially the output matrix (and so, it is minimal).

There is a bit more optimization to be had by keeping maximums in local variables, but perhaps the compiler knows to do it itself.

To really optimize stuff, getting into SparseMatrixCSC:

dosum2(sz4, sz2, ag) = [maximum(
  @view ag.nzval[ag.colptr[t]:ag.colptr[t+sz4[2]]-1]; 
  init=zero(eltype(ag))) 
  for t in 1:sz4[2]:sz2[2]-1]

with this dosum2:

julia> testsum();
  2.553 ms (2 allocations: 351.61 KiB)

A 10X improvement with minimal allocations over the baseline in the previous post.

This method uses contiguity of memory for columns of sparse matrix across second dimension, to chop the nonzero values into regions to maximize (or sum).

1 Like

Thanks Dan, it looks interesting. But as I tried to apply this on my activitygrid, it somehow indexes out of bounds. I’m afraid that with this solution it does not seem easy to select different directions to reduce over, and specify submatrices for space/time selections. I will look into your previous one.

I really should try to understand how to work with the columns, Cartesian indices and at the same time pull a 2D field in the right direction while taking a range of values of time, x,y,z. Dan’s first method (dosum) did not give the desired result, as the maximum(nonzeros(c)) applies to an entire 3D grid in each column of activitygrid.

For laughs, here is a naive “unrolled” version operating on just the non-zero values. This is not how to do it (findall over 518000 elements every iteration) but at least I understand what it is doing…

julia> xy_max,xy_sum,xz_max,xz_sum,yz_max,yz_sum,tz_max,tz_sum = gridviews(activitygrid, dims, 1:1440, 1:300,1:300,1:36)
 13.618621 seconds (58.16 k allocations: 40.398 GiB, 14.22% gc time)
 17.791983 seconds (71.50 k allocations: 49.663 GiB, 13.91% gc time)
101.901340 seconds (394.82 k allocations: 277.617 GiB, 14.33% gc time)
121.682171 seconds (492.48 k allocations: 342.121 GiB, 13.93% gc time)

# the function...

function gridviews(activitygrid,dims, ts,xs,ys,zs)
        T = eltype(activitygrid)
        nz = findnz(activitygrid[:,ts])
        ca = CartesianIndices(dims)[nz[1]]
        cax = getindex.(ca,1)
        cay = getindex.(ca,2)
        caz = getindex.(ca,3)
        cells = (xs[begin] .<= cax .<= xs[end]) .&& (ys[begin] .<= cay .<= ys[end]) .&& (zs[begin] .<= caz .<= zs[end])
        yz_max = zeros(Float32,length(ys),length(zs))
        yz_sum = zeros(Float32,length(ys),length(zs))
    @time begin
        for y in unique(cay[cells]), z in unique(caz[cells]) 
            locs = findall(cay[cells].==y .&& caz[cells].==z)
            yz_max[y-ys[begin]+1, z-zs[begin]+1] = maximum(nz[3][locs]; init=zero(T))
            yz_sum[y-ys[begin]+1, z-zs[begin]+1] = sum(nz[3][locs]; init=zero(T))
        end
    end
        xz_max = zeros(Float32,length(xs),length(zs))
        xz_sum = zeros(Float32,length(xs),length(zs))
    @time begin
        for x in unique(cax[cells]), z in unique(caz[cells]) 
            locs = findall(cax[cells].==x .&& caz[cells].==z)
            xz_max[x-xs[begin]+1, z-zs[begin]+1] = maximum(nz[3][locs]; init=zero(T))
            xz_sum[x-xs[begin]+1, z-zs[begin]+1] = sum(nz[3][locs]; init=zero(T))
        end
    end
        xy_max = zeros(Float32,length(xs),length(ys))
        xy_sum = zeros(Float32,length(xs),length(ys))
    @time begin
        for x in unique(cax[cells]), y in unique(cay[cells]) 
            locs = findall(cax[cells].==x .&& cay[cells].==y)
            xy_max[x-xs[begin]+1,y-ys[begin]+1] = maximum(nz[3][locs]; init=zero(T))
            xy_sum[x-xs[begin]+1,y-ys[begin]+1] = sum(nz[3][locs]; init=zero(T))
        end
    end
        tz_max = zeros(Float32,length(ts),length(zs))
        tz_sum = zeros(Float32,length(ts),length(zs))
    @time begin
        for t in ts, z in unique(caz[cells]) 
            locs = findall(nz[2][cells].==t .&& caz[cells].==z)
            tz_max[t-ts[begin]+1,z-zs[begin]+1] = maximum(nz[3][locs]; init=zero(T))
            tz_sum[t-ts[begin]+1,z-zs[begin]+1] = sum(nz[3][locs]; init=zero(T))
        end
    end
        return xy_max,xy_sum,xz_max,xz_sum,yz_max,yz_sum,tz_max,tz_sum
end

Maybe GitHub - Jutho/SparseArrayKit.jl: Sparse multidimensional arrays using a DOK format, with support for TensorOperations.jl could be useful?

In my first testing of SparseArrayKit, it allowed to convert activitygrid or the reshaped sgrid4d to SparseArray, but the latter took 35 seconds. When I tested a single column of activitygrid (SparseMatrixCSC) or that column converted to SparseArray, reshaped this and ran the maximum, the original came out faster (19 ms) than the SparseArray (84 ms).

# full 4D array
julia> @time dropdims(maximum(test; dims=(1,2)),dims=(1,2))
178.498085 seconds (84 allocations: 4.837 MiB)
36×1460 SparseArray{Float32, 2} with 20205 stored entries:

julia> 1440*0.084
120.96000000000001

julia> 1440*0.019
27.36

It did allocate very little.

On the other hand… I now went back to normal matrices. Earlier, I found that it lasted many seconds as well, but I converted to Float32 which made it allocate much more, and I had not converted to a true 4D matrix, but an array of 3D matrices.

The following is actually way faster than working with sparse arrays:

julia> @time grid4d = reshape(collect(activitygrid),(dims[1],dims[2],dims[3],size(activitygrid)[2]))
  0.605269 seconds (220 allocations: 4.406 GiB, 4.63% gc time)
300×300×36×1460 Array{UInt8, 4}:


function matrixviews(grid4d, ts,xs,ys,zs; view="xy", reduce="max") 
    if reduce == "max"
        if view == "xy"
            return @views dropdims(maximum(grid4d[xs,ys,zs,ts],dims=(3,4)),dims=(3,4))
        end
    elseif reduce == "sum"
        if view == "xy"
            return @views dropdims(sum(grid4d[xs,ys,zs,ts],dims=(3,4)),dims=(3,4))
        end
       # etcetera
    end
end

julia> @btime test = matrixviews(grid4d,:,:,:,:,; view="xy", reduce="max")
  336.462 ms (22 allocations: 88.77 KiB)
300×300 Matrix{UInt8}:

# a selection:

julia> @btime test = matrixviews(grid4d, 100:1100,80:160,100:200,1:36; view="xy", reduce="max")
  65.355 ms (22 allocations: 9.05 KiB)
81×101 Matrix{UInt8}:

I feel like having taken a long unnecessary detour, but happy to be home again. Learned something about sparse matrices along the way.

1 Like