Trying to code a parallel for loop with shared sparse arrays

Having asked here how to code a parallel for loop, and getting a nice answer, I now find myself unable to find anything on how to do it with sparse arrays, as I couldn’t work out how to use sparse arrays with SharedArrays.

For example,

function timeconsuming(params)
    Z0 = spzeros(100000,1)
    for k = 1:100
        Z0[Int64.(ceil.(100000*rand(1,10)))].+= params
    end
    return Z0
end

function test()
    Z = spzeros(100000,1000)
    for p = 1:1000
        params = rand(1,10)
        Z[:,p] = timeconsuming(params)
    end
    return Z
end

@time test(); 5.704991 seconds (1.53 M allocations: 7.626 GiB, 17.34% gc time)

The structure of this example is very close to my actual problem. Any help with parallelising it would be much appreciated.

I don’t think you’re going to be able to do this, because underneath the hood sparsearrays are three vectors, one of which holds pointer data to the other two. Because of this, each thread/process will need to have exclusive access to the sparse array in order to insert values in specific order and there will need to be a sync between switches, which essentially turns this into a serial problem.

That is indeed the problem. I am converting my code from MATLAB, where it runs an order of magnitude faster, and does it by automatically multithreading (no parfor loop). I’m going to write this test code in MATLAB as well to see whether it captures the effect.

Surely there must be some way of parallelizing code like this by operating on the three underlying vectors and creating the sparse matrix at the end? Anyway, I’m going to keep working on this. Difficulties like this are certainly a good way to learn about a new language.

MATLAB version.

function Z = test()
    Z = sparse(100000,1000);
    parfor p = 1:1000
        params = rand(10,1);
        Z(:,p)= timeconsuming(params);
    end
end

function Z0 = timeconsuming(params)
    Z0 = sparse(100000,1);
    for k = 1:100
        ind = ceil(100000*rand(1,10));
        Z0(ind) = Z0(ind) + params;
    end
end

Without parfor

tic,Z = test;toc Elapsed time is 4.521721 seconds.

With parfor

tic,Z = test;toc Elapsed time is 0.545647 seconds.

So it’s clearly possible, although tbh, this probably isn’t the right example, as it needs a parfor to work. I’d still like to know how to do this in Julia though.

I don’t know MATLAB so can’t comment on the code except to say that if timeconsuming() isn’t returning a lot of zeros, then maybe you could use a dense array instead, which you could then use in a parallel for loop.

No I’m only interested in sparse matrices. Thanks for thinking about it though.

Are you sure this code doesn’t introduce a race condition? Assigning columns of a sparse matrix requires making copies of the row index and value vectors, and if you try to do that in parallel the different threads will interfere with each other unless you take special precautions. I am not familiar with parallel computing in Matlab, and it is in principle possible that Matlab will take care of things for you, but I somewhat doubt it.

You do have to be careful, but MATLAB does take care of it for you.

Sliced variables in MATLAB

It’s actually very picky and gives lots of error messages and warnings if you screw it up. It’s very traumatic switching to a grown up language after being babied by MATLAB for all these years. :blush:

1 Like
using Base.Threads
using SparseArrays

function timeconsuming(params)
    Z0 = spzeros(100000,1)
    for k = 1:100
        Z0[Int64.(ceil.(100000*rand(1,10)))].+= params
    end
    return Z0
end

function test()
    L = ReentrantLock() #Threads.SpinLock()
    Z = spzeros(100000,1000)
    @Threads.threads for p = 1:1000
        params = rand(1,10)
        values = timeconsuming(params)
        lock(L)
        Z[:,p] = values
        unlock(L)
    end
    return Z
end

This “works” although with this MWE it doesn’t show much improvement (2.5 seconds from 3 seconds on my machine). I think the assignment Z[:,p] = value is taking most of the time while the timeconsuming() method only takes half a millisecond.

That runs slower than the original version on my machine. If I increase the size of the inner loop from 100 to 1000, this method gives a 15% improvement. Either way, it’s not much, and I already tried it in my actual code without success. Thanks for the suggestion though. I want Z to be shared so that it can be updated in parallel, so I guess I should try sharing the index and value vectors instead of the sparse structure, which I will do soon.

One thing that you might consider is having each thread return its own sparse vector, and then process the vectors into a sparsematrix at the end. This will probably require more memory because it requires a dense vector of sparsevectors at the beginning, but it should allow you to thread.

Example:

v = Vector{SparseVector{UInt16}}(undef, 1000);

@Threads.threads for i = 1:1000
    v[i] = sparse(fill(UInt16(i), i))
end

Don’t know if this is applicable, but would it be preferable to run the time-consuming calcs in the background, leaving the foreground available for collection? With @threads, the time-consuming calc running on the primary will block collection, forcing the other threads that have finished to wait for it. If all the time-consuming calcs are the same length, this is not a big deal.

If the jobs are not very uniform, though, you might get improvement with the @qbthreads macro in GitHub - tro3/ThreadPools.jl: Improved thread management for background and nonuniform tasks in Julia. Docs at https://tro3.github.io/ThreadPools.jl. If nothing else, it has thread visualization to help you optimize the work. Hope this helps.

All my calcs are the same length, but it’s good to know that that’s a possibility if I ever need it thanks.

This is what I had in mind.

So here are my efforts. Firstly, here’s a basic code. I had to change it a bit because in my actual problem there are no issues with ordering the indices of the matrix, and I also wanted ‘timeconsuming’ to be more time consuming, so I added a stupid loop into it to slow it down. The index and value vectors can be shared and the sparse matrix constructed at the end.

using SparseArrays

function timeconsuming(params)
    index = zeros(Int64, 10000, 1)
    value = zeros(Float64, 10000, 1)
    for k = 1:1000
        ind = 10 * (k - 1) .+ (1:10)
        index[ind] = 1000 * (k - 1) .+ Int64.(ceil.(1000 .* rand(10, 1)))
        value[ind] = params
    end
    for k = 1:100
        dum = rand(1000,1000)
    end
    return index, value
end

function test()
    index = zeros(Int64, 1000000, 1)
    value = zeros(Float64, 1000000, 1)
    J = repeat(transpose(collect(1:100)), 10000,1)
    for p = 1:100
        params = rand(1, 10)
        ind = 10000 * (p - 1) .+ (1:10000)
        index[ind], value[ind] = timeconsuming(params)
    end
    Z = sparse(vec(index), J[:], vec(value), 1000000, 100)
    return Z
end

which gives
@time Z = test(); 40.613715 seconds (225.51 k allocations: 74.619 GiB, 21.99% gc time)

With @threads, this becomes

using SparseArrays, SharedArrays, Distributed

function timeconsuming(params)
    index = zeros(Int64, 10000, 1)
    value = zeros(Float64, 10000, 1)
    for k = 1:1000
        ind = 10 * (k - 1) .+ (1:10)
        index[ind] = 1000 * (k - 1) .+ Int64.(ceil.(1000 .* rand(10, 1)))
        value[ind] = params
    end
    for k = 1:100
        dum = rand(1000, 1000)
    end
    return index, value
end

function test()
    index = SharedArray{Int64,2}(1000000, 1)
    value = SharedArray{Float64,2}(1000000, 1)
    J = repeat(transpose(collect(1:100)), 10000, 1)
    Threads.@threads for p = 1:100
        params = rand(1, 10)
        ind = 10000 * (p - 1) .+ (1:10000)
        index[ind], value[ind] = timeconsuming(params)
    end
    Z = sparse(vec(index), J[:], vec(value), 1000000, 100)
    return Z
end

and @time Z = test(); 22.801165 seconds (228.17 k allocations: 74.604 GiB, 30.57% gc time)

and finally with @distributed

using SparseArrays, SharedArrays, Distributed

addprocs(12)

@everywhere begin
    function timeconsuming(params)
        index = zeros(Int64, 10000, 1)
        value = zeros(Float64, 10000, 1)
        for k = 1:1000
            ind = 10 * (k - 1) .+ (1:10)
            index[ind] = 1000 * (k - 1) .+ Int64.(ceil.(1000 .* rand(10, 1)))
            value[ind] = params
        end
        for k = 1:100
            dum = rand(1000,1000)
        end
        return index, value
    end
end

function test()
    index = SharedArray{Int64,2}(1000000, 1)
    value = SharedArray{Float64,2}(1000000, 1)
    J = repeat(transpose(collect(1:100)), 10000, 1)
    @sync @distributed for p = 1:100
        params = rand(1, 10)
        ind = 10000 * (p - 1) .+ (1:10000)
        index[ind], value[ind] = timeconsuming(params)
    end
    Z = sparse(vec(index), J[:], vec(value), 1000000, 100)
    return Z
end

and @time Z = test(); 15.664146 seconds (12.10 k allocations: 53.835 MiB)

This last seems to be the better of the two.

I think I’ve learned a few things, and I’ll now have to see what I can do with my actual code. Thanks to everybody for helping, and any further comments would be gratefully received.

I’m not sure, but it seems to me that it is a SparseArrays thing. I observed that a dense matrix version run serially in Julia is 3X faster than MATLAB’s parallel execution. So, may be sparse arrays in MATLAB are especially optimized? I don’t still use MATLAB now, but I remember a few years ago that sparse matrices weren’t that good, I was even avoiding them for performance reasons.

Please don’t use @time for benchmarking. There are too many ways to get inaccurate results. I believe you can benchmark both parallel and distributed code using BenchmarkTools.

1 Like

OK. I am aware of the potential problems though. I always make sure the code has run a couple of times to check that it has properly compiled. When I’ve tried using @btime instead, I haven’t found a significant difference.

I expect MATLAB’s sparse matrices are very well optimized. I do agree that it’s a SparseArrays thing, and that I should build the index and value vectors first, then create the sparse matrices instead of working with sparse arrays directly. I’ll be working on my actual code next week taking all of these ideas into account and will note what I find here.

By working with the indices and values instead of the sparse array, I’ve got my code working. In fact it runs at a comparable speed on a single processor, to the speed of multithreaded MATLAB, so I’m pretty pleased as you can imagine. My only issue now is that I can’t seem to Share an array of arrays, but I’m going to post separately about that.

Postscript: Seem that threads works well without sharing the arrays, so I’m beating MATLAB handily and am a happy user.

1 Like