Trying to code a parallel for loop with shared sparse arrays

jbpatzer · June 12, 2020, 1:26pm

Having asked here how to code a parallel for loop, and getting a nice answer, I now find myself unable to find anything on how to do it with sparse arrays, as I couldn’t work out how to use sparse arrays with SharedArrays.

For example,

function timeconsuming(params)
    Z0 = spzeros(100000,1)
    for k = 1:100
        Z0[Int64.(ceil.(100000*rand(1,10)))].+= params
    end
    return Z0
end

function test()
    Z = spzeros(100000,1000)
    for p = 1:1000
        params = rand(1,10)
        Z[:,p] = timeconsuming(params)
    end
    return Z
end

@time test(); 5.704991 seconds (1.53 M allocations: 7.626 GiB, 17.34% gc time)

The structure of this example is very close to my actual problem. Any help with parallelising it would be much appreciated.

anon94023334 · June 13, 2020, 8:08am

I don’t think you’re going to be able to do this, because underneath the hood sparsearrays are three vectors, one of which holds pointer data to the other two. Because of this, each thread/process will need to have exclusive access to the sparse array in order to insert values in specific order and there will need to be a sync between switches, which essentially turns this into a serial problem.

jbpatzer · June 13, 2020, 8:19am

That is indeed the problem. I am converting my code from MATLAB, where it runs an order of magnitude faster, and does it by automatically multithreading (no parfor loop). I’m going to write this test code in MATLAB as well to see whether it captures the effect.

Surely there must be some way of parallelizing code like this by operating on the three underlying vectors and creating the sparse matrix at the end? Anyway, I’m going to keep working on this. Difficulties like this are certainly a good way to learn about a new language.

jbpatzer · June 13, 2020, 8:35am

MATLAB version.

function Z = test()
    Z = sparse(100000,1000);
    parfor p = 1:1000
        params = rand(10,1);
        Z(:,p)= timeconsuming(params);
    end
end

function Z0 = timeconsuming(params)
    Z0 = sparse(100000,1);
    for k = 1:100
        ind = ceil(100000*rand(1,10));
        Z0(ind) = Z0(ind) + params;
    end
end

Without parfor

tic,Z = test;toc Elapsed time is 4.521721 seconds.

With parfor

tic,Z = test;toc Elapsed time is 0.545647 seconds.

So it’s clearly possible, although tbh, this probably isn’t the right example, as it needs a parfor to work. I’d still like to know how to do this in Julia though.

anon94023334 · June 13, 2020, 8:44am

I don’t know MATLAB so can’t comment on the code except to say that if timeconsuming() isn’t returning a lot of zeros, then maybe you could use a dense array instead, which you could then use in a parallel for loop.

jbpatzer · June 13, 2020, 8:46am

No I’m only interested in sparse matrices. Thanks for thinking about it though.

ettersi · June 13, 2020, 10:19am

Are you sure this code doesn’t introduce a race condition? Assigning columns of a sparse matrix requires making copies of the row index and value vectors, and if you try to do that in parallel the different threads will interfere with each other unless you take special precautions. I am not familiar with parallel computing in Matlab, and it is in principle possible that Matlab will take care of things for you, but I somewhat doubt it.

jbpatzer · June 13, 2020, 11:00am

You do have to be careful, but MATLAB does take care of it for you.

Sliced variables in MATLAB

It’s actually very picky and gives lots of error messages and warnings if you screw it up. It’s very traumatic switching to a grown up language after being babied by MATLAB for all these years.

pixel27 · June 13, 2020, 3:58pm

using Base.Threads
using SparseArrays

function timeconsuming(params)
    Z0 = spzeros(100000,1)
    for k = 1:100
        Z0[Int64.(ceil.(100000*rand(1,10)))].+= params
    end
    return Z0
end

function test()
    L = ReentrantLock() #Threads.SpinLock()
    Z = spzeros(100000,1000)
    @Threads.threads for p = 1:1000
        params = rand(1,10)
        values = timeconsuming(params)
        lock(L)
        Z[:,p] = values
        unlock(L)
    end
    return Z
end

This “works” although with this MWE it doesn’t show much improvement (2.5 seconds from 3 seconds on my machine). I think the assignment Z[:,p] = value is taking most of the time while the timeconsuming() method only takes half a millisecond.

jbpatzer · June 13, 2020, 4:53pm

That runs slower than the original version on my machine. If I increase the size of the inner loop from 100 to 1000, this method gives a 15% improvement. Either way, it’s not much, and I already tried it in my actual code without success. Thanks for the suggestion though. I want Z to be shared so that it can be updated in parallel, so I guess I should try sharing the index and value vectors instead of the sparse structure, which I will do soon.

anon94023334 · June 13, 2020, 5:08pm

One thing that you might consider is having each thread return its own sparse vector, and then process the vectors into a sparsematrix at the end. This will probably require more memory because it requires a dense vector of sparsevectors at the beginning, but it should allow you to thread.

Example:

v = Vector{SparseVector{UInt16}}(undef, 1000);

@Threads.threads for i = 1:1000
    v[i] = sparse(fill(UInt16(i), i))
end

tro3 · June 13, 2020, 5:14pm

Don’t know if this is applicable, but would it be preferable to run the time-consuming calcs in the background, leaving the foreground available for collection? With @threads, the time-consuming calc running on the primary will block collection, forcing the other threads that have finished to wait for it. If all the time-consuming calcs are the same length, this is not a big deal.

If the jobs are not very uniform, though, you might get improvement with the @qbthreads macro in GitHub - tro3/ThreadPools.jl: Improved thread management for background and nonuniform tasks in Julia. Docs at https://tro3.github.io/ThreadPools.jl. If nothing else, it has thread visualization to help you optimize the work. Hope this helps.

jbpatzer · June 13, 2020, 5:45pm

All my calcs are the same length, but it’s good to know that that’s a possibility if I ever need it thanks.

jbpatzer · June 13, 2020, 5:49pm

This is what I had in mind.

jbpatzer · June 13, 2020, 5:55pm

So here are my efforts. Firstly, here’s a basic code. I had to change it a bit because in my actual problem there are no issues with ordering the indices of the matrix, and I also wanted ‘timeconsuming’ to be more time consuming, so I added a stupid loop into it to slow it down. The index and value vectors can be shared and the sparse matrix constructed at the end.

using SparseArrays

function timeconsuming(params)
    index = zeros(Int64, 10000, 1)
    value = zeros(Float64, 10000, 1)
    for k = 1:1000
        ind = 10 * (k - 1) .+ (1:10)
        index[ind] = 1000 * (k - 1) .+ Int64.(ceil.(1000 .* rand(10, 1)))
        value[ind] = params
    end
    for k = 1:100
        dum = rand(1000,1000)
    end
    return index, value
end

function test()
    index = zeros(Int64, 1000000, 1)
    value = zeros(Float64, 1000000, 1)
    J = repeat(transpose(collect(1:100)), 10000,1)
    for p = 1:100
        params = rand(1, 10)
        ind = 10000 * (p - 1) .+ (1:10000)
        index[ind], value[ind] = timeconsuming(params)
    end
    Z = sparse(vec(index), J[:], vec(value), 1000000, 100)
    return Z
end

which gives
@time Z = test(); 40.613715 seconds (225.51 k allocations: 74.619 GiB, 21.99% gc time)

With @threads, this becomes

using SparseArrays, SharedArrays, Distributed

function timeconsuming(params)
    index = zeros(Int64, 10000, 1)
    value = zeros(Float64, 10000, 1)
    for k = 1:1000
        ind = 10 * (k - 1) .+ (1:10)
        index[ind] = 1000 * (k - 1) .+ Int64.(ceil.(1000 .* rand(10, 1)))
        value[ind] = params
    end
    for k = 1:100
        dum = rand(1000, 1000)
    end
    return index, value
end

function test()
    index = SharedArray{Int64,2}(1000000, 1)
    value = SharedArray{Float64,2}(1000000, 1)
    J = repeat(transpose(collect(1:100)), 10000, 1)
    Threads.@threads for p = 1:100
        params = rand(1, 10)
        ind = 10000 * (p - 1) .+ (1:10000)
        index[ind], value[ind] = timeconsuming(params)
    end
    Z = sparse(vec(index), J[:], vec(value), 1000000, 100)
    return Z
end

and @time Z = test(); 22.801165 seconds (228.17 k allocations: 74.604 GiB, 30.57% gc time)

and finally with @distributed

using SparseArrays, SharedArrays, Distributed

addprocs(12)

@everywhere begin
    function timeconsuming(params)
        index = zeros(Int64, 10000, 1)
        value = zeros(Float64, 10000, 1)
        for k = 1:1000
            ind = 10 * (k - 1) .+ (1:10)
            index[ind] = 1000 * (k - 1) .+ Int64.(ceil.(1000 .* rand(10, 1)))
            value[ind] = params
        end
        for k = 1:100
            dum = rand(1000,1000)
        end
        return index, value
    end
end

function test()
    index = SharedArray{Int64,2}(1000000, 1)
    value = SharedArray{Float64,2}(1000000, 1)
    J = repeat(transpose(collect(1:100)), 10000, 1)
    @sync @distributed for p = 1:100
        params = rand(1, 10)
        ind = 10000 * (p - 1) .+ (1:10000)
        index[ind], value[ind] = timeconsuming(params)
    end
    Z = sparse(vec(index), J[:], vec(value), 1000000, 100)
    return Z
end

and @time Z = test(); 15.664146 seconds (12.10 k allocations: 53.835 MiB)

This last seems to be the better of the two.

I think I’ve learned a few things, and I’ll now have to see what I can do with my actual code. Thanks to everybody for helping, and any further comments would be gratefully received.

Seif_Shebl · June 13, 2020, 10:16pm

I’m not sure, but it seems to me that it is a SparseArrays thing. I observed that a dense matrix version run serially in Julia is 3X faster than MATLAB’s parallel execution. So, may be sparse arrays in MATLAB are especially optimized? I don’t still use MATLAB now, but I remember a few years ago that sparse matrices weren’t that good, I was even avoiding them for performance reasons.

anon94023334 · June 14, 2020, 12:43am

Please don’t use @time for benchmarking. There are too many ways to get inaccurate results. I believe you can benchmark both parallel and distributed code using BenchmarkTools.

jbpatzer · June 14, 2020, 8:21am

OK. I am aware of the potential problems though. I always make sure the code has run a couple of times to check that it has properly compiled. When I’ve tried using @btime instead, I haven’t found a significant difference.

I expect MATLAB’s sparse matrices are very well optimized. I do agree that it’s a SparseArrays thing, and that I should build the index and value vectors first, then create the sparse matrices instead of working with sparse arrays directly. I’ll be working on my actual code next week taking all of these ideas into account and will note what I find here.

jbpatzer · June 15, 2020, 1:50pm

By working with the indices and values instead of the sparse array, I’ve got my code working. In fact it runs at a comparable speed on a single processor, to the speed of multithreaded MATLAB, so I’m pretty pleased as you can imagine. My only issue now is that I can’t seem to Share an array of arrays, but I’m going to post separately about that.

Postscript: Seem that threads works well without sharing the arrays, so I’m beating MATLAB handily and am a happy user.

Topic		Replies	Views
Performance issues with parallel Julia code Julia at Scale performance , parallel , distributed , scientific-computing	2	1132	October 29, 2021
How to code faster parallel for loop New to Julia parallel	8	7262	January 18, 2019
Have a time issue, need advises on my parallel code Performance parallel	10	1154	March 20, 2018
Multithreaded MatVec Numerics multithreading , matrices	10	1963	February 4, 2022
Parallel assembly of a finite element sparse matrix General Usage parallel	34	1733	February 28, 2024

Trying to code a parallel for loop with shared sparse arrays

Related topics