# Trying to code a parallel for loop with shared sparse arrays

Having asked here how to code a parallel for loop, and getting a nice answer, I now find myself unable to find anything on how to do it with sparse arrays, as I couldn’t work out how to use sparse arrays with SharedArrays.

For example,

``````function timeconsuming(params)
Z0 = spzeros(100000,1)
for k = 1:100
Z0[Int64.(ceil.(100000*rand(1,10)))].+= params
end
return Z0
end

function test()
Z = spzeros(100000,1000)
for p = 1:1000
params = rand(1,10)
Z[:,p] = timeconsuming(params)
end
return Z
end
``````

`@time test(); 5.704991 seconds (1.53 M allocations: 7.626 GiB, 17.34% gc time)`

The structure of this example is very close to my actual problem. Any help with parallelising it would be much appreciated.

I don’t think you’re going to be able to do this, because underneath the hood sparsearrays are three vectors, one of which holds pointer data to the other two. Because of this, each thread/process will need to have exclusive access to the sparse array in order to insert values in specific order and there will need to be a sync between switches, which essentially turns this into a serial problem.

That is indeed the problem. I am converting my code from MATLAB, where it runs an order of magnitude faster, and does it by automatically multithreading (no parfor loop). I’m going to write this test code in MATLAB as well to see whether it captures the effect.

Surely there must be some way of parallelizing code like this by operating on the three underlying vectors and creating the sparse matrix at the end? Anyway, I’m going to keep working on this. Difficulties like this are certainly a good way to learn about a new language.

MATLAB version.

``````function Z = test()
Z = sparse(100000,1000);
parfor p = 1:1000
params = rand(10,1);
Z(:,p)= timeconsuming(params);
end
end

function Z0 = timeconsuming(params)
Z0 = sparse(100000,1);
for k = 1:100
ind = ceil(100000*rand(1,10));
Z0(ind) = Z0(ind) + params;
end
end
``````

Without parfor

`tic,Z = test;toc Elapsed time is 4.521721 seconds.`

With parfor

`tic,Z = test;toc Elapsed time is 0.545647 seconds.`

So it’s clearly possible, although tbh, this probably isn’t the right example, as it needs a parfor to work. I’d still like to know how to do this in Julia though.

I don’t know MATLAB so can’t comment on the code except to say that if `timeconsuming()` isn’t returning a lot of zeros, then maybe you could use a dense array instead, which you could then use in a parallel for loop.

No I’m only interested in sparse matrices. Thanks for thinking about it though.

Are you sure this code doesn’t introduce a race condition? Assigning columns of a sparse matrix requires making copies of the row index and value vectors, and if you try to do that in parallel the different threads will interfere with each other unless you take special precautions. I am not familiar with parallel computing in Matlab, and it is in principle possible that Matlab will take care of things for you, but I somewhat doubt it.

You do have to be careful, but MATLAB does take care of it for you.

Sliced variables in MATLAB

It’s actually very picky and gives lots of error messages and warnings if you screw it up. It’s very traumatic switching to a grown up language after being babied by MATLAB for all these years.

1 Like
``````using Base.Threads
using SparseArrays

function timeconsuming(params)
Z0 = spzeros(100000,1)
for k = 1:100
Z0[Int64.(ceil.(100000*rand(1,10)))].+= params
end
return Z0
end

function test()
Z = spzeros(100000,1000)
params = rand(1,10)
values = timeconsuming(params)
lock(L)
Z[:,p] = values
unlock(L)
end
return Z
end
``````

This “works” although with this MWE it doesn’t show much improvement (2.5 seconds from 3 seconds on my machine). I think the assignment `Z[:,p] = value` is taking most of the time while the `timeconsuming()` method only takes half a millisecond.

That runs slower than the original version on my machine. If I increase the size of the inner loop from 100 to 1000, this method gives a 15% improvement. Either way, it’s not much, and I already tried it in my actual code without success. Thanks for the suggestion though. I want Z to be shared so that it can be updated in parallel, so I guess I should try sharing the index and value vectors instead of the sparse structure, which I will do soon.

One thing that you might consider is having each thread return its own sparse vector, and then process the vectors into a sparsematrix at the end. This will probably require more memory because it requires a dense vector of sparsevectors at the beginning, but it should allow you to thread.

Example:

``````v = Vector{SparseVector{UInt16}}(undef, 1000);

v[i] = sparse(fill(UInt16(i), i))
end
``````

Don’t know if this is applicable, but would it be preferable to run the time-consuming calcs in the background, leaving the foreground available for collection? With @threads, the time-consuming calc running on the primary will block collection, forcing the other threads that have finished to wait for it. If all the time-consuming calcs are the same length, this is not a big deal.

If the jobs are not very uniform, though, you might get improvement with the @qbthreads macro in https://github.com/tro3/ThreadPools.jl. If nothing else, it has thread visualization to help you optimize the work. Hope this helps.

All my calcs are the same length, but it’s good to know that that’s a possibility if I ever need it thanks.

This is what I had in mind.

So here are my efforts. Firstly, here’s a basic code. I had to change it a bit because in my actual problem there are no issues with ordering the indices of the matrix, and I also wanted ‘timeconsuming’ to be more time consuming, so I added a stupid loop into it to slow it down. The index and value vectors can be shared and the sparse matrix constructed at the end.

``````using SparseArrays

function timeconsuming(params)
index = zeros(Int64, 10000, 1)
value = zeros(Float64, 10000, 1)
for k = 1:1000
ind = 10 * (k - 1) .+ (1:10)
index[ind] = 1000 * (k - 1) .+ Int64.(ceil.(1000 .* rand(10, 1)))
value[ind] = params
end
for k = 1:100
dum = rand(1000,1000)
end
return index, value
end

function test()
index = zeros(Int64, 1000000, 1)
value = zeros(Float64, 1000000, 1)
J = repeat(transpose(collect(1:100)), 10000,1)
for p = 1:100
params = rand(1, 10)
ind = 10000 * (p - 1) .+ (1:10000)
index[ind], value[ind] = timeconsuming(params)
end
Z = sparse(vec(index), J[:], vec(value), 1000000, 100)
return Z
end
``````

which gives
`@time Z = test(); 40.613715 seconds (225.51 k allocations: 74.619 GiB, 21.99% gc time)`

``````using SparseArrays, SharedArrays, Distributed

function timeconsuming(params)
index = zeros(Int64, 10000, 1)
value = zeros(Float64, 10000, 1)
for k = 1:1000
ind = 10 * (k - 1) .+ (1:10)
index[ind] = 1000 * (k - 1) .+ Int64.(ceil.(1000 .* rand(10, 1)))
value[ind] = params
end
for k = 1:100
dum = rand(1000, 1000)
end
return index, value
end

function test()
index = SharedArray{Int64,2}(1000000, 1)
value = SharedArray{Float64,2}(1000000, 1)
J = repeat(transpose(collect(1:100)), 10000, 1)
params = rand(1, 10)
ind = 10000 * (p - 1) .+ (1:10000)
index[ind], value[ind] = timeconsuming(params)
end
Z = sparse(vec(index), J[:], vec(value), 1000000, 100)
return Z
end
``````

and `@time Z = test(); 22.801165 seconds (228.17 k allocations: 74.604 GiB, 30.57% gc time)`

and finally with @distributed

``````using SparseArrays, SharedArrays, Distributed

@everywhere begin
function timeconsuming(params)
index = zeros(Int64, 10000, 1)
value = zeros(Float64, 10000, 1)
for k = 1:1000
ind = 10 * (k - 1) .+ (1:10)
index[ind] = 1000 * (k - 1) .+ Int64.(ceil.(1000 .* rand(10, 1)))
value[ind] = params
end
for k = 1:100
dum = rand(1000,1000)
end
return index, value
end
end

function test()
index = SharedArray{Int64,2}(1000000, 1)
value = SharedArray{Float64,2}(1000000, 1)
J = repeat(transpose(collect(1:100)), 10000, 1)
@sync @distributed for p = 1:100
params = rand(1, 10)
ind = 10000 * (p - 1) .+ (1:10000)
index[ind], value[ind] = timeconsuming(params)
end
Z = sparse(vec(index), J[:], vec(value), 1000000, 100)
return Z
end
``````

and ` @time Z = test(); 15.664146 seconds (12.10 k allocations: 53.835 MiB)`

This last seems to be the better of the two.

I think I’ve learned a few things, and I’ll now have to see what I can do with my actual code. Thanks to everybody for helping, and any further comments would be gratefully received.

I’m not sure, but it seems to me that it is a `SparseArrays` thing. I observed that a dense matrix version run serially in Julia is 3X faster than MATLAB’s parallel execution. So, may be sparse arrays in MATLAB are especially optimized? I don’t still use MATLAB now, but I remember a few years ago that sparse matrices weren’t that good, I was even avoiding them for performance reasons.

Please don’t use `@time` for benchmarking. There are too many ways to get inaccurate results. I believe you can benchmark both parallel and distributed code using BenchmarkTools.

1 Like

OK. I am aware of the potential problems though. I always make sure the code has run a couple of times to check that it has properly compiled. When I’ve tried using `@btime` instead, I haven’t found a significant difference.

I expect MATLAB’s sparse matrices are very well optimized. I do agree that it’s a SparseArrays thing, and that I should build the index and value vectors first, then create the sparse matrices instead of working with sparse arrays directly. I’ll be working on my actual code next week taking all of these ideas into account and will note what I find here.

By working with the indices and values instead of the sparse array, I’ve got my code working. In fact it runs at a comparable speed on a single processor, to the speed of multithreaded MATLAB, so I’m pretty pleased as you can imagine. My only issue now is that I can’t seem to Share an array of arrays, but I’m going to post separately about that.

Postscript: Seem that threads works well without sharing the arrays, so I’m beating MATLAB handily and am a happy user.

1 Like