I am calculating the cosine distance between two vector spaces by using Distances.jl package.
Since the vector spaces in comparison are large, I wanted to calculate cosine distances by using @threads. The problem is sequential for loop gives results faster than threaded version. I know that opening and closing threads might take time which in turn this might be the reason. But could someone read the threaded version, and could tell me the proper way of writing parallelized for loops in Julia ?

For minimal working example :

using Base.Threads
using Distances
X = rand(Int(200e3), 300);
X_sub = X[1:4000, :];
function sequenceCosine(X4k::Matrix, X::Matrix)
results = []
for i in 1:size(X4k, 1)
dist = Float64[]
for j in 1:size(X, 1)
push!(dist, cosine_dist(X4k[i, :], X[j, :]))
end
push!(results, dist)
end
return results
end
function parallelCosine(X_sub::Matrix, X::Matrix)
r_threads = [Array{Float64}[] for i in 1:nthreads()];
T = div(size(X_sub, 1), nthreads()) # got 32 threads
for k in 0:(T - 1)
@threads for i in 1:nthreads()
dist = Float64[];
for j in 1:size(X, 1)
push!(dist, cosine_dist(X_sub[i + (nthreads() * k), :], X[j, :]));
end
push!(r_threads[threadid()], dist)
end
end
r_threads; # returns
end
@time result = sequenceCosine(X_sub, X);
@time result = parallelCosine(X_sub, X); # very similar time r.t. sequence version.

using Base.Threads
using Distances
using BenchmarkTools
function sequenceCosine(X_sub::Matrix, X::Matrix)
results = Matrix(undef, size(X_sub, 1), size(X, 1))
for i in 1:size(X_sub, 1)
for j in 1:size(X, 1)
results[i, j] = cosine_dist((@view X_sub[i, :]), (@view X[j, :]))
end
end
results
end
function parallelCosine(X_sub::Matrix, X::Matrix)
results = Matrix(undef, size(X_sub, 1), size(X, 1))
len = div(size(X_sub, 1), nthreads() - 1)
@threads for t in 1:nthreads()
for i in 1 + (t - 1) * len:min(t * len, size(X_sub, 1))
for j in 1:size(X, 1)
results[i, j] = cosine_dist((@view X_sub[i, :]), (@view X[j, :]))
end
end
end
results
end
X = rand(Int(200e2), 300);
X_sub = X[1:400, :];
result1 = sequenceCosine(X_sub, X);
result2 = parallelCosine(X_sub, X);
@assert result1 == result2
@btime result = sequenceCosine($X_sub, $X);
@btime result = parallelCosine($X_sub, $X);

shows

4.429 s (8000002 allocations: 183.11 MiB)
895.453 ms (8000034 allocations: 183.11 MiB)

Thank you very much for reviewing code and making it better.

I have 2 question about your answer:

From other working examples, I saw that in order to hold the values of each thread it is needed to create a dynamic array ( results for my case) . And when the results are ready, they are pushed inside the that dynamic array by using the threadid(). From your code, I understand that there is no need for that. Could you explain the inner workings ?

About @view inside innermost loop. Why do we need it ?

To point 1: in my understanding you concurrently can read non changing parts of memory and write non overlapping parts of memory without data races. Therefore you only need to slice the task accordingly. In short: you have to avoid data races.

To point 2: @view constructs a view into an array without copying. We can use it to avoid memory allocations when reading parts of arrays.

BTW, you donâ€™t need to keep track of the threads yourself:

function pcosine(X_sub, X)
res = similar(X, size(X_sub, 1), size(X, 1))
@threads for j in axes(X, 1)
for i in axes(X_sub, 1)
res[i, j] = @views cosine_dist(X_sub[i, :], X[j, :])
end
end
return res
end

Definitely not Always avoid things like

which is a red performance flag, and pre-allocating is better than incrementally push!ing.

function sequenceCosine(X_sub::Matrix, X::Matrix)
results = similar(X, size(X_sub, 1), size(X, 1))
for i in axes(X_sub, 1)
for j in axes(X, 1)
results[i, j] = @views cosine_dist(X_sub[i, :], X[j, :])
end
end
results
end
function parallelCosine(X_sub::Matrix, X::Matrix)
results = similar(X, size(X_sub, 1), size(X, 1))
@threads for j in axes(X, 1)
for i in axes(X_sub, 1)
results[i, j] = @views cosine_dist(X_sub[i, :], X[j, :])
end
end
results
end

with result

4.377 s (2 allocations: 61.04 MiB)
604.920 ms (33 allocations: 61.04 MiB)

Thereâ€™s actually a tiny trap here. similar will fail if the inputs have an integer eltype. So to really make this robust, one probably needs to do some float promotion or something.

In fact, a parallel comprehension is probably the best way, but not sure how to do that.

function sequenceCosine(X_sub::Matrix, X::Matrix)
[@views cosine_dist(X_sub[i, :], X[j, :])
for i in axes(X_sub, 1),
j in axes(X, 1)]
end
function parallelCosine(X_sub::Matrix, X::Matrix)
results = similar(X, size(X_sub, 1), size(X, 1))
@threads for j in axes(X, 1)
@views results[:, j] .=
[cosine_dist(X_sub[i, :], X[j, :])
for i in axes(X_sub, 1)]
end
results
end

Results:

2.739 s (2 allocations: 61.04 MiB)
570.136 ms (20033 allocations: 124.51 MiB)

If you care for performance of array traversal code, itâ€™s crucial to consider memory layout. In Julia, arrays are column-major by default, and switching to row-major gives almost a 5-fold speedup for me. This is without paralellization:

X_sub_r = PermutedDimsArray(permutedims(X_sub), (2, 1))
X_r = PermutedDimsArray(permutedims(X), (2, 1))
sequenceCosine(X_sub, X) # 3.6 s
sequenceCosine(X_sub_r, X_r) # 0.72 s

Writing this function in terms of map can make it absolutely trivial to parallelize:

# sequential - has the same performance as sequenceCosine, 0.7 s for row-major arrays
function sequenceCosine3(X_sub, X)
map(Iterators.product(eachrow(X_sub), eachrow(X))) do (i, j)
cosine_dist(i, j)
end
end
# parallel - 0.26 s for me
function parallelCosine3(X_sub, X)
ThreadsX.map(Iterators.product(eachrow(X_sub), eachrow(X))) do (i, j)
cosine_dist(i, j)
end
end

I canâ€™t believe I missed that you were working row major, thatâ€™s of course an important difference.

This

is a somewhat obscure way of making X_r into a row major array. permutedims transposes X, and PermureDimsArray transposes it back again, except lazily. So X_r is now the same as X except itâ€™s row major, leading to a significant speedup.

But for you, @kadir-gunel, it would be better if you organize your problem so that the matrices hold the interesting data along their columns in the first place. Then you can rewrite the function to work along columns.