# Y.A.t.Q : Yet Another @threads Question

Hello,

I am calculating the cosine distance between two vector spaces by using Distances.jl package.
Since the vector spaces in comparison are large, I wanted to calculate cosine distances by using @threads. The problem is sequential for loop gives results faster than threaded version. I know that opening and closing threads might take time which in turn this might be the reason. But could someone read the threaded version, and could tell me the proper way of writing parallelized for loops in Julia ?

For minimal working example :

``````using Base.Threads
using Distances

X = rand(Int(200e3), 300);
X_sub = X[1:4000, :];

function sequenceCosine(X4k::Matrix, X::Matrix)
results = []
for i in 1:size(X4k, 1)
dist = Float64[]
for j in 1:size(X, 1)
push!(dist, cosine_dist(X4k[i, :], X[j, :]))
end
push!(results, dist)
end
return results
end

function parallelCosine(X_sub::Matrix, X::Matrix)
for k in 0:(T - 1)
dist = Float64[];
for j in 1:size(X, 1)
push!(dist, cosine_dist(X_sub[i + (nthreads() * k), :], X[j, :]));
end
end
end
end

@time result = sequenceCosine(X_sub, X);
@time result = parallelCosine(X_sub, X);  # very similar time r.t. sequence version.

``````

B.R.

The slightly reworked MWE

``````using Base.Threads
using Distances
using BenchmarkTools

function sequenceCosine(X_sub::Matrix, X::Matrix)
results = Matrix(undef, size(X_sub, 1), size(X, 1))
for i in 1:size(X_sub, 1)
for j in 1:size(X, 1)
results[i, j] = cosine_dist((@view X_sub[i, :]), (@view X[j, :]))
end
end
results
end

function parallelCosine(X_sub::Matrix, X::Matrix)
results = Matrix(undef, size(X_sub, 1), size(X, 1))
len = div(size(X_sub, 1), nthreads() - 1)
for i in 1 + (t - 1) * len:min(t * len, size(X_sub, 1))
for j in 1:size(X, 1)
results[i, j] = cosine_dist((@view X_sub[i, :]), (@view X[j, :]))
end
end
end
results
end

X = rand(Int(200e2), 300);
X_sub = X[1:400, :];

result1 = sequenceCosine(X_sub, X);
result2 = parallelCosine(X_sub, X);
@assert result1 == result2

@btime result = sequenceCosine(\$X_sub, \$X);
@btime result = parallelCosine(\$X_sub, \$X);
``````

shows

``````  4.429 s (8000002 allocations: 183.11 MiB)
895.453 ms (8000034 allocations: 183.11 MiB)
``````

for 6 threads on Julia 1.6.3.

3 Likes

Thank you very much for reviewing code and making it better.

1. From other working examples, I saw that in order to hold the values of each thread it is needed to create a dynamic array (` results` for my case) . And when the results are ready, they are pushed inside the that dynamic array by using the `threadid()`. From your code, I understand that there is no need for that. Could you explain the inner workings ?
2. About @view inside innermost loop. Why do we need it ?

B.R.

1 Like

Hi,

To point 1: in my understanding you concurrently can read non changing parts of memory and write non overlapping parts of memory without data races. Therefore you only need to slice the task accordingly. In short: you have to avoid data races.

To point 2: `@view` constructs a view into an array without copying. We can use it to avoid memory allocations when reading parts of arrays.

All the best,

1 Like

This creates a `Matrix{Any}`, that doesn’t seem like a good idea.

2 Likes

Sorry for the carelessness on my part. But I didn’t notice obvious problems problems using the profiler. To OP:

``````Matrix{Float64}(undef, size(X_sub, 1), size(X, 1))
``````

will eliminate nearly all allocations. Here are my results with this change:

``````  4.370 s (2 allocations: 61.04 MiB)
1.020 s (33 allocations: 61.04 MiB)
``````
1 Like

@DNF , @goerch I thought that might be negligible but it seems like not Thank you both!

BTW, you don’t need to keep track of the threads yourself:

``````function pcosine(X_sub, X)
res = similar(X, size(X_sub, 1), size(X, 1))
@threads for j in axes(X, 1)
for i in axes(X_sub, 1)
res[i, j] = @views cosine_dist(X_sub[i, :], X[j, :])
end
end
return res
end
``````

Definitely not Always avoid things like

which is a red performance flag, and pre-allocating is better than incrementally `push!`ing.

3 Likes

Beautiful fast code:

``````function sequenceCosine(X_sub::Matrix, X::Matrix)
results = similar(X, size(X_sub, 1), size(X, 1))
for i in axes(X_sub, 1)
for j in axes(X, 1)
results[i, j] = @views cosine_dist(X_sub[i, :], X[j, :])
end
end
results
end

function parallelCosine(X_sub::Matrix, X::Matrix)
results = similar(X, size(X_sub, 1), size(X, 1))
@threads for j in axes(X, 1)
for i in axes(X_sub, 1)
results[i, j] = @views cosine_dist(X_sub[i, :], X[j, :])
end
end
results
end
``````

with result

``````  4.377 s (2 allocations: 61.04 MiB)
604.920 ms (33 allocations: 61.04 MiB)
``````
3 Likes

Which leads to the following question: how to parallelize the corresponding comprehension

``````function sequenceCosine(X_sub::Matrix, X::Matrix)
[@views cosine_dist(X_sub[i, :], X[j, :])
for i in axes(X_sub, 1),
j in axes(X, 1)]
end
``````

which BTW runs faster than the looping version:

``````  2.739 s (2 allocations: 61.04 MiB)
``````

?

There’s actually a tiny trap here. `similar` will fail if the inputs have an integer eltype. So to really make this robust, one probably needs to do some float promotion or something.

In fact, a parallel comprehension is probably the best way, but not sure how to do that.

Seems this is the only way out at the moment:

``````function sequenceCosine(X_sub::Matrix, X::Matrix)
[@views cosine_dist(X_sub[i, :], X[j, :])
for i in axes(X_sub, 1),
j in axes(X, 1)]
end

function parallelCosine(X_sub::Matrix, X::Matrix)
results = similar(X, size(X_sub, 1), size(X, 1))
@threads for j in axes(X, 1)
@views results[:, j] .=
[cosine_dist(X_sub[i, :], X[j, :])
for i in axes(X_sub, 1)]
end
results
end
``````

Results:

``````  2.739 s (2 allocations: 61.04 MiB)
570.136 ms (20033 allocations: 124.51 MiB)
``````
1 Like

If you care for performance of array traversal code, it’s crucial to consider memory layout. In Julia, arrays are column-major by default, and switching to row-major gives almost a 5-fold speedup for me. This is without paralellization:

``````X_sub_r = PermutedDimsArray(permutedims(X_sub), (2, 1))
X_r = PermutedDimsArray(permutedims(X), (2, 1))
sequenceCosine(X_sub, X)  # 3.6 s
sequenceCosine(X_sub_r, X_r)  # 0.72 s
``````

Writing this function in terms of `map` can make it absolutely trivial to parallelize:

``````# sequential - has the same performance as sequenceCosine, 0.7 s for row-major arrays
function sequenceCosine3(X_sub, X)
map(Iterators.product(eachrow(X_sub), eachrow(X))) do (i, j)
cosine_dist(i, j)
end
end

# parallel - 0.26 s for me
function parallelCosine3(X_sub, X)
cosine_dist(i, j)
end
end
``````
4 Likes

Very nice solution! Put into my test bench and got

``````  2.740 s (2 allocations: 61.04 MiB)
590.554 ms (20033 allocations: 124.51 MiB)
``````

for

``````@btime result = sequenceCosine(\$X_sub, \$X);
@btime result = parallelCosine(\$X_sub, \$X);
``````

and

``````  616.712 ms (2 allocations: 61.04 MiB)
186.552 ms (4185 allocations: 320.18 MiB)
``````

for

``````X_sub_r = PermutedDimsArray(permutedims(X_sub), (2, 1))
X_r = PermutedDimsArray(permutedims(X), (2, 1))

@btime result = sequenceCosine(\$X_sub_r, \$X_r);
@btime result = parallelCosine(\$X_sub_r, \$X_r);
``````

Hello @aplavin ,

I was reading your post and noticed that you mentioned the row major operation is more performant than column major. Shouldn’t be vice versa ?

The ` PermuteDimsArray` with ` permutedims(X)` gives the same result as the original matrix.

``````using Test
X = rand(10, 20);
X_r = PermuteDimsArray(permutedims(X), (2, 1))

@test X == X_r # gives true

``````

Could not catch the point. Could you please explain ?

B.R.

I can’t believe I missed that you were working row major, that’s of course an important difference.

This

is a somewhat obscure way of making `X_r` into a row major array. `permutedims` transposes `X`, and `PermureDimsArray` transposes it back again, except lazily. So `X_r` is now the same as `X` except it’s row major, leading to a significant speedup.

But for you, @kadir-gunel, it would be better if you organize your problem so that the matrices hold the interesting data along their columns in the first place. Then you can rewrite the function to work along columns.

3 Likes

And, in case this isn’t clear, your code works along rows, therefore working on a row-major array is more efficient than on one that is column-major

1 Like

Yes, my bad. Thank you for clarification.