Y.A.t.Q : Yet Another @threads Question

Hello,

I am calculating the cosine distance between two vector spaces by using Distances.jl package.
Since the vector spaces in comparison are large, I wanted to calculate cosine distances by using @threads. The problem is sequential for loop gives results faster than threaded version. I know that opening and closing threads might take time which in turn this might be the reason. But could someone read the threaded version, and could tell me the proper way of writing parallelized for loops in Julia ?

For minimal working example :

using Base.Threads
using Distances

X = rand(Int(200e3), 300);
X_sub = X[1:4000, :];

function sequenceCosine(X4k::Matrix, X::Matrix)
    results = []
    for i in 1:size(X4k, 1)
        dist = Float64[]
        for j in 1:size(X, 1)
            push!(dist, cosine_dist(X4k[i, :], X[j, :]))
        end
        push!(results, dist)
    end
    return results
end



function parallelCosine(X_sub::Matrix, X::Matrix)
    r_threads = [Array{Float64}[] for i in 1:nthreads()];
    T = div(size(X_sub, 1), nthreads()) # got 32 threads 
    for k in 0:(T - 1)
        @threads for i in 1:nthreads()
            dist = Float64[];
            for j in 1:size(X, 1)
                push!(dist, cosine_dist(X_sub[i + (nthreads() * k), :], X[j, :]));
            end
            push!(r_threads[threadid()], dist)
        end
    end
    r_threads; # returns
end

@time result = sequenceCosine(X_sub, X); 
@time result = parallelCosine(X_sub, X);  # very similar time r.t. sequence version.

B.R.

The slightly reworked MWE

using Base.Threads
using Distances
using BenchmarkTools

function sequenceCosine(X_sub::Matrix, X::Matrix)
    results = Matrix(undef, size(X_sub, 1), size(X, 1))
    for i in 1:size(X_sub, 1)
        for j in 1:size(X, 1)
            results[i, j] = cosine_dist((@view X_sub[i, :]), (@view X[j, :]))
        end
    end
    results
end

function parallelCosine(X_sub::Matrix, X::Matrix)
    results = Matrix(undef, size(X_sub, 1), size(X, 1))
    len = div(size(X_sub, 1), nthreads() - 1)
    @threads for t in 1:nthreads()
        for i in 1 + (t - 1) * len:min(t * len, size(X_sub, 1))
            for j in 1:size(X, 1)
                results[i, j] = cosine_dist((@view X_sub[i, :]), (@view X[j, :]))
            end
        end
    end
    results
end

X = rand(Int(200e2), 300);
X_sub = X[1:400, :];

result1 = sequenceCosine(X_sub, X);
result2 = parallelCosine(X_sub, X);
@assert result1 == result2

@btime result = sequenceCosine($X_sub, $X);
@btime result = parallelCosine($X_sub, $X);

shows

  4.429 s (8000002 allocations: 183.11 MiB)
  895.453 ms (8000034 allocations: 183.11 MiB)

for 6 threads on Julia 1.6.3.

3 Likes

Thank you very much for reviewing code and making it better.

I have 2 question about your answer:

  1. From other working examples, I saw that in order to hold the values of each thread it is needed to create a dynamic array ( results for my case) . And when the results are ready, they are pushed inside the that dynamic array by using the threadid(). From your code, I understand that there is no need for that. Could you explain the inner workings ?
  2. About @view inside innermost loop. Why do we need it ?

B.R.

1 Like

Hi,

To point 1: in my understanding you concurrently can read non changing parts of memory and write non overlapping parts of memory without data races. Therefore you only need to slice the task accordingly. In short: you have to avoid data races.

To point 2: @view constructs a view into an array without copying. We can use it to avoid memory allocations when reading parts of arrays.

All the best,

1 Like

This creates a Matrix{Any}, that doesn’t seem like a good idea.

2 Likes

Sorry for the carelessness on my part. But I didn’t notice obvious problems problems using the profiler. To OP:

Matrix{Float64}(undef, size(X_sub, 1), size(X, 1))

will eliminate nearly all allocations. Here are my results with this change:

  4.370 s (2 allocations: 61.04 MiB)
  1.020 s (33 allocations: 61.04 MiB)
1 Like

@DNF , @goerch I thought that might be negligible but it seems like not :slight_smile:
Thank you both!

BTW, you don’t need to keep track of the threads yourself:

function pcosine(X_sub, X)
    res = similar(X, size(X_sub, 1), size(X, 1))
    @threads for j in axes(X, 1)
        for i in axes(X_sub, 1)
            res[i, j] = @views cosine_dist(X_sub[i, :], X[j, :])
        end
    end
    return res
end

Definitely not :wink: Always avoid things like

which is a red performance flag, and pre-allocating is better than incrementally push!ing.

3 Likes

Beautiful fast code:

function sequenceCosine(X_sub::Matrix, X::Matrix)
    results = similar(X, size(X_sub, 1), size(X, 1))
    for i in axes(X_sub, 1)
        for j in axes(X, 1)
            results[i, j] = @views cosine_dist(X_sub[i, :], X[j, :])
        end
    end
    results
end

function parallelCosine(X_sub::Matrix, X::Matrix)
    results = similar(X, size(X_sub, 1), size(X, 1))
    @threads for j in axes(X, 1)
        for i in axes(X_sub, 1)
            results[i, j] = @views cosine_dist(X_sub[i, :], X[j, :])
        end
    end
    results
end

with result

  4.377 s (2 allocations: 61.04 MiB)
  604.920 ms (33 allocations: 61.04 MiB)
3 Likes

Which leads to the following question: how to parallelize the corresponding comprehension

function sequenceCosine(X_sub::Matrix, X::Matrix)
    [@views cosine_dist(X_sub[i, :], X[j, :])
        for i in axes(X_sub, 1),
            j in axes(X, 1)]
end

which BTW runs faster than the looping version:

  2.739 s (2 allocations: 61.04 MiB)

?

There’s actually a tiny trap here. similar will fail if the inputs have an integer eltype. So to really make this robust, one probably needs to do some float promotion or something.

In fact, a parallel comprehension is probably the best way, but not sure how to do that.

Seems this is the only way out at the moment:

function sequenceCosine(X_sub::Matrix, X::Matrix)
    [@views cosine_dist(X_sub[i, :], X[j, :])
        for i in axes(X_sub, 1),
            j in axes(X, 1)]
end

function parallelCosine(X_sub::Matrix, X::Matrix)
    results = similar(X, size(X_sub, 1), size(X, 1))
    @threads for j in axes(X, 1)
        @views results[:, j] .=
            [cosine_dist(X_sub[i, :], X[j, :])
                for i in axes(X_sub, 1)]
    end
    results
end

Results:

  2.739 s (2 allocations: 61.04 MiB)
  570.136 ms (20033 allocations: 124.51 MiB)
1 Like

If you care for performance of array traversal code, it’s crucial to consider memory layout. In Julia, arrays are column-major by default, and switching to row-major gives almost a 5-fold speedup for me. This is without paralellization:

X_sub_r = PermutedDimsArray(permutedims(X_sub), (2, 1))
X_r = PermutedDimsArray(permutedims(X), (2, 1))
sequenceCosine(X_sub, X)  # 3.6 s
sequenceCosine(X_sub_r, X_r)  # 0.72 s

Writing this function in terms of map can make it absolutely trivial to parallelize:

# sequential - has the same performance as sequenceCosine, 0.7 s for row-major arrays
function sequenceCosine3(X_sub, X)
	map(Iterators.product(eachrow(X_sub), eachrow(X))) do (i, j)
    	cosine_dist(i, j)
	end
end

# parallel - 0.26 s for me
function parallelCosine3(X_sub, X)
	ThreadsX.map(Iterators.product(eachrow(X_sub), eachrow(X))) do (i, j)
    	cosine_dist(i, j)
	end
end
4 Likes

Very nice solution! Put into my test bench and got

  2.740 s (2 allocations: 61.04 MiB)
  590.554 ms (20033 allocations: 124.51 MiB)

for

@btime result = sequenceCosine($X_sub, $X);
@btime result = parallelCosine($X_sub, $X);

and

  616.712 ms (2 allocations: 61.04 MiB)
  186.552 ms (4185 allocations: 320.18 MiB)

for

X_sub_r = PermutedDimsArray(permutedims(X_sub), (2, 1))
X_r = PermutedDimsArray(permutedims(X), (2, 1))

@btime result = sequenceCosine($X_sub_r, $X_r);
@btime result = parallelCosine($X_sub_r, $X_r);

Hello @aplavin ,

I was reading your post and noticed that you mentioned the row major operation is more performant than column major. Shouldn’t be vice versa ?

The PermuteDimsArray with permutedims(X) gives the same result as the original matrix.

using Test
X = rand(10, 20);
X_r = PermuteDimsArray(permutedims(X), (2, 1))

@test X == X_r # gives true 

Could not catch the point. Could you please explain ?

B.R.

I can’t believe I missed that you were working row major, that’s of course an important difference.

This

is a somewhat obscure way of making X_r into a row major array. permutedims transposes X, and PermureDimsArray transposes it back again, except lazily. So X_r is now the same as X except it’s row major, leading to a significant speedup.

But for you, @kadir-gunel, it would be better if you organize your problem so that the matrices hold the interesting data along their columns in the first place. Then you can rewrite the function to work along columns.

3 Likes

And, in case this isn’t clear, your code works along rows, therefore working on a row-major array is more efficient than on one that is column-major

1 Like

Yes, my bad. Thank you for clarification.