Y.A.t.Q : Yet Another @threads Question

kadir-gunel · November 16, 2021, 7:56am

Hello,

I am calculating the cosine distance between two vector spaces by using Distances.jl package.
Since the vector spaces in comparison are large, I wanted to calculate cosine distances by using @threads. The problem is sequential for loop gives results faster than threaded version. I know that opening and closing threads might take time which in turn this might be the reason. But could someone read the threaded version, and could tell me the proper way of writing parallelized for loops in Julia ?

For minimal working example :

using Base.Threads
using Distances

X = rand(Int(200e3), 300);
X_sub = X[1:4000, :];

function sequenceCosine(X4k::Matrix, X::Matrix)
    results = []
    for i in 1:size(X4k, 1)
        dist = Float64[]
        for j in 1:size(X, 1)
            push!(dist, cosine_dist(X4k[i, :], X[j, :]))
        end
        push!(results, dist)
    end
    return results
end



function parallelCosine(X_sub::Matrix, X::Matrix)
    r_threads = [Array{Float64}[] for i in 1:nthreads()];
    T = div(size(X_sub, 1), nthreads()) # got 32 threads 
    for k in 0:(T - 1)
        @threads for i in 1:nthreads()
            dist = Float64[];
            for j in 1:size(X, 1)
                push!(dist, cosine_dist(X_sub[i + (nthreads() * k), :], X[j, :]));
            end
            push!(r_threads[threadid()], dist)
        end
    end
    r_threads; # returns
end

@time result = sequenceCosine(X_sub, X); 
@time result = parallelCosine(X_sub, X);  # very similar time r.t. sequence version.

B.R.

goerch · November 16, 2021, 10:39am

The slightly reworked MWE

using Base.Threads
using Distances
using BenchmarkTools

function sequenceCosine(X_sub::Matrix, X::Matrix)
    results = Matrix(undef, size(X_sub, 1), size(X, 1))
    for i in 1:size(X_sub, 1)
        for j in 1:size(X, 1)
            results[i, j] = cosine_dist((@view X_sub[i, :]), (@view X[j, :]))
        end
    end
    results
end

function parallelCosine(X_sub::Matrix, X::Matrix)
    results = Matrix(undef, size(X_sub, 1), size(X, 1))
    len = div(size(X_sub, 1), nthreads() - 1)
    @threads for t in 1:nthreads()
        for i in 1 + (t - 1) * len:min(t * len, size(X_sub, 1))
            for j in 1:size(X, 1)
                results[i, j] = cosine_dist((@view X_sub[i, :]), (@view X[j, :]))
            end
        end
    end
    results
end

X = rand(Int(200e2), 300);
X_sub = X[1:400, :];

result1 = sequenceCosine(X_sub, X);
result2 = parallelCosine(X_sub, X);
@assert result1 == result2

@btime result = sequenceCosine($X_sub, $X);
@btime result = parallelCosine($X_sub, $X);

shows

  4.429 s (8000002 allocations: 183.11 MiB)
  895.453 ms (8000034 allocations: 183.11 MiB)

for 6 threads on Julia 1.6.3.

kadir-gunel · November 16, 2021, 11:07am

Thank you very much for reviewing code and making it better.

I have 2 question about your answer:

From other working examples, I saw that in order to hold the values of each thread it is needed to create a dynamic array ( results for my case) . And when the results are ready, they are pushed inside the that dynamic array by using the threadid(). From your code, I understand that there is no need for that. Could you explain the inner workings ?
About @view inside innermost loop. Why do we need it ?

B.R.

goerch · November 16, 2021, 11:15am

Hi,

To point 1: in my understanding you concurrently can read non changing parts of memory and write non overlapping parts of memory without data races. Therefore you only need to slice the task accordingly. In short: you have to avoid data races.

To point 2: @view constructs a view into an array without copying. We can use it to avoid memory allocations when reading parts of arrays.

All the best,

DNF · November 16, 2021, 11:24am

This creates a Matrix{Any}, that doesn’t seem like a good idea.

goerch · November 16, 2021, 11:30am

Sorry for the carelessness on my part. But I didn’t notice obvious problems problems using the profiler. To OP:

Matrix{Float64}(undef, size(X_sub, 1), size(X, 1))

will eliminate nearly all allocations. Here are my results with this change:

  4.370 s (2 allocations: 61.04 MiB)
  1.020 s (33 allocations: 61.04 MiB)

kadir-gunel · November 16, 2021, 11:33am

@DNF , @goerch I thought that might be negligible but it seems like not
Thank you both!

DNF · November 16, 2021, 11:35am

BTW, you don’t need to keep track of the threads yourself:

function pcosine(X_sub, X)
    res = similar(X, size(X_sub, 1), size(X, 1))
    @threads for j in axes(X, 1)
        for i in axes(X_sub, 1)
            res[i, j] = @views cosine_dist(X_sub[i, :], X[j, :])
        end
    end
    return res
end

Definitely not Always avoid things like

which is a red performance flag, and pre-allocating is better than incrementally push!ing.

goerch · November 16, 2021, 11:42am

Beautiful fast code:

function sequenceCosine(X_sub::Matrix, X::Matrix)
    results = similar(X, size(X_sub, 1), size(X, 1))
    for i in axes(X_sub, 1)
        for j in axes(X, 1)
            results[i, j] = @views cosine_dist(X_sub[i, :], X[j, :])
        end
    end
    results
end

function parallelCosine(X_sub::Matrix, X::Matrix)
    results = similar(X, size(X_sub, 1), size(X, 1))
    @threads for j in axes(X, 1)
        for i in axes(X_sub, 1)
            results[i, j] = @views cosine_dist(X_sub[i, :], X[j, :])
        end
    end
    results
end

with result

  4.377 s (2 allocations: 61.04 MiB)
  604.920 ms (33 allocations: 61.04 MiB)

goerch · November 16, 2021, 11:50am

Which leads to the following question: how to parallelize the corresponding comprehension

function sequenceCosine(X_sub::Matrix, X::Matrix)
    [@views cosine_dist(X_sub[i, :], X[j, :])
        for i in axes(X_sub, 1),
            j in axes(X, 1)]
end

which BTW runs faster than the looping version:

  2.739 s (2 allocations: 61.04 MiB)

?

DNF · November 16, 2021, 11:50am

There’s actually a tiny trap here. similar will fail if the inputs have an integer eltype. So to really make this robust, one probably needs to do some float promotion or something.

In fact, a parallel comprehension is probably the best way, but not sure how to do that.

goerch · November 16, 2021, 11:58am

Seems this is the only way out at the moment:

function sequenceCosine(X_sub::Matrix, X::Matrix)
    [@views cosine_dist(X_sub[i, :], X[j, :])
        for i in axes(X_sub, 1),
            j in axes(X, 1)]
end

function parallelCosine(X_sub::Matrix, X::Matrix)
    results = similar(X, size(X_sub, 1), size(X, 1))
    @threads for j in axes(X, 1)
        @views results[:, j] .=
            [cosine_dist(X_sub[i, :], X[j, :])
                for i in axes(X_sub, 1)]
    end
    results
end

Results:

  2.739 s (2 allocations: 61.04 MiB)
  570.136 ms (20033 allocations: 124.51 MiB)

aplavin · November 16, 2021, 3:48pm

If you care for performance of array traversal code, it’s crucial to consider memory layout. In Julia, arrays are column-major by default, and switching to row-major gives almost a 5-fold speedup for me. This is without paralellization:

X_sub_r = PermutedDimsArray(permutedims(X_sub), (2, 1))
X_r = PermutedDimsArray(permutedims(X), (2, 1))
sequenceCosine(X_sub, X)  # 3.6 s
sequenceCosine(X_sub_r, X_r)  # 0.72 s

Writing this function in terms of map can make it absolutely trivial to parallelize:

# sequential - has the same performance as sequenceCosine, 0.7 s for row-major arrays
function sequenceCosine3(X_sub, X)
	map(Iterators.product(eachrow(X_sub), eachrow(X))) do (i, j)
    	cosine_dist(i, j)
	end
end

# parallel - 0.26 s for me
function parallelCosine3(X_sub, X)
	ThreadsX.map(Iterators.product(eachrow(X_sub), eachrow(X))) do (i, j)
    	cosine_dist(i, j)
	end
end

goerch · November 16, 2021, 5:56pm

Very nice solution! Put into my test bench and got

  2.740 s (2 allocations: 61.04 MiB)
  590.554 ms (20033 allocations: 124.51 MiB)

for

@btime result = sequenceCosine($X_sub, $X);
@btime result = parallelCosine($X_sub, $X);

and

  616.712 ms (2 allocations: 61.04 MiB)
  186.552 ms (4185 allocations: 320.18 MiB)

for

X_sub_r = PermutedDimsArray(permutedims(X_sub), (2, 1))
X_r = PermutedDimsArray(permutedims(X), (2, 1))

@btime result = sequenceCosine($X_sub_r, $X_r);
@btime result = parallelCosine($X_sub_r, $X_r);

kadir-gunel · November 18, 2021, 6:01am

Hello @aplavin ,

I was reading your post and noticed that you mentioned the row major operation is more performant than column major. Shouldn’t be vice versa ?

The PermuteDimsArray with permutedims(X) gives the same result as the original matrix.

using Test
X = rand(10, 20);
X_r = PermuteDimsArray(permutedims(X), (2, 1))

@test X == X_r # gives true

Could not catch the point. Could you please explain ?

B.R.

DNF · November 18, 2021, 6:30am

I can’t believe I missed that you were working row major, that’s of course an important difference.

This

is a somewhat obscure way of making X_r into a row major array. permutedims transposes X, and PermureDimsArray transposes it back again, except lazily. So X_r is now the same as X except it’s row major, leading to a significant speedup.

But for you, @kadir-gunel, it would be better if you organize your problem so that the matrices hold the interesting data along their columns in the first place. Then you can rewrite the function to work along columns.

DNF · November 18, 2021, 6:58am

And, in case this isn’t clear, your code works along rows, therefore working on a row-major array is more efficient than on one that is column-major

kadir-gunel · November 18, 2021, 7:05am

Yes, my bad. Thank you for clarification.

Topic		Replies	Views
Parallel implementation is not very effective General Usage question , performance , parallel , multithreading	19	790	October 8, 2022
Parallelizing for loop in the computation of a gradient Performance question	19	2563	February 26, 2018
Slower @threads than serial for array computations Julia at Scale	26	2690	May 7, 2020
Data structures for threaded computing Performance	23	2846	October 23, 2019
Large allocations using @Threads.threads in a loop leads to slow down New to Julia multithreading	17	1005	August 19, 2023

Y.A.t.Q : Yet Another @threads Question

Related topics