Shared-memory parallelization with large matrix

Sam · September 23, 2019, 2:46pm

Hi!

I am currently writing a program that looks something like this

Threads.@threads for i = 1:m 
    a[m] = f(args, matrix[:,:,m])
end

Where f is a function that runs some calculations, matrix is a large (e.g. 3000x200xm) matrix, and args some other inputs.

The problem I have is that I don’t obtain a huge improvement in performance when running this small program with many threads (with 20 threads I only get a 2x speed-up).

I think the problem is some shared-memory issues since the threads in parallel have to read from matrix. Any ideas on how I can fix this problem?

My current idea is to reduce the number of allocations in f and use view’s instead of slice operations. But I am not sure if this is a good idea(?).

I also have a similar program that looks like

Threads.@threads for i = 1:m 
    a[m] = g(args)
end

And for this case, I get a good improvement when using many threads (10x speed-up with 20 threads).

Thus, the problem with the first program seems to be memory issues related to reading from matrix in parallel.

So, I guess my questions are: How can I efficiently read parts of a matrix in parallel without introducing over-head issue? Can I declare a part of a matrix as local to one particular thread?

Tamas_Papp · September 23, 2019, 2:53pm

did you want to ask a question about your program, possibly posting more code?

Sam · September 23, 2019, 3:08pm

Unfortunately managed to post the question before finishing writing it… But now I have edited my original question

DNF · September 23, 2019, 3:19pm

Have you tried using views instead of copying the matrix content?

dalejordan · September 23, 2019, 5:25pm

Do you mean
a[i] = f(args, matrix[:,:,i])
instead of
a[m] = f(args, matrix[:,:,m])
?

Sam · September 23, 2019, 6:28pm

Yes! The loop should be

Threads.@threads for m = 1:M 
   a[m] = f(args, matrix[:,:,m])
end

Thanks for finding this typo!

tkluck · September 23, 2019, 7:48pm

What kind of machine are you running this on? You’ll need cpu cores to run those threads.

Sam · September 23, 2019, 8:50pm

@tkluck I am running the program on a two Intel Xeon E5-2650 v3 processors (which allow for running up to 20 tasks/threads in parallel).

cshenton · September 24, 2019, 12:22pm

Have you export JULIA_NUM_THREADs=20 in your environment? Run Threads.nthreads() in your interpreter to check.

Also I would recommend setting the number of threads to the number of physical (not logical) cores for numerical workloads, otherwise you’ll have 2K threads contesting K floating point units.

In addition as another commenter mentioned, you should pass in a view, or better yet pass in the full array and an index range (since creating views allocates). For example.

using Base.Threads: @threads

function work(array, column)
    array[column] = sin.(array[column])
end

function main()
    x = rand(1000, 1000)
    
    @time @threads for i=1:1000
        work(x, i)
    end 
end

main()

Sam · September 24, 2019, 6:33pm

@cshenton Thanks for your suggestions!

Yes, I do use JULIA_NUM_THREADs=20 and the program is running with the correct number of threads.

Topic		Replies	Views
Large allocations using @Threads.threads in a loop leads to slow down New to Julia multithreading	17	1004	August 19, 2023
Poor performance multiplying many (large) matrices multithreaded Performance question , linearalgebra	11	2479	July 13, 2020
Loosing performance with `Threads.@threads` for loop Performance parallel , multithreading , threads	10	704	October 7, 2021
Best way to parallelize Julia at Scale parallel , linearalgebra , sparse	12	1342	August 25, 2022
A question about parallel performance in multithreading Performance question , performance , multithreading	10	659	May 12, 2022

Shared-memory parallelization with large matrix

Related topics