Shared-memory parallelization with large matrix

Hi!

I am currently writing a program that looks something like this

Threads.@threads for i = 1:m 
    a[m] = f(args, matrix[:,:,m])
end 

Where f is a function that runs some calculations, matrix is a large (e.g. 3000x200xm) matrix, and args some other inputs.

The problem I have is that I don’t obtain a huge improvement in performance when running this small program with many threads (with 20 threads I only get a 2x speed-up).

I think the problem is some shared-memory issues since the threads in parallel have to read from matrix. Any ideas on how I can fix this problem?

My current idea is to reduce the number of allocations in f and use view’s instead of slice operations. But I am not sure if this is a good idea(?).

I also have a similar program that looks like

Threads.@threads for i = 1:m 
    a[m] = g(args)
end 

And for this case, I get a good improvement when using many threads (10x speed-up with 20 threads).

Thus, the problem with the first program seems to be memory issues related to reading from matrix in parallel.

So, I guess my questions are: How can I efficiently read parts of a matrix in parallel without introducing over-head issue? Can I declare a part of a matrix as local to one particular thread?

did you want to ask a question about your program, possibly posting more code?

Unfortunately managed to post the question before finishing writing it… But now I have edited my original question :slight_smile:

Have you tried using views instead of copying the matrix content?

Do you mean
a[i] = f(args, matrix[:,:,i])
instead of
a[m] = f(args, matrix[:,:,m])
?

Yes! The loop should be

Threads.@threads for m = 1:M 
   a[m] = f(args, matrix[:,:,m])
end 

Thanks for finding this typo!

What kind of machine are you running this on? You’ll need cpu cores to run those threads.

@tkluck I am running the program on a two Intel Xeon E5-2650 v3 processors (which allow for running up to 20 tasks/threads in parallel).

Have you export JULIA_NUM_THREADs=20 in your environment? Run Threads.nthreads() in your interpreter to check.

Also I would recommend setting the number of threads to the number of physical (not logical) cores for numerical workloads, otherwise you’ll have 2K threads contesting K floating point units.

In addition as another commenter mentioned, you should pass in a view, or better yet pass in the full array and an index range (since creating views allocates). For example.

using Base.Threads: @threads

function work(array, column)
    array[column] = sin.(array[column])
end

function main()
    x = rand(1000, 1000)
    
    @time @threads for i=1:1000
        work(x, i)
    end 
end

main()

@cshenton Thanks for your suggestions!

Yes, I do use JULIA_NUM_THREADs=20 and the program is running with the correct number of threads.