I am currently writing a program that looks something like this
Threads.@threads for i = 1:m
a[m] = f(args, matrix[:,:,m])
end
Where f is a function that runs some calculations, matrix is a large (e.g. 3000x200xm) matrix, and args some other inputs.
The problem I have is that I don’t obtain a huge improvement in performance when running this small program with many threads (with 20 threads I only get a 2x speed-up).
I think the problem is some shared-memory issues since the threads in parallel have to read from matrix. Any ideas on how I can fix this problem?
My current idea is to reduce the number of allocations in f and use view’s instead of slice operations. But I am not sure if this is a good idea(?).
I also have a similar program that looks like
Threads.@threads for i = 1:m
a[m] = g(args)
end
And for this case, I get a good improvement when using many threads (10x speed-up with 20 threads).
Thus, the problem with the first program seems to be memory issues related to reading from matrix in parallel.
So, I guess my questions are: How can I efficiently read parts of a matrix in parallel without introducing over-head issue? Can I declare a part of a matrix as local to one particular thread?
Have you export JULIA_NUM_THREADs=20 in your environment? Run Threads.nthreads() in your interpreter to check.
Also I would recommend setting the number of threads to the number of physical (not logical) cores for numerical workloads, otherwise you’ll have 2K threads contesting K floating point units.
In addition as another commenter mentioned, you should pass in a view, or better yet pass in the full array and an index range (since creating views allocates). For example.
using Base.Threads: @threads
function work(array, column)
array[column] = sin.(array[column])
end
function main()
x = rand(1000, 1000)
@time @threads for i=1:1000
work(x, i)
end
end
main()