I am currently writing a program that looks something like this
Threads.@threads for i = 1:m a[m] = f(args, matrix[:,:,m]) end
f is a function that runs some calculations,
matrix is a large (e.g. 3000x200x
m) matrix, and
args some other inputs.
The problem I have is that I don’t obtain a huge improvement in performance when running this small program with many threads (with 20 threads I only get a 2x speed-up).
I think the problem is some shared-memory issues since the threads in parallel have to read from
matrix. Any ideas on how I can fix this problem?
My current idea is to reduce the number of allocations in
f and use
view's instead of slice operations. But I am not sure if this is a good idea(?).
I also have a similar program that looks like
Threads.@threads for i = 1:m a[m] = g(args) end
And for this case, I get a good improvement when using many threads (10x speed-up with 20 threads).
Thus, the problem with the first program seems to be memory issues related to reading from
matrix in parallel.
So, I guess my questions are: How can I efficiently read parts of a matrix in parallel without introducing over-head issue? Can I declare a part of a matrix as local to one particular thread?