# Shared-memory parallelization with large matrix

Hi!

I am currently writing a program that looks something like this

``````Threads.@threads for i = 1:m
a[m] = f(args, matrix[:,:,m])
end
``````

Where `f` is a function that runs some calculations, `matrix` is a large (e.g. 3000x200x`m`) matrix, and `args` some other inputs.

The problem I have is that I don’t obtain a huge improvement in performance when running this small program with many threads (with 20 threads I only get a 2x speed-up).

I think the problem is some shared-memory issues since the threads in parallel have to read from `matrix`. Any ideas on how I can fix this problem?

My current idea is to reduce the number of allocations in `f` and use `view`'s instead of slice operations. But I am not sure if this is a good idea(?).

I also have a similar program that looks like

``````Threads.@threads for i = 1:m
a[m] = g(args)
end
``````

And for this case, I get a good improvement when using many threads (10x speed-up with 20 threads).

Thus, the problem with the first program seems to be memory issues related to reading from `matrix` in parallel.

So, I guess my questions are: How can I efficiently read parts of a matrix in parallel without introducing over-head issue? Can I declare a part of a matrix as local to one particular thread?

did you want to ask a question about your program, possibly posting more code?

Unfortunately managed to post the question before finishing writing it… But now I have edited my original question

Have you tried using `view`s instead of copying the matrix content?

Do you mean
`a[i] = f(args, matrix[:,:,i])`
instead of
`a[m] = f(args, matrix[:,:,m])`
?

Yes! The loop should be

``````Threads.@threads for m = 1:M
a[m] = f(args, matrix[:,:,m])
end
``````

Thanks for finding this typo!

What kind of machine are you running this on? You’ll need cpu cores to run those threads.

@tkluck I am running the program on a two Intel Xeon E5-2650 v3 processors (which allow for running up to 20 tasks/threads in parallel).

Have you `export JULIA_NUM_THREADs=20` in your environment? Run `Threads.nthreads()` in your interpreter to check.

Also I would recommend setting the number of threads to the number of physical (not logical) cores for numerical workloads, otherwise you’ll have 2K threads contesting K floating point units.

In addition as another commenter mentioned, you should pass in a view, or better yet pass in the full array and an index range (since creating views allocates). For example.

``````using Base.Threads: @threads

function work(array, column)
array[column] = sin.(array[column])
end

function main()
x = rand(1000, 1000)

@time @threads for i=1:1000
work(x, i)
end
end

main()
``````

@cshenton Thanks for your suggestions!

Yes, I do use `JULIA_NUM_THREADs=20` and the program is running with the correct number of threads.