Hi all,
I wonder what is the best way to deal with temporary arrays in function calls within a parallelized calculation.
The (very reduced) structure of my code is something like
function fun1()
p = SA[1,2,3,4]
q = SA[10,20,30,40]
temp_arr = zeros(4,4)
for i=1:3
for j=(i+1):4
temp_arr[i,j] = 2*p[i]*p[j] + 2*q[i]*q[j]
end
end
end
# large for loop:
for i = 1:1000
fun1()
end
In order to remove any unnecessary in-time allocation, I preallocate the temporary array in the following way:
temp_arr = zeros(4,4)
function fun2(temp_arr)
p = SA[1,2,3,4]
q = SA[10,20,30,40]
for i=1:3
for j=(i+1):4
temp_arr[i,j] = 2*p[i]*p[j] + 2*q[i]*q[j]
end
end
end
# large for loop:
for i = 1:1000
fun2(temp_arr)
end
This worked as intended as I could remove any allocations from my function calls.
Now, in an attempt to speed-up the large for loop via parallelization (@threads for ...
), I started to run into nonsense results (not an “error”). I assume this is because of some data race happening, as the functions use the same preallocated temp_arr
which is overwritten and therefore changed each time.
For the above example I managed to come up with a simple solution
function fun3()
p = SA[1,2,3,4]
q = SA[10,20,30,40]
temp_arr3 = 2*p.*transpose(p) .+2*q.*transpose(q)
end
which is automatically an SMatrix and (to my understanding) therefore the compiler knows its size and does not need any in-time allocations. However, in my actual code there are several of those temporary arrays in use, and not always I can come up with such a simple expression as above. Also the other temporary arrays might be larger (i.e. having more dimensions) and therefore might violate the usual rule of thumb to use SArrays only when the number of elements is not more than 50 elements.
In that sense, I would like to learn how to manage such temporary arrays efficiently.