I am not completely sure if the results are the same, you should take a look at that first (they are certainly not the same, but I don’t know if the differences are important or if the reason if something not related to the benchmark). What I did is to separate the inner loop in its three levels and used LoopVectorization
(@avx
macro) in the inner loop. That seems to accelerate ~7x the execution (I removed the @views
, @inbounds
and @fastmath
flags as well):
With your code (second execution):
julia> @time f1(zeros(Int8,J,60),ones(Int8,J,60),ones(Int8,J,60),params,Mat1,Mat2,vec2,vec1,vec3,Mat3)
0.207526 seconds (10 allocations: 220.578 KiB)
(Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], Float32[12179.851 12341.136 … 12276.678 12244.565; 12507.148 12552.595 … 12462.509 12452.235; … ; 12946.373 12772.485 … 12852.6045 12584.991; 12812.645 12605.404 … 12843.521 12823.51], Float32[12179.851 12341.136 … 12276.678 12244.565; 12507.148 12552.595 … 12462.509 12452.235; … ; 12946.373 12772.485 … 12852.6045 12584.991; 12812.645 12605.404 … 12843.521 12823.51])
With this code (second run in all cases):
Code
using Random
J = 250
T = 100
Mat1 = rand(J,J)
vec1 = rand(J)
Mat2= rand(J,J)
vec2 = rand(J)
Mat3 = rand(J,J,T)
vec3 = rand(J,T)
using LoopVectorization
function f1(X,Y,Z,params,Mat1,Mat2,vec2,vec1,vec3,Mat3)
(J,T) = size(X)
(γoᴱ,γgᴱ,ψgᴱ,κgᴱ,γlᴱ,ψlᴱ,κlᴱ,γaᴱ,ψaᴱ,κaᴱ)=params
eⱼX = zeros(Float32,J,T)
eⱼY = zeros(Float32,J,T)
eⱼZ = zeros(Float32,J,T)
for t=1:T
for l=1:J
@avx for j=1:J
ejl = γoᴱ+γgᴱ*(1+ψgᴱ*vec1[j])*exp(Mat1[j,l]*(-κgᴱ))+γlᴱ*(1+ψlᴱ*vec2[j])*exp(Mat2[j,l]*(-κlᴱ))+γaᴱ*(1+ψaᴱ*vec3[j,t])*exp(Mat3[j,l,t]*(-κaᴱ))
elj = γoᴱ+γgᴱ*(1+ψgᴱ*vec1[l])*exp(Mat1[j,l]*(-κgᴱ))+γlᴱ*(1+ψlᴱ*vec2[l])*exp(Mat2[j,l]*(-κlᴱ))+γaᴱ*(1+ψaᴱ*vec3[l,t])*exp(Mat3[j,l,t]*(-κaᴱ))
eⱼX[j,t] =eⱼX[j,t] + X[l,t]*(ejl+elj)
eⱼY[j,t] =eⱼY[j,t] + Y[l,t]*(ejl+elj)
eⱼZ[j,t] =eⱼZ[j,t] + Z[l,t]*(ejl+elj)
end
end
end
return eⱼX, eⱼY, eⱼZ
end
params = [0.0,
22.504883,
0.00010256348,
0.0021472168,
28.549805,
1.4460449,
23.164062,
29.42871,
1.4924316,
79.72656]
@time f1(zeros(Int8,J,60),ones(Int8,J,60),ones(Int8,J,60),params,Mat1,Mat2,vec2,vec1,vec3,Mat3)
@time f1(zeros(Int8,J,60),ones(Int8,J,60),ones(Int8,J,60),params,Mat1,Mat2,vec2,vec1,vec3,Mat3)
julia> @time f1(zeros(Int8,J,60),ones(Int8,J,60),ones(Int8,J,60),params,Mat1,Mat2,vec2,vec1,vec3,Mat3)
0.028289 seconds (10 allocations: 220.578 KiB)
(Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], Float32[12634.557 12446.863 … 12334.467 121
62.421; 12453.994 12371.172 … 12615.109 12296.466; … ; 12301.93 12348.444 … 12477.65 12510.166; 12712.252 12737.326 … 12705.662 12569
.258], Float32[12634.557 12446.863 … 12334.467 12162.421; 12453.994 12371.172 … 12615.109 12296.466; … ; 12301.93 12348.444 … 12477.6
5 12510.166; 12712.252 12737.326 … 12705.662 12569.258])
By the way, since you are using Float32
arrays inside, if you initialize everything as Float32
you get an additional ~2x speedup:
0.012065 seconds (10 allocations: 220.578 KiB)
(Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], Float32[12742.045 12706.508 … 12620.173 12605.99; 12646.944 12537.022 … 12411.814 12757.828; … ; 12568.096 12390.755 … 12645.672 12821.599; 13024.046 13046.444 … 12973.952 13356.733], Float32[12742.045 12706.508 … 12620.173 12605.99; 12646.944 12537.022 … 12411.814 12757.828; … ; 12568.096 12390.755 … 12645.672 12821.599; 13024.046 13046.444 … 12973.952 13356.733])
Code
using Random
J = 250
T = 100
Mat1 = rand(Float32,J,J)
vec1 = rand(Float32,J)
Mat2= rand(Float32,J,J)
vec2 = rand(Float32,J)
Mat3 = rand(Float32,J,J,T)
vec3 = rand(Float32,J,T)
using LoopVectorization
function f1(X,Y,Z,params,Mat1,Mat2,vec2,vec1,vec3,Mat3)
(J,T) = size(X)
(γoᴱ,γgᴱ,ψgᴱ,κgᴱ,γlᴱ,ψlᴱ,κlᴱ,γaᴱ,ψaᴱ,κaᴱ)=params
eⱼX = zeros(Float32,J,T)
eⱼY = zeros(Float32,J,T)
eⱼZ = zeros(Float32,J,T)
for t=1:T
for l=1:J
@avx for j=1:J
ejl = γoᴱ+γgᴱ*(1+ψgᴱ*vec1[j])*exp(Mat1[j,l]*(-κgᴱ))+γlᴱ*(1+ψlᴱ*vec2[j])*exp(Mat2[j,l]*(-κlᴱ))+γaᴱ*(1+ψaᴱ*vec3[j,t])*exp(Mat3[j,l,t]*(-κaᴱ))
elj = γoᴱ+γgᴱ*(1+ψgᴱ*vec1[l])*exp(Mat1[j,l]*(-κgᴱ))+γlᴱ*(1+ψlᴱ*vec2[l])*exp(Mat2[j,l]*(-κlᴱ))+γaᴱ*(1+ψaᴱ*vec3[l,t])*exp(Mat3[j,l,t]*(-κaᴱ))
eⱼX[j,t] =eⱼX[j,t] + X[l,t]*(ejl+elj)
eⱼY[j,t] =eⱼY[j,t] + Y[l,t]*(ejl+elj)
eⱼZ[j,t] =eⱼZ[j,t] + Z[l,t]*(ejl+elj)
end
end
end
return eⱼX, eⱼY, eⱼZ
end
params = Float32[0.0,
22.504883,
0.00010256348,
0.0021472168,
28.549805,
1.4460449,
23.164062,
29.42871,
1.4924316,
79.72656]
@time f1(zeros(Int8,J,60),ones(Int8,J,60),ones(Int8,J,60),params,Mat1,Mat2,vec2,vec1,vec3,Mat3)
@time f1(zeros(Int8,J,60),ones(Int8,J,60),ones(Int8,J,60),params,Mat1,Mat2,vec2,vec1,vec3,Mat3)