Hello everybody,
I have a performance question which I hope has not been asked to death before. I will provide a minimal working example below.
First, let me give some context. The type of code shown below gets called several tens of million times in the actual computation. The array work[]
is always fixed size nxm=3x11, at least for the current application. Size is never changed. In the actual computation the difference in total runtime between the variants discussed below is 20 s (@time on the calling function) for the slowest example and 15 s for the fastest one.
I know about StaticArrays, but would like to stick to the stdlib at the moment and the actual behaviour which surprised me is with the standard dynamic arrays (did not test StaticArrays). I am using julia 0.7.0 beta2.
So, now the example:
using BenchmarkTools
@inline function setup_k_no_if(k::Int,xalt,work,coef,factor=1.0)
work[:,k].=0.0
for jj=1:size(coef,1)
for ii=1:size(work,1)
@inbounds work[ii,k]=work[ii,k]+coef[jj]*work[ii,jj]
end
end
for ii=1:size(work,1)
@inbounds work[ii,k]=work[ii,k]/factor+xalt[ii]
end
nothing
end
@inline function setup_k_with_if(k::Int,xalt,work,coef,factor=1.0)
work[:,k].=0.0
if (size(work,1)==3)
for jj=1:size(coef,1)
for ii=1:3
@inbounds work[ii,k]=work[ii,k]+coef[jj]*work[ii,jj]
end
end
else
for jj=1:size(coef,1)
for ii=1:size(work,1)
@inbounds work[ii,k]=work[ii,k]+coef[jj]*work[ii,jj]
end
end
end
nothing
end
@inline function setup_k_fixed_size(k::Int,xalt,work,coef,factor=1.0)
work[:,k].=0.0
for jj=1:size(coef,1)
for ii=1:3
@inbounds work[ii,k]=work[ii,k]+coef[jj]*work[ii,jj]
end
end
for ii=1:3
@inbounds work[ii,k]=work[ii,k]/factor+xalt[ii]
end
nothing
end
work=zeros(3,11)
work[:,1]=[2.0,1.0,4.0]
xalt=[1.0,2.0,3.0]
s21=sqrt(21.0)
coef11=(0.0,0.0,0.0,0.0,(-42.0+7.0*s21)*20.0,(-18.0+28.0*s21)*8.0,
(-273.0-53.0*s21)*5.0,(301.0+53.0*s21)*5.0,(28.0-28.0*s21)*8.0,
(49.0-7.0*s21)*20.0)
@btime setup_k_no_if(11,xalt,work,coef11,17640.0)
@btime setup_k_with_if(11,xalt,work,coef11,17640.0)
@btime setup_k_fixed_size(11,xalt,work,coef11,17640.0)
Calling this gives as a result:
setup_k_no_if: 50.825 ns (0 allocations: 0 bytes)
setup_k_with_if: 35.021 ns (0 allocations: 0 bytes)
setup_k_fixed_size: 38.538 ns (0 allocations: 0 bytes)
In the real world code the setup_k_no_if() variant is 20 s, the setup_k_with_if() is 15 s and the setup_k_fixed_size() is perhaps 14 s.
Now I am wondering:
- The compiler apparently is not able to infer the size(work,1) at compile time and specialize the code so that the size(work,1) calls are no longer needed. That looks like a missed optimization opportunity, but I believe getting this case right is very hard. Is there a way to help the compiler ?
- What happens in the setup_k_with_if() version ? Apparently the code is specialized to the case size(work,1)=3 somehow ? EDIT: I should better ask: the if has negligible performance impact somehow, which implies that size(work,1) is not executed every time the function is called ? My previous experience with fortran would suggest to never have an if-else in a loop like this because it would be performance killer. So this was a big surprise. Can someone explain ?
- IMHO the clean way would of course to use multiple dispatch with a version specialized to size(work,1)=3 and one for the general case. But I believe it is not possible to dispatch on actual size of array dimensions ?
Best Regards
Christof