Nested functions with SubArray argument


#1

Hi,
I compare the performances of 3 functions implementing X.+=1 where X is a 2D Array of float.

  • shift2D_1D! splits the 2D loop in two nested functions.

  • shift2DLoop! uses a single function with a 2D loop nest,

  • shift2DNative! uses the broadcast iterator,

The shift2D_1D! exhibits lower performances. Is there a way to improve this ?
In particular, is the signature of the inner function shift1D! acting on a SubArray OK ?

Results : (Julia 0.6)

GFlops=10.829836198727493 (shift2D_1D!)
GFlops=18.726591760299627 (shift2DLoop!)
GFlops=17.362785762515674 (shift2DNative!)

Thank you for your help.
Laurent

using BenchmarkTools

#  Implementation #1 shift1D! and shift2D! 
function shift1D!(x::AbstractArray{T,1}) where T<:Real
    one_T=T(1)
    nx=length(x)
    @simd for i=1:nx
        @inbounds x[i]+=one_T
    end
end

function shift2D_1D!(x2D::Array{T,2}) where T<:Real
    nx,ny=size(x2D)
    for j=1:ny
        shift1D!(view(x2D,:,j))
    end
end

#  Implementation #2 Nested Loops impl for  X2D+=1
function shift2DLoop!(x2D::Array{T,2}) where T<:Real
    nx,ny=size(x2D)
    one_T=T(1)
    @simd for j=1:ny
        @simd for i=1:nx
            @inbounds x2D[i,j]+=one_T
        end
    end
end

#  Implementation #3 native Julia broadcast op for  X2D+=1
function shift2DNative!(x2D::Array{T,2}) where T<:Real
    one_T=T(1)
    x2D.+=one_T
end

# A function to evaluate the performances
function testShift(shiftFunction, T::Type,n::Int64)
    x=zeros(T,n,n)
    # t=@belapsed shift2D!($x)
    @benchmark $shiftFunction($x)
    t=@belapsed $shiftFunction($x)
    print("GFlops=",n*n/(t*1.e9)," (",string(shiftFunction),")\n")
end

testShift(shift2D_1D!,Float32,200)
testShift(shift2DLoop!,Float32,200)
testShift(shift2DNative!,Float32,200)

#2

Creating a view has some overhead unless all uses of the view are confined to the function where it is created.

You can do that by force inlining shift1D! by adding @inline in front of the function definition.


#3

Great ! The @inline macro removed the overhead.
Thank you very much Kristoffer.


#4

GFlops=18.65671641791045 (shift2D_1D! with @inline before shift1D!)
GFlops=16.99556226985176 (shift2DLoop!)
GFlops=17.584135202461777 (shift2DNative!)


Sum performance for Array{Float64,2} elements
#5

I only get around 7 GFlops on my computer (2016 macbook pro), out of curiosity, what system are you running on?


#6

A Desktop CPU:
model name : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
but I had to build the system image in order to access the AVX instructions (otherwise the perfs are halved).