Sum performance for Array{Float64,2} elements

LaurentPlagne · May 13, 2018, 8:04pm

Hi,
I try to implement efficiently the sum of 2D arrays of size (4,N).
I fail to obtain good performance with different implementations:

msum_native corresponds to the std Julia sum method,
msum_2D1D corresponds to a single loop after 2D->1D transformation,
mum_fast4_sa corresponds to an attempt to force SIMD instructions via SVector of size 4,
msum_fast4 corresponds to a manual unroll of the inner loop,
msum_basic corresponds to a straightforward implementation with 2 nested loops.

I think that I should be able to reach around 16Gflops for this operation on my machine (4GHz and 4 additions par cycle for Float64 and AVX units). Best performance does not exceed 4GFlops…

The attached code produces the following curves:
allsum

Having a look at the result of @code_native, it seems (I am not very good at reading assembly code) that
LLVM failed to produce efficient SIMD code and performs an horizontal reduction (vhaddpd) for each SIMD packs (which is not good).

Any clues ?

P.S. If you want to reproduce the curves you have to replace the line
nset=[2^(i+9) for i=1:2]
by
nset=[2^i for i=1:21]
int the following code (but it takes around 7 minutes)

struct TBSM{T,N}
    data::Array{T,2}
end

function TBSM{T,N}(s::Int) where T where N
    TBSM{T,N}(rand(T,N,s))
end

function msum_basic(a::TBSM{T,N}) where T where N
    s=zero(T)
    nx,ny=size(a.data)
    @simd for j=1:ny
        @simd for i=1:N
            @inbounds s+=a.data[i,j]
        end
    end
    s
end

function msum_fast4(a::TBSM{T,N}) where T where N
    s=zero(T)
    nx,ny=size(a.data)
    @simd for j=1:ny
        @inbounds s+=a.data[1,j]+a.data[2,j]+a.data[3,j]+a.data[4,j]
    end
    s
end

using StaticArrays

function msum_fast4_sa(a::TBSM{T,N}) where T where N
    # println("N=",N)
    s=zero(T)
    sa=SVector(0.0,0.0,0.0,0.0)
    nx,ny=size(a.data)
    @simd for j=1:ny
        @inbounds sa+=SVector(a.data[1,j],a.data[2,j],a.data[3,j],a.data[4,j])
    end
    s=sum(sa)
end

function msum_2D1D(a::TBSM{T,N}) where T where N
    nx,ny=size(a.data)
    a1D=reshape(a.data,nx*ny)
    s=zero(T)
    @simd for ai in a1D
        @inbounds s+=ai
    end
    s
end

function msum_native(a::TBSM{T,N}) where T where N
    return sum(a.data)
end

#******************** Benchmark machinery ******************

abstract type BTLAction{T} end
type SumTBSMAction{T,N} <: BTLAction{T} end

function initialize(action::SumTBSMAction{T,N}, n::Int) where T where N
    return (TBSM{T,N}(n),)
end

function checkresult(action::SumTBSMAction{T,N}, my_function, inputs::Tuple) where T where N
    (a,)=inputs
    sref=msum_native(a)
    s=my_function(a)
    # println("diff=",abs(s-sref))
    return isapprox(s,sref,atol=1.e-6)
end



function nflops(action::SumTBSMAction{T,N}, n::Int) where T where N
    return BigInt(n)
end


using BenchmarkTools
using Plots
#plotly()
pyplot()


struct Measurement
    action_name::String
    function_name::String
    float_name::String
    gflops::Vector{Float64}
    sizes::Vector{Int}
    source::String
end

maxgflops(m::Measurement) = maximum(m.gflops)

using Sugar

function benchmark_action(action::BTLAction{T}, ss::Vector{Int}, my_function) where T
    gflops=[]
    sizes=[]
#    println(ss)
    for s in ss
        inputs=initialize(action,s)
        correct=checkresult(action,my_function,inputs)
        assert(correct)
        inputs=initialize(action,s)
        bt=@belapsed $my_function($inputs...)
        gf=Float64(nflops(action,s)/(1.e9*bt))
        println("GFLOPS=",gf,"\t",string(my_function)," ",string(T)," size=",s)
        push!(gflops,gf)
        push!(sizes,s)
    end
    inputs=initialize(action,sizes[1])
    #Build a tuple of arg types from a tuple of args
    mytypes=[]
    for e in inputs
        push!(mytypes,typeof(e))
    end
    typetuples=(mytypes...)
    #pass this tuple of type to Sugar
    m=Sugar.get_method(my_function,typetuples)
    code,str=Sugar.get_source(m) #str contains the source code of my_function
    return Measurement(string(typeof(action)),string(my_function),string(T),gflops,sizes,str)
end

function plotit(measurements::Array{Measurement,1})
    maxi=1.1*maximum([maximum(m.gflops) for m in measurements])
    ms=sort(measurements,by=maxgflops,rev=true)
    p1=Plots.plot(ylims=(0,maxi),ylabel="GFLOPS",xlabel="arraySize",title="Performance")
    for i=1:length(ms)
        m=ms[i]
        plot!(p1,m.sizes,linewidth=4,marker="o",markersize=5,markercolor="white",m.gflops,xscale = :log10,label=string(string(m.function_name)," ",string(m.float_name)))
    end
    p1
end


#******************** data for the bench ******************


using BenchmarkTools

fset=[msum_native,msum_basic,msum_2D1D,msum_fast4,msum_fast4_sa]

#nset=[2^i for i=1:21]
nset=[2^(i+9) for i=1:2]
rset=[Float64]
#rset=[Float64,Float32]
m=Vector{Measurement}(0)
for f in fset
    for r in rset
        ms=benchmark_action(SumTBSMAction{r,4}(),nset,f)
        # println(ms.gflops)
        push!(m,ms)
    end
end

p1=plotit(m)
show(p1)
savefig(p1,"allsum.pdf")

a=TBSM{Float64,4}(1000)
@code_native(msum_2D1D(a))
@code_native(msum_fast4_sa(a))type or paste code here

stevengj · May 13, 2018, 8:23pm

Probably not, because summation speed is usually limited by memory bandwidth.

LaurentPlagne · May 13, 2018, 8:28pm

Well, I guess I should obtain around 16 gflops for small arrays that fit in the cache memory. I obtain 18 Gflops when adding one to an array Nested functions with SubArray argument - #4 by LaurentPlagne

Elrod · May 14, 2018, 12:05am

I added these three functions:

function msum_rb(a::TBSM{T,N}) where T where N
    s = zero(T)
    am = a.data
    for ai in am
        s += ai
    end
    s
end
function msum_rbs(a::TBSM{T,N}) where T where N
    s = zero(T)
    am = a.data
    @simd for ai in am
        s += ai
    end
    s
end

function msum_ei(a::TBSM{T,N}) where T where N
    s = zero(T)
    am = a.data
    @inbounds @simd for i in eachindex(am)
        s += am[i]
    end
    s
end

and got:

|GFLOPS=1.1083705064287588|msum_native Float64 size=1024|
|GFLOPS=0.9956784788245463|msum_native Float64 size=2048|
|GFLOPS=0.14021154965255195|msum_basic Float64 size=1024|
|GFLOPS=0.1403412595079833|msum_basic Float64 size=2048|
|GFLOPS=1.3230268873493574|msum_2D1D Float64 size=1024|
|GFLOPS=1.3632430273580511|msum_2D1D Float64 size=2048|
|GFLOPS=0.40965462061608215|msum_fast4 Float64 size=1024|
|GFLOPS=0.4159823578910716|msum_fast4 Float64 size=2048|
|GFLOPS=0.5448547408747473|msum_fast4_sa Float64 size=1024|
|GFLOPS=0.5536257349462729|msum_fast4_sa Float64 size=2048|
|GFLOPS=0.14023555190358805|msum_rb Float64 size=1024|
|GFLOPS=0.1404182379156668|msum_rb Float64 size=2048|
|GFLOPS=1.4805298141768957|msum_rbs Float64 size=1024|
|GFLOPS=1.4478614351360906|msum_rbs Float64 size=2048|
|GFLOPS=1.5035394788020227|msum_ei Float64 size=1024|
|GFLOPS=1.4649499284692418|msum_ei Float64 size=2048|

higher performance with both SIMD versions than with any of yours.
Any reason for avoiding the simple approach?

versioninfo()
Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i3-4010U CPU @ 1.70GHz
  WORD_SIZE: 64
  BLAS: libmkl_rt
  LAPACK: libmkl_rt
  LIBM: libimf
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

FWIW,

julia> BLAS.set_num_threads(1)

julia> peakflops(5000) / 1e9
24.57265894018821

so this still isn’t close to peak performance.
Likely memory bound like stevenjg suggested, but I wonder where that memory bound is?

abulak · May 14, 2018, 7:32am

Did You compile Your julia system image? in my case it turned ~4 GFlops to ~12 for such simple operations

https://docs.julialang.org/en/stable/devdocs/sysimg/

LaurentPlagne · May 14, 2018, 7:46am

Thank you very much !
I did not know that it was possible to iterate so easily other 2D array elements: that is great !

Here are the curves corresponding to your functions:
allsum

I add a new version using dot with function

struct TBSM{T,N}
    data::Array{T,2}
    dataOne::Array{T,2}
end

function TBSM{T,N}(s::Int) where T where N
    TBSM{T,N}(rand(T,N,s),ones(T,N,s))
end
function msum_dot(a::TBSM{T,N}) where T where N
    return BLAS.dot(a.data,a.dataOne)
end

This dot based version is not vey good but I suspect that openblas is not as good as MKL for this operation.
I think it can be better on your machine since you use MKL.

Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)

julia> BLAS.set_num_threads(1)

julia> peakflops(5000) / 1e9
58.47954933142325t

I made the same experiment with Eigen in C++ (bench-sum-Eigen-Double.dat).
I also tried to achieve the sum via a dot operation against a an array filled with ones for both MKL and Eigen:

My 16 GFlops expectation seems to be out of reach .
Anyway 8 GFlops should be attainable.
@abulak : yes, I did compile my julia system image.

I will try to call the Eigen sum directly from julia.

Thank you again !

GunnarFarneback · May 14, 2018, 8:35am

Another option if you want a bit more manual control of the SIMD operations is to use the SIMD package. E.g.

using SIMD
function msum_simd(a::TBSM{T, N}, ::Type{Val{M}}) where {T, N, M}
    x = vec(a.data)
    n = length(x)
    s = Vec{M, T}(0)
    for k = 1:M:n
        s += vload(Vec{M, T}, x, k)
    end
    return sum(s)
end

Call it with msum_simd(a, Val{4}), msum_simd(a, Val{8}) etc.
Note: this implementation is incomplete in that it ignores tails when the length of the data isn’t a multiple of M.

LaurentPlagne · May 14, 2018, 11:02am

Thanks ! Awesome interface for SIMD programming indeed.
I will try it after work.
Note 1 : the N parameter was here to ensure that the total length is a multiple of N. So I guess that it can replace M if N is 4 or 8.
Note 2 : I suspect a discrepancy/error between my C++ and Julia measurements because I can’t explain the performance differences for large (out of cache) arrays. I have to double check this…

GunnarFarneback · May 14, 2018, 11:39am

Yes, you can merge N and M if it fits your purposes but don’t be shy about testing 16, 32, 64.

LaurentPlagne · May 14, 2018, 11:41am

OK I will check with different sizes… a pity that currying does not apply here

stevengj · May 14, 2018, 1:08pm

Be careful when benchmarking libraries like MKL and Eigen, because they also might be using multi-threading…

LaurentPlagne · May 14, 2018, 1:33pm

I will double check this too (I’ve been using Eigen for a long time now and I don’t think that it does implicit MT). I have check my top and the process did not exceed 100%.

Elrod · May 14, 2018, 7:52pm

I will try to call the Eigen sum directly from julia.

Cxx and CxxWrap are two alternatives:

But honestly, if all I wanted are to call a couple functions, I just added a declaration with extern "C" and then compiled with -fPIC -shared into a shared library so I could just ccall it following:
https://docs.julialang.org/en/stable/manual/calling-c-and-fortran-code/

LaurentPlagne · May 15, 2018, 7:02am

Relieved and ashamed : I made a mistake in the nflops function (BigInt(n) must be replaced by BigInt(n)*N).

As a consequence all the Julia performance figures must my multiplied by 4 and there is no more discrepancy with C++ results. Finally my 16 GFlops expectation was correct.

allsum

Nevertheless this mistake was a lucky one for me since I learned a lot from your advises:

Simple is beautiful is efficient : the most elegant loop in Julia is the top performer (sum_rbs/ei)
BLAS1 (vector) loop in Julia competes with best C/C++ implementations and the automatic vectorization is very good (note that g++ 6.4 failed to vectorize properly this loop and the performance is divided by 4). This kind of quantitative comparison will be very useful for me to introduce Julia to my colleagues with a HPC background.
I learn to use SIMD.jl which I guess can be very neat and useful to vectorize more complex computations. (I used the M=16 version of Gunnar’s function for the plot).
I learn to call C/C++ function from Julia and check that it was really easy.
For information the Eigen called function corresponds to the following implementation

extern "C" {
  double eigensum(double * ptr,int N){
    typedef Eigen::Array<double,Eigen::Dynamic,1> EA;
    Eigen::Map<EA> em(ptr,N);
    return em.sum();
  }
}

and the ccall looks like:

function msum_eigen(a::TBSM{Float64, N}) where N
    ccall((:eigensum, "./libesum"),Cdouble,(Ptr{Cdouble}, Csize_t),a.data,length(a.data))
end

Rq1 on Julia: considering the Eigen wrap where I could have use
Eigen::Map<EA,Aligned64> em(ptr,N);
and the SIMD.jl vloada variant of vload, I think that it could be useful to enforce the Julia allocator for arrays to fulfill memory alignments compatible with used SIMD instruction sets.
Rq2 on discourse: it is a pity that I can’t upload svg or pdf images

Thank you again for your help !
Laurent

Topic		Replies	Views
How to match performance of sum(A, dims=1)? Performance	14	1070	September 21, 2022
How to improve performance of sum() Performance question , fast-math , tullio , loopvectorization , sum	19	4742	July 19, 2021
Why I love Julia Community announcement	34	4760	October 11, 2019
A fast sum. Any downsides? Performance sum	18	894	December 16, 2024
Does the julia intrinsic sum() apply fastmath by default? New to Julia fast-math , sum	15	746	February 21, 2023

Sum performance for Array{Float64,2} elements

Related topics