# Sum performance for Array{Float64,2} elements

#1

Hi,
I try to implement efficiently the sum of 2D arrays of size (4,N).
I fail to obtain good performance with different implementations:

• `msum_native` corresponds to the std Julia `sum` method,
• `msum_2D1D` corresponds to a single loop after 2D->1D transformation,
• `mum_fast4_sa` corresponds to an attempt to force SIMD instructions via SVector of size 4,
• `msum_fast4` corresponds to a manual unroll of the inner loop,
• `msum_basic` corresponds to a straightforward implementation with 2 nested loops.

I think that I should be able to reach around 16Gflops for this operation on my machine (4GHz and 4 additions par cycle for Float64 and AVX units). Best performance does not exceed 4GFlopsâ€¦

The attached code produces the following curves:

Having a look at the result of @code_native, it seems (I am not very good at reading assembly code) that
LLVM failed to produce efficient SIMD code and performs an horizontal reduction (vhaddpd) for each SIMD packs (which is not good).

Any clues ?

P.S. If you want to reproduce the curves you have to replace the line
`nset=[2^(i+9) for i=1:2]`
by
`nset=[2^i for i=1:21]`
int the following code (but it takes around 7 minutes)

``````struct TBSM{T,N}
data::Array{T,2}
end

function TBSM{T,N}(s::Int) where T where N
TBSM{T,N}(rand(T,N,s))
end

function msum_basic(a::TBSM{T,N}) where T where N
s=zero(T)
nx,ny=size(a.data)
@simd for j=1:ny
@simd for i=1:N
@inbounds s+=a.data[i,j]
end
end
s
end

function msum_fast4(a::TBSM{T,N}) where T where N
s=zero(T)
nx,ny=size(a.data)
@simd for j=1:ny
@inbounds s+=a.data[1,j]+a.data[2,j]+a.data[3,j]+a.data[4,j]
end
s
end

using StaticArrays

function msum_fast4_sa(a::TBSM{T,N}) where T where N
# println("N=",N)
s=zero(T)
sa=SVector(0.0,0.0,0.0,0.0)
nx,ny=size(a.data)
@simd for j=1:ny
@inbounds sa+=SVector(a.data[1,j],a.data[2,j],a.data[3,j],a.data[4,j])
end
s=sum(sa)
end

function msum_2D1D(a::TBSM{T,N}) where T where N
nx,ny=size(a.data)
a1D=reshape(a.data,nx*ny)
s=zero(T)
@simd for ai in a1D
@inbounds s+=ai
end
s
end

function msum_native(a::TBSM{T,N}) where T where N
return sum(a.data)
end

#******************** Benchmark machinery ******************

abstract type BTLAction{T} end
type SumTBSMAction{T,N} <: BTLAction{T} end

function initialize(action::SumTBSMAction{T,N}, n::Int) where T where N
return (TBSM{T,N}(n),)
end

function checkresult(action::SumTBSMAction{T,N}, my_function, inputs::Tuple) where T where N
(a,)=inputs
sref=msum_native(a)
s=my_function(a)
# println("diff=",abs(s-sref))
return isapprox(s,sref,atol=1.e-6)
end

function nflops(action::SumTBSMAction{T,N}, n::Int) where T where N
return BigInt(n)
end

using BenchmarkTools
using Plots
#plotly()
pyplot()

struct Measurement
action_name::String
function_name::String
float_name::String
gflops::Vector{Float64}
sizes::Vector{Int}
source::String
end

maxgflops(m::Measurement) = maximum(m.gflops)

using Sugar

function benchmark_action(action::BTLAction{T}, ss::Vector{Int}, my_function) where T
gflops=[]
sizes=[]
#    println(ss)
for s in ss
inputs=initialize(action,s)
correct=checkresult(action,my_function,inputs)
assert(correct)
inputs=initialize(action,s)
bt=@belapsed \$my_function(\$inputs...)
gf=Float64(nflops(action,s)/(1.e9*bt))
println("GFLOPS=",gf,"\t",string(my_function)," ",string(T)," size=",s)
push!(gflops,gf)
push!(sizes,s)
end
inputs=initialize(action,sizes[1])
#Build a tuple of arg types from a tuple of args
mytypes=[]
for e in inputs
push!(mytypes,typeof(e))
end
typetuples=(mytypes...)
#pass this tuple of type to Sugar
m=Sugar.get_method(my_function,typetuples)
code,str=Sugar.get_source(m) #str contains the source code of my_function
return Measurement(string(typeof(action)),string(my_function),string(T),gflops,sizes,str)
end

function plotit(measurements::Array{Measurement,1})
maxi=1.1*maximum([maximum(m.gflops) for m in measurements])
ms=sort(measurements,by=maxgflops,rev=true)
p1=Plots.plot(ylims=(0,maxi),ylabel="GFLOPS",xlabel="arraySize",title="Performance")
for i=1:length(ms)
m=ms[i]
plot!(p1,m.sizes,linewidth=4,marker="o",markersize=5,markercolor="white",m.gflops,xscale = :log10,label=string(string(m.function_name)," ",string(m.float_name)))
end
p1
end

#******************** data for the bench ******************

using BenchmarkTools

fset=[msum_native,msum_basic,msum_2D1D,msum_fast4,msum_fast4_sa]

#nset=[2^i for i=1:21]
nset=[2^(i+9) for i=1:2]
rset=[Float64]
#rset=[Float64,Float32]
m=Vector{Measurement}(0)
for f in fset
for r in rset
ms=benchmark_action(SumTBSMAction{r,4}(),nset,f)
# println(ms.gflops)
push!(m,ms)
end
end

p1=plotit(m)
show(p1)
savefig(p1,"allsum.pdf")

a=TBSM{Float64,4}(1000)
@code_native(msum_2D1D(a))
@code_native(msum_fast4_sa(a))type or paste code here
``````

#2

Probably not, because summation speed is usually limited by memory bandwidth.

#3

Well, I guess I should obtain around 16 gflops for small arrays that fit in the cache memory. I obtain 18 Gflops when adding one to an array Nested functions with SubArray argument

#4

``````function msum_rb(a::TBSM{T,N}) where T where N
s = zero(T)
am = a.data
for ai in am
s += ai
end
s
end
function msum_rbs(a::TBSM{T,N}) where T where N
s = zero(T)
am = a.data
@simd for ai in am
s += ai
end
s
end

function msum_ei(a::TBSM{T,N}) where T where N
s = zero(T)
am = a.data
@inbounds @simd for i in eachindex(am)
s += am[i]
end
s
end
``````

and got:

``````|GFLOPS=1.1083705064287588|msum_native Float64 size=1024|
|GFLOPS=0.9956784788245463|msum_native Float64 size=2048|
|GFLOPS=0.14021154965255195|msum_basic Float64 size=1024|
|GFLOPS=0.1403412595079833|msum_basic Float64 size=2048|
|GFLOPS=1.3230268873493574|msum_2D1D Float64 size=1024|
|GFLOPS=1.3632430273580511|msum_2D1D Float64 size=2048|
|GFLOPS=0.40965462061608215|msum_fast4 Float64 size=1024|
|GFLOPS=0.4159823578910716|msum_fast4 Float64 size=2048|
|GFLOPS=0.5448547408747473|msum_fast4_sa Float64 size=1024|
|GFLOPS=0.5536257349462729|msum_fast4_sa Float64 size=2048|
|GFLOPS=0.14023555190358805|msum_rb Float64 size=1024|
|GFLOPS=0.1404182379156668|msum_rb Float64 size=2048|
|GFLOPS=1.4805298141768957|msum_rbs Float64 size=1024|
|GFLOPS=1.4478614351360906|msum_rbs Float64 size=2048|
|GFLOPS=1.5035394788020227|msum_ei Float64 size=1024|
|GFLOPS=1.4649499284692418|msum_ei Float64 size=2048|
``````

higher performance with both SIMD versions than with any of yours.
Any reason for avoiding the simple approach?

``````versioninfo()
Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i3-4010U CPU @ 1.70GHz
WORD_SIZE: 64
BLAS: libmkl_rt
LAPACK: libmkl_rt
LIBM: libimf
LLVM: libLLVM-3.9.1 (ORCJIT, haswell)
``````

FWIW,

``````julia> BLAS.set_num_threads(1)

julia> peakflops(5000) / 1e9
24.57265894018821
``````

so this still isnâ€™t close to peak performance.
Likely memory bound like stevenjg suggested, but I wonder where that memory bound is?

#5

Did You compile Your julia system image? in my case it turned ~4 GFlops to ~12 for such simple operations

https://docs.julialang.org/en/stable/devdocs/sysimg/

#6

Thank you very much !
I did not know that it was possible to iterate so easily other 2D array elements: that is great !

Here are the curves corresponding to your functions:

I add a new version using dot with function

``````struct TBSM{T,N}
data::Array{T,2}
dataOne::Array{T,2}
end

function TBSM{T,N}(s::Int) where T where N
TBSM{T,N}(rand(T,N,s),ones(T,N,s))
end
function msum_dot(a::TBSM{T,N}) where T where N
return BLAS.dot(a.data,a.dataOne)
end
``````

This dot based version is not vey good but I suspect that openblas is not as good as MKL for this operation.
I think it can be better on your machine since you use MKL.

``````Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, skylake)

julia> peakflops(5000) / 1e9
58.47954933142325t
``````

I made the same experiment with Eigen in C++ (bench-sum-Eigen-Double.dat).
I also tried to achieve the sum via a dot operation against a an array filled with ones for both MKL and Eigen:

My 16 GFlops expectation seems to be out of reach .
Anyway 8 GFlops should be attainable.
@abulak : yes, I did compile my julia system image.

I will try to call the Eigen sum directly from julia.

Thank you again !

#7

Another option if you want a bit more manual control of the SIMD operations is to use the SIMD package. E.g.

``````using SIMD
function msum_simd(a::TBSM{T, N}, ::Type{Val{M}}) where {T, N, M}
x = vec(a.data)
n = length(x)
s = Vec{M, T}(0)
for k = 1:M:n
s += vload(Vec{M, T}, x, k)
end
return sum(s)
end
``````

Call it with `msum_simd(a, Val{4})`, `msum_simd(a, Val{8})` etc.
Note: this implementation is incomplete in that it ignores tails when the length of the data isnâ€™t a multiple of `M`.

#8

Thanks ! Awesome interface for SIMD programming indeed.
I will try it after work.
Note 1 : the N parameter was here to ensure that the total length is a multiple of N. So I guess that it can replace M if N is 4 or 8.
Note 2 : I suspect a discrepancy/error between my C++ and Julia measurements because I canâ€™t explain the performance differences for large (out of cache) arrays. I have to double check thisâ€¦

#9

Yes, you can merge N and M if it fits your purposes but donâ€™t be shy about testing 16, 32, 64.

#10

OK I will check with different sizesâ€¦ a pity that currying does not apply here

#11

Be careful when benchmarking libraries like MKL and Eigen, because they also might be using multi-threadingâ€¦

#12

I will double check this too (Iâ€™ve been using Eigen for a long time now and I donâ€™t think that it does implicit MT). I have check my top and the process did not exceed 100%.

#13

I will try to call the Eigen sum directly from julia.

Cxx and CxxWrap are two alternatives:

But honestly, if all I wanted are to call a couple functions, I just added a declaration with `extern "C"` and then compiled with `-fPIC -shared` into a shared library so I could just `ccall` it following:
https://docs.julialang.org/en/stable/manual/calling-c-and-fortran-code/

#14

Relieved and ashamed : I made a mistake in the `nflops` function (`BigInt(n)` must be replaced by `BigInt(n)*N`).

As a consequence all the Julia performance figures must my multiplied by 4 and there is no more discrepancy with C++ results. Finally my 16 GFlops expectation was correct.

Nevertheless this mistake was a lucky one for me since I learned a lot from your advises:

• Simple is beautiful is efficient : the most elegant loop in Julia is the top performer (sum_rbs/ei)
• BLAS1 (vector) loop in Julia competes with best C/C++ implementations and the automatic vectorization is very good (note that g++ 6.4 failed to vectorize properly this loop and the performance is divided by 4). This kind of quantitative comparison will be very useful for me to introduce Julia to my colleagues with a HPC background.
• I learn to use SIMD.jl which I guess can be very neat and useful to vectorize more complex computations. (I used the M=16 version of Gunnarâ€™s function for the plot).
• I learn to call C/C++ function from Julia and check that it was really easy.
For information the Eigen called function corresponds to the following implementation
``````extern "C" {
double eigensum(double * ptr,int N){
typedef Eigen::Array<double,Eigen::Dynamic,1> EA;
Eigen::Map<EA> em(ptr,N);
return em.sum();
}
}
``````

and the ccall looks like:

``````function msum_eigen(a::TBSM{Float64, N}) where N
ccall((:eigensum, "./libesum"),Cdouble,(Ptr{Cdouble}, Csize_t),a.data,length(a.data))
end
``````
• Rq1 on Julia: considering the Eigen wrap where I could have use
`Eigen::Map<EA,Aligned64> em(ptr,N);`
and the SIMD.jl `vloada` variant of `vload`, I think that it could be useful to enforce the Julia allocator for arrays to fulfill memory alignments compatible with used SIMD instruction sets.
• Rq2 on discourse: it is a pity that I canâ€™t upload svg or pdf images

Thank you again for your help !
Laurent

Compile (and call) a C function from Julia