Sum performance for Array{Float64,2} elements

Relieved and ashamed :smile:: I made a mistake in the nflops function (BigInt(n) must be replaced by BigInt(n)*N).

As a consequence all the Julia performance figures must my multiplied by 4 and there is no more discrepancy with C++ results. Finally my 16 GFlops expectation was correct.

allsum

Nevertheless this mistake was a lucky one for me since I learned a lot from your advises:

  • Simple is beautiful is efficient : the most elegant loop in Julia is the top performer (sum_rbs/ei)
  • BLAS1 (vector) loop in Julia competes with best C/C++ implementations and the automatic vectorization is very good (note that g++ 6.4 failed to vectorize properly this loop and the performance is divided by 4). This kind of quantitative comparison will be very useful for me to introduce Julia to my colleagues with a HPC background.
  • I learn to use SIMD.jl which I guess can be very neat and useful to vectorize more complex computations. (I used the M=16 version of Gunnar’s function for the plot).
  • I learn to call C/C++ function from Julia and check that it was really easy.
    For information the Eigen called function corresponds to the following implementation
extern "C" {
  double eigensum(double * ptr,int N){
    typedef Eigen::Array<double,Eigen::Dynamic,1> EA;
    Eigen::Map<EA> em(ptr,N);
    return em.sum();
  }
}

and the ccall looks like:

function msum_eigen(a::TBSM{Float64, N}) where N
    ccall((:eigensum, "./libesum"),Cdouble,(Ptr{Cdouble}, Csize_t),a.data,length(a.data))
end
  • Rq1 on Julia: considering the Eigen wrap where I could have use
    Eigen::Map<EA,Aligned64> em(ptr,N);
    and the SIMD.jl vloada variant of vload, I think that it could be useful to enforce the Julia allocator for arrays to fulfill memory alignments compatible with used SIMD instruction sets.
  • Rq2 on discourse: it is a pity that I can’t upload svg or pdf images :grinning:

Thank you again for your help !
Laurent

9 Likes