Julia 1.0, tight-binding benchmark and array slices

jabl · August 23, 2018, 2:34pm

Hi,

following the recent release of Julia 1.0 I updated a small benchmark tight-binding program that I have implemented in Fortran, C++ with Eigen, C++ with Armadillo, and python/numpy. Roughly, the Fortran and both C++ versions are equivalent both in terms of LOC and performance. The Julia and Numpy versions are roughly the same in terms of LOC, about half the LOC of the Fortran/C++ versions. The Numpy version, however, is very slow, roughly a factor of 50 slower than Fortran (excluding the part which is just a lapack call).

Now, previously in the Julia 0.4 timeframe, the Julia version was about half as fast as the Fortran/C++ versions. That version used the Devectorize package, which seems to have been unmaintained now for several years. I was unable to make it work with julia 0.6.x, not to mention 1.0. However, it seems that as of Julia 0.6 there is the “@.” macro which does roughly the same as the @devec macro from Devectorize(?). With @. for a few critical operations, Julia 1.0 is a factor of 1.7 slower than Fortran. Without @., about a factor of 2.1 slower.

However, if I rewrite those expression as manual loops, Julia is only a factor of 1.1 slower than Fortran, that is, more or less the same! Very impressive!

Although slightly disappointing that I had to resort to writing manual loops for performance. Is there some trick I’m missing? The expressions in question are all of the form

@. v[:] = atoms[bj,:] - atoms[bi,:]

which I rewrite as an explicit loop like:

for z = 1:3
v[z] = atoms[bj,z] - atoms[bi,z]
end

Does Julia create a copy as part of the slicing operation, or what makes the array syntax slow? The @time macro does report a lot of allocations due to this, whether it’s an actual copy, an array descriptor for the slice, or whatever. Allocations for one particular case:

Unoptimized: 3.83 M
Using @.: 2.55 M
Explicit loops: 4

Is there anything that can be done here?

ChrisRackauckas · August 23, 2018, 2:36pm

You want to do

@views @. v = atoms[bj,:] - atoms[bi,:]

But there is a performance regression on Julia v1.0 which does currently slow down broadcasting.

github.com/JuliaLang/julia

Broadcasting is much slower than a for loop

opened 08:13PM - 15 Jul 18 UTC

YingboMa

performance regression broadcast simd

Here is a minimal working example. ```julia julia> using BenchmarkTools j…ulia> function foo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T @. a = b + 0.1 * (0.2c + 0.3d + 0.4e) nothing end foo (generic function with 1 method) julia> function goo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T @assert length(a) == length(b) == length(c) == length(d) == length(e) @inbounds for i in eachindex(a) a[i] = b[i] + 0.1 * (0.2c[i] + 0.3d[i] + 0.4e[i]) end nothing end goo (generic function with 1 method) julia> a,b,c,d,e=(rand(1000) for i in 1:5) Base.Generator{UnitRange{Int64},getfield(Main, Symbol("##9#10"))}(getfield(Main, Symbol("##9#10"))(), 1:5) julia> @btime foo($a,$b,$c,$d,$e) 1.277 μs (0 allocations: 0 bytes) julia> @btime goo($a,$b,$c,$d,$e) 345.568 ns (0 allocations: 0 bytes) julia> versioninfo() Julia Version 0.7.0-beta2.12 Commit a878341 (2018-07-15 15:57 UTC) Platform Info: OS: Linux (x86_64-pc-linux-gnu) CPU: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-6.0.0 (ORCJIT, skylake) Environment: JULIA_PKG3_PRECOMPILE = 1 ```

(Note: Don’t forget to @inbounds or @inbounds @simd that for loop in the benchmarks )

mbauman · August 23, 2018, 2:37pm

That sounds like a great benchmark.

Yes, that’s precisely what happens — slicing isn’t able to be “devectorized” like all other function calls. You can use @views alongside the @. macro to instead make slicing return a lazy view instead of a copy.

carstenbauer · August 23, 2018, 2:46pm

Also, try julia -O3 and @simd.

baggepinnen · August 23, 2018, 3:40pm

You are also using an access pattern optimal for row major storage, whereas Julia uses column major storage. Flip the array dimensions and you might see increased performance.

ChrisRackauckas · August 23, 2018, 6:18pm

If the second dimension is small and you always use it together, not only would it be better to transpose it, but using an Array of SVectors (from StaticArrays.jl) would likely help too.

jabl · August 23, 2018, 6:20pm

Thanks for all the suggestions. A little more experimenting showed that for this particular case:

@views helps a bit
-O3 doesn’t seem to have any effect
transposing the atoms array did help a little bit. Surprisingly little, but my largest atoms array is about 10 kB, so it all fits in L1 cache anyhow. I guess the biggest benefit might be to enable SIMD, but OTOH with only 3 elements the benefits of SIMD are, well, minute.

All in all, with transposed atoms array I got:

Explicit loops + @inbounds @simd: 0.96 x Fortran
Slices with @views @.: 1.34 x Fortran

Pretty nice!

kristoffer.carlsson · August 23, 2018, 7:06pm

Being able to see the code would be nice.

jabl · August 24, 2018, 6:19am

I’d love to, but the code was originally a homework exercise for a course, and AFAIK they are still giving this course, so it’d be a bit bad style to give out the exercise answer. Come to think of it, I should ask if they are still using this same exercise, if not I guess there’s nothing to prevent releasing it.

jabl · September 22, 2018, 7:02pm

Well, to wake up this semi-zombie thread, I asked and got permission for releasing the code, wooo! So here it is: Janne Blomqvist / tb · GitLab

Please let me know if you have issues running it, or any other feedback for that matter!

Topic		Replies	Views
Slicing array on julia 4000ms vs c++ 400ms Performance vector	24	692	November 10, 2024
Array broadcasting slower than numpy? Performance	20	721	June 4, 2022
Porting code from MatLab - performance tips New to Julia	18	425	June 26, 2024
Why is this simple function twice as slow as its Python version Performance question	97	4432	April 12, 2021
Performance of array broadcasting General Usage question	11	1186	July 8, 2017

Julia 1.0, tight-binding benchmark and array slices

Related topics