Julia 1.0, tight-binding benchmark and array slices


following the recent release of Julia 1.0 I updated a small benchmark tight-binding program that I have implemented in Fortran, C++ with Eigen, C++ with Armadillo, and python/numpy. Roughly, the Fortran and both C++ versions are equivalent both in terms of LOC and performance. The Julia and Numpy versions are roughly the same in terms of LOC, about half the LOC of the Fortran/C++ versions. The Numpy version, however, is very slow, roughly a factor of 50 slower than Fortran (excluding the part which is just a lapack call).

Now, previously in the Julia 0.4 timeframe, the Julia version was about half as fast as the Fortran/C++ versions. That version used the Devectorize package, which seems to have been unmaintained now for several years. I was unable to make it work with julia 0.6.x, not to mention 1.0. However, it seems that as of Julia 0.6 there is the “@.” macro which does roughly the same as the @devec macro from Devectorize(?). With @. for a few critical operations, Julia 1.0 is a factor of 1.7 slower than Fortran. Without @., about a factor of 2.1 slower.

However, if I rewrite those expression as manual loops, Julia is only a factor of 1.1 slower than Fortran, that is, more or less the same! Very impressive!

Although slightly disappointing that I had to resort to writing manual loops for performance. Is there some trick I’m missing? The expressions in question are all of the form

@. v[:] = atoms[bj,:] - atoms[bi,:]

which I rewrite as an explicit loop like:

for z = 1:3
v[z] = atoms[bj,z] - atoms[bi,z]

Does Julia create a copy as part of the slicing operation, or what makes the array syntax slow? The @time macro does report a lot of allocations due to this, whether it’s an actual copy, an array descriptor for the slice, or whatever. Allocations for one particular case:

  • Unoptimized: 3.83 M
  • Using @.: 2.55 M
  • Explicit loops: 4

Is there anything that can be done here?

1 Like

You want to do

@views @. v = atoms[bj,:] - atoms[bi,:]

But there is a performance regression on Julia v1.0 which does currently slow down broadcasting.

(Note: Don’t forget to @inbounds or @inbounds @simd that for loop in the benchmarks :slight_smile:)


That sounds like a great benchmark.

Yes, that’s precisely what happens — slicing isn’t able to be “devectorized” like all other function calls. You can use @views alongside the @. macro to instead make slicing return a lazy view instead of a copy.


Also, try julia -O3 and @simd.

1 Like

You are also using an access pattern optimal for row major storage, whereas Julia uses column major storage. Flip the array dimensions and you might see increased performance.


If the second dimension is small and you always use it together, not only would it be better to transpose it, but using an Array of SVectors (from StaticArrays.jl) would likely help too.

1 Like

Thanks for all the suggestions. A little more experimenting showed that for this particular case:

  • @views helps a bit
  • -O3 doesn’t seem to have any effect
  • transposing the atoms array did help a little bit. Surprisingly little, but my largest atoms array is about 10 kB, so it all fits in L1 cache anyhow. I guess the biggest benefit might be to enable SIMD, but OTOH with only 3 elements the benefits of SIMD are, well, minute.

All in all, with transposed atoms array I got:

  • Explicit loops + @inbounds @simd: 0.96 x Fortran
  • Slices with @views @.: 1.34 x Fortran

Pretty nice!


Being able to see the code would be nice.


I’d love to, but the code was originally a homework exercise for a course, and AFAIK they are still giving this course, so it’d be a bit bad style to give out the exercise answer. Come to think of it, I should ask if they are still using this same exercise, if not I guess there’s nothing to prevent releasing it.


Well, to wake up this semi-zombie thread, I asked and got permission for releasing the code, wooo! So here it is: https://gitlab.com/jabl/tb

Please let me know if you have issues running it, or any other feedback for that matter!