@turbo speeds routine, slows down everything else

I’ve been playing with LoopVectorization.jl to speed up the bottleneck in my code. I’ve included @turbo several times in the innermost loops in this routine, and nowhere else in the code. It currently speeds up that routine (24 usec to 17 usec, using @btime), but overall the execution time increases (2.6 sec to 2.8 sec), even though the bottleneck routine is 50% of the execution time.

I only found this because I’m preparing a MWE for another performance problem, and I don’t have a MWE for this one. But it was so strange, I thought I ask whether it triggered any associations for people about benchmarking mistakes I could be making.

1 Like

what’s the memory usage like? suspect it’s cost multiple threads running uses more memory and hence requires more time to gc.

There’s quite a bit of memory allocated, so your theory makes sense. In the bottleneck routine it’s minimal (2 KiB), but the full code has 928 MiB allocated.

Are you including compilation times in your ‘overall execution time’? LoopVectorization.jl has quite a bit of compile time overhead.

IIRC there was some discussion that the parallel vectorization schemes don’t currently participate in Julia’s composable threading system?

I don’t think I’m including compile times, since I think @btime removes these.

1 Like

OK, thanks. Maybe @turbo just isn’t the tool for this particular problem.

If you aren’t using @tturbo or @turbo threads=true, then it shouldn’t be using multiple threads. That is, just doing @turbo will be single threaded.

I would be interested in an example of it slowing your code down overall.

Can you profile it with and without @turbo to see what the pattern in slow down looks like?

2 Likes

Could you link us to the actual benchmark in question? I’d be interested to run it.

1 Like

Sure, will post when I get home.

Okay, the routines are in the MolecularIntegrals.jl package, which computes the electron repulsion integrals for quantum chemistry programs.

I’ve been working on a standalone version of the bottleneck routine (which is called vrr! for vertical recurrence relations) to ask the forum here for general advice on further optimization. That’s in the pluto notebook standalone_vrr.jl, which has two versions vrr! and vrr_turbo!. On my current 2014 Mac Mini, these individual routines clock in at ~27 and ~20 usec, respectively. These don’t have particularly deep for loops for vectorization, which might be part of the problem.

The full code is called from timing.jl. I invoke it using

% julia --project=. test/timing.jl

from the main MolecularIntegrals.jl installation. Currently I just add the @turbo commands to the relevant for-loops and re-time, since the vrr! routine is several calls down in the integral code. But if you think a clear comparison is important, I can make _turbo versions of all these routines. Let me know.

These routines are in the HGP.jl file in the repo.

I hope this is clear. I’m still relatively new to Julia modules and such, and I could be doing something dumb or explaining something poorly. I’m grateful for any help you can offer.

Here’s the timing data I referred to earlier. With no turbo, for the vrr! timings, I get

vrr(4,4)        24.241 μs (19 allocations: 2.08 KiB)

and for timings of the full integral code using different sized basis sets, I get:

Ethane/sto3g: nbf=16
HGP        41.762 ms (19464 allocations: 4.72 MiB)
Ethane/6-31G: nbf=30
HGP        216.715 ms (203712 allocations: 45.46 MiB)
Ethane/cc-pvdz: nbf=60
HGP        2.520 s (833353 allocations: 927.58 MiB)

With @turbo, I see:

vrr(4,4)        17.470 μs (19 allocations: 2.08 KiB)

and for the full integral suite:

Ethane/sto3g: nbf=16
HGP        47.446 ms (19464 allocations: 4.72 MiB)
Ethane/6-31G: nbf=30
HGP        231.727 ms (203712 allocations: 45.46 MiB)
Ethane/cc-pvdz: nbf=60
HGP        2.837 s (833353 allocations: 927.58 MiB)
if aminus > 0
  a_i = 0.5*ooze*ao2m[a][i]
  for m in 1:lim
    vrrs[m,a,cplus] = QC[i]*vrrs[m,a,c]+WQ[i]*vrrs[m+1,a,c] + 
      a_i*vrrs[m+1,aminus,c] +
      c_i*(vrrs[m,a,cminus]-zeta*ooze*vrrs[m+1,a,cminus])
  end
else
  for m in 1:lim
    vrrs[m,a,cplus] = QC[i]*vrrs[m,a,c]+WQ[i]*vrrs[m,a,c] + 
      c_i*(vrrs[m,a,cminus]-zeta*ooze*vrrs[m+1,a,cminus])
  end
end

Should the aminus <= 0 branch be

for m in 1:lim
  vrrs[m,a,cplus] = QC[i]*vrrs[m,a,c]+WQ[i]*vrrs[m+1,a,c] + 
    c_i*(vrrs[m,a,cminus]-zeta*ooze*vrrs[m+1,a,cminus])
end

Or should it be vrrs[m+1,a,c] on one side and vrrs[m,a,c] on the other side of the aminus branch, as it is on the cminus <= 0 branch, and also earlier before the c loops?

4 Likes

Wow. I think you’re correct. Wonder how this got past my test suite, which I thought was pretty extensive. Sometimes molecular symmetry can make these terms zero. Thank you very much!

@Elrod Here’s my theory as to what could be going on. vrr!(amax,cmax) has two arguments, and the length of the innermost loops is amax+cmax+1. The case I’m timing here is the slowest call with amax=cmax=4, but it’s far more common for these to have amax=cmax=1, which takes less time. If @turbo slightly slows things when amax=cmax=1 and thus the innermost loop is only 3, but greatly speeds things up when amax=cmax=4 and the inner loop is 9, this could explain what I’m seeing.

I’ve updated the standalone_vrr.jl pluto notebook to include these cases as well, and it appears to be consistent with this: timings of (usec) 1.979, 2.153, and 1.957 for vrr!, vrr_turbo!, and vrr_inbounds.

Does this make sense to you?

1 Like

Yes, I was going to ask if the vrr! call you were benchmarking was really representative of the typical call.
For one thing, I noticed that in your actual code it gets called with SVector instead of Vector arguments.

@turbo speeds things up be evaluating multiple loop iterations in parallel using SIMD instructions.
For this to speed things up, we need multiple loop iterations.
Of course, @simd works the same way, but both macros make different trade offs in terms of performance as a function of the number of loop iterations.
Here is a very simple example, just using a sum:

julia> using LoopVectorization

julia> function sumsimd(x)
           s = zero(eltype(x))
           @inbounds @simd for i ∈ eachindex(x)
               s += x[i]
           end
           s
       end
sumsimd (generic function with 1 method)

julia> function sumturbo(x)
           s = zero(eltype(x))
           @turbo for i ∈ eachindex(x)
               s += x[i]
           end
           s
       end
sumturbo (generic function with 1 method)

julia> x = rand(1);

julia> @btime sumsimd($x)
  1.764 ns (0 allocations: 0 bytes)
0.028409636745355238

julia> @btime sumturbo($x)
  3.996 ns (0 allocations: 0 bytes)
0.028409636745355238

sumsimd was over twice as fast for a single iteration.

using Cairo
using Fontconfig, Gadfly, LoopVectorization, BenchmarkTools, DataFrames

itimes = Matrix{Float64}(undef, 64, 2);

@inline function isumsimd(x)
    s = zero(eltype(x))
    @inbounds @simd for i ∈ eachindex(x)
        s += x[i]
    end
    s
end
@inline function isumturbo(x)
    s = zero(eltype(x))
    @turbo for i ∈ eachindex(x)
        s += x[i]
    end
    s
end

for n ∈ axes(itimes,1)
    x = rand(n)
    itimes[n,1] = @belapsed isumsimd($x)
    itimes[n,2] = @belapsed isumturbo($x)
end

elements_per_ns = 1e-9 .* axes(itimes, 1) ./ itimes;
df = DataFrame(elements_per_ns);
rename!(df, [:simd, :turbo]);
df.Size = axes(elements_per_ns,1);
dfs = stack(df, Not([:Size]));
rename!(dfs, [:Size, :Macro, :elements_per_ns]);
plt = plot(dfs, x = :Size, y = :elements_per_ns, Geom.line, color = :Macro);
plt |> PNG("sumbenchmark.png")

sumbenchmark

I think overall LoopVectorization makes a much better tradeoff, but @simd is faster for small numbers of iterations, as well as for small multiples of a power of 2 (32 on this computer, 16 for computers with AVX but not AVX512).
For the sum example, LoopVectorization should eventually start winning there as well, but it’ll take more iterations:

julia> x = rand(512);

julia> @btime isumsimd($x)
  18.055 ns (0 allocations: 0 bytes)
267.1071894162856

julia> @btime isumturbo($x)
  16.915 ns (0 allocations: 0 bytes)
267.1071894162856

Versus the first 8:

julia> first(df,8) # elements per nanosecond
8×3 DataFrame
 Row │ simd      turbo     Size
     │ Float64   Float64   Int64
─────┼───────────────────────────
   1 │ 0.713267  0.338753      1
   2 │ 1.08284   0.67659       2
   3 │ 1.21655   0.974976      3
   4 │ 1.51515   1.35318       4
   5 │ 1.8162    1.69377       5
   6 │ 2.17628   2.03321       6
   7 │ 2.17729   2.30491       7
   8 │ 2.17984   2.70819       8

@simd is over twice as fast for 1-element vectors. It takes LoopVectorization a while to catch up.

8 Likes

Excellent. Thanks for your insights and the time you spent looking at my problem.

1 Like