The @avx and @btime macros: strange interdependency

VivMendes · January 1, 2021, 4:45pm

I am a newbie to Julia, and I came up with a script to produce a bifurcation diagram for the logistic map. The logistic map is well known and I will be brief here: x(t+1) = r*x(t)*(1-x(t)), 0 ≤ r ≤ 4. For accuracy, I choose maxiter = 1000 and r = LinRange(0.0,4.0, 6001).

I have been able to run the code and produce such diagram. Performance is quite crucial in this kind of problem, and to check it in my code I tried without any macro and with the following macros: @inbounds, @simd and @avx. There seems to be a kind of interdependency between @avx and @btime in my code, which I do not understand. Below, I explain what I do using three scenarios: (i) applying @inbounds @simd, (ii) applying @avx, (iii) applying @avx and @btime. The results look very different indeed.

1. Applying the macros @inbounds @simd

using Plots, BenchmarkTools, LoopVectorization, SIMD

function bifur!(rs, x, xs, maxiter)
    for k = 1 : length(rs) 
        x[k,1] = xs # initial condition
        @inbounds @simd for j = 1 : maxiter-1
            x[k, j+1] =  rs[k] * x[k, j] * (1 - x[k,j])
        end
    end
    return x
end

rs = LinRange(0.0, 4.0, 6001) #rs = collect(rs)
xs = 0.5
maxiter = 1000 # maximum iterations
x = zeros(length(rs), maxiter)
 
#@btime begin
bifur!(rs, x, xs, maxiter)
#end

#@benchmark bifur!($rs, x, $xs, $maxiter)

@benchmark bifur!($rs, x, $xs, $maxiter) setup=(x = zeros(length($rs), $maxiter)) evals=1

@btime begin
    plot(rs, x[:, end-50:end], #avoid transients
    #plot($rs, @view($x[:, end-50:end]),
    seriestype = :scatter,
    markercolor=:blue,
    markerstrokecolor=:match,
    markersize = 1.1,
    markerstrokewidth = 0,
    legend = false,
    markeralpha = 0.3)
    #xticks! = 0:1:4
    xlims!(2.75,4)
    #savefig("zee_bifurcation.png")
end

The output that the function bifur! delivers is:

6001×1000 Array{Float64,2}:
 0.5  0.0          0.0          …  0.0        0.0         0.0
 0.5  0.000166667  1.11093e-7      0.0        0.0         0.0
 0.5  0.000333333  4.44296e-7      0.0        0.0         0.0
 0.5  0.0005       9.995e-7        0.0        0.0         0.0
 0.5  0.000666667  1.77659e-6      0.0        0.0         0.0
 0.5  0.000833333  2.77546e-6   …  0.0        0.0         0.0
 0.5  0.001        3.996e-6        0.0        0.0         0.0
 0.5  0.00116667   5.43809e-6      0.0        0.0         0.0
 0.5  0.00133333   7.10163e-6      0.0        0.0         0.0
 0.5  0.0015       8.9865e-6       0.0        0.0         0.0
 ⋮                              ⋱                         
 0.5  0.998667     0.00531912      0.226126   0.699039    0.840412
 0.5  0.998833     0.00465578      0.7908     0.660969    0.895311
 0.5  0.999        0.003992        0.915731   0.308363    0.852247
 0.5  0.999167     0.00332778   …  0.23146    0.710952    0.821312
 0.5  0.999333     0.00266311      0.895762   0.37324     0.935104
 0.5  0.9995       0.001998        0.9993     0.00279591  0.0111468
 0.5  0.999667     0.00133244      0.0453006  0.172936    0.571927
 0.5  0.999833     0.000666444     0.272303   0.792485    0.657701
 0.5  1.0          0.0          …  0.0        0.0         0.0

The execution time given by @benchmark bifur!($rs, x, $xs, $maxiter) setup=(x = zeros(length($rs), $maxiter)) evals=1 is as follows (actually, the function without those two macros runs with very similar times):

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     19.345 ms (0.00% GC)
  median time:      19.805 ms (0.00% GC)
  mean time:        20.000 ms (0.00% GC)
  maximum time:     28.163 ms (0.00% GC)
  --------------
  samples:          172
  evals/sample:     1

2. Here, I apply the macro @avx.
I keep the code exactly as it was in point (1), but I have to remove the @inbounds @simd to avoid LoadError (LoadError: Don't know how to handle expression), and call the bifur! function with exactly the same arguments:

function bifur!(rs, x, xs, maxiter)
    @avx for k = 1 : length(rs) # macro to vectorize the loop
        x[k,1] = xs # initial condition
        for j = 1 : maxiter-1
            x[k, j+1] =  rs[k] * x[k, j] * (1 - x[k,j])
        end
    end
    return x
end

rs = LinRange(0.0, 4.0, 6001) #rs = collect(rs)
xs = 0.5
maxiter = 1000 # maximum iterations
x = zeros(length(rs), maxiter)
 
#@btime begin
bifur!(rs, x, xs, maxiter)
#end

The output is now completely different:

6001×1000 Array{Float64,2}:
 0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 ⋮                        ⋮              ⋱            ⋮                   
 0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0

The execution time is:

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.840 ms (0.00% GC)
  median time:      2.949 ms (0.00% GC)
  mean time:        3.408 ms (0.00% GC)
  maximum time:     12.612 ms (0.00% GC)
  --------------
  samples:          391
  evals/sample:     1

3. Applying the macros @avx and @btime

The problem with the approach in point (2) is that it renders the entire exercise useless: the loops do not produce the intended output, which is the 6001×1000 Array{Float64,2} as found in point (1). However, if I use the code of point (2), but call the bifur! as follows:

@btime begin
    bifur!(rs, x, xs, maxiter)
end

instead of

#@btime begin
bifur!(rs, x, xs, maxiter)
#end

the output turns out to be the correct one (the same as in point 1). And the execution time is

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.852 ms (0.00% GC)
  median time:      2.962 ms (0.00% GC)
  mean time:        3.410 ms (0.00% GC)
  maximum time:     12.561 ms (0.00% GC)
  --------------
  samples:          389
  evals/sample:     1

My fundamental question is: what is happening between the @avx and the @btime that leads the code to produce the correct output? Is this just a lousy coding style of myself? Or is there something that I overlooked when reading the docs?

It should be stressed that the @avx macro does indeed speed up the bifur! function (from 20ms to 3.4ms), which is 6 times faster. But it looks awkward that, to obtain the correct output, I have to call that function by using the @btime macro. Suppose I just want to call the function, without executing `@btime: what should I have to do to get the correct output?

Help would be appreciated.

baggepinnen · January 1, 2021, 5:05pm

the loop iterations are not independent, so use of both @simd and @avx is illegal. You violate the third point in the warning section in the LoopVectorization readme

mcabbott · January 1, 2021, 5:22pm

The loop over j needs to be sequential, but the loop over k does not. And should be innermost anyway since it’s first on x. This is about 5x quicker:

function bifur_rev!(rs, x, xs, maxiter)
	x[:,1] .= xs
    for j in 1:maxiter-1
        @inbounds @simd for k in 1:length(rs) 
            x[k, j+1] =  rs[k] * x[k, j] * (1 - x[k,j])
        end
    end
    return x
end

VivMendes · January 1, 2021, 6:08pm

@mcabbott Thank you very much.
It works well and does increase run time quite significantly. Following your suggestion, now the script runs 9x faster than the previous Matlab routine I had for the same issue.

VivMendes · January 1, 2021, 6:13pm

@baggepinnen Thank’s a lot. Now I understand better point N.3 in the Warning section of the docs of LoopVectorization.

Topic		Replies	Views
LoopVectorization: @turbo performs worse than @inbounds on trivial loop New to Julia question , simd , loopvectorization	9	2094	August 28, 2021
[ANN] LoopVectorization Package Announcements	157	23185	May 27, 2020
Poor performance of SIMD vectorization in the latest version of Julia (v1.11.2) Performance performance	19	799	January 8, 2025
LoopVectorization.jl: adding `@avx` makes code slower Performance question , tullio , loopvectorization	8	1157	August 29, 2020
Memory allocation and performance Performance memory-allocation	7	639	June 26, 2020

The @avx and @btime macros: strange interdependency

Related topics