# The @avx and @btime macros: strange interdependency

I am a newbie to Julia, and I came up with a script to produce a bifurcation diagram for the logistic map. The logistic map is well known and I will be brief here: `x(t+1) = r*x(t)*(1-x(t))`, `0 ≤ r ≤ 4`. For accuracy, I choose `maxiter = 1000` and `r = LinRange(0.0,4.0, 6001)`.

I have been able to run the code and produce such diagram. Performance is quite crucial in this kind of problem, and to check it in my code I tried without any macro and with the following macros: `@inbounds`, `@simd` and `@avx`. There seems to be a kind of interdependency between `@avx` and `@btime` in my code, which I do not understand. Below, I explain what I do using three scenarios: (i) applying `@inbounds @simd`, (ii) applying `@avx`, (iii) applying `@avx` and `@btime`. The results look very different indeed.

1. Applying the macros `@inbounds @simd`

``````using Plots, BenchmarkTools, LoopVectorization, SIMD

function bifur!(rs, x, xs, maxiter)
for k = 1 : length(rs)
x[k,1] = xs # initial condition
@inbounds @simd for j = 1 : maxiter-1
x[k, j+1] =  rs[k] * x[k, j] * (1 - x[k,j])
end
end
return x
end

rs = LinRange(0.0, 4.0, 6001) #rs = collect(rs)
xs = 0.5
maxiter = 1000 # maximum iterations
x = zeros(length(rs), maxiter)

#@btime begin
bifur!(rs, x, xs, maxiter)
#end

#@benchmark bifur!(\$rs, x, \$xs, \$maxiter)

@benchmark bifur!(\$rs, x, \$xs, \$maxiter) setup=(x = zeros(length(\$rs), \$maxiter)) evals=1

@btime begin
plot(rs, x[:, end-50:end], #avoid transients
#plot(\$rs, @view(\$x[:, end-50:end]),
seriestype = :scatter,
markercolor=:blue,
markerstrokecolor=:match,
markersize = 1.1,
markerstrokewidth = 0,
legend = false,
markeralpha = 0.3)
#xticks! = 0:1:4
xlims!(2.75,4)
#savefig("zee_bifurcation.png")
end
``````

The output that the function `bifur!` delivers is:

``````6001×1000 Array{Float64,2}:
0.5  0.0          0.0          …  0.0        0.0         0.0
0.5  0.000166667  1.11093e-7      0.0        0.0         0.0
0.5  0.000333333  4.44296e-7      0.0        0.0         0.0
0.5  0.0005       9.995e-7        0.0        0.0         0.0
0.5  0.000666667  1.77659e-6      0.0        0.0         0.0
0.5  0.000833333  2.77546e-6   …  0.0        0.0         0.0
0.5  0.001        3.996e-6        0.0        0.0         0.0
0.5  0.00116667   5.43809e-6      0.0        0.0         0.0
0.5  0.00133333   7.10163e-6      0.0        0.0         0.0
0.5  0.0015       8.9865e-6       0.0        0.0         0.0
⋮                              ⋱
0.5  0.998667     0.00531912      0.226126   0.699039    0.840412
0.5  0.998833     0.00465578      0.7908     0.660969    0.895311
0.5  0.999        0.003992        0.915731   0.308363    0.852247
0.5  0.999167     0.00332778   …  0.23146    0.710952    0.821312
0.5  0.999333     0.00266311      0.895762   0.37324     0.935104
0.5  0.9995       0.001998        0.9993     0.00279591  0.0111468
0.5  0.999667     0.00133244      0.0453006  0.172936    0.571927
0.5  0.999833     0.000666444     0.272303   0.792485    0.657701
0.5  1.0          0.0          …  0.0        0.0         0.0
``````

The execution time given by `@benchmark bifur!(\$rs, x, \$xs, \$maxiter) setup=(x = zeros(length(\$rs), \$maxiter)) evals=1` is as follows (actually, the function without those two macros runs with very similar times):

``````BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     19.345 ms (0.00% GC)
median time:      19.805 ms (0.00% GC)
mean time:        20.000 ms (0.00% GC)
maximum time:     28.163 ms (0.00% GC)
--------------
samples:          172
evals/sample:     1
``````

2. Here, I apply the macro `@avx`.
I keep the code exactly as it was in point (1), but I have to remove the `@inbounds @simd` to avoid LoadError (`LoadError: Don't know how to handle expression`), and call the `bifur!` function with exactly the same arguments:

``````function bifur!(rs, x, xs, maxiter)
@avx for k = 1 : length(rs) # macro to vectorize the loop
x[k,1] = xs # initial condition
for j = 1 : maxiter-1
x[k, j+1] =  rs[k] * x[k, j] * (1 - x[k,j])
end
end
return x
end

rs = LinRange(0.0, 4.0, 6001) #rs = collect(rs)
xs = 0.5
maxiter = 1000 # maximum iterations
x = zeros(length(rs), maxiter)

#@btime begin
bifur!(rs, x, xs, maxiter)
#end
``````

The output is now completely different:

``````6001×1000 Array{Float64,2}:
0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
⋮                        ⋮              ⋱            ⋮
0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
``````

The execution time is:

``````BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     2.840 ms (0.00% GC)
median time:      2.949 ms (0.00% GC)
mean time:        3.408 ms (0.00% GC)
maximum time:     12.612 ms (0.00% GC)
--------------
samples:          391
evals/sample:     1
``````

3. Applying the macros `@avx` and `@btime`

The problem with the approach in point (2) is that it renders the entire exercise useless: the loops do not produce the intended output, which is the 6001×1000 Array{Float64,2} as found in point (1). However, if I use the code of point (2), but call the `bifur!` as follows:

``````@btime begin
bifur!(rs, x, xs, maxiter)
end
``````

``````#@btime begin
bifur!(rs, x, xs, maxiter)
#end
``````

the output turns out to be the correct one (the same as in point 1). And the execution time is

``````BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     2.852 ms (0.00% GC)
median time:      2.962 ms (0.00% GC)
mean time:        3.410 ms (0.00% GC)
maximum time:     12.561 ms (0.00% GC)
--------------
samples:          389
evals/sample:     1
``````

My fundamental question is: what is happening between the `@avx` and the `@btime` that leads the code to produce the correct output? Is this just a lousy coding style of myself? Or is there something that I overlooked when reading the docs?

It should be stressed that the `@avx` macro does indeed speed up the `bifur!` function (from 20ms to 3.4ms), which is 6 times faster. But it looks awkward that, to obtain the correct output, I have to call that function by using the `@btime` macro. Suppose I just want to call the function, without executing `@btime: what should I have to do to get the correct output?

Help would be appreciated.

the loop iterations are not independent, so use of both `@simd` and `@avx` is illegal. You violate the third point in the warning section in the LoopVectorization readme
https://github.com/chriselrod/LoopVectorization.jl#warning

The loop over `j` needs to be sequential, but the loop over `k` does not. And should be innermost anyway since it’s first on `x`. This is about 5x quicker:

``````function bifur_rev!(rs, x, xs, maxiter)
x[:,1] .= xs
for j in 1:maxiter-1
@inbounds @simd for k in 1:length(rs)
x[k, j+1] =  rs[k] * x[k, j] * (1 - x[k,j])
end
end
return x
end
``````
1 Like

@mcabbott Thank you very much.
It works well and does increase run time quite significantly. Following your suggestion, now the script runs 9x faster than the previous Matlab routine I had for the same issue.

@baggepinnen Thank’s a lot. Now I understand better point N.3 in the Warning section of the docs of LoopVectorization.