# Making a fast evaluation of a function and its gradient

Just to illustrate the last point of my previous message. In other codes where I use LV, I usually ‘get around’ conditional statements by replacing `if` statements with some boolean operation. For example, if I would replace the `if` statement that defines the stopping condition within the `feta` function, I would do something like:

``````# Visco-elastic power law rheology
function feta(x::AbstractVector{<:Real}, data) # FIX #3: made it AbstractVector
eta0, G, n, dt = data
tol = 1e-9
Eii = sqrt( 0.5*(x[1]^2 + x[2]^2) + x[3]^2)
C   = 1.0/(2eta0) # FIX #5 instead of (2eta0)^(-1)
eta = eta0*Eii^(1/n-1)
for it=1:10 # inner Newton iteration
Tii    = 2eta*Eii
Eiiv   = C*Tii^n
r      = Eii - Tii/(2G*dt) - Eiiv
yes    =  (abs(r)>tol) * 1.0 + (abs(r)<tol) * 0.0
drdeta = -Eii/(G*dt) - Eiiv*n/eta
eta   -= yes * r/drdeta
end
return eta
end
``````

where `yes` varies between `1.0` to `0.0` if the tolerance criterion is met.
Then, `eta` receives a Newton correction only if `yes = 1.0`
However, this gives me an error of the type:

``````ERROR: LoadError: TypeError: non-boolean (VectorizationBase.Mask{4, UInt8}) used in boolean context
``````

Would anyone have an idea about how to get around this?

From the stacktrace, you can see that the problem is in `abs`. `ForwardDiff` defines `abs` for dual numbers using branches.
So we need to define one without it:

``````@generated function Base.abs(x::ForwardDiff.Dual{T,S,N}) where {T,S<:AbstractSIMD,N}
quote
\$(Expr(:meta,:inline))
val = x.value
p = x.partials
cmp = val < zero(\$S)
absx = \$ifelse(cmp, -val, val)
Base.Cartesian.@nexprs \$N n -> p_n = p[n]
ForwardDiff.Dual{\$T}(absx, ForwardDiff.Partials(Base.Cartesian.@ntuple \$N n -> \$ifelse(cmp, -p_n, p_n)))
end
end
``````

I’ll add that definition somewhere.
LoopVectorization already has a few like this, but that isn’t really the right place.

Also, I’d suggest using `IfElse.ifelse` instead of the `*1` and `*0` thing.

Also, this won’t make your code faster if you’re still going through the iterations.

If you want to see an example of a while loop, look at gcd in VectorizationBase.

Do run benchmarks though. There is a cost to this, so it’s worth double checking whether/how much it helps over your typical range of values.
It’s probably safer to use terminating conditions rather than a fixed number of values though, in terms of guaranteeing a certain accuracy.

2 Likes

Thanks for pointing this out. Indeed I could check that replacing `abs(r)` by `sqrt(r*r)` actually works there.

Also thanks for this, I did not know about this way of using conditional statements. So basically one could do something like:

``````yes    = ifelse(sqrt(r*r)>tol, 1.0, 0.0)
``````

which would indeed look better than the the `*1` and `*0` thing. So far this returns a similar error than the earliest test (`TypeError: non-boolean (VectorizationBase.Mask{4, UInt8})...`) but I still have to investigate to see what really causes the issue (seems also to be related to `gradient`).

Yes, to the point So, would there be a way to implement the behaviour of a

``````if (abs(r)<tol) break; end
``````

without having to use a `while` statement?
(A `while` statement also leads to the same type of error as met above).

Use IfElse.ifelse, not `Base.ifelse`.

Yes, you’d have to do it like in the example, using `VectorizationBase.vany` or `VectorizationBase.vall` to convert the mask into a boolean.

2 Likes

does works, thanks!

And `VectorizationBase.vany` also does the job, I could update the core of ‘feta’ as:

``````cond = true
while VectorizationBase.vany(cond)
Tii    = 2eta*Eii
Eiiv   = C*Tii^n
r      = Eii - Tii/(2G*dt) - Eiiv
cond   = IfElse.ifelse(sqrt(r*r)>tol, true, false)
drdeta = -Eii/(G*dt) - Eiiv*n/eta
eta   -=  r/drdeta
end
``````

Have not dared using the modification to `abs` though…
Double thumps up Elrod

1 Like

Great.
Something to keep in mind is that LoopVectorization works by “vectorizing” loops, which means it performs multiple iterations of a loop at a time by replacing scalars with vectors.
The CPU can operate on these vectors more or less the same way it operates on scalars by using “SIMD” instructions (Single Instruction Multiple Data).
That is, a single CPU instruction will work on the entire vector (multiple data). Hence, this is an easy way to make code faster.

This doesn’t work well with branches, however. If you do a comparison with multiple data points, you get multiple true/false values (this is the `Mask` you saw appearing in the errors), which do you use? `IfElse.ifelse` can do things like blend different vectors, to emulate branching.

But to actually do something like breaking out of a loop, you need to consider your decision using the values of every boolean in the `Mask`.

As a correction to my earlier comment, you should probably define

``````v_all(m) = VectorizationBase.vall(VectorizationBase.collapse_and(m))
v_any(m) = VectorizationBase.vany(VectorizationBase.collapse_or(m))
``````

as LoopVectorization will sometimes use `VecUnroll`s of `Mask`s instead of just `Mask`s, requiring the above treatment.

In this case, you don’t want to break until all are done, so `v_all(abs(r) < 1e-9)` makes sense.

But to use `v_any`/`v_all` correctly, you need to keep in mind that it is combining the results of multiple loop iterations. So you should only be using it for checks like this.
You normally don’t want one iteration to have any bearing on another.

3 Likes

You could copy/paste it for now, and it should be fine. LLVM can be pretty smart sometimes, and actually convert `sqrt(x*x)` into `fabs(x)`:

``````julia> myabs(x) = @fastmath sqrt(x*x)
myabs (generic function with 1 method)

julia> @cl myabs(1.2)
``````
``````define double @julia_myabs_1349(double %0) #0 !dbg !5 {
top:
%fabs = call fast double @llvm.fabs.f64(double %0), !dbg !7
ret double %fabs, !dbg !10
}
``````

But this does not work for the dual number version.

I may add Requires to SLEEFPirates and then add it there, along with some other functions.

Well, thanks again Elrod! Obviously, this opens a lot of new perspectives in terms of computing and it is very exciting.

If I may, I’d have another question regarding the combination of LV and ForwardDiff in practice. It’s always good to be able to `port` the notions we’ve discussed to other problems. A basic one would be a 1D Poisson equation. By using the above scripts, once can compute both the residual and the Jacobian coefficients using `DiffResults` as you suggested. However, I have issues setting BCs when I use the `Else.ifelse` to identify boundaries in `Poisson1D`. So the question is: how to best deal with the BC such a case? Is there another tool within VectorizationBase that could be use to this end?

I am aware that there are other/better strategies to derive Jacobians with Julia but I think the problem with the MWE below is related to the above issues regarding the combination of conditions, vectorisation and differentiation. If this is perceived as off-topic (because of another MWE), I happy to delete and/or create a separate thread.

``````using IfElse, ForwardDiff, StaticArrays, LoopVectorization, SparseArrays, LinearAlgebra, Plots

function PoissonSparsity1D!(TC,TW,TE)
TW[2:end-0] .= TC[1:end-1];
TE[1:end-1] .= TC[2:end-0];
end

function Poisson1D(x::AbstractVector{<:Real}, data, i)
kx, dx, ncx, TW_BC, TE_BC = data
TC, TW, TE = x[1], x[2], x[3]
TW1 = IfElse.ifelse(i==1,   2TW_BC-TC, TW)
TE1 = IfElse.ifelse(i==ncx, 2TE_BC-TC, TE)
qxW = -kx*(TC - TW1)/dx
qxE = -kx*(TE1 - TC)/dx
F   = (qxE - qxW)/dx
return F
end

Poisson1D(TC::Real,TW::Real,TE::Real,data,i) = Poisson1D(@SVector([TC,TW,TE]),data,i)  # use multiple dispatch to make it work with the desired argument list
Poisson_grad1D(Poisson1D,v,w,x,d,i) = ForwardDiff.gradient(x -> Poisson1D(x, d, i),@SVector([v,w,x]))
sv = @SVector([v,w,x])
dr = ForwardDiff.gradient!(DiffResults.GradientResult(sv), x -> Poisson1D(x, d, i), sv)
end

function main()

# Initialise
ncx = 10
data = (kx=1.0, dx=0.1, ncx, TW_BC=1.0, TE_BC=2.0)
F, T = fill(0.0, ncx), fill(0.0, ncx)  # residual, solution fields
TC,TW,TE = fill(0.0,  ncx), fill(0.0,   ncx), fill(0.0,   ncx) # working arrays double
IC,IW,IE = fill(  1,  ncx), fill(  1,   ncx), fill(  1,   ncx) # working arrays double
I, J, V  = fill(  0,3*ncx), fill(  0, 3*ncx), fill(0.0, 3*ncx)

# Get neigbour values and generate connectivity
TC .= T
PoissonSparsity1D!(TC,TW,TE)

# Generate connectivity
IC .= collect(1:ncx)
PoissonSparsity1D!(IC,IW,IE)

# Compute Finite Difference coefficients
t1 = @elapsed for i=1:ncx # @turbo
F[i], (TC[i], TW[i], TE[i]) = Poisson_valgrad1D(Poisson1D,TC[i], TW[i], TE[i],data,i)
end

# Fill triplets and assemble
I .= [IC[:]; IC[:]; IC[:]]
J .= [IC[:]; IW[:]; IE[:]]
V .= [TC[:]; TW[:]; TE[:]]
@time K  = sparse(I,J,V)
droptol!(K,1e-9)

# Solve
@time T[:] .-= cholesky(K)\F[:]

# Get neigbour values
TC .= T
PoissonSparsity1D!(TC,TW,TE)

# Compute Finite Difference coefficients
t2 = @elapsed @turbo for i=1:ncx
F[i] = Poisson1D(TC[i],TW[i],TE[i],data,i)
end
println("r = ", norm(F[:]))
end

for it=1:2
@time main()
end
``````

EDIT: Incidentally this ‘awkward’ trick in `Poisson1D`:

``````TW1   = (i==1)*(2TW_BC-TC) + (i>1)*TW
TE1   = (i==ncx)*(2TE_BC-TC) + (i<ncx)*TE
# TW1 = IfElse.ifelse(i==1,   2TW_BC-TC, TW)
# TE1 = IfElse.ifelse(i==ncx, 2TE_BC-TC, TE)
``````

makes it it work with both ForwardDiff and LV.
Unfortunately, it does not work in the 2D case (where `j` is the index corresponding to the second dimension).

``````TW1   = (i==1)*(-TC) + (i>1)*TW
TE1   = (i==ncx)*(-TC) + (i<ncx)*TE
TS1   = (j==1)*(2TS_BC-TC) + (j>1)*TS
TN1   = (j==ncy)*(2TN_BC-TC) + (j<ncy)*TN
``````

That likely indicates that it is not the correct way to proceed…

1 Like

Actually, after digging in VectorizationBase, I could try this for the 1D case:

``````west_BC = VectorizationBase.veq(i,1)
west    = VectorizationBase.vne(i,1)
east_BC = VectorizationBase.veq(i,ncx)
east    = VectorizationBase.vne(i,ncx)
TW1     = west_BC*(2TW_BC-TC) + west*TW
TE1     = east_BC*(2TE_BC-TC) + east*TE
``````

which can correctly identifies boundary nodes and thus works (LV + ForwardDiff)
However, I could not get it to work with a 2D case so far.

I’ve not tried your latest examples, but I’ll consolidate functions to get ForwardDiff and `AbstractSIMD`s to play well together in SIMDDualNumbers.jl.
It includes functions like `abs`, `max`, etc.

This package should be registered on Wednesday (there is a three day waiting period).

`using SIMDDualNumbers` should fix the error from the original version.
If you run into a problem with a number not working on some combination of `AbstractSIMD`s and `ForwardDiff.Dual`, please file an issue there, preferably with the error message, e.g.

``````julia> for it=1:2
@time main()
end
ERROR: TypeError: non-boolean (VectorizationBase.Mask{8, UInt8}) used in boolean context
Stacktrace:
[1] ifelse(::VectorizationBase.Mask{8, UInt8}, ::ForwardDiff.Dual{ForwardDiff.Tag{var"#3#4"{typeof(Poisson1D), NamedTuple{(:kx, :dx, :ncx, :TW_BC, :TE_BC), Tuple{Float64, Float64, Int64, Float64, Float64}}, VectorizationBase.MM{8, 1, Int64}}, VectorizationBase.Vec{8, Float64}}, VectorizationBase.Vec{8, Float64}, 3}, ::ForwardDiff.Dual{ForwardDiff.Tag{var"#3#4"{typeof(Poisson1D), NamedTuple{(:kx, :dx, :ncx, :TW_BC, :TE_BC), Tuple{Float64, Float64, Int64, Float64, Float64}}, VectorizationBase.MM{8, 1, Int64}}, VectorizationBase.Vec{8, Float64}}, VectorizationBase.Vec{8, Float64}, 3})
@ IfElse ~/.julia/packages/IfElse/u1l5W/src/IfElse.jl:3
[2] Poisson1D(x::SVector{3, ForwardDiff.Dual{ForwardDiff.Tag{var"#3#4"{typeof(Poisson1D), NamedTuple{(:kx, :dx, :ncx, :TW_BC, :TE_BC), Tuple{Float64, Float64, Int64, Float64, Float64}}, VectorizationBase.MM{8, 1, Int64}}, VectorizationBase.Vec{8, Float64}}, VectorizationBase.Vec{8, Float64}, 3}}, data::NamedTuple{(:kx, :dx, :ncx, :TW_BC, :TE_BC), Tuple{Float64, Float64, Int64, Float64, Float64}}, i::VectorizationBase.MM{8, 1, Int64})
``````

So I’ll know what the problem is. E.g., here that I have to add some ifelse methods.

4 Likes

Thanks for the follow up @Elrod, I’ll be very happy to run some tests with `SIMDDualNumbers.jl`.

Just for the sake of it, I’ve added the 2D faulty MWE below, in case it can be relevant for anyone. I will try to get it to work with `SIMDDualNumbers.jl` next week and we can continue the discussion on GitHub, if needed. Hence I’m not expecting any follow up on this. Once again many thanks to @DNF @Elrod @leandromartinez98 and @jling for all your precious advices and solutions on this topic.

``````using VectorizationBase, ForwardDiff, StaticArrays, LoopVectorization, SparseArrays, LinearAlgebra, Plots

function PoissonSparsity!(TC,TW,TE,TS,TN)
TW[2:end-0,:] .= TC[1:end-1,:];
TE[1:end-1,:] .= TC[2:end-0,:];
TS[:,2:end-0] .= TC[:,1:end-1];
TN[:,1:end-1] .= TC[:,2:end-0];
end

function Poisson(x::AbstractVector{<:Real}, data, i, j)
kx, ky, dx, dy, ncx, ncy, TS_BC, TN_BC = data
TC, TW, TE, TS, TN = x[1], x[2], x[3], x[4], x[5]
west_BC  = VectorizationBase.veq(i,1)
west     = VectorizationBase.vne(i,1)
east_BC  = VectorizationBase.veq(i,ncx)
east     = VectorizationBase.vne(i,ncx)
south_BC = VectorizationBase.veq(j,1)
south    = VectorizationBase.vne(j,1)
north_BC = VectorizationBase.veq(j,ncy)
north    = VectorizationBase.vne(j,ncy)
TW1   = west_BC*(TC) + west*TW
TE1   = east_BC*(TC) + east*TE
TS1   = south_BC*(2TS_BC-TC) + south*TS
TN1   = north_BC*(2TN_BC-TC) + north*TN
qxW = -kx*(TC - TW1)/dx
qxE = -kx*(TE1 - TC)/dx
qyS = -ky*(TC - TS1)/dy
qyN = -ky*(TN1 - TC)/dy
F   = (qxE - qxW)/dx + (qyN - qyS)/dy
return F
end

Poisson(TC::Real,TW::Real,TE::Real,TS::Real,TN::Real,data,i,j) = Poisson(@SVector([TC,TW,TE,TS,TN]),data,i,j)  # use multiple dispatch to make it work with the desired argument list
Poisson_grad(Poisson,v,w,x,y,z,d,i,j) = ForwardDiff.gradient(x -> Poisson(x, d, i, j),@SVector([v,w,x,y,z])) # construct gradient function for feta()
sv = @SVector([v,w,x,y,z])
dr = ForwardDiff.gradient!(DiffResults.GradientResult(sv), x -> Poisson(x, d, i, j), sv)
end

function main()

ncx, ncy = 10, 10
data = (kx=1.0, ky=1.0, dx=0.1, dy=0.1, ncx, ncy, TS_BC=1.0, TN_BC=2.0)

F  = fill(0.0, ncx, ncy)
T  = fill(0.0, ncx, ncy)
TC = fill(0.0, ncx, ncy)
TW = fill(0.0, ncx, ncy)
TE = fill(0.0, ncx, ncy)
TS = fill(0.0, ncx, ncy)
TN = fill(0.0, ncx, ncy)
PoissonSparsity!(TC,TW,TE,TS,TN)

# Compute Finite Difference coefficients
t1 = @elapsed @turbo  for j=1:ncy
for i=1:ncx
F[i,j], (TC[i,j], TW[i,j], TE[i,j], TS[i,j], TN[i,j]) = Poisson_valgrad(Poisson,TC[i,j], TW[i,j], TE[i,j], TS[i,j], TN[i,j],data,i,j)
end
end
end

for it=1:2
@time main()
end
``````

@Elrod: just to confirm that `SIMDDualNumbers.jl` solved the issue with `ifelse` in ‘more-than-one’ dimensional function + gradient evaluation with both LV and ForwardDiff.
Thanks again!

``````using SIMDDualNumbers, ForwardDiff, StaticArrays, LoopVectorization, SparseArrays, LinearAlgebra, Plots

function PoissonSparsity!(TC,TW,TE,TS,TN)
TW[2:end-0,:] .= TC[1:end-1,:];
TE[1:end-1,:] .= TC[2:end-0,:];
TS[:,2:end-0] .= TC[:,1:end-1];
TN[:,1:end-1] .= TC[:,2:end-0];
end

function Poisson(x::AbstractVector{<:Real}, data, i, j)
kx, ky, dx, dy, ncx, ncy, TS_BC, TN_BC = data
TC, TW, TE, TS, TN = x[1], x[2], x[3], x[4], x[5]
# Stencil points
TW1 = SIMDDualNumbers.ifelse( i==1,          TC, TW )
TE1 = SIMDDualNumbers.ifelse( i==ncx,        TC, TE )
TS1 = SIMDDualNumbers.ifelse( j==1,   2TS_BC-TC, TS )
TN1 = SIMDDualNumbers.ifelse( j==ncy, 2TN_BC-TC, TN )
# Fluxes
qxW = -kx*(TC - TW1)/dx
qxE = -kx*(TE1 - TC)/dx
qyS = -ky*(TC - TS1)/dy
qyN = -ky*(TN1 - TC)/dy
# Balance
F   = (qxE - qxW)/dx + (qyN - qyS)/dy
return F
end

Poisson(TC::Real,TW::Real,TE::Real,TS::Real,TN::Real,data,i,j) = Poisson(@SVector([TC,TW,TE,TS,TN]),data,i,j)  # use multiple dispatch to make it work with the desired argument list
Poisson_grad(Poisson,v,w,x,y,z,d,i,j) = ForwardDiff.gradient(x -> Poisson(x, d, i, j),@SVector([v,w,x,y,z])) # construct gradient function for feta()
sv = @SVector([v,w,x,y,z])
dr = ForwardDiff.gradient!(DiffResults.GradientResult(sv), x -> Poisson(x, d, i, j), sv)
end

function main()

ncx, ncy = 100, 100
data = (kx=1.0, ky=1.0, dx=0.1, dy=0.1, ncx, ncy, TS_BC=1.0, TN_BC=2.0)

# Arrays
F  = fill(0.0, ncx, ncy)
T  = fill(0.0, ncx, ncy)
Q  = fill(0.0, ncx, ncy)
TC = fill(0.0, ncx, ncy)
TW = fill(0.0, ncx, ncy)
TE = fill(0.0, ncx, ncy)
TS = fill(0.0, ncx, ncy)
TN = fill(0.0, ncx, ncy)
IC = reshape(1:ncx*ncy,ncx,ncy) # equation numbers
IW = fill(  1, ncx, ncy)
IE = fill(  1, ncx, ncy)
IS = fill(  1, ncx, ncy)
IN = fill(  1, ncx, ncy)

# Source term
Q[1:Int(ncx/2),:] .= 1.0

# Compute stencil connectivity
PoissonSparsity!(IC,IW,IE,IS,IN)

# Compute Finite Difference coefficients
PoissonSparsity!(TC.=T,TW,TE,TS,TN)
t1 = @elapsed @turbo for j=1:ncy
for i=1:ncx
F[i,j], (TC[i,j], TW[i,j], TE[i,j], TS[i,j], TN[i,j]) = Poisson_valgrad(Poisson,TC[i,j], TW[i,j], TE[i,j], TS[i,j], TN[i,j],data,i,j)
F[i,j] -= Q[i,j]
end
end

# Residual before solve
println("r = ", norm(F))

# Assemble stiffness matrix
I = fill( 0  , 5*ncx*ncy )
J = fill( 0  , 5*ncx*ncy )
V = fill( 0.0, 5*ncx*ncy )
I .= [IC[:]; IC[:]; IC[:]; IC[:]; IC[:]]
J .= [IC[:]; IW[:]; IE[:]; IS[:]; IN[:]]
V .= [TC[:]; TW[:]; TE[:]; TS[:]; TN[:]]
@time K  = sparse(I,J,V)
droptol!(K,1e-6)

# Solve
@time T[:] .-= cholesky(K)\F[:]

# Compute residual after solve
PoissonSparsity!(TC.=T,TW,TE,TS,TN)
t2 = @elapsed @turbo for j=1:ncy
for i=1:ncx
F[i,j] = Poisson(TC[i,j],TW[i,j],TE[i,j],TS[i,j],TN[i,j],data,i,j)
F[i,j] -= Q[i,j]
end
end
println("r = ", norm(F[:]))

# Post-processing
display(plot(heatmap(T',c=:roma)))
println("Time for res. + stencil eval. --> ", t1, " s")
println("Time for res. eval.           --> ", t2, " s")
end

for it=1:2
@time main()
end
``````
1 Like

Note that LoopVectorization 0.12.83 and newer should automatically `using SIMDDualNumbers` for you if you `using LoopVectorization, ForwardDiff`.

2 Likes