Making a fast evaluation of a function and its gradient

Gravlax · October 1, 2021, 6:49am

Just to illustrate the last point of my previous message. In other codes where I use LV, I usually ‘get around’ conditional statements by replacing if statements with some boolean operation. For example, if I would replace the if statement that defines the stopping condition within the feta function, I would do something like:

# Visco-elastic power law rheology
function feta(x::AbstractVector{<:Real}, data) # FIX #3: made it AbstractVector
    eta0, G, n, dt = data
    tol = 1e-9
    Eii = sqrt( 0.5*(x[1]^2 + x[2]^2) + x[3]^2)
    C   = 1.0/(2eta0) # FIX #5 instead of (2eta0)^(-1)
    eta = eta0*Eii^(1/n-1)
    for it=1:10 # inner Newton iteration
        Tii    = 2eta*Eii
        Eiiv   = C*Tii^n
        r      = Eii - Tii/(2G*dt) - Eiiv
        yes    =  (abs(r)>tol) * 1.0 + (abs(r)<tol) * 0.0
        drdeta = -Eii/(G*dt) - Eiiv*n/eta
        eta   -= yes * r/drdeta 
    end
    return eta
end

where yes varies between 1.0 to 0.0 if the tolerance criterion is met.
Then, eta receives a Newton correction only if yes = 1.0
However, this gives me an error of the type:

ERROR: LoadError: TypeError: non-boolean (VectorizationBase.Mask{4, UInt8}) used in boolean context

Would anyone have an idea about how to get around this?

Elrod · October 1, 2021, 8:06am

From the stacktrace, you can see that the problem is in abs. ForwardDiff defines abs for dual numbers using branches.
So we need to define one without it:

@generated function Base.abs(x::ForwardDiff.Dual{T,S,N}) where {T,S<:AbstractSIMD,N}
  quote
    $(Expr(:meta,:inline))
    val = x.value
    p = x.partials
    cmp = val < zero($S)
    absx = $ifelse(cmp, -val, val)
    Base.Cartesian.@nexprs $N n -> p_n = p[n]
    ForwardDiff.Dual{$T}(absx, ForwardDiff.Partials(Base.Cartesian.@ntuple $N n -> $ifelse(cmp, -p_n, p_n)))
  end
end

I’ll add that definition somewhere.
LoopVectorization already has a few like this, but that isn’t really the right place.

Also, I’d suggest using IfElse.ifelse instead of the *1 and *0 thing.

Also, this won’t make your code faster if you’re still going through the iterations.

If you want to see an example of a while loop, look at gcd in VectorizationBase.

Do run benchmarks though. There is a cost to this, so it’s worth double checking whether/how much it helps over your typical range of values.
It’s probably safer to use terminating conditions rather than a fixed number of values though, in terms of guaranteeing a certain accuracy.

Gravlax · October 1, 2021, 8:53am

Thanks for pointing this out. Indeed I could check that replacing abs(r) by sqrt(r*r) actually works there.

Also thanks for this, I did not know about this way of using conditional statements. So basically one could do something like:

yes    = ifelse(sqrt(r*r)>tol, 1.0, 0.0)

which would indeed look better than the the *1 and *0 thing. So far this returns a similar error than the earliest test (TypeError: non-boolean (VectorizationBase.Mask{4, UInt8})...) but I still have to investigate to see what really causes the issue (seems also to be related to gradient).

Yes, to the point So, would there be a way to implement the behaviour of a

if (abs(r)<tol) break; end

without having to use a while statement?
(A while statement also leads to the same type of error as met above).

Elrod · October 1, 2021, 8:54am

Use IfElse.ifelse, not Base.ifelse.

Yes, you’d have to do it like in the example, using VectorizationBase.vany or VectorizationBase.vall to convert the mask into a boolean.

Gravlax · October 1, 2021, 9:00am

does works, thanks!

And VectorizationBase.vany also does the job, I could update the core of ‘feta’ as:

cond = true
    while VectorizationBase.vany(cond)
        Tii    = 2eta*Eii
        Eiiv   = C*Tii^n
        r      = Eii - Tii/(2G*dt) - Eiiv
        cond   = IfElse.ifelse(sqrt(r*r)>tol, true, false)
        drdeta = -Eii/(G*dt) - Eiiv*n/eta
        eta   -=  r/drdeta 
    end

Have not dared using the modification to abs though…
Double thumps up Elrod

Elrod · October 1, 2021, 9:18am

Great.
Something to keep in mind is that LoopVectorization works by “vectorizing” loops, which means it performs multiple iterations of a loop at a time by replacing scalars with vectors.
The CPU can operate on these vectors more or less the same way it operates on scalars by using “SIMD” instructions (Single Instruction Multiple Data).
That is, a single CPU instruction will work on the entire vector (multiple data). Hence, this is an easy way to make code faster.

This doesn’t work well with branches, however. If you do a comparison with multiple data points, you get multiple true/false values (this is the Mask you saw appearing in the errors), which do you use? IfElse.ifelse can do things like blend different vectors, to emulate branching.

But to actually do something like breaking out of a loop, you need to consider your decision using the values of every boolean in the Mask.

As a correction to my earlier comment, you should probably define

v_all(m) = VectorizationBase.vall(VectorizationBase.collapse_and(m))
v_any(m) = VectorizationBase.vany(VectorizationBase.collapse_or(m))

as LoopVectorization will sometimes use VecUnrolls of Masks instead of just Masks, requiring the above treatment.

In this case, you don’t want to break until all are done, so v_all(abs(r) < 1e-9) makes sense.

But to use v_any/v_all correctly, you need to keep in mind that it is combining the results of multiple loop iterations. So you should only be using it for checks like this.
You normally don’t want one iteration to have any bearing on another.

Elrod · October 1, 2021, 9:19am

You could copy/paste it for now, and it should be fine. LLVM can be pretty smart sometimes, and actually convert sqrt(x*x) into fabs(x):

julia> myabs(x) = @fastmath sqrt(x*x)
myabs (generic function with 1 method)

julia> @cl myabs(1.2)

define double @julia_myabs_1349(double %0) #0 !dbg !5 {
top:
  %fabs = call fast double @llvm.fabs.f64(double %0), !dbg !7
  ret double %fabs, !dbg !10
}

But this does not work for the dual number version.

I may add Requires to SLEEFPirates and then add it there, along with some other functions.

Gravlax · October 1, 2021, 4:10pm

Well, thanks again Elrod! Obviously, this opens a lot of new perspectives in terms of computing and it is very exciting.

If I may, I’d have another question regarding the combination of LV and ForwardDiff in practice. It’s always good to be able to port the notions we’ve discussed to other problems. A basic one would be a 1D Poisson equation. By using the above scripts, once can compute both the residual and the Jacobian coefficients using DiffResults as you suggested. However, I have issues setting BCs when I use the Else.ifelse to identify boundaries in Poisson1D. So the question is: how to best deal with the BC such a case? Is there another tool within VectorizationBase that could be use to this end?

I am aware that there are other/better strategies to derive Jacobians with Julia but I think the problem with the MWE below is related to the above issues regarding the combination of conditions, vectorisation and differentiation. If this is perceived as off-topic (because of another MWE), I happy to delete and/or create a separate thread.

using IfElse, ForwardDiff, StaticArrays, LoopVectorization, SparseArrays, LinearAlgebra, Plots

function PoissonSparsity1D!(TC,TW,TE)
    TW[2:end-0] .= TC[1:end-1]; 
    TE[1:end-1] .= TC[2:end-0]; 
end

function Poisson1D(x::AbstractVector{<:Real}, data, i) 
    kx, dx, ncx, TW_BC, TE_BC = data
    TC, TW, TE = x[1], x[2], x[3]
    TW1 = IfElse.ifelse(i==1,   2TW_BC-TC, TW)
    TE1 = IfElse.ifelse(i==ncx, 2TE_BC-TC, TE)
    qxW = -kx*(TC - TW1)/dx
    qxE = -kx*(TE1 - TC)/dx 
    F   = (qxE - qxW)/dx
    return F
end

Poisson1D(TC::Real,TW::Real,TE::Real,data,i) = Poisson1D(@SVector([TC,TW,TE]),data,i)  # use multiple dispatch to make it work with the desired argument list
Poisson_grad1D(Poisson1D,v,w,x,d,i) = ForwardDiff.gradient(x -> Poisson1D(x, d, i),@SVector([v,w,x]))
function Poisson_valgrad1D(Poisson1D,v,w,x,d,i)
    sv = @SVector([v,w,x])
    dr = ForwardDiff.gradient!(DiffResults.GradientResult(sv), x -> Poisson1D(x, d, i), sv)
    DiffResults.value(dr), DiffResults.gradient(dr)
end

function main()

    # Initialise
    ncx = 10
    data = (kx=1.0, dx=0.1, ncx, TW_BC=1.0, TE_BC=2.0)
    F, T = fill(0.0, ncx), fill(0.0, ncx)  # residual, solution fields 
    TC,TW,TE = fill(0.0,  ncx), fill(0.0,   ncx), fill(0.0,   ncx) # working arrays double
    IC,IW,IE = fill(  1,  ncx), fill(  1,   ncx), fill(  1,   ncx) # working arrays double
    I, J, V  = fill(  0,3*ncx), fill(  0, 3*ncx), fill(0.0, 3*ncx)

    # Get neigbour values and generate connectivity
    TC .= T
    PoissonSparsity1D!(TC,TW,TE)

    # Generate connectivity
    IC .= collect(1:ncx)
    PoissonSparsity1D!(IC,IW,IE) 
    
    # Compute Finite Difference coefficients
    t1 = @elapsed for i=1:ncx # @turbo 
        F[i], (TC[i], TW[i], TE[i]) = Poisson_valgrad1D(Poisson1D,TC[i], TW[i], TE[i],data,i)
    end

    # Fill triplets and assemble
    I .= [IC[:]; IC[:]; IC[:]]
    J .= [IC[:]; IW[:]; IE[:]]
    V .= [TC[:]; TW[:]; TE[:]]
    @time K  = sparse(I,J,V)
    droptol!(K,1e-9)

    # Solve
    @time T[:] .-= cholesky(K)\F[:]

    # Get neigbour values
    TC .= T
    PoissonSparsity1D!(TC,TW,TE)

    # Compute Finite Difference coefficients
    t2 = @elapsed @turbo for i=1:ncx
        F[i] = Poisson1D(TC[i],TW[i],TE[i],data,i)
    end
    println("r = ", norm(F[:]))
end

for it=1:2
    @time main()
end

EDIT: Incidentally this ‘awkward’ trick in Poisson1D:

TW1   = (i==1)*(2TW_BC-TC) + (i>1)*TW
TE1   = (i==ncx)*(2TE_BC-TC) + (i<ncx)*TE
 # TW1 = IfElse.ifelse(i==1,   2TW_BC-TC, TW)
 # TE1 = IfElse.ifelse(i==ncx, 2TE_BC-TC, TE)

makes it it work with both ForwardDiff and LV.
Unfortunately, it does not work in the 2D case (where j is the index corresponding to the second dimension).

TW1   = (i==1)*(-TC) + (i>1)*TW
TE1   = (i==ncx)*(-TC) + (i<ncx)*TE
TS1   = (j==1)*(2TS_BC-TC) + (j>1)*TS
TN1   = (j==ncy)*(2TN_BC-TC) + (j<ncy)*TN

That likely indicates that it is not the correct way to proceed…

Gravlax · October 2, 2021, 5:54pm

Actually, after digging in VectorizationBase, I could try this for the 1D case:

west_BC = VectorizationBase.veq(i,1)
west    = VectorizationBase.vne(i,1)
east_BC = VectorizationBase.veq(i,ncx)
east    = VectorizationBase.vne(i,ncx)
TW1     = west_BC*(2TW_BC-TC) + west*TW
TE1     = east_BC*(2TE_BC-TC) + east*TE

which can correctly identifies boundary nodes and thus works (LV + ForwardDiff)
However, I could not get it to work with a 2D case so far.

Elrod · October 2, 2021, 11:12pm

I’ve not tried your latest examples, but I’ll consolidate functions to get ForwardDiff and AbstractSIMDs to play well together in SIMDDualNumbers.jl.
It includes functions like abs, max, etc.

This package should be registered on Wednesday (there is a three day waiting period).

using SIMDDualNumbers should fix the error from the original version.
If you run into a problem with a number not working on some combination of AbstractSIMDs and ForwardDiff.Dual, please file an issue there, preferably with the error message, e.g.

julia> for it=1:2
           @time main()
       end
ERROR: TypeError: non-boolean (VectorizationBase.Mask{8, UInt8}) used in boolean context
Stacktrace:
  [1] ifelse(::VectorizationBase.Mask{8, UInt8}, ::ForwardDiff.Dual{ForwardDiff.Tag{var"#3#4"{typeof(Poisson1D), NamedTuple{(:kx, :dx, :ncx, :TW_BC, :TE_BC), Tuple{Float64, Float64, Int64, Float64, Float64}}, VectorizationBase.MM{8, 1, Int64}}, VectorizationBase.Vec{8, Float64}}, VectorizationBase.Vec{8, Float64}, 3}, ::ForwardDiff.Dual{ForwardDiff.Tag{var"#3#4"{typeof(Poisson1D), NamedTuple{(:kx, :dx, :ncx, :TW_BC, :TE_BC), Tuple{Float64, Float64, Int64, Float64, Float64}}, VectorizationBase.MM{8, 1, Int64}}, VectorizationBase.Vec{8, Float64}}, VectorizationBase.Vec{8, Float64}, 3})
    @ IfElse ~/.julia/packages/IfElse/u1l5W/src/IfElse.jl:3
  [2] Poisson1D(x::SVector{3, ForwardDiff.Dual{ForwardDiff.Tag{var"#3#4"{typeof(Poisson1D), NamedTuple{(:kx, :dx, :ncx, :TW_BC, :TE_BC), Tuple{Float64, Float64, Int64, Float64, Float64}}, VectorizationBase.MM{8, 1, Int64}}, VectorizationBase.Vec{8, Float64}}, VectorizationBase.Vec{8, Float64}, 3}}, data::NamedTuple{(:kx, :dx, :ncx, :TW_BC, :TE_BC), Tuple{Float64, Float64, Int64, Float64, Float64}}, i::VectorizationBase.MM{8, 1, Int64})

So I’ll know what the problem is. E.g., here that I have to add some ifelse methods.

Gravlax · October 3, 2021, 8:58am

Thanks for the follow up @Elrod, I’ll be very happy to run some tests with SIMDDualNumbers.jl.

Just for the sake of it, I’ve added the 2D faulty MWE below, in case it can be relevant for anyone. I will try to get it to work with SIMDDualNumbers.jl next week and we can continue the discussion on GitHub, if needed. Hence I’m not expecting any follow up on this. Once again many thanks to @DNF @Elrod @lmiq and @jling for all your precious advices and solutions on this topic.

using VectorizationBase, ForwardDiff, StaticArrays, LoopVectorization, SparseArrays, LinearAlgebra, Plots

function PoissonSparsity!(TC,TW,TE,TS,TN)
    TW[2:end-0,:] .= TC[1:end-1,:]; 
    TE[1:end-1,:] .= TC[2:end-0,:]; 
    TS[:,2:end-0] .= TC[:,1:end-1]; 
    TN[:,1:end-1] .= TC[:,2:end-0];
end

function Poisson(x::AbstractVector{<:Real}, data, i, j) 
    kx, ky, dx, dy, ncx, ncy, TS_BC, TN_BC = data
    TC, TW, TE, TS, TN = x[1], x[2], x[3], x[4], x[5]
    west_BC  = VectorizationBase.veq(i,1)
    west     = VectorizationBase.vne(i,1)
    east_BC  = VectorizationBase.veq(i,ncx)
    east     = VectorizationBase.vne(i,ncx)
    south_BC = VectorizationBase.veq(j,1)
    south    = VectorizationBase.vne(j,1)
    north_BC = VectorizationBase.veq(j,ncy)
    north    = VectorizationBase.vne(j,ncy)
    TW1   = west_BC*(TC) + west*TW
    TE1   = east_BC*(TC) + east*TE
    TS1   = south_BC*(2TS_BC-TC) + south*TS
    TN1   = north_BC*(2TN_BC-TC) + north*TN
    qxW = -kx*(TC - TW1)/dx
    qxE = -kx*(TE1 - TC)/dx 
    qyS = -ky*(TC - TS1)/dy
    qyN = -ky*(TN1 - TC)/dy
    F   = (qxE - qxW)/dx + (qyN - qyS)/dy
    return F
end

Poisson(TC::Real,TW::Real,TE::Real,TS::Real,TN::Real,data,i,j) = Poisson(@SVector([TC,TW,TE,TS,TN]),data,i,j)  # use multiple dispatch to make it work with the desired argument list
Poisson_grad(Poisson,v,w,x,y,z,d,i,j) = ForwardDiff.gradient(x -> Poisson(x, d, i, j),@SVector([v,w,x,y,z])) # construct gradient function for feta()
function Poisson_valgrad(Poisson,v,w,x,y,z,d,i,j)
    sv = @SVector([v,w,x,y,z])
    dr = ForwardDiff.gradient!(DiffResults.GradientResult(sv), x -> Poisson(x, d, i, j), sv)
    DiffResults.value(dr), DiffResults.gradient(dr)
end

function main()

    ncx, ncy = 10, 10
    data = (kx=1.0, ky=1.0, dx=0.1, dy=0.1, ncx, ncy, TS_BC=1.0, TN_BC=2.0)
    
    F  = fill(0.0, ncx, ncy)
    T  = fill(0.0, ncx, ncy) 
    TC = fill(0.0, ncx, ncy)
    TW = fill(0.0, ncx, ncy) 
    TE = fill(0.0, ncx, ncy) 
    TS = fill(0.0, ncx, ncy) 
    TN = fill(0.0, ncx, ncy) 
    PoissonSparsity!(TC,TW,TE,TS,TN)
    
    # Compute Finite Difference coefficients
    t1 = @elapsed @turbo  for j=1:ncy 
        for i=1:ncx
            F[i,j], (TC[i,j], TW[i,j], TE[i,j], TS[i,j], TN[i,j]) = Poisson_valgrad(Poisson,TC[i,j], TW[i,j], TE[i,j], TS[i,j], TN[i,j],data,i,j)
        end
    end
end

for it=1:2
    @time main()
end

Gravlax · October 11, 2021, 1:44pm

@Elrod: just to confirm that SIMDDualNumbers.jl solved the issue with ifelse in ‘more-than-one’ dimensional function + gradient evaluation with both LV and ForwardDiff.
Thanks again!

using SIMDDualNumbers, ForwardDiff, StaticArrays, LoopVectorization, SparseArrays, LinearAlgebra, Plots

function PoissonSparsity!(TC,TW,TE,TS,TN)
    TW[2:end-0,:] .= TC[1:end-1,:]; 
    TE[1:end-1,:] .= TC[2:end-0,:]; 
    TS[:,2:end-0] .= TC[:,1:end-1]; 
    TN[:,1:end-1] .= TC[:,2:end-0];
end

function Poisson(x::AbstractVector{<:Real}, data, i, j) 
    kx, ky, dx, dy, ncx, ncy, TS_BC, TN_BC = data
    TC, TW, TE, TS, TN = x[1], x[2], x[3], x[4], x[5]
    # Stencil points
    TW1 = SIMDDualNumbers.ifelse( i==1,          TC, TW )
    TE1 = SIMDDualNumbers.ifelse( i==ncx,        TC, TE )
    TS1 = SIMDDualNumbers.ifelse( j==1,   2TS_BC-TC, TS )
    TN1 = SIMDDualNumbers.ifelse( j==ncy, 2TN_BC-TC, TN )
    # Fluxes
    qxW = -kx*(TC - TW1)/dx
    qxE = -kx*(TE1 - TC)/dx 
    qyS = -ky*(TC - TS1)/dy
    qyN = -ky*(TN1 - TC)/dy
    # Balance
    F   = (qxE - qxW)/dx + (qyN - qyS)/dy
    return F
end

Poisson(TC::Real,TW::Real,TE::Real,TS::Real,TN::Real,data,i,j) = Poisson(@SVector([TC,TW,TE,TS,TN]),data,i,j)  # use multiple dispatch to make it work with the desired argument list
Poisson_grad(Poisson,v,w,x,y,z,d,i,j) = ForwardDiff.gradient(x -> Poisson(x, d, i, j),@SVector([v,w,x,y,z])) # construct gradient function for feta()
function Poisson_valgrad(Poisson,v,w,x,y,z,d,i,j)
    sv = @SVector([v,w,x,y,z])
    dr = ForwardDiff.gradient!(DiffResults.GradientResult(sv), x -> Poisson(x, d, i, j), sv)
    DiffResults.value(dr), DiffResults.gradient(dr)
end

function main()

    ncx, ncy = 100, 100
    data = (kx=1.0, ky=1.0, dx=0.1, dy=0.1, ncx, ncy, TS_BC=1.0, TN_BC=2.0)

    # Arrays
    F  = fill(0.0, ncx, ncy)
    T  = fill(0.0, ncx, ncy) 
    Q  = fill(0.0, ncx, ncy) 
    TC = fill(0.0, ncx, ncy)
    TW = fill(0.0, ncx, ncy) 
    TE = fill(0.0, ncx, ncy) 
    TS = fill(0.0, ncx, ncy) 
    TN = fill(0.0, ncx, ncy) 
    IC = reshape(1:ncx*ncy,ncx,ncy) # equation numbers
    IW = fill(  1, ncx, ncy) 
    IE = fill(  1, ncx, ncy) 
    IS = fill(  1, ncx, ncy) 
    IN = fill(  1, ncx, ncy) 

    # Source term
    Q[1:Int(ncx/2),:] .= 1.0

    # Compute stencil connectivity
    PoissonSparsity!(IC,IW,IE,IS,IN)
    
    # Compute Finite Difference coefficients
    PoissonSparsity!(TC.=T,TW,TE,TS,TN)
    t1 = @elapsed @turbo for j=1:ncy
        for i=1:ncx
            F[i,j], (TC[i,j], TW[i,j], TE[i,j], TS[i,j], TN[i,j]) = Poisson_valgrad(Poisson,TC[i,j], TW[i,j], TE[i,j], TS[i,j], TN[i,j],data,i,j)
            F[i,j] -= Q[i,j]
        end
    end

    # Residual before solve
    println("r = ", norm(F))

    # Assemble stiffness matrix
    I = fill( 0  , 5*ncx*ncy )
    J = fill( 0  , 5*ncx*ncy )
    V = fill( 0.0, 5*ncx*ncy )
    I .= [IC[:]; IC[:]; IC[:]; IC[:]; IC[:]]
    J .= [IC[:]; IW[:]; IE[:]; IS[:]; IN[:]]
    V .= [TC[:]; TW[:]; TE[:]; TS[:]; TN[:]]
    @time K  = sparse(I,J,V)
    droptol!(K,1e-6)
    
    # Solve
    @time T[:] .-= cholesky(K)\F[:]
    
    # Compute residual after solve
    PoissonSparsity!(TC.=T,TW,TE,TS,TN)
    t2 = @elapsed @turbo for j=1:ncy  
        for i=1:ncx
            F[i,j] = Poisson(TC[i,j],TW[i,j],TE[i,j],TS[i,j],TN[i,j],data,i,j)
            F[i,j] -= Q[i,j]
        end
    end
    println("r = ", norm(F[:]))
   
    # Post-processing
    display(plot(heatmap(T',c=:roma)))
    println("Time for res. + stencil eval. --> ", t1, " s")
    println("Time for res. eval.           --> ", t2, " s")
end

for it=1:2
    @time main()
end

Elrod · October 11, 2021, 5:32pm

Note that LoopVectorization 0.12.83 and newer should automatically using SIMDDualNumbers for you if you using LoopVectorization, ForwardDiff.

Topic		Replies	Views
Optimize a simple code for performance General Usage	3	673	January 20, 2019
Improving performance of a number of functions for large scale gradient based optimization Performance performance , optim , memory-allocation , loops	11	826	November 5, 2021
Fast f(x) ∂f∂x and ∂2f∂x2 Performance autodiff	5	577	July 30, 2021
Accelerate Non-linear function evaluation Performance tullio , loopvectorization	17	1300	April 6, 2021
How to speed up a generic function evaluating multiple array calculations Performance design , optimization	3	758	August 13, 2019

Making a fast evaluation of a function and its gradient

Related topics