Making a fast evaluation of a function and its gradient

Gravlax · September 28, 2021, 3:55pm

Hi everybody,

I have written a script that evaluates a non-linear function feta over a 2D domain. The function returns eta that depends on other 2D fields (Exx, Eyy, Exy). I try to compute the gradient of eta w.r.t Exx, Eyy, Exy et each cell of the domain using ForwardDiff (function feta_grad). The function feta itself is not trivial as it requires an inner non-linear iteration to compute eta.

I have managed to make a MWE which I think can be optimized. It now allocates quite some memory on the way as both feta and feta_grad allocate memory. I have tried to make a non-allocating version of feta but this is where I struggle. Indeed, parsing ‘scalar’ arguments (here, for example eta[i,j]) to a non-allocating version of feta does not result in the modification of the value. The problem is likely related to this topic.
I’ve also tried to accelerate the evaluation by using LoopVectorization, but this gave me trouble. However this is most likely because of the if statement contained within the function evaluation.

Would anyone have advices on how to optimize/make non-allocating version of the MWE below?

Thanks and cheers!

using ForwardDiff, LoopVectorization, Statistics
# Global variables
Ebg = -1.0; eta0 = 1.0; n = 3.0; G = 1.0; dt = 1.0
# Visco-elastic power law rheology
function feta(x::Vector{<:Real}) 
    Eii = sqrt( 0.5*(x[1]^2 + x[2]^2) + x[3]^2)
    C   = (2eta0)^(-1)
    eta = eta0*Eii^(1/n-1)
    for it=1:10 # inner Newton iteration
        Tii    = 2eta*Eii
        Eiiv   = C*Tii^n
        r      = Eii - Tii/(2G*dt) - Eiiv
        # if (abs(r)<1e-9) break; end       # If statement does not work in LoopVectorization
        drdeta = -Eii./(G.*dt) - Eiiv.*n./eta
        eta   -= r/drdeta 
    end
    return eta
end
feta(Exx::Real,Eyy::Real,Exy::Real) = feta([Exx,Eyy,Exy])  # use multiple dispatch to make it work with the desired argument list
feta_grad(feta,x,y,z) = ForwardDiff.gradient(feta,[x,y,z]) # construct gradient function for feta()
# Setup
nx, ny = 100, 100
eta = 0.0*ones(Float64,nx,ny)                                    
Exx,      Eyy,      Exy      = Ebg.*ones(Float64,nx,ny), -Ebg.*ones(Float64,nx,ny), 0.2*Ebg.*ones(Float64,nx,ny)
detadExx, detadEyy, detadExy =     zeros(Float64,nx,ny),      zeros(Float64,nx,ny),         zeros(Float64,nx,ny)
# Loop through all cells and evaluate
@time for j=1:ny
    for i=1:nx
        eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j])
        detadExx[i,j], detadEyy[i,j],  detadExy[i,j] = feta_grad(feta,Exx[i,j],Eyy[i,j],Exy[i,j])
    end
end
# Check numbers
println( "eta      = ", mean(eta) )
println( "detadExx = ", mean(detadExx) )
println( "detadEyy = ", mean(detadEyy) )
println( "detadExy = ", mean(detadExy) )

jling · September 28, 2021, 3:58pm

Before:

julia> # Loop through all cells and evaluate
       @time for j=1:ny
           for i=1:nx
               eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j])
               detadExx[i,j], detadEyy[i,j],  detadExy[i,j] = feta_grad(feta,Exx[i,j],Eyy[i,j],Exy[i,j])
           end
       end
  0.411425 seconds (5.33 M allocations: 176.093 MiB, 4.15% gc time)

After julia> const Ebg = -1.0; const eta0 = 1.0; const n = 3.0; const G = 1.0; const dt = 1.0;:

julia> # Loop through all cells and evaluate
       @time for j=1:ny
           for i=1:nx
               eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j])
               detadExx[i,j], detadEyy[i,j],  detadExy[i,j] = feta_grad(feta,Exx[i,j],Eyy[i,j],Exy[i,j])
           end
       end
  0.028025 seconds (200.20 k allocations: 8.398 MiB)

conclusion: don’t use non-constant global variable.

In fact, you want to put eta = 0.0*ones(Float64,nx,by) , Exx etc. all into a function along side with your for-loop.

Gravlax · September 28, 2021, 6:16pm

Oh, thanks a lot! That makes indeed a huge difference. Yes, these guys were not meant to remain global variable (I’d rather never use global vars) but I wouldn’t have suspected them to be troublemarkers. They should end up in a structure that is explicitly passed to the different functions. However, I’ve not yet found an easy way to deal with structs in practice since, as far as I experienced, each modification of the structure requires a restart of Julia.

Thanks also for the hints regarding the structure, I will make one function in which all/most memory is allocated at once.

By the way, anyone would know why using LoopVectorization as below yields actually slower execution than without LoopVectorization?

@turbo for j=1:ny  
    for i=1:nx
        eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j])
        detadExx[i,j], detadEyy[i,j],  detadExy[i,j] = feta_grad(feta,Exx[i,j],Eyy[i,j],Exy[i,j])
    end
end

lmiq · September 28, 2021, 6:31pm

Use named tuples until you are more sure of what your structures will contain:

julia> function f(x,data)
         a, b = data
         a*x + b
       end
f (generic function with 1 method)

julia> data = ( a = 1.0, b = 2.0 )
(a = 1.0, b = 2.0)

julia> f(1.0,data)
3.0

julia> function f(x,data)
         a, b, c = data
         a*x + b + c
       end
f (generic function with 1 method)

julia> data = ( a = 2.0, b = 3.0, c = 1.0 )
(a = 2.0, b = 3.0, c = 1.0)

julia> f(1.0,data)
6.0

Gravlax · September 28, 2021, 7:37pm

Thanks, did not know that possibility. This sounds like a good temporary solution prior to using structs.

lmiq · September 28, 2021, 7:39pm

There is also this package to make that transition even simpler: https://github.com/BeastyBlacksmith/ProtoStructs.jl

Elrod · September 29, 2021, 1:17am

After compiling, it is about 10x faster for me.

julia> # Loop through all cells and evaluate
       @time @turbo for j=1:ny
           for i=1:nx
               eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j])
               detadExx[i,j], detadEyy[i,j],  detadExy[i,j] = feta_grad(feta,Exx[i,j],Eyy[i,j],Exy[i,j])
           end
       end
  0.002653 seconds (16.00 k allocations: 4.095 MiB)

julia> # Loop through all cells and evaluate
       @time for j=1:ny
           for i=1:nx
               eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j])
               detadExx[i,j], detadEyy[i,j],  detadExy[i,j] = feta_grad(feta,Exx[i,j],Eyy[i,j],Exy[i,j])
           end
       end
  0.026776 seconds (200.20 k allocations: 8.398 MiB)

If you do have an example where LV is slower, I’d like to see it.

Also, pretty cool that LV works on ForwardDiff.gradient!

StaticArrays shaves off a little more time:

julia> using StaticArrays

julia> feta(Exx::Real,Eyy::Real,Exy::Real) = feta(@SVector([Exx,Eyy,Exy]))  # use multiple dispatch to make it work with the desired argument list
feta (generic function with 2 methods)

julia> feta_grad(feta,x,y,z) = ForwardDiff.gradient(feta,@SVector([x,y,z])) # construct gradient function for feta()
feta_grad (generic function with 1 method)

Bringing LV from 0.0026 to 0.002, and Base from 0.026 to 0.02:


julia> # Loop through all cells and evaluate
       @time @turbo for j=1:ny
           for i=1:nx
               eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j])
               detadExx[i,j], detadEyy[i,j],  detadExy[i,j] = feta_grad(feta,Exx[i,j],Eyy[i,j],Exy[i,j])
           end
       end
  0.001924 seconds (9.20 k allocations: 1.412 MiB)

julia> # Loop through all cells and evaluate
       @time for j=1:ny
           for i=1:nx
               eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j])
               detadExx[i,j], detadEyy[i,j],  detadExy[i,j] = feta_grad(feta,Exx[i,j],Eyy[i,j],Exy[i,j])
           end
       end
  0.020613 seconds (200.20 k allocations: 4.431 MiB)

You also have to generalize the feta definition to ::AbstractVector.
I also defined the globals as const, and n an Int.

Gravlax · September 29, 2021, 7:44am

Many thanks Elrod for all these insights!
I have tried to reproduce your (stunning) results but I have not managed yet.
I’ve probably messed it up… So, just to recap:
FIX #1 - change global variables to const
FIX #2 - made n a const Int by defining it as: const n=3 instead of const n=3.0
FIX #3 - changed definition of feta to function feta(x::AbstractVector{<:Real})
FIX #4 - use StaticArrays (EDIT: added a posteriori)
FIX #5 - use 1/a instead of a^(-1) (EDIT: added a posteriori)

I include these modification this edited MWE:

using ForwardDiff, LoopVectorization, StaticArrays # FIX #4 use StaticArrays

# Global variables
const Ebg = -1.0; const eta0 = 1.0; const n = 3; const G = 1.0; const dt = 1.0; # FIX #1: declare global variable as consts and FIX #2 n as int

# Visco-elastic power law rheology
function feta(x::AbstractVector{<:Real}) # FIX #3: made it AbstractVector
    Eii = sqrt( 0.5*(x[1]^2 + x[2]^2) + x[3]^2)
    C   = 1.0/(2eta0) # FIX #5 instead of (2eta0)^(-1)
    eta = eta0*Eii^(1/n-1)
    for it=1:10 # inner Newton iteration
        Tii    = 2eta*Eii
        Eiiv   = C*Tii^n
        r      = Eii - Tii/(2G*dt) - Eiiv
        # if (abs(r)<1e-9) break; end       # If statement does not work in LoopVectorization
        drdeta = -Eii/(G*dt) - Eiiv*n/eta
        eta   -= r/drdeta 
    end
    return eta
end

feta(Exx::Real,Eyy::Real,Exy::Real) = feta(@SVector([Exx,Eyy,Exy]))  # use multiple dispatch to make it work with the desired argument list
feta_grad(feta,x,y,z) = ForwardDiff.gradient(feta,@SVector([x,y,z])) # construct gradient function for feta()

# Setup / allocation of tables
nx, ny = 100, 100
eta = 0.0*ones(Float64,nx,ny)                                    
Exx,      Eyy,      Exy      = Ebg.*ones(Float64,nx,ny), -Ebg.*ones(Float64,nx,ny), 0.2*Ebg.*ones(Float64,nx,ny)
detadExx, detadEyy, detadExy =     zeros(Float64,nx,ny),      zeros(Float64,nx,ny),         zeros(Float64,nx,ny)

# Loop through all cells and evaluate
t1 = @elapsed for j=1:ny  
    for i=1:nx
        eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j])
        detadExx[i,j], detadEyy[i,j],  detadExy[i,j] = feta_grad(feta,Exx[i,j],Eyy[i,j],Exy[i,j])
    end
end

t2 = @elapsed @turbo for j=1:ny  
    for i=1:nx
        eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j])
        detadExx[i,j], detadEyy[i,j],  detadExy[i,j] = feta_grad(feta,Exx[i,j],Eyy[i,j],Exy[i,j])
    end
end

t3 = @elapsed @tturbo for j=1:ny 
    for i=1:nx
        eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j])
        detadExx[i,j], detadEyy[i,j],  detadExy[i,j] = feta_grad(feta,Exx[i,j],Eyy[i,j],Exy[i,j])
    end
end

println("Base    --> ", t1, " s")
println("@turbo  --> ", t2, " s")
println("@tturbo --> ", t3, " s")

which yields (from second execution onwards):

Base    --> 0.089966198 s
@turbo  --> 0.647681009 s
@tturbo --> 0.23096321 s

Have you maybe changed the definition of nx/ny to const?
And sorry if I misinterpreted your suggestions of modifications.

I hope this could turn to be a useful example of the combination LV and ForwardDiff.gradient!, which is very promising.

Also, thanks a lot leandromartinez98 for suggesting ProtoStructs. This is going to be very helpful!

EDIT: thanks to DNF for the additional edits that were suggested below.

DNF · September 29, 2021, 8:01am

You are missing a crucial point, I believe. Standard arrays are sort of heavyweight objects, they must allocate on the heap, they are mutable, and the compiler doesn’t know their size, etc. For small, fixed-size collections, you should use tuples, or, as @Elrod, did SVectors from the StaticArrays package. This is what you see here:

Changing the function signature to AbstractVector doesn’t help performance, but it makes your function accept SVectors, and that helps performance.

Whenever you are working with short, fixed-size ‘vector-like’ data, use tuples or StaticArrays instead of ordinary arrays.

DNF · September 29, 2021, 8:07am

Another minor remark:

In general, it is more efficient to write 1/x^y than x^(-y), and much more efficient to calculate powers with integers than with float exponents. Perhaps it makes no big difference for x^(-1), but if you try

@code_native 1.0^(-1)

and

@code_native 1/1.0

you will see a big difference in the amount of machine code generated.

Gravlax · September 29, 2021, 8:19am

Thanks a lot for your inputs. I’ve edited the above post and the MWE to include you suggestions (FIX #4 and FIX#5). And now I go reads the docs of StaticArrays…

lmiq · September 29, 2021, 11:43am

But anyway, avoid using global variables in every next test. You can benchmark this correctly if you declare everything as const, but this is not how anyone would do in general. Just put the loop inside a function, pass all the data as parameters, and benchmark the function (better yet using BenchmarkTools)

Elrod · September 29, 2021, 12:07pm

Did you try running it twice?
The first time, I got (note my session was already somewhat warm, otherwise these times, especially the @turbo, would’ve been longer):

julia> println("Base    --> ", t1, " s")
Base    --> 0.237203193 s

julia> println("@turbo  --> ", t2, " s")
@turbo  --> 0.288348365 s

julia> println("@tturbo --> ", t3, " s")
@tturbo --> 0.142087903 s

When I run them a second time, I now get:

julia> println("Base    --> ", t1, " s")
Base    --> 0.014496065 s

julia> println("@turbo  --> ", t2, " s")
@turbo  --> 0.001547502 s

julia> println("@tturbo --> ", t3, " s")
@tturbo --> 0.001092988 s

Unfortunately, much like a real turbo, @turbo increases latency; it adds to compile time.
So when you measure compile + runtime, the combination is likely to be larger unless runtime dominates compile time.
Running it a second time (without redefining the functions being called) should ensure everything has compiled, so that we’re predominately measuring runtime.

Gravlax · September 29, 2021, 12:38pm

Yes, I did run many times after compilation (what I meant by “from second execution onwards”).
I am now trying to put the advices together to make a more suitable MWE (one main function, no global vars) to discuss performance.
So far I’m stuck trying to add more inputs arguments to feta such that it still works with ForwardDiff.gradient. I will post it as soon as I’ve managed…
Thanks again everyone for your help!

EDIT: what I mean is, ideally one would need feta defined such that it takes data as input parameter. data would then be a name tuple that contains the parameters as suggested by leandromartinez98 (which would be local to a main function).

function feta(x::AbstractVector{<:Real}, data) # FIX #3: made it AbstractVector and take parameters as input
    eta0, G, n, dt = data
... # the remainder is unchanged

but then one needs to redefined the gradient such that it knows about data (but doesn’t differentiate against these parameters)

feta(Exx::Real,Eyy::Real,Exy::Real,data) = feta(@SVector([Exx,Eyy,Exy]),data) 
feta_grad(feta,x,y,z,d) = ForwardDiff.gradient(feta,@SVector([x,y,z]),d) # construct gradient function for feta()

which so far doesn’t work…

Elrod · September 29, 2021, 1:21pm

I just tried on a different computer without AVX512, and got basically the same pattern of results:

julia> println("Base    --> ", t1, " s")
Base    --> 0.09175798 s

julia> println("@turbo  --> ", t2, " s")
@turbo  --> 0.007550789 s

julia> println("@tturbo --> ", t3, " s")
@tturbo --> 0.004292455 s

Mind sharing:

julia> versioninfo()
Julia Version 1.7.0-rc1
Commit 9eade6195e* (2021-09-12 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i3-4010U CPU @ 1.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, haswell)
Environment:
  JULIA_NUM_THREADS = 4

(nnlib) pkg> st LoopVectorization
      Status `~/Documents/progwork/julia/env/nnlib/Project.toml`
  [bdcacae8] LoopVectorization v0.12.80 `~/.julia/dev/LoopVectorization`

It’d be great if you could (help me) find out the cause of your performance problem, as I cannot reproduce it.

Try:

feta_grad(feta,x,y,z,d) = ForwardDiff.gradient(x -> feta(x, d),@SVector([x,y,z]))

DNF · September 29, 2021, 2:14pm

Not related to your performance issues, but I should point this out:

Instead of 0.0 * ones(Float64, nx, ny), use zeros: zeros(nx, ny) (Float64 is default.)
Instead of Ebg .* ones(...) write fill(Ebg, nx, ny).

It’s wasteful to first create an array, fill it with ones, and then multiply each element with the correct value, instead of just filling the array with the right values in the first place. It’s kind of inelegant too.

You could also use fill(0.0, nx, ny) instead of zeros, if you prefer being consistent.

Gravlax · September 29, 2021, 7:33pm

Thanks a lot to both of you. It now runs with parsing arguments using the closure suggested by Elrod and fill of arrays suggested by DNF. It also into a main function. The picture has now changed quite a bit, here for 3 consecutive evaluations:

Base    --> 0.008191534 s
@turbo  --> 0.003956482 s
@tturbo --> 0.003793147 s
  0.016332 seconds (69 allocations: 550.047 KiB)
Base    --> 0.008469515 s
@turbo  --> 0.00483582 s
@tturbo --> 0.004676405 s
  0.018387 seconds (68 allocations: 549.719 KiB)
Base    --> 0.009877944 s
@turbo  --> 0.00434214 s
@tturbo --> 0.005410352 s
  0.020165 seconds (69 allocations: 550.047 KiB)

using the updated MWE:

using ForwardDiff, LoopVectorization, StaticArrays, BenchmarkTools, Statistics, Revise # FIX #4 use StaticArrays

# Visco-elastic power law rheology
function feta(x::AbstractVector{<:Real}, data) # FIX #3: made it AbstractVector
    eta0, G, n, dt = data
    Eii = sqrt( 0.5*(x[1]^2 + x[2]^2) + x[3]^2)
    C   = 1.0/(2eta0) # FIX #5 instead of (2eta0)^(-1)
    eta = eta0*Eii^(1/n-1)
    for it=1:10 # inner Newton iteration
        Tii    = 2eta*Eii
        Eiiv   = C*Tii^n
        r      = Eii - Tii/(2G*dt) - Eiiv
        # if (abs(r)<1e-9) break; end       # If statement does not work in LoopVectorization
        drdeta = -Eii/(G*dt) - Eiiv*n/eta
        eta   -= r/drdeta 
    end
    return eta
end
feta(Exx::Real,Eyy::Real,Exy::Real,data) = feta(@SVector([Exx,Eyy,Exy]),data)  # use multiple dispatch to make it work with the desired argument list
feta_grad(feta,x,y,z,d) = ForwardDiff.gradient(x -> feta(x, d),@SVector([x,y,z])) # construct gradient function for feta()

function main()

    Ebg  = -1.0; 
    data = ( eta0 = 1.0, G = 1.0, n = 3,  dt = 1.0 )

    # Setup / allocation of tables
    nx, ny = 100, 100
    eta = fill(0.0, nx, ny)                               
    Exx,      Eyy,      Exy      = fill(Ebg, nx, ny), fill(-Ebg, nx, ny), fill(0.2*Ebg, nx, ny)
    detadExx, detadEyy, detadExy = fill(0.0, nx, ny), fill( 0.0, nx, ny), fill(0.0, nx, ny)

    # Loop through all cells and evaluate
    t1 = @elapsed for j=1:ny  
        for i=1:nx
            eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j],data)
            detadExx[i,j], detadEyy[i,j],  detadExy[i,j] = feta_grad(feta,Exx[i,j],Eyy[i,j],Exy[i,j],data )
        end
    end

    t2 = @elapsed @turbo for j=1:ny  
        for i=1:nx
            eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j],data)
            detadExx[i,j], detadEyy[i,j],  detadExy[i,j] = feta_grad(feta,Exx[i,j],Eyy[i,j],Exy[i,j],data)
        end
    end

    t3 = @elapsed @tturbo for j=1:ny 
        for i=1:nx
            eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j],data)
            detadExx[i,j], detadEyy[i,j],  detadExy[i,j] = feta_grad(feta,Exx[i,j],Eyy[i,j],Exy[i,j],data)
        end
    end

    println("Base    --> ", t1, " s")
    println("@turbo  --> ", t2, " s")
    println("@tturbo --> ", t3, " s")

end

for it=1:3
    @time main()
end

So now @turbo is faster
Here are the details of my configuration:

julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, ivybridge)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS =

Looks like the number of threads is not set, which likely explains why @tturbo is as fast as @turbo. I’ll go fix that and update…

EDIT: with 4 threads

Base    --> 0.00824187 s
@turbo  --> 0.004137984 s
@tturbo --> 0.002877103 s
  0.015900 seconds (68 allocations: 549.719 KiB)
Base    --> 0.011947013 s
@turbo  --> 0.005061616 s
@tturbo --> 0.002595061 s
  0.020100 seconds (70 allocations: 550.078 KiB)
Base    --> 0.010976247 s
@turbo  --> 0.004203059 s
@tturbo --> 0.002639843 s
  0.018551 seconds (69 allocations: 549.750 KiB)

jling · September 29, 2021, 7:38pm

Release date: Q1’13

a blast from the past! XD

Elrod · September 29, 2021, 11:25pm

Currently, LV will only use up to as many threads as you have cores (so 2).
Glad it’s helping performance now.

ivybridge does not have FMA (fused multiply add) or AVX2 (basic integer support)!

Finally, if you want both a function evaluation and its gradient, you can use DiffResults. This requires LoopVectorization 0.12.81 (just released a few minutes before making this post, so you’ll need to update your packages), because of a bug in multi-assignment handling:

using ForwardDiff, DiffResults, LoopVectorization, StaticArrays, BenchmarkTools, Statistics, Revise # FIX #4 use StaticArrays

# Visco-elastic power law rheology
function feta(x::AbstractVector{<:Real}, data) # FIX #3: made it AbstractVector
    eta0, G, n, dt = data
    Eii = sqrt( 0.5*(x[1]^2 + x[2]^2) + x[3]^2)
    C   = 1.0/(2eta0) # FIX #5 instead of (2eta0)^(-1)
    eta = eta0*Eii^(1/n-1)
    for it=1:10 # inner Newton iteration
        Tii    = 2eta*Eii
        Eiiv   = C*Tii^n
        r      = Eii - Tii/(2G*dt) - Eiiv
        # if (abs(r)<1e-9) break; end       # If statement does not work in LoopVectorization
        drdeta = -Eii/(G*dt) - Eiiv*n/eta
        eta   -= r/drdeta 
    end
    return eta
end
feta(Exx::Real,Eyy::Real,Exy::Real,data) = feta(@SVector([Exx,Eyy,Exy]),data)  # use multiple dispatch to make it work with the desired argument list
feta_grad(feta,x,y,z,d) = ForwardDiff.gradient(x -> feta(x, d),@SVector([x,y,z])) # construct gradient function for feta()
function feta_valgrad(feta,x,y,z,d)
  sv = @SVector([x,y,z])
  dr = ForwardDiff.gradient!(DiffResults.GradientResult(sv), x -> feta(x, d), sv)
  DiffResults.value(dr), DiffResults.gradient(dr)
end

function main()

    Ebg  = -1.0; 
    data = ( eta0 = 1.0, G = 1.0, n = 3,  dt = 1.0 )

    # Setup / allocation of tables
    nx, ny = 100, 100
    eta = fill(0.0, nx, ny)                               
    Exx,      Eyy,      Exy      = fill(Ebg, nx, ny), fill(-Ebg, nx, ny), fill(0.2*Ebg, nx, ny)
    detadExx, detadEyy, detadExy = fill(0.0, nx, ny), fill( 0.0, nx, ny), fill(0.0, nx, ny)

    # Loop through all cells and evaluate
    t1 = @elapsed for j=1:ny  
        for i=1:nx
            eta[i,j], (detadExx[i,j], detadEyy[i,j],  detadExy[i,j]) = feta_valgrad(feta,Exx[i,j],Eyy[i,j],Exy[i,j],data)
        end
    end

    t2 = @elapsed @turbo for j=1:ny  
        for i=1:nx
            eta[i,j], (detadExx[i,j], detadEyy[i,j],  detadExy[i,j]) = feta_valgrad(feta,Exx[i,j],Eyy[i,j],Exy[i,j],data)
        end
    end

    t3 = @elapsed @tturbo for j=1:ny 
        for i=1:nx
            eta[i,j], (detadExx[i,j], detadEyy[i,j],  detadExy[i,j]) = feta_valgrad(feta,Exx[i,j],Eyy[i,j],Exy[i,j],data)
        end
    end

    println("Base    --> ", t1, " s")
    println("@turbo  --> ", t2, " s")
    println("@tturbo --> ", t3, " s")

end

for it=1:3
    @time main()
end

I get:

julia> for it=1:3
           @time main()
       end
Base    --> 0.002588202 s
@turbo  --> 0.000640338 s
@tturbo --> 0.000127322 s
  0.003497 seconds (70 allocations: 550.359 KiB)
Base    --> 0.002590058 s
@turbo  --> 0.000643189 s
@tturbo --> 0.000101351 s
  0.003444 seconds (68 allocations: 549.500 KiB)
Base    --> 0.002593851 s
@turbo  --> 0.000640927 s
@tturbo --> 0.000101391 s
  0.003440 seconds (68 allocations: 549.500 KiB)

Gravlax · September 30, 2021, 7:57am

Really really cool!!
I’ve made the update and ran on my desktop computer which could also be seen as a ‘blast from the past’

julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i5-3470S CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, ivybridge)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 4

Using the MWE updated by Elrod, which uses now also DiffResults:

Base    --> 0.004413731 s
@turbo  --> 0.00217721 s
@tturbo --> 0.000688704 s
  0.007464 seconds (69 allocations: 549.734 KiB)
Base    --> 0.004730431 s
@turbo  --> 0.002347034 s
@tturbo --> 0.001636784 s
  0.008873 seconds (73 allocations: 550.156 KiB)
Base    --> 0.004744306 s
@turbo  --> 0.002316257 s
@tturbo --> 0.000683593 s
  0.007930 seconds (72 allocations: 549.828 KiB)

Many thanks to all of you for your help! I’ve spent so much time deriving and implementing analytical gradients (for Jacobians) over the last years. Looks like I won’t ever need that again…
(I somehow expected that AD would be significantly slower than computing analytical derivatives, likely a wrong idea…)

Maybe a last point, would anyone have recommendations on how to deal with stopping conditions in the feta loop? For now, it just does 10 Newton steps without checking. In practice, it would be nice to have it dynamic as in some cases only one iteration is needed (e.g. in the linear elastic limit of the visco-elastic model feta) while in others ~5-10 iterations are needed (e.g. in the non-linear viscous limit). Is there a way to include a conditional statement such that the function remains compatible with both LV and ForwardDiff?

Topic		Replies	Views
Vectorized gradient using ForwardDiff Specific Domains forwarddiff	5	424	July 17, 2022
Gradient evaluation with ForwardDiff and LoopVectorization General Usage forwarddiff , loopvectorization	2	902	October 24, 2021
Need help using ForwardDiff gradient with numerical loop General Usage question	11	182	November 7, 2024
High order approximation of a multi-country trade model Finance and Economics performance , differentiation , forwarddiff , finitediff	22	1111	July 6, 2022
Getting non-vectorized code to the speed of vectorized ones Numerics question	22	1076	November 18, 2021

Making a fast evaluation of a function and its gradient

Related topics