 # Making a fast evaluation of a function and its gradient

Hi everybody,

I have written a script that evaluates a non-linear function `feta` over a 2D domain. The function returns `eta` that depends on other 2D fields (`Exx, Eyy, Exy`). I try to compute the gradient of `eta` w.r.t `Exx, Eyy, Exy` et each cell of the domain using ForwardDiff (function `feta_grad`). The function `feta` itself is not trivial as it requires an inner non-linear iteration to compute `eta`.

I have managed to make a MWE which I think can be optimized. It now allocates quite some memory on the way as both `feta` and `feta_grad` allocate memory. I have tried to make a non-allocating version of `feta` but this is where I struggle. Indeed, parsing ‘scalar’ arguments (here, for example `eta[i,j]`) to a non-allocating version of `feta` does not result in the modification of the value. The problem is likely related to this topic.
I’ve also tried to accelerate the evaluation by using LoopVectorization, but this gave me trouble. However this is most likely because of the `if ` statement contained within the function evaluation.

Would anyone have advices on how to optimize/make non-allocating version of the MWE below?

Thanks and cheers!

``````using ForwardDiff, LoopVectorization, Statistics
# Global variables
Ebg = -1.0; eta0 = 1.0; n = 3.0; G = 1.0; dt = 1.0
# Visco-elastic power law rheology
function feta(x::Vector{<:Real})
Eii = sqrt( 0.5*(x^2 + x^2) + x^2)
C   = (2eta0)^(-1)
eta = eta0*Eii^(1/n-1)
for it=1:10 # inner Newton iteration
Tii    = 2eta*Eii
Eiiv   = C*Tii^n
r      = Eii - Tii/(2G*dt) - Eiiv
# if (abs(r)<1e-9) break; end       # If statement does not work in LoopVectorization
drdeta = -Eii./(G.*dt) - Eiiv.*n./eta
eta   -= r/drdeta
end
return eta
end
feta(Exx::Real,Eyy::Real,Exy::Real) = feta([Exx,Eyy,Exy])  # use multiple dispatch to make it work with the desired argument list
# Setup
nx, ny = 100, 100
eta = 0.0*ones(Float64,nx,ny)
Exx,      Eyy,      Exy      = Ebg.*ones(Float64,nx,ny), -Ebg.*ones(Float64,nx,ny), 0.2*Ebg.*ones(Float64,nx,ny)
# Loop through all cells and evaluate
@time for j=1:ny
for i=1:nx
eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j])
end
end
# Check numbers
println( "eta      = ", mean(eta) )
``````

Before:

``````julia> # Loop through all cells and evaluate
@time for j=1:ny
for i=1:nx
eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j])
end
end
0.411425 seconds (5.33 M allocations: 176.093 MiB, 4.15% gc time)
``````

After `julia> const Ebg = -1.0; const eta0 = 1.0; const n = 3.0; const G = 1.0; const dt = 1.0;`:

``````julia> # Loop through all cells and evaluate
@time for j=1:ny
for i=1:nx
eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j])
end
end
0.028025 seconds (200.20 k allocations: 8.398 MiB)
``````

conclusion: don’t use non-constant global variable.

In fact, you want to put `eta = 0.0*ones(Float64,nx,by) , Exx` etc. all into a function along side with your for-loop.

5 Likes

Oh, thanks a lot! That makes indeed a huge difference. Yes, these guys were not meant to remain global variable (I’d rather never use global vars) but I wouldn’t have suspected them to be troublemarkers. They should end up in a structure that is explicitly passed to the different functions. However, I’ve not yet found an easy way to deal with structs in practice since, as far as I experienced, each modification of the structure requires a restart of Julia.

Thanks also for the hints regarding the structure, I will make one function in which all/most memory is allocated at once.

By the way, anyone would know why using LoopVectorization as below yields actually slower execution than without LoopVectorization?

``````@turbo for j=1:ny
for i=1:nx
eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j])
end
end
``````

Use named tuples until you are more sure of what your structures will contain:

``````julia> function f(x,data)
a, b = data
a*x + b
end
f (generic function with 1 method)

julia> data = ( a = 1.0, b = 2.0 )
(a = 1.0, b = 2.0)

julia> f(1.0,data)
3.0

julia> function f(x,data)
a, b, c = data
a*x + b + c
end
f (generic function with 1 method)

julia> data = ( a = 2.0, b = 3.0, c = 1.0 )
(a = 2.0, b = 3.0, c = 1.0)

julia> f(1.0,data)
6.0
``````
2 Likes

Thanks, did not know that possibility. This sounds like a good temporary solution prior to using structs.

There is also this package to make that transition even simpler: GitHub - BeastyBlacksmith/ProtoStructs.jl: Easy prototyping of structs

2 Likes

After compiling, it is about 10x faster for me.

``````julia> # Loop through all cells and evaluate
@time @turbo for j=1:ny
for i=1:nx
eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j])
end
end
0.002653 seconds (16.00 k allocations: 4.095 MiB)

julia> # Loop through all cells and evaluate
@time for j=1:ny
for i=1:nx
eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j])
end
end
0.026776 seconds (200.20 k allocations: 8.398 MiB)
``````

If you do have an example where LV is slower, I’d like to see it.

Also, pretty cool that LV works on `ForwardDiff.gradient`!

`StaticArrays` shaves off a little more time:

``````julia> using StaticArrays

julia> feta(Exx::Real,Eyy::Real,Exy::Real) = feta(@SVector([Exx,Eyy,Exy]))  # use multiple dispatch to make it work with the desired argument list
feta (generic function with 2 methods)

feta_grad (generic function with 1 method)
``````

Bringing LV from 0.0026 to 0.002, and Base from 0.026 to 0.02:

``````
julia> # Loop through all cells and evaluate
@time @turbo for j=1:ny
for i=1:nx
eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j])
end
end
0.001924 seconds (9.20 k allocations: 1.412 MiB)

julia> # Loop through all cells and evaluate
@time for j=1:ny
for i=1:nx
eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j])
end
end
0.020613 seconds (200.20 k allocations: 4.431 MiB)
``````

You also have to generalize the `feta` definition to `::AbstractVector`.
I also defined the globals as `const`, and `n` an `Int`.

2 Likes

Many thanks Elrod for all these insights!
I have tried to reproduce your (stunning) results but I have not managed yet.
I’ve probably messed it up… So, just to recap:
FIX #1 - change global variables to `const`
FIX #2 - made `n` a ` const Int` by defining it as: `const n=3` instead of `const n=3.0`
FIX #3 - changed definition of `feta` to `function feta(x::AbstractVector{<:Real})`
FIX #4 - use StaticArrays (EDIT: added a posteriori)
FIX #5 - use 1/a instead of a^(-1) (EDIT: added a posteriori)

I include these modification this edited MWE:

``````using ForwardDiff, LoopVectorization, StaticArrays # FIX #4 use StaticArrays

# Global variables
const Ebg = -1.0; const eta0 = 1.0; const n = 3; const G = 1.0; const dt = 1.0; # FIX #1: declare global variable as consts and FIX #2 n as int

# Visco-elastic power law rheology
function feta(x::AbstractVector{<:Real}) # FIX #3: made it AbstractVector
Eii = sqrt( 0.5*(x^2 + x^2) + x^2)
C   = 1.0/(2eta0) # FIX #5 instead of (2eta0)^(-1)
eta = eta0*Eii^(1/n-1)
for it=1:10 # inner Newton iteration
Tii    = 2eta*Eii
Eiiv   = C*Tii^n
r      = Eii - Tii/(2G*dt) - Eiiv
# if (abs(r)<1e-9) break; end       # If statement does not work in LoopVectorization
drdeta = -Eii/(G*dt) - Eiiv*n/eta
eta   -= r/drdeta
end
return eta
end

feta(Exx::Real,Eyy::Real,Exy::Real) = feta(@SVector([Exx,Eyy,Exy]))  # use multiple dispatch to make it work with the desired argument list

# Setup / allocation of tables
nx, ny = 100, 100
eta = 0.0*ones(Float64,nx,ny)
Exx,      Eyy,      Exy      = Ebg.*ones(Float64,nx,ny), -Ebg.*ones(Float64,nx,ny), 0.2*Ebg.*ones(Float64,nx,ny)

# Loop through all cells and evaluate
t1 = @elapsed for j=1:ny
for i=1:nx
eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j])
end
end

t2 = @elapsed @turbo for j=1:ny
for i=1:nx
eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j])
end
end

t3 = @elapsed @tturbo for j=1:ny
for i=1:nx
eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j])
end
end

println("Base    --> ", t1, " s")
println("@turbo  --> ", t2, " s")
println("@tturbo --> ", t3, " s")
``````

which yields (from second execution onwards):

``````Base    --> 0.089966198 s
@turbo  --> 0.647681009 s
@tturbo --> 0.23096321 s
``````

Have you maybe changed the definition of nx/ny to `const`?
And sorry if I misinterpreted your suggestions of modifications.

I hope this could turn to be a useful example of the combination LV and `ForwardDiff.gradient!`, which is very promising.

Also, thanks a lot leandromartinez98 for suggesting ProtoStructs. This is going to be very helpful!

EDIT: thanks to DNF for the additional edits that were suggested below.

1 Like

You are missing a crucial point, I believe. Standard arrays are sort of heavyweight objects, they must allocate on the heap, they are mutable, and the compiler doesn’t know their size, etc. For small, fixed-size collections, you should use tuples, or, as @Elrod, did `SVector`s from the `StaticArrays` package. This is what you see here:

Changing the function signature to `AbstractVector` doesn’t help performance, but it makes your function accept `SVector`s, and that helps performance.

Whenever you are working with short, fixed-size ‘vector-like’ data, use tuples or StaticArrays instead of ordinary arrays.

2 Likes

Another minor remark:

In general, it is more efficient to write `1/x^y` than `x^(-y)`, and much more efficient to calculate powers with integers than with float exponents. Perhaps it makes no big difference for `x^(-1)`, but if you try

``````@code_native 1.0^(-1)
``````

and

``````@code_native 1/1.0
``````

you will see a big difference in the amount of machine code generated.

2 Likes

Thanks a lot for your inputs. I’ve edited the above post and the MWE to include you suggestions (FIX #4 and FIX#5). And now I go reads the docs of StaticArrays… But anyway, avoid using global variables in every next test. You can benchmark this correctly if you declare everything as `const`, but this is not how anyone would do in general. Just put the loop inside a function, pass all the data as parameters, and benchmark the function (better yet using `BenchmarkTools`)

1 Like

Did you try running it twice?
The first time, I got (note my session was already somewhat warm, otherwise these times, especially the `@turbo`, would’ve been longer):

``````julia> println("Base    --> ", t1, " s")
Base    --> 0.237203193 s

julia> println("@turbo  --> ", t2, " s")
@turbo  --> 0.288348365 s

julia> println("@tturbo --> ", t3, " s")
@tturbo --> 0.142087903 s
``````

When I run them a second time, I now get:

``````julia> println("Base    --> ", t1, " s")
Base    --> 0.014496065 s

julia> println("@turbo  --> ", t2, " s")
@turbo  --> 0.001547502 s

julia> println("@tturbo --> ", t3, " s")
@tturbo --> 0.001092988 s
``````

Unfortunately, much like a real turbo, `@turbo` increases latency; it adds to compile time.
So when you measure compile + runtime, the combination is likely to be larger unless runtime dominates compile time.
Running it a second time (without redefining the functions being called) should ensure everything has compiled, so that we’re predominately measuring runtime.

1 Like

Yes, I did run many times after compilation (what I meant by “from second execution onwards”).
I am now trying to put the advices together to make a more suitable MWE (one main function, no global vars) to discuss performance.
So far I’m stuck trying to add more inputs arguments to `feta` such that it still works with `ForwardDiff.gradient`. I will post it as soon as I’ve managed…
Thanks again everyone for your help!

EDIT: what I mean is, ideally one would need `feta` defined such that it takes `data` as input parameter. `data` would then be a name tuple that contains the parameters as suggested by leandromartinez98 (which would be local to a main function).

``````function feta(x::AbstractVector{<:Real}, data) # FIX #3: made it AbstractVector and take parameters as input
eta0, G, n, dt = data
... # the remainder is unchanged
``````

but then one needs to redefined the gradient such that it knows about `data` (but doesn’t differentiate against these parameters)

``````feta(Exx::Real,Eyy::Real,Exy::Real,data) = feta(@SVector([Exx,Eyy,Exy]),data)
``````

which so far doesn’t work…

I just tried on a different computer without AVX512, and got basically the same pattern of results:

``````julia> println("Base    --> ", t1, " s")
Base    --> 0.09175798 s

julia> println("@turbo  --> ", t2, " s")
@turbo  --> 0.007550789 s

julia> println("@tturbo --> ", t3, " s")
@tturbo --> 0.004292455 s
``````

Mind sharing:

``````julia> versioninfo()
Julia Version 1.7.0-rc1
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i3-4010U CPU @ 1.70GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, haswell)
Environment:

(nnlib) pkg> st LoopVectorization
Status `~/Documents/progwork/julia/env/nnlib/Project.toml`
[bdcacae8] LoopVectorization v0.12.80 `~/.julia/dev/LoopVectorization`
``````

It’d be great if you could (help me) find out the cause of your performance problem, as I cannot reproduce it.

Try:

``````feta_grad(feta,x,y,z,d) = ForwardDiff.gradient(x -> feta(x, d),@SVector([x,y,z]))
``````
1 Like

Not related to your performance issues, but I should point this out:

• Instead of `0.0 * ones(Float64, nx, ny)`, use `zeros`: `zeros(nx, ny)` (Float64 is default.)
• Instead of `Ebg .* ones(...)` write `fill(Ebg, nx, ny)`.

It’s wasteful to first create an array, fill it with ones, and then multiply each element with the correct value, instead of just filling the array with the right values in the first place. It’s kind of inelegant too.

You could also use `fill(0.0, nx, ny)` instead of `zeros`, if you prefer being consistent.

1 Like

Thanks a lot to both of you. It now runs with parsing arguments using the closure suggested by Elrod and fill of arrays suggested by DNF. It also into a `main` function. The picture has now changed quite a bit, here for 3 consecutive evaluations:

``````Base    --> 0.008191534 s
@turbo  --> 0.003956482 s
@tturbo --> 0.003793147 s
0.016332 seconds (69 allocations: 550.047 KiB)
Base    --> 0.008469515 s
@turbo  --> 0.00483582 s
@tturbo --> 0.004676405 s
0.018387 seconds (68 allocations: 549.719 KiB)
Base    --> 0.009877944 s
@turbo  --> 0.00434214 s
@tturbo --> 0.005410352 s
0.020165 seconds (69 allocations: 550.047 KiB)
``````

using the updated MWE:

``````using ForwardDiff, LoopVectorization, StaticArrays, BenchmarkTools, Statistics, Revise # FIX #4 use StaticArrays

# Visco-elastic power law rheology
function feta(x::AbstractVector{<:Real}, data) # FIX #3: made it AbstractVector
eta0, G, n, dt = data
Eii = sqrt( 0.5*(x^2 + x^2) + x^2)
C   = 1.0/(2eta0) # FIX #5 instead of (2eta0)^(-1)
eta = eta0*Eii^(1/n-1)
for it=1:10 # inner Newton iteration
Tii    = 2eta*Eii
Eiiv   = C*Tii^n
r      = Eii - Tii/(2G*dt) - Eiiv
# if (abs(r)<1e-9) break; end       # If statement does not work in LoopVectorization
drdeta = -Eii/(G*dt) - Eiiv*n/eta
eta   -= r/drdeta
end
return eta
end
feta(Exx::Real,Eyy::Real,Exy::Real,data) = feta(@SVector([Exx,Eyy,Exy]),data)  # use multiple dispatch to make it work with the desired argument list

function main()

Ebg  = -1.0;
data = ( eta0 = 1.0, G = 1.0, n = 3,  dt = 1.0 )

# Setup / allocation of tables
nx, ny = 100, 100
eta = fill(0.0, nx, ny)
Exx,      Eyy,      Exy      = fill(Ebg, nx, ny), fill(-Ebg, nx, ny), fill(0.2*Ebg, nx, ny)

# Loop through all cells and evaluate
t1 = @elapsed for j=1:ny
for i=1:nx
eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j],data)
end
end

t2 = @elapsed @turbo for j=1:ny
for i=1:nx
eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j],data)
end
end

t3 = @elapsed @tturbo for j=1:ny
for i=1:nx
eta[i,j]                                     = feta(Exx[i,j],Eyy[i,j],Exy[i,j],data)
end
end

println("Base    --> ", t1, " s")
println("@turbo  --> ", t2, " s")
println("@tturbo --> ", t3, " s")

end

for it=1:3
@time main()
end
``````

So now @turbo is faster Here are the details of my configuration:

``````julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin18.7.0)
CPU: Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, ivybridge)
Environment:
JULIA_EDITOR = code
``````

Looks like the number of threads is not set, which likely explains why @tturbo is as fast as @turbo. I’ll go fix that and update…

EDIT: with 4 threads ``````Base    --> 0.00824187 s
@turbo  --> 0.004137984 s
@tturbo --> 0.002877103 s
0.015900 seconds (68 allocations: 549.719 KiB)
Base    --> 0.011947013 s
@turbo  --> 0.005061616 s
@tturbo --> 0.002595061 s
0.020100 seconds (70 allocations: 550.078 KiB)
Base    --> 0.010976247 s
@turbo  --> 0.004203059 s
@tturbo --> 0.002639843 s
0.018551 seconds (69 allocations: 549.750 KiB)
``````
2 Likes

Release date: Q1’13

a blast from the past! XD

Currently, LV will only use up to as many threads as you have cores (so 2).

ivybridge does not have FMA (fused multiply add) or AVX2 (basic integer support)!

Finally, if you want both a function evaluation and its gradient, you can use DiffResults. This requires LoopVectorization 0.12.81 (just released a few minutes before making this post, so you’ll need to update your packages), because of a bug in multi-assignment handling:

``````using ForwardDiff, DiffResults, LoopVectorization, StaticArrays, BenchmarkTools, Statistics, Revise # FIX #4 use StaticArrays

# Visco-elastic power law rheology
function feta(x::AbstractVector{<:Real}, data) # FIX #3: made it AbstractVector
eta0, G, n, dt = data
Eii = sqrt( 0.5*(x^2 + x^2) + x^2)
C   = 1.0/(2eta0) # FIX #5 instead of (2eta0)^(-1)
eta = eta0*Eii^(1/n-1)
for it=1:10 # inner Newton iteration
Tii    = 2eta*Eii
Eiiv   = C*Tii^n
r      = Eii - Tii/(2G*dt) - Eiiv
# if (abs(r)<1e-9) break; end       # If statement does not work in LoopVectorization
drdeta = -Eii/(G*dt) - Eiiv*n/eta
eta   -= r/drdeta
end
return eta
end
feta(Exx::Real,Eyy::Real,Exy::Real,data) = feta(@SVector([Exx,Eyy,Exy]),data)  # use multiple dispatch to make it work with the desired argument list
sv = @SVector([x,y,z])
end

function main()

Ebg  = -1.0;
data = ( eta0 = 1.0, G = 1.0, n = 3,  dt = 1.0 )

# Setup / allocation of tables
nx, ny = 100, 100
eta = fill(0.0, nx, ny)
Exx,      Eyy,      Exy      = fill(Ebg, nx, ny), fill(-Ebg, nx, ny), fill(0.2*Ebg, nx, ny)

# Loop through all cells and evaluate
t1 = @elapsed for j=1:ny
for i=1:nx
end
end

t2 = @elapsed @turbo for j=1:ny
for i=1:nx
end
end

t3 = @elapsed @tturbo for j=1:ny
for i=1:nx
end
end

println("Base    --> ", t1, " s")
println("@turbo  --> ", t2, " s")
println("@tturbo --> ", t3, " s")

end

for it=1:3
@time main()
end
``````

I get:

``````julia> for it=1:3
@time main()
end
Base    --> 0.002588202 s
@turbo  --> 0.000640338 s
@tturbo --> 0.000127322 s
0.003497 seconds (70 allocations: 550.359 KiB)
Base    --> 0.002590058 s
@turbo  --> 0.000643189 s
@tturbo --> 0.000101351 s
0.003444 seconds (68 allocations: 549.500 KiB)
Base    --> 0.002593851 s
@turbo  --> 0.000640927 s
@tturbo --> 0.000101391 s
0.003440 seconds (68 allocations: 549.500 KiB)
``````
2 Likes

Really really cool!!
I’ve made the update and ran on my desktop computer which could also be seen as a ‘blast from the past’ ``````julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin18.7.0)
CPU: Intel(R) Core(TM) i5-3470S CPU @ 2.90GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, ivybridge)
Environment:
JULIA_EDITOR = code
``````

Using the MWE updated by Elrod, which uses now also DiffResults:

``````Base    --> 0.004413731 s
@turbo  --> 0.00217721 s
@tturbo --> 0.000688704 s
0.007464 seconds (69 allocations: 549.734 KiB)
Base    --> 0.004730431 s
@turbo  --> 0.002347034 s
@tturbo --> 0.001636784 s
0.008873 seconds (73 allocations: 550.156 KiB)
Base    --> 0.004744306 s
@turbo  --> 0.002316257 s
@tturbo --> 0.000683593 s
0.007930 seconds (72 allocations: 549.828 KiB)
``````

Many thanks to all of you for your help! I’ve spent so much time deriving and implementing analytical gradients (for Jacobians) over the last years. Looks like I won’t ever need that again… (I somehow expected that AD would be significantly slower than computing analytical derivatives, likely a wrong idea…)

Maybe a last point, would anyone have recommendations on how to deal with stopping conditions in the `feta` loop? For now, it just does 10 Newton steps without checking. In practice, it would be nice to have it dynamic as in some cases only one iteration is needed (e.g. in the linear elastic limit of the visco-elastic model `feta`) while in others ~5-10 iterations are needed (e.g. in the non-linear viscous limit). Is there a way to include a conditional statement such that the function remains compatible with both `LV` and `ForwardDiff`?

1 Like