Automatic differentiation - Julia implementation advantages

I’m interested in utilizing Julia for some ML tasks; in perusing some discussions from this discourse and the Julia slack in the past, I have a vague memory of having read some claims regarding advantages of Juila implementations of automatic differentiation over other languages. However, more recently I’ve become aware of more approaches (e.g., Apache commons ).

Can someone point me toward a discussion or reference that would help show potential advantages of working with Julia in regard to the automatic differentiation?

I don’t want to rehash what is probably an old topic for many, but I’ve tried to search the past posts to no avail so far.

Thank you.


I would take a look at some of the recent papers on the topic, for example


Thank you - the references look helpful, and I’ll dig into them.


Some more links you may like, if you have not seen them: This blog post and this more recent talk . And these notebooks about different approaches: .


As an extremely nontechnical plug for Julia’s ForwardDiff package, one thing that amazes me compared with other languages is that you can just write a function as though you had never heard of ForwardDiff, and then ask ForwardDiff to take the derivatives…and it will! No need to learn some new syntax. You can use if statements, while loops, etc., in your function (although presumably nothing that makes your function non-differentiable, like random numbers) and ask ForwardDiff to figure it out.

That was a game changer for me.


In fact, here’s some code I wrote awhile ago exploring the use of ForwardDiff. I think some of the syntax is suboptimal: for example, I recall learning there were better ways of providing arguments for which the ForwardDiff will not take the derivative of. But I think it demonstrates how easy it is learn/use.

## EXAMPLES WITH ForwardDiff package
using ForwardDiff
# Distributions is only for simulating data
using Distributions

# ForwardDiff can do autodifferentation with functions that take in
# a single Vector ONLY

# Note that f only takes ONE argument
f(x::Vector) = sum(x.^2)
# Now we can make a function that takes the
# gradient of the function f with respect to x
g = x -> ForwardDiff.gradient(f, x)
# ...or the Hessian!
h = x -> ForwardDiff.hessian(f, x)

# Demonstration:
println("Gradient: ")
println( g([2,3]) ) # Gradient will be 2x
println( "Hessian: ")
println( h([2,3]) ) # Hessian will be 2 * identity matrix

# ForwardDiff expects us to provide a *single* vector, for
# which we will take all the derivatives for
# But how do we include other values that affect the derivative,
# for which we do not want to take the derivative of (i.e., data in stats)?
# My answer so far: leave them laying around in the environment above!
# I don't love this method, but it's all I got right now

# a, b will affect our function, but we do not want to take their derivatives
a = 5.0
b = 3.0

# This function takes a Vector, but accesses a, b,
# which are NOT provided as arguments
function f(x::Vector)
   ans = a * x[1] + b * x[2]

# Taking the gradient with respect to x
g = x -> ForwardDiff.gradient(f, x)

# Checking
println("Gradient = ")
# Should be values of a, b defined in Global Environment
println( g([0,0]) )

# It seems like we can arbitrarily nest functions
# and it'll figure it out for us
f1(x) = x.^2
f2(x) = sum(x)

f3(x) = f2(f1(x))
g = x -> ForwardDiff.gradient(f3, x)

println("Gradient = ")
# Should be 2 .* x
println( g([3, 2, 1]) )


# This function is just the sum of absolute values
# The motivation for this was testing if autodifferentation
# could handle control statements (answer:yes!)
function f(x::Vector)
   k = length(x)
   ans = 0.0
   for i in 1:k
      if x[i] > 0.0
         ans = ans + x[i]
         ans = ans - x[i]

g = x -> ForwardDiff.gradient(f, x)

println("Gradient = ")
# Should just be sign(x)
println( g([-1.0,1.0, 2.0, -2.0]) )

# GLM Example!
# Now we will get a little fancier. We're going to solve
# for logistic regression using Newton Raphson

# For transforming from logit to probabilities
expit(x::Real) = exp(x) / (1.0 + exp(x) )

# For computing log-likelihood from estimated p-hats
function logistic_llk(p_hat, y)
   n = length(y)
   ans = 0.0
   for i in 1:n
      if y[i] == 1
         ans = ans + log(p_hat[i])
         ans = ans + log(1.0 - p_hat[i])

# For computing p-hats
# Assuming that intercept is left out
function compute_p_hat(X, beta)
   k = length(beta)
   # The first value of beta is the intercept
   # so we will individually add it and
   # use matrix multiplication for the rest of the betas
   eta = beta[1] .+ (X * beta[2:k])
   ans = expit.(eta)

# Final function that calls everything
function logReg(X, y, its = 5, verbose = true)
   # Extracting dimensions
   X_dims = size(X)
   nRows = X_dims[1]
   nCols = X_dims[2]

   # Setting initial values
   # We're assuming X has no intercept
   # so length(beta) = nCols + 1
   betas = zeros(nCols + 1)

   # Base function that we will call our
   # 1st and 2nd order derivatives on
   base_fxn = function(betas::Vector)
      # X will be found in the environment above
      p_hats = compute_p_hat(X, betas)
      # As will y
      ans = logistic_llk(p_hats, y)

   # Creating gradient + hessian functions
   # Note: I think you might be able to actually
   # call the gradient from the hessian function (so g is redundant)
   # But let's get code working before we optimize it!
   g = x->ForwardDiff.gradient(base_fxn, x)
   h = x->ForwardDiff.hessian(base_fxn, x)

   current_llk = base_fxn(betas)
   if verbose
      println(string("Current llk:", current_llk) )

   # Runs a fixed number of vanilla Newton Raphson steps
   for i in 1:its
      # Grabbing current gradients
      g_vec = g(betas)
      h_mat = h(betas)
      # -\(M, v) solves M for v (i.e., R::solve(M,v) )
      # So line below gets the proposal step for Newton Raphson
      delta = -\(h_mat, g_vec)
      betas = betas + delta
      # Computes current log-likelihood
      current_llk = base_fxn(betas)
      if verbose
         # Prints out current log-likelihood at this step of the algorithm
         println(string("Current llk:", current_llk) )
   if verbose
      println("Final betas = ", betas)

# Funciton for simulating data
function simData(nRow, nCol)
   X = randn(nRow, nCol)
   trueBetas = randn(nCol + 1)
   etas = X * trueBetas[2:(nCol + 1)] .+ trueBetas[1]
   p = expit.(etas)
   y = rand.(Binomial.(1, p))
   ans = (X = X, y = y, trueBetas = trueBetas)

# Size of data
nCol = 2
nRow = 5000

sim_problem = simData(nRow, nCol)
println(string("True values of beta = ", sim_problem[3]) )

start = time()
logReg(sim_problem[1], sim_problem[2])
finish = time()
println("Time = ", finish - start)

# Now let's try with 10x more rows
# If autodifferentation is smart, should be 10x slower
# If dumb, 100x slower
nRow = 50000
sim_problem = simData(nRow, nCol)
start = time()
logReg(sim_problem[1], sim_problem[2])
finish = time()
println("Time = ", finish - start)
# Looks like it is smart!

Contributing to the conversation, i have a thermodynamic implementation of the IAPWS95-2018 standard of properties of water, where all the properties are combinations of derivatives of a base function (helmholtz energy,with molar volume temperature as variables, and molar fractions) my code for the pressure is the following:

function helmholtz(model::IAPWS95,v, T, x) 
##code for calculation of the base function
## the model variable holds all the necessary parameters 


#the pressure is the negative volume  derivative of the helmholtz function 
function pressure(model::Helmholtz Model, v, T, x) 
f0 = z->helmholtz(model, z, T, x) 
return - ForwardDiff.derivative(f0, v, x) 

thanks for these links!


thanks for posting this!