Types and gradients, including Forward.gradient


#1

This thing with types is not so easy in the beginning. I’m trying to use the gradient() function on an Any array, but it does not work. Could someone help me with the type of the w array? Second, I want to compare the base gradient function with the Forward.gradient function from the ForwardDiff package and would be grateful if someone who knows this package can tell what needs to be changed. Obviously, just changing gradient to Forward.gradient was not enough.

n = 100
p = 10
x = randn(n,p)'
y = sum(x[1:5,:],1) .+ randn(n)'*0.1
w = [0.0001*randn(1,p), 0]

loss(w,x,y) = sumabs2(y - (w[1]*x .+ w[2])) / size(y,2)
lossgradient = gradient(loss)

function train(w, data; lr=.1)
    for (x,y) in data
        dw = lossgradient(w, x, y)
        w[1] -= lr * dw[1]
        w[2] -= lr * dw[2]
    end
    return w
end

for i=1:50; train(w, [x,y]); println(loss(w,x,y)); end

using ForwardDiff
lossgradient = Forward.gradient(loss)

#2

I hope you don’t mind, but I edited your comment to use triple-backtick formatting. It often makes code easier for others to read.


#3

What are you trying to differentiate with respect to? x?


#4

It’s an attempt to implement proximal gradient descent for minimization of the loss (squared error) function. The gradient is with respect to w.


#5

Shouldn’t that be ForwardDiff.gradient? You’re missing the Diff.


#6

Also, you ForwardDiff.gradient requires two arguments: the function f to take the gradient of, and the vector x at which to calculate the gradient. You seem to want the function itself, which you can do via an anonymous function:

lossgradient = x -> Forward.gradient(loss, x)

#7

ForwardDiff’s gradient can only be used to take the gradient at a specific point, with respect to a specific parameter (see the documentation here). ForwardDiff.gradient can then be used to easily define your own functions for taking the gradient generally.

In your example, you wish to get the gradient of loss w.r.t. w. loss takes multiple parameters, but you can easily “select” which parameter you want to differentiate against by creating a closure over the fixed (i.e. not differentiated against) parameters. That looks like this:

# get the gradient of loss w.r.t. to w at the point w0, with fixed x and y
ForwardDiff.gradient(w -> loss(w, x, y), w0)

Thus, if you’d like to define a gradient function which you call repeatedly, you can write:

lossgradient(w0, x, y) = ForwardDiff.gradient(w -> loss(w, x, y), w0)

Note that you can make this more efficient by providing ForwardDiff with all the memory it needs to compute the gradient up front (e.g. an output array and some work buffers):

lossgradient!(out, w0, x, y, cfg) = ForwardDiff.gradient!(out, w -> loss(w, x, y), w0, cfg)

# allocate a bunch of pre-configured, reusable work buffers needed by ForwardDiff.gradient
cfg = ForwardDiff.GradientConfig(w0) 

# allocate an output array for the gradient
out = similar(w0)

# now you can call this over and over again with the same `out` and `cfg` (note that `out` will get overwritten)
lossgradient!(out, w0, x, y, cfg)

#8

Ah, I was assuming in my example that w is a weight matrix, and didn’t see you had an extra bias parameter there.

w here, as written, is a 2-element Vector{Any} where the first element is a 1xp weight matrix and the second is a single scalar.

This probably isn’t what you actually want, since when you index into w to retrieve the objects, Julia will infer the returned objects’ types to be Any (since w is of type Vector{Any}) and won’t be able to generate fast code.

It might be better to make them two separate objects (or a single, concretely-typed array, e.g. Vector{Float64}). That way, Julia can figure out their exact types (instead of being forced to infer their types as Any, which will result in slow code).

Here’s what it’d look like to make them two separate, well-inferred objects:

w = 0.0001*randn(1,p)
b = 0 # writing this like you wrote it, but did you mean for this to be a vector?

Then you can rewrite your code to accept a b argument, and use ForwardDiff to differentiate w.r.t. to w, b, etc.


#9

Many thanks for the explanation. Though I’m not sure I follow fully. The
algorithm should work so that the points to take the gradient over (what
you refer to as w0) should be the old estimates of w. In other words, w0
needs to be indexed in some way so that it relates to the correct w. I
cannot see how w, w0 and out relates to each other. A somewhat sloppy
pseudo code would be:

Merge x and y so that the variables are in the rows (of dim p)
Define loss = (y - w * x) ^ 2 / n
Initialize w of length p and set i = 1
Set stepsize lr = 0.1
For i = 2 to niter
  For j = 1 to p
     w[i, j] = w[i-1, j] - lossgradient(w[i-1, j]) * lr
  End
End

#10

Here’s a more complete example of the kind of thing I’m recommending (note that I’m not going to go the trouble of preallocating any memory as I’ve previously shown, but you can make this more efficient if you do):

loss(w,b,x,y) = sumabs2(y - (w*x .+ b)) / size(y,2)

# get gradient w.r.t to `w` 
loss∇w(w, b, x, y) = ForwardDiff.gradient(w -> loss(w, b, x, y), w)

# get derivative w.r.t to `b` (`ForwardDiff.derivative` is 
# used instead of `ForwardDiff.gradient` because `b` is 
# a scalar instead of an array)
lossdb(w, b, x, y) = ForwardDiff.derivative(b -> loss(w, b, x, y), b)

function train(w, b, data; lr=.1)
    for (x, y) in data
        w -= scale!(lr, loss∇w(w, b, x, y))
        b -= lr * lossdb(w, b, x, y)
    end
    return w, b
end

Like I said, there are many opportunities here to improve this code by reusing memory and reducing allocations, but I’ll leave that as an exercise to the reader :slight_smile:

EDIT: Sorry, I forgot to remove the ! from gradient! when I copy-pasted my other example.


#11

Sorry to bother again, but I cannot get it to work. By the way, how do I get the code in proper format?

using ForwardDiff

n = 100
p = 10
x = randn(n,p)'
y = sum(x[1:5,:],1) .+ randn(n)'*0.1
w = 0.0001*randn(1,p)
b = 0.0

loss(w,b,x,y) = sumabs2(y - (w*x .+ b)) / size(y,2)

# get gradient w.r.t to `w` 
loss∇w(w, b, x, y) = ForwardDiff.gradient!(w -> loss(w, b, x, y), w)

# get derivative w.r.t to `b` (`ForwardDiff.derivative` is 
# used instead of `ForwardDiff.gradient` because `b` is 
# a scalar instead of an array)
lossdb(w, b, x, y) = ForwardDiff.derivative(b -> loss(w, b, x, y), b)

function train(w, b, x, y ; lr=.1)
    for j=1:p
        w -= scale!(lr, loss∇w(w, b, x, y))
        b -= lr * lossdb(w, b, x, y)
    end
    return w, b
end

for i=1:50; train(w, b, x, y); println(loss(w,b,x,y)); end

#12

Remove the exclamation mark from gradient!. Also, please quote your code.


#13

I’m afraid that won’t help. It just inherits the values through the iterations.


#14

for i=1:50; w,b = train(w, b, x, y); println(loss(w,b,x,y)); end


#15

Ok, now I think we ended up with the final code. Many thanks for all your help.

using ForwardDiff

n = 100
p = 10
x = randn(n,p)'
y = sum(x[1:5,:],1) .+ randn(n)'*0.1
w = 0.0001*randn(1,p)
b = 0.0

# squared error loss function
loss(w,b,x,y) = sumabs2(y - (w*x .+ b)) / size(y,2)

# get gradient w.r.t to `w`
loss∇w(w, b, x, y) = ForwardDiff.gradient(w -> loss(w, b, x, y), w)

# get derivative w.r.t to `b` (`ForwardDiff.derivative` is
# used instead of `ForwardDiff.gradient` because `b` is
# a scalar instead of an array)
lossdb(w, b, x, y) = ForwardDiff.derivative(b -> loss(w, b, x, y), b)

# proximal gradient descent function
function train(w, b, x, y ; lr=.1)
        w -= scale!(lr, loss∇w(w, b, x, y))
        b -= lr * lossdb(w, b, x, y)
    return w, b
end

# iterator that collects the weights and bias, and prints the loss
for i=1:50; w, b = train(w, b, x, y); println(loss(w,b,x,y));

#16

By the way, how do I get the code in proper format?

You can quote your code in Markdown by wrapping text in single backticks for inline code, or triple backticks for whole code blocks. Here’s some useful documentation on Markdown formatting (note that those docs are for GitHub-flavored markdown; the actual markdown specification docs are here).


#17

I have done some comparison of ForwardDiff with AutoGrad. It turns out to be rather large performance differences, at least the way it is implemented now. Any ideas why these large timing and memory differences?

n = 500
p = 1000
x = randn(n,p)'
y = sum(x[1:5,:],1) .+ randn(n)'*0.1

using ForwardDiff
w = 0.0001*randn(1,p)
b = 0.0

# squared error loss function
loss(w,b,x,y) = sumabs2(y - (w*x .+ b)) / size(y,2)

# get gradient w.r.t to `w`
loss∇w(w, b, x, y) = ForwardDiff.gradient(w -> loss(w, b, x, y), w)

# get derivative w.r.t to `b` (`ForwardDiff.derivative` is
# used instead of `ForwardDiff.gradient` because `b` is
# a scalar instead of an array)
lossdb(w, b, x, y) = ForwardDiff.derivative(b -> loss(w, b, x, y), b)

# proximal gradient descent with soft threshold, i.e. lasso
function train(w, b, x, y,lambda ; lr=.1)
        w -= scale!(lr, loss∇w(w, b, x, y))
        b -= lr * lossdb(w, b, x, y)
        w[(w.<(lr * lambda))&(w.>-(lr * lambda))] = 0
    return w, b
end

# iterator that collects the weights and bias, and prints the loss
niter = 50
lambda = 1.0
@time for i=1:niter; w, b = train(w, b, x, y, lambda); println(loss(w,b,x,y)); end

20.764755 seconds (84.48 k allocations: 640.544 MB, 0.58% gc time)

using AutoGrad
w = [0.0001*randn(1,p),0.0]

loss(w) = sumabs2(y - (w[1]*x .+ w[2])) / size(y,2)
lossgradient = grad(loss)
function train(w, x, y, lambda; lr=.1)
        g = lossgradient(w)
        w[1] -= lr * g[1]
        w[2] -= lr * g[2]
        w[1][-(lr * lambda) .< w[1] .< (lr * lambda)] = 0
    return w
end

niter = 50
lambda = 1.0
@time for i=1:niter; w = train(w, x, y,lambda); println(loss(w)); end

0.277986 seconds (11.04 k allocations: 3.942 MB, 4.70% gc time)


#18

ForwardDiff.jl, as its name states, uses forward-mode automatic differentiation. Forward-mode AD is efficient for function R -> R^n, but for function R^n -> R it is terribly slow. In fact, runtime of forward-mode linearly depends on n.

Typical loss function like yours is R^n -> R, i.e. it takes multiple input parameters (vector w and scalar b) and returns exactly 1 number - value of the loss at that point. For such functions reverse-mode AD. Reverse-mode AD propagates derivatives from top-most node (the loss itself) down to input parameters. Since there’s only one value at the top, tracking derivatives backward is much more efficient for loss functions. gains

AutoGrad.jl is one of the packages that implements reverse-mode AD and thus calculates gradients much more efficiently than ForwardDiff.jl. But Julia has many others, e.g.:


#19

Remember not to do timing in global scope, and to time only the second or later run of a given function.

You may also want to compare with the new ReverseDiff.jl package.


#20

If anybody is curious about the ReverseDiff vs. AutoGrad performance (just timing the gradients, not the training):

Set up code:

using BenchmarkTools

# these are constants (following your code)
const n = 500
const p = 1000
const x = randn(n, p)'
const y = sum(x[1:5,:],1) .+ randn(n)'*0.1

# these are the parameters
w = 0.0001*randn(1,p)
b = [0.0]
input = (w, b)

Autograd version:

julia> begin
           using AutoGrad
           # tried to do `sum(abs2.(...))` since `sumabs2` is getting 
           # deprecated, but AutoGrad threw an error so I switched back
           loss(input) = sumabs2(y - ((input[1] * x) .+ input[2][1])) / size(y, 2)
           loss∇ = grad(loss)
           @benchmark loss∇($input)
       end
BenchmarkTools.Trial:
  memory estimate:  43.11 kb
  allocs estimate:  200
  --------------
  minimum time:     1.801 ms (0.00% GC)
  median time:      1.836 ms (0.00% GC)
  mean time:        1.861 ms (0.49% GC)
  maximum time:     7.337 ms (69.84% GC)

ReverseDiff version:

julia> begin
           using ReverseDiff
           output = map(zeros, input)
           loss(w, b) = sum(abs2.(y - ((w * x) .+ b[1]))) / size(y, 2)
           loss∇! = ReverseDiff.compile_gradient(loss, input)
           @benchmark loss∇!($output, $input)
       end
BenchmarkTools.Trial:
  memory estimate:  48.00 bytes
  allocs estimate:  2
  --------------
  minimum time:     1.621 ms (0.00% GC)
  median time:      1.637 ms (0.00% GC)
  mean time:        1.648 ms (0.00% GC)
  maximum time:     2.617 ms (0.00% GC)