ANN: XDiff.jl - an expression differentiation package


#1

Although it’s not quite stable yet, people seem to be interested in the topic, so I think it’s time to announce a package I’ve been working on lately.

XDiff.jl is an expression differentiation package, supporting fully symbolic approach to finding tensor derivatives. Unlike automatic differentiation packages, XDiff.jl can output not only ready-to-use derivative functions but also their symbolic expressions suitable for further optimization and code generation. Here’s an example:

function ann(w1, w2, w3, x1)
    _x2 = w1 * x1
    x2 = log(1. + exp(_x2))   # soft RELU unit
    _x3 = w2 * x2
    x3 = log(1. + exp(_x3))   # soft RELU unit
    x4 = sum(w3 * x3)
    return 1.0 ./ (1.0 + exp(-x4))  # sigmoid output
end

# ANN input parameter types
types = (Matrix{Float64}, Matrix{Float64}, Matrix{Float64}, Vector{Float64})

# generate example input
w1, w2, w3, x1 = randn(10,10), randn(10,10), randn(1,10), randn(10)

# create a dict of symbolic derivatives
dexs = rdiff(ann, types)
dexs[:w1]   # ==> quote ... end

# create derivative functions
dw1, dw2, dw3, _ = fdiff(ann, types)
dw1(randn(100,100), randn(100,100), randn(1,100), randn(100))

Another unique feature of XDiff.jl is that it can generate expressions not only for functions R^n -> R, but also functions R^n -> R^m using Einstein indexing notation:

ctx = [:outfmt => :ein]
dexs = rdiff(:(z = W*x + b); ctx=ctx, W=rand(3,4), x=rand(4), b=rand(3))
dexs[:W]   # ==> :(dz_dW[i,m,n] = x[n] * (i == m))

Go through the whole README or install XDiff.jl directly using:

Pkg.clone("https://github.com/dfdx/XDiff.jl")

#2

how does this relate to ReverseDiffSource.jl?


#3

Side comment: as a suggestion, don’t use floating numbers with trailing dot. They could be made illegal in Julia at some point (see https://github.com/JuliaLang/julia/issues/15731), right now they are not allowed as a juxtaposed numeric literal coefficient (1.x), and they can lead to unexpected results, like in 1.+1.


#4

XDiff.jl takes roots in ReverseDiffSource.jl, hence similar names and semantics. However, there are several important differences:

  1. ReverseDiffSource can’t handle nested functions. While XDiff isn’t perfect in this sense too, it can handle all scalar and most element-wise and broadcasting functions. For example, there’s derivative only for log(x) (single argument) defined in the package, but if you call rdiff on log(b, x) (two arguments), it will extract code for this method, infer a new differentiation rule for it and add to cache. In practice, this means that you don’t need to define your cost functions as a single long function but can easily decompose it into a number of smaller ones as you would do in normal code.

  2. XDiff supports differentiation of vector-valued functions. In the context of ML, it’s mostly valuable for analysis only since cost functions are normally scalar-valued. But I’ve already encountered a couple of use cases (e.g. in finance) where both - input and output of a function - are vectors.

  3. The final step of differentiation is code generation. Currently, there are only 2 formats available - vectorized and Einstein notation - but the final goal is to add pluggable code generators. Imagine that with a single argument you can produce vectorized code or code with fused loops, or BLAS-optimized, or code for GPU, etc. Though, this is mostly part of Espresso.jl vision.


#5

Good point, thank you! I’ve updated the example.


#6

thanks for this - are you planning to allow for-loops in the near future?


#7

Maybe some limited subset of them. Do you have in mind some specific example?


#8

there a lot of different ones, but a typical example is

function F(x)
r = 0.0
for n = 1:length(x)
  r += f1(x[n])
end
return f2(r)

in some cases I also need double-loops


#9

I have a couple of ideas how to handle such cases. For example, we could employ the summation rule for derivatives and generate something like:

r = 0.0
dr = 0.0
for n = 1:length(x)
    r += f1(x[n])
    dr += df1(x[n])
end

But I tend to think that such a code may be converted (manually by the a programmer or automatically at parse time) into aggregation function, e.g.:

y = f1.(x)
r = sum(y)
result = f2(r)

Which is much easier to differentiate.

Anyway, feel free to submit a feature request or vote for existing one. Such requests really help to determine the needs of the community and prioritize further tasks.


#10

that’s good to know, thanks for clarifying.


#11

That particular example could be written f2(sum(f1, x)), and the sum(f, x) pattern is presumably something that XDiff would want to handle.


#12

ok, this was maybe too simple an example. I’ll just keep experimenting with both XDiff and ReverseDiffSource to find their respective limitations.


#13

A few updates about this package:

  1. The codebase for 0.1 release is ready. Currently I’m in the process of releasing dependencies, fixing README, etc., so I expect to tag first official version of XDiff in a couple of days.
  2. Generated code is now much faster. In particular, it beats ReverseDiff.jl (which itself is very well done) in all my benchmarks except for the smallest ones (where constant factors outweigh vector operations). See below for details.
  3. The generated code is mostly GPU friendly and should play well with CUDAnative.jl.
  4. I made a POC for convolution operations and it worked smoothly. The only thing that stops me from adding it is a lack of common function implementations for convolutional neural networks (built-in conv2 is too simple for real-life applications)

The most important improvement is a new buffered code generator. This generator replaces all matrix multiplications with Ax_mul_Bx!() family of functions, fuses element-wise operations and removes any unused code. As a result, it now runs ~100x faster than a non-optimized version and 1.3-2x faster than ReverseDiff.jl. Here’s the output of a benchmark for autoencoders:

Compiling derivatives using XDiff
 35.168511 seconds (12.59 M allocations: 7.464 GiB, 6.44% gc time)
Testing XDiff...
BenchmarkTools.Trial:
  memory estimate:  9.95 MiB
  allocs estimate:  153
  --------------
  minimum time:     509.478 ms (0.21% GC)
  median time:      518.636 ms (0.00% GC)
  mean time:        545.330 ms (0.08% GC)
  maximum time:     637.074 ms (0.16% GC)
  --------------
  samples:          10
  evals/sample:     1

Testing ReverseDiff...
BenchmarkTools.Trial:
  memory estimate:  4.58 MiB
  allocs estimate:  16
  --------------
  minimum time:     1.011 s (0.28% GC)
  median time:      1.117 s (0.00% GC)
  mean time:        1.096 s (0.05% GC)
  maximum time:     1.133 s (0.00% GC)
  --------------
  samples:          5
  evals/sample:     1

#14

Cool stuff!

This generator replaces all matrix multiplications with Ax_mul_Bx!() family of functions, fuses element-wise operations and removes any unused code.

ReverseDiff also does these things, so I looked into why it was running so slowly for your benchmark.

It turns out that Julia’s new parser-level broadcast fusion was bypassing ReverseDiff’s primitives, so the broadcast operations were getting unrolled in ReverseDiff’s tape. I’m surprised ReverseDiff was even as quick as it was, given the crazy number of operations it was doing!

Future versions of ReverseDiff won’t have this problem - fused broadcasts will be intercepted automatically as their own primitives. The machinery for this optimization actually already exists, and is used for some of ReverseDiff’s opt-in mixed-mode optimizations. The only thing holding back the more general version of this optimization was a tagging system for ForwardDiff’s Dual numbers (since the optimization could’ve resulted in perturbation confusion). Now that Forward has such a system to prevent perturbation confusion, ReverseDiff’s mixed-mode broadcast optimization will no longer need to be explicitly opt-in; it’ll just happen automatically. I just need to get around to updating ReverseDiff to use the latest version of ForwardDiff.

In the meantime, I altered the code to avoid parser fusion so we could get a better idea of ReverseDiff’s actual performance.

Here are the “large data” XDiff results on my machine for benchmark_autoencoder (using your unmodified code):

Compiling derivatives using XDiff
 75.817737 seconds (17.12 M allocations: 7.705 GiB, 5.87% gc time)
Testing XDiff...
BenchmarkTools.Trial:
  memory estimate:  9.95 MiB
  allocs estimate:  153
  --------------
  minimum time:     567.973 ms (0.00% GC)
  median time:      654.563 ms (0.00% GC)
  mean time:        649.178 ms (0.20% GC)
  maximum time:     739.510 ms (0.64% GC)
  --------------
  samples:          8
  evals/sample:     1

Here are the ReverseDiff results for the same benchmark, but using this gist:

BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     611.937 ms (0.00% GC)
  median time:      622.834 ms (0.00% GC)
  mean time:        627.895 ms (0.00% GC)
  maximum time:     670.710 ms (0.00% GC)
  --------------
  samples:          8
  evals/sample:     1

Fate of ReverseDiffSource
#15

Good to know that there’s still space for improvement!

It turns out that Julia’s new parser-level broadcast fusion was bypassing ReverseDiff’s primitives, so the broadcast operations were getting unrolled in ReverseDiff’s tape.

Do I understand it correctly that Julia’s parser was transforming the code into a form (probably broadcast!(...) kind of things) that ReverseDiff can’t understand and thus can’t use it’s own optimizations for broadcasting?

BTW, not sure how applicable it’s for ReverseDiff, but Espresso.jl (a backbone of XDiff) can do “unfusion” of expressions automatically:

julia> using Espresso

julia>  _, ex = funexpr(autoencoder_cost, map(typeof, vals))
(Symbol[:We1, :We2, :Wd, :b1, :b2, :x], quote 
    firstLayer = logistic.(We1 * x .+ b1)
    encodedInput = logistic.(We2 * firstLayer .+ b2)
    reconstructedInput = logistic.(Wd * encodedInput)
    cost = sum((reconstructedInput .- x) .^ 2.0)
    cost
end)

julia> to_expr(ExGraph(ex))
quote
    tmp1235 = We1 * x
    tmp1236 = tmp1235 .+ b1
    firstLayer = Main.logistic.(tmp1236)
    tmp1238 = We2 * firstLayer
    tmp1239 = tmp1238 .+ b2
    encodedInput = Main.logistic.(tmp1239)
    tmp1241 = Wd * encodedInput
    reconstructedInput = Main.logistic.(tmp1241)
    tmp1243 = reconstructedInput .- x
    tmp1244 = 2.0
    tmp1245 = tmp1243 .^ tmp1244
    cost = sum(tmp1245)
end

#16

Just a curiosity, is it possible to define custom operations and their derivations? For example something that autograd allows?

For example for AutoGrad.jl I have implemented segmented_sum as

function segmented_sum(x,bags)
xx=similar(x,length(bags),size(x,2))
for (i,bag) in enumerate(bags)
xx[i,:]=sum(x[bag,:],1)
end
return(xx)
end

function grad_segmented_sum(x,dy,bags)
l=mapreduce(b->b.stop,max,bags)
xx=zeros(eltype(x),l,size(x,2))
for (i,bag) in enumerate(bags)
for j in bag
xx[j,:]=dy[i,:]
end
end
return(xx)
end

@primitive segmented_sum(x,bags),dy,y grad_segmented_sum(x,dy,bags)

where the last line is the registration to the autograd.jl package.

Thanks for and answer. As I have said, this is really a curiosity.


#17

In general, yes, but since XDiff is quite different from AutoGrad, available features are also different. For scalars everything is pretty straightforward, e.g. the following example from AutoGrad:

@primitive hypot(x1::Number,x2::Number),dy,y  (dy->dy*x1/y)  (dy->dy*x2/y)

may be translated into the following in XDiff:

@diff_rule hypot(x1::Number,x2::Number) 1 (x1 / hypot(x1, x2))
@diff_rule hypot(x1::Number,x2::Number) 2 (x2 / hypot(x1, x2))

(I’m thinking to change syntax to @diff_rule y = hypot(x1::Number,x2::Number) 1 (x1 / y), but even in current version common subexpressions should be eliminated anyway, so the performance should be the same in both cases).

Tensor operations are more tricky. Most (reverse-mode) differentiation packages assume that your cost function is a scalar and you propagate derivatives strictly backward. XDiff, on other hand, doesn’t make any assumptions and allows you to define derivatives of tensor-to-tensor functions (although I’m not sure how stable such derivatives currently are, I haven’t tested them for a while). Here’s an example of tensor differentiation rule:

@tdiff_rule (Z[i,j] = X[i,k] * Y[k,j]) (dZ[i,j]/dX[m,n] = Y[n,j] * (i == m))

which reads as “for matrix product between X and Y derivative of output element Z[i,j] w.r.t. to input element X[m,n] is equal to Y[n,j] when i == m and 0 in all other cases”. This is pretty flexible and covers even things like convolution:

# in Einstein indexing notation, `conv2()` may be expressed as:
# :(Z = conv2(X, W)) => :(Z[i,j] = X[i+m-1, j+n-1] * W[m,n])
@tdiff_rule (Z[i,j] = X[i+m-1, j+n-1] * W[m,n]) (dZ[i,j] / dW[m,n] = X[i+m-1, j+n-1])
@tdiff_rule (Z[i,j] = X[i+m-1, j+n-1] * W[m,n]) (dZ[i,j] / dX[p,q] = W[p-i+1, q-j+1])

You particular case, however, is more tricky than that. If I understand it correctly, the output tensor xx has dynamic size:

l=mapreduce(b->b.stop,max,bags)

If so, we can’t infer symbolic expression for derivatives for element of the output and can’t use current machinery. I have long-term goal to support alternative definition, something like:

@tdiff_rule_new (y = segment_sum(x[:], bags[:])) (dy[:] / dx[:] = grad_segmented_sum(x[:], dy[:], bags[:]))

But even in this case the use of such derivatives would be restricted: dynamically sized variables mean that you can’t reuse buffers, and custom logic inside derivative means that it would be harder to translate to GPU.


#18

Kind of. ReverseDiff is an operator-overloading implementation of AD, so it intercepts function calls, not expressions. It can, in fact, intercept and differentiate any broadcast call as a single primitive by differentiating the broadcast kernel using forward-mode (an example where mixed-mode AD really shines). Fusion here actually is a performance benefit, because it reduces the overall number of primitives in the graph.

Unfortunately, automatic application of this technique can be unsafe due to potential perturbation confusion. Thus, ReverseDiff uses overly-strict dispatch to ensure that this technique only gets applied on “known” kernels (e.g. *, +, ^, etc.) or kernels which the user has annotated with @forward (meaning, “I’ll allow ReverseDiff to use forward-mode AD for this function”).

Since ForwardDiff now has a tagging system (i.e. provides confusion-safe dual numbers), however, we can loosen this restriction, and apply the technique automatically for any broadcast call! In general, this should be more efficient than unfusing the calls and differentiating them separately.


#19

Thanks, I was expecting this to be a trouble. This operation was a reason for me to write a custom library for neural networks. This operation is now supported by TensorFlow and I made it to work with Knet, but the speed of Knet was terrible.


#20

Challenge accepted! Essentially, XDiff is very similar in spirit to TensorFlow, just doesn’t provide that many tweaks yet. In particular, TF makes it possible to:

  1. Explicitly specify output size (SetShapeFn) and explicitly allocate buffers for it (context->allocate_output).
  2. Write separate kernels for CPU and CUDA, as well as their gradients.

XDiff currently does it automatically, but it’s possible to provide user-level handlers to control it. Note, that in principle segmented_sum() shouldn’t be different from things like sum or * in current implementation.

If you point me to the TF implementation of segmented_sum, I’ll open an issue to support it (and user-level primitives in general).