And this is exactly how it works To be precise, here’s the line that performs CSE. You can check the result like this (which is not the part of external API though):
julia> using Yota
julia> foo(x) = exp(x)
foo (generic function with 1 method)
julia> _, g = grad(foo, 0.5)
(1.6487212707001282, GradResult(1))
julia> g.tape
Tape
inp %1::Float64
%2 = exp(%1)::Float64
const %3 = 1.0::Float32
%4 = exp(%1)::Float64
%5 = *(%4, %3)::Float64
julia> Yota.generate_function_expr(g.tape)
:(function ##tape_fn#364()
#= /home/slipslop/work/Yota/src/compile.jl:112 =#
#= prologue:0 =#
%1 = (inp %1::Float64).val
%2 = (%2 = exp(%1)::Float64).val
%3 = (const %3 = 1.0::Float32).val
%4 = (%4 = exp(%1)::Float64).val
%5 = (%5 = *(%4, %3)::Float64).val
#= body:0 =#
#= /home/slipslop/.julia/packages/Espresso/Lewh0/src/exgraph.jl:100 =#
%2 = (exp)(%1)
%3 = 1.0f0
%5 = (*)(%2, %3)
#= epilogue:0 =#
(inp %1::Float64).val = %1
(%2 = exp(%1)::Float64).val = %2
(const %3 = 1.0::Float32).val = %3
(%4 = exp(%1)::Float64).val = %4
(%5 = *(%4, %3)::Float64).val = %5
end)
Except for prologue and epilogue (which are used for buffer pre-allocation and are the most important optimization for large-scale deep learning models), the only code left is:
%2 = (exp)(%1)
%3 = 1.0f0
%5 = (*)(%2, %3)
We could further optimize it to just exp(%1)
, but usually it doesn’t make much difference and is presumably eliminated by Julia compiler anyway.
It might look limiting to forbid mutable operations, but for comparison here’s citation of PyTorch documentation:
Supporting in-place operations in autograd is a hard matter, and we discourage their use in most cases. Autograd’s aggressive buffer freeing and reuse makes it very efficient and there are very few occasions when in-place operations actually lower memory usage by any significant amount. Unless you’re operating under heavy memory pressure, you might never need to use them.