# Does Zygote differentiate symbolically?

My question is as simple as it says in the title. Does Zygote differentiate symbolically?

No, Automatic Differentiation is distinct from symbolic differentiation. Zygote can perform forward-mode or reverse-mode automatic differentiation by source rewriting.

If it does, can the results be seen in some way?

Sort of, although it is not very helpful. There is an example in the documentation.

Are there other packages that differentiate symbolically + previous question?

Symbolics.jl can differentiate symbolic expressions.

How (and where) does Zygote overload one single quote?

I didn’t know about this! Can you point out an example of the syntax you mean?

5 Likes

f'(x) denotes the derivative of f(x).

E.g. sin'(0) returns 1.

In my mind, I imagined that differentiation could be done symbolically or numerically, and that automatic differentiation obviated the need to do it oneself by doing one of them. What does Zygote do exactly (not exactly, a broad outline will do fine)?

It sounds like you’re not familiar with the concept of AD. It’s worth starting with the simplified problem of using dual numbers: GitHub - JuliaDiff/DualNumbers.jl: Julia package for representing dual numbers and for performing dual algebra

3 Likes

Promoting our lectures for students, you read about AD here
https://juliateachingctu.github.io/Scientific-Programming-in-Julia/dev/lecture_08/lecture/
it is meant to be very introductory

4 Likes

Oh, that is Base.adjoint, which can also be spelled ' . The easy way to find the new method is to use @edit sin'(0) in a REPL, that will bring up this line in your editor.

This is why I prefer the other interpretation of the AD abbreviation – Algorithmic Differentiation. I think it better reflects the essence than Automatic differentiation. Anyway, AD is distinct both from symbolic and numerical differentiation Simple numerical differentiation - #5 by zdenek_hurak.

1 Like

OK, I’m going to summarise what I have gleaned from links in the answers (and to some extent the answers themselves). This is so that others can identify any misunderstandings I might be labouring under.

Let me start with refining my question a bit (this is so that we may not have diverging ideas about what is meant in this context by a certain concept or term). I am talking about Julia at my university department (at some point in the near future). I (and I think my audience) want to know whether Zygote uses finite differences to calculate derivatives.

As far as I understand it, derivatives are taken from rules and associated with the corresponding values. These derivatives are then evaluated and accumulated in numeric form.

To evaluate a derivative we need it’s arguments, meaning that the intermediary results will have to be kept from the evaluation of the overall function, if we want to accumulate the derivatives from the outermost function (in ML typically the loss function) and inwards.

A simple explanation of the differences between Algorithmic Differentiation (A), Numeric Differentiation (N) and Symbolic Differentiation (S) could be shown in a figure:

                                                        ┏━━━━━┳━━━━━━━┓
┃ sym ┃  num  ┃
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━╋━━━━━━━┫
┃differentiating primitive parts ("leaves in the graph")┃ A,S ┃ (A),N ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━╋━━━━━━━┫
┃accumulating the derivative ("non-leaves")             ┃  S  ┃  A,N  ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━┻━━━━━━━┛


Here sym and num refer to if the part in question is calculated symbolically or numerically. I surmise that Algorithmic Differentiation could potentially revert to numerical evaluation of “unfamiliar” nodes and that is why I put an A in parenthesis in the upper right corner.

Apart perhaps from the minutest semantic details, I would like to know if this is a correct description.

Links to introductory or survey journal articles:

https://www.jstor.org/stable/24103956

1 Like

Thanks again for all the references.

I have above given a short explanation, based on what I have learned from said references. What I would like to know is:

Is this explanation correct? If not, what is wrong?

No. Zygote and other AD systems never use finite differences.

Essentially, they accumulate vector–Jacobian products, working backwards from outputs to inputs (for reverse-mode AD; or Jacobian–vector products working from input to output for forward mode). These products are computed by an equivalent of symbolic differentiation for individual computational steps, expressed in low-level compiler building blocks rather than in high-level symbolic expressions/code. If it hits a function call that it cannot analyze (e.g. foreign function calls, mutation, …) then it fails with an error, and you need to supply a manual vector–Jacobian product (an rrule or “pullback” via ChainRules.jl for Zygote) for that step.

4 Likes

So, I conclude that my description was correct. To clarify: I only stated that I did not guarantee that numerical derivation couldn’t be used in some algorithmic derivation system. Also, I can no longer edit the post in question.

Also, If I get to write my own derivation rules for unknown functions, what’s to say that I don’t write a rule like \frac{df}{dx}=\frac{f(x+\delta)-f(x)}{\delta}?

Is there any non-semantic reason that an algorithmic derivation system couldn’t use a numeric derivative for a certain group of primitives?

Sure, you could do that. You could also write an incorrect derivative rule if you want. Or your derivative rule could query ChatGPT. Or your function could send an email to your high-school calculus teacher and wait until it receives a response. Code can do lots of things.

There’s nothing physically preventing it. Most people would consider that a bug in an AD system, though.

(A finite difference like that would be noticeable, even without looking inside the code, for having the wrong performance scaling — its cost scales proportional to the number of inputs, whereas reverse-mode AD scales proportional to the number of outputs — as well as unacceptably large numerical errors.)

2 Likes

Finite differences are clearly good enough in some application and perhaps even the only alternative. Should it then be understood that people doing that would have nothing to gain from introducing rule based derivatives in those places where it’s possible?

you can very often at the very least use dual numbers which are more numerically robust and typically give you higher accuracy.

1 Like

Only if you consider accuracy and scalability to be nothing. Which maybe they are, if finite differences are good enough and you have other concerns to focus on.

1 Like

So, if and when they do this they have implemented an algorithmic derivation system partially using finite differences.

They would then have made a differentiation system that mixes AD and finite differences.

Seems like you are now mainly engaging in semantic bickering, in order to score some far-fetched point. If you think a hybrid system would be useful, then that’s fine, but there’s no reason to insist on it being fully AD.

That is a correct analysis (although I don’t exclude the possibility that a hybrid differentiation system would be useful). The point, the importance of which you so rightly belittle, is that I asked for a non-semantic reason.

The bickering bit could possibly be connected to a certain annoyance that nobody could write the one or two sentences needed to convey the information in my table (possibly without any parenthetical letter), that would have been helpful to understanding both what Zygote does and what the distinction between different differentiation strategies is.

Frankly, I just found the table more confusing because it wasn’t clear what each of the columns/rows/cells meant. In the spirit of a one-liner response though, I remember someone (maybe @ChrisRackauckas?) summarizing it as “AD is like symbolic differentiation, just with = [instead of deep nested expression trees]”. Probably butchering the quote, but should clarify that finite differences doesn’t even enter the picture here.

Yup that was me.

# Automatic differentiation is symbolic differentiation where instead of using substitution you use assignment (=).

It’s not 100% correct, but it gets you pretty close that it’s a good rule of thumb. For example, if you have sin(f(x)), then with symbolic differentiation you get sin'(f(x))f'(x) and you evaluate that expression. But with automatic differentiation you generate a code that effectively does:

fx = f(x)
dfx  = f'(x)
sinx = sin(fx)
dsinx  = cos(fx)
return dsinx * dfx


If f is what’s known as a primitive, then f'(x) has been defined and you use that. If it hasn’t been defined in the AD system, then you look into its code and do this same process to all steps, etc.

But if you take what AD gives you and instead of having it be in different operations, if you just substitute everything to build a single final expression in the end, then you get sin'(f(x))f'(x) or the symbolic derivative expression.

## Thinking about Differentiation of Languages

One way of describing this then is that symbolic differentiation is limited by the semantics of “standard mathematical expressions”, and AD is simply rewriting it in a language that allows for assignment. AD is symbolic differentiation in the language of SSA IR, i.e. computer code. So in a sense I think it’s fine to say Zygote is doing symbolic differentiation on Julia code.

When we say “symbolic differentiation”, we normally mean that it is differentiating in the language of mathematical expressions, i.e. you take Symbolics.jl and use @variable x; f(x) what it will do is generate a mathematical expression without any computational aspects and then perform the differentiation in the context of the purely mathematical language:

using Symbolics
@variables x
function f(x)
out = one(x)
for i in 1:5
out *= x^i
end
out
end
sin(f(x)) # sin(x^15)


Evaluation with symbolic variables completely removes the “non-mathematical” computational expressions, and then we symbolically differentiate in this language:

Symbolics.derivative(sin(f(x)),x) # 15(x^14)*cos(x^15)


Note that expression blow up: we take an entire computational expression and squash it down to a single mathematical formula and differentiate it, which then has the problem that you can exponential blow up in the size of the expressions you’re building/differentiating. This is the downside of symbolic differentiation.

function f(x)
out = x
for i in 1:5
out *= sin(out)
end
out
end
sin(f(x)) # sin(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x)))*sin(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x))))*sin(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x)))*sin(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x))))))

Symbolics.derivative(sin(f(x)),x) # (sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x)))*sin(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x))))*sin(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x)))*sin(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x))))) + x*cos(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x)))*sin(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x))))*sin(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x)))*sin(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x))))) + x*(sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x))) + x*cos(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x))) + x*(sin(x)*sin(x*sin(x)) + x*cos(x)*sin(x*sin(x)) + x*(x*cos(x) + sin(x))*sin(x)*cos(x*sin(x)))*sin(x)*sin(x*sin(x))*cos(x*sin(x)*sin(x*sin(x))) + x*(x*cos(x) + sin(x))*sin(x)*cos(x*sin(x))*sin(x*sin(x)*sin(x*sin(x))))*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x)))*cos(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x))))*sin(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x)))*sin(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x))))) + x*(x*cos(x) + sin(x))*sin(x)*cos(x*sin(x))*sin(x*sin(x)*sin(x*sin(x)))*sin(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x))))*sin(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x)))*sin(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x))))) + x*(sin(x)*sin(x*sin(x)) + x*cos(x)*sin(x*sin(x)) + x*(x*cos(x) + sin(x))*sin(x)*cos(x*sin(x)))*sin(x)*sin(x*sin(x))*cos(x*sin(x)*sin(x*sin(x)))*sin(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x))))*sin(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x)))*sin(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x))))) + x*(sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x)))*sin(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x)))) + x*cos(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x)))*sin(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x)))) + x*(sin(x)*sin(x*sin(x)) + x*cos(x)*sin(x*sin(x)) + x*(x*cos(x) + sin(x))*sin(x)*cos(x*sin(x)))*sin(x)*sin(x*sin(x))*cos(x*sin(x)*sin(x*sin(x)))*sin(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x)))) + x*(x*cos(x) + sin(x))*sin(x)*cos(x*sin(x))*sin(x*sin(x)*sin(x*sin(x)))*sin(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x)))) + x*(sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x))) + x*cos(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x))) + x*(sin(x)*sin(x*sin(x)) + x*cos(x)*sin(x*sin(x)) + x*(x*cos(x) + sin(x))*sin(x)*cos(x*sin(x)))*sin(x)*sin(x*sin(x))*cos(x*sin(x)*sin(x*sin(x))) + x*(x*cos(x) + sin(x))*sin(x)*cos(x*sin(x))*sin(x*sin(x)*sin(x*sin(x))))*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x)))*cos(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x)))))*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x)))*sin(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x))))*cos(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x)))*sin(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x))))))*cos(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x)))*sin(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x))))*sin(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x)))*sin(x*sin(x)*sin(x*sin(x))*sin(x*sin(x)*sin(x*sin(x))))))


So then a good way to think about AD is that it’s doing differentiation directly on the language of computer programs. When doing this, you want to build expressions that carry forward the derivative calculation, and generate something that is a computation of the derivative, not a mathematical expression of it.

On that same example, this looks like:

function f(x)
out = x
for i in 1:5
# sin(out) => chain rule sin' = cos
tmp = (sin(out[1]), out[2] * cos(out[1]))
# out = out * tmp => product rule
out = (out[1] * tmp[1], out[1] * tmp[2] + out[2] * tmp[1])
end
out
end
function outer(x)
# sin(x) => chain rule sin' = cos
out1, out2 = f(x)
sin(out1), out2 * cos(out1)
end
dsinfx(x) = outer((x,1))[2]

f((1,1)) # (0.01753717849708632, 0.36676042682811677)
dsinfx(1) # 0.3667040292067162


See this vs

julia> substitute(sin(f(x)),x=>1)
0.017536279576682495

julia> substitute(Symbolics.derivative(sin(f(x)),x),x=>1)
0.3667040292067162


You can see the symbolic aspects in there: it uses the analytical derivative of sin being cos, and it uses the product rule in the code it generates. Those are the primitives. But you then use an extra variable to accumulate the derivative, because again you’re working in the language of computer programs with for loops and all, and you are taking the derivative of a computational expression to get another computational expression.

The advantage of course is that things like control flow which have no simple representation in mathematical language have a concise computational description, and you can avoid ever building the exponentially large mathematical expressions that is being described.

6 Likes