What to return for non-differentiable points

What is the short version?

If the function is not-differentiable choose to return something useful rather than erroring. For a branch a function is not differentiable due to e.g. a branch, like abs, your rule can reasonably claim the derivative at that point is the value from either branch, or any value in-between. In particular for local optima (like in the case of abs) claiming the derivative is 0 is a good idea. Similarly, if derivative is from one side is not defined, or is not finite, return the derivative from the other side. Throwing an error, or returning NaN is generally the least useful option.

However, contrary to what calculus says most autodiff systems will return an answer for such functions. For example for: abs_left(x) = (x <= 0) ? -x : x, AD will say the derivative at x=0 is -1. Alternatively for: abs_right(x) = (x < 0) ? -x : x, AD will say the derivative at x=0 is 1. Those two examples are weird since they are equal at all points, but AD claims different derivatives at x=0. The way to fix autodiff systems being weird is to write custom rules. So what rule should we write for this case?

The obvious answer, would be to write a rule that throws an error if input at a point where calculus says the derivative is not defined. Another option is to return some error signally value like NaN. Which you can do. However, there is no where to go with an error, the user still wants a derivative; so this is not useful.

Let us explore what is useful:

Case Studies

Derivative is defined in usual sense

plot(x->x^3)

This is the standard case, one can return the derivative that is defined according to school room calculus. Here we would reasonably say that at x=0 the derivative is 3*0^2=0.

Local Minima / Maxima

plot(abs)

abs is the classic example of a function where the derivative is not defined, as the limit from above is not equal to the limit from below.

\[\operatorname{abs}'(0) = \lim_{h \to 0^-} \dfrac{\operatorname{abs}(0)-\operatorname{abs}(0-h)}{0-h} = -1\]

\[\operatorname{abs}'(0) = \lim_{h \to 0^+} \dfrac{\operatorname{abs}(0)-\operatorname{abs}(0-h)}{0-h} = 1\]

Now, as discussed in the introduction, the AD system would on it's own choose either 1 or -1, depending on implementation.

We however have a potentially much nicer answer available to use: 0.

This has a number of advantages.

It follows the rule that derivatives are zero at local minima (and maxima).
If you leave a gradient descent optimizer running it will eventually actually converge absolutely to the point – where as with it being 1 or -1 it would never outright converge it would always flee.

Further:

It is a perfectly nice member of the subderivative.
It is the mean of the derivative on each side; which means that it will agree with central finite differencing at the point.

Piecewise slope change

plot(x-> x < 0 ? x : 5x)

Here we have 3 main options, all are good.

We could say the derivative at 0 is:

1: which agrees with backwards finite differencing
5: which agrees with forwards finite differencing
3: which is the mean of [1, 5], and agrees with central finite differencing

All of these options are perfectly nice members of the subderivative. 3 is the arguably the nicest, but it is also the most expensive to compute. In general all are acceptable.

Derivative zero almost everywhere

plot(ceil)

Here it is most useful to say the derivative is zero everywhere. The limits are zero from both sides.

The other option for x->ceil(x) would be to relax the problem into x->x, and thus say it is 1 everywhere. But that it too weird, if the user wanted a relaxation of the problem then they would provide one. We can not be imposing that relaxation on to ceil, as it is not reasonable for everyone.

Not defined on one-side

plot(x->exp(2log(x)))

We do not have to worry about what to return for the side where it is not defined. As we will never be asked for the derivative at e.g. x=-2.5 since the primal function errors. But we do need to worry about at the boundary – if that boundary point doesn't error.

Since we will never be asked about the left-hand side (as the primal errors), we can use just the right-hand side derivative. In this case giving 0.0.

Also nice in this case is that it agrees with the symbolic simplification of x->exp(2log(x)) into x->x^2.

Derivative nonfinite and same on both sides

plot(cbrt)

Here we have no real choice but to say the derivative at 0 is Inf. We could consider as an alternative saying some large but finite value. However, if too large it will just overflow rapidly anyway; and if too small it will not dominate over finite terms. It is not possible to find a given value that is always large enough. Our alternatives would be to consider the derivative at nextfloat(0.0) or prevfloat(0.0). But this is more or less the same as choosing some large value – in this case an extremely large value that will rapidly overflow.

Derivative on-finite and different on both sides

plot(x-> sign(x) * cbrt(x))

In this example, the primal is defined and finite, so we would like a derivative to be defined. We are back in the case of a local minimum like we were for abs. We can make most of the same arguments as we made there to justify saying the derivative is zero.

Conclusion

From the case studies a few general rules can be seen for how to choose a value that is useful. These rough rules are:

Say the derivative is 0 at local optima.
If the derivative from one side is defined and the other isn't, say it is the derivative taken from the defined side.
If the derivative from one side is finite and the other isn't, say it is the derivative taken from the finite side.
When derivative from each side is not equal, strongly consider reporting the average.

Our goal as always, is to get a pragmatically useful result for everyone, which must by necessity also avoid a pathological result for anyone.