Convolutional neural networks are not rotation invariant… since when?

“Why is a convolutional neural network not rotation invariant?”.

This question popped up in an interview recently, and it is surprisingly deeper than I initially thought. For a start, the statement “a convolutional neural network is not rotation invariant” is not entirely true, but only half-true, maybe a quarter-true. So why is this a popular statement being asked in the machine learning domain? I feel like this is mainly due to a poor understanding of the convolutional neural net architecture.

What’s surprising is how little I have been able to find on the internet about this question. The articles that I have found are unoriginal and poor derivatives of what was written in (Goodfellow, Bengio, and Courville 2016); and like any good mathematics textbook, they do not really do a deep dive into the mathematical reasoning behind the statement – which make interpretations written around it even worse. In fact, Goodfellow, Bengio, and Courville (2016) mentions that a convolutional neural net is rotation invariant with the right layers… so what the heck is going on here?

Today, I am putting this question to bed with a somewhat acceptable mathematical reasoning. Let’s get one important thing out of the way first because this might confuse some people. The statement “convolutional net is not rotation invariant” is somewhat misleading; and this is a consequence of people not understanding the difference betwen a convolutional neural net, a convolutional layer and a convolutional operator. So what is actually “convolution” in convolutional neural net?

Convolutional neural net, convolution operators

A convolutional neural network is simply a neural network where at least one of its layers utilize the convolutional operator rather than a full-blown matrix multiplication (Goodfellow, Bengio, and Courville 2016). This definition is so general that it renders the statement “why is a convolutional neural network (CNN) not rotation invariant?” to be false, especially with our breadth of understanding on deep learning today. I will demonstrate why precisely in the rest of this article.

So let’s go down one step in the rabbit hole… what is a convolution? A convolution * of two complex-valued functions I and K is simply the integral transform

J(t) := (I * K)(t) = \int_{\mathbb{R}^d} I(\tau) K(t - \tau) d\tau = \int_{\mathbb{R}^d} I(t - \tau) K(t) d\tau.

In computer vision, we are, at the most basic level, concerned with the discrete convolution of real-valued objects (i.e. images) over a finite two-dimensional integer support. Although, this is not exactly true because with modern convolutional neural nets, you actually perform multi-channel convolution which is typically modelled with a four-dimensional kernel tensor instead. Let’s focus only on the basics though. In the two-dimensional case, the convolution above reduces (and simplifies) to

J[i, j] = \sum_{m, n} I[i- m, j- n] K[m, n],

where I, K are real-valued functions over \mathbb{Z}^2 and the summation is done over the support of K. Here, I use square brackets instead of parentheses to align with literature on discrete convolution.

Now, I claim that the convolution operator is not rotation invariant. But before that, let’s look at something even simpler. I claim that the convolution operator is not even translation invariant!

Convolution is not translation invariant

To prove that convolution is not translation invariant, it suffices to look at a single nonzero shift s in the x-coordinate where we observe that

\begin{align*} J[i, j] &= \sum_{m, n} I[i-m, j-n] K[m, n] \\ &\neq \sum_{m, n} I[i-m+s, j-n] K[m, n] \\ &= J[i + s, j]. \end{align*}

Obviously, this is a sum of (a lot of) product terms so there are several ways where equality J[i,j] = J[i + s, j] can be achieved; but it is not a guarantee. \blacksquare

Here’s a simple concrete example to see where they are not equal. Define I to be the function [m, n] \mapsto 1 and let K be defined by

[m, n] \mapsto \begin{cases} 1 & \text{if } m = n \text{ and } m, n \neq 0, \\ 0 & \text{otherwise} \end{cases}

over \{ 0, 1, 2\} \times \{ 0, 1, 2 \} \subseteq \mathbb{Z}^2. Then evaluating J at (i, j) = (0, 1) gives J[i, j] = 0 but J[i+1, j] = 1.

Now was that surprising? Did you feel lied to all your life that convolutional neural nets are translation invariant? Well, we’ll get to why that statement is commonly made later at the end of this post. For now, let me show you that the convolution operator is not rotation invariant.

Convolution is not rotation invariant

Now that you’ve got a basic idea on how to disprove convolution being translation invariant, it should be straightforward to see that convolution is not rotation invariant… at least a sketch proof should be present in your mind. Reasoning it formally, however, takes a bit more effort as we need to do some setup first. For a start, I am going to make things simple by expanding the domain back to \mathbb{R}^2 because I do not want to talk about rotation matrices in \mathbb{Z}^2 which is not pretty. As a bonus to doing this, we are now back in the world of vector spaces… although we can always make an extra effort and start talking about \mathbb{Z}^2 as a module over itself. Let’s now start putting the pieces together.

Let p_{\mathrm{center}} = (0, 0) \in \mathbb{R}^2 be the image center; then define a two-dimensional Cartesian coordinate system centered at p_{\mathrm{center}} and fix the standard basis. This defines the axes that we will play around to perform rotation. Now that we have that set up, for any vector x \in \mathbb{R}^2, we can perform a counterclockwise rotation of \theta degrees around p_{\mathrm{center}} by applying the rotation matrix

R_\theta = \begin{pmatrix} \cos \theta & -\sin \theta \\ \sin \theta & \cos \theta \end{pmatrix}.

Then for an image I_0(x): \mathcal{D} \subseteq \mathbb{R}^2 \to \mathbb{R}, the corresponding image I_\theta rotated by \theta degrees around p_{\mathrm{center}} is given by the image I_\theta(y): \mathcal{D}' \subseteq \mathbb{R}^2 \to \mathbb{R} where I_\theta(y) = I_0(x) and y = R_\theta x (or equivalently, x = R_\theta^{-1} y).

Now fix a finite canvas to draw our images on so let \Omega \subseteq \mathbb{R}^2 be a bounded domain such that \mathcal{D}, \mathcal{D'} \subseteq \Omega. Suppose now that the point q \in \mathcal{D} is mapped (bijectively) to q' \in \mathcal{D}' under the rotation R_\theta. Then we can now see that the convolution with rotation is not equal to the convolution without rotation as follows

\begin{align*} J_\theta(q') &= \sum_{p \in \Omega} I_{\theta}(q'-p) K(p) \\ &= \sum_{p \in \Omega} I_0(R_{-\theta} q' - R_{-\theta} p) K(p) \\ &= \sum_{p \in \Omega} I_0(q - R_{-\theta} p) K(p) \\ &\neq \sum_{p \in \Omega} I_0(q - p) K(p) \\ &= J_0(q). \end{align*}

In other words, convolution is not rotation invariant. \blacksquare

Seems like convolutional neural nets are not rotation invariant at all…?

So the convolution operator is neither translation nor rotation invariant. So why did I say the statement “a convolutional neural network is not rotation invariant” is only partially true? It should be borderline false now, no? Well there’s a stark difference between a convolution operator and a convolutional neural net. What we have shown is that the former is not translation/rotation invariant but the latter is a different beast completely, it is much more complex.

A convolutional neural net typically composes of multiple kernel convolution operators, pooling operators and activation functions. See the difference now? A convolutional neural net has a convolution operator as a layer. So in its most basic form where the net only has convolutional layers, it is neither translation nor rotation invariant. But… since we have other layers, we can actually make the net learn these invariances by combining the right layers (and the right data).

As in MLPs, the activation function introduces non-linearity so it’s not really a bonus. So based on the layers I have mentioned, it should be obvious now. A convolutional operator and the activation function themselves alone will not be helpful; so the secret sauce must be the pooling operator… plus a sufficiently broad dataset… ish. Read on in part II.

Edit (19/09/2024): Added link to part II.

References

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.