Convolutional Neural Networks: recipe to find optimal architecture and avoid redundancy
Updated: Dec 6, 2020
Convolutional Neural Networks (CNN) power in automatically creating features that help to increase model accuracies is widely recognized. CNN can also be seen as a process that creates an intuitive multi-scale representation of an image.
How so? essentially CNN is a sequence of Convolutions, and each convolution is familiar from linear-algebra projection to a lower-dimensional space. This projection can also be implemented as an embedding projection into higher-dimensional space.
Multi-scale approximation analysis is a classical area of mathematical analysis, and I want to share rules on how to avoid redundant model architectures and create optimal CNN model architectures utilizing this theory.
I compare the performance of the following three CNN architectures: 1. 'Road sign' CNN architecture
2. Remove the two concatenated sections denoted as '2x2 max pool' and '4x4 max pool' from the 'flatten' layer
3. Three-headed model: I add two softmax outputs for each of the two concatenated sections in the 'flatten' layer denoted as '2x2 max pool' and '4x4 max pool'.
I am going to review the multi-scale theory and decomposition into low-resolution and detail layers and illustrate how this theory directs to treat the 'flatten' layer in a special way.
Model architecture: Optimal versus Redundant
What is a Redundant model architecture?
First, I claim that a supervised model goal of accurate scoring/predicting is essentially equivalent to efficient data representation. As an illustration, you can think about a deep neural network binary classification model as a series of transformations with the goal to find an efficient data representation that separates the two classes. And redundancy is well defined in data representation that aims at finding the smallest basis sets.
Now, the final dense layer can be viewed as the representation of the data.
Now, an example of Redudncay in modeling is if one model achieves 99% accuracy with a 10-long final dense layer, a model architecture that needs longer representation can be viewed as redundant.
Why avoid Redundantly model architectures?
Effienecent models will require smaller data to achieve an equivalent accuracy.
In fact, the amount of data needed to train a model can be viewed as an indicator of model architecture redundancy.
Let's add Intuition
A deep neural network as a sequence of multi-scale transforms
The redundant representation can be illustrated using the following multi-scale decomposition, which employs a projection of high-resolution object into two parts:
low-resolution coarse object and the details object.
Multi-scale decomposition is in essence when an original high-resolution object is decomposed into a lower resolution version, and its complement details part (Figure 2).
Reconstruction of the original high-resolution object is obtained as a sum of the lower resolution and the details version (sometimes also called Wavelet coefficients).
Multi-scale transform as an object transformation from high to low resolution
A lower-resolution object has some of the original high-resolution details smoothed out as a result of a projection onto lower dimensions subspace. Projection operation is obtained using inner-product and convolution is one of its examples. Applied consequently, one obtains a set of object multi-scaale (multi-resolution) versions and the reminder levels, see an 1-D signal multi-scale decomposition example in Figure 3. I highlight the following properties, which all derived from the two operations that define multi-scale transform - projection, and decomposition:
1. Objects tend to look similar at near resolutions, exhibiting differences at the far distanced resolution levels.
2. Redundancy can be clearly seen at the near resolution levels and the signal versions look almost identical. Concatenating a signal with its lower resolution level would be nearly equivalent to duplication of the same object.
3. Low-resolution version of the object represents the least-squares projection.
4. Low-resolution version can have a lower-dimensional structure, or alternatively embedded into a higher-dimensional representation.
5. The reminders ('details') can have a sparse structure.
Multi-scale decomposition and representation: linear algebra intuition
Here is a useful Uintuition from the linear algebra of decomposition of vector V into a 'simpler' V1 and a projection 'reminder' P1, applied repeatedly: V=V1+P1=V2+P1+P2=...=V3+P1+P2+P3
Notice that the vector V representation is based on the 'reminders' P1, P2, and P3.
V1, V2, V3exhibit a redundant representation.
CNN and Multi-scale: Road signs example
Can you find a redundant CNN architecture?
Figure 1 shows three CNN model architectures (Keras, road signs data set) and three training curves - Red, Green, Black (Blue and Green show the training of the same model architecture)
Which architecture produced the Black accuracy (most efficient) and the least accuracy (Red)?
We can say that the lowest accuracy corresponds to redundant CNN architecture.
Three CNN architectures experiments
The most optimal Black convergence curve corresponds to the three-headed architecture. The 'flat' layer concatenates outputs of convolutions that carry redundant information. Introducing three independent softmax optimizers makes convolution outputs to be optimized independently of each other.
The 'red' convergence plot produced by the least optimal architecture, that applies a single softmax optimizer to the single output from the last convolutional layer.
The 'green' and 'blue' convergences were produced by raining of the original single softmax architecture with the convolution outputs concatenated.
Optimal model architecture and data size
Figure 4 shows that increasing the training data four times effectively removes the redundancy effect.
Please leave any comments below, or email: email@example.com