- Roman Kazinnik

# Convolutional Neural Networks: insights on multi-scale representation and redundant architecture

Updated: Feb 17, 2019

__https://github.com/romanonly/romankazinnik_blog/tree/master/CVision/CNN__

**Perhaps the greatest about CNN (Convolutional Neural Networks) is that CNN allows intuitive multi-scale representation of data.**

**How so? Each convolution in the network projects (as a sequence of projections) to some lower dimensional non-liner space.**

**Here are a few practical rules from classical multi-resolution analysis aiming at optimizing CNN architecture.**

**Goal** Comparative analysis of three CNN, based on the idea of multi-scale and efficient versus redundant data representation.

**Redundant architecture in CNN **

Figure 1 shows three CNN-s (Keras, road signs data set) together with three training curves. *Can you predict winner (black training curve) and loser (red) CNN? *

**Redundant representation** can be illustrated using the following multi-scale decomposition, which employs projection of high resolution object into its low-resolution version and the details complement.

**Multi-scale **decomposition is in essence when an original high-resolution object is decomposed into lower resolution version, and its complement details part (Figure 2).

Reconstruction of the original high-resolution object is obtained as a sum of the lower resolution and the details version (sometimes also called *Wavelet coefficients*).

**Transformation from high to low resolution** is essentially a projection onto lower dimensions subspace, using a liner product. Examples range from liners algebra and multi-scale analysis to CNN-s with multiple convolutional filter layers.

Applied over and over again, we can obtain resolution levels and details levels, with the following properties (Figure 3):

1. Objects tend to look similar at lower resolutions, exhibiting differences only at the higher level.

2. Redundancy can be clearly seen between the lower resolution levels.

3. Low-resolution objects represent essentially* the least squares projection* of higher-dimensional objects.

4. Low-resolution object will have fewer data points.

5. The detail objects are sparse with fast coefficients decay.

To conclude, **low-resolution objects concatenated together will provide redundant representation, and as a result will not represent an optimal set of features**.

In Figure 1, more efficient CNN architecture will use only the last layer as final image embedding* (dense layer)*. Most efficient CNN architecture would concatenate the lowest resolution layer with corresponding details levels.

I train three CNNs and plot three validation accuracy conversions: black, red and blue. Convergence is stable to randomization, as illustrated by green and blue curves converging to the same accuracy. Black is a clear winner, red is a loser.

Top Network:

Pros: clear multi-scale decomposition.

Cons: doesn't utilize high-resolution in final dense layer.

Mid Network:

Pros: attempts to utilize both high and low cnn layers.

Cons: obtains redundant features by concatenating level 1 and level 2 as the final dense level and single softmax classification.

Bottom Network:

Pros: cnn level 1 is trained independently to minimize classification softmax misfit. Softmax can be thought of as a variation of the least-squares projection in the probability sense. Notice that while cnn level 1 is trained by minimizing the misfit softmax error, it is not directly used in the features vector. Only the last cnn layer is utilized as the image embedding *dense layer, *illustrating the least redundant cnn architecture.

Figure 4 shows that four times larger training data will effectively minimize the redundancy effect.

Please leave any comments below, or email: __roman.kazinnik@gmail.com__