• Roman Kazinnik

Convolutional Neural Networks: insights on multi-scale representation and redundant architecture

Updated: Feb 17, 2019

Perhaps the greatest about CNN (Convolutional Neural Networks) is that CNN allows intuitive multi-scale representation of data.

How so? Each convolution in the network projects (as a sequence of projections) to some lower dimensional non-liner space.

Here are a few practical rules from classical multi-resolution analysis aiming at optimizing CNN architecture.

Figure 1

Goal Comparative analysis of three CNN, based on the idea of multi-scale and efficient versus redundant data representation.

Redundant architecture in CNN

Figure 1 shows three CNN-s (Keras, road signs data set) together with three training curves. Can you predict winner (black training curve) and loser (red) CNN?

Redundant representation can be illustrated using the following multi-scale decomposition, which employs projection of high resolution object into its low-resolution version and the details complement.

Figure 2

Multi-scale decomposition is in essence when an original high-resolution object is decomposed into lower resolution version, and its complement details part (Figure 2).

Reconstruction of the original high-resolution object is obtained as a sum of the lower resolution and the details version (sometimes also called Wavelet coefficients).

Transformation from high to low resolution is essentially a projection onto lower dimensions subspace, using a liner product. Examples range from liners algebra and multi-scale analysis to CNN-s with multiple convolutional filter layers.

Applied over and over again, we can obtain resolution levels and details levels, with the following properties (Figure 3):

Figure 3

1. Objects tend to look similar at lower resolutions, exhibiting differences only at the higher level.

2. Redundancy can be clearly seen between the lower resolution levels.

3. Low-resolution objects represent essentially the least squares projection of higher-dimensional objects.

4. Low-resolution object will have fewer data points.

5. The detail objects are sparse with fast coefficients decay.

To conclude, low-resolution objects concatenated together will provide redundant representation, and as a result will not represent an optimal set of features.

In Figure 1, more efficient CNN architecture will use only the last layer as final image embedding (dense layer). Most efficient CNN architecture would concatenate the lowest resolution layer with corresponding details levels.

I train three CNNs and plot three validation accuracy conversions: black, red and blue. Convergence is stable to randomization, as illustrated by green and blue curves converging to the same accuracy. Black is a clear winner, red is a loser.

Top Network:

Pros: clear multi-scale decomposition.

Cons: doesn't utilize high-resolution in final dense layer. 

Mid Network:

Pros: attempts to utilize both high and low cnn layers.

Cons: obtains redundant features by concatenating level 1 and level 2 as the final dense level and single softmax classification.

Bottom Network:

Pros: cnn level 1 is trained independently to minimize classification softmax misfit. Softmax can be thought of as a variation of the least-squares projection in the probability sense. Notice that while cnn level 1 is trained by minimizing the misfit softmax error, it is not directly used in the features vector. Only the last cnn layer is utilized as the image embedding dense layer, illustrating the least redundant cnn architecture.

Figure 4

Figure 4 shows that four times larger training data will effectively minimize the redundancy effect.

Please leave any comments below, or email:


© 2018 by Challenge. Proudly created with