Convolutional Neural Network (CNN)

CNN or ConvNet is a type of ANN that has been widely used to analyse images and sound signals. MLP is not suited to image and sound signals because the trainable parameters are large.

When sound or image is used for training, the trainable parameters do not increase as much with a CNN. Data can be extracted with it and it does not get affected by the scale of the data; in other words, it is scale-invariant. This is why CNNs can learn from multidimensional data like images.

In neural networks with the Curse of Dimensionality, images have very high dimensionality, so training a standard feed-forward network to recognize images would require hundreds of thousands of input neurons. CNN reduces the dimensionality of an image by utilizing convolutional and pooling layers. It is able to highlight important parts of the image and pass them forward since convolutional layers are trainable but have fewer parameters than a standard hidden layer. The last few layers of CNNs are hidden layers, which process the compressed image information. A basic CNN architecture consists of two parts, a convolution tool that identifies features in an image and extracts them for analysis in a process called Feature Extraction. Based on the features extracted in previous stages, it predicts the class of the image using the output from the convolution process.

Convolution layer

The first layer in the process extracts features from the input images. This layer performs convolution between the input image and a filter of a specific size MxM. By moving the filter over the input image and then selecting the parts of the input image according to the size of the filter, the dot product is computed between the filter and the portions of the image. The result is a Feature map, which gives us information about the image, like corners and edges.

Strides and Padding

Using stride, we can control how far our sliding filter window will move per application of the filter function. We create a new depth column in the output volume every time we apply the filter function to the input column. In the output volume, lower stride values (for example, 1 specifies only a single unit step) will allocate more depth columns.

Filters don't always perfectly match input images. So, we can add padding to the picture with zeros which is called zero padding or we can remove the part of the image where the filter does not fit. Those pixels are called valid padding, because only the valid parts of an image are retained.

Pooling Layer

By pooling layers, you can reduce the number of parameters when the images are large. A spatial pooling, sometimes called subsampling or downsampling, lowers the level of dimensionality in the data while maintaining essential information on the map. Typically, pooling layers are inserted between convolutional layers. By following convolutional layers with pooling layers, the data representation can be progressively reduced in size (width and height). Every depth slice of data is treated independently by the pooling layer. The purpose of this layer is to reduce the size of the convolved feature map to reduce computation costs and help control overfitting. As a result, the connections between layers are reduced and each feature map is operated on independently.