Complete Guide about CNNs(Part-1)

Published in

Analytics Vidhya

9 min readJan 11, 2020

Hi Learners!
Presenting a granular detailed blog after an intense research from various sources on a current burning and powerful technology, CNN.

RoadMap

1. Introduction
2. How CNN sees an image
3. General CNN Architecture
4. CNN layers concepts in detail
5. Summary of CNN
6. Training CNN model
7. Hyper-parameters in CNN
8. Recommended CNN Architecture Rules
9. Curiosity-prone ‘why’ questions

Part-I: Covers Introduction to Summary of CNN
Part-II: Covers from Training CNN model to the last why questions

Introduction

Convolutional Neural Networks(ConvNets or CNN) is an unsupervised conventional feed-forward deep learning technique which is predominantly used for images.
CNN basically classifies an image let’s suppose a dog or cat. It doesn’t know if it’s cat or dog but it has just received certain pixel matrices as an input and starts learning its low-level features such as curves,brightness or lines. Gradually it starts learning mid-level and high-level features such as ears and eyes as go through deeper layers. And finally, compute the probability for each class(cat, dog..)ranges from 0 to 1 using certain activation function which best describes an image (0.8 for cat, 0.2 for dog etc).

Applications of CNN are widened to image, video, text and audio analytics. Object recognition, computer vision and related applications are built which includes face recognition app. Also, it is popularly used in very advanced deep learning algorithm known as GAN.

The role of the ConvNet is to reduce the images into a form which is easier to process, without losing features which are critical for getting a good prediction.

How CNN sees an image

An image is composed of pixels which are arranged in a n-dimensional array. That’s how computer sees an image. For it, image is an array of values where each values denotes a pixel. Each pixel denotes the intensity of color at that point which ranges in 0–255 and 0–1 for colored and grayscale image respectively. In CNN, image in a form of n-D array has a shape of (length,breadth,channel). Here, channel refers to the depth of image referring to whether its a colored image or not.
A colored image has 3 channels (R,G,B) which is a stack of 3-D matrices on top of each other where each matrix has a range of values between 0–255. The other non-colored image is a 1-D array having 1 channel.

Few image related terminologies:
1. Resolution: Refers to the ppi (pixels per inch) / dpi (dots per inch) in an image, in a shape of height and width. Low resolution means less pixels per inch, hence stretching an image will make quality more poor as large pixels become visible.

2. Pixel: Picture element where its value refers to intensity of color at that point in image. Refer to this blog for more info.
3. Size: Changing the size refers to the changes in the number of pixels in an image i.e. either some pixels need to be removed or added.

Key fact: In image processing, image scaling doesn’t change the size of image instead if resolution doesn’t fits best to the new size, pixels are stretched or compressed.

General CNN Architecture

CNN is another NN having similar architecture just like typical Neural Network.

CNN just modifies NN architecture by adding some layers such as convolution, relu(a.k.a. activation) and pooling layer. Basically whole network gets divided into learning the features and classifying classes.

Features are learned with the following CNN layers:

Convolution: Input image is first passed into this layer which convolutional kernels convolve around whole image to decrease the image dimensions and generate a stack of Feature Maps.
ReLU (aka Activation) : ReLU activation function transforms negatives to 0s and activates certain high weighted features which gets forwarded to next layer. Thus coined as Activation layer. This layer brings non-linearity to the model to make it able to solve complex features.

Pooling (a.k.a Downsampling): A downsampling layer which reduces the training parameters and thus computation cost and time. In short words, faster calculations.

Classification is performed using following CNN layers:

The reduced matrices now passed into classifying layers where now it’s converted into a long vector (flattening) .

Fully Connected: Matrices are converted to a dense layer which is flattened(feature vector) into a single dimension array. Later, this has to be passed through a fully connected layer and finally through an activation classifier(sigmoid/softmax) to classify images accordingly. Classification is done by getting probabilities of each class for that input image.

CNN layers concepts in detail

1. Convolution Layer

‘Convolution’ in general terms mean an operation of combining two to produce third. Here, an element wise multiplication is performed which got added up at the last to produce a scalar value. Basically, convolution is an operation of convolving(sliding) kernel over the image with a specific stride. The current operation area in an input image(say 1st sliding over leftmost matrix of shape 3*3) with kernel is known as receptive field.

Let me make more simple. I got this idea of explanation while researching across other blogs. There is a flashlight which flashes across an input image with a stride value. Here, flashlight is a kernel and region it flashes upon is the receptive field. Now, going through more layers when kernel size increases, receptive field also increases hence we are able to identify complex level features(objects) at once.

As demonstrated above, this layer has many filters which separately convolve with an input image to produce output matrices coined as Feature Maps. Check out this blog for finer details.

Layer goals to learn the parameters(values) of kernel by back-propagation.

Let’s know more about elements involved in convolution:

Filter: (a.k.a Kernel)

Filter is a feature detector (array of values) which represents image features such as edge, curve, blurriness or brightness detector.
Values of kernel are called as weights(/parameters/intensity values) where high weight refers the importance of that pixel for a particular feature. Values (1,-1,0) indicate the filtration of brightness(white), darkness(black) and 0(grey).
Following two operations performed during convolution while striding over an input image:
- Element-wise multiplication
- Summation
Initially, filters are randomly initialized but then updated by the back-propagation.
Kernel size is always odd.
Depth should always be same as input image’s depth
( eg. input dim (32*32*3) where ‘3’ is a depth. Same way kernel’s dim (7*7*3) having same depth.)
Impact of convolving with more filters:
1. Preserve spatial dimensions
2. Greater the depth(channel) of Feature Maps
3. More information about an input image
Kernels initially detects low-level features such as curves or edges.Going deeper into network, they increase their receptive field to capture information from larger part of image. This allows to detect some complex-level features or specific object such as ears or eyes.
What in case if kernel doesn’t fit well the input image?
1. Drop the part where filter not getting fitted
2. Pad boundary with 0s to make it fit with kernel

Feature Map: (a.k.a Activation Map, Convolved Feature)

An array of values (or matrix) which is an output of Convolutional layer after performing convolution operation with a particular filter.
It is called as an activation map since it shows the activated areas for a particular feature with respect to kernel.

Stride:

Stride refers to the sliding of kernel with a specific pixel at a time. For instance, stride of 1 will slide the filter with 1 pixel at a time.

While striding, certain details along the border get skipped. Thus, we need a concept of padding.

Padding:

Padding is a concept of adding values along border of an image.
Types of Padding:
1. Valid: (No Pad) Results in output size < input size
2. Same: (Zero Pad) Results in output size = input size

In the first image, 4*4 input with 3*3 kernel with no additional values along boundary results in a downsized 2*2 feature map. This is ‘valid’ padding. This is not good for the network since then we will be loosing many details, very fast. Thus, third image illustrates the solution by implementing ‘same’ padding.
5*5 input with 3*3 kernel with 2-layer padding(adding 0s) results in same dimension i.e. same convolution i.e. 5*5 feature map as an output.

Why need padding?
1. Performing convolutions may impact on losing our data very fast. Hence, additional pixels will weaken the impact.
2. While striding, weights along the border are ‘touched’ less than than the middles. Hence, we may lose certain details of that particular position.
Formula to calculate the output data size where filter size (f), stride (s), pad (p), and input size (n) as:

2. ReLU: (a.k.a Activation layer)

An activation layer of CNN which brings non-linearity in the model to be able to solve complex problems.
Without ReLU model, image classification is a linear problem where it is actually a non-linear one.
No vanishing gradient problem
The negatives in the input transformed 0s.
ReLU function gives out the output as 0 when pass negatives while return the same value for positives. Hence, f(x)=max(0,x)

https://ailephant.com/glossary/relu-function/

3. Pooling: (a.k.a Downsampling)

After applying linear and non-linear transformations, output is downsampled to get the eminent parts enough to describe the whole image instead of taking huge parameters.
Each feature map(output of conv layer) can be reduced while retaining important information.
It is a feature of summarizing important features in a patch of an image by either using both the ways as given below.
How is it done?

Average: Average computation of a receptive field region covered by filter on each feature map. Resembles the average features present in that patch.

2. Max: Max value of each patch of a receptive field region covered by filter on each feature map. Resembles the most important features present in that patch of an image.

Advantages?

1. Reduces number of weights to train and thus directly impact on the computation time and costs. Faster calculations in short time.
2. Makes our training more robust in the scenario of transformation such as rotation, scaling or cropping. How? Feature Map’s pixels denote the precise position of specific feature. Thus, model becomes so accurate for detecting a particular feature. Hence, model may not be wise when image is even little bit shifted as it results in different feature map.

4. Flatten:

The pooled image matrices are flattened into a single vector so as to bring it in a way to perform classification in sequential layers.

5. Fully-Connected: (a.k.a Dense)

This layer as the name suggests is just like a layer in standard neural network where flattened layer is connected to all activations in the previous layer. It is just like hidden layer where the only difference is that it is fully-connected.

6. Sigmoid/Softmax:

The last layer of CNN where predictions happen with the help of either sigmoid or softmax activation function based on binary or multi-class predictions. This is basically an output layer after FC layer where we get predicted class based on probability.

The output and actual output is compared and loss is computed which is then back-propagated for further training in the next epoch.

Summary

Input image passed to the network
Select stride, filter size and padding parameter. Performs the convolution.
Perform ReLU activation
Perform pooling
Repeat above as your need.
Flatten matrices into single vector as fully-connected layer
Last layer will classify images based on distinct predicted classes

Interesting blog to glance upon.

DeepLearning series: Convolutional Neural Networks

In this blog, I will explain the details of Convolutional Neural Networks (CNNs or ConvNets), which have proven to be…

medium.com

Happy Reading!

Can get in touch with me via LinkedIn.

Feel free to share views in comments section or any misleading information stated. :)