Convolutional Neural Networks

Convolution

In order to find the elements most important for classification, convolution applies the same operations to all pixels.

In convolution ( $c$ ) the weights ( $w$ ) move along the sequence ( $s$ ), and the scalar product is calculated for each position on the sequence. The length $n$ of the weights' vector is never longer than the length $m$ of the sequence's vector; otherwise, there wouldn't be a position to which convolution could be applied.

Let's express a one-dimensional convolution operation with a formula:

c_k=\sum^{n-1}_{t=0} s_{k+1}w_t

Here, $t$ is the index for calculating the scalar product, and $k$ is any value from $0$ to $(m - n + 1)$ .

The number $(m - n + 1)$ is chosen so that the weights don't exceed the sequence.


1def convolve(sequence, weights):
2    convolution = np.zeros(len(sequence) - len(weights) + 1)
3    for i in range(convolution.shape[0]):
4        convolution[i] = np.sum(weights * sequence[i:i + len(weights)])

Take a two-dimensional image $s$ with a size of $m×m$ pixels and a weight matrix $w$ of $n×n$ pixels. This matrix is the kernel of the convolution.

The kernel moves inside the image from left to right, top to bottom. Its weights are multiplied by each pixel in every position. The products are summed up and recorded as the resulting pixels.


1s = [[1, 1, 1, 0, 0],
2     [0, 1, 1, 1, 0],
3     [0, 0, 1, 1, 1],
4     [0, 0, 1, 1, 0],
5     [0, 1, 1, 0, 0]]
6
7w = [[1, 0, 1],
8     [0, 1, 0],
9     [1, 0, 1]]

Two-dimensional convolution can be expressed with the formula below:

c_{k_1,k_2}=\sum^n_{t_1 =0}\sum^n_{t_2 =0} s_{k_1 + t_1, k_2 + t_2}w_{t_1,t_2}

Horizontal contours can be found using convolution with the following kernel:


1np.array([[-1, -2, -1],
2          [ 0,  0,  0],
3          [ 1,  2,  1]])

Use the following kernel to find vertical contours:


1np.array([[-1, 0, 1],
2          [-2, 0, 2],
3          [-1, 0, 1]])

Convolutional Layers

Convolutional layers apply a convolution operation to input images.

A convolutional layer consists of customizable and trainable filters which are applied to the image. A filter is essentially a square matrix of size $K×K$ pixels.

Depth is added to the filter if the input is a color image. In this case, the filter is no longer a matrix, but a tensor, or a multidimensional array.

A convolutional layer can have several filters, each returning a two-dimensional image which can be reconverted into a three-dimensional image. In the next convolutional layer, the depth of filters will be equal to the number of filters in the previous layer.

The asterisk (*) indicates a convolution operation.

Convolutional layers contain fewer parameters than fully connected layers.

The settings of a convolutional layer:

Padding This setting adds zeros to the edges of the matrix (zero padding) so that the outermost pixels participate in the convolution at least as many times as the central pixels.
Striding/Stride. This setting shifts the filter by more than one pixel and generates a smaller output image.

If the initial image has a size $W\times W×D$ , a $K×K×D$ filter, padding ( $P$ ), and step ( $S$ ), then the new image size $W$ can be determined this way:

W^\prime=\frac{W-K+2P}{S}+1

Convolutional Layers in Keras


1keras.layers.Conv2D(
2  filters,
3  kernel_size,
4  strides,
5  padding,
6  activation
7)

Filters: The number of filters, which corresponds to the size of the output tensor.
Kernel_size: The spatial size of the filter $K$ . Filter is a tensor with the size $K×K×D$ , where $D$ is equal to the depth of the input image.
Strides: A stride determines how far the filter shifts over the input matrix. It's set to 1 by default.
Padding: This parameter determines the width of the zero padding. There are two types of padding: valid and same. The default type of padding is valid, and is equal to zero. same sets the size of the padding automatically so that the width and height of the output tensor is equal to the width and height of the input tensor.
Activation: This function is applied immediately after the convolution. You can use the activation functions already familiar to you: 'relu' and 'sigmoid'. By default, this parameter is None.

In order for the results of the convolutional layer to be compatible with a fully connected layer, connect a new layer named Flatten.


1from tensorflow.keras import Sequential
2from tensorflow.keras.layers import Conv2D, Flatten, Dense
3
4model = Sequential()
5
6# this tensor has a size of (None, 32, 32, 3)
7# the first dimension defines different objects
8# it's set to None because the size of the batch is unknown
9
10model.add(Conv2D(filters=4, kernel_size=(3, 3), input_shape=(32, 32, 3)))
11
12# this tensor has a size of (None, 30, 30, 4)
13
14model.add(Flatten())
15
16# this tensor has a size of (None, 3600)
17# where 3600 = 30 * 30 * 4
18
19model.add(Dense(...))

LeNet Architecture

You can reduce the number of the model's parameters with pooling techniques. The Max Pooling operation can be conducted this way:

The kernel size is determined (for example, 2x2).
The kernel starts moving left to right, top to bottom, and in each frame of four pixels there is a pixel with the maximum value.
The pixel with the maximum value remains, but the neighboring pixels disappear.
The result is a matrix formed only from the pixels with the maximum values.

In Keras, you can also use the AveragePooling operation. The main differences between the techniques are:

MaxPooling returns the maximum pixel value from the pixel group within a channel. If the input image has a size of $W×W$ , then the output image's size is $\frac{W}{K}$ , where $K$ is the kernel size.
AveragePooling returns the average value of a group of pixels within a channel.


1keras.layers.AveragePooling2D(
2  pool_size=(2, 2),
3  strides=None,
4  padding='valid',
5  ...)

pool_size - The larger it is, the more neighboring pixels involved.
strides - A stride determines how far the filter shifts over the input matrix. If None is specified, then the stride is equal to the pooling size.
padding - This parameter determines the width of the zero padding. The default type of padding is valid, which is equal to zero. same sets the size of the padding automatically.

The parameters of MaxPooling2D are similar to these parameters.

We now have all the tools to create a popular architecture for classifying images with a size of 20-30 pixels, LeNet.

LeNet is structured as follows:

The network begins with two or three 5x5 layers alternating with Average Pooling with a size of 2x2. They gradually reduce the spatial resolution and collect all the information in the image into a small matrix of about 5 pixels.
The number of filters increases from layer to layer to prevent the loss of important information.
There are one or two fully connected layers at the end of the network. They collect all the features and classify them.


1model = Sequential()
2
3model.add(
4  Conv2D(
5    6,
6    (5, 5),
7    padding='same',
8    activation='tanh',
9    input_shape=(28, 28, 1)
10  )
11)
12
13model.add(AvgPool2D(pool_size=(2, 2)))
14
15model.add(
16  Conv2D(
17    16,
18    (5, 5),
19    padding='valid',
20    activation='tanh'
21  )
22)
23
24model.add(AvgPool2D(pool_size=(2, 2)))
25
26model.add(Flatten())
27model.add(Dense(120, activation='tanh'))
28
29model.add(Dense(84, activation='tanh'))
30
31model.add(Dense(10, activation='softmax'))
32
33model.compile(
34  loss='sparse_categorical_crossentropy',
35  optimizer='sgd',
36  metrics=['acc']
37)
38
39model.summary()

The Adam Algorithm

The Adam algorithm makes stride selection automatic. It selects different parameters for different neurons, which speeds up model training.

Let's write the Adam algorithm into Keras:


1model.compile(
2  optimizer='adam',
3  loss='sparse_categorical_crossentropy',
4  metrics=['acc']
5)

Let's set the algorithm class to configure the hyperparameters:


1from tensorflow.keras.optimizers import Adam
2optimizer = Adam()
3
4model.compile(
5  optimizer=optimizer,
6  loss='sparse_categorical_crossentropy',
7  metrics=['acc']
8)

The main configurable hyperparameter in the Adam algorithm is the learning rate. This is the part of the gradient descent from where the algorithm starts. It's written as follows:


1optimizer = Adam(lr=0.01)

The default learning rate is 0.001. Reducing it can sometimes slow down learning, but that improves the overall quality of the model.

Data Generators

Arrays are stored in the RAM, not on the computer's hard drive.

To deal with such a huge amount of images, you need to implement dynamic data loading.

ImageDataGenerator:


1from tensorflow.keras.preprocessing.image import ImageDataGenerator

The ImageDataGenerator class forms batches with images and class labels based on the photos in the folders. Let's put it to the test:


1datagen = ImageDataGenerator()

To extract data from a folder, call the flow_from_directory() function:


1datagen_flow = datagen.flow_from_directory(
2    # the folder with the dataset
3    '/dataset/',
4    # the target image size
5    target_size=(150, 150),
6    # the batch size
7    batch_size=16,
8    # class mode
9    class_mode='sparse',
10    # set a random number generator
11    seed=54321)


1Found 1683 images belonging to 12 classes.

The data generator found 12 classes (folders) and a total of 1683 images.

Let's go through the arguments:

target_size=(150, 150) — an argument with the target width and height of the image. The folders may contain images of different sizes, but the neural networks need all images to have the same dimensions.
batch_size=16 — the number of images in the batches. The more images there are, the more effective the model's training will be. Too many pictures won't fit in the GPU's memory, so 16 is the perfect starting value.

class_mode='sparse' — an argument that indicates the class label output mode. sparse means that the labels will correspond to the number of the folder.

You can find out how the class numbers relate to folder names this way:


1# class indices
2print(datagen_flow.class_indices)

Calling the datagen.flow_from_directory(...) method will return an object from which "picture-label" pairs can be obtained by using the next() function:


1features, target = next(datagen_flow)
2
3print(features.shape)

The result is a four-dimensional tensor with sixteen 150x150 images with three color channels.

To train the model on this data, let's pass the object datagen_flow to the fit() method. To limit the training time, specify the number of dataset batches in the steps_per_epoch parameter:


1model.fit(datagen_flow, steps_per_epoch=len(datagen_flow))

The fit() method has to contain both training and validation sets. To do this, create two data generators for each set.


1# indicate that the validation set contains
2# 25% random objects
3datagen = ImageDataGenerator(validation_split=0.25)
4
5train_datagen_flow = datagen.flow_from_directory(
6    '/datasets/fruits_small/',
7    target_size=(150, 150),
8    batch_size=16,
9    class_mode='sparse',
10    # indicate that this is the data generator for the training set
11    subset='training',
12    seed=54321)
13
14val_datagen_flow = datagen.flow_from_directory(
15    '/datasets/fruits_small/',
16    target_size=(150, 150),
17    batch_size=16,
18    class_mode='sparse',
19    # indicate that this is the data generator for the validation set
20    subset='validation',
21    seed=54321)

Training is now initiated like this:


1model.fit(train_datagen_flow,
2          validation_data=val_datagen_flow,
3          steps_per_epoch=len(train_datagen_flow),
4          validation_steps=len(val_datagen_flow))

Image Data Augmentations

Augmentation is used to artificially expand a dataset by transforming the existing images. The changes are only applied to training sets, while test and validation sets remain the same.

There are several types of augmentation:

Rotation
Reflection
Changing brightness and contrast
Stretching and compression
Blurring and sharpening
Adding noise

You can apply more than one type of augmentation to a single image.

You can avoid problems if you follow these recommendations:

Do not apply augmentation on test and validation sets so as not to distort metric values.
Add augmentations gradually, one at a time, and keep an eye on the quality metric in the validation set.
Always leave some images in the dataset unchanged.

Augmentations in Keras

There are many ways to add image augmentations in ImageDataGenerator.


1datagen = ImageDataGenerator(validation_split=0.25,
2                             rescale=1./255,
3                             vertical_flip=True)

Different generators have to be created for the training and validation sets:


1train_datagen = ImageDataGenerator(
2    validation_split=0.25,
3    rescale=1./255,
4    horizontal_flip=True)
5
6validation_datagen = ImageDataGenerator(
7    validation_split=0.25,
8    rescale=1./255)
9
10train_datagen_flow = train_datagen.flow_from_directory(
11    '/dataset/',
12    target_size=(150, 150),
13    batch_size=16,
14    class_mode='sparse',
15    subset='training',
16    seed=54321)
17
18val_datagen_flow = validation_datagen.flow_from_directory(
19    '/dataset/',
20    target_size=(150, 150),
21    batch_size=16,
22    class_mode='sparse',
23    subset='validation',
24    seed=54321)

Set the objects train_datagen_flow and val_datagen_flow to the same seed value to prevent the training and validation sets from sharing common elements.

ResNet in Keras

Import ResNet50 from Keras. (50 indicates the number of layers in the network.)


1from tensorflow.keras.applications.resnet import ResNet50
2
3model = ResNet50(input_shape=None,
4             classes=1000,
5             include_top=True,
6              weights='imagenet')

Let's go through the arguments:

input_shape — the size of the input image. For example: (640, 480, 3).
classes=1000 — the number of neurons in the last fully connected layer where classification takes place.
weights='imagenet' — the initialization of weights. ImageNet is the name of a large image database that was used to train the network to sort pictures into 1000 classes. If you start training the network on ImageNet, and then continue with your task, the result will be much better than if you'd just trained it from scratch. To initialize the weights at random, write weights=None.
include_top=True — indicates that there are two layers (GlobalAveragePooling2D and Dense) at the end of ResNet. If you set it to False, these layers will be missing.

GlobalAveragePooling2D acts as a window to the entire tensor. Like AveragePooling2D, it returns the average value from a group of pixels inside a channel. GlobalAveragePooling2D is used to average the information throughout the image in order to get a pixel with a large number of channels (for example, 512 for ResNet50).
Dense is the fully connected layer responsible for classification.

Let's learn how to use a network that's been pre-trained on ImageNet. To adapt ResNet50 to our task, let's remove the top and rebuild it:


1from tensorflow.keras.layers import GlobalAveragePooling2D, Dense
2from tensorflow.keras.models import Sequential
3
4backbone = ResNet50(input_shape=(150, 150, 3),
5                    weights='imagenet',
6                    include_top=False)
7
8model = Sequential()
9model.add(backbone)
10model.add(GlobalAveragePooling2D())
11model.add(Dense(12, activation='softmax'))

The backbone is what's left of ResNet50.

Say there's a very small dataset that only contains 100 pictures and two classes. If you train ResNet50 on this dataset, it is guaranteed to overtrain because it has too many parameters (about 23 million)! The network will end up getting perfectly accurate predictions on the training set, but it'll get random ones on the test set.

To avoid this, we will "freeze" a part of the network: some layers will be left with ImageNet weights and won't be trained with gradient descent. Let's train only one or two fully connected layers at the top of the network. This way, the number of parameters in the network will be reduced, but the architecture itself will be preserved.

Let's freeze the network like this:


1backbone = ResNet50(input_shape=(150, 150, 3),
2                    weights='imagenet',
3                    include_top=False)
4
5# freeze ResNet50 with the top removed
6backbone.trainable = False
7
8model = Sequential()
9model.add(backbone)
10model.add(GlobalAveragePooling2D())
11model.add(Dense(12, activation='softmax'))

We didn't freeze the fully connected layer above backbone so that the network is able to learn.

Freezing allows you to avoid overtraining and increase the network's learning rate (the gradient descent won't need to count derivatives for frozen layers).