Multi-Layer Perceptron with NumPy

The project presents the realization of an MLP neural network with 4 hidden layers and aims to recognize numbers in images. The database is imported from tensorflow.keras.datasets.mnist.load_data() and contains 60000 training data and 10000 validation data
.

Project includes:

Data per-processing
Feedforward
Backpropagation
Gradient Descent
Mean Squared Error
Accuracy measure
Code implementation

Data per-processing

The structure of the data retrieved from tensorflow.keras.datasets.mnist.load_ data() has the following form:

The training data consists of 60000 images of size 28X28 bits with a cullet depth of 8 bits (gray scale), illustrating numbers and 60000 labels of size 1 digit (digits from 0 to 9 ).
The validation data consists of 10000 images of 28X28 size with the same color feature, illustrating numbers and 10000 labels of 1 digit size.

The data pre-processing is performed as follows:

The 28X28 image existing in the arrays 60000X28X28 and 10000X28X28, respectively, is transformed into a vector of 784 elements. Consequently, the two matrices will have the form 60000X784 and 10000X784 respectively.
Data normalization is performed with the equation esulting in values between 0 and 1. In this case we can divide the two matrices that have the image characteristics, to the value of 255 because the 8-bit gray scale has values between 0 and 255.
The labels corresponding to the images have the form 60000X1 and 10000X1 respectively. The values of these matrices will be transformed into categorical values. For example the value 0 = |1 0 0 0 0 0 0 0 0 0| , 1 = |0 1 0 0 0 0 0 0 0 0| , 2 = |0 0 1 0 0 0 0 0 0 0| ... Etc. This gives the matrices 60000X10 and 10000X10 respectively.
For a good management of computational resources and a better optimization of the model (neural network), it is recommended to divide the training data into smaller batches. For this project we have chosen the batch size of 100, thus the shape of the feature matrices as well as the labeling data, will be: 600X100X784X784 and 600X100X784X10 respectively

Feedforward

Feedforward, the neural network's process of passing data from the input layer to the output layer. This process is carried out by each perceptron of the network, calculating the weighted sum of inputs 1.1a, and passed to the activation function 1.1b. Figure 1 illustrates the shape of a perceptron, where X1..p - represents the input, w1..P - the weighting for each input, b - bias (balancing value), S - the weighted sum 1.1a and A- the activation function
fig. 1

(1.1a)

(1.1b)

S - weighted sum, w - weighting, x - input values , f() - activation function , A - activation function result

In accordance with the above equations, the relations and their matrix form for a neural network with 4 hidden layers are given below. The data input, as shown in matrix 1.2, is a matrix 100 x 784 . The size of each line, is a 28X28 pixel image transformed into a vector of 784 . The number of lines is the number of features in the data packet that goes into the processing, 100 in this case.

(1.2)

The initialization of the weight matrices is performed by generating a randomized matrix where the number of rows is the feature size and the number of columns is the number of perceptrons in that layer, i.e. the number of outputs. For example matrix 1.3, the matrix of weights on the first hidden layer has the number of rows equal to the size of the resulting vector from the 28X28pixel image i.e. 784. The number of columns, as can be seen in matrix 1.3, is the number of outputs to layer 2.

(1.3)

Initializing the bias vector, it is a matrix with a single row and has the number of scalars eqal numbers layers.

Equation 1.4 relates the weighted sum of the input layer to which the bias vector is added. The shape of the SL1 matrix is 100x64, the shape resulting from multiplying the matrix X (100x784) by the matrix wL1 (784x64). For equations 1.6, 1.8 and 1.10, the approach is identical but the matrix X is replaced by the matrix resulting from the activation function in the previous layer, and the weighted matrix w is the matrix related to the respective layer.

(1.4)

matrix form of eucation 1.4

The activation functions expressed in Equations 1.5, 1.7, 1.9 and 1.11 are the weighted sum matrix for each layer, to which the sigmoid function (in this case, Equation 1.2) is applied for the element in the matrix.

(1.5)

(1.6)

(1.7)

(1.8)

(1.9)

(1.10)

(1.11)

S Ln – weighted sum matrix of layer Ln ; w Ln – weight matrix Ln ; X – matrix of input data ; ALn – the matrix resulting from the activation function (sigmoid in this case) for the Ln

Backpropagation

Backpropagation is the method of training the neural network in order to reduce the difference between the predicted and the actual outcome. As the name suggests it has the reverse direction to Feedforward, resulting in gradients with which the weights (w) are adjusted.

Determining gradients is done using the chain rule, which helps us find the derivative of a compound function. The case of a function , where g -is the function of x, and f is the function of g, the result is the derivative of y with respect to x is (2.1):

(2.1)

According to the principle stated above and considering equation 1.1, we can write the partial derivative equation of the correction gradient for a single-layer linear perceptron, as (2.2).

(2.2) E – loss, w – coefficients weights, – weighted sum, X – input values

Using the example of equation 2.2 and the chain rule with multiple consecutive functions, we can determine the partial derivative equations for a model with four hidden layers. Based on Figure 2, which represents a sketch of the model's operation, we can deduce the partial derivatives and their order.

fig. 2

This gives equations 2.3 which determine the gradients of the weights on the 4 hidden layers.

(2.3)

E –loss; A Ln – activation function for layer n; S Ln – weighted sum on layer n; w Ln – weight of layer n

Similar to the determination of the gradients of the weights we determine the gradients for the bias, but replacing w L by b L. This is a vector that has the size of the number of perceptrons in the layer.

(2.4)

b Ln – bias for layer n

Taking into account that the neural network has a repetitive form with respect to the positioning of the perceptrons composing it, we can generalize the relations for the partial derivatives as follows:

For the output layer, represents the change in loss (E) as a function of the change in the activation function result (AL). In this situation we have chosen the simplest loss function, i.e. the difference between the model output (AL) and the actual output (Y) used during training

(2.5)

The influence of the result of the activation function (A L = f(SL)) on the weighted sum (S L), is equal to the derivative of the activation function In this case we have chosen the sigmoid function, so the derivative of this function is ( taken from the literature)

(2.6)

Equation 2.7 represents the change in the weighted sum of the upper layer S L+1 as a function of the inputs assigned to it from the output of the lower layers A L . This is equal to the value of the weights in the upper layer w L+1. We consider b L+1 = 0, since we are interested in the modification of S L+1 as a function of A L .

(2.7)

The derivative of the weighted sum S L with respect to the weight w L, is equal to the result of the activation function in the previous layer A L-1 , except for the first layer because A L-1 becomes the input to the model, i.e. X. We consider b L = 0, since we are interested in the evolution of S L as a function of w L.

(2.8)

The weighted sum S L with respect to b L, is equal to 1 because the inputs A L-1 and the weights w are 0, because we are interested in the evolution of S L as a function of b L.

(2.9)

Considering equations 2.5, 2.6, 2.7, 2.8 and 2.9, equations 2.3 and 2.4 can be written as 2.10 below.

Index T is the transposed matrix for realizing the junctions between layers. If for forward propagation, the matrix w L realizes the junction between the shape of the input matrices in the L-layer and the output shape, for back propagation the reverse effect must be realized.This is done by rotating the matrix from the bottom left corner to the top right corner.

(2.10)

To simplify the implementation process, we consider

for the last layer
for the rest of the layers

And we can write equations 2.10 excluding the last term from 2.11

(2.11)

Considering equations 2.10 and 2.11 results in the following form of the gradient determinant (2.12 and 2.13). The form of the gradient matrix is identical to the form of the weighted matrix

(2.12)

As it follows from equations 2.10, the gradients of b L are equal to δ L and the composition of the gradients resulting from each input feature (100 in this case) is realized by summing over the vertical

(2.13)

Gradient Descent

Gradient descent is a method of optimizing a model by finding a local minimum of a differential function. In machine learning, it has the role of correcting the weights used in the neural network. The generalized form as well as the way of working is expressed in equation 3.1

(3.1)

– learning rate

Equation 2.3 presents the weight and bias optimization approach for each layer

(3.2)

Mean Squared Error

The mean squared error is a method of expressing the errors that a model may have. It is realized according to equation 4.1, being the average of the squares of the difference between the predicted and the actual result.

(4.1)

Accuracy metric

For classification tasks, this method provides a quick information of the model performance in terms of the correctness of the delivered results. The accuracy expresses the ratio of the number of correct results to the total number of results

(5.1)