This post describes the steps involved in deep learning, with a practical example, written in Python. The code creates a multi layer neural network with logistic regression. This system will be trained with a set of images of handwritten digits and then be tested with a different set of images to see if it can identify the digits.

This is not a complete guide to Machine Learning or Deep Learning, but just one example, offered without rigorous proof of the equations used. If you’re looking for a course on the subject, Andrew Ng’s Machine Learning course on Coursera would be a very good and popular choice.

The 4 layers of this example neural network are:

1. Input layer of 256 inputs
2. Hidden layer of 30 nodes
3. Hidden layer of 20 nodes
4. Output layer of 10 outputs

“nodes” and “layers” are merely names used to conceptualize the mathematics. In the code, the numbers of layers and nodes simply translate to the sizes of the weights arrays. These arrays store the “learned” information of the system.

There are some great machine learning/deep learning libraries for Python, such as scikit-learn, TensorFlow and Theano. To better illustrate deep learning, we are not going to use these libraries, but instead code the system from scratch.

The only Python libraries needed are numpy for matrix manipulation, matplotlib for plotting the data, random to initiate the weights, and urllib to pull down the data from a URL:

### Sigmoid Function

The sigmoid function outputs a value between zero and one from an input between $$-\infty$$ and $$\infty$$. It is used to take an input and turn it into an activation value for a “neuron”. or node.

$$h_\theta (x) = g(\theta^T x) = g(z) = \dfrac{1}{1 + e^{-z}} = \dfrac{1}{1 + e^{-\theta^T x}}$$

### Partial Derivative of the Sigmoid Function

Later, we use gradient decent to make corrections to the weights. This is how the system learns. “Gradient decent” suggests differentials, and indeed, we will need the partial differential of the sigmoid function:

$$\dfrac{dg(z)}{dz} = g(z) (1 - g(z))$$

The system uses “weights” to determine the influence of each activation value on the output of each neuron. It is these weights which are altered during the learning phases. To start with, these will be initialized to random values. If we initialized the weights to zero, the system would not be able to form suitable differentials between the neurons.

This function creates a 2D array of random weights, $$w$$ where $$-1 \le w_{ij} \le 1$$

“Bias terms” are used to add extra weighing to the inputs. Think of these as the intercept point of the equations.

This function prepends a column of ones to an array for adding bias terms to inputs.

This function prepends a column of zeros to an array. This is to remove the influence of the bias term weights later on for regularization.

Our system is designed to predict boolean outputs of either zero or one. However, the sigmoid function outputs a number somewhere between zero and one. This output is the probability that Y is one, so we simply return one if the sigmoid function output is greater than half, otherwise we return zero.

### Prediction

The “predict” function uses the weighted terms and inputs to predict outputs for each layer in the neural network.

For each layer:

$$h_\theta (x) = g(\theta X)$$

where:

• $$g$$ is the sigmoid function
• $$h_\theta$$ is the hypothesis function (output) for each layer
• $$x$$ are the inputs on each neuron
• $$\theta$$ are the weights

Function outputs:

• list of activations per layer (for every sample)
• weighted outputs per layer (for every sample)
• last layer outputs (per sample)

### Cost Function for Regularized Logistic Regression

The learning algorithm seeks to minimize this cost function.

$$J(\theta) = - \dfrac{1}{m} \sum\limits_{i=1}^{m} \sum\limits_{j=1}^{J} [t_i \ log(h_\theta(x_{ij})) + (1 - t_{ij}) \ log(1-h_\theta(x_{ij}))] + \dfrac{\lambda}{2m} \sum\limits_{k=1}^{K} \theta_{k}^2$$

Where:

• $$\theta$$ are the weight terms
• $$m$$ is the number of training examples
• $$J$$ is the number of output nodes
• $$K$$ is the number of weight terms across all nodes (not including the bias values)
• $$\lambda$$ is the regularization parameter
• $$t$$ is the target output
• $$h(\theta)$$ is the predicted output, or hypothesis function

### Backpropagation

The grads function returns the gradients of the cost function by “backpropagation”. First the forward pass is run (using the “predict” function). This computes all the activation terms, for all nodes in the network.

Then the gradients of the error terms are calculated and summed over all the training samples. The code shows different calculations for output layers (where the target output value is known), and the hidden layers (where the target value has to propagate down from the output layer).

The gradient of the cost function (the return value of this function) is then calculated as:

$$\dfrac {\partial}{\partial \theta} J(\theta) = \dfrac{1}{m} \Delta$$ for the output layer

$$\dfrac {\partial}{\partial \theta} J(\theta) = \dfrac{1}{m} \Delta + \dfrac{\lambda}{m} \theta$$ for the hidden layers

for each neuron, where:

• $$m$$ is the number of data samples
• $$\lambda$$ is the regularization parameter
• $$\Delta$$ are the accumulated gradients (for all test samples)

### The Data

The data source is a selection of 1593 handwritten digits. The data consists of rows of digit data with each row containing a 16x16 pixel image (the first 256 columns), followed by 10 digits of ones and zeros, representing the Y value.

Y values are stored as ten columns of data. “1” in the first column signifies zero, a “1” in the second signifies two and so on.

The data is extracted from the given URL. It is then split into Dev and Test data. The Dev data being used for training, while the Test data is used to check the results of the training.

X data: (1593, 256)
Y data: (1593, 10)


### System Initialization

Initial values of $$\theta$$ (the weights) are set. The input layer is of a known size (the 256 pixels of the images), as is the output layer (10 outputs representing the output number). However the number of hidden layers and the nodes per hidden layer can be of any size we choose. One hidden layer is usually acceptable for such a case, however two hidden layers are used here, by way of an example. The only representation of these sizes in the code is in the size of these weights arrays (theta1, theta2, etc), which can be changed as required.

### The Learning Phase

The code below puts the training (Dev data) dataset through the learning algorithm, updates the weights, then tests its performance using the testing dataset. It loops through this process 500 times, which is enough to bring the performance of the tests to an acceptable level.

The weights are updated by simply subtracting the error gradients. More complex algorithms are available which can improve the efficiency here.

..............................


Plot of the cost against the number of iterations

Plot of the accurancy on the test data against the number of iterations

### Prediction

Now the weights are set, these values can be used to make predictions. Here, 5 samples are selected from the Test data set and are run through the prediction algorithm. Note that the prediction of new data is relatively quick. The hard work has already been done in setting the weights.

Actual:   5
Guess:    5


-------------------------------
Actual:   7
Guess:    7


-------------------------------
Actual:   9
Guess:    9


-------------------------------
Actual:   4
Guess:    4


-------------------------------
Actual:   8
Guess:    8


-------------------------------
Right: 5
Wrong: 0
Accuracy: 100%


## Conclusion

Deep learning code need not be overly long. Here we have shown an example which is small, yet makes reasonable predictions on hand written digits.

The algorithms used are not specifically for handwriting recognition, so this same code could easily be used for other Deep Learning applications.

For real world applications it may be prudent to use existing libraries which;

• are be better optimized for the given hardware
• have been more thoroughly tested
• contain more complex algorithms for faster gradient decent