top of page

How to build your own Handwritten Digits Recognition System

In recent years we have witnessed the digital revolution with the rapid increase in the amount of visual and multimedia files generated daily. This has brought new challenges for computer scientists to create systems capable of recognizing patterns and features from images. This endeavor has led to the development of machine and deep learning models which excel in this type of task. The “Hello World” of image classification is a seemingly simple, yet non-trivial problem of classifying handwritten digits. Although this task may seem trivial, the task of building a reliable recognition system that recognizes various types of handwritten characters is extremely challenging.

This blog will present the use of deep learning models to simplify the task of classifying digits from zero to nine. The MNIST dataset is a commonly used data set for testing the performance of machine and deep learning algorithms.

This project uses the MNIST dataset as an example to evaluate the performance of three different deep learning models and compare their results. The source code for this project can be found here:

TeddyMeg/Handwritten-Digits-Recognizer: The objective of this project is to build a image-classifier using Neural Networks to accurately categorize the handwritten digits. It also has a gui made using tkinter and OpenCV where a user can draw a digit and choose a model to predict what the number is. (

The models presented here include a simple neural network, a convolutional neural network and a deep convolutional network namely VGG16.


Recognition is the process of identifying something from past experiences. As a result, the task of digit recognition is just the process of identifying manually written digits from different kinds of sources like documents, papers, bank cheques and pictures. Handwriting character recognition has become a standard research area due to advances in technologies such as the handwriting capture devices and powerful mobile computers (Elleuch, Maalej, & Kherallah, 2016). Typically, handwritten digit recognition is an essential function in a variety of practical applications, for example in administration and finance. (Niu & Suen). Handwritten digits recognition has been around for a long time since the 1980s. But there were many challenges like the size, shape thickness and position of digits which made it impossible to build an application that could account for all these factors. Recently advancements in field of machine and deep learning have made it possible to reduce human efforts in perceiving, learning, and recognizing more regions. Deep learning models can accomplish very high accuracy which is sometimes better than humans.

Problem Statement

The rapid growth of new documents and multimedia news has created new challenges in pattern recognition and machine learning (Cecotti, 2016). The handwritten analysis is a cumbersome and organized process that relies on a broad knowledge of the way people form digits or letters, and which exploits the unique 18 characteristics of numerals and letters, for example, the shapes, sizes, and individual writing styles that people use (Winkler, A combination of statistical and syntactical pattern recognition applied to classification of unconstrained handwritten numerals, 1980). Typically, handwriting experts use sophisticated classification models to analyze printed or handwritten character images. As part of this process, they extract features from the samples which include slants, orientation, and the center alignment of the letters. Offline digital recognition has many practical applications. For instance, the handwritten sample is analyzed and recognized by the handwriting expert to identify the zip code in an address written or printed on an envelope (Hanmandlu & Murthy, 2007). Since handwriting very much depends on the writer, building a high-reliability recognition system that recognizes any handwritten character input to an application, is challenging. It is a hard task for the machine because handwritten digits are not perfect and can be made with many different flavors. The handwritten digit recognition is the solution to this problem which uses the image of a digit and recognizes the digit present in the image.

Project Goal

The objectives of this project are: -

· Prepare the dataset and split the data in the appropriate amount for training, validation, and testing.

· Implement three types of models on a predefined ‘MNIST’ dataset and evaluate their performance on handwritten digit recognition.

· To identify whether image preprocessing methods have a significant impact on the error rate of selected classification models

· To recommend which algorithms can improve the accuracy of handwritten digits recognition to up to 99% based on the evaluated findings.

Relevance and Significance

OCR refers to the recognition of characters on optical scanning and digital text pages by computer (Winkler, 1980). Despite its problems, it widely contributes to the progress of improving the interface between man and machine in a lot of applications (Sarkhel, Das, Das, Kundu, & Nasipuri, 2017). Due to a variety of potential applications such as the reading of postal codes, medical prescription reading, interpreting handwritten addresses, processing bank checks, credit authentication, social welfare, forensic analysis of crime evidence which includes a handwritten note, etc., handwritten digital recognition is still an active area of research (Winkler, 1980).

Definition of Terms

Deep learning: is a subset of machine learning, which is essentially a neural network with three or more layers. Deep learning is a modern variation that is concerned with an unbounded number of layers of bounded size, which permits practical application and optimized implementation while retaining theoretical universality under mild conditions.

Artificial Neural Networks: are a kind of machine-learning algorithm that mimics the way of how the human brain works and serves as the building block for most deep learning models.

Layers: A neural network is composed of several layers. The first layer of neurons is

called input layers and these might be x values ranging between zero and one which depict the intensities of each pixel in the image. A simple neural network might include several hidden layers and finally an output layer predicting any one of the classes that the sample image belongs to.

Activation functions: They are simple mathematical functions which take in a linear function as an input from the previous layer and spits out a number ranging between 0 and 1. These functions are applied to all the layers except the input layer. The hidden layers might apply activations functions like sigmoid, tanh or ReLU and softmax might be used for the output layer of a multiclass classification problem.

Weights and Biases: These are numbers which are which are applied to activations of the previous layer to compute the weighted sum. These are parameters that the network tries to learn and are updated until we get an optimal result. Initially they might be selected randomly but get updated as we train the network.

Cost Function: The cost or loss function is a method to evaluate the performance of the network. One of the most used cost functions is the mean squared error (MSE). The goal of training a neural network is to minimize this cost function by updating the weights and biases.

Backpropagation and Gradient Descent: The process of moving from input layers to output ones is called feedforward. And the reverse process is called backpropagation which is used to adjust the weights and biases of a network and reduce the cost function until it reaches a local minimum, The algorithm which is used to adjust the weights is called gradient descent and it allows the model to determine the direction of the nearest local minimum.

Convolutional Neural Networks (CNNs): Convolution Neural Network is a deep learning method that is broadly used for image classification, image recognition, object detection etc.

Convolution Layer: The convolution layer is the first layer of the CNN model architecture which starts to extract features from the input. The convolution layer operates with two inputs, one is the image matrix while the other being the filter or kernel.

Pooling Layer: Pooling layer is another extremely significant feature of convolution neural networks. Pooling layer will resolve to reducing the number of parameters when the images are large. Pooling is also known as subsampling or down sampling which helps us to reduce the dimensionality of each feature map by leaving out the trivial traits and retaining all the important information.

FC or Fully Connected Layer: Every CNN architecture is provided with a fully connected layer in the end where learning of non-linear combination of high-level features takes place.

Padding: Padding refers to a certain number of pixels which is added to the original input image, while the filter matrix strides over the input. Padding extends the area on which the CNN works. As the kernel moves along for extracting features, it converts the input into either a small or a large format. Thus, padding is added to the frame of the image to allow the kernel to cover more space, resulting in more accurate analysis of the image.

Very Deep Neural Networks: Some of the most common deep convolutional neural networks include AlexNet, VGGNet, ResNet, GoogleNet. These networks contain ten or more layers and are deeper than conventional neural networks. These networks are more powerful and have an accuracy that is very close to humans.

Literature Review

Hand Digits Recognition turns out to be progressively significant in the advanced world because of its actual implementation in our everyday life (Nagu, Shankar, & Apurna, 2011). Recently, various recognition frameworks have been presented inside numerous applications where higher order effectiveness is required. It causes us to take care of increasingly complex issues and is simpler (LeCun, et al., 1990). Programmed preparing of bank checks, the postal location is a general utilization of hand-written digit recognition (Ashworth, et al., 2017). In this particular paper, they prepared both Artificial neural network and Convolutional neural network model to recognize written by hand digits from 0 to 9. A node in a neural system can be comprehended as a neuron in the brain. Every node is associated with different nodes through weights (which are basically the edges between the nodes) which are balanced in the algorithm. A value is determined for every node dependent on the feature and methods of previous node. This procedure is called forward propagation (Arpit, Zhou, Kota, & Govindaraju, 2016).

The last output of the system is related with the objective output, at that point weights are changed according to the loss function to depicting whether the system is speculated effectively (Patel, Jagtap, & Kale, 2014). This procedure is called back propagation (Witten, Frank, Hall, & Pal, 2016). To include complexity and correctness in the neural network, the systems have different layers. In the middle of a fully connected neural system, there are various layers that exist, in particular information, output and hidden layers. Suppose we have features x1, x2, x3…. xn. The edges from one node to the other node of the network have weights that play the most important role in both forward and backward propagation. In forward propagation, there are two types of operation that happens in the hidden layer with the feature and the weights being passed to the neuron or node. The sum of the product of feature and weights and then applying an activation function. Whenever we have a Neural Network which is very deep at that time you will understand there are many weights and bias parameters. In backward propagation we have to change the values of the previous epoch weights, this reduces the loss value. In a completely associated neural system nodes in each particular layer are associated with the nodes and the layers preceding and succeeding them (Arif, Siddique, Khan, & Rahman, 2019).

Figure 1 Artificial Neural Network

In 1980–2000 researchers were not able to create a deep neural network in an Artificial neural network. The reason is the use of sigmoid function in every neuron (in 1980–2000 the ReLU was not invented). This is termed a vanishing gradient problem in a Neural network. The Activation function(sigmoid), when applied to the summation of the product of weights and the features, is always ranging between 0–1 and the derivative of the activation function ranges between 0–0.25 which gets smaller when the layers of the neural network become deeper. To deal with the vanishing gradient problem the use of ReLU or other activation function which does not lead to the collapse of the derivative is used.

When the weights assigned are large numbers then the expected number of the derivative loss/old will be a very large number which will result in a large variation in the new and old values when backpropagating. Then new weight will jump on large values over the epochs and the weights will vary a lot with the value never converging at a point. So, the weight initialization in a Neural network is a very crucial point otherwise this can lead to an Exploding Gradient problem (Chen, Wang, Fan, Sun, & Naoi, 2020).

Figure 2 Forward and Backward Propagation

Whenever we have a deep Neural network or a network with a huge number of layers then we have a huge number of weights and bias parameters as well which leads to overfitting the dataset problem or a particular data. In a multilevel Neural Network, underfitting will never happen because we will be having multiple levels that try to fit the training data perfectly. High variance is a problem with increasing levels in the network. We can apply regularization (L1 or L2) or Dropout layer to decrease the overfitting problem. In a Random Forest multiple decision trees are created. Every Decision tree is created to its depth which also leads to an overfitting problem. Similarly, like the decision tree, we will be using a subset of features which is regularization which improves the accuracy of the whole model.

Figure 3 Graph of derivative for vanishing and exploding gradient problem

Figure 4 CNN Model Architecture

In the Neural Network, we select a subset of features from the input layer and select a subset of hidden neurons. The other neurons which are not selected in the subset are deactivated (Sudarsan & Joseph, 2020). The number of nodes in a subset count is calculated by the use of the dropout ratio. In image classification, object detection, and many other data augmentations Convolutional Neural Network (Convolutional neural network) plays a very major role. In the Convolutional neural network, the input data is in the form of a matrix which is having values in each cell ranging from 0–255 and either one or 3 artificial neural networks depending on grayscale and RGB scale respectively (Nanehkaran, Zhang, & Salimi, 2020).

Figure 5 Operations on an image using CNN model

The filters are applied to images and the output is also a matrix in a particular operation. The images go through a pipeline of operations of convolution layers with the filter, pooling, fully connected layer, and applying SoftMax function. The beneath figure is the complete architecture of a convolutional neural network to process an input picture and classify it based on values.


In this project, we will use the MNIST dataset for the handwritten digit classification problem. The MNIST dataset is a very authentic and great dataset for the students and researchers. It has 60000 images with 10 classes (0–9) which is enormous. Each image in the MNIST dataset is of 28 height and 28 width which make the image of 784-dimensional vectors. The MNIST dataset is available easily on the internet. Each image in MNIST is a grey-scale image and the range is 0–255 which indicates the brightness and the darkness of that pixel. The MNIST dataset was created by the National Institute of Standards and Technology (MNIST). To estimate the performance of a model, we split the preparation set into a training and testing dataset. Execution on the train and testing dataset would then be able to be plotted to give expectations to learn and adapt knowledge into how well a model is learning the issue.

Software and Hardware Requirements

Tensorflow and Keras

TensorFlow is an amazing information stream in machine learning library made by the Brain Team of Google and made open source in 2015. Keras is a deep learning API written in Python, running on top of the machine learning platform TensorFlow. It was developed with a focus on enabling fast experimentation. Keras is the high-level API of TensorFlow 2: an approachable, highly productive interface for solving machine learning problems, with a focus on modern deep learning. For the purpose of this project, we have used version 2.6.0 of Tensorflow and Keras. (Keras, 2021)

Python 3.9

Python is a programming language that gives you a chance to work rapidly and coordinate frameworks more effectively. Python is broadly utilized universally and is a high-level programming language. It was primarily introduced for prominence on code, and its language structure enables software engineers to express ideas in fewer lines of code. This project utilizes the latest version of python 3.9.

Anaconda and Pycharm

PyCharm is a very popular IDE developed by JetBrains used for Python scripting language. Anaconda is a free and open-source appropriation of the Python and R programming for logical figuring like information science. The scripts for this project can be run by using PyCharm with Anaconda environment.

Hardware Platform

Although the models used for this project can be run on a GPU, we performed the experiment using a conventional Intel Core I5 CPU. The program can be downloaded and run-on high-performance CPUs with good processing capabilities which saves time and computational power.

Packages used

All the packages required for this project are provided under the file mnist_env_packages.yml. The user can simply install Anaconda on the PC and run the following command on conda prompt to create a new environment with all the required packages.

conda env create -f mnist_env_packages.yml

Some of the essential packages used in this project include:-

· NumPy — used for numerical analysis

· MatPlotLib — used for plotting graphs and confusion matrix

· Scikit Learn- used for generating confusion matrix

· Opencv- used for creating contours and save images

· Tkinter- used for creating the GUI

System Architecture

Optical character recognition (OCR) is a recognition system that has various stages. Each stage plays a very important role in the model. The stages are pipelined one after other. Here we look into the design method of the proposed system.

A. Data Acquisition — The initial step was to get the dataset used for the experiment, which can be done very easily through Keras API. For the purposes of this experiment the dataset was split into 3 categories.

1. 48,000 Training Sets

2. 12,000 Validation Sets

3. 10,000 Test Sets

B. Pre-processing: Preprocessing is a very vital operation in the image. In Pre-processing major operations that are carried are image cleaning to reduce the noise in the image and removing the garbage. First, the images were converted into numpy arrays with dimension of 28 by 28 by 1 where the last dimension indicates the number of dimensions. Since the pixel intensity values range from 0 to 255, normalization was carried out to make the values fall between the range 0 and 1.

C. Building model: At this stage we built three different types of models using Keras API. The three types of classifiers used for this experiment is a Simple neural network, a CNN model and VGG16 model.

D. Training model: After we are done pre-processing the data and building the model the next stage will be to feed the input to the classifiers and train the models.

E. Evaluation and Prediction: Once the models have been trained we evaluate the models to compare their loss and accuracies. The predictions can be used to get the precision, recall and f1 score values and is also used to plot the confusion matrix.

F. Save model: Once the models have been trained and evaluated, we saved the structure and weights of the model so that it can be used by the GUI later in order to classify new images.

Project Structure

The project is created using under the name Hand-Written-Digits-Recognizer. The code for getting the data, building and evaluation the models is written in modularized manner which is easy to read and understand. The subdirectories and python files used in this project are summarized below:-

1. The pre-trained models can be found in /models/ directory

2. All the plots and charts for the training and evaluation can be found in /plots/ folder

3. is the file used to run the code for training and evaluating the three models.

4. contains the code to run the gui for predicting handwritten digits

5. The three sub folders /ANN/, /CNN/, and /VGG16/ contains the python files which contains the code for building the three models respectively.

Figure 6 Project Structure

Algorithms Used

ANN: In our first experiment, we will be building a simple neural network from scratch and evaluate its performance. This will help us to understand the basic building blocks of deep neural networks and useful for comparison. The ANN model has an input layer, two dense layers and an output layer as shown in the figure below.

Figure 7 ANN Model Diagram

CNN: Convolution Neural Network is a deep learning method that is broadly used for image classification, image recognition. It is comprised of convolutional, max pooling and fully connected layers which enables it to learn much more sophisticated features than a simple neural network. In the second experiment, we will use a CNN model for classifying the digits. The model we used for this experiment can be summarized int the figure below.

VGG16: For the final experiment, we selected one very deep CNN for classifying the MNIST dataset and compare its performance with the previous methods. VGG16 is a simple and widely used Convolutional Neural Network (CNN) architecture used for ImageNet, a large visual database project used in visual object recognition. Since the original VGG16 model was intended for color images, we modified the model to fit our task of classifying grey scaled images.

Results and Discussion

4.1 Training, Validation and Test Results

The models discussed in the previous chapter were all trained with the same hyperparameter initialized so there are no discrepancies when comparing their results. The hyperparameters used for this experiment include: -

Learning rate=0.00001


Batch Size=32


Input dimensions=(28,28,1)

After training the models the results obtained for the accuracy and loss for training, validation and testing of the three models can be summarized as follows.


Training Accuracy — 96.5%

Validation Accuracy — 95.8%

Test Accuracy — 95.4%


Training Accuracy — 99.25%

Validation Accuracy — 98.78%

Test Accuracy — 98.78%


Training Accuracy — 98.93%

Validation Accuracy — 98.89%

Test Accuracy — 99.07%

From the above description it is easy to deduce that the accuracy for the task of classifying digits has improved as we go from the simple ANN model to the much deeper VGG16 model. The highest testing accuracy was achieved by the VGG16 model and has no overfitting problem, The plots for the loss and curacies of the three models can be used to clearly show the progress in different epochs.

Figure 9 ANN Training and Validation Accuracy

Figure 10 CNN Training and Validation Accuracy

Figure 11 VGG Training and Validation Accuracy

Prediction and Evaluation Results

In this experiment we have used different methods to demonstrate the prediction results for the three models. Some of evaluation metrics used in the experiment include precision, recall, f1 score and confusion matrix. While the first three metrics can be used shown as a score the last one can be used to show the prediction results in a more detailed manner.

Figure 12 ANN Prediction Score and Confusion Matrix

Figure 13 CNN Prediction Score and Confusion Matrix