Search Results

2 items found for ""

How to build your own Handwritten Digits Recognition System
In recent years we have witnessed the digital revolution with the rapid increase in the amount of visual and multimedia files generated daily. This has brought new challenges for computer scientists to create systems capable of recognizing patterns and features from images. This endeavor has led to the development of machine and deep learning models which excel in this type of task. The “Hello World” of image classification is a seemingly simple, yet non-trivial problem of classifying handwritten digits. Although this task may seem trivial, the task of building a reliable recognition system that recognizes various types of handwritten characters is extremely challenging. This blog will present the use of deep learning models to simplify the task of classifying digits from zero to nine. The MNIST dataset is a commonly used data set for testing the performance of machine and deep learning algorithms. This project uses the MNIST dataset as an example to evaluate the performance of three different deep learning models and compare their results. The source code for this project can be found here: TeddyMeg/Handwritten-Digits-Recognizer: The objective of this project is to build a image-classifier using Neural Networks to accurately categorize the handwritten digits. It also has a gui made using tkinter and OpenCV where a user can draw a digit and choose a model to predict what the number is. (github.com) The models presented here include a simple neural network, a convolutional neural network and a deep convolutional network namely VGG16. Background Recognition is the process of identifying something from past experiences. As a result, the task of digit recognition is just the process of identifying manually written digits from different kinds of sources like documents, papers, bank cheques and pictures. Handwriting character recognition has become a standard research area due to advances in technologies such as the handwriting capture devices and powerful mobile computers (Elleuch, Maalej, & Kherallah, 2016). Typically, handwritten digit recognition is an essential function in a variety of practical applications, for example in administration and finance. (Niu & Suen). Handwritten digits recognition has been around for a long time since the 1980s. But there were many challenges like the size, shape thickness and position of digits which made it impossible to build an application that could account for all these factors. Recently advancements in field of machine and deep learning have made it possible to reduce human efforts in perceiving, learning, and recognizing more regions. Deep learning models can accomplish very high accuracy which is sometimes better than humans. Problem Statement The rapid growth of new documents and multimedia news has created new challenges in pattern recognition and machine learning (Cecotti, 2016). The handwritten analysis is a cumbersome and organized process that relies on a broad knowledge of the way people form digits or letters, and which exploits the unique 18 characteristics of numerals and letters, for example, the shapes, sizes, and individual writing styles that people use (Winkler, A combination of statistical and syntactical pattern recognition applied to classification of unconstrained handwritten numerals, 1980). Typically, handwriting experts use sophisticated classification models to analyze printed or handwritten character images. As part of this process, they extract features from the samples which include slants, orientation, and the center alignment of the letters. Offline digital recognition has many practical applications. For instance, the handwritten sample is analyzed and recognized by the handwriting expert to identify the zip code in an address written or printed on an envelope (Hanmandlu & Murthy, 2007). Since handwriting very much depends on the writer, building a high-reliability recognition system that recognizes any handwritten character input to an application, is challenging. It is a hard task for the machine because handwritten digits are not perfect and can be made with many different flavors. The handwritten digit recognition is the solution to this problem which uses the image of a digit and recognizes the digit present in the image. Project Goal The objectives of this project are: - · Prepare the dataset and split the data in the appropriate amount for training, validation, and testing. · Implement three types of models on a predefined ‘MNIST’ dataset and evaluate their performance on handwritten digit recognition. · To identify whether image preprocessing methods have a significant impact on the error rate of selected classification models · To recommend which algorithms can improve the accuracy of handwritten digits recognition to up to 99% based on the evaluated findings. Relevance and Significance OCR refers to the recognition of characters on optical scanning and digital text pages by computer (Winkler, 1980). Despite its problems, it widely contributes to the progress of improving the interface between man and machine in a lot of applications (Sarkhel, Das, Das, Kundu, & Nasipuri, 2017). Due to a variety of potential applications such as the reading of postal codes, medical prescription reading, interpreting handwritten addresses, processing bank checks, credit authentication, social welfare, forensic analysis of crime evidence which includes a handwritten note, etc., handwritten digital recognition is still an active area of research (Winkler, 1980). Definition of Terms Deep learning: is a subset of machine learning, which is essentially a neural network with three or more layers. Deep learning is a modern variation that is concerned with an unbounded number of layers of bounded size, which permits practical application and optimized implementation while retaining theoretical universality under mild conditions. Artificial Neural Networks: are a kind of machine-learning algorithm that mimics the way of how the human brain works and serves as the building block for most deep learning models. Layers: A neural network is composed of several layers. The first layer of neurons is called input layers and these might be x values ranging between zero and one which depict the intensities of each pixel in the image. A simple neural network might include several hidden layers and finally an output layer predicting any one of the classes that the sample image belongs to. Activation functions: They are simple mathematical functions which take in a linear function as an input from the previous layer and spits out a number ranging between 0 and 1. These functions are applied to all the layers except the input layer. The hidden layers might apply activations functions like sigmoid, tanh or ReLU and softmax might be used for the output layer of a multiclass classification problem. Weights and Biases: These are numbers which are which are applied to activations of the previous layer to compute the weighted sum. These are parameters that the network tries to learn and are updated until we get an optimal result. Initially they might be selected randomly but get updated as we train the network. Cost Function: The cost or loss function is a method to evaluate the performance of the network. One of the most used cost functions is the mean squared error (MSE). The goal of training a neural network is to minimize this cost function by updating the weights and biases. Backpropagation and Gradient Descent: The process of moving from input layers to output ones is called feedforward. And the reverse process is called backpropagation which is used to adjust the weights and biases of a network and reduce the cost function until it reaches a local minimum, The algorithm which is used to adjust the weights is called gradient descent and it allows the model to determine the direction of the nearest local minimum. Convolutional Neural Networks (CNNs): Convolution Neural Network is a deep learning method that is broadly used for image classification, image recognition, object detection etc. Convolution Layer: The convolution layer is the first layer of the CNN model architecture which starts to extract features from the input. The convolution layer operates with two inputs, one is the image matrix while the other being the filter or kernel. Pooling Layer: Pooling layer is another extremely significant feature of convolution neural networks. Pooling layer will resolve to reducing the number of parameters when the images are large. Pooling is also known as subsampling or down sampling which helps us to reduce the dimensionality of each feature map by leaving out the trivial traits and retaining all the important information. FC or Fully Connected Layer: Every CNN architecture is provided with a fully connected layer in the end where learning of non-linear combination of high-level features takes place. Padding: Padding refers to a certain number of pixels which is added to the original input image, while the filter matrix strides over the input. Padding extends the area on which the CNN works. As the kernel moves along for extracting features, it converts the input into either a small or a large format. Thus, padding is added to the frame of the image to allow the kernel to cover more space, resulting in more accurate analysis of the image. Very Deep Neural Networks: Some of the most common deep convolutional neural networks include AlexNet, VGGNet, ResNet, GoogleNet. These networks contain ten or more layers and are deeper than conventional neural networks. These networks are more powerful and have an accuracy that is very close to humans. Literature Review Hand Digits Recognition turns out to be progressively significant in the advanced world because of its actual implementation in our everyday life (Nagu, Shankar, & Apurna, 2011). Recently, various recognition frameworks have been presented inside numerous applications where higher order effectiveness is required. It causes us to take care of increasingly complex issues and is simpler (LeCun, et al., 1990). Programmed preparing of bank checks, the postal location is a general utilization of hand-written digit recognition (Ashworth, et al., 2017). In this particular paper, they prepared both Artificial neural network and Convolutional neural network model to recognize written by hand digits from 0 to 9. A node in a neural system can be comprehended as a neuron in the brain. Every node is associated with different nodes through weights (which are basically the edges between the nodes) which are balanced in the algorithm. A value is determined for every node dependent on the feature and methods of previous node. This procedure is called forward propagation (Arpit, Zhou, Kota, & Govindaraju, 2016). The last output of the system is related with the objective output, at that point weights are changed according to the loss function to depicting whether the system is speculated effectively (Patel, Jagtap, & Kale, 2014). This procedure is called back propagation (Witten, Frank, Hall, & Pal, 2016). To include complexity and correctness in the neural network, the systems have different layers. In the middle of a fully connected neural system, there are various layers that exist, in particular information, output and hidden layers. Suppose we have features x1, x2, x3…. xn. The edges from one node to the other node of the network have weights that play the most important role in both forward and backward propagation. In forward propagation, there are two types of operation that happens in the hidden layer with the feature and the weights being passed to the neuron or node. The sum of the product of feature and weights and then applying an activation function. Whenever we have a Neural Network which is very deep at that time you will understand there are many weights and bias parameters. In backward propagation we have to change the values of the previous epoch weights, this reduces the loss value. In a completely associated neural system nodes in each particular layer are associated with the nodes and the layers preceding and succeeding them (Arif, Siddique, Khan, & Rahman, 2019). Figure 1 Artificial Neural Network In 1980–2000 researchers were not able to create a deep neural network in an Artificial neural network. The reason is the use of sigmoid function in every neuron (in 1980–2000 the ReLU was not invented). This is termed a vanishing gradient problem in a Neural network. The Activation function(sigmoid), when applied to the summation of the product of weights and the features, is always ranging between 0–1 and the derivative of the activation function ranges between 0–0.25 which gets smaller when the layers of the neural network become deeper. To deal with the vanishing gradient problem the use of ReLU or other activation function which does not lead to the collapse of the derivative is used. When the weights assigned are large numbers then the expected number of the derivative loss/old will be a very large number which will result in a large variation in the new and old values when backpropagating. Then new weight will jump on large values over the epochs and the weights will vary a lot with the value never converging at a point. So, the weight initialization in a Neural network is a very crucial point otherwise this can lead to an Exploding Gradient problem (Chen, Wang, Fan, Sun, & Naoi, 2020). Figure 2 Forward and Backward Propagation Whenever we have a deep Neural network or a network with a huge number of layers then we have a huge number of weights and bias parameters as well which leads to overfitting the dataset problem or a particular data. In a multilevel Neural Network, underfitting will never happen because we will be having multiple levels that try to fit the training data perfectly. High variance is a problem with increasing levels in the network. We can apply regularization (L1 or L2) or Dropout layer to decrease the overfitting problem. In a Random Forest multiple decision trees are created. Every Decision tree is created to its depth which also leads to an overfitting problem. Similarly, like the decision tree, we will be using a subset of features which is regularization which improves the accuracy of the whole model. Figure 3 Graph of derivative for vanishing and exploding gradient problem Figure 4 CNN Model Architecture In the Neural Network, we select a subset of features from the input layer and select a subset of hidden neurons. The other neurons which are not selected in the subset are deactivated (Sudarsan & Joseph, 2020). The number of nodes in a subset count is calculated by the use of the dropout ratio. In image classification, object detection, and many other data augmentations Convolutional Neural Network (Convolutional neural network) plays a very major role. In the Convolutional neural network, the input data is in the form of a matrix which is having values in each cell ranging from 0–255 and either one or 3 artificial neural networks depending on grayscale and RGB scale respectively (Nanehkaran, Zhang, & Salimi, 2020). Figure 5 Operations on an image using CNN model The filters are applied to images and the output is also a matrix in a particular operation. The images go through a pipeline of operations of convolution layers with the filter, pooling, fully connected layer, and applying SoftMax function. The beneath figure is the complete architecture of a convolutional neural network to process an input picture and classify it based on values. Dataset In this project, we will use the MNIST dataset for the handwritten digit classification problem. The MNIST dataset is a very authentic and great dataset for the students and researchers. It has 60000 images with 10 classes (0–9) which is enormous. Each image in the MNIST dataset is of 28 height and 28 width which make the image of 784-dimensional vectors. The MNIST dataset is available easily on the internet. Each image in MNIST is a grey-scale image and the range is 0–255 which indicates the brightness and the darkness of that pixel. The MNIST dataset was created by the National Institute of Standards and Technology (MNIST). To estimate the performance of a model, we split the preparation set into a training and testing dataset. Execution on the train and testing dataset would then be able to be plotted to give expectations to learn and adapt knowledge into how well a model is learning the issue. Software and Hardware Requirements Tensorflow and Keras TensorFlow is an amazing information stream in machine learning library made by the Brain Team of Google and made open source in 2015. Keras is a deep learning API written in Python, running on top of the machine learning platform TensorFlow. It was developed with a focus on enabling fast experimentation. Keras is the high-level API of TensorFlow 2: an approachable, highly productive interface for solving machine learning problems, with a focus on modern deep learning. For the purpose of this project, we have used version 2.6.0 of Tensorflow and Keras. (Keras, 2021) Python 3.9 Python is a programming language that gives you a chance to work rapidly and coordinate frameworks more effectively. Python is broadly utilized universally and is a high-level programming language. It was primarily introduced for prominence on code, and its language structure enables software engineers to express ideas in fewer lines of code. This project utilizes the latest version of python 3.9. Anaconda and Pycharm PyCharm is a very popular IDE developed by JetBrains used for Python scripting language. Anaconda is a free and open-source appropriation of the Python and R programming for logical figuring like information science. The scripts for this project can be run by using PyCharm with Anaconda environment. Hardware Platform Although the models used for this project can be run on a GPU, we performed the experiment using a conventional Intel Core I5 CPU. The program can be downloaded and run-on high-performance CPUs with good processing capabilities which saves time and computational power. Packages used All the packages required for this project are provided under the file mnist_env_packages.yml. The user can simply install Anaconda on the PC and run the following command on conda prompt to create a new environment with all the required packages. conda env create -f mnist_env_packages.yml Some of the essential packages used in this project include:- · NumPy — used for numerical analysis · MatPlotLib — used for plotting graphs and confusion matrix · Scikit Learn- used for generating confusion matrix · Opencv- used for creating contours and save images · Tkinter- used for creating the GUI System Architecture Optical character recognition (OCR) is a recognition system that has various stages. Each stage plays a very important role in the model. The stages are pipelined one after other. Here we look into the design method of the proposed system. A. Data Acquisition — The initial step was to get the dataset used for the experiment, which can be done very easily through Keras API. For the purposes of this experiment the dataset was split into 3 categories. 1. 48,000 Training Sets 2. 12,000 Validation Sets 3. 10,000 Test Sets B. Pre-processing: Preprocessing is a very vital operation in the image. In Pre-processing major operations that are carried are image cleaning to reduce the noise in the image and removing the garbage. First, the images were converted into numpy arrays with dimension of 28 by 28 by 1 where the last dimension indicates the number of dimensions. Since the pixel intensity values range from 0 to 255, normalization was carried out to make the values fall between the range 0 and 1. C. Building model: At this stage we built three different types of models using Keras API. The three types of classifiers used for this experiment is a Simple neural network, a CNN model and VGG16 model. D. Training model: After we are done pre-processing the data and building the model the next stage will be to feed the input to the classifiers and train the models. E. Evaluation and Prediction: Once the models have been trained we evaluate the models to compare their loss and accuracies. The predictions can be used to get the precision, recall and f1 score values and is also used to plot the confusion matrix. F. Save model: Once the models have been trained and evaluated, we saved the structure and weights of the model so that it can be used by the GUI later in order to classify new images. Project Structure The project is created using under the name Hand-Written-Digits-Recognizer. The code for getting the data, building and evaluation the models is written in modularized manner which is easy to read and understand. The subdirectories and python files used in this project are summarized below:- 1. The pre-trained models can be found in /models/ directory 2. All the plots and charts for the training and evaluation can be found in /plots/ folder 3. test.py is the file used to run the code for training and evaluating the three models. 4. gui_digit_recognizer.py contains the code to run the gui for predicting handwritten digits 5. The three sub folders /ANN/, /CNN/, and /VGG16/ contains the python files which contains the code for building the three models respectively. Figure 6 Project Structure Algorithms Used ANN: In our first experiment, we will be building a simple neural network from scratch and evaluate its performance. This will help us to understand the basic building blocks of deep neural networks and useful for comparison. The ANN model has an input layer, two dense layers and an output layer as shown in the figure below. Figure 7 ANN Model Diagram CNN: Convolution Neural Network is a deep learning method that is broadly used for image classification, image recognition. It is comprised of convolutional, max pooling and fully connected layers which enables it to learn much more sophisticated features than a simple neural network. In the second experiment, we will use a CNN model for classifying the digits. The model we used for this experiment can be summarized int the figure below. VGG16: For the final experiment, we selected one very deep CNN for classifying the MNIST dataset and compare its performance with the previous methods. VGG16 is a simple and widely used Convolutional Neural Network (CNN) architecture used for ImageNet, a large visual database project used in visual object recognition. Since the original VGG16 model was intended for color images, we modified the model to fit our task of classifying grey scaled images. Results and Discussion 4.1 Training, Validation and Test Results The models discussed in the previous chapter were all trained with the same hyperparameter initialized so there are no discrepancies when comparing their results. The hyperparameters used for this experiment include: - Learning rate=0.00001 Epochs=5 Batch Size=32 Optimizer=Adam Input dimensions=(28,28,1) After training the models the results obtained for the accuracy and loss for training, validation and testing of the three models can be summarized as follows. ANN Training Accuracy — 96.5% Validation Accuracy — 95.8% Test Accuracy — 95.4% CNN Training Accuracy — 99.25% Validation Accuracy — 98.78% Test Accuracy — 98.78% VGG16 Training Accuracy — 98.93% Validation Accuracy — 98.89% Test Accuracy — 99.07% From the above description it is easy to deduce that the accuracy for the task of classifying digits has improved as we go from the simple ANN model to the much deeper VGG16 model. The highest testing accuracy was achieved by the VGG16 model and has no overfitting problem, The plots for the loss and curacies of the three models can be used to clearly show the progress in different epochs. Figure 9 ANN Training and Validation Accuracy Figure 10 CNN Training and Validation Accuracy Figure 11 VGG Training and Validation Accuracy Prediction and Evaluation Results In this experiment we have used different methods to demonstrate the prediction results for the three models. Some of evaluation metrics used in the experiment include precision, recall, f1 score and confusion matrix. While the first three metrics can be used shown as a score the last one can be used to show the prediction results in a more detailed manner. Figure 12 ANN Prediction Score and Confusion Matrix Figure 13 CNN Prediction Score and Confusion Matrix Figure 14 VGG16 Prediction Score and Confusion Matrix Using GUI to predict digits As a final procedure we built a GUI using OpenCV where the user can draw any digit they want using a canvas and get a prediction for what the system thinks the digit is. Users have the option to choose from one of the three models to do the prediction by clicking on the button which has the model’s name. Once the model is chosen the digit drawn on the canvas will be saved as an image and the system will use it to give predictions using the saved model. The last button ‘Clear Canvas’ can be used to clear the canvas so that the user can enter other digits. Below we can see figure for the GUI and the prediction results for the same digits drawn on the canvas. Figure 15 Hand written digits recognizer GUI Figure 16 Sample of Prediction Result for VGG16 model Conclusion In this blog, we have investigated different deep learning models which can be used for the task of classifying handwritten digits. The MNIST dataset, which is repository that contains thousands of handwritten images, was found suitable for our project. We have shown that by using three models high accuracies can be obtained for this task. The focus of this project was to identify the classifier which works better in identifying the digits. Using Keras and Tensorflow we have deployed ANN, CNN and VGG models and evaluated their performance using different metrics. After the evaluation we have found that the ANN model was able to give a testing accuracy of 95.4%, while the deeper models of CNN and VGG model have achieved a much higher accuracy of 98.78% and 99.07% respectively. From this we can conclude that the VGG16 model was selected as the best model among the three. Finally, we built a GUI where a user can draw any digit they want and get prediction scores from any the three models. Limitations of the project The project might have the following limitations: - · All the images in the dataset might not always contain numbers and the models might confidently predict it as a number. · A handwritten digit dataset is vague and may not always contain numbers with perfect straight lines and some people’s writing styles may seem unrecognizable to the models. · The system might sometimes show inconsistent results due to similar shaped numerals. · Due to limited time, we will not be able to further optimize the model and improve the performance.
Deep learning : Computer Vision
Over the last several decades, there has been an explosion in the amount of visual data created by humans on a daily basis. However, computer scientists have struggled with the problem of creating machines that can recognize images and videos in the same way that people can. Humans are capable of detecting patterns and objects with a high degree of precision without exerting any effort. However, prior to the advent of deep learning, classical computers were quite poor at understanding visual input. As a result, the field of computer vision has become one of the biggest research areas in computer science. Decades later, we have made significant progress in developing software that can comprehend and describe the content of visual data. But we’ve also uncovered how far we still have to go before we can comprehend and reproduce one of the most basic operations of the human brain. This blog will discuss the purpose of computer vision systems and the various domains in which it is being utilized to solve image recognition tasks. Then, it will examine the different types of neural network and deep learning architectures used for this task and ways to evaluate them using different techniques. It will take one image recognition task and analyze how the different algorithms perform on this task. The blog will also outline some of the applications and limits of these systems and how deep learning is being utilized to optimize the efficiency of these systems. To sum up, this blog will provide an insight into the world of computer vision systems and how they have been used across the industry to solve very complicated image recognition tasks. It investigates current literatures to provide a slight overview and acts as a steppingstone to conduct further research in this area. Since the invention of modern cameras and sensors, a large number of digital devices have been used to generate images and videos which increased the demand for a system that can recognize and describe the content of visual data. An image is an array of pixels and numerical values which represent intensities of these pixels in shades of red, green and blue. Computers have been used to process and store images but it’s only until recently that they have been able to learn and understand the contents of the image. They were lagging behind for quite some time in image recognition tasks as they had to be able to determine which pixels belonged to which object and how to extract features. It was a very challenging task as programmers initially specified rules for detecting patterns in images, but it proved to be unsuccessful. This is because static programs cannot adapt to real world objects which could appear from many different angles, various lighting, and a range of different backgrounds. Researchers have been looking for a way in which machines can learn for themselves to identify and recognize patterns from images without being explicitly programmed. Although traditional machine learning techniques like SVMs proved to be useful initially, they have been stagnant for a while and did not produce higher accuracies. It was only until the discovery of neural networks that the field started to show some progress. The human vision system which has evolved for billions of years was quite suited for the task of identifying and recognizing objects. Around the end of 1970s, Japanese scientist Kunihiko Fukushima proposed a computer vision system based on neuroscience which mimics the way of human visual cortex. Though it failed to perform complex visual tasks, it marked the birth of new developments in the field of computer vision. The rise in the computational power of computers and the development of deep convolutional neural networks has pushed the boundaries in the field and enabled computers to outperform humans in image recognition tasks. What is Computer Vision? “Computer vision is concerned with the automatic extraction, analysis and understanding of useful information from a single image or a sequence of images. It involves the development of a theoretical and algorithmic basis to achieve automatic visual understanding.” (The British Machine Vision Association and Society for Pattern Recognition, 2017). It is different from Computer Graphics which is used to generate images as it deals with extracting visual information from the real word and make predictions. Computer vision systems have been deployed to create algorithms which can produce accurate results and are capable of very fast predictions. Applications in Computer Vision Computer vision systems can be used in a variety of applications including image classification, object detection, image segmentation, face recognition, action and activity recognition, and human pose estimation. But here we will discuss the three main applications of computer vision. · Image classification: is one of the fundamental applications of computer vision. It is the process of training computer systems to recognize the whole image and assign it to a specific category from a collection of predefined tags. According to (Wang & Su, 2019), “Image classification is a fundamental task that attempts to comprehend an entire image as a whole. The goal is to classify the image by assigning it to a specific label. Typically, image classification refers to images in which only one object appears and is analyzed.” · Object detection: is a technique used to locate instances using bounding boxes in images or videos. It produces classification results and regression coordinates for the bounding boxes to locate each object. It leverages deep learning algorithms to recognize and label objects of interest within an image. “Object detection is the process of detecting instances of semantic objects of a certain class (such as humans, airplanes, or birds) in digital images and video. A common approach for object detection frameworks includes the creation of a large set of candidate windows that are in the sequel classified using CNN features” (Voulodimos, Doulamis, Doulamis, & Protopapadakis, 2018). · Image Segmentation: It is the process of labeling pixels of related features or objects in an image. Kaur et al. (2014) define segmentation as “the technique of dividing or partitioning an image into parts, called segments. It is mostly useful for applications like image compression or object recognition, because for these types of applications, it is inefficient to process the whole image. So, image segmentation is used to segment the parts from image for further processing. There exist several image segmentation techniques, which partition the image into several parts based on certain image features like pixel intensity value, color, texture, etc. These all techniques are categorized based on the segmentation method used” (pp. 809–814). Image segmentation can further be divided into semantic and instance segmentation. Figure 1 The difference between classification, detection, and segmentation (Shanmugamani, 2018) Datasets for Computer Vision Machine learning on picture and video files is a time-consuming and data-intensive task. It is critical to have access to high-quality, noise-free, large-scale datasets for training complicated deep neural network models. To train on, high-quality deep learning systems may require millions of properly selected pictures. Many open-source datasets are being created for use in image classification, pose estimation, image captioning, autonomous driving, and object segmentation. The computer vision community has benefited from an influx of publicly available, annotated picture datasets, which has resulted in impressive results in object detection/segmentation tasks and innovative modeling architectures (Yuzhen & Sierra, 2020). Some of the most utilized datasets for the purpose of computer vision are mentioned below. · Grayscale Images: The computer interprets a grayscale image as a matrix with one entry for each image pixel. One of the most widely recognized grayscale image dataset is the MNIST (Modified National Institute of Standards and Technology) dataset, which consists of images of handwritten numbers and their labels. Since its release in 1999, this classic dataset is used for benchmarking classification algorithms. Each image has a height and width of 28 pixels, with a total of 784 pixels. A single pixel value is associated with each pixel. This number specifies the brightness or darkness of that specific pixel; 0 denotes the darkest, and 255 indicates the brightest (Sewak, Karim, & Pujari, 2018). Other datasets include Fashion MNIST which contains thousands of fashion products and associated labels from 10 classes. · RGB Images: are images with a three dimensional array of numbers representing the intensities of pixels across the red, blue and green channels. These are colored images with a height, width and three different channels. A single pixel value can be represented as a combination in intensities in the three different channels ranging from 0 to 255 such as (255,0,255). Some of the most common RGB image public datasets include ImageNet, which is “a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories” (Krizhevsky, Sutskever, & Hinton, 2012). Other more familiar RGB image datasets include CIFAR, Caltech 101/Caltech 256 and the Caltech Silhouettes and others. Figure 2 Example Images of MNIST dataset (Lim, Young, & Patton, 2016) Computer Vision Models Over the past few years numerous algorithms were developed to build models that can be trained to recognize objects and classify images as a whole. Different models may predict different concepts for the same inputs based on their training. Models can take inputs as images or videos to learn some important features and return pre-learned concepts or predictions. Deep learning models for computer vision have gone through a lot of improvements since their inception ranging from simple neural networks to very deep and sophisticated models. Artificial Neural Networks (Perceptron) Neural networks, also referred to as Artificial Neural Networks (ANNs), is a kind of machine learning algorithm which mimics the way of how the human brain works and serves as the building block for most deep learning models. “Perceptrons were developed in the 1950s and 1960s by the scientist Frank Rosenblatt, inspired by earlier work by Warren McCulloch and Walter Pitts.” (Nielsen, 2015) They are composed of several nodes which make up the input, one or more hidden and output layers. While the nodes in each layer indicate the nonlinear activation functions, the connections between them represent the weights associated with each node. We can think individual nodes as simple mathematical functions which takes in a linear regression model composed of input data, weights and biases to give an output. For the sake of simplicity we will introduce some of the most common terms used when building neural network architectures. Layers: A neural network is composed of several layers. The first layer of neurons are called input layers and these might be x values ranging between zero and one which depict the intensities of each pixel in the image. A simple neural network might include several hidden layers and finally an output layer predicting any one of the classes that the sample image belongs to. Activation function: are simple mathematical functions which take in a linear function as an input from the previous layer and spits out a number ranging between 0 and 1. These functions are applied to all the layers except the input layer. The hidden layers might apply activations functions like sigmoid, tanh or ReLU and softmax might be used for the output layer of a multi-class classification problem. Weights and Biases: are numbers which are which are applied to activations of the previous layer to compute the weighted sum. These are parameters that the network tries to learn and are updated until we get an optimal result. Initially they might be selected randomly but get updated as we train the network. Cost Function: The cost or loss function is a method to evaluate the performance of the network. One of the most commonly used cost function is the mean squared error (MSE). The goal of training a neural network is to minimize this cost function by updating the weights and biases. Back propagation and Gradient Descent: The process of moving from input layers to output ones is called feed forward. And the reverse process is called back propagation which is used adjust the weights and biases of a network and reduce the cost function until it reaches a local minimum, The algorithm which is used to adjust the weights is called gradient descent and it allows the model to determine the direction of the nearest local minimum. Convolutional Neural Networks Convolutional neural networks are deep learning neural networks which are designed to recognize patterns and features better than conventional neural networks. The strength of this type of network is that it contains a special type of layer called convolutional layer. The success of a deep convolutional architecture called AlexNet in the 2012 ImageNet competition was the main reason that CNNs have become so popular in building deeper networks for image recognition tasks. “Convolutional Neural Network has had ground breaking results over the past decade in a variety of fields related to pattern recognition; from image processing to voice recognition. The most beneficial aspect of CNNs is reducing the number of parameters in ANN. This achievement has prompted both researchers and developers to approach larger models in order to solve complex tasks, which was not possible with classic ANNs” ( Albawi, Mohammed, & Al-Zawi, 2017). Convolution neural networks have three types of layers namely convolutional layer, pooling layer and fully connected layer. Convolutional layer: is the main building block and it is the place where the majority of computation takes place. This layer consists of a kernel or filter usually with smaller dimension which moves across the input image to check if a certain feature is present. It is a rectangular array of numbers which is applied to an area of an image to take the dot product with input pixels. The result of this product is fed into an activation function which produces an output matrix whose size is determined by a number of hyperparameters like the number of filters, stride and padding. Pooling Layer: is a layer which is to reduce the dimensions of the input which in turn decrease the number of parameters used for the input. The filter used for a pooling layer does not contain any weights as it simply applies an aggregate function to the input reducing the computational complexity. They are usually applied immediately after each convolutional layer. The two most common types of pooling include max and average pooling. Fully Connected Layers: This layer takes the result produced by convolutional layers and converts it into a one-dimensional array. It applies activation functions and passes it to the next fully connected layer or if there are none it sends it directly to the output layer. In this type of layer each node is connected directly to the next layer which was not the case for convolutional layers. Figure 3 Building blocks of a CNN (O’Mahony, et al., 2019) Very Deep Convolutional Neural Networks Some of the earliest convolutional neural networks were very simple and did not contain that many layer. The earliest of these was the LeNet-5 convolutional neural network which had only six layers and was used to recognize gray scale images of handwritten digits and letters of the alphabet. Since then there has been a tremendous progress in the field where we saw more powerful and sophisticated networks which consist of more than 30 layers and millions of parameters. Some of the most common deep convolutional neural networks include AlexNet, VGGNet, ResNet, GoogleNet. These networks are more powerful and have an accuracy which is very close to humans. Each network has its own unique implementation method and are utilized for various types of computer vision tasks. An important thing to note here is that some of these networks might be very good at extracting features than others in a certain task and perform poorly on other tasks so we should pay attention to which one we select to fit our business model. Methods of Evaluation for Deep Learning Models There are a lot of similarities between techniques used for evaluating machine learning and deep learning models. Below are some of the various methods to check and evaluate the performance of a deep learning model. Accuracy: is one of the most commonly used evaluation methods but is not a good indicator of the actual performance of the network. This happens when there is a class imbalance, and the accuracy measure becomes biased towards one class. It is calculated as the ratio of the number of correct predictions to the total number of predictions. Precision: is defined as the ratio between the number of correctly classified positive samples (TruePositive) to the total number of samples classified as positive (TruePositive+FalsePositive). It measures the model’s performance in correctly identifying samples as positive. Recall: is defined as the ratio between the number of correctly classified positive samples (TruePositive) to the total number of positive samples (TruePositive+FalseNegative). Specificity: is used to measure the number of correctly identified negative samples (TrueNegative) to the total number of negative instances (TrueNegative+FalsePositive). F1 Score: is defined as the harmonic mean of precision and recall. It takes the contribution of both measures to estimate how well the model is doing. PR curve: is defined as a curve which shows the relationship between precision and recall for various threshold values. ROC curve: is a graph used to show the true positive rate and false positive rate for various threshold values. Confusion matrix: is a representation of true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN) in a matrix format. Comparison of Traditional Computer Vision and Deep Learning The traditional approach of computer vision involves algorithms which are concerned with the extraction of features and context within the image. It is used to detect curves and edges which are used for deriving features to gain an understanding of the contents within an image. Several edge detection algorithms like Canny Edge Detector, Sobel, and Fuzzy Logic method might be used for this purpose. Some of the most established feature descriptors for object detection might include SIFT, SURF and BRIEF. The problem with traditional approach is that “it is necessary to choose which features are important in each given image. As the number of classes to classify increases, feature extraction becomes more and more cumbersome. It is up to the CV engineer’s judgment and a long trial and error process to decide which features best describe different classes of objects” (O’Mahony, et al., 2019). Practitioners of computer vision had to determine which specific properties best defined the item of interest. This method of feature engineering and description was not scalable, especially when there are so many items of interest to be described at the same time. Deep learning solves this problem by performing feature extraction and classification with one automated procedure. Deep learning models introduce neural networks which uncover the relationship between the underlining features and class labels. They outperformed traditional approaches by providing higher accuracy and do not require hand-crafted feature extraction which takes a lot of time. Figure 4 Traditional Computer Vision and Deep Learning Workflow (O’Mahony, et al., 2019) Comparison of Artificial and Convolutional Neural Networks Artificial Neural Networks were originally good enough for most image classification tasks with small amount datasets. But they were not able to adapt with the increasing amount of data as they require a large number of parameters which took a lot of computational power and resources. Convolutional Neural Networks offer a solution to this problem as they substantially reduce the amount of computation and parameters required to train the network. It uses special convolutions and pooling layers to reduce the dimensions of the image thus decreasing the number of parameters. They are mostly deployed in most computer vision tasks to classify images and perform far much better than traditional artificial neural networks. Challenges of Deep learning in Computer Vision tasks Although deep learning models perform far better than traditional approaches it comes with a cost. These networks require a lot of time to train, and they require good understanding of hyperparameters and machine learning techniques to fine tune the model and get the best result. They also require a lot of computational and processing power to perform the billions of mathematical operations used to train the model. These networks also suffer from the problem of overfitting or underfitting either from the lack of datasets or low resolution of images. They might perform well on the training data and might fail to generalize when they encounter new data. Although these networks do a decent job on classifying images they have no deeper understanding or background knowledge about the contents of the image. When confronted with a scenario they haven’t seen before, humans may rely on their extensive knowledge of the world to fill in the blanks. In contrast to people, computer-vision algorithms must be carefully trained on the sorts of things they must detect. When they are exposed to items that differ from their training examples, they act irrationally, such as failing to recognize emergency cars parked in unusual locations. So we must wait until the emergence of truly artificial intelligent systems to encounter such scenarios. Implications The tremendous potential demonstrated by deep learning models in computer vision tasks offer a promising future in utilizing these systems to solve real world problems. Researchers and scientists are working hard to improve these systems so that they could give higher accuracies. Deep learning models are deployed in various sectors to solve problems specific to the field. They are used in facial recognition systems, content moderation, autonomous vehicles, military, AR/VR technologies, fraud detection and the health care industry. They are becoming more ubiquitous than ever before because of their superiority in terms of performance. Big companies are utilizing this huge potential to develop new features for their services. From the facial recognition system in our phones to image captioning and content moderation algorithms on social media sites they are becoming part of our daily life. Conclusion To sum up, computer vision has come a long way from traditional techniques for extracting features to artificial neural networks and the more recent deep convolutional neural networks. It is advancing at a rapid pace and is getting better in accuracy and reducing computational cost. The blog highlighted the different types of computer vision models and made comparisons in terms of their deployment and computational cost. It has also covered the applications of these systems and methods to evaluate their performance. It has laid arguments for the advantages and limits of deep learning systems in recognizing objects and outlined the future prospect of these systems in revolutionizing image recognition tasks. The blog concludes by stating the fact that although current deep learning models have achieved very high accuracies, they have a long way to go in terms of reducing computational costs and achieve true artificial intelligence.