Over the last several decades, there has been an explosion in the amount of visual data created by humans on a daily basis. However, computer scientists have struggled with the problem of creating machines that can recognize images and videos in the same way that people can. Humans are capable of detecting patterns and objects with a high degree of precision without exerting any effort. However, prior to the advent of deep learning, classical computers were quite poor at understanding visual input. As a result, the field of computer vision has become one of the biggest research areas in computer science.
Decades later, we have made significant progress in developing software that can comprehend and describe the content of visual data. But we’ve also uncovered how far we still have to go before we can comprehend and reproduce one of the most basic operations of the human brain. This blog will discuss the purpose of computer vision systems and the various domains in which it is being utilized to solve image recognition tasks. Then, it will examine the different types of neural network and deep learning architectures used for this task and ways to evaluate them using different techniques. It will take one image recognition task and analyze how the different algorithms perform on this task. The blog will also outline some of the applications and limits of these systems and how deep learning is being utilized to optimize the efficiency of these systems. To sum up, this blog will provide an insight into the world of computer vision systems and how they have been used across the industry to solve very complicated image recognition tasks. It investigates current literatures to provide a slight overview and acts as a steppingstone to conduct further research in this area.
Since the invention of modern cameras and sensors, a large number of digital devices have been used to generate images and videos which increased the demand for a system that can recognize and describe the content of visual data. An image is an array of pixels and numerical values which represent intensities of these pixels in shades of red, green and blue. Computers have been used to process and store images but it’s only until recently that they have been able to learn and understand the contents of the image. They were lagging behind for quite some time in image recognition tasks as they had to be able to determine which pixels belonged to which object and how to extract features.
It was a very challenging task as programmers initially specified rules for detecting patterns in images, but it proved to be unsuccessful. This is because static programs cannot adapt to real world objects which could appear from many different angles, various lighting, and a range of different backgrounds. Researchers have been looking for a way in which machines can learn for themselves to identify and recognize patterns from images without being explicitly programmed. Although traditional machine learning techniques like SVMs proved to be useful initially, they have been stagnant for a while and did not produce higher accuracies.
It was only until the discovery of neural networks that the field started to show some progress. The human vision system which has evolved for billions of years was quite suited for the task of identifying and recognizing objects. Around the end of 1970s, Japanese scientist Kunihiko Fukushima proposed a computer vision system based on neuroscience which mimics the way of human visual cortex. Though it failed to perform complex visual tasks, it marked the birth of new developments in the field of computer vision. The rise in the computational power of computers and the development of deep convolutional neural networks has pushed the boundaries in the field and enabled computers to outperform humans in image recognition tasks.
What is Computer Vision?
“Computer vision is concerned with the automatic extraction, analysis and understanding of useful information from a single image or a sequence of images. It involves the development of a theoretical and algorithmic basis to achieve automatic visual understanding.” (The British Machine Vision Association and Society for Pattern Recognition, 2017). It is different from Computer Graphics which is used to generate images as it deals with extracting visual information from the real word and make predictions. Computer vision systems have been deployed to create algorithms which can produce accurate results and are capable of very fast predictions.
Applications in Computer Vision
Computer vision systems can be used in a variety of applications including image classification, object detection, image segmentation, face recognition, action and activity recognition, and human pose estimation. But here we will discuss the three main applications of computer vision.
· Image classification: is one of the fundamental applications of computer vision. It is the process of training computer systems to recognize the whole image and assign it to a specific category from a collection of predefined tags. According to (Wang & Su, 2019), “Image classification is a fundamental task that attempts to comprehend an entire image as a whole. The goal is to classify the image by assigning it to a specific label. Typically, image classification refers to images in which only one object appears and is analyzed.”
· Object detection: is a technique used to locate instances using bounding boxes in images or videos.
It produces classification results and regression coordinates for the bounding boxes to locate each object. It leverages deep learning algorithms to recognize and label objects of interest within an image. “Object detection is the process of detecting instances of semantic objects of a certain class (such as humans, airplanes, or birds) in digital images and video. A common approach for object detection frameworks includes the creation of a large set of candidate windows that are in the sequel classified using CNN features” (Voulodimos, Doulamis, Doulamis, & Protopapadakis, 2018).
· Image Segmentation: It is the process of labeling pixels of related features or objects in an image.
Kaur et al. (2014) define segmentation as “the technique of dividing or partitioning an image into parts, called segments. It is mostly useful for applications like image compression or object recognition, because for these types of applications, it is inefficient to process the whole image. So, image segmentation is used to segment the parts from image for further processing. There exist several image segmentation techniques, which partition the image into several parts based on certain image features like pixel intensity value, color, texture, etc. These all techniques are categorized based on the segmentation method used” (pp. 809–814). Image segmentation can further be divided into semantic and instance segmentation.
Figure 1 The difference between classification, detection, and segmentation (Shanmugamani, 2018)
Datasets for Computer Vision
Machine learning on picture and video files is a time-consuming and data-intensive task. It is critical to have access to high-quality, noise-free, large-scale datasets for training complicated deep neural network models. To train on, high-quality deep learning systems may require millions of properly selected pictures. Many open-source datasets are being created for use in image classification, pose estimation, image captioning, autonomous driving, and object segmentation. The computer vision community has benefited from an influx of publicly available, annotated picture datasets, which has resulted in impressive results in object detection/segmentation tasks and innovative modeling architectures (Yuzhen & Sierra, 2020). Some of the most utilized datasets for the purpose of computer vision are mentioned below.
· Grayscale Images: The computer interprets a grayscale image as a matrix with one entry for each image pixel. One of the most widely recognized grayscale image dataset is the MNIST (Modified National Institute of Standards and Technology) dataset, which consists of images of handwritten numbers and their labels. Since its release in 1999, this classic dataset is used for benchmarking classification algorithms. Each image has a height and width of 28 pixels, with a total of 784 pixels. A single pixel value is associated with each pixel. This number specifies the brightness or darkness of that specific pixel; 0 denotes the darkest, and 255 indicates the brightest (Sewak, Karim, & Pujari, 2018). Other datasets include Fashion MNIST which contains thousands of fashion products and associated labels from 10 classes.
· RGB Images: are images with a three dimensional array of numbers representing the intensities of pixels across the red, blue and green channels. These are colored images with a height, width and three different channels. A single pixel value can be represented as a combination in intensities in the three different channels ranging from 0 to 255 such as (255,0,255). Some of the most common RGB image public datasets include ImageNet, which is “a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories” (Krizhevsky, Sutskever, & Hinton, 2012). Other more familiar RGB image datasets include CIFAR, Caltech 101/Caltech 256 and the Caltech Silhouettes and others.
Figure 2 Example Images of MNIST dataset (Lim, Young, & Patton, 2016)
Computer Vision Models
Over the past few years numerous algorithms were developed to build models that can be trained to recognize objects and classify images as a whole. Different models may predict different concepts for the same inputs based on their training. Models can take inputs as images or videos to learn some important features and return pre-learned concepts or predictions. Deep learning models for computer vision have gone through a lot of improvements since their inception ranging from simple neural networks to very deep and sophisticated models.
Artificial Neural Networks (Perceptron)
Neural networks, also referred to as Artificial Neural Networks (ANNs), is a kind of machine learning algorithm which mimics the way of how the human brain works and serves as the building block for most deep learning models. “Perceptrons were developed in the 1950s and 1960s by the scientist Frank Rosenblatt, inspired by earlier work by Warren McCulloch and Walter Pitts.” (Nielsen, 2015) They are composed of several nodes which make up the input, one or more hidden and output layers. While the nodes in each layer indicate the nonlinear activation functions, the connections between them represent the weights associated with each node. We can think individual nodes as simple mathematical functions which takes in a linear regression model composed of input data, weights and biases to give an output.
For the sake of simplicity we will introduce some of the most common terms used when building neural network architectures.
Layers: A neural network is composed of several layers. The first layer of neurons are called input layers and these might be x values ranging between zero and one which depict the intensities of each pixel in the image. A simple neural network might include several hidden layers and finally an output layer predicting any one of the classes that the sample image belongs to.
Activation function: are simple mathematical functions which take in a linear function as an input from the previous layer and spits out a number ranging between 0 and 1. These functions are applied to all the layers except the input layer. The hidden layers might apply activations functions like sigmoid, tanh or ReLU and softmax might be used for the output layer of a multi-class classification problem.
Weights and Biases: are numbers which are which are applied to activations of the previous layer to compute the weighted sum. These are parameters that the network tries to learn and are updated until we get an optimal result. Initially they might be selected randomly but get updated as we train the network.
Cost Function: The cost or loss function is a method to evaluate the performance of the network. One of the most commonly used cost function is the mean squared error (MSE). The goal of training a neural network is to minimize this cost function by updating the weights and biases.
Back propagation and Gradient Descent: The process of moving from input layers to output ones is called feed forward. And the reverse process is called back propagation which is used adjust the weights and biases of a network and reduce the cost function until it reaches a local minimum, The algorithm which is used to adjust the weights is called gradient descent and it allows the model to determine the direction of the nearest local minimum.
Convolutional Neural Networks
Convolutional neural networks are deep learning neural networks which are designed to recognize patterns and features better than conventional neural networks. The strength of this type of network is that it contains a special type of layer called convolutional layer. The success of a deep convolutional architecture called AlexNet in the 2012 ImageNet competition was the main reason that CNNs have become so popular in building deeper networks for image recognition tasks. “Convolutional Neural Network has had ground breaking results over the past decade in a variety of fields related to pattern recognition; from image processing to voice recognition. The most beneficial aspect of CNNs is reducing the number of parameters in ANN. This achievement has prompted both researchers and developers to approach larger models in order to solve complex tasks, which was not possible with classic ANNs” ( Albawi, Mohammed, & Al-Zawi, 2017). Convolution neural networks have three types of layers namely convolutional layer, pooling layer and fully connected layer.
Convolutional layer: is the main building block and it is the place where the majority of computation takes place. This layer consists of a kernel or filter usually with smaller dimension which moves across the input image to check if a certain feature is present. It is a rectangular array of numbers which is applied to an area of an image to take the dot product with input pixels. The result of this product is fed into an activation function which produces an output matrix whose size is determined by a number of hyperparameters like the number of filters, stride and padding.
Pooling Layer: is a layer which is to reduce the dimensions of the input which in turn decrease the number of parameters used for the input. The filter used for a pooling layer does not contain any weights as it simply applies an aggregate function to the input reducing the computational complexity. They are usually applied immediately after each convolutional layer. The two most common types of pooling include max and average pooling.
Fully Connected Layers: This layer takes the result produced by convolutional layers and converts it into a one-dimensional array. It applies activation functions and passes it to the next fully connected layer or if there are none it sends it directly to the output layer. In this type of layer each node is connected directly to the next layer which was not the case for convolutional layers.
Figure 3 Building blocks of a CNN (O’Mahony, et al., 2019)
Very Deep Convolutional Neural Networks
Some of the earliest convolutional neural networks were very simple and did not contain that many layer. The earliest of these was the LeNet-5 convolutional neural network which had only six layers and was used to recognize gray scale images of handwritten digits and letters of the alphabet. Since then there has been a tremendous progress in the field where we saw more powerful and sophisticated networks which consist of more than 30 layers and millions of parameters. Some of the most common deep convolutional neural networks include AlexNet, VGGNet, ResNet, GoogleNet. These networks are more powerful and have an accuracy which is very close to humans. Each network has its own unique implementation method and are utilized for various types of computer vision tasks. An important thing to note here is that some of these networks might be very good at extracting features than others in a certain task and perform poorly on other tasks so we should pay attention to which one we select to fit our business model.
Methods of Evaluation for Deep Learning Models
There are a lot of similarities between techniques used for evaluating machine learning and deep learning models. Below are some of the various methods to check and evaluate the performance of a deep learning model.
Accuracy: is one of the most commonly used evaluation methods but is not a good indicator of the actual performance of the network. This happens when there is a class imbalance, and the accuracy measure becomes biased towards one class. It is calculated as the ratio of the number of correct predictions to the total number of predictions.
Precision: is defined as the ratio between the number of correctly classified positive samples (TruePositive) to the total number of samples classified as positive (TruePositive+FalsePositive). It measures the model’s performance in correctly identifying samples as positive.
Recall: is defined as the ratio between the number of correctly classified positive samples (TruePositive) to the total number of positive samples (TruePositive+FalseNegative).
Specificity: is used to measure the number of correctly identified negative samples (TrueNegative) to the total number of negative instances (TrueNegative+FalsePositive).
F1 Score: is defined as the harmonic mean of precision and recall. It takes the contribution of both measures to estimate how well the model is doing.
PR curve: is defined as a curve which shows the relationship between precision and recall for various threshold values.
ROC curve: is a graph used to show the true positive rate and false positive rate for various threshold values.
Confusion matrix: is a representation of true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN) in a matrix format.
Comparison of Traditional Computer Vision and Deep Learning
The traditional approach of computer vision involves algorithms which are concerned with the extraction of features and context within the image. It is used to detect curves and edges which are used for deriving features to gain an understanding of the contents within an image. Several edge detection algorithms like Canny Edge Detector, Sobel, and Fuzzy Logic method might be used for this purpose. Some of the most established feature descriptors for object detection might include SIFT, SURF and BRIEF. The problem with traditional approach is that “it is necessary to choose which features are important in each given image. As the number of classes to classify increases, feature extraction becomes more and more cumbersome. It is up to the CV engineer’s judgment and a long trial and error process to decide which features best describe different classes of objects” (O’Mahony, et al., 2019).
Practitioners of computer vision had to determine which specific properties best defined the item of interest. This method of feature engineering and description was not scalable, especially when there are so many items of interest to be described at the same time. Deep learning solves this problem by performing feature extraction and classification with one automated procedure. Deep learning models introduce neural networks which uncover the relationship between the underlining features and class labels. They outperformed traditional approaches by providing higher accuracy and do not require hand-crafted feature extraction which takes a lot of time.
Figure 4 Traditional Computer Vision and Deep Learning Workflow (O’Mahony, et al., 2019)
Comparison of Artificial and Convolutional Neural Networks
Artificial Neural Networks were originally good enough for most image classification tasks with small amount datasets. But they were not able to adapt with the increasing amount of data as they require a large number of parameters which took a lot of computational power and resources. Convolutional Neural Networks offer a solution to this problem as they substantially reduce the amount of computation and parameters required to train the network. It uses special convolutions and pooling layers to reduce the dimensions of the image thus decreasing the number of parameters. They are mostly deployed in most computer vision tasks to classify images and perform far much better than traditional artificial neural networks.
Challenges of Deep learning in Computer Vision tasks
Although deep learning models perform far better than traditional approaches it comes with a cost. These networks require a lot of time to train, and they require good understanding of hyperparameters and machine learning techniques to fine tune the model and get the best result. They also require a lot of computational and processing power to perform the billions of mathematical operations used to train the model. These networks also suffer from the problem of overfitting or underfitting either from the lack of datasets or low resolution of images. They might perform well on the training data and might fail to generalize when they encounter new data. Although these networks do a decent job on classifying images they have no deeper understanding or background knowledge about the contents of the image. When confronted with a scenario they haven’t seen before, humans may rely on their extensive knowledge of the world to fill in the blanks. In contrast to people, computer-vision algorithms must be carefully trained on the sorts of things they must detect. When they are exposed to items that differ from their training examples, they act irrationally, such as failing to recognize emergency cars parked in unusual locations. So we must wait until the emergence of truly artificial intelligent systems to encounter such scenarios.
The tremendous potential demonstrated by deep learning models in computer vision tasks offer a promising future in utilizing these systems to solve real world problems. Researchers and scientists are working hard to improve these systems so that they could give higher accuracies. Deep learning models are deployed in various sectors to solve problems specific to the field. They are used in facial recognition systems, content moderation, autonomous vehicles, military, AR/VR technologies, fraud detection and the health care industry. They are becoming more ubiquitous than ever before because of their superiority in terms of performance. Big companies are utilizing this huge potential to develop new features for their services. From the facial recognition system in our phones to image captioning and content moderation algorithms on social media sites they are becoming part of our daily life.
To sum up, computer vision has come a long way from traditional techniques for extracting features to artificial neural networks and the more recent deep convolutional neural networks. It is advancing at a rapid pace and is getting better in accuracy and reducing computational cost. The blog highlighted the different types of computer vision models and made comparisons in terms of their deployment and computational cost. It has also covered the applications of these systems and methods to evaluate their performance. It has laid arguments for the advantages and limits of deep learning systems in recognizing objects and outlined the future prospect of these systems in revolutionizing image recognition tasks. The blog concludes by stating the fact that although current deep learning models have achieved very high accuracies, they have a long way to go in terms of reducing computational costs and achieve true artificial intelligence.