Deep Learning

Deep Learning and machine learning: image processing for face recognition

Were you surprised when you realized for the first time that Facebook and Instagram automatically identified all your friends and tagged them immediately once you uploaded a photo.? How do they have the ability to accurately map the geometry of your face?

Face Recognition is image processing technology that has evolved using deep learning. Were you surprised when you realized for the first time that Facebook and Instagram automatically identified all your friends and tagged them immediately once you uploaded a photo.? How do they have the ability to accurately map the geometry of your face? It is because the machines and technologies around us are getting smarter and they would soon overtake human intelligence if they were to.

What is Face Recognition?

It is a technology capable of identifying or verifying the identity of a person using their face.

How does it work?

In human beings, the neurons are responsible for extracting certain facial features and storing them in the brain leading to face recognition. A face recognition system works by taking an image of a face and predicting whether the face matches with the faces it has learned and returning the result.

Why it is important?

Apart from unlocking smartphones, there are so many other benefits of this technology.

▪ Smart door locks used in companies are capable of unlocking the door with nothing but a smile in the face.

▪ Track school and lecture attendance- This will save time and minimize the possibility of wrong attendance.

▪ ATMs and banks – The new cash machines ensure increased security for card users by using facial data instead of pin number.

▪ Home security cameras – This can set alerts that notify when somebody whose face the camera does not recognize attempts to enter the home.

▪ On the government level, it can be used to identify terrorists or other criminals only by scanning faces.

Moreover, there are some major complications associated with face recognition that limits its effectiveness in practice.

Description of the Data

The original dataset is a collection of photographs of 25 individuals and there is a total of 300 coloured (RGB) images, each person having 12 individual photographs. The photographs are captured from different views (Front view, Side view), different times, varying illuminations, poses etc and no restrictions placed on clothes, glasses, beard or hair style. The images were in different dimensions and all are in JPG format.


▪ Total number of Individuals :25

▪ Number of Images per individual: 12

▪ Total number of original images: 300

▪ Gender: Images of Male and Female subjects

▪ Race: Sinhala

▪ Age Range: Individuals between 22-28 years old

▪ Glasses: Yes or No

▪ Beards: Yes or No

▪ Image format: colour JPG

Some example images are shown below.

Data Preparation

Step 01: Face Detection and Cropping

The most important thing in a face recognition system is detecting the faces in an image/video stream. Person classification is worthless without detecting the face of a person. The Haar Cascade designed by OpenCV “haarcascade_frontalface_default.xml” is used to detect the frontal face of the original images and is capable of detecting features from the source.

Haar Cascade is an effective object detection method proposed by Paul Viola and Michael Jones in their paper “Rapid Object Detection using a Boosted Cascade of Simple Features” in 2001.

The faces detected by Haar Cascade Classifier are cropped and used for further processing and analysis. Figure 3 shows some sample images from the dataset and the corresponding detected faces.

Step 02: Data Augmentation

Data augmentation is the process of artificially increasing the diversity and the amount of training data. Rather than collecting new data, here we transform the already presented data in to different transformations. Moreover, it is a technique that can be used to improve the performance of a deep learning model by generalizing.

As the dataset needs to be representative of different variations, domain-specific methods are being applied to examples from the training data such as different angles, lightings, positions, flips, zooms and much more.

In more practical sense, it makes sense to turn a picture of a dog horizontally, as the photo could have been taken from left or right. But imagine a vertical flip of the picture of a dog which does not make sense since that is very unlikely to see a dog upside down.

In the face recognition research domain, the cropped facial images are augmented using different noises, rotations, brightness conditions, horizontal flips etc.

Horizontal Flipping

Horizontal Flipping make sense for facial images, but vertical flipping would not. Below is asample set of images that are flipped horizontally.

Random Brightness Conditions

The brightness of an image can be varied by either randomly darkening images, brightening images or both. The goal behind this technique is to generalize the model through trained images of different lightning levels.

The figure 5 shows the brightening levels of original images of 2 subjects.

Random Zooming Augmentation

A zoom value of 0.2 is given, where the randomly sampled images are zoomed between a range of [1-0.2,1+0.2] (Zoom in by 80% and zoom out by 120%).

After varying the flipping of images, the brightness of all images and the zooming levels, the number of images per subject is increased to 108. Finally, the augmented dataset contained a total of 2700 images of 25 subjects respectively

Step 03: Data Pre-processing

Pre-processing of data is a basic step in image classification. It is because the acquired data can be messy and collected from different sources. Therefore, the images need to be standardized and cleaned up, in order to reduce the complexity and computational power.

Resizing of the images

The images captured by cameras or mobile phones can take different image sizes. Therefore, all the images are resized to a common size of 96 x 96, before fed into the algorithm.

Conversion of colour images into grayscale images

The conversion of colour images into grayscale images reduce the spatial dimensions of theimages and the number of pixels need to be processed. It reduces the computational complexity and shorten the training time.

Data Normalization

The image data are normally existing as pixel values that are integers between 0 to 255.The data normalization is known as the scaling of data (pixel intensity) to the range of [0,1]. This can be achieved by dividing all the pixel values by the maximum pixel value which is 255.

Label Binarization

Label binarization is an important pre-processing step in every machine learning and deep learning algorithm. Since the machines need all the data to be in machine-readable form, the conversion of the labels into numeric form is a must.

According to this scenario, the labelled data is the name of the person representing the image. This technique encodes the data into dummy variables indicating the presence of a particular label or not.

Face Recognition – Advanced Analysis

Deep learning is a subset of machine learning that uses multiple layers to progressively extract high level features from the input data. In image processing, the low-level features such as edges and colours are identified by lower layers while higher layers may extract the whole faces or images. In an image recognition application, the input data can be in matrix of pixels, such that the first layer could abstract the pixels and encode edges, the second layer may compose and arrangement of edges, the third layer may encode eyes and mouth and the next layer may recognize that the image contains a face. Therefore, a deep learning model have the ability to learn the low-level and high-level features both.

A deep learning-based CNN architecture which is a simplified version of the VGGNet model (Smaller VGGNet) is used for the face classification. The proposed CNN architecture is represented in Figure 10.

The augmented dataset has 2700 coloured face images of 25 subjects and its greyscale version is used for the experiments. It is of size 96 x 96 pixels with diversified facial expressions, brightness conditions, occlusions. A deep learning-based CNN approach, which is a simplified version of VGGNet model is used as the architecture of the model and it has 5 convolutional layers.

Training the Network

The dataset is divided into training and testing set, such that 80% of the face images (2160 images) are used to train the CNN model while the remaining 20% images (540 images) are used for performance evaluation of the proposed system. The remained 20% images (540 images) was used as the validation set, in order to check the presence of overfitting, tuning of learning rate, number of epochs and batch size.

The model is trained for 10 epochs with a learning rate of 0.001 and batch size of 32 with ‘categorical_crossentropy’ loss function and Adam optimizer are used to train the network.

By observing the training accuracy and loss of each epoch, we can say that the model is doing a good job after the 9th epoch by achieving a training accuracy of 100% and a quite low training loss of 0.0031.

Moreover, the validation set obtained an accuracy of 98.15% at the 9th epoch with a very low validation loss of 0.0744. The model seems to be learning well and performing better with the validation set.

Performance Evaluation

The above two plots say that the validation accuracy almost became stagnant after 9th epoch and the gradual reduction of the loss function of both training and validation set imply that the network has memorized the training data very well while guaranteed to generalize for the validation set.

Model Evaluation on the Test Set

A test set of 100 unseen face images of the 25 subjects were pre-processed and used to make predictions using the trained Smaller VGGNet network. The predicted names for each image was evaluated with the true names and a confusion matrix is drawn as shown in Figure 13.

According to the confusion matrix, majority of the predictions are lying on the diagonal, implying that they are correctly predicted. Also, there are very few test images that are misclassified.

Furthermore, the performance of the model is measured by changing the dimensions of the input image (height, width and depth). Following are the accuracies obtained for training, validation and testing data. The training time for each scenario is presented here.

The above Table 1 shows that higher the dimensions of the input image, higher the training time. Out of all these scenarios, 96 x 96 x1 was selected as the best dimension for the input image, such that high accuracies could be obtained for all the three training, validation and testing data, with average training time.


The overall performances were obtained by changing the height, width and the depth (RGB and grayscale) of the training and testing data. The grayscale images of size 96 x 96 achieved the best results with a validation accuracy of 98.15%.

Moreover, the computational power and the time consumption for training the network is significantly getting increased with the number of channels (depth) and the image height and width. Therefore, the presented approach for face recognition minimizes the computational time while achieving a high accuracy. Though the original dataset is too small, data augmentation techniques can be used to acquire enough training samples. Then we demonstrate that the CNN-based face recognition system can achieve better predictions with data augmentation.

About Deep Data Insight

Deep Data Insight are Artificial Intelligence experts. We make it our jobs to produce world-class AI solutions. You can find out more about where we do here or on our LinkedIn Page here

Share this post