Algorithm Design

Authors: Yunbai Zhang, Tianjian Cheng, Wen-Lung Hsu, Chen Liu, Runhan Xu, Alice Lan

7 min readMay 27, 2021

Data introduction

We use data from Petfinder.com from Kaggle (https://www.kaggle.com/c/petfinder-adoption-prediction/data ) to train our models. The database is over 2.3G and has records of 18,967 pets, 72,776 pieces of photos, and contains photos of both cats and dogs. All the images in this database are identified with PetID and photo id.

Model Design Architecture

1. Face Extraction

For better performance in face verification, we need to extract the face part from our images to reduce noise.

In traditional ways of object detection, people use sliding window detection, but this tool is much slower than YOLO. Therefore, we use YOLOv3 due to its higher speed and accuracy to extract target pets’ faces. YOLOv3 has a better feature extractor and a better object detector with feature map upsampling and concatenation.

1.1 Evaluation of face extraction

1⃣️ Intersection Over Union (IoU) = Area of Overlap / Area of Union. It implies how close your predicted extracted target is to the real target. We define as follows:

IoU > 0.5 as true positive: the predicted box covers the true box.

IoU < 0.5 as false positive: the predicted box doesn’t cover the true box.

False Negative: there is a true box but YOLO doesn’t find any part of it.

**Ground Truth: Blue, Predicted: Yellow**

2⃣️ Interpolated Precision:

P_interp(r) is calculated at each recall level r, by taking the maximum precision measured for that r. Interpolated precision could smooth the traditional precision curve.

IoU affects how we define TP, FP. Therefore, if the average precision is calculated with IoU at 0.5, then it’s average precision@0.5. In this project, we use average precision@0.5.

1.2 Conclusion of work

Applied transfer learning on YOLOv3
Initialized the network with the Darknet weights, which are pre-trained non-COCO datasets.
Trained the last 3 layers with single class detection in YOLOv3 with 51 epochs on 90 samples, validated with 10 samples.
Unfreeze all layers to train another 51 epochs.
Applied model on our kaggle pet datasets, achieving Average Precision @0.5 being around 87%.
Feed this extracted dataset into the following steps.

2. Face Verification — Binary Classification Approach

Code Flow of Binary Classification Approach

In the workflow of the binary classification approach, we firstly embedded the extracted face images, then sample image pairs, and feed them into classification models to see whether the pair of images comes from the same pet. ?Face verification model to evaluate the model performance. If some part goes wrong, we refit until the performance is good enough.???

2.1 Image Embedding

Image embedding is a kind of feature engineering technique. We use Convolutional Neural Network-based transfer learning models, cut out final activation layers and use the output as image features.

The advantages of using transfer learning models include:

Capture information of neighborhoods pixel so that it can learn the features better.
They are trained on large datasets, which return better accuracy and performance.
Pretrained model: no need to retrain by freezing parameters, friendly to limited computing power

2.2 Sampling image pairs

The principles of sampling image pairs:

Randomly select 2 embedded images A and B in a matrix format, with each column vector ai, bi representing a feature. Then we append the A, B image matrices and shuffle the features of A and B.
Train-test Split Rule: split by PetID. Since if split by pet images, some pets in the training model would have seen the pets in the test set, thereby resulting in an information leakage problem.
Control ratio of image pairs that are from the same pet and from different pets: Avoid imbalanced data problems. Treat the ratio as a hyperparameter, and after tuning, we found 1:1 is best for the classification model.

2.3 Classification

(1) Classification Methods and Results:

Classification methods: Logistic Regression, XGBoost, Random Forest, SVM
Embedding methods: VGG16, ResNet, NasNet, DenseNet, EfficientNet

After running the combinations of classification methods and embedding methods above, we could see that the best result on both training and test data is using EfficientNet as embedding and SVM as classification. There is a small gap between the results of training (0.95) and test (0.902) datasets. The gap comes from our train-test set splitting rule.

The following tables display embedding models we used and responding results:

(2) Results Analysis:

After taking a deeper look at classification results, we could see that there are several problems that might affect our result confidence:

Different facial expressions
Variation of background lights
Front face not included
Blurred images

(4) Problems and Solutions:

We could see that the ROC score of the improved model is 0.90, which improved 0.03% compared with the original model.

3. Face Recognition Model — CNN Approach

3.1 Motivation

To learn good embeddings for various classes, like face recognition, we use triplet loss. Faces from the same pet should be close to each other and form a well-separated cluster. The goal of triplet loss is to minimize the distance of two embeddings with the same label and to maximize embeddings with the different labels in the embedding space.

However, the pair-based triplet loss slows convergence due to its high training complexity. We actually experienced an extremely long training time while using triplet loss. A new loss function called ‘Proxy Anchor Loss’ was proposed in 2020, it overcame the low speed of convergence and is robust against noisy labels and outliers. We also experimented with this new loss function.

3.2 Framework of CNN Model

(1) Three input images:

Anchor images: “the control group” image in the training set.

Positive images: share the same label with the anchor image. (same PetID)

Negative images: have different labels with the anchor image. (different PetIDs)

(2) Input with anchor images, positive images, and negative images. Process through Embedding CNN Model with loss (here includes triplet loss and proxy anchor loss).

(3) Output with the image embeddings.

3.3 Evaluation

To reflect real user experience, we choose Recall@k: the average positive rate for k returned photos. As long as there is one photo that matches the user’s pet photo, we say it is a positive case.

3.4 Results

We use inception_v2 structure with an optimal embedding size of 256 for hyperparameter tuning, and get the following results on the test set:

Dummy Embedding: a pre-trained VGG16 network on the image dataset. Use this result as embedding and calculate the distance between image embeddings to find the target pets.
Random Guess: randomly choose an image and see the possibility of this image matching the target class.

From the table, we could see that Proxy Anchor Loss has the best performance among all the methods, and as k increases to 100, the recall rate will increase to 99% as well.

4.5 Improvement & Next steps

Increase Images: compared with the result on the training set, we could see that our network somehow exists overfitting due to the large gaps between the training and test results. And therefore, if we could address overfitting, there is a potential of 20% increase in R@1 on the test set. Therefore, for the next step, in order to overcome overfitting, we need to input more data.
Database too shallow and narrow: compared with a human face recognition database, which usually has thousands of more classes and for each class, there would be 20 more photos, our database is narrow and shallow. It is also a potential point to improve.

Demo Showcase

Here are several examples of our app design outputs, the found images are presented by descending confidence order.

From output result 1 and 2, we could see that even though the face angle and background changed, our app still successfully found the right photo.

From the output result 3, we could see that even though pet expressions changed and its eyes closed, our app still successfully found the right photo.

From the above failure example, the false positive photo has very similar coat patterns and eye colors compared with the query image. And even for humans ourselves, sometimes we might have a hard time distinguishing them.