*This post was originally published by Muhammad Fathy Rashad at Medium [AI]*

### Siamese Net, Triplet Loss, and Circle Loss explained.

When I was reading a machine learning paper about a new loss function, Circle Loss, I had a hard time understanding it. Searching for an easier explanation in Medium also gave me no result ☹️, since it’s a recent paper just from this month, June 2020. Hence, why I decided to write this article to provide an easier explanation for other people and also document my learning.

- Prerequisites
- Siamese Network
- Triplet Loss
- Circle Loss

- Machine Learning basics
- Convolutional Neural Networks (CNNs)

To understand Circle Loss, previous knowledge of neural networks, CNN, Siamese Network, and Triplet Loss would be extremely helpful. I will briefly explain Siamese Network and Triplet Loss in this article but feel free to read more on neural networks and CNN on other articles as I will not cover these topics here.

One common problem you may face when implementing a classification solution is the lack of training data for one of the classes. For example, let’s say you want to create a face recognition system for your company that only has 5 employees. Training this network would require you to have a lot of images for each employee, while an organization usually would only have a single picture for each employee. Additionally, when a new employee is joining, you would need to retrain the model again to add a new class on your network. Hence, you want a solution that can differentiate a person with just a single input (image) of each class (person).

One-Shot Learning aims to solve these problems by only requiring one training example for each class. Siamese Network is one architecture that could do one-shot learning. Let’s see how the architecture looks like.

Siamese Network is consisted of 2 identical Convolutional Networks, hence why the name Siamese which means identical twins. The idea is that, instead of giving a single image as an input and try to predict the class, you will give the network a pair of images as the inputs, process it through the ConvNets, which will then give you the feature maps or the embedding of each image. Then we will use a loss function that calculates the difference between these two features to measure the similarity score. Contrastive loss is one example of the loss that calculates the cosine distances between the 2 embeddings. Finally, we use a sigmoid function to transform the similarity score to a value between 0 and 1.

The similarity score will enable you to determine whether two images are similar or not. 0 means it is a different image, while 1 means a similar image. This, in turn, solves the previous issue, as you only need one image initially for each class to be used as the reference image.

To read more and understand deeper about Siamese Network, you can read another article by Harshal Lamba here.

Over the years, Google introduced the Triplet Loss function for face recognition in the *FaceNet: A Unified Embedding for Face Recognition and Clustering* paper.

In Triplet Loss, during training time, instead of taking two inputs, it will take three inputs which are the Anchor, the Positive, and the Negative. The Anchor will be the reference input that can be any input, the positive will be an input that has the same class as the anchor, while negative must be input with a different class from the anchor. For example, in the case of face recognition, the inputs can be like the image below with anchor and positive showing the same person and negative showing a different person.

Then, we will calculate the embedding of each image by passing them to the same CNN (same weights). Next, we will pass these three embeddings to the Triplet Loss function. Below, is the visualization of the embeddings.

The idea behind the Triplet Loss function is that we push or maximize the distance between the anchor and the negative and also pull or minimize the distance between the anchor and the positive embedding.

To do this, we will calculate the difference between the anchor and the positive using some distance function *d**, *we will denote this as ** d(a, p)** which ideally should be low, and also calculate the difference between the anchor and negative, we will denote this as

**which ideally should be high. Therefore, to get a correct prediction, we always want the value**

*d(a, n)***to be less than**

*d(a,p)*

*d(a,n)**.*Mathematically we can also denote this as,

**. However, as we don’t want our loss to be a negative value, we will make it zero if it’s negative. Hence,we can define the loss function to be**

*d(a,p) — d(a,n) < 0*

*max(d(a,p) — d(a,n), 0)*However, the problem with the previous equation is the fact that when the positive or negative have the same distance with the anchor or when the positive is only a bit closer to the anchor than the negative, the loss would be zero, and there would be no correction even though it should still be pulling the positive even closer and pushing the negative even further from the anchor. To solve this issue, we will just add some constant to the d(a,p) — d(a,n), and we call this constant, **margin**. Therefore, we can finally define the triplet loss function as below.

Using the formula, we can categorize the triplets into 3 types:

**Easy triplets**: triplets which have a loss of 0, because*d(a,p)+margin<d(a,n)***Hard triplets**: triplets where the negative is closer to the anchor than the positive, i.e. d(a,n)<d(a,p)d(a,n)<d(a,p)**Semi-hard triplets**: triplets where the negative is not closer to the anchor than the positive, but which still have positive loss: d(a,p)<d(a,n)<d(a,p)+margin

You can notice that the margin will prevent the loss from being zero when the positive is closer to the anchor but is very close to the negative. Ideally, you would want to train the model using hard negatives and avoid easy negatives as it gives zero losses.

To understand Triplet Loss more deeply, I recommend reading Oliver Moindrot’s explanation here.

Wait, but why should we use triplet loss? Instead of contrastive loss in the previous Siamese Network (where you just calculate the distance/difference between an anchor and another image). In Contrastive loss, you would only update the weights to either minimize the similarity of a different class or maximize the similarity of the same class. On the other hand, using Triplet Loss, the model would both pull the positive input to the anchor and also push the negative image away from the anchor.

After understanding Triplet Loss, we can finally start looking at Circle Loss which was introduced in the paper, Circle Loss: A Unified Perspective of Pair Similarity Optimization.

In the paper, it claims that the previous related work has inflexible optimization. We can look at the example below, where 2 pairs have the same margin between positive and negative, but one is closer to the anchor another is much further to the anchor. In Triplet Loss, it would see these 2 pairs as the same and it would optimize both of this by pulling the positive closer and pushing the negative further away equally.

*Sp** is the within-class similarity score, and **Sn** is between class similarity score. Note that the similarity score is inversely proportional to distance.*

However, this is not very optimal. When the positive is already quite close to the anchor, we would want to focus more on pushing the negative away, and when both positive and negative already far away, we want to focus more on pulling the positive closer to the anchor. This is what Circle Loss had done.

Circle Loss gives a more flexible optimization by giving a different penalty strength for each similarity score, Sn, and Sp. Hence, we generalize *(Sn — Sp) *to *(αn*Sn — αp * Sp), *where *αn *and *αp *are independent weighting factors, allowing sn and sp to learn at different paces

Additionally, Circle loss also gives a more definite convergence point. Previously, in Triplet Loss, it views both pairs above to have the same optimality. On the contrary, in Circle Loss prefer a pair where it is not too close to the anchor (as that mean negative would be closer to the anchor as well) and not too far from the anchor (as the positive would be too far from anchor).

In this example, both T and T’ have the same margin, other losses such as triplet loss would find this ambiguous. However, Circle Loss would prefer T and create a definite target for convergence. The decision boundary also changed from *sp − sn = m *to *(αnsn — αpsp) = m *and hence creating a circle shape and named as Circle Loss.

Moreover, Circle loss introduces a unified perspective for learning with pair-wise labels and class-level labels. It has a unified formula that can degenerate to triplet loss or classification loss with slight modification.

Then, given a class level labels, it will degenerate to classification loss.

Then, given a pair-wise label, it will degenerate to Triplet loss.

Finally, the proposed Circle Loss would be below.

Where

Note that we have 5 hyperparameters, Op, On, γ, ∆p, ∆n. However, we reduce the hyperparameters by setting Op = 1+m, On = −m, ∆p = 1−m, and ∆n = m. Hence, we only need to set the scale factor *γ* and margin *m*.

For evaluation, the circle loss was experimented under two learning approaches, learning with class-level labels and learning with pair-wise labels and both performed well.

For class-level labels, circle loss was evaluated on face recognition and person re-identification tasks.

For pair-wise labels, the circle loss was evaluated on fine-grained image retrieval tasks.

We can see that circle loss both performed well for both tasks amongst other state-of-the-art methods.

In short, Circle loss give a more flexible optimization by having different penalty strength for each similarity score, hence enabling them to learn at different paces and creating a more definite convergence point. Additionally, it uses a unified formula for learning with class level labels and pair-wise labels.

*This post was originally published by Muhammad Fathy Rashad at Medium [AI]*