Abstract: In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity


Keywords: 人脸识别,FaceNet,GoogleNet


从老的wordpress转移过来,之前一年一直在做人脸识别算法研究,现在回头看看too young too simple,但是这些经历也算对自己的有很多帮助,认清自己的水平,知道自己什么方面比较差,这样也算是有所帮助

Our method uses a deep convolutional network trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches.


To train, we use triplets of roughly aligned matching / non-matching face patches generated using a novel online triplet mining method


128-bytes per face



The network is trained such that the squared L2 distances in the embedding space directly correspond to face similarity: faces of the same person have small distances and faces of distinct people have large distances.

网络输出embedding ,在embedding空间直接描述人脸的相似程度,同一个人有较近的距离,不同的人有较远的距离

Previous face recognition approaches based on deep networks use a classification layer [15, 17] trained over a set of known face identities and then take an intermediate bottleneck layer as a representation used to generalize recognition beyond the set of identities used in training


The downsides of this approach are its indirectness and its inefficiency: one has to hope that the bottleneck representation generalizes well to new faces; and by using a bottleneck layer the representation size per face is usually very large (1000s of dimensions)


Some recent work [15] has reduced this dimensionality using PCA, but this is a linear transformation that can be easily learnt in one layer of the network


FaceNet directly trains its output to be a compact 128-D embedding using a triplet-based loss function based on LMNN [19].

FaceNet直接训练他的输出使其达到紧凑的128维embedding,使用基于LMNN 【19】的triplet 损失函数

Our triplets consist of two matching face thumbnails and a non-matching face thumbnail and the loss aims to separate the positive pair from the negative by a distance margin


The thumbnails are tight crops of the face area, no 2D or 3D alignment, other than scale and translation is performed


Choosing which triplets to use turns out to be very important for achieving good performance and,


inspired by curriculum learning [1], we present a novel online negative exemplar mining strategy which ensures consistently increasing difficulty of triplets as the network trains.


To improve clustering accuracy, we also explore hard-positive mining techniques which encourage spherical clusters for the embeddings of a single person.




Similarly to other recent works which employ deep networks [15, 17], our approach is a purely data driven method which learns its representation directly from the pixels of the face. Rather than using engineered features, we use a large dataset of labelled faces to attain the appropriate invariances to pose, illumination, and other variational conditions.



1:The first architecture is based on the Zeiler&Fergus [22] model which consists of multiple inter-leaved layers of convolutions, non-linear activations, local response normalizations, and max pooling layers

2:The second architecture is based on the Inception model of Szegedy et al. which was recently used as the winning approach for ImageNet 2014 [16]

We have found that these models can reduce the number of parameters by up to 20 times and have the potential to reduce the number of FLOPS required for comparable performance



The works of [15, 17, 23] all employ a complex system of multiple stages, that combines the output of a deep con-volutional network with PCA for dimensionality reduction and an SVM for classification:


Zhenyao et al. [23] employ a deep network to “warp” faces into a canonical frontal view and then learn CNN that classifies each face as belonging to a known identity. For face verification, PCA on the network output in conjunction with an ensemble of SVMs is used.


Taigman et al. [17] propose a multi-stage approach that aligns faces to a general 3D shape model. Amulti-class network is trained to perform the face recognition task on over four thousand identities.


Sun et al. [14, 15] propose a compact and therefore relatively cheap to compute network. They use an ensemble of 25 of these network, each operating on a different face patch.


The verification loss is similar to the triplet loss we employ [12, 19], in that it minimizes the L2-distance between faces of the same identity and enforces a margin between the distance of faces of different identities.

Verification loss 类似于triplet loss 我们引用自12,19的方法,最小化L2距离,在同一个人之间和不同人之间的margin。


two different core architectures:

  1. The Zeiler&Fergus [22] style networks

the recent Inception [16] type networks

Given the model details, and treating it as a black box (see Figure 2), the most important part of our approach lies in the end-to-end learning of the whole system.


To this end we employ the triplet loss that directly reflects what we want to achieve in face verification, recognition and clustering.

在一端,我们使用triplet loss 直接反应我们想要在人脸分辨,识别,聚类中达成的目标

Namely, we strive for an embedding f(x), from an image x into a feature space Rd,

我们使用embedding , f(x)直接将图像从x域映射到特征空间R(d维),

The triplet loss, however, tries to enforce a margin between each pair of faces from one person to all other faces. This allows the faces for one identity to live on a manifold, while still enforcing the distance and thus discriminability to other identities

Triplet loss 试图去注意在正样本对和负样本对之间的margin,这使得同一类的人脸存在于多种情况,同时也关注其间的距离,因此对于其他类可分辨。

Triplet Loss

Here we want to ensure that an image

1.of a specific person is closer to all other images

2.of the same person than it is to any image

3.of any other person




||x_i^a-x_i^p||_2^2+\alpha < ||x_i^a-x_i^n||_2^2 ,\forall(x_i^a,x_i^p,x_i^n)\in\tau



Generating all possible triplets would result in many triplets that are easily satisfied (i.e. fulfill the constraint in Eq. (1)). These triplets would not contribute to the training and result in slower convergence, as they would still be passed through the network. It is crucial to select hard triplets, that are active and can therefore contribute to improving the model.


Triplet Selection

In order to ensure fast convergence it is crucial to select triplets that violate the triplet constraint in Eq. (1)





It is infeasible to compute the argmin and argmax across the whole training set.

Additionally, it might lead to poor training, as mislabelled and poorly imaged faces would dominate the hard positives and negatives. There are two obvious choices that avoid this issue:

尤其是,其可能导致不好的结果,因为有一些训练数据未准确标记,模糊的图片可能会决定hard positives或者hard negative,有两种明确的选择可以避免这两个问题:

Generate triplets offline every n steps, using the most recent network checkpoint and computing the argmin and argmax on a subset of the data.


Generate triplets online. This can be done by selecting the hard positive/negative exemplars from within a mini-batch.

在线获取triplet,通过从一个小的batch中选取hard positive或者hard negative样本。

Here, we focus on the online generation and use large mini-batches in the order of a few thousand exemplars and only compute the argmin and argmax within a mini-batch.


To have a meaningful representation of the anchor-positive distances, it needs to be ensured that a minimal number of exemplars of any one identity is present in each mini-batch. In our experiments we sample the training data such that around 40 faces are selected per identity per mini-batch. Additionally, randomly sampled negative faces are added to each mini-batch.


Instead of picking the hardest positive, we use all anchor-positive pairs in a mini-batch while still selecting the hard negatives.


We don’t have a side-by-side comparison of hard anchor-positive pairs versus all anchor-positive pairs within a mini-batch, but we found in practice that the all anchor- positive method was more stable and converged slightly faster at the beginning of training.


We also explored the offline generation of triplets in conjunction with the online generation and it may allow the use of smaller batch sizes, but the experiments were inconclusive.


Selecting the hardest negatives can in practice lead to bad local minima early on in training, specifically it can result in a collapsed model (i.e. f(x) = 0). In order to mitigate this, it helps to select xni such that



We call these negative exemplars semi-hard, as they are further away from the anchor than the positive exemplar, but still hard because the squared distance is close to the anchor-positive distance. Those negatives lie inside the margin α.

我们称这些negative为semi-hard,因为他们距离anchor比positive远,但是还难以区分,因为距离还是比较接近ap距离,这些negative在margin alpha内!

As mentioned before, correct triplet selection is crucial for fast convergence.


On the one hand we would like to use small mini-batches as these tend to improve convergence during Stochastic Gradient Descent (SGD) [20].


On the other hand, implementation details make batches of tens to hundreds of exemplars more efficient.


The main constraint with regards to the batch size, however, is the way we select hard relevant triplets from within the mini-batches.

Batch size 的主要的约束是我们从mini-batches选择hard triplet的方式

In most experiments we use a batch size of around 1,800 exemplars

Batch size大约1800个样例

Deep Convolutional Networks

In all our experiments we train the CNN using Stochastic Gradient Descent (SGD) with standard backprop [8, 11] and AdaGrad [5]. In most experiments we start with a learning rate of 0.05 which we lower to finalize the model. The models are initialized from random, similar to [16], and trained on a CPU cluster for 1,000 to 2,000 hours. The decrease in the loss (and increase in accuracy) slows down drastically after 500h of training, but additional training can still significantly improve performance. The margin α is set to 0.2.


The first category, shown in Table 1, adds 1×1×d convolutional layers, as suggested in [9], between the standard convolutional layers of the Zeiler&Fergus [22] architecture and results in a model 22 layers deep. It has a total of 140 million parameters and requires around 1.6 billion FLOPS per image.
The second category we use is based on GoogLeNet style Inception models [16]. These models have 20× fewer parameters (around 6.6M-7.5M) and up to 5×fewer FLOPS(between 500M-1.6B).