Visualizing and Understanding CNN

Abstract: 研究卷积神经网络,把阅读到的一些文献经典的部分翻译一下
Keywords: CNN Visualizing





In this paper we address both issues. We introduce a novel visualization technique that gives insight into the function of inter-mediate feature layers and the operation of the classifier
Used in a diagnostic role, these visualizations allow us to find model architectures that outperform Krizhevsky et al. on the ImageNet classification benchmark.
We also perform an ablation study to discover the performance contribution from different model layers.
We show our ImageNet model generalizes well to other datasets: when>the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.
我们展示了我们的 ImageNet 模型在其他数据集上获得优秀的表现:当我们重新训练SoftMax分类器。其结果信服的打败了当前SOTA结果,在Caltech-101和Caltech-256数据集上




Several factors are responsible for this renewed interest in convnet

(i) the availability of much larger training sets, with

(ii) powerful GPU implementations, making the training of very large models practical

(iii) better model regularization strategies, such as Dropout (Hinton et al., 2012).

Without clear understanding of how and why they work, the development of better models is reduced to trial-and-error.

In this paper we introduce a visualization technique that reveals the>input stimuli that excite individual feature maps at any layer in the>model. It also allows us to observe the evolution of features during training and to diagnose potential problems with the model.

The visualization technique we propose uses a multi-layered>Deconvolutional Network (deconvnet), as proposed by (Zeiler et al., 2011), to project the feature activations back to the input pixel space.
我们提出了使用多层逆卷积网络,(Zeiler et al 2014年)提出的,将特征反向映射会到输入层观察结果

We also perform a sensitivity analysis of the classifier output by occluding portions of the input image, revealing which parts of the scene are important for classification

Using these tools, we start with the architecture of (Krizhevsky et al., 2012) and explore different architectures, discovering ones that>outperform their results on ImageNet.

We then explore the generalization ability of the model to other datasets, just retraining the softmax classifier on top.

As such, this is a form of supervised pre-training, which contrasts with the unsupervised pre-training methods popularized by (Hinton et>al., 2006) and others (Bengio et al., 2007; Vincent et al., 2008)
监督学习的Pre-training来对比无监督的Pre-training方法(Hinton et al., 2006 Bengio et al., 2007; Vincent et al., 2008)


Our approach, by contrast, provides a non-parametric view of invariance, showing which patterns from the training set activate the feature map.



We use standard fully supervised convnet models throughout the paper, as defined by (LeCun et al., 1989) and (Krizhevsky et al., 2012).
我们在整篇文章使用标准完全监督卷积网络模型,在 (LeCun et al., 1989) and (Krizhevsky et al., 2012)定义的。
(i) convolution of the previous layer output (or, in the case of the 1st layer, the input image) with a set of learned filters;

(ii) passing the responses through a rectified linear function (relu(x) = max(x, 0));

(iii) max pooling over local neighborhoodsand

(iv) a local contrast operation that normalizes the responses across feature maps.

The top few layers of the network are conventional fully-connected

networks and the final layer is a softmax classifie

Visualization with a Deconvnet

We present a novel way to map these activities back to the input pixel space, showing what input pattern originally caused a given activation in the feature maps.
In (Zeiler et al., 2011), deconvnets were proposed as a way of>performing unsupervised learning. Here, they are not used in any learning capacity, just as a probe of an already trained convent.
在(Zeiler et al., 2011),Deconvnets 被提出作为一种表现非监督学习的方法。这里他们不再用于任何其学习能力,只是用于研究已经训练好的Convnet
To examine a given convnet activation, we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layer.
Then we successively (i) unpool, (ii) rectify and (iii) filter to reconstruct the activity in the layer beneath that gave rise to the chosen activation. This is then repeated until input pixel space is>reached.


In the convnet, the max pooling operation is non-invertible, however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variables. In the>deconvnet, the unpooling operation uses these switches to place the reconstructions from the layer above into appropriate locations, preserving the structure of the stimulus. See Fig. 1(bottom) for an illustration of the procedure


The convnet uses relu non-linearities, which rectify the feature maps
thus ensuring the feature maps are always positive. To obtain valid feature reconstructions at each layer (which also should be positive),we pass the reconstructed signal through a relu non-linearity.


The convnet uses learned filters to convolve the feature maps from the previous layer. To invert this, the deconvnet uses transposed versions of the same filters, but applied to the rectified maps, not>the output of the layer beneath.


Since the model is trained discriminatively, they implicitly show which parts of the input image are discriminative
Note that these projections are not samples from the model, since there is no generative process involved


Training Details

The architecture, shown in Fig. 3, is similar to that used by (Krizhevsky et al., 2012) for ImageNet classification
结构在Fig 3中(本文第一张图)。。。


The model was trained on the ImageNet 2012 training set (1.3 million images, spread over 1000 different classes). Each RGB image was preprocessed by resizing the smallest dimension to 256, cropping the center 256x256 region, subtracting the per-pixel mean (across all images) and then using 10 different sub-crops of size 224x224 (corners+ center with(out) horizontal flips). Stochastic gradient descent with a mini-batch size of 128 was used to update the parameters, starting
with a learning rate of 10−2, in conjunction with a momentum term of 0.9. We anneal the learning rate throughout training manually when the validation error plateaus. Dropout (Hinton et al., 2012) is used in the fully connected layers (6 and 7) with a rate of 0.5.
Visualization of the first layer filters during training reveals that a few of them dominate, as shown in Fig. 6(a). To combat this, we>renormalize each filter in the convolutional layers whose RMS value exceeds a fixed radius of 10−1 to this fixed radius
第一层 Filter的可视化在训练过程揭示,其中一部分起支配作用,如Fig 6 a 所示,为了对抗这种情况,我们重新归一化RMS值超过fixed-radius的0.1倍的每一个在卷基层的Filter


Convnet Visualization


Feature Visualization: Fig. 2 shows feature visualizations from our model once training is complete. However, instead of showing the single strongest activation for a given feature map, we show the top 9 activations.

Alongside these visualizations we show the corresponding image patches. These have greater variation than visualizations as the latter solely focus on the discriminant structure within each patch.

The projections from each layer show the hierarchical nature of the features in the network.

Feature Evolution during Training: Fig. 4 visualizes the progression during training of the strongest activation (across all training examples) within a given feature map projected back to pixel space.

Sudden jumps in appearance result from a change in the image from which the strongest activation originates.
The lower layers of the model can be seen to converge within a few>epochs. However, the upper layers only develop after a considerable number of epochs (40-50), demonstrating the need to let the models train until fully converged.
Feature Invariance: Fig. 5 shows 5 sample images being translated,rotated and scaled by varying degrees while looking at the changes in the feature vectors from the top and bottom layers of the model,relative to the untransformed feature.


The network output is stable to translations and scalings.
In general, the output is not invariant to rotation, except for>object with rotational symmetry (e.g. entertainment center).


Architecture Selection

While visualization of a trained model gives insight into its operation, it can also assist with selecting good architectures in the first place.
The first layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies.
Additionally, the 2nd layer visualization shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions.
To remedy these problems, we
(i) reduced the 1st layer filter size from 11x11 to 7x7
(ii) made the stride of the convolution 2, rather than 4.

This new architecture retains much more information in>the 1st and 2nd layer fea- tures, as shown in Fig. 6(c) & (e). More importantly, it also improves the classification performance as shown in Section 5.1.


Occlusion Sensitivity

With image classification approaches, a natural question is if themodel is truly identifying the location of the object in the image, or just using the surrounding context.
Fig. 7 attempts to answer this question by systematically occluding different portions of the input image with a grey square, and monitoring the output of the classifier.

When the occluder covers the image region that appears in the visualization, we see a strong drop in activity in the feature map.
Fig. 4 and Fig. 2. 图4和图2


Correspondence Analysis


Deep models differ from many existing recognition approaches in that there is no explicit mechanism for establishing correspondence between>specific object parts in different images (e.g. faces have a particular spatial configuration of the eyes and nose)

We then measure the consistency of this difference vector delta between all related image pairs (i, j):

where H is Hamming distance.
A lower value indicates greater consistency in the change resulting
from the masking operation, hence tighter correspondence between the
same object parts in different images (i.e. blocking the left eye)

Table 1