CBAM - Plug and Play Attention Module (with code)

Updated to 4 months ago

Thesis:CBAM: Convolutional Block Attention Module

Code:code

catalogs

preamble

1. What is CBAM?

（1）Channel attention module（CAM）

（2）Spatial attention module（SAM）

(3) Combined CAM and SAM forms

2. Ablation experiments

（1）Channel attention

（2）Spatial attention

（3）Channel attention+spatial attention

3. Image classification

4. Target detection

visualization

code implementation

summarize

preamble

CBAM ( Convolutional Block Attention Module ) is a lightweight attention module proposed in 2018, which can perform Attention operation in spatial dimension and channel dimension. The paper adds CBAM module on Resnet and MobileNet for comparison, and conducts experiments for the successive application of the two Attention Modules, as well as CAM visualization, which shows that Attention pays more attention to the target object.

1. What is CBAM?

CBAM (Convolutional Block Attention Module) is a lightweight convolutional attention module that combines channel and spatial attention mechanism modules.

As can be seen in the above figure, CBAM contains two sub-modules, CAM (Channel Attention Module) and SAM (Spatial Attention Module), which perform channel and spatial Attention, respectively.This not only saves parameter and computational power, but also ensures that it can be integrated as a plug-and-play module into existing network architectures.

As shown in the above figure, there are inputs, channel attention module, spatial attention module and outputs. Input characteristics $F\in R^{C*H*W}$ , and then the channel attention module one-dimensional convolution ${M_c}\in R^{C*1*1}$ In this case, the convolution result is multiplied by the original map, and the CAM output is used as input for the 2D convolution of the spatial attention module $M_s\in R^{1*H*W}$ , and then multiply the output with the original graph.

（1）Channel attention module（CAM）

Channel Attention Module：The channel dimension is unchanged and the space dimension is compressed. This module focuses on inputting images in themeaningful information(The categorization task then focuses on what is divided into different categories because of).

Illustration: the input feature map is passed through two parallel MaxPool and AvgPool layers to change the feature map from C*H*W to the size of C*1*1, and then passed through the Share MLP module, in which it first compresses the number of channels to 1/r (Reduction, Reduction rate) times of the original one, then expands it to the original number of channels, and then passes through the ReLU activation function to get two activated results. These two outputs are summed element by element, and then a sigmoid activation function is used to get the output of Channel Attention, and then this output is multiplied by the original map to change back to the size of C*H*W.

The channel attention formula:

The difference between CAM and SEnet is the addition of a parallel maximum pooling layer, which extracts more comprehensive and richer high-level features. An explanation of why this change was made is also given in the paper.

AvgPool & MaxPool Comparison Experiment

In channel attention, Table 1 has an experimental comparison for the use of pooling and it is found that avg & max's parallel pooling works better. Here it is also possible that pooling loses too much information and avg & max's parallel connection approach loses less information than single pooling, so the results are a little better.

（2）Spatial attention module（SAM）

Spatial Attention Module：No change in spatial dimension, compressed channel dimension. The module is concerned withLocation information of the target。

Illustration: The output of Channel Attention gets two 1*H*W feature maps through maximum pooling and average pooling, and then the two feature maps are spliced by Concat operation, and then the feature map is changed to 1-channel feature map through 7*7 convolution (the experiment proves that the effect of 7*7 is better than that of 3*3), and then the feature map is changed to Spatial Attention through a sigmoid operation. Attention feature map, and finally multiply the output result by the original map back to C*H*W size.

Spatial attention formula:

(3) Combined CAM and SAM forms

The two modules, channel attention and spatial attention, can be combined in a parallel or serial order, and the authors conducted experiments to compare the serial order and parallelism on channel and space, and found that the results of channel first and then space will be slightly better. The results of the experiments are as follows:

As can be seen from Table III, based on the ResNet network, the two Attention modules work better in the order of Channel Attention + Spatial Attention.

2. Ablation experiments

（1）Channel attention
The first is a comparison of the different channel attention results, average pooling, maximum pooling and both jointly and using a shared MLP for inferred preservation of parameters, which are as follows:

First, the difference between the number of parameters and memory loss is small, while the increase in the error rate makes it clear that the joint approach of the two is superior.

（2）Spatial attention

Comparing the effect of 7*7 convolutional kernel and 3*3 convolutional kernel, the result is that 7*7 convolutional kernel works better

（3）Channel attention+spatial attention

Comparing the results of SEnet, CAM and SAM in parallel, SAM+CAM and CAM+SAM, finally CAM+SAM works best .

3. Image classification

Comparison experiments using ResNet network on ImageNet-1K dataset.

4. Target detection

Data set: MS COCO and VOC 2007
As shown in the table below:
On MS COCO, CBAM shows a significant improvement in generalization performance over the baseline network on the recognition task.

The table below:
In VOC 2007, the ladder detection framework is used and SE and CBAM are applied to the detector.CBAM greatly improves all powerful baselines with insignificant additional parameters.

visualization

In this paper, after visualizing the disparate networks using Grad CAM, we were able to find that with the introduction of CBAM, the features covered more parts of the object to be recognized and the chances of finally discriminating the object were higher, indicating that the attention mechanism did allow the network to learn to focus on the key information.

code implementation

code implementation：

import torch
import as nn


class CBAMLayer():
    def __init__(self, channel, reduction=16, spatial_kernel=7):
        super(CBAMLayer, self).__init__()

        # channel attention compressedH,Wbecause of1
        self.max_pool = nn.AdaptiveMaxPool2d(1)
        self.avg_pool = nn.AdaptiveAvgPool2d(1)

        # shared MLP
         = (
            # Conv2d(particle used for comparison and "-er than")Linearconvenient operation
            # (channel, channel // reduction, bias=False)
            nn.Conv2d(channel, channel // reduction, 1, bias=False),
            # inplace=Truedirect replacement，Save memory
            (inplace=True),
            # (channel // reduction, channel,bias=False)
            nn.Conv2d(channel // reduction, channel, 1, bias=False)
        )

        # spatial attention
         = nn.Conv2d(2, 1, kernel_size=spatial_kernel,
                              padding=spatial_kernel // 2, bias=False)
         = ()

    def forward(self, x):
        max_out = (self.max_pool(x))
        avg_out = (self.avg_pool(x))
        channel_out = (max_out + avg_out)
        x = channel_out * x

        max_out, _ = (x, dim=1, keepdim=True)
        avg_out = (x, dim=1, keepdim=True)
        spatial_out = ((([max_out, avg_out], dim=1)))
        x = spatial_out * x
        return x

x = (1,1024,32,32)
net = CBAMLayer(1024)
y = (x)
print()

summarize

The authors in CBAM experimented with the order in which the two attentional mechanisms were used and found that channeled attention worked best before spatial attention.
The authors applied grad-CAM to visualize the feature map in their experiments, which is a very useful visualization tool to observe the features of the feature map in an image classification task, explaining the results as to why the model classified the original image into a certain category.
Adding the CBAM module won't necessarily give the network a performance boost, and it can even go down depending on other factors such as your own network and also your data. If the generalization ability of the network model is already very strong, and your dataset is not benchmarks but your own data set, it is not recommended to add the CBAM module. CBAM performance, although the improvement is much higher than the SE, but by no means a brainless addition to the network can have a boost. It should also be considered based on your data, network and other factors.