纪录片评价建议
Criado em: 8 de janeiro de 2025
Criado em: 8 de janeiro de 2025
纪录片世上最不安的人好看吗
目前关于“《世上最不安的人》”这部纪录片的权威信息和观众评价相对有限,尚未成为广泛流传或讨论度较高的作品。如果您在一些视频网站或纪录片平台上看到了这个名字,建议可以先通过以下方式综合判断它是否值得观看:
查看简介和预告
了解创作背景
搜索观众反馈
关注影片主题
如果您找到了它的资源且对这个主题感兴趣,不妨先尝试观看一部分,再决定是否继续。也可以多搜寻一下有没有相关播出资讯或媒体报道,以便对影片做更充分的了解。若您看完后愿意分享体验,也能帮助更多人做判断。
世界上最不安的人
目前没有查到广泛传播或权威信息来源明确提及名为「世界上最不安的人」的纪录片、影视作品或图书。可能存在以下几种情况:
译名或别名差异
尚未正式发行或知名度较低
与其他同主题作品混淆
如果您是从其他渠道看到或听到「世界上最不安的人」的相关信息,建议尝试以下方式更进一步核实:
关注原始信息来源
搜索主题关键词
尝试原文或外文关键词
询问社区或平台
总之,目前并无主流或高知名度的作品专门以「世界上最不安的人」为正式片名。如果您希望了解与“焦虑”“不安”相关的社会或心理话题纪录片,以下几个已知作品可以参考:
如果您之后能确认「世界上最不安的人」确有其作,并找到更具体的信息来源,欢迎再来分享。这样不仅能帮助自己做观看判断,也可以让更多对该题材感兴趣的人获得参考。
Start Train
Epoch 1/100: 0%| | 0/2162 [00:00<?, ?it/s<class 'dict'>]
从您贴出的日志来看,模型已经开始训练,但在执行过程中出现了一个 TracerWarning:
textTracerWarning: Converting a tensor to a Python float might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. attn_scores = attn_scores / math.sqrt(dim_k)
这一般说明在使用 PyTorch 的某些跟模型“跟踪”(trace)或导出(如 ONNX)相关的功能时,代码中有一处将 Tensor 转为了 Python float,导致 Tracer 不再能完整追踪到这个值的动态变化,而是把它当作了一个常量,从而可能造成推理或导出时行为不一致。
下面是一些可能的原因、影响,以及如何处理的思路。
从报错栈可以看出,代码里有:
pythonattn_scores = attn_scores / math.sqrt(dim_k)
其中 dim_k
很可能在运行时是一个 Tensor,却被 math.sqrt(dim_k)
这样直接用 Python 的 math.sqrt
处理。Python 的 math
模块只能处理纯 Python 的数字类型(int、float),于是 PyTorch 会先把 dim_k
从 Tensor 转为 Python float,再进行开方运算。
如果您只是普通训练,不涉及 tracing 或导出模型(如 torch.jit.trace
、ONNX 导出等),这个警告本身并不会影响纯训练过程,但它是在提醒您:
dim_k
是什么?查看 dim_k
的定义
在 WAFA_deeplabV3.py
第 144 行附近(或者在此之前的代码)找到 dim_k
的来源,看它是怎样赋值的。常见的可能是:
pythondim_k = attn_scores.size(-1) # 或者类似
这时候 dim_k
就是一个 Python int,可以直接用于 math.sqrt(dim_k)
;若如此,理论上不应该出现警告。
如果 dim_k
本身是一个 Tensor
例如某些地方写了:
pythondim_k = some_tensor.shape[-1] # shape[-1] 通常返回 int,但有时是Tensor?
或者 dim_k
并不单纯是一个 shape,而是某个动态张量的值。那么在后面直接 math.sqrt(dim_k)
就会触发这个警告。
如果 dim_k
依旧是一个 Tensor(且其数值代表标量),可以改用 PyTorch 的操作,而不是 Python 的 math
。例如:
pythonimport torch attn_scores = attn_scores / torch.sqrt(dim_k)
这样 PyTorch 就可以在计算图中保留这步运算,不会因为把 Tensor 转成 Python float 而丢失跟踪信息。
dim_k
转为可追踪的标量 Tensor有时我们确实只想用它当作一个纯数字。这种情况下,如果 dim_k
本身就是“已知的标量”,可以在更上层(如初始化时)将其转成普通 Python float,再参与后续运算,这样也不会触发警告。举例:
pythondim_k_value = dim_k.item() # 如果 dim_k 是形如 [x] 的标量Tensor attn_scores = attn_scores / math.sqrt(dim_k_value)
但这么做也意味着后续导出时,这里会被视为常量,不再是动态可变的张量。
dim_k
是否真需要是 Tensor很多时候,我们只是想根据 attn_scores
的通道数或形状做简单的缩放,那么 dim_k
通常是一个 整数(或者说 Python int),这时可以在获取它的时候就把它转为 int,而不是把它保留为 Tensor。例如:
pythondim_k = attn_scores.size(-1) # 这样得到的就是 int attn_scores = attn_scores / math.sqrt(dim_k)
如果这样写仍然出现警告,要仔细确认 size()
这一行是不是在哪个地方被改造成了 Tensor(比如 torch.tensor(...)
包裹了它)。
训练过程中提示 <class 'dict'>
在您贴的日志最后一行出现:
Epoch 1/100: 0%| | 0/2162 [00:00<?, ?it/s<class 'dict'>]
这通常是因为某些进度条(tqdm)或日志打印把信息混杂在一起,并非真正的错误。可以查看一下 tqdm
或者训练日志的打印方式,确认下日志是否正常输出。
其余超参数
日志中可以看到您配置的学习率、批次大小、训练集大小、网络骨干等都已经读取成功。如果训练没有报更严重的错误(如 shape 不匹配、显存溢出、梯度爆炸等),说明大概率可以正常进行。
是否需要担心 TracerWarning
torch.sqrt()
)或者在需要时把值保存在 Python 层或 Tensor 层,避免出现警告或导出失败。math.sqrt(dim_k)
)。torch.sqrt()
或者在合适的时机将其转为 Python int/float,并清楚这会“固定”某些动态值。希望以上信息能帮助您理解并处理这个警告,让您的 DeeplabV3+ 模型继续顺利训练、验证以及可能的后续部署。若有更多报错或训练相关的问题,欢迎继续交流。
dim_k = K.shape[-1]
从您给出的代码片段
pythondim_k = K.shape[-1]
来看,如果 K 是普通的 PyTorch 张量(torch.Tensor
),那么 K.shape
通常会返回一个 tuple,其中的每个维度都是 Python int 而不是张量。因此,这种写法理论上取得的就是一个纯 Python int,直接用于 math.sqrt(dim_k)
并不会产生将 Tensor 转为 Python float 的 TracerWarning。
不过,如果您在某些场景(例如动态 shape、ONNX 导出、TorchScript 等)下,K.shape
的行为发生了变化,或者 K 不是普通的张量(可能是某种特殊包装类型)导致 K.shape[-1]
返回的是一个张量,那么就有可能触发 PyTorch 的 TracerWarning。下面是一些思路,帮您检查和解决这一问题。
K
的类型先确认以下两点:
K 是否确实是一个常规的 torch.Tensor
?
type(K)
,看它是不是 <class 'torch.Tensor'>
。.shape
可能返回的并不是常规的 Python tuple。K 的形状获取方式是否可能来自脚本化/导出环境?
torch.jit.trace
或者 ONNX 导出时,有时获取张量形状的操作会被视为动态张量操作,而不是一个静态的 Python int。在一般的 PyTorch 正常训练/推理里,K.shape[-1]
返回的应当就是一个 int,所以不会出现 TracerWarning。
万一在您的环境下,K.shape[-1]
返回了一个 0-D Tensor(即仅包含一个标量值的张量),则会导致类似:
pythonTracerWarning: Converting a tensor to a Python float might cause the trace to be incorrect. ...
这是因为 math.sqrt(...)
只能处理 Python float,而不是张量。解决方法一般有两种:
pythonimport torch # 如果 dim_k 的确是一个 0-D Tensor attn_scores = attn_scores / torch.sqrt(dim_k)
这样 PyTorch 可以在计算图中保留动态信息,而不会引发将张量转换为 Python float 的警告。
如果您确定 dim_k
就是一个固定的整数(例如是某个通道维度、特征维度),并且对后续不需要保持动态性,可以在取数时就转成 Python 类型:
python# 如果 dim_k 是 [x] 形状的标量张量 dim_k_value = dim_k.item() attn_scores = attn_scores / math.sqrt(dim_k_value)
这样 PyTorch 会把它视为常量,也就不会有动态可追踪的诉求;警告自然消失。不过,要清楚这样做会在脚本化/导出时将这个值“写死”,如果它本来应该是可变的,就会失去动态性。
常规 PyTorch 训练/推理场景
若 K
为普通张量,K.shape[-1]
一般就是一个 Python int。比如说:
pythonK = torch.randn(2, 3, 4) # (batch, channels, width) dim_k = K.shape[-1] # 这里 dim_k = 4, 是 int import math val = math.sqrt(dim_k) # 不会有任何警告
不会出错,也不会有任何 TracerWarning。
在 TorchScript/ONNX 导出时,某些 dynamic shape 的处理
可能出现 shape
变成了张量,或者某种符号类型,这时需要用 torch.sqrt()
等 PyTorch 原生算子来保持图的完整可追踪性。
从张量中获取形状
如果写法是 dim_k = K.size(-1)
,在绝大多数情况下也返回的是 Python int,同理不会导致警告。
而后再用pythondef forward(self, K): dim_k = K.shape[-1] ...
torch.jit.trace
之类对这个函数做跟踪,一旦 K.shape[-1]
在 trace 时被视为动态值,PyTorch 可能会发出类似警告。.shape[-1]
,由于返回的可能是一个 tf.Tensor
(而不是 Python int),就会引发类似的“Python float转换”问题。只是和 PyTorch 的行为不同而已。dim_k = K.shape[-1]
若 K
为 PyTorch 普通张量,dim_k
就是 Python int,一般不会触发警告。K.shape[-1]
返回了 Tensor 而非 Python int,可通过使用 PyTorch 运算或提前转为 Python int/float来消除警告。torch.sqrt()
)来避免潜在的问题。只要确认 K
的类型和具体用途,然后根据上面方法加以调整,就可以顺利解决或避免该 TracerWarning。
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
#------------------------------------------------#
#------------------------------------------------#
from nets.xception import xception
from nets.mobilenetv2 import mobilenetv2
class tongdao(nn.Module): #处理通道部分 函数名就是拼音名称
# 通道模块初始化,输入通道数为in_channel
def init(self, in_channel):
super().init()
self.avg_pool = nn.AdaptiveAvgPool2d(1) # 自适应平均池化,输出大小为1x1
self.fc = nn.Conv2d(in_channel, 1, kernel_size=1, bias=False) # 1x1卷积用于降维
self.relu = nn.ReLU(inplace=True) # ReLU激活函数,就地操作以节省内存
text# 前向传播函数 def forward(self, x): b, c, _, _ = x.size() # 提取批次大小和通道数 y = self.avg_pool(x) # 应用自适应平均池化 y = self.fc(y) # 应用1x1卷积 y = self.relu(y) # 应用ReLU激活 y = nn.functional.interpolate(y, size=(x.size(2), x.size(3)), mode='nearest') # 调整y的大小以匹配x的空间维度 return x * y.expand_as(x) # 将计算得到的通道权重应用到输入x上,实现特征重校准
class kongjian(nn.Module):
# 空间模块初始化,输入通道数为in_channel
def init(self, in_channel):
super().init()
self.Conv1x1 = nn.Conv2d(in_channel, 1, kernel_size=1, bias=False) # 1x1卷积用于产生空间激励
self.norm = nn.Sigmoid() # Sigmoid函数用于归一化
text# 前向传播函数 def forward(self, x): y = self.Conv1x1(x) # 应用1x1卷积 y = self.norm(y) # 应用Sigmoid函数 return x * y # 将空间权重应用到输入x上,实现空间激励
class hebing(nn.Module): #函数名为合并, 意思是把空间和通道分别提取的特征合并起来
# 合并模块初始化,输入通道数为in_channel
def init(self, in_channel):
super().init()
self.tongdao = tongdao(in_channel) # 创建通道子模块
self.kongjian = kongjian(in_channel) # 创建空间子模块
text# 前向传播函数 def forward(self, U): U_kongjian = self.kongjian(U) # 通过空间模块处理输入U U_tongdao = self.tongdao(U) # 通过通道模块处理输入U return torch.max(U_tongdao, U_kongjian) # 取两者的逐元素最大值,结合通道和空间激励
############################################
############################################
class MobileNetV2(nn.Module):
def init(self, downsample_factor=8, pretrained=True):
super(MobileNetV2, self).init()
from functools import partial
model = mobilenetv2(pretrained)
self.features = model.features[:-1]
textself.total_idx = len(self.features) self.down_idx = [2, 4, 7, 14] if downsample_factor == 8: for i in range(self.down_idx[-2], self.down_idx[-1]): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=4)) elif downsample_factor == 16: for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) def _nostride_dilate(self, m, dilate): classname = m.__class__.__name__ if classname.find('Conv') != -1: if m.stride == (2, 2): m.stride = (1, 1) if m.kernel_size == (3, 3): m.dilation = (dilate // 2, dilate // 2) m.padding = (dilate // 2, dilate // 2) else: if m.kernel_size == (3, 3): m.dilation = (dilate, dilate) m.padding = (dilate, dilate) def forward(self, x): # 前4层 -> low_level_features low_level_features = self.features[:4](x) # 后面 -> deep_features x = self.features[4:](low_level_features) return low_level_features, x
############################################
############################################
def count_parameters(model):
"""统计模型参数数量"""
return sum(p.numel() for p in model.parameters() if p.requires_grad)
############################################
############################################
def window_partition(x, window_size):
"""
将 (B, C, H, W) -> (B*num_windows, C, window_size, window_size)
假设 H,W 能被 window_size 整除
"""
B, C, H, W = x.shape
x = x.view(B, C, H // window_size, window_size, W // window_size, window_size)
x = x.permute(0, 2, 4, 1, 3, 5).contiguous()
windows = x.view(-1, C, window_size, window_size)
return windows
def window_unpartition(windows, window_size, H, W):
"""
将 (B*num_windows, C, window_size, window_size) -> (B, C, H, W)
"""
num_windows = (H // window_size) * (W // window_size)
B = windows.size(0) // num_windows
C = windows.size(1)
x = windows.view(B, H // window_size, W // window_size, C, window_size, window_size)
x = x.permute(0, 3, 1, 4, 2, 5).contiguous()
x = x.view(B, C, H, W)
return x
############################################
############################################
def scaled_dot_product_attention(Q, K, V):
"""
Q, K, V: (B, heads, seq_len, dim_per_head)
返回: (B, heads, seq_len, dim_per_head)
"""
attn_scores = torch.matmul(Q, K.transpose(-2, -1)) # [B, heads, seq_len, seq_len]
dim_k = K.shape[-1]
attn_scores = attn_scores / math.sqrt(dim_k)
attn = F.softmax(attn_scores, dim=-1)
out = torch.matmul(attn, V) # [B, heads, seq_len, dim_per_head]
return out
############################################
############################################
class TransformerBlock(nn.Module):
"""
标准Transformer Encoder风格:
1) LayerNorm -> MultiHeadAttention -> Dropout -> Residual
2) LayerNorm -> FFN(MLP) -> Dropout -> Residual
不含位置编码,如需可自行加
"""
def init(self, dim, num_heads=4, mlp_ratio=4.0, attn_dropout=0.0, ffn_dropout=0.0):
super().init()
self.dim = dim
self.num_heads = num_heads
self.mlp_ratio = mlp_ratio
self.attn_dropout = attn_dropout
self.ffn_dropout = ffn_dropout
text#---- 注意力前的LN ---- self.norm1 = nn.LayerNorm(dim) # Q,K,V合并投影 self.qkv = nn.Linear(dim, dim*3, bias=True) self.attn_drop = nn.Dropout(attn_dropout) # 输出投影 self.proj = nn.Linear(dim, dim, bias=True) self.proj_drop = nn.Dropout(attn_dropout) #---- FFN 前的LN ---- self.norm2 = nn.LayerNorm(dim) hidden_dim = int(dim * mlp_ratio) self.ffn = nn.Sequential( nn.Linear(dim, hidden_dim), nn.GELU(), nn.Dropout(ffn_dropout), nn.Linear(hidden_dim, dim), nn.Dropout(ffn_dropout), ) def forward(self, x): """ x: (B, seq_len, dim) """ B, N, D = x.shape # (1) 自注意力子层 shortcut = x x_ln = self.norm1(x) qkv = self.qkv(x_ln).reshape(B, N, 3, self.num_heads, D//self.num_heads) qkv = qkv.permute(2, 0, 3, 1, 4) # (3, B, heads, seq_len, dim_ph) Q, K, V = qkv[0], qkv[1], qkv[2] # (B, heads, seq_len, dim_ph) # 多头注意力 attn_out = scaled_dot_product_attention(Q, K, V) attn_out = self.attn_drop(attn_out) # 合并heads attn_out = attn_out.permute(0, 2, 1, 3).contiguous() # => (B, seq_len, dim) attn_out = attn_out.view(B, N, D) # 输出投影 attn_out = self.proj(attn_out) attn_out = self.proj_drop(attn_out) # 残差 x = shortcut + attn_out # (2) FFN子层 shortcut2 = x x_ln2 = self.norm2(x) x_ffn = self.ffn(x_ln2) x = shortcut2 + x_ffn return x
############################################
############################################
class WindowTransformerBlock(nn.Module):
"""
- 将(B, C, H, W)划分为窗口
- 在每个窗口内, 将通道C视作embedding dim, (ws*ws)视作序列长度
- 送进 TransformerBlock
- 最终输出和输入维度相同 (in_channels不变)
"""
def init(self, in_channels, window_size=8, num_heads=4, mlp_ratio=4.0,
attn_dropout=0.0, ffn_dropout=0.0):
super().init()
self.in_channels = in_channels
self.window_size = window_size
textself.transformer = TransformerBlock( dim = in_channels, num_heads = num_heads, mlp_ratio = mlp_ratio, attn_dropout= attn_dropout, ffn_dropout = ffn_dropout ) def forward(self, x): """ x: (B, C, H, W) return: (B, C, H, W) """ B, C, H, W = x.shape ws = self.window_size # 1) 分窗口 => (B*nw, C, ws, ws) windows = window_partition(x, ws) # 2) => (B*nw, ws*ws, C) windows_reshape = windows.view(-1, C, ws*ws).permute(0, 2, 1).contiguous() # 3) TransformerBlock out = self.transformer(windows_reshape) # (B*nw, ws*ws, C) # 4) => (B*nw, C, ws, ws) out = out.permute(0, 2, 1).contiguous() out = out.view(-1, C, ws, ws) # 5) 拼回 (B, C, H, W) x_out = window_unpartition(out, ws, H, W) return x_out
############################################
############################################
class WA_ASPP(nn.Module):
"""
五条分支:
1) branch1: 1×1 conv => (320->256)
2) branch2/3/4: WindowTransformerBlock(320->320) + 1×1 conv(320->256) => 层次叠加
3) branch5: 全局平均池化(320->256)
最终拼接 => 1×1 conv => 256
"""
def init(self, dim_in, dim_out,
window_size=(4,8,16),
num_heads=4,
mlp_ratio=4.0,
attn_dropout=0.0,
ffn_dropout=0.0):
super().init()
text# branch1: 1×1 conv => [B, dim_out, H, W] self.branch1 = nn.Sequential( nn.Conv2d(dim_in, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) # branch2/3/4: WindowTransformerBlock (输入输出都是dim_in=320) self.branch2_attn = WindowTransformerBlock( in_channels = dim_in, window_size = window_size[0], num_heads = num_heads, mlp_ratio = mlp_ratio, attn_dropout = attn_dropout, ffn_dropout = ffn_dropout ) self.branch3_attn = WindowTransformerBlock( in_channels = dim_in, window_size = window_size[1], num_heads = num_heads, mlp_ratio = mlp_ratio, attn_dropout = attn_dropout, ffn_dropout = ffn_dropout ) self.branch4_attn = WindowTransformerBlock( in_channels = dim_in, window_size = window_size[2], num_heads = num_heads, mlp_ratio = mlp_ratio, attn_dropout = attn_dropout, ffn_dropout = ffn_dropout ) # 额外的 1×1 Conv:将分支2/3/4输出从 320 => 256 self.branch2_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch3_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch4_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) # branch5: 全局平均池化 => 卷积(320->256) self.branch5_conv = nn.Conv2d(dim_in, dim_out, kernel_size=1) self.branch5_bn = nn.BatchNorm2d(dim_out) self.branch5_relu = nn.ReLU(inplace=True) # 拼接(5路) => 1×1 conv => dim_out self.fuse_conv = nn.Sequential( nn.Conv2d(dim_out * 5, dim_out, kernel_size=1,stride=1,padding=0,bias=True), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) self.Hebing=hebing(in_channel=dim_out*5)# 整合通道和空间特征的合并模块 def forward(self, x): B, C, H, W = x.size() # branch1 => [B, 256, H, W] out1 = self.branch1(x) # branch2 => WindowAttn => [B, 320, H, W] -> 1x1Conv => [B,256,H,W] out2 = self.branch2_attn(x) out2 = self.branch2_proj(out2) out2_agg = out1 + out2 # branch3 out3 = self.branch3_attn(x) out3 = self.branch3_proj(out3) out3_agg = out1 + out2_agg + out3 # branch4 out4 = self.branch4_attn(x) out4 = self.branch4_proj(out4) out4_agg = out1 + out2_agg + out3_agg + out4 # branch5 => 全局池化 => 卷积=>bn=>relu => 上采样 => [B,256,H,W] gp = F.adaptive_avg_pool2d(x, 1) # (B, 320, 1, 1) gp = self.branch5_conv(gp) # => (B, 256, 1, 1) gp = self.branch5_bn(gp) gp = self.branch5_relu(gp) gp = F.interpolate(gp, size=(H, W), mode='bilinear', align_corners=False) # 拼接 [out1, out2_agg, out3_agg, out4_agg, gp] => [B, 256*5, H, W] out_cat = torch.cat([out1, out2_agg, out3_agg, out4_agg, gp], dim=1) # 应用合并模块进行通道和空间特征增强 larry = self.Hebing(out_cat) larry_feature_cat = larry * out_cat # 融合 => [B, 256, H, W] out = self.fuse_conv(larry_feature_cat) return out
############################################
############################################
class DeepLab(nn.Module):
def init(self, num_classes, backbone="mobilenet", pretrained=True, downsample_factor=16):
super(DeepLab, self).init()
if backbone == "xception":
self.backbone = xception(downsample_factor=downsample_factor, pretrained=pretrained)
in_channels = 2048
low_level_channels = 256
elif backbone == "mobilenet":
self.backbone = MobileNetV2(downsample_factor=downsample_factor, pretrained=pretrained)
#----------------------------------#
# mobilenetv2 输出deep_features: 320通道
# 浅层特征 low_level_features: 24通道
#----------------------------------#
in_channels = 320
low_level_channels = 24
else:
raise ValueError('Unsupported backbone - {}
, Use mobilenet, xception.'.format(backbone))
text# 这就是我们自定义的多头窗口注意力ASPP self.aspp = WA_ASPP( dim_in = in_channels, # 320 dim_out = 256, window_size = (4,8,16), # 窗口大小 num_heads = 4, # 多头数 mlp_ratio = 4.0, # FFN扩展倍数 attn_dropout= 0.1, # 注意力dropout ffn_dropout = 0.1 # FFN dropout ) # 浅层特征 => 1×1卷积 => 48通道 self.shortcut_conv = nn.Sequential( nn.Conv2d(low_level_channels, 48, 1), nn.BatchNorm2d(48), nn.ReLU(inplace=True) ) # 拼接后 => 进一步卷积 self.cat_conv = nn.Sequential( nn.Conv2d(48 + 256, 256, 3, stride=1, padding=1), nn.BatchNorm2d(256), nn.ReLU(inplace=True), nn.Dropout(0.5), nn.Conv2d(256, 256, 3, stride=1, padding=1), nn.BatchNorm2d(256), nn.ReLU(inplace=True), nn.Dropout(0.1), ) # 最终1x1分类层 self.cls_conv = nn.Conv2d(256, num_classes, 1, stride=1) def forward(self, x): H, W = x.size(2), x.size(3) # 主干网络 => 浅层特征, 深层特征 low_level_features, deep_features = self.backbone(x) # 深层特征 => WA_ASPP deep_features = self.aspp(deep_features) # 浅层特征 => 1×1 conv => 48通道 low_level_features = self.shortcut_conv(low_level_features) # 将深层特征上采样到浅层特征大小,然后拼接 deep_features = F.interpolate(deep_features, size=(low_level_features.size(2), low_level_features.size(3)), mode='bilinear', align_corners=True) x_cat = torch.cat((deep_features, low_level_features), dim=1) x_cat = self.cat_conv(x_cat) # 最终分类头 x_cat = self.cls_conv(x_cat) x_cat = F.interpolate(x_cat, size=(H, W), mode='bilinear', align_corners=True) return x_cat
############################################
############################################
if name == 'main':
img = torch.randn(3, 3, 512, 512)
model = DeepLab(num_classes=3, backbone="mobilenet", pretrained=False, downsample_factor=16)
out = model(img)
print("输出维度:", out.shape) # 期望 (3, 3, 512, 512)
print("DeepLabV3+参数数量:", count_parameters(model))
用文字描述上述网络模型,分编码器和解码器两部分讲解,讲的详细一些,需要在论文中呈现
以下文字将对上述网络模型进行较为详细的说明,并按照**编码器(Encoder)与解码器(Decoder)**两部分来介绍其结构和工作原理,适合在论文或技术报告中呈现。
该网络整体基于 DeepLabV3+ 的思路,采用 MobileNetV2(或 Xception) 作为主干网络(Backbone),在此基础上用自定义的 WA_ASPP 模块来提取多尺度上下文信息,并通过编码器-解码器的结构完成对输入图像的语义分割。
在编码器部分的 ASPP 模块中,融合了 多头窗口注意力(WindowTransformerBlock) 与全局平均池化等多分支结构,以更好地捕捉不同尺度下的上下文信息。此外,模型还在通道和空间维度上分别进行注意力提取,并通过自定义的合并模块(hebing)来增强特征。下面将分别对编码器和解码器进行更详细的阐述。
MobileNetV2 作为轻量级网络,通过 Depthwise Separable Convolution 以及 Inverted Residual 等技术,在保持较高准确率的同时降低了模型复杂度。在这份代码中,主要对其做了以下处理:
网络结构截断
features
模块分段使用。其中,前 4 层输出的特征图记为 low_level_features,后续层输出的特征图记为 deep_features。low_level_features
一般维度较小(通道数少,空间分辨率相对大),在 DeepLabV3+ 结构中通常用作后续解码器的浅层特征。deep_features
则是主干网络下采样后的深层特征,其通道数相对较多(如 320 通道),空间分辨率更小。可变下采样率 (downsample_factor)
downsample_factor=8
或 downsample_factor=16
,则通过修改卷积层的 stride 和 dilation 来调整主干网络的最终下采样倍数,保持输出特征图在更高分辨率上进行推理,从而兼顾精度和速度。整体而言,MobileNetV2 提供了从原始图像到深层特征的编码过程,输出两个关键张量:
传统 DeepLabV3+ 中的 ASPP(空洞空间金字塔池化)通过多尺度空洞卷积和全局平均池化获取上下文信息。而在这里的 WA_ASPP 模块中,主要思路是:
该模块最终会拼接 5 条分支的输出,然后用额外的注意力整合模块(hebing)进行通道和空间维度的权重融合。
具体细节如下:
分支 1 (branch1):
分支 2/3/4 (branch2_attn, branch3_attn, branch4_attn):
分支 5 (branch5):
F.adaptive_avg_pool2d
)将特征图压缩到 1×1,大幅抓取全局背景信息;五路拼接 + 注意力融合:
fuse_conv
)变换到 256 通道,完成对多尺度上下文信息的综合聚合。综上,编码器(Encoder)部分输出的 deep_features,在通过 WA_ASPP 处理后,得到尺寸为 的富含多尺度和注意力增强的语义特征。
DeepLabV3+ 的解码器阶段主要是将前面得到的深层特征与浅层特征(low_level_features)进行融合,以恢复空间细节并提升预测精度。具体做法如下:
浅层特征变换(shortcut_conv)
low_level_features
(形状 )做一次 1×1 卷积变换,将其从 24 通道提升到 48 通道,并跟进 BatchNorm + ReLU。上采样深层特征 + 拼接
融合卷积(cat_conv)
分类头(cls_conv)
num_classes
,即完成对每个像素的类别预测。该网络整体结构既继承了 DeepLabV3+ 的主干-解码器思路,又在 ASPP 中融入多头注意力模块,有助于在保持相对较低参数量的同时,提升对多尺度目标及复杂场景的适应能力,可应用于各种语义分割任务。
编码器和解码器部分分别用一段话阐述
编码器(Encoder)部分以 MobileNetV2(或 Xception)为主干,通过逐层卷积和下采样提取多层次特征,其中浅层输出被称为“low_level_features”,深层输出被称为“deep_features”。在 deep_features 的末端,网络引入了自定义的 WA_ASPP 模块:它通过五条分支捕获多尺度信息,包括 1×1 卷积分支、全局平均池化分支,以及三个基于多头窗口注意力(WindowTransformerBlock)的分支。各分支提取的特征随后在通道维度拼接,并经由通道和空间注意力模块(hebing)进一步融合,输出多尺度且注意力增强的深层特征,为后续解码器提供高语义分辨率的特征表示。
解码器(Decoder)部分则将“low_level_features”和经过 WA_ASPP 的深层特征进行融合,以恢复空间细节:首先对 low_level_features 进行 1×1 卷积(shortcut_conv)扩充通道,然后将深层特征上采样到相同分辨率后与浅层特征拼接,并通过数次 3×3 卷积(cat_conv)提炼融合特征。最后,经过 1×1 卷积得到 num_classes 通道的预测图,并以双线性插值的方式还原至原图尺寸,从而完成对每个像素的分类。
import datetime
import os
from functools import partial
import torch.nn as nn
import numpy as np
import torch
import torch.backends.cudnn as cudnn
import torch.distributed as dist
import torch.optim as optim
from torch.utils.data import DataLoader
from nets.WAFA_deeplabV3 import DeepLab
from nets.deeplabv3_training import (get_lr_scheduler, set_optimizer_lr,
weights_init)
from utils.callbacks import EvalCallback, LossHistory
from utils.dataloader import DeeplabDataset, deeplab_dataset_collate
from utils.utils import (download_weights, seed_everything, show_config,
worker_init_fn)
from utils.utils_fit import fit_one_epoch
import argparse
'''
训练自己的语义分割模型一定需要注意以下几点:
1、训练前仔细检查自己的格式是否满足要求,该库要求数据集格式为VOC格式,需要准备好的内容有输入图片和标签
输入图片为.jpg图片,无需固定大小,传入训练前会自动进行resize。
灰度图会自动转成RGB图片进行训练,无需自己修改。
输入图片如果后缀非jpg,需要自己批量转成jpg后再开始训练。
标签为png图片,无需固定大小,传入训练前会自动进行resize。
由于许多同学的数据集是网络上下载的,标签格式并不符合,需要再度处理。一定要注意!标签的每个像素点的值就是这个像素点所属的种类。
网上常见的数据集总共对输入图片分两类,背景的像素点值为0,目标的像素点值为255。这样的数据集可以正常运行但是预测是没有效果的!
需要改成,背景的像素点值为0,目标的像素点值为1。
如果格式有误,参考:https://github.com/bubbliiiing/segmentation-format-fix
2、损失值的大小用于判断是否收敛,比较重要的是有收敛的趋势,即验证集损失不断下降,如果验证集损失基本上不改变的话,模型基本上就收敛了。
损失值的具体大小并没有什么意义,大和小只在于损失的计算方式,并不是接近于0才好。如果想要让损失好看点,可以直接到对应的损失函数里面除上10000。
训练过程中的损失值会保存在logs文件夹下的loss_%Y_%m_%d_%H_%M_%S文件夹中
3、训练好的权值文件保存在logs文件夹中,每个训练世代(Epoch)包含若干训练步长(Step),每个训练步长(Step)进行一次梯度下降。
如果只是训练了几个Step是不会保存的,Epoch和Step的概念要捋清楚一下。
'''
if name == "main":
#---------------------------------#
# Cuda 是否使用Cuda
# 没有GPU可以设置成False
#---------------------------------#
Cuda = True
#----------------------------------------------#
# Seed 用于固定随机种子
# 使得每次独立训练都可以获得一样的结果
#----------------------------------------------#
seed = 11
#---------------------------------------------------------------------#
# distributed 用于指定是否使用单机多卡分布式运行
# 终端指令仅支持Ubuntu。CUDA_VISIBLE_DEVICES用于在Ubuntu下指定显卡。
# Windows系统下默认使用DP模式调用所有显卡,不支持DDP。
# DP模式:
# 设置 distributed = False
# 在终端中输入 CUDA_VISIBLE_DEVICES=0,1 python train.py
# DDP模式:
# 设置 distributed = True
# 在终端中输入 CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 train.py
#---------------------------------------------------------------------#
distributed = False
#---------------------------------------------------------------------#
# sync_bn 是否使用sync_bn,DDP模式多卡可用
#---------------------------------------------------------------------#
sync_bn = False
#---------------------------------------------------------------------#
# fp16 是否使用混合精度训练
# 可减少约一半的显存、需要pytorch1.7.1以上
#---------------------------------------------------------------------#
fp16 = False
#-----------------------------------------------------#
# num_classes 训练自己的数据集必须要修改的
# 自己需要的分类个数+1,如2+1
#-----------------------------------------------------#
num_classes = 3
#---------------------------------#
# 所使用的的主干网络:
# mobilenet
# xception
#---------------------------------#
backbone = ""
#----------------------------------------------------------------------------------------------------------------------------#
# pretrained 是否使用主干网络的预训练权重,此处使用的是主干的权重,因此是在模型构建的时候进行加载的。
# 如果设置了model_path,则主干的权值无需加载,pretrained的值无意义。
# 如果不设置model_path,pretrained = True,此时仅加载主干开始训练。
# 如果不设置model_path,pretrained = False,Freeze_Train = Fasle,此时从0开始训练,且没有冻结主干的过程。
#----------------------------------------------------------------------------------------------------------------------------#
pretrained = False
#----------------------------------------------------------------------------------------------------------------------------#
# 权值文件的下载请看README,可以通过网盘下载。模型的 预训练权重 对不同数据集是通用的,因为特征是通用的。
# 模型的 预训练权重 比较重要的部分是 主干特征提取网络的权值部分,用于进行特征提取。
# 预训练权重对于99%的情况都必须要用,不用的话主干部分的权值太过随机,特征提取效果不明显,网络训练的结果也不会好
# 训练自己的数据集时提示维度不匹配正常,预测的东西都不一样了自然维度不匹配
#
# 如果训练过程中存在中断训练的操作,可以将model_path设置成logs文件夹下的权值文件,将已经训练了一部分的权值再次载入。
# 同时修改下方的 冻结阶段 或者 解冻阶段 的参数,来保证模型epoch的连续性。
#
# 当model_path = ''的时候不加载整个模型的权值。
#
# 此处使用的是整个模型的权重,因此是在train.py进行加载的,pretrain不影响此处的权值加载。
# 如果想要让模型从主干的预训练权值开始训练,则设置model_path = '',pretrain = True,此时仅加载主干。
# 如果想要让模型从0开始训练,则设置model_path = '',pretrain = Fasle,Freeze_Train = Fasle,此时从0开始训练,且没有冻结主干的过程。
#
# 一般来讲,网络从0开始的训练效果会很差,因为权值太过随机,特征提取效果不明显,因此非常、非常、非常不建议大家从0开始训练!
# 如果一定要从0开始,可以了解imagenet数据集,首先训练分类模型,获得网络的主干部分权值,分类模型的 主干部分 和该模型通用,基于此进行训练。
#----------------------------------------------------------------------------------------------------------------------------#
model_path = ""
#---------------------------------------------------------#
# downsample_factor 下采样的倍数8、16
# 8下采样的倍数较小、理论上效果更好。
# 但也要求更大的显存
#---------------------------------------------------------#
downsample_factor = 16
#------------------------------#
# 输入图片的大小
#------------------------------#
input_shape = [512, 512]
text#----------------------------------------------------------------------------------------------------------------------------# # 训练分为两个阶段,分别是冻结阶段和解冻阶段。设置冻结阶段是为了满足机器性能不足的同学的训练需求。 # 冻结训练需要的显存较小,显卡非常差的情况下,可设置Freeze_Epoch等于UnFreeze_Epoch,此时仅仅进行冻结训练。 # # 在此提供若干参数设置建议,各位训练者根据自己的需求进行灵活调整: # (一)从整个模型的预训练权重开始训练: # Adam: # Init_Epoch = 0,Freeze_Epoch = 50,UnFreeze_Epoch = 100,Freeze_Train = True,optimizer_type = 'adam',Init_lr = 5e-4,weight_decay = 0。(冻结) # Init_Epoch = 0,UnFreeze_Epoch = 100,Freeze_Train = False,optimizer_type = 'adam',Init_lr = 5e-4,weight_decay = 0。(不冻结) # SGD: # Init_Epoch = 0,Freeze_Epoch = 50,UnFreeze_Epoch = 100,Freeze_Train = True,optimizer_type = 'sgd',Init_lr = 7e-3,weight_decay = 1e-4。(冻结) # Init_Epoch = 0,UnFreeze_Epoch = 100,Freeze_Train = False,optimizer_type = 'sgd',Init_lr = 7e-3,weight_decay = 1e-4。(不冻结) # 其中:UnFreeze_Epoch可以在100-300之间调整。 # (二)从主干网络的预训练权重开始训练: # Adam: # Init_Epoch = 0,Freeze_Epoch = 50,UnFreeze_Epoch = 100,Freeze_Train = True,optimizer_type = 'adam',Init_lr = 5e-4,weight_decay = 0。(冻结) # Init_Epoch = 0,UnFreeze_Epoch = 100,Freeze_Train = False,optimizer_type = 'adam',Init_lr = 5e-4,weight_decay = 0。(不冻结) # SGD: # Init_Epoch = 0,Freeze_Epoch = 50,UnFreeze_Epoch = 120,Freeze_Train = True,optimizer_type = 'sgd',Init_lr = 7e-3,weight_decay = 1e-4。(冻结) # Init_Epoch = 0,UnFreeze_Epoch = 120,Freeze_Train = False,optimizer_type = 'sgd',Init_lr = 7e-3,weight_decay = 1e-4。(不冻结) # 其中:由于从主干网络的预训练权重开始训练,主干的权值不一定适合语义分割,需要更多的训练跳出局部最优解。 # UnFreeze_Epoch可以在120-300之间调整。 # Adam相较于SGD收敛的快一些。因此UnFreeze_Epoch理论上可以小一点,但依然推荐更多的Epoch。 # (三)batch_size的设置: # 在显卡能够接受的范围内,以大为好。显存不足与数据集大小无关,提示显存不足(OOM或者CUDA out of memory)请调小batch_size。 # 受到BatchNorm层影响,batch_size最小为2,不能为1。 # 正常情况下Freeze_batch_size建议为Unfreeze_batch_size的1-2倍。不建议设置的差距过大,因为关系到学习率的自动调整。 #----------------------------------------------------------------------------------------------------------------------------# #------------------------------------------------------------------# # 冻结阶段训练参数 # 此时模型的主干被冻结了,特征提取网络不发生改变 # 占用的显存较小,仅对网络进行微调 # Init_Epoch 模型当前开始的训练世代,其值可以大于Freeze_Epoch,如设置: # Init_Epoch = 60、Freeze_Epoch = 50、UnFreeze_Epoch = 100 # 会跳过冻结阶段,直接从60代开始,并调整对应的学习率。 # (断点续练时使用) # Freeze_Epoch 模型冻结训练的Freeze_Epoch # (当Freeze_Train=False时失效) # Freeze_batch_size 模型冻结训练的batch_size # (当Freeze_Train=False时失效) #------------------------------------------------------------------# Init_Epoch = 0 Freeze_Epoch = 50 Freeze_batch_size = 8 #------------------------------------------------------------------# # 解冻阶段训练参数 # 此时模型的主干不被冻结了,特征提取网络会发生改变 # 占用的显存较大,网络所有的参数都会发生改变 # UnFreeze_Epoch 模型总共训练的epoch # Unfreeze_batch_size 模型在解冻后的batch_size #------------------------------------------------------------------# UnFreeze_Epoch = 100 Unfreeze_batch_size = 2 #------------------------------------------------------------------# # Freeze_Train 是否进行冻结训练 # 默认先冻结主干训练后解冻训练。 #------------------------------------------------------------------# Freeze_Train = False #------------------------------------------------------------------# # 其它训练参数:学习率、优化器、学习率下降有关 #------------------------------------------------------------------# #------------------------------------------------------------------# # Init_lr 模型的最大学习率 # 当使用Adam优化器时建议设置 Init_lr=5e-4 # 当使用SGD优化器时建议设置 Init_lr=7e-3 # Min_lr 模型的最小学习率,默认为最大学习率的0.01 #------------------------------------------------------------------# Init_lr = 7e-3 Min_lr = Init_lr * 0.01 #------------------------------------------------------------------# # optimizer_type 使用到的优化器种类,可选的有adam、sgd # 当使用Adam优化器时建议设置 Init_lr=5e-4 # 当使用SGD优化器时建议设置 Init_lr=7e-3 # momentum 优化器内部使用到的momentum参数 # weight_decay 权值衰减,可防止过拟合 # adam会导致weight_decay错误,使用adam时建议设置为0。 #------------------------------------------------------------------# optimizer_type = "sgd" momentum = 0.9 weight_decay = 1e-4 #------------------------------------------------------------------# # lr_decay_type 使用到的学习率下降方式,可选的有'step'、'cos' #------------------------------------------------------------------# lr_decay_type = 'cos' #------------------------------------------------------------------# # save_period 多少个epoch保存一次权值 #------------------------------------------------------------------# save_period = 5 #------------------------------------------------------------------# # save_dir 权值与日志文件保存的文件夹 #------------------------------------------------------------------# save_dir = 'logs_WAFA_deeplabV3+' #------------------------------------------------------------------# # eval_flag 是否在训练时进行评估,评估对象为验证集 # eval_period 代表多少个epoch评估一次,不建议频繁的评估 # 评估需要消耗较多的时间,频繁评估会导致训练非常慢 # 此处获得的mAP会与get_map.py获得的会有所不同,原因有二: # (一)此处获得的mAP为验证集的mAP。 # (二)此处设置评估参数较为保守,目的是加快评估速度。 #------------------------------------------------------------------# eval_flag = True eval_period = 5 #------------------------------------------------------------------# # VOCdevkit_path 数据集路径 #------------------------------------------------------------------# VOCdevkit_path = 'VOCdevkit' #------------------------------------------------------------------# # 建议选项: # 种类少(几类)时,设置为True # 种类多(十几类)时,如果batch_size比较大(10以上),那么设置为True # 种类多(十几类)时,如果batch_size比较小(10以下),那么设置为False #------------------------------------------------------------------# dice_loss = True #------------------------------------------------------------------# # 是否使用focal loss来防止正负样本不平衡 #------------------------------------------------------------------# focal_loss = True #------------------------------------------------------------------# # 是否给不同种类赋予不同的损失权值,默认是平衡的。 # 设置的话,注意设置成numpy形式的,长度和num_classes一样。 # 如: # num_classes = 3 # cls_weights = np.array([1, 2, 3], np.float32) #------------------------------------------------------------------# cls_weights = np.array([1, 2, 3], np.float32) #------------------------------------------------------------------# # num_workers 用于设置是否使用多线程读取数据,1代表关闭多线程 # 开启后会加快数据读取速度,但是会占用更多内存 # keras里开启多线程有些时候速度反而慢了许多 # 在IO为瓶颈的时候再开启多线程,即GPU运算速度远大于读取图片的速度。 #------------------------------------------------------------------# num_workers = 4 dims = [96, 120, 144] channels = [16, 32, 48, 48, 64, 64, 80, 80, 96, 96, 384] seed_everything(seed) #------------------------------------------------------# # 设置用到的显卡 #------------------------------------------------------# ngpus_per_node = torch.cuda.device_count() if distributed: dist.init_process_group(backend="nccl") local_rank = int(os.environ["LOCAL_RANK"]) rank = int(os.environ["RANK"]) device = torch.device("cuda", local_rank) if local_rank == 0: print(f"[{os.getpid()}] (rank = {rank}, local_rank = {local_rank}) training...") print("Gpu Device Count : ", ngpus_per_node) else: device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') local_rank = 0 rank = 0 #----------------------------------------------------# # 下载预训练权重 #----------------------------------------------------# if pretrained: if distributed: if local_rank == 0: download_weights(backbone) dist.barrier() else: download_weights(backbone) model = DeepLab(num_classes=3, backbone="mobilenet", pretrained=False, downsample_factor=16) if not pretrained: weights_init(model) if model_path != '': #------------------------------------------------------# # 权值文件请看README,百度网盘下载 #------------------------------------------------------# if local_rank == 0: print('Load weights {}.'.format(model_path)) #------------------------------------------------------# # 根据预训练权重的Key和模型的Key进行加载 #------------------------------------------------------# model_dict = model.state_dict() pretrained_dict = torch.load(model_path, map_location = device) load_key, no_load_key, temp_dict = [], [], {} for k, v in pretrained_dict.items(): if k in model_dict.keys() and np.shape(model_dict[k]) == np.shape(v): temp_dict[k] = v load_key.append(k) else: no_load_key.append(k) model_dict.update(temp_dict) model.load_state_dict(model_dict) #------------------------------------------------------# # 显示没有匹配上的Key #------------------------------------------------------# if local_rank == 0: print("\nSuccessful Load Key:", str(load_key)[:500], "……\nSuccessful Load Key Num:", len(load_key)) print("\nFail To Load Key:", str(no_load_key)[:500], "……\nFail To Load Key num:", len(no_load_key)) print("\n\033[1;33;44m温馨提示,head部分没有载入是正常现象,Backbone部分没有载入是错误的。\033[0m") #----------------------# # 记录Loss #----------------------# if local_rank == 0: time_str = datetime.datetime.strftime(datetime.datetime.now(),'%Y_%m_%d_%H_%M_%S') log_dir = os.path.join(save_dir, "loss_" + str(time_str)) loss_history = LossHistory(log_dir, model, input_shape=input_shape) else: loss_history = None #------------------------------------------------------------------# # torch 1.2不支持amp,建议使用torch 1.7.1及以上正确使用fp16 # 因此torch1.2这里显示"could not be resolve" #------------------------------------------------------------------# if fp16: from torch.cuda.amp import GradScaler as GradScaler scaler = GradScaler() else: scaler = None model_train = model.train() #----------------------------# # 多卡同步Bn #----------------------------# if sync_bn and ngpus_per_node > 1 and distributed: model_train = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model_train) elif sync_bn: print("Sync_bn is not support in one gpu or not distributed.") if Cuda: if distributed: #----------------------------# # 多卡平行运行 #----------------------------# model_train = model_train.cuda(local_rank) model_train = torch.nn.parallel.DistributedDataParallel(model_train, device_ids=[local_rank], find_unused_parameters=True) else: model_train = torch.nn.DataParallel(model) cudnn.benchmark = True model_train = model_train.cuda() #---------------------------# # 读取数据集对应的txt #---------------------------# with open(os.path.join(VOCdevkit_path, "VOC2007/ImageSets/Segmentation/train.txt"),"r") as f: train_lines = f.readlines() with open(os.path.join(VOCdevkit_path, "VOC2007/ImageSets/Segmentation/val.txt"),"r") as f: val_lines = f.readlines() num_train = len(train_lines) num_val = len(val_lines) if local_rank == 0: show_config( num_classes = num_classes, backbone = backbone, model_path = model_path, input_shape = input_shape, \ Init_Epoch = Init_Epoch, Freeze_Epoch = Freeze_Epoch, UnFreeze_Epoch = UnFreeze_Epoch, Freeze_batch_size = Freeze_batch_size, Unfreeze_batch_size = Unfreeze_batch_size, Freeze_Train = Freeze_Train, \ Init_lr = Init_lr, Min_lr = Min_lr, optimizer_type = optimizer_type, momentum = momentum, lr_decay_type = lr_decay_type, \ save_period = save_period, save_dir = save_dir, num_workers = num_workers, num_train = num_train, num_val = num_val ) #---------------------------------------------------------# # 总训练世代指的是遍历全部数据的总次数 # 总训练步长指的是梯度下降的总次数 # 每个训练世代包含若干训练步长,每个训练步长进行一次梯度下降。 # 此处仅建议最低训练世代,上不封顶,计算时只考虑了解冻部分 #----------------------------------------------------------# wanted_step = 1.5e4 if optimizer_type == "sgd" else 0.5e4 total_step = num_train // Unfreeze_batch_size * UnFreeze_Epoch if total_step <= wanted_step: if num_train // Unfreeze_batch_size == 0: raise ValueError('数据集过小,无法进行训练,请扩充数据集。') wanted_epoch = wanted_step // (num_train // Unfreeze_batch_size) + 1 print("\n\033[1;33;44m[Warning] 使用%s优化器时,建议将训练总步长设置到%d以上。\033[0m"%(optimizer_type, wanted_step)) print("\033[1;33;44m[Warning] 本次运行的总训练数据量为%d,Unfreeze_batch_size为%d,共训练%d个Epoch,计算出总训练步长为%d。\033[0m"%(num_train, Unfreeze_batch_size, UnFreeze_Epoch, total_step)) print("\033[1;33;44m[Warning] 由于总训练步长为%d,小于建议总步长%d,建议设置总世代为%d。\033[0m"%(total_step, wanted_step, wanted_epoch)) #------------------------------------------------------# # 主干特征提取网络特征通用,冻结训练可以加快训练速度 # 也可以在训练初期防止权值被破坏。 # Init_Epoch为起始世代 # Interval_Epoch为冻结训练的世代 # Epoch总训练世代 # 提示OOM或者显存不足请调小Batch_size #------------------------------------------------------# if True: UnFreeze_flag = False #------------------------------------# # 冻结一定部分训练 #------------------------------------# if Freeze_Train: for param in model.backbone.parameters(): param.requires_grad = False #-------------------------------------------------------------------# # 如果不冻结训练的话,直接设置batch_size为Unfreeze_batch_size #-------------------------------------------------------------------# batch_size = Freeze_batch_size if Freeze_Train else Unfreeze_batch_size #-------------------------------------------------------------------# # 判断当前batch_size,自适应调整学习率 #-------------------------------------------------------------------# nbs = 16 lr_limit_max = 5e-4 if optimizer_type == 'adam' else 1e-1 lr_limit_min = 3e-4 if optimizer_type == 'adam' else 5e-4 if backbone == "xception": lr_limit_max = 1e-4 if optimizer_type == 'adam' else 1e-1 lr_limit_min = 1e-4 if optimizer_type == 'adam' else 5e-4 Init_lr_fit = min(max(batch_size / nbs * Init_lr, lr_limit_min), lr_limit_max) Min_lr_fit = min(max(batch_size / nbs * Min_lr, lr_limit_min * 1e-2), lr_limit_max * 1e-2) #---------------------------------------# # 根据optimizer_type选择优化器 #---------------------------------------# optimizer = { 'adam' : optim.Adam(model.parameters(), Init_lr_fit, betas = (momentum, 0.999), weight_decay = weight_decay), 'sgd' : optim.SGD(model.parameters(), Init_lr_fit, momentum = momentum, nesterov=True, weight_decay = weight_decay) }[optimizer_type] #---------------------------------------# # 获得学习率下降的公式 #---------------------------------------# lr_scheduler_func = get_lr_scheduler(lr_decay_type, Init_lr_fit, Min_lr_fit, UnFreeze_Epoch) #---------------------------------------# # 判断每一个世代的长度 #---------------------------------------# epoch_step = num_train // batch_size epoch_step_val = num_val // batch_size if epoch_step == 0 or epoch_step_val == 0: raise ValueError("数据集过小,无法继续进行训练,请扩充数据集。") train_dataset = DeeplabDataset(train_lines, input_shape, num_classes, True, VOCdevkit_path) val_dataset = DeeplabDataset(val_lines, input_shape, num_classes, False, VOCdevkit_path) if distributed: train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset, shuffle=True,) val_sampler = torch.utils.data.distributed.DistributedSampler(val_dataset, shuffle=False,) batch_size = batch_size // ngpus_per_node shuffle = False else: train_sampler = None val_sampler = None shuffle = True gen = DataLoader(train_dataset, shuffle = shuffle, batch_size = batch_size, num_workers = num_workers, pin_memory=True, drop_last = True, collate_fn = deeplab_dataset_collate, sampler=train_sampler, worker_init_fn=partial(worker_init_fn, rank=rank, seed=seed)) gen_val = DataLoader(val_dataset , shuffle = shuffle, batch_size = batch_size, num_workers = num_workers, pin_memory=True, drop_last = True, collate_fn = deeplab_dataset_collate, sampler=val_sampler, worker_init_fn=partial(worker_init_fn, rank=rank, seed=seed)) #----------------------# # 记录eval的map曲线 #----------------------# if local_rank == 0: eval_callback = EvalCallback(model, input_shape, num_classes, val_lines, VOCdevkit_path, log_dir, Cuda, \ eval_flag=eval_flag, period=eval_period) else: eval_callback = None #---------------------------------------# # 开始模型训练 #---------------------------------------# for epoch in range(Init_Epoch, UnFreeze_Epoch): #---------------------------------------# # 如果模型有冻结学习部分 # 则解冻,并设置参数 #---------------------------------------# if epoch >= Freeze_Epoch and not UnFreeze_flag and Freeze_Train: batch_size = Unfreeze_batch_size #-------------------------------------------------------------------# # 判断当前batch_size,自适应调整学习率 #-------------------------------------------------------------------# nbs = 16 lr_limit_max = 5e-4 if optimizer_type == 'adam' else 1e-1 lr_limit_min = 3e-4 if optimizer_type == 'adam' else 5e-4 if backbone == "xception": lr_limit_max = 1e-4 if optimizer_type == 'adam' else 1e-1 lr_limit_min = 1e-4 if optimizer_type == 'adam' else 5e-4 Init_lr_fit = min(max(batch_size / nbs * Init_lr, lr_limit_min), lr_limit_max) Min_lr_fit = min(max(batch_size / nbs * Min_lr, lr_limit_min * 1e-2), lr_limit_max * 1e-2) #---------------------------------------# # 获得学习率下降的公式 #---------------------------------------# lr_scheduler_func = get_lr_scheduler(lr_decay_type, Init_lr_fit, Min_lr_fit, UnFreeze_Epoch) for param in model.backbone.parameters(): param.requires_grad = True epoch_step = num_train // batch_size epoch_step_val = num_val // batch_size if epoch_step == 0 or epoch_step_val == 0: raise ValueError("数据集过小,无法继续进行训练,请扩充数据集。") if distributed: batch_size = batch_size // ngpus_per_node gen = DataLoader(train_dataset, shuffle = shuffle, batch_size = batch_size, num_workers = num_workers, pin_memory=True, drop_last = True, collate_fn = deeplab_dataset_collate, sampler=train_sampler, worker_init_fn=partial(worker_init_fn, rank=rank, seed=seed)) gen_val = DataLoader(val_dataset , shuffle = shuffle, batch_size = batch_size, num_workers = num_workers, pin_memory=True, drop_last = True, collate_fn = deeplab_dataset_collate, sampler=val_sampler, worker_init_fn=partial(worker_init_fn, rank=rank, seed=seed)) UnFreeze_flag = True if distributed: train_sampler.set_epoch(epoch) set_optimizer_lr(optimizer, lr_scheduler_func, epoch) fit_one_epoch(model_train, model, loss_history, eval_callback, optimizer, epoch, epoch_step, epoch_step_val, gen, gen_val, UnFreeze_Epoch, Cuda, dice_loss, focal_loss, cls_weights, num_classes, fp16, scaler, save_period, save_dir, local_rank) if distributed: dist.barrier() if local_rank == 0: loss_history.writer.close()
依据代码,阐述模型的超参数设置,写为一段话即可
在模型训练阶段,我们将学习率初始化为0.005,以确保加速收敛,同时保持训练稳定性。动量因子设置为0.9,以进一步增强收敛并减少训练期间的振荡。为了避免过度拟合,正则化的权重衰减因子被设定为0.0001。模型训练持续60个周期,批量为32,使用Adam优化算法。为了平衡计算资源和图像细节,将输入图像的分辨率设置为512x512像素。我们每10个历元输出一次结果,每20个历元保存一次权重文件。这些设置共同指导模型的训练。
仿照以上话术来写
在训练阶段,我们将初始学习率(Init_lr)设定为0.007,采用余弦退火(cos)策略逐渐衰减至最小学习率0.00007,从而兼顾加速收敛和后期精细微调的需求;动量因子(momentum)取0.9以抑制局部梯度震荡,并配合0.0001的权重衰减(weight_decay)来防止过拟合。整个训练流程分为冻结和解冻两阶段,共进行100个Epoch,其中冻结阶段批量大小(batch_size)为8,解冻后批量大小缩减为2以适配显存;输入分辨率则统一设为512×512,以在模型精度与计算开销之间取得平衡。优化器使用SGD,训练过程中每5个Epoch进行一次评估并保存权重文件,以便及时监控模型表现并记录训练过程。
训练部分没有冻结
在训练阶段,我们从起始便对全部网络参数进行更新(未进行任何冻结操作),单阶段总计训练 100 个 Epoch。初始学习率(Init_lr)设定为 0.007,并采用余弦退火(cos)策略逐渐衰减到 0.00007,从而兼顾前期快速收敛和后期精细微调的需求;动量(momentum)取 0.9 并结合 0.0001 的权重衰减(weight_decay),用以稳定训练过程、防止过拟合。为平衡训练效率和特征表达的细腻度,我们统一将输入图像分辨率设定为 512×512,批量大小(batch_size)则设置为 2;优化器选择了 SGD,每 5 个 Epoch 进行一次评估并保存权重文件,以便实时观察模型性能并保留训练进度。
我使用miou、mPA、precision、Recall、FPS为指标。
评价指标是量化语义分割模型分割性能的重要手段。在目前的研究中,选择了几个评估指标,包括平均交并比 (mIoU) (He et al., 2016)、平均像素精度 (mPA)、F1 分数和每秒帧数 (FPS)。此外,使用参数作为指标来评估模型的复杂性。图 8 提供了这些评估指标的解释图。在图中,TP 表示正确分类为真实示例的像素数,TN 表示正确标记为真阴性的像素数,FN 表示错误标记为假阴性的像素数,FP 表示错误识别为假阳性的像素数。这些量度共同提供了对所评估模型的分段有效性的全面评估。
参考以上话术,需要有较大变动,先用一段话简单介绍,再单独介绍每一种评价指标,包括公式
下文将介绍本研究中所采用的主要评价指标及其背后的数学定义。这些指标从不同角度量化了语义分割模型的性能与复杂度,包括模型在像素级别的准确度、分类边缘情况,以及处理速度和参数规模等。具体而言,我们选用了平均交并比(mIoU)、平均像素精度(mPA)、精度(Precision)、召回率(Recall)、F1 分数、每秒帧数(FPS)以及模型参数量(Params),它们通过对预测结果和真实标注之间的关系进行统计和分析,能够为模型的综合表现提供更全面、客观的衡量。
定义:
F1 分数:
F1 分数是精度和召回率的调和平均,用于在二者之间取得平衡:
当 Precision 和 Recall 失衡时,F1 分数能更好地体现模型的综合表现。
通过上述一系列度量指标(mIoU、mPA、Precision、Recall、F1、FPS 以及参数量),我们能够从精度、速度与模型规模等多角度对语义分割模型进行评估,为后续的模型改进和实际应用提供可靠参考。
miou、mPA、Recall使用单个像素点的计算公式
下面给出 mIoU、mPA、Recall 在语义分割中常用的“单像素级”计算公式。所谓“单像素级”,是指将整个预测过程拆分到每一个像素上,通过指示函数(Indicator Function)来统计各类像素被正确或错误识别的数量,从而得到全局的评价指标。
为了统计某类别 是否预测正确或错误,通常定义以下指示函数 :
据此,我们可以对所有像素求和,得到对某一类别 而言的以下三个量:
真阳性(True Positive, TP)
表示对于类别 ,模型正确识别为 的像素总数。
假阳性(False Positive, FP)
表示真实不是 却被模型错误预测为 的像素总数。
假阴性(False Negative, FN)
表示真实是 却被模型错误预测成其它类别的像素总数。
有了这三个量(在所有像素上求和所得),即可计算相应指标。
交并比(IoU) 主要衡量预测区域和真实区域在空间上的重叠程度,对单一类别 而言,定义为:
在多类别情形下,将各类别 IoU 取平均,得到 平均交并比(mIoU):
在分割任务中,“像素精度(Pixel Accuracy, PA)”有多种定义方式,这里给出的 平均像素精度(mPA) 多指“逐类像素准确率再取平均”的方式。对类别 而言,可写为:
即在真实为 的所有像素中,模型成功预测为 的比例。将其对所有类别取平均,便得到 mPA:
对于类别 ,召回率(Recall)的定义是:
它与前述的 形式上相同,实际含义也都是“在真实为 的像素中,被模型正确检出的比例”。若需要对多类别做综合评价,可再对各类别 Recall 取平均。
通过这些“单像素级”指标计算方式,可以更好地理解语义分割评估中各种统计量的由来,为模型结果的客观评价与比较提供理论依据。
self.shallow_fusion = ShallowFusionLite(
in_channels1=16, # <-- 与 F1.shape[1] 匹配
in_channels2=24, # <-- 与 F2.shape[1] 匹配
reduced_channels=16,
out_channels=40,
dropout=0.2
)
reduced_channels=16是什么意思
在这段代码里,reduced_channels=16
通常表示在特征融合或处理的过程中,会将输入特征先降维到 16 个通道(例如通过 1×1 卷积、线性变换等方式),再进行后续操作。这样做有以下好处:
reduced_channels
,可以让后续拼接、加法或注意力等操作更方便。因此,“reduced_channels=16
” 就是这个 ShallowFusionLite 模块里的一个 中间通道数,用来对输入特征图进行降维处理,让后续融合步骤更轻量、高效。
通常,在计算 mIoU 时,首先会针对测试集中的所有图像评估混淆矩阵。从此矩阵中,可以确定不同类别的交集与并集 (IoU) 值。最后,计算所有类别的平均值,得出 mIoU 分数。此指标通常用于评估图像分割模型的性能。mIoU 通过计算每个类别标签的预测值与真实值之间的交集和并集的平均比例来量化对象分割的准确性。mIoU 可以按如下方式计算:
其中 n 表示语义分割任务像素类别,
pii是被正确预测为 i 类的 i 类的像素数,
pij是被预测为 j 类的 i 类的像素数,
pji是被预测为 i 类的 j 类的像素数。
仿照上述写法,介绍miou,区别大一点
在进行 mIoU(Mean Intersection over Union) 计算时,通常会先对测试集中所有的图像进行预测,然后基于真实标签和预测结果生成对应的混淆矩阵。该矩阵中不仅包含了各类别被正确预测的像素数量,还记录了各类别之间可能的误分类情况。接下来,通过统计特定类别 的像素在预测和真实标注中的重叠区域(交集)与总区域(并集),便可得到该类别的 IoU 指标;最终,对所有类别的 IoU 结果取平均,即可获得整体的 mIoU 分数。具体而言,令 为语义分割任务的类别总数(含背景), 表示真实为 类且被正确预测为 类的像素数, 则表示真实为 类却被预测为 类的像素数,而 为真实为 类但被错误识别为 类的像素数。通过将这些像素计数代入交并比公式并对各类别取平均,即可量化模型在对象分割任务中的整体准确度。由于 mIoU 能综合反映模型在不同类别上的性能,它成为评估分割质量最常用的指标之一。
我在mobilenetv2的倒残差结构的深度卷积后插入GGCA (全局分组坐标注意力),其代码如下
import torch
from torch import nn
class GGCA(nn.Module): #(Global Grouped Coordinate Attention) 全局分组坐标注意力
def init(self, channel, h, w, reduction=16, num_groups=4):
super(GGCA, self).init()
self.num_groups = num_groups # 分组数
self.group_channels = channel // num_groups # 每组的通道数
self.h = h # 高度方向的特定尺寸
self.w = w # 宽度方向的特定尺寸
text# 定义H方向的全局平均池化和最大池化 self.avg_pool_h = nn.AdaptiveAvgPool2d((h, 1)) # 输出大小为(h, 1) self.max_pool_h = nn.AdaptiveMaxPool2d((h, 1)) # 定义W方向的全局平均池化和最大池化 self.avg_pool_w = nn.AdaptiveAvgPool2d((1, w)) # 输出大小为(1, w) self.max_pool_w = nn.AdaptiveMaxPool2d((1, w)) # 定义共享的卷积层,用于通道间的降维和恢复 self.shared_conv = nn.Sequential( nn.Conv2d(in_channels=self.group_channels, out_channels=self.group_channels // reduction, kernel_size=(1, 1)), nn.BatchNorm2d(self.group_channels // reduction), nn.ReLU(inplace=True), nn.Conv2d(in_channels=self.group_channels // reduction, out_channels=self.group_channels, kernel_size=(1, 1)) ) # 定义sigmoid激活函数 self.sigmoid_h = nn.Sigmoid() self.sigmoid_w = nn.Sigmoid() def forward(self, x): batch_size, channel, height, width = x.size() # 确保通道数可以被分组数整除,一般分组数,要选择整数,不然不能被整除。而且是小一点.groups选择4挺好。 assert channel % self.num_groups == 0, "The number of channels must be divisible by the number of groups." # 将输入特征图按通道数分组 x = x.view(batch_size, self.num_groups, self.group_channels, height, width) # 分别在H方向进行全局平均池化和最大池化 x_h_avg = self.avg_pool_h(x.view(batch_size * self.num_groups, self.group_channels, height, width)).view( batch_size, self.num_groups, self.group_channels, self.h, 1) x_h_max = self.max_pool_h(x.view(batch_size * self.num_groups, self.group_channels, height, width)).view( batch_size, self.num_groups, self.group_channels, self.h, 1) # 分别在W方向进行全局平均池化和最大池化 x_w_avg = self.avg_pool_w(x.view(batch_size * self.num_groups, self.group_channels, height, width)).view( batch_size, self.num_groups, self.group_channels, 1, self.w) x_w_max = self.max_pool_w(x.view(batch_size * self.num_groups, self.group_channels, height, width)).view( batch_size, self.num_groups, self.group_channels, 1, self.w) # 应用共享卷积层进行特征处理 y_h_avg = self.shared_conv(x_h_avg.view(batch_size * self.num_groups, self.group_channels, self.h, 1)) y_h_max = self.shared_conv(x_h_max.view(batch_size * self.num_groups, self.group_channels, self.h, 1)) y_w_avg = self.shared_conv(x_w_avg.view(batch_size * self.num_groups, self.group_channels, 1, self.w)) y_w_max = self.shared_conv(x_w_max.view(batch_size * self.num_groups, self.group_channels, 1, self.w)) # 计算注意力权重 att_h = self.sigmoid_h(y_h_avg + y_h_max).view(batch_size, self.num_groups, self.group_channels, self.h, 1) att_w = self.sigmoid_w(y_w_avg + y_w_max).view(batch_size, self.num_groups, self.group_channels, 1, self.w) # 应用注意力权重 out = x * att_h * att_w out = out.view(batch_size, channel, height, width) return out
想问下加了这个有作用嘛,作用在哪
从整体流程来看,GGCA(Global Grouped Coordinate Attention) 模块可以视为在倒残差结构的深度可分离卷积之后,进一步强化特征表达的一种注意力机制。它将特征图按通道进行分组,并分别在 H 与 W 两个维度上进行全局平均池化、最大池化,再通过一系列 1×1 卷积(降维/升维)和 Sigmoid 激活来生成高度注意力与宽度注意力,最后将这些注意力与原特征图逐元素相乘,实现对不重要区域的抑制以及重要区域的增强。下面从几个方面说明它可能带来的作用与好处:
增强空间维度上的信息交互
分组(Grouped)操作带来的细粒度注意力
结合最大池化与平均池化
适配移动端或轻量级网络的注意力思路
“加了 GGCA 有没有作用?”——从原理与实践经验来看,像 GGCA 这样的一种分组式坐标注意力,可以帮助网络更好地捕捉全局上下文和关键细节,对提高特征表达能力、改善分割性能通常是有益的。但实际增益大小还要视数据集难度、网络结构、插入位置、训练设置等多方面因素而定。建议通过消融实验(加上或去掉 GGCA 的对比测试)以及可视化注意力分布来进一步验证其有效性。
主编码器分支由 17 个源自 MobileNetV2 模型的残差瓶颈层组成。标准的 MobileNetV2 块采用线性瓶颈倒置残差结构构建,首先将输入特征的维度增加 1 × 1 卷积,然后用深度可分卷积代替标准卷积对高维特征进行过滤。与标准卷积相比,深度可分离卷积的计算量较低,参数也较少。最后,特征通过 1 × 1 卷积投影到低维压缩表示。
根据卷积层的步幅,在 MobileNetV2 模型中提出了两个卷积块,一个是 block1 (Stride=1),另一个是 block2 (Stride=2)。两者的区别在于,block1 使用类似于 ResNet 的跳转连接进行输入和输出的融合,而 block2 用于压缩特征大小,如图 4 所示。这些块的表述如下:
标准的 MobileNetV2 块在局部感受野中构建特征,而局部感受野缺乏对信息丰富的全局通道特征的关注。为了最大限度地减少特征提取中的信息损失,我们提出了 SMB 机制,通过在深度过滤器后面插入一个软注意力模块来替换标准的 MobileNetV2 块。SMB 的结构如图 4 所示,其中软注意力的计算公式如下:
Soft-pooling layer 是 soft-attention 模块中的关键构建,它基于 Softmax 功能。池化方法首先为每个 activation
分配一个权重
。权重是邻域内这种激活的自然指数的分数
。然后,权重与相应 activation的值一起用作非线性变换:
因此,权重值较高的激活值比激活值较低的激活更显著,与简单地选择最大值相比,减少了信息丢失。相反,通过保持整体区域特征强度,它的性能明显优于平均池化。
模仿上述话术,先介绍mobilenetv2,在说明其不足以突出我们所加的GGCA的作用,然后介绍GGCA的原理,包括公式
MobileNetV2 简介与局限性
MobileNetV2 采用线性瓶颈(Linear Bottleneck)和倒置残差结构(Inverted Residual)来构建网络,每个标准块(block)首先使用 1×1 卷积分支将通道数扩张,随后利用深度可分离卷积在高维空间中进行特征过滤,最后再用 1×1 卷积投影到低维表示。与传统的卷积相比,深度可分离卷积有效减少了计算量和参数量;同时,通过在部分块中设置 stride=2(记为 block2),网络能够压缩空间分辨率,从而在更深层次抽取特征。然而,这种以局部感受野为主导的结构在捕捉跨区域、跨通道的全局信息时往往存在不足,导致网络对关键特征的关注度不够,从而限制了分割或识别性能的进一步提升。
为此,我们在深度可分离卷积之后插入了一个全局分组坐标注意力(GGCA, Global Grouped Coordinate Attention)模块,以在不显著增加计算量的前提下提升网络对全局和局部信息的综合感知。这样做不仅能在空间维度上挖掘高、宽方向上更丰富的上下文信息,也能通过分组策略分别关注不同通道子空间,从而在特征提取阶段减少信息丢失。
GGCA 将输入特征图按通道数分为 组,每组的通道数记为 ,然后分别在高度和宽度两个方向做全局平均池化与最大池化,捕捉在 H 和 W 维度上的统计信息。以高度方向为例,可写成下述形式(略去批量大小和分组的索引):
其中 与 分别表示在高度维度上的自适应平均池化与最大池化。对宽度方向也同理可得 与 。随后,分别通过共享的 1×1 卷积降维/激活再升维,最终以逐元素相加的方式获得两套注意力特征(例如 ),并使用 Sigmoid 函数进行归一化。高度注意力 与宽度注意力 的计算可以抽象为:
其中 为 Sigmoid 激活函数。最后将这两种注意力与原特征逐元素相乘:
其中 表示广播维度上的逐元素乘法操作。由于分组(Grouped)策略的引入,网络能够在相对更小的通道子空间中学习高度与宽度上的全局注意力,从而在保持轻量化的同时,兼顾跨通道与跨空间的语义关联。
全局上下文感知:
通过高度和宽度方向的全局池化,GGCA 获取了更远距离的特征依赖,对于需要大范围上下文信息的目标区域,能够带来更准确的分割或识别。
分组机制提升效率:
若直接在所有通道上计算坐标注意力,可能会引入过多的参数和计算量。分组方式则能在保证精度的同时有效控制模型大小与推理开销。
显著减少信息损失:
在深度可分离卷积之后插入 GGCA,相当于给 MobileNetV2 的局部感受野网络补上了一条全局依赖的通路,使得关键特征更容易被放大,不再仅仅依赖最大值或平均值来代表整片区域。
综上所述,通过在 MobileNetV2 的倒置残差结构中加入 GGCA,我们希望在不牺牲模型轻量化特点的前提下,增强网络对空间坐标与通道全局关联的关注度,从而在分割、检测等视觉任务中获得更高的性能。
mobilenetv2介绍中也加入公式
MobileNetV2 简介(含公式)
MobileNetV2 的核心思想在于采用**线性瓶颈(Linear Bottleneck)和倒置残差结构(Inverted Residual)**来构建网络。与传统卷积相比,其关键改进在于:
以 倒置残差块(Inverted Residual Block) 为例,不考虑 BatchNorm 和 ReLU 等操作,示意性公式如下所示(其中 表示步幅, 表示扩张倍率):
当 且 时,还可将 与原输入 相加得到:
即所谓的倒置残差(Inverted Residual)连接。相比传统卷积,深度可分离卷积能将大卷积分解为深度卷积(只在通道维度上进行操作)和逐点卷积(1×1 卷积),从而显著减少运算量与参数量。然而,这种由局部感受野主导的结构也带来了一个潜在不足: 难以充分捕捉跨区域、跨通道的全局特征。
由于 MobileNetV2 块主要在局部进行特征过滤,网络对远距离、跨通道的依赖信息关注不足,可能导致对于复杂场景或大尺寸目标的表征能力有限。为解决此问题,我们引入了全局分组坐标注意力(GGCA, Global Grouped Coordinate Attention) 模块,将其插入到深度可分离卷积之后,旨在增强网络的全局语义理解和精细特征抓取能力。
GGCA 通过按通道分组、在高度与宽度维度做全局池化来获取远距离上下文信息,并将其转化为注意力权重,用于加权原特征图。具体步骤如下:
分组(Grouped Channels)
将输入特征图的通道维度拆成 组,每组通道数为
这样每组可单独学到一套“坐标注意力”,避免过多通道互相干扰并控制计算量。
高度/宽度方向的全局池化
对每一组特征同时做平均池化和最大池化,分别得到
其中 / 表示在高度维度的池化操作, / 表示在宽度维度的池化操作。
共享 1×1 卷积
将 和 分支各自通过同一个 1×1 卷积(或小 MLP)降维-升维并相加:
Sigmoid 激活得到注意力图
采用 函数将结果归一化到 ,得到高度注意力 与宽度注意力 :
融合到原特征
最后,将上述注意力与原特征图逐元素相乘:
其中 表示广播维度上的逐元素乘法操作,通过此机制能够突出关键区域、抑制无关或冗余信息。
全局信息感知
通过在 H 和 W 两个方向进行全局池化并计算注意力,GGCA 能捕捉到大范围上下文,补足 MobileNetV2 局部感受野的不足。
分组策略兼顾精度和效率
如果直接在所有通道上做坐标注意力,可能引入大量参数。分组式计算能在保持精度的同时有效压缩计算开销。
可嵌入式增强
GGCA 可以无缝嵌入到 MobileNetV2 的倒置残差块中,尤其在深度可分离卷积后插入,对网络整体推理速度影响较小,却能提供跨通道和跨空间的更丰富信息。
综上所述,通过在 MobileNetV2 块的深度可分离卷积后加入 GGCA,我们在轻量化与全局上下文学习之间取得更好的平衡,为后续的图像分割、检测等视觉任务提供了更强的表征能力。
⊗表示什么
在许多深度学习论文和代码中,“” 常用于表示 逐元素相乘(element-wise multiplication) 或有时也称 Hadamard 乘积。当我们写下
时,意思是将特征张量 与注意力权重 和 按照对应位置(像素、通道等)一一相乘,并保持相同的维度输出。
若在上下文中需要广播(broadcast)操作(例如注意力的 shape 可能在其中一个维度是 1,而特征图在该维度是大于 1),深度学习框架会自动为相乘的张量做适当的广播,以便完成逐元素相乘。
介绍评价指标MPA和Accuracy
在语义分割任务中,mPA(mean Pixel Accuracy) 与 Accuracy 都可以用来衡量模型在像素级别的预测准确程度,但它们在关注点和计算方式上略有差异。以下将对二者进行介绍和比较。
定义
mPA(mean Pixel Accuracy)是对各类别的“像素准确率”取平均。对于多类别情形,若共有 个类别(含背景),记 表示第 类的像素准确率,定义为:
其中 表示“真实为 类且预测也为 类”的像素总数, 表示“真实为 类但预测不是 类”的像素总数。
将各类别 的平均值作为最终指标:
理解
定义
理解
在实际应用中,如果数据集类别分布相对均衡,Accuracy 往往能较好地反映整体性能;而当类别分布不平衡,或者需要重点关注小类时,mPA 往往能更准确地揭示模型的多类别性能差异,故二者可结合使用,帮助研究者全面评估模型质量。
介绍F1-score
在语义分割及其他机器学习任务中,F1-score(或简称 F1)是用来综合评估模型精度(Precision)和召回率(Recall)的一个重要指标,通常用于衡量分类、目标检测等任务在“正例”检测方面的综合表现。它能够在 Precision 与 Recall 之间取得平衡,尤其当数据存在不平衡(类别分布不均)或对误报、漏报都需要重视时,F1-score 能提供比单独看 Precision 或 Recall 更全面的度量。
Precision(精度)
Recall(召回率)
F1-score 是精度(Precision)与召回率(Recall)的调和平均(Harmonic Mean),定义为:
F1-score 通过同时衡量 Precision 和 Recall,为二者提供一个平衡的评估方式,广泛应用于二分类、多分类以及语义分割场景下的“正例检测”情况。当数据类别不平衡、误报和漏报都需要关注时,F1-score 能更客观地反映模型的整体实用水平,常与其他指标(Accuracy、mIoU、mPA 等)联合使用,帮助研究者或工程师在不同需求下做出性能优化与权衡。
我的语义分割带背景有三类,我在计算F1-score时,可以先三类的Precision和Recall的平均值,然后用Precision和Recall的平均值计算F1-score
是的,这种做法通常被称为**“宏平均(Macro-average)”** F1-score 计算方式。具体来说,对于含背景在内的三类场景,可以先分别计算每一类的 Precision 与 Recall,然后对三类的 Precision 取平均、对三类的 Recall 取平均,最后用这两个平均值来计算 F1-score。公式形式可写为:
逐类计算 Precision 与 Recall
其中 表示三类(含背景),、、 分别为第 类对应的真阳性、假阳性和假阴性。
对三类结果做宏平均(Macro-average)
由平均后的 Precision 与 Recall 计算 F1-score
这样得到的 F1-score 有时也被称为 “Macro-F1”。它会对各类(包括背景类)给予相同的权重,不会因为某一类别样本量大、预测偏多或偏少就“淹没”掉其他类别的表现,因而在类间分布不均衡的情况下常常被用来更加平衡地评估多类别任务的整体表现。
import torch
import torch.nn as nn
import torch.nn.functional as F
from nets.CBAM import CBAM
class DepthwiseSeparableConv(nn.Module):
"""
标准的深度可分离卷积:
DW( groups = in_channels ) + PW( 1x1 卷积 )
"""
def init(self, inp, oup, kernel_size=3, stride=1, padding=1, bias=False):
super(DepthwiseSeparableConv, self).init()
self.depthwise = nn.Conv2d(inp, inp, kernel_size, stride, padding,
groups=inp, bias=bias)
self.pointwise = nn.Conv2d(inp, oup, 1, 1, 0, bias=bias)
self.bn = nn.BatchNorm2d(oup)
# 将原先的 nn.ReLU(inplace=True) 改为:
self.relu = nn.ReLU(inplace=False)
textdef forward(self, x): out = self.depthwise(x) out = self.pointwise(out) out = self.bn(out) out = self.relu(out) return out
class ShallowFusionLite(nn.Module):
"""
改进的浅层特征融合:
1) 对 F1、F2 先做 1x1 卷积减小通道数
2) 分别用一次 DepthwiseSeparableConv 提取
3) 上采样 F2 => 拼接 => 再做一层 DepthwiseSeparableConv + Dropout
"""
def init(self, in_channels1, in_channels2,
reduced_channels=16, out_channels=32, dropout=0.2):
super(ShallowFusionLite, self).init()
text# 1) 用1x1卷积先压缩通道 self.reduce_f1 = nn.Conv2d(in_channels1, reduced_channels, kernel_size=1, bias=False) self.reduce_f2 = nn.Conv2d(in_channels2, reduced_channels, kernel_size=1, bias=False) # 2) 深度可分离卷积提炼 self.conv1 = DepthwiseSeparableConv(reduced_channels, reduced_channels, kernel_size=3, padding=1) self.conv2 = DepthwiseSeparableConv(reduced_channels, reduced_channels, kernel_size=3, padding=1) # 3) 最终融合: 拼接后 => depthwise separable => out_channels # 并将 Dropout 的 inplace 改为 False self.fuse_conv = nn.Sequential( DepthwiseSeparableConv(2 * reduced_channels, out_channels, kernel_size=3, padding=1), nn.Dropout(dropout, inplace=False) # 改为 inplace=False ) self.attn = CBAM(channels=out_channels,reduction=16,kernel_size=7) def forward(self, F1, F2): # 1) 通道压缩 F1_reduced = self.reduce_f1(F1) F2_reduced = self.reduce_f2(F2) # 2) 深度可分离卷积 F1_out = self.conv1(F1_reduced) F2_out = self.conv2(F2_reduced) # 3) 上采样F2 F2_up = F.interpolate(F2_out, size=F1_out.shape[2:], mode='bilinear', align_corners=False) # 4) 拼接 & fuse_conv fused = torch.cat([F1_out, F2_up], dim=1) out = self.fuse_conv(fused) out = self.attn(out) return out
根据代码介绍浅层特征融合网络仿照一下话术,要有较大差异,话术专业些。
在深度网络中,层信息随着网络级别的增加而减少,因此裂纹纹理和边缘特征在浅层中得到更好的保留。通过融合浅层特征,可以保留更多的原始信息,从而有效补偿这个问题。如图 4 所示的 SFF 模块旨在保留原始裂纹信息。SFF 模块对 F2 进行 2×2 次上采样,将图像大小与 F1 对齐,然后将两张图像叠加在通道上。最后,将特征图输入到高效通道注意力 (ECA) 中,以调整焦点区域的权重。然后,根据具有更详细信息的浅层特征进行后续操作,以更好地还原裂纹形态的真实性。在改进的 EDeepLab 模型中,SFF 模块在其解码器过程中被采用,以结合 F1 和 F2 的特性。
在分割网络的深层阶段,模型往往更注重高层语义信息,但随之也会丢失大量细节特征,尤其是裂纹等目标的纹理细节与边缘结构。为此,本代码中的 ShallowFusionLite 模块旨在结合来自浅层(F1)和中层(F2)的特征,通过轻量化的卷积操作与注意力机制,在解码器阶段有效补偿裂纹形态等微细信息的丢失。其整体思路可概括为以下几个步骤:
通道压缩
深度可分离卷积提炼
空间对齐与拼接
融合与 Dropout
通道与空间注意力(CBAM)
通过上述流程,ShallowFusionLite 能有效整合来自不同深度的特征,继承了浅层分辨率高、边缘信息丰富的优势,并利用中层特征的判别能力,对裂纹与背景之间的差异做出更精准的区分。与直接采用传统卷积相比,模块在融合时使用深度可分离卷积和简化通道数,不仅在计算效率上更加轻量,而且配合通道/空间注意力进一步提升了对目标区域的聚焦能力。这样一来,网络在解码阶段便能充分利用更多原始纹理与形状线索,为后续的裂纹检测或语义分割提供更可靠的特征保障。
总结为一段话,不要分点阐述
在该设计中,ShallowFusionLite 模块通过多个轻量化操作将浅层 F1 与中层 F2 进行融合:先采用 1×1 卷积对两路特征分别进行通道压缩,以减少冗余并降低后续计算负担;随后以深度可分离卷积提取各自的局部信息,再将经过上采样的 F2 与 F1 沿通道维度拼接,形成含有丰富细节与中层语义的联合特征;接下来利用另一层深度可分离卷积和 Dropout 对融合后的特征进行混合和筛选;最后,通过 CBAM 对关键通道与显著空间区域进行注意力加权,进一步突出有助于裂纹检测或分割的区域,从而减少噪声干扰并保留更多纹理细节与边缘信息,为后续解码和预测提供更精准的特征支撑。
从头到尾总结为一段话,精简,专业
ShallowFusionLite 模块以轻量化方式整合了浅层 F1 与中层 F2 的互补信息:首先利用 1×1 卷积对两路特征进行降维,以减少冗余通道;随后通过深度可分离卷积分别提取关键局部表示,并将上采样后的 F2 与 F1 沿通道维度拼接形成融合特征;随后,再次使用深度可分离卷积和 Dropout 对融合结果加以筛选,去除噪声;最后引入 CBAM,对通道和空间注意力进行加权,从而强化裂纹等细节的显著区域并抑制干扰,为解码和预测提供更加丰富且精准的特征支撑。
在分割网络的深层阶段,模型往往更注重高层语义信息,但随之也会丢失大量细节特征,尤其是裂纹等目标的纹理细节与边缘结构。为此,本代码中的 ShallowFusionLite 模块旨在结合来自浅层(F1)和中层(F2)的特征,通过轻量化的卷积操作与注意力机制,在解码器阶段有效补偿裂纹形态等微细信息的丢失。其整体思路可概括为以下几个步骤:
通道压缩
对输入的浅层特征
F
1
F1 与中层特征
F
2
F2,分别使用
1
×
1
1×1 卷积进行降维,将冗余通道压缩为较小的“精简通道”(reduced_channels)。
这样做不仅在融合前初步过滤了特征冗余,也显著减少了后续卷积的计算量。
深度可分离卷积提炼
在获得降维后的
F
1
F1 与
F
2
F2 后,分别用 DepthwiseSeparableConv 进行局部特征挖掘。
深度可分离卷积(Depthwise + Pointwise)的设计能在保持表征能力的同时,减少参数与运算量,提升融合阶段的效率。
空间对齐与拼接
将经过提炼的
F
2
F2 上采样(双线性插值)至与
F
1
F1 相同的空间分辨率,保证二者在空间尺度上一致。
将上采样后的
F
2
F2 与
F
1
F1 沿通道维度拼接,从而将浅层细节特征与中层的语义信息整合在同一张量中。
融合与 Dropout
拼接后的特征先经过一层深度可分离卷积,以进一步提炼融合后的特征表示,再接入 Dropout 防止过拟合。
这一步相当于“混合两路信息”,同时对可能的噪声或冗余做消减。
通道与空间注意力(CBAM)
最后,将融合好的特征输入到 CBAM 中,分别在通道维度与空间维度学习注意力权重。
通过对关键通道与显著区域的突出,减少无用特征干扰,使得网络更专注于裂纹的边缘和细节,从而在后续解码和预测阶段保留更多微观形态信息。
通过上述流程,ShallowFusionLite 能有效整合来自不同深度的特征,继承了浅层分辨率高、边缘信息丰富的优势,并利用中层特征的判别能力,对裂纹与背景之间的差异做出更精准的区分。与直接采用传统卷积相比,模块在融合时使用深度可分离卷积和简化通道数,不仅在计算效率上更加轻量,而且配合通道/空间注意力进一步提升了对目标区域的聚焦能力。这样一来,网络在解码阶段便能充分利用更多原始纹理与形状线索,为后续的裂纹检测或语义分割提供更可靠的特征保障。
从头到尾总结为一段话,精简,专业
ShallowFusionLite 模块通过轻量化卷积与注意力机制在解码阶段融合浅层 F1 与中层 F2:首先用 1×1 卷积压缩多余通道以减轻后续计算负担,并分别采用深度可分离卷积提取局部信息;随后将上采样后的 F2 与 F1 沿通道拼接,利用再次的深度可分离卷积及 Dropout 去除冗余;最后将融合特征输入到 CBAM 进行通道与空间注意力加权,从而更好地聚焦裂纹等细节区域并抑制噪声。该设计既保留了浅层高分辨率与边缘特征优势,又借助中层特征实现对语义差异的精细区分,为后续检测或分割提供更强的特征支撑。
class WA_ASPP(nn.Module):
"""
五条分支:
1) branch1: 1×1 conv => (320->256)
2) branch2/3/4: WindowTransformerBlock(320->320) + 1×1 conv(320->256) => 层次叠加
3) branch5: 全局平均池化(320->256)
最终拼接 => 1×1 conv => 256
"""
def init(self, dim_in, dim_out,
window_size=(4,8,16),
num_heads=4,
mlp_ratio=4.0,
attn_dropout=0.0,
ffn_dropout=0.0):
super().init()
text# branch1: 1×1 conv => [B, dim_out, H, W] self.branch1 = nn.Sequential( nn.Conv2d(dim_in, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) # branch2/3/4: WindowTransformerBlock (输入输出都是dim_in=320) self.branch2_attn = WindowTransformerBlock( in_channels = dim_in, window_size = window_size[0], num_heads = num_heads, mlp_ratio = mlp_ratio, attn_dropout = attn_dropout, ffn_dropout = ffn_dropout ) self.branch3_attn = WindowTransformerBlock( in_channels = dim_in, window_size = window_size[1], num_heads = num_heads, mlp_ratio = mlp_ratio, attn_dropout = attn_dropout, ffn_dropout = ffn_dropout ) self.branch4_attn = WindowTransformerBlock( in_channels = dim_in, window_size = window_size[2], num_heads = num_heads, mlp_ratio = mlp_ratio, attn_dropout = attn_dropout, ffn_dropout = ffn_dropout ) # 额外的 1×1 Conv:将分支2/3/4输出从 320 => 256 self.branch2_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch3_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch4_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) # branch5: 全局平均池化 => 卷积(320->256) self.branch5_conv = nn.Conv2d(dim_in, dim_out, kernel_size=1) self.branch5_bn = nn.BatchNorm2d(dim_out) self.branch5_relu = nn.ReLU(inplace=True) # 拼接(5路) => 1×1 conv => dim_out self.fuse_conv = nn.Sequential( nn.Conv2d(dim_out * 5, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) def forward(self, x): B, C, H, W = x.size() # branch1 => [B, 256, H, W] out1 = self.branch1(x) # branch2 => WindowAttn => [B, 320, H, W] -> 1x1Conv => [B,256,H,W] out2 = self.branch2_attn(x) out2 = self.branch2_proj(out2) out2_agg = out1 + out2 # branch3 out3 = self.branch3_attn(x) out3 = self.branch3_proj(out3) out3_agg = out1 + out2_agg + out3 # branch4 out4 = self.branch4_attn(x) out4 = self.branch4_proj(out4) out4_agg = out1 + out2_agg + out3_agg + out4 # branch5 => 全局池化 => 卷积=>bn=>relu => 上采样 => [B,256,H,W] gp = F.adaptive_avg_pool2d(x, 1) # (B, 320, 1, 1) gp = self.branch5_conv(gp) # => (B, 256, 1, 1) gp = self.branch5_bn(gp) gp = self.branch5_relu(gp) gp = F.interpolate(gp, size=(H, W), mode='bilinear', align_corners=False) # 拼接 [out1, out2_agg, out3_agg, out4_agg, gp] => [B, 256*5, H, W] out_cat = torch.cat([out1, out2_agg, out3_agg, out4_agg, gp], dim=1) # 融合 => [B, 256, H, W] out = self.fuse_conv(out_cat) return out
这是别人文章中设计的WA-ASPP,我不想和别人完全一样,给一些合理的修改意见,并且不要太复杂
可以考虑在以下几个方面做些小调整,既不大幅增加复杂度,又能与原始设计有所区分:
替换部分 1×1 卷积为轻量化卷积
调整最终融合方式
在全局池化分支添加微小变化
上述改动大都仅在卷积类型或融合环节做小规模调整,不会影响整体结构和复杂度太多,却能与原方案保持一定差异,并在某些场景下带来轻微的性能增益或更好的可解释性。
在 WA-ASPP 中,我们保留第一个分支和第五个分支的原始平均池化。中间的三个分支被设计成一个窗口注意力层、一个二维批量归一化层、一个激活函数、一个二维卷积层、另一个二维批量归一化层,最后是另一个激活函数。其结构如图 5 所示。这种设计不仅保持了输入图像的整体特性,而且提高了计算效率并增强了局部相关性。其次,我们将窗口注意力层在三个分支上的窗口大小分别设置为4×4、8×8、16×16,激活函数统一使用GELU。
这是文章中设计的WA-ASPP中间三层的结构,和我代码中的一样嘛
从您提供的代码和文章中对 WA-ASPP 中“三个中间分支”的描述来看,二者存在如下差异:
激活函数
nn.Conv2d + (BatchNorm2d) + nn.ReLU
(或者 inplace=True
的 ReLU),并没有统一替换成 GELU。BN 与激活的先后顺序
1×1 conv -> BN -> ReLU
投影,而不是外层再配合两次 BN + 两次激活。WindowTransformerBlock 与文章的“窗口注意力+BN+激活+卷积+BN+激活”并不完全一致
运算流程上的对比
综上,您文章中对“三个中间分支”的描述,和现在代码实际所实现的 WindowTransformerBlock + 1×1 Conv + ReLU 并不 完全 一致:文章更多是一个“注意力 + CNN”风格的管线,统一使用了 GELU 并两次应用 BN + 激活,而代码则将注意力部分替换为 TransformerBlock 形式,并且在投影时采用了 ReLU。尽管二者都引入了局部/窗口注意力并具备类似的多分支思路,但在激活函数、BN 次数及顺序、以及具体实现细节上仍有较大差异。
我现在要说明我的数据集,我的数据集是钢丝绳损伤数据集共970张,其中包括断丝和损伤两种损伤类型,在数据增强与剔除模糊图片后有4805张,训练集、验证集和测试集按照 8:1:1 进行划分,图片像素大小不一,但输入模型的图片大小为512×512,请模仿以下话术,要有较大差异并且具有专业性,介绍我的数据集。
为了创建一个高分辨率的地铁隧道裂缝数据集,其中包含大量样本和隧道衬砌表面病害的综合数据,目前的研究主要集中在地铁盾构隧道上。收集图像后,对其进行仔细检查和组合,总共产生了 580 张衬里裂纹图像。这些图像的像素大小为 512×512,具有不同的背景,如图 5 所示。背景包括光照、水渍、管道、剥落、划痕和其他形式的噪声干扰的变化。这些图像准确地代表了现实生活中的隧道场景,并保证了训练模型的稳健性。
数据集中收集的数据会经历空间变换(例如旋转、镜像和缩放)以及像素级处理(包括增亮、变暗和模糊)。应用这些转换是为了增强样本的多样性并防止在模型训练过程中过度拟合。图 6 显示了经过这些增强后的图像。因此,地铁隧道裂缝数据集中的图像数量扩展到 8700 张。训练集、验证集和测试集按照 8:1:1 进行划分。这种划分可确保在大型和多样化的数据集上训练模型,同时维护一组单独的图像以用于评估目的。对于训练集和验证集,使用 Labelme 软件标记裂纹区域。
为充分覆盖钢丝绳的断丝及其他损伤形态,本研究先在现场采集了 970 张原始图像,并按照其清晰度剔除模糊样本,随后采用旋转、镜像、缩放及亮度调节等多种数据增强方法,将数据量扩增至 4805 张。由于原始图像像素尺度不尽相同,为实现训练过程的可控性与统一性,所有样本在输入网络前都被规范化至 512×512 分辨率。最后,为在模型评估时保持独立性与数据多样性,按照 8:1:1 的比例将所有图像划分为训练集、验证集和测试集,既能保证训练中对多种损伤形态的充分学习,也能在测试阶段提供客观检验,确保模型在识别钢丝绳断丝与损伤的各种场景中均能保持较高的稳健性。
内容再丰富一下,不要指明数据集来源
为充分覆盖钢丝绳的断丝与损伤形态,本研究首先采集了 970 张钢丝绳图像,涵盖了多种背景与不同程度的磨损情况。在对采集到的样本进行初步清晰度筛选后,为增强模型在识别多样损伤场景时的稳健性,还通过旋转、镜像、缩放以及亮度与模糊等像素级处理方式对图像进行了增广操作,使得最终可用样本数量扩展至 4805 张。由于原始图像分辨率各异,为实现统一的网络输入标准和保证训练的可控性,所有数据均被调整为 512×512 分辨率。最后,为同时兼顾训练的充分性与模型评估的独立性,将数据集按照 8:1:1 的比例划分为训练集、验证集和测试集,以便在绝大多数样本上进行模型参数优化,并保留足够数量的独立图像用于性能验证和测试。这样一来,经过多重预处理与划分的钢丝绳损伤数据集不仅涵盖了丰富的损伤类型,也确保了模型对真实工业环境下不同背景和复杂度的适应能力。
本研究数据集包含断丝和锈蚀共 970 张钢丝绳损伤图像,涵盖了多种背景与不同程度的损伤情况。按照其清晰度剔除模糊样本,随后采用旋转、镜像、缩放及亮度调节等多种数据增强方法,将数据量扩增至 4805 张,其中磨损1828张,断丝3092张,两种损伤均包含的有115张。由于原始图像分辨率各异,为实现统一的网络输入标准和保证训练的可控性,所有数据在输入网络前均被调整为 512×512 分辨率。数据集按照 8:1:1 的比例将所有图像划分为训练集、验证集和测试集,既能保证训练中对多种损伤形态的充分学习,也能在测试阶段提供客观检验,确保模型在识别钢丝绳断丝与损伤的各种场景中均能保持较高的稳健性。
润色一下
本研究的数据集共包含 970 张钢丝绳损伤图像,涵盖断丝和锈蚀两类问题,且背景环境及损伤程度多样。基于清晰度筛选剔除了模糊样本,并通过旋转、镜像、缩放及亮度调节等多种数据增强技术将数据量扩增至 4805 张,其中磨损类有 1828 张、断丝类 3092 张,另有 115 张同时包含两类损伤。为保证统一的网络输入标准并提高训练过程的稳定性,所有图像均在输入前被调整至 512×512 分辨率。最后,按照 8:1:1 的比例将图像划分为训练集、验证集和测试集,既能确保模型在多种损伤形态上获得充分学习,也能在测试时更客观地检验模型在不同背景场景下识别钢丝绳损伤的稳健性。
Model mIoU(%) F1 Score(%) FPS Parameter(M)
Xception 65.55 76.78 45.51 54.71
VGG16 66.84 77.89 46.98 20.15
MobileNetV2 65.67 77.27 90.48 5.81
MobileNetV4 64.23 75.73 73.26 31.79
Swin Transformerv2 58.30 70.95 43.43 58.21
MobileViT 64.14 75.34 75.61 5.04
这是我的主干网络对比实验,我最终选择mobilenetv2作为主干用于提出的SFF-WAdeeplab网络,现在模仿以下论文的文笔,写出对主干网络对比实验的描述,要求有较大差异,并且专业
特征提取在语义分割任务中起着至关重要的作用,而 DeepLabV3+ 以其在这方面的卓越能力而闻名。但是,它需要大量的参数,这会导致更高的计算资源使用率。为了比较不同骨干网络的图像分割精度,在当前研究中对 Xception 、 MobileNetV2 、 EfficientNetV2 和优化的 EfficientNetV2 进行了对比实验。对这些网络适应性的评估包括分析它们的 mIoU、FPS 和参数大小。
表 1 显示了不同骨干网的分段性能。很明显,使用优化的 EfficientNetV2 作为骨干网络实现了 83.15% 的高 mIoU,处理速度为 13.84 f/s,参数为 92.03 M。相比之下,Xception 的分割精度略低(比优化的 EfficientNetV2 低 0.88%),处理速度相似,但参数数量明显更多。另一方面,MobileNetV2 需要的参数更少,图像处理速度更快,但与优化的 EfficientNetV2 相比,分割精度降低了 2.91%。这表明,优化后的 EfficientNetV2 在更高的特征提取精度、更快的图像处理速度和降低模型复杂性方面被证明是有利的。因此,优化后的 EfficientNetV2 将用于新提出的 EDeepLab。
如表所示,在对多种主干网络进行对比后可以发现,Xception 在分割精度方面(mIoU=65.55%)表现尚可,但其参数规模(54.71 M)相对较高,导致推理开销偏大;VGG16 以 66.84% 的 mIoU 和 20.15 M 的参数量在精度和复杂度之间取得了一定平衡,不过其推理速度(46.98 FPS)仍不及轻量化网络;MobileNetV4 和 MobileViT 均拥有较为适中的分割性能与中等或偏小的模型规模,但在推理速度和最终准确率上与更出色的主干仍有差距;Swin Transformer v2 由于其 Transformer 结构的特征提取机制,虽具备一定的跨区域建模能力,但 mIoU 仅达到 58.30%,并且参数量(58.21 M)和推理速度(43.43 FPS)均不占优势。相比之下,MobileNetV2 虽然分割精度(65.67%)略低于 VGG16 和 Xception,但其极少的参数需求(5.81 M)与远高于其他网络的推理速度(90.48 FPS)在整体性能上展现出了明显优势。综合考虑分割精度、推理速度以及模型复杂度,本研究最终选择 MobileNetV2 作为新网络的主干,为后续模块的特征提取与融合提供了相对高效且准确的基础。
mobilenetv2.py
import math
import os
import torch
import torch.nn as nn
import torch.utils.model_zoo as model_zoo
BatchNorm2d = nn.BatchNorm2d
def conv_bn(inp, oup, stride):
return nn.Sequential(
nn.Conv2d(inp, oup, 3, stride, 1, bias=False),
BatchNorm2d(oup),
nn.ReLU6(inplace=True)
)
def conv_1x1_bn(inp, oup):
return nn.Sequential(
nn.Conv2d(inp, oup, 1, 1, 0, bias=False),
BatchNorm2d(oup),
nn.ReLU6(inplace=True)
)
class InvertedResidual(nn.Module):
def init(self, inp, oup, stride, expand_ratio):
super(InvertedResidual, self).init()
self.stride = stride
assert stride in [1, 2]
texthidden_dim = round(inp * expand_ratio) self.use_res_connect = self.stride == 1 and inp == oup if expand_ratio == 1: self.conv = nn.Sequential( #--------------------------------------------# # 进行3x3的逐层卷积,进行跨特征点的特征提取 #--------------------------------------------# nn.Conv2d(hidden_dim, hidden_dim, 3, stride, 1, groups=hidden_dim, bias=False), BatchNorm2d(hidden_dim), nn.ReLU6(inplace=True), #-----------------------------------# # 利用1x1卷积进行通道数的调整 #-----------------------------------# nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False), BatchNorm2d(oup), ) else: self.conv = nn.Sequential( #-----------------------------------# # 利用1x1卷积进行通道数的上升 #-----------------------------------# nn.Conv2d(inp, hidden_dim, 1, 1, 0, bias=False), BatchNorm2d(hidden_dim), nn.ReLU6(inplace=True), #--------------------------------------------# # 进行3x3的逐层卷积,进行跨特征点的特征提取 #--------------------------------------------# nn.Conv2d(hidden_dim, hidden_dim, 3, stride, 1, groups=hidden_dim, bias=False), BatchNorm2d(hidden_dim), nn.ReLU6(inplace=True), #-----------------------------------# # 利用1x1卷积进行通道数的下降 #-----------------------------------# nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False), BatchNorm2d(oup), ) def forward(self, x): if self.use_res_connect: return x + self.conv(x) else: return self.conv(x)
class MobileNetV2(nn.Module):
def init(self, n_class=1000, input_size=224, width_mult=1.):
super(MobileNetV2, self).init()
block = InvertedResidual
input_channel = 32
last_channel = 1280
interverted_residual_setting = [
# t, c, n, s
[1, 16, 1, 1], # 256, 256, 32 -> 256, 256, 16
[6, 24, 2, 2], # 256, 256, 16 -> 128, 128, 24 2
[6, 32, 3, 2], # 128, 128, 24 -> 64, 64, 32 4
[6, 64, 4, 2], # 64, 64, 32 -> 32, 32, 64 7
[6, 96, 3, 1], # 32, 32, 64 -> 32, 32, 96
[6, 160, 3, 2], # 32, 32, 96 -> 16, 16, 160 14
[6, 320, 1, 1], # 16, 16, 160 -> 16, 16, 320
]
textassert input_size % 32 == 0 input_channel = int(input_channel * width_mult) self.last_channel = int(last_channel * width_mult) if width_mult > 1.0 else last_channel # 512, 512, 3 -> 256, 256, 32 self.features = [conv_bn(3, input_channel, 2)] for t, c, n, s in interverted_residual_setting: output_channel = int(c * width_mult) for i in range(n): if i == 0: self.features.append(block(input_channel, output_channel, s, expand_ratio=t)) else: self.features.append(block(input_channel, output_channel, 1, expand_ratio=t)) input_channel = output_channel self.features.append(conv_1x1_bn(input_channel, self.last_channel)) self.features = nn.Sequential(*self.features) self.classifier = nn.Sequential( nn.Dropout(0.2), nn.Linear(self.last_channel, n_class), ) self._initialize_weights() def forward(self, x): x = self.features(x) x = x.mean(3).mean(2) x = self.classifier(x) return x def _initialize_weights(self): for m in self.modules(): if isinstance(m, nn.Conv2d): n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels m.weight.data.normal_(0, math.sqrt(2. / n)) if m.bias is not None: m.bias.data.zero_() elif isinstance(m, BatchNorm2d): m.weight.data.fill_(1) m.bias.data.zero_() elif isinstance(m, nn.Linear): n = m.weight.size(1) m.weight.data.normal_(0, 0.01) m.bias.data.zero_()
def load_url(url, model_dir='./model_data', map_location=None):
if not os.path.exists(model_dir):
os.makedirs(model_dir)
filename = url.split('/')[-1]
cached_file = os.path.join(model_dir, filename)
if os.path.exists(cached_file):
return torch.load(cached_file, map_location=map_location)
else:
return model_zoo.load_url(url,model_dir=model_dir)
def mobilenetv2(pretrained=False, **kwargs):
model = MobileNetV2(n_class=1000, **kwargs)
if pretrained:
model.load_state_dict(load_url('https://github.com/bubbliiiing/deeplabv3-plus-pytorch/releases/download/v1.0/mobilenet_v2.pth.tar'), strict=False)
return model
if name == "main":
model = mobilenetv2()
for i, layer in enumerate(model.features):
print(i, layer)
GGCA.py
import torch
from torch import nn
class GGCA(nn.Module): #(Global Grouped Coordinate Attention) 全局分组坐标注意力
def init(self, channel, h, w, reduction=16, num_groups=4):
super(GGCA, self).init()
self.num_groups = num_groups # 分组数
self.group_channels = channel // num_groups # 每组的通道数
self.h = h # 高度方向的特定尺寸
self.w = w # 宽度方向的特定尺寸
text# 定义H方向的全局平均池化和最大池化 self.avg_pool_h = nn.AdaptiveAvgPool2d((h, 1)) # 输出大小为(h, 1) self.max_pool_h = nn.AdaptiveMaxPool2d((h, 1)) # 定义W方向的全局平均池化和最大池化 self.avg_pool_w = nn.AdaptiveAvgPool2d((1, w)) # 输出大小为(1, w) self.max_pool_w = nn.AdaptiveMaxPool2d((1, w)) # 定义共享的卷积层,用于通道间的降维和恢复 self.shared_conv = nn.Sequential( nn.Conv2d(in_channels=self.group_channels, out_channels=self.group_channels // reduction, kernel_size=(1, 1)), nn.BatchNorm2d(self.group_channels // reduction), nn.ReLU(inplace=True), nn.Conv2d(in_channels=self.group_channels // reduction, out_channels=self.group_channels, kernel_size=(1, 1)) ) # 定义sigmoid激活函数 self.sigmoid_h = nn.Sigmoid() self.sigmoid_w = nn.Sigmoid() def forward(self, x): batch_size, channel, height, width = x.size() # 确保通道数可以被分组数整除,一般分组数,要选择整数,不然不能被整除。而且是小一点.groups选择4挺好。 assert channel % self.num_groups == 0, "The number of channels must be divisible by the number of groups." # 将输入特征图按通道数分组 x = x.view(batch_size, self.num_groups, self.group_channels, height, width) # 分别在H方向进行全局平均池化和最大池化 x_h_avg = self.avg_pool_h(x.view(batch_size * self.num_groups, self.group_channels, height, width)).view( batch_size, self.num_groups, self.group_channels, self.h, 1) x_h_max = self.max_pool_h(x.view(batch_size * self.num_groups, self.group_channels, height, width)).view( batch_size, self.num_groups, self.group_channels, self.h, 1) # 分别在W方向进行全局平均池化和最大池化 x_w_avg = self.avg_pool_w(x.view(batch_size * self.num_groups, self.group_channels, height, width)).view( batch_size, self.num_groups, self.group_channels, 1, self.w) x_w_max = self.max_pool_w(x.view(batch_size * self.num_groups, self.group_channels, height, width)).view( batch_size, self.num_groups, self.group_channels, 1, self.w) # 应用共享卷积层进行特征处理 y_h_avg = self.shared_conv(x_h_avg.view(batch_size * self.num_groups, self.group_channels, self.h, 1)) y_h_max = self.shared_conv(x_h_max.view(batch_size * self.num_groups, self.group_channels, self.h, 1)) y_w_avg = self.shared_conv(x_w_avg.view(batch_size * self.num_groups, self.group_channels, 1, self.w)) y_w_max = self.shared_conv(x_w_max.view(batch_size * self.num_groups, self.group_channels, 1, self.w)) # 计算注意力权重 att_h = self.sigmoid_h(y_h_avg + y_h_max).view(batch_size, self.num_groups, self.group_channels, self.h, 1) att_w = self.sigmoid_w(y_w_avg + y_w_max).view(batch_size, self.num_groups, self.group_channels, 1, self.w) # 应用注意力权重 out = x * att_h * att_w out = out.view(batch_size, channel, height, width) return out
将GGCA模块嵌入mobilenetv2中的倒残差块中的深层卷积后,给出完整代码,带中文注释
下面给出一个示例,将 GGCA(Global Grouped Coordinate Attention) 模块嵌入到 MobileNetV2 的倒残差结构(InvertedResidual)中的 3×3 深度卷积之后,供参考使用。示例代码基于您提供的 mobilenetv2.py
与 GGCA.py
做了少量改动,实现了在 深度可分离卷积阶段插入 GGCA 注意力。为简化说明,此处采用动态获取特征图的大小 ,以便 GGCA 可以适应不同输入分辨率。代码示例如下(可放置于同一个 .py 文件中):
pythonimport math import os import torch import torch.nn as nn import torch.nn.functional as F import torch.utils.model_zoo as model_zoo #---------------------------------------------------# # 这里的 BatchNorm2d 定义可视需要进行替换 # 例如可换为 SyncBatchNorm 或其他自定义BN #---------------------------------------------------# BatchNorm2d = nn.BatchNorm2d def conv_bn(inp, oup, stride): """ 标准3x3卷积 + BN + ReLU6,用于 MobileNetV2 首层 inp : 输入通道数 oup : 输出通道数 stride: 步幅 """ return nn.Sequential( nn.Conv2d(inp, oup, 3, stride, 1, bias=False), BatchNorm2d(oup), nn.ReLU6(inplace=True) ) def conv_1x1_bn(inp, oup): """ 标准1x1卷积 + BN + ReLU6,用于 MobileNetV2 最后升维 """ return nn.Sequential( nn.Conv2d(inp, oup, 1, 1, 0, bias=False), BatchNorm2d(oup), nn.ReLU6(inplace=True) ) #---------------------------------------------------# # 动态版 GGCA,用于在 H、W 不固定时的注意力机制 # 不再在 __init__ 里固化 H, W,而是在 forward 中 # 根据输入张量的实际高宽进行自适应池化。 #---------------------------------------------------# class GGCA(nn.Module): """ Global Grouped Coordinate Attention (动态版): - 通道分组: 将通道分为 num_groups 组 - 在 H、W 两个方向分别做全局池化(Avg+Max) - 使用共享的 1x1 卷积序列对池化结果进行降维+激活 - 在高度和宽度两个维度上生成注意力权重 - 与原特征 x 逐元素相乘 """ def __init__(self, channel, reduction=16, num_groups=4): super(GGCA, self).__init__() self.num_groups = num_groups self.group_channels= channel // num_groups self.reduction = reduction #-------------------------------------------------# # 用于对 (group_channels) -> (group_channels//reduction) -> (group_channels) # 做降维和升维 #-------------------------------------------------# self.shared_conv = nn.Sequential( nn.Conv2d(self.group_channels, self.group_channels // reduction, kernel_size=1), nn.BatchNorm2d(self.group_channels // reduction), nn.ReLU(inplace=True), nn.Conv2d(self.group_channels // reduction, self.group_channels, kernel_size=1) ) self.sigmoid_h = nn.Sigmoid() self.sigmoid_w = nn.Sigmoid() def forward(self, x): """ x shape: (B, C, H, W) 在本函数内动态获取 H, W => 使用自适应池化 """ B, C, H, W = x.shape assert C % self.num_groups == 0, "通道数必须能被分组数整除" #-------------------------------------------------# # 将特征图按照通道维度拆分成 num_groups 份 # 例如: (B, C, H, W) -> (B, num_groups, group_channels, H, W) #-------------------------------------------------# x_grouped = x.view(B, self.num_groups, self.group_channels, H, W) #------------------------------# # 在高度维度进行自适应平均/最大池化 # 输出大小 (H, 1), 保留H这个维度 #------------------------------# x_h_avg = F.adaptive_avg_pool2d(x_grouped.view(B*self.num_groups, self.group_channels, H, W), output_size=(H, 1)) x_h_avg = x_h_avg.view(B, self.num_groups, self.group_channels, H, 1) x_h_max = F.adaptive_max_pool2d(x_grouped.view(B*self.num_groups, self.group_channels, H, W), output_size=(H, 1)) x_h_max = x_h_max.view(B, self.num_groups, self.group_channels, H, 1) #------------------------------# # 在宽度维度进行自适应平均/最大池化 # 输出大小 (1, W) #------------------------------# x_w_avg = F.adaptive_avg_pool2d(x_grouped.view(B*self.num_groups, self.group_channels, H, W), output_size=(1, W)) x_w_avg = x_w_avg.view(B, self.num_groups, self.group_channels, 1, W) x_w_max = F.adaptive_max_pool2d(x_grouped.view(B*self.num_groups, self.group_channels, H, W), output_size=(1, W)) x_w_max = x_w_max.view(B, self.num_groups, self.group_channels, 1, W) #------------------------------# # 将池化结果分别过共享的卷积层 # 得到 y_h, y_w #------------------------------# y_h_avg = self.shared_conv(x_h_avg.view(B*self.num_groups, self.group_channels, H, 1)) y_h_max = self.shared_conv(x_h_max.view(B*self.num_groups, self.group_channels, H, 1)) y_w_avg = self.shared_conv(x_w_avg.view(B*self.num_groups, self.group_channels, 1, W)) y_w_max = self.shared_conv(x_w_max.view(B*self.num_groups, self.group_channels, 1, W)) #------------------------------# # 相加后做 Sigmoid 激活 #------------------------------# att_h = self.sigmoid_h(y_h_avg + y_h_max).view(B, self.num_groups, self.group_channels, H, 1) att_w = self.sigmoid_w(y_w_avg + y_w_max).view(B, self.num_groups, self.group_channels, 1, W) #------------------------------# # 将注意力权重乘回到原特征 #------------------------------# out = x_grouped * att_h * att_w out = out.view(B, C, H, W) return out #---------------------------------------------------# # 倒残差结构,嵌入 GGCA #---------------------------------------------------# class InvertedResidual(nn.Module): """ 将 GGCA 加入到 3x3深度卷积之后 结构: 1) 如果 expand_ratio=1 : (DWConv3x3 -> BN -> ReLU6 -> GGCA -> Conv1x1 -> BN) 2) 如果 expand_ratio!=1 : (Conv1x1 -> BN -> ReLU6 -> DWConv3x3 -> BN -> ReLU6 -> GGCA -> Conv1x1 -> BN) """ def __init__(self, inp, oup, stride, expand_ratio, use_ggca=True, # 是否使用GGCA ggca_groups=4, # GGCA分组数 ggca_reduction=16): super(InvertedResidual, self).__init__() self.stride = stride assert stride in [1, 2] hidden_dim = round(inp * expand_ratio) self.use_res_connect = (self.stride == 1 and inp == oup) self.use_ggca = use_ggca #------------------------------# # 若 expand_ratio=1, 则跳过升维那一步 #------------------------------# if expand_ratio == 1: # 深度可分离卷积 + GGCA + PW layers = [] layers.append( nn.Conv2d(hidden_dim, hidden_dim, 3, stride, 1, groups=hidden_dim, bias=False) ) layers.append(BatchNorm2d(hidden_dim)) layers.append(nn.ReLU6(inplace=True)) # 加入GGCA if self.use_ggca: layers.append(GGCA(channel=hidden_dim, num_groups=ggca_groups, reduction=ggca_reduction)) layers.append( nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False) ) layers.append(BatchNorm2d(oup)) self.conv = nn.Sequential(*layers) else: # 1x1 卷积升维 + 深度可分离卷积 + GGCA + 1x1卷积降维 layers = [] # 升维 layers.append( nn.Conv2d(inp, hidden_dim, 1, 1, 0, bias=False) ) layers.append(BatchNorm2d(hidden_dim)) layers.append(nn.ReLU6(inplace=True)) # 深度可分离卷积 layers.append( nn.Conv2d(hidden_dim, hidden_dim, 3, stride, 1, groups=hidden_dim, bias=False) ) layers.append(BatchNorm2d(hidden_dim)) layers.append(nn.ReLU6(inplace=True)) # 加入GGCA if self.use_ggca: layers.append(GGCA(channel=hidden_dim, num_groups=ggca_groups, reduction=ggca_reduction)) # 降维 layers.append( nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False) ) layers.append(BatchNorm2d(oup)) self.conv = nn.Sequential(*layers) def forward(self, x): if self.use_res_connect: return x + self.conv(x) else: return self.conv(x) #---------------------------------------------------# # 整合了GGCA的MobileNetV2 # 在倒残差结构里插入GGCA #---------------------------------------------------# class MobileNetV2_GGCA(nn.Module): def __init__(self, n_class=1000, input_size=224, width_mult=1., use_ggca=True, ggca_groups=4, ggca_reduction=16): """ use_ggca, ggca_groups, ggca_reduction 用于控制GGCA """ super(MobileNetV2_GGCA, self).__init__() block = InvertedResidual input_channel = 32 last_channel = 1280 interverted_residual_setting = [ # t, c, n, s [1, 16, 1, 1], # 第1个倒残差 [6, 24, 2, 2], # 第2个倒残差 [6, 32, 3, 2], # 第3个倒残差 [6, 64, 4, 2], # 第4个倒残差 [6, 96, 3, 1], # 第5个倒残差 [6, 160,3, 2], # 第6个倒残差 [6, 320,1, 1], # 第7个倒残差 ] assert input_size % 32 == 0 input_channel = int(input_channel * width_mult) self.last_channel = int(last_channel * width_mult) if width_mult > 1.0 else last_channel # 首层: 3 -> 32 (conv_bn) self.features = [conv_bn(3, input_channel, 2)] # 构建倒残差层 for t, c, n, s in interverted_residual_setting: output_channel = int(c * width_mult) for i in range(n): if i == 0: # 使用 stride = s self.features.append( block(inp=input_channel, oup=output_channel, stride=s, expand_ratio=t, use_ggca=use_ggca, ggca_groups=ggca_groups, ggca_reduction=ggca_reduction) ) else: # 使用 stride = 1 self.features.append( block(inp=input_channel, oup=output_channel, stride=1, expand_ratio=t, use_ggca=use_ggca, ggca_groups=ggca_groups, ggca_reduction=ggca_reduction) ) input_channel = output_channel # 最后一层: 1x1 卷积把通道升到 1280 self.features.append(conv_1x1_bn(input_channel, self.last_channel)) self.features = nn.Sequential(*self.features) # 分类器部分 self.classifier = nn.Sequential( nn.Dropout(0.2), nn.Linear(self.last_channel, n_class), ) self._initialize_weights() def forward(self, x): # 前面部分: 卷积特征提取 x = self.features(x) # 池化 + 全连接 x = x.mean(dim=3).mean(dim=2) # Global Average Pooling x = self.classifier(x) return x def _initialize_weights(self): """ 权值初始化,与官方MobileNetV2类似 """ for m in self.modules(): if isinstance(m, nn.Conv2d): n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels m.weight.data.normal_(0, math.sqrt(2. / n)) if m.bias is not None: m.bias.data.zero_() elif isinstance(m, BatchNorm2d): m.weight.data.fill_(1) m.bias.data.zero_() elif isinstance(m, nn.Linear): n = m.weight.size(1) m.weight.data.normal_(0, 0.01) m.bias.data.zero_() def load_url(url, model_dir='./model_data', map_location=None): if not os.path.exists(model_dir): os.makedirs(model_dir) filename = url.split('/')[-1] cached_file = os.path.join(model_dir, filename) if os.path.exists(cached_file): return torch.load(cached_file, map_location=map_location) else: return model_zoo.load_url(url, model_dir=model_dir) def mobilenetv2_ggca(pretrained=False, **kwargs): """ 构建带GGCA的MobileNetV2,可根据需要加载预训练权重 """ model = MobileNetV2_GGCA(n_class=1000, **kwargs) if pretrained: # 这里仅作演示,和原版预训练权重可能不完全匹配 model.load_state_dict( load_url('https://github.com/bubbliiiing/deeplabv3-plus-pytorch/releases/download/v1.0/mobilenet_v2.pth.tar'), strict=False ) return model if __name__ == "__main__": # 测试 net = mobilenetv2_ggca(pretrained=False, use_ggca=True, ggca_groups=4, ggca_reduction=16) print(net) x_test = torch.randn(1, 3, 224, 224) y_test = net(x_test) print("输出维度:", y_test.shape)
GGCA 模块(动态版)
F.adaptive_avg_pool2d
和 F.adaptive_max_pool2d
根据当前输入的实际大小进行全局池化。InvertedResidual
use_ggca
、ggca_groups
、ggca_reduction
控制是否启用注意力及注意力的分组与降维比。MobileNetV2_GGCA
InvertedResidual
替换为带 GGCA 的实现;通过这种方式,GGCA 模块就被嵌入到了 MobileNetV2 的倒置残差结构的 深度卷积阶段,用以增强模型在高度、宽度维度对全局依赖的捕捉与建模能力,同时保留 MobileNetV2 轻量化的特性。
import math
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.model_zoo as model_zoo
#---------------------------------------------------#
#---------------------------------------------------#
BatchNorm2d = nn.BatchNorm2d
def conv_bn(inp, oup, stride):
"""
标准3x3卷积 + BN + ReLU6,用于 MobileNetV2 首层
inp : 输入通道数
oup : 输出通道数
stride: 步幅
"""
return nn.Sequential(
nn.Conv2d(inp, oup, 3, stride, 1, bias=False),
BatchNorm2d(oup),
nn.ReLU6(inplace=True)
)
def conv_1x1_bn(inp, oup):
"""
标准1x1卷积 + BN + ReLU6,用于 MobileNetV2 最后升维
"""
return nn.Sequential(
nn.Conv2d(inp, oup, 1, 1, 0, bias=False),
BatchNorm2d(oup),
nn.ReLU6(inplace=True)
)
#---------------------------------------------------#
#---------------------------------------------------#
class GGCA(nn.Module):
"""
Global Grouped Coordinate Attention (动态版):
- 通道分组: 将通道分为 num_groups 组
- 在 H、W 两个方向分别做全局池化(Avg+Max)
- 使用共享的 1x1 卷积序列对池化结果进行降维+激活
- 在高度和宽度两个维度上生成注意力权重
- 与原特征 x 逐元素相乘
"""
def init(self, channel, reduction=16, num_groups=4):
super(GGCA, self).init()
self.num_groups = num_groups
self.group_channels= channel // num_groups
self.reduction = reduction
text#-------------------------------------------------# # 用于对 (group_channels) -> (group_channels//reduction) -> (group_channels) # 做降维和升维 #-------------------------------------------------# self.shared_conv = nn.Sequential( nn.Conv2d(self.group_channels, self.group_channels // reduction, kernel_size=1), nn.BatchNorm2d(self.group_channels // reduction), nn.ReLU(inplace=True), nn.Conv2d(self.group_channels // reduction, self.group_channels, kernel_size=1) ) self.sigmoid_h = nn.Sigmoid() self.sigmoid_w = nn.Sigmoid() def forward(self, x): """ x shape: (B, C, H, W) 在本函数内动态获取 H, W => 使用自适应池化 """ B, C, H, W = x.shape assert C % self.num_groups == 0, "通道数必须能被分组数整除" #-------------------------------------------------# # 将特征图按照通道维度拆分成 num_groups 份 # 例如: (B, C, H, W) -> (B, num_groups, group_channels, H, W) #-------------------------------------------------# x_grouped = x.view(B, self.num_groups, self.group_channels, H, W) #------------------------------# # 在高度维度进行自适应平均/最大池化 # 输出大小 (H, 1), 保留H这个维度 #------------------------------# x_h_avg = F.adaptive_avg_pool2d(x_grouped.view(B*self.num_groups, self.group_channels, H, W), output_size=(H, 1)) x_h_avg = x_h_avg.view(B, self.num_groups, self.group_channels, H, 1) x_h_max = F.adaptive_max_pool2d(x_grouped.view(B*self.num_groups, self.group_channels, H, W), output_size=(H, 1)) x_h_max = x_h_max.view(B, self.num_groups, self.group_channels, H, 1) #------------------------------# # 在宽度维度进行自适应平均/最大池化 # 输出大小 (1, W) #------------------------------# x_w_avg = F.adaptive_avg_pool2d(x_grouped.view(B*self.num_groups, self.group_channels, H, W), output_size=(1, W)) x_w_avg = x_w_avg.view(B, self.num_groups, self.group_channels, 1, W) x_w_max = F.adaptive_max_pool2d(x_grouped.view(B*self.num_groups, self.group_channels, H, W), output_size=(1, W)) x_w_max = x_w_max.view(B, self.num_groups, self.group_channels, 1, W) #------------------------------# # 将池化结果分别过共享的卷积层 # 得到 y_h, y_w #------------------------------# y_h_avg = self.shared_conv(x_h_avg.view(B*self.num_groups, self.group_channels, H, 1)) y_h_max = self.shared_conv(x_h_max.view(B*self.num_groups, self.group_channels, H, 1)) y_w_avg = self.shared_conv(x_w_avg.view(B*self.num_groups, self.group_channels, 1, W)) y_w_max = self.shared_conv(x_w_max.view(B*self.num_groups, self.group_channels, 1, W)) #------------------------------# # 相加后做 Sigmoid 激活 #------------------------------# att_h = self.sigmoid_h(y_h_avg + y_h_max).view(B, self.num_groups, self.group_channels, H, 1) att_w = self.sigmoid_w(y_w_avg + y_w_max).view(B, self.num_groups, self.group_channels, 1, W) #------------------------------# # 将注意力权重乘回到原特征 #------------------------------# out = x_grouped * att_h * att_w out = out.view(B, C, H, W) return out
#---------------------------------------------------#
#---------------------------------------------------#
class InvertedResidual(nn.Module):
"""
将 GGCA 加入到 3x3深度卷积之后
结构:
1) 如果 expand_ratio=1 : (DWConv3x3 -> BN -> ReLU6 -> GGCA -> Conv1x1 -> BN)
2) 如果 expand_ratio!=1 : (Conv1x1 -> BN -> ReLU6 -> DWConv3x3 -> BN -> ReLU6 -> GGCA -> Conv1x1 -> BN)
"""
def init(self, inp, oup, stride, expand_ratio,
use_ggca=True, # 是否使用GGCA
ggca_groups=4, # GGCA分组数
ggca_reduction=16):
super(InvertedResidual, self).init()
self.stride = stride
assert stride in [1, 2]
texthidden_dim = round(inp * expand_ratio) self.use_res_connect = (self.stride == 1 and inp == oup) self.use_ggca = use_ggca #------------------------------# # 若 expand_ratio=1, 则跳过升维那一步 #------------------------------# if expand_ratio == 1: # 深度可分离卷积 + GGCA + PW layers = [] layers.append( nn.Conv2d(hidden_dim, hidden_dim, 3, stride, 1, groups=hidden_dim, bias=False) ) layers.append(BatchNorm2d(hidden_dim)) layers.append(nn.ReLU6(inplace=True)) # 加入GGCA if self.use_ggca: layers.append(GGCA(channel=hidden_dim, num_groups=ggca_groups, reduction=ggca_reduction)) layers.append( nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False) ) layers.append(BatchNorm2d(oup)) self.conv = nn.Sequential(*layers) else: # 1x1 卷积升维 + 深度可分离卷积 + GGCA + 1x1卷积降维 layers = [] # 升维 layers.append( nn.Conv2d(inp, hidden_dim, 1, 1, 0, bias=False) ) layers.append(BatchNorm2d(hidden_dim)) layers.append(nn.ReLU6(inplace=True)) # 深度可分离卷积 layers.append( nn.Conv2d(hidden_dim, hidden_dim, 3, stride, 1, groups=hidden_dim, bias=False) ) layers.append(BatchNorm2d(hidden_dim)) layers.append(nn.ReLU6(inplace=True)) # 加入GGCA if self.use_ggca: layers.append(GGCA(channel=hidden_dim, num_groups=ggca_groups, reduction=ggca_reduction)) # 降维 layers.append( nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False) ) layers.append(BatchNorm2d(oup)) self.conv = nn.Sequential(*layers) def forward(self, x): if self.use_res_connect: return x + self.conv(x) else: return self.conv(x)
#---------------------------------------------------#
#---------------------------------------------------#
class MobileNetV2_GGCA(nn.Module):
def init(self, n_class=1000, input_size=224, width_mult=1.,
use_ggca=True, ggca_groups=4, ggca_reduction=16):
"""
use_ggca, ggca_groups, ggca_reduction 用于控制GGCA
"""
super(MobileNetV2_GGCA, self).init()
block = InvertedResidual
input_channel = 32
last_channel = 1280
interverted_residual_setting = [
# t, c, n, s
[1, 16, 1, 1], # 第1个倒残差
[6, 24, 2, 2], # 第2个倒残差
[6, 32, 3, 2], # 第3个倒残差
[6, 64, 4, 2], # 第4个倒残差
[6, 96, 3, 1], # 第5个倒残差
[6, 160,3, 2], # 第6个倒残差
[6, 320,1, 1], # 第7个倒残差
]
textassert input_size % 32 == 0 input_channel = int(input_channel * width_mult) self.last_channel = int(last_channel * width_mult) if width_mult > 1.0 else last_channel # 首层: 3 -> 32 (conv_bn) self.features = [conv_bn(3, input_channel, 2)] # 构建倒残差层 for t, c, n, s in interverted_residual_setting: output_channel = int(c * width_mult) for i in range(n): if i == 0: # 使用 stride = s self.features.append( block(inp=input_channel, oup=output_channel, stride=s, expand_ratio=t, use_ggca=use_ggca, ggca_groups=ggca_groups, ggca_reduction=ggca_reduction) ) else: # 使用 stride = 1 self.features.append( block(inp=input_channel, oup=output_channel, stride=1, expand_ratio=t, use_ggca=use_ggca, ggca_groups=ggca_groups, ggca_reduction=ggca_reduction) ) input_channel = output_channel # 最后一层: 1x1 卷积把通道升到 1280 self.features.append(conv_1x1_bn(input_channel, self.last_channel)) self.features = nn.Sequential(*self.features) # 分类器部分 self.classifier = nn.Sequential( nn.Dropout(0.2), nn.Linear(self.last_channel, n_class), ) self._initialize_weights() def forward(self, x): # 前面部分: 卷积特征提取 x = self.features(x) # 池化 + 全连接 x = x.mean(dim=3).mean(dim=2) # Global Average Pooling x = self.classifier(x) return x def _initialize_weights(self): """ 权值初始化,与官方MobileNetV2类似 """ for m in self.modules(): if isinstance(m, nn.Conv2d): n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels m.weight.data.normal_(0, math.sqrt(2. / n)) if m.bias is not None: m.bias.data.zero_() elif isinstance(m, BatchNorm2d): m.weight.data.fill_(1) m.bias.data.zero_() elif isinstance(m, nn.Linear): n = m.weight.size(1) m.weight.data.normal_(0, 0.01) m.bias.data.zero_()
def load_url(url, model_dir='./model_data', map_location=None):
if not os.path.exists(model_dir):
os.makedirs(model_dir)
filename = url.split('/')[-1]
cached_file = os.path.join(model_dir, filename)
if os.path.exists(cached_file):
return torch.load(cached_file, map_location=map_location)
else:
return model_zoo.load_url(url, model_dir=model_dir)
def mobilenetv2_ggca(pretrained=False, **kwargs):
"""
构建带GGCA的MobileNetV2,可根据需要加载预训练权重
"""
model = MobileNetV2_GGCA(n_class=1000, **kwargs)
if pretrained:
# 这里仅作演示,和原版预训练权重可能不完全匹配
model.load_state_dict(
load_url('https://github.com/bubbliiiing/deeplabv3-plus-pytorch/releases/download/v1.0/mobilenet_v2.pth.tar'),
strict=False
)
return model
if name == "main":
# 测试
net = mobilenetv2_ggca(pretrained=False,
use_ggca=True,
ggca_groups=4,
ggca_reduction=16)
print(net)
textx_test = torch.randn(1, 3, 224, 224) y_test = net(x_test) print("输出维度:", y_test.shape)
以下是一个示例,将 GGCA(Global Grouped Coordinate Attention) 动态注意力模块嵌入到了 MobileNetV2 倒置残差结构(InvertedResidual)中的 深度卷积 后,并在相关位置配有详细的中文注释,以便更好地理解代码逻辑。
pythonimport math import os import torch import torch.nn as nn import torch.nn.functional as F import torch.utils.model_zoo as model_zoo #---------------------------------------------------# # BatchNorm2d 可以根据需要切换为 SyncBatchNorm # 或者自定义的 BN 形式,当前使用PyTorch自带 #---------------------------------------------------# BatchNorm2d = nn.BatchNorm2d def conv_bn(inp, oup, stride): """ 标准 3x3 卷积 + BN + ReLU6,用于 MobileNetV2 的首层: ---------------------------------------------------- inp : 输入通道数 oup : 输出通道数 stride : 步幅 (当 stride=2 时,可实现下采样) ---------------------------------------------------- 返回一个 nn.Sequential: 1) 3x3卷积,padding=1 保持特征图大小或结合stride进行空间变换 2) 批归一化 (BatchNorm2d) 3) ReLU6 激活,较ReLU在数值稳定性上略有区别 """ return nn.Sequential( nn.Conv2d(inp, oup, 3, stride, 1, bias=False), BatchNorm2d(oup), nn.ReLU6(inplace=True) ) def conv_1x1_bn(inp, oup): """ 标准 1x1 卷积 + BN + ReLU6: ---------------------------------------------------- 主要用于 MobileNetV2 最后升维到更高通道数 (如 1280) 或在网络深层替换通道数时 ---------------------------------------------------- """ return nn.Sequential( nn.Conv2d(inp, oup, 1, 1, 0, bias=False), BatchNorm2d(oup), nn.ReLU6(inplace=True) ) #---------------------------------------------------# # GGCA (Global Grouped Coordinate Attention) 动态版 # 通过自适应池化获取 H 和 W 两个维度上的全局信息 # 并在通道方向上进行分组 (num_groups),每组独立 # 学习坐标注意力后再乘回原特征 #---------------------------------------------------# class GGCA(nn.Module): """ GGCA 动态版: 1) 通道分组: channel // num_groups 2) 高度/宽度方向分别做全局平均池化 与 全局最大池化 3) 使用共享的 1×1 卷积序列对池化结果降维+升维 4) 分别得到 att_h, att_w 后乘回到原特征上 """ def __init__(self, channel, reduction=16, num_groups=4): super(GGCA, self).__init__() self.num_groups = num_groups self.group_channels = channel // num_groups self.reduction = reduction #-----------------------------------------# # shared_conv 用于对每组通道先做降维 # (group_channels -> group_channels//reduction) # 再升维到 (group_channels),并配合 BN + ReLU #-----------------------------------------# self.shared_conv = nn.Sequential( nn.Conv2d(self.group_channels, self.group_channels // reduction, kernel_size=1), nn.BatchNorm2d(self.group_channels // reduction), nn.ReLU(inplace=True), nn.Conv2d(self.group_channels // reduction, self.group_channels, kernel_size=1) ) self.sigmoid_h = nn.Sigmoid() self.sigmoid_w = nn.Sigmoid() def forward(self, x): """ x 的形状: [B, C, H, W] 动态获取 H, W, 通过 F.adaptive_xxx_pool2d 实现自适应的全局池化 (avg, max) """ B, C, H, W = x.shape # 确保能整除 assert C % self.num_groups == 0, "通道数不能被分组数整除" #-----------------------------------------# # (B, C, H, W) -> (B, num_groups, group_channels, H, W) # 以分组方式处理通道 #-----------------------------------------# x_grouped = x.view(B, self.num_groups, self.group_channels, H, W) #------------------------------# # 先对高度维度做avg + max池化 # output_size=(H, 1) 保留H #------------------------------# x_h_avg = F.adaptive_avg_pool2d( x_grouped.view(B*self.num_groups, self.group_channels, H, W), output_size=(H, 1) ).view(B, self.num_groups, self.group_channels, H, 1) x_h_max = F.adaptive_max_pool2d( x_grouped.view(B*self.num_groups, self.group_channels, H, W), output_size=(H, 1) ).view(B, self.num_groups, self.group_channels, H, 1) #------------------------------# # 再对宽度维度做avg + max池化 # output_size=(1, W) 保留W #------------------------------# x_w_avg = F.adaptive_avg_pool2d( x_grouped.view(B*self.num_groups, self.group_channels, H, W), output_size=(1, W) ).view(B, self.num_groups, self.group_channels, 1, W) x_w_max = F.adaptive_max_pool2d( x_grouped.view(B*self.num_groups, self.group_channels, H, W), output_size=(1, W) ).view(B, self.num_groups, self.group_channels, 1, W) #------------------------------# # 池化后 => 共享卷积层 # y_h_avg, y_h_max, y_w_avg, y_w_max #------------------------------# y_h_avg = self.shared_conv( x_h_avg.view(B*self.num_groups, self.group_channels, H, 1) ) y_h_max = self.shared_conv( x_h_max.view(B*self.num_groups, self.group_channels, H, 1) ) y_w_avg = self.shared_conv( x_w_avg.view(B*self.num_groups, self.group_channels, 1, W) ) y_w_max = self.shared_conv( x_w_max.view(B*self.num_groups, self.group_channels, 1, W) ) #------------------------------# # 相加 => Sigmoid 激活 #------------------------------# att_h = self.sigmoid_h(y_h_avg + y_h_max).view( B, self.num_groups, self.group_channels, H, 1 ) att_w = self.sigmoid_w(y_w_avg + y_w_max).view( B, self.num_groups, self.group_channels, 1, W ) #------------------------------# # 将注意力权重乘回原分组特征 # 然后恢复到 (B, C, H, W) #------------------------------# out = x_grouped * att_h * att_w out = out.view(B, C, H, W) return out #---------------------------------------------------# # 倒置残差结构 (InvertedResidual),插入 GGCA #---------------------------------------------------# class InvertedResidual(nn.Module): """ 若 expand_ratio=1: => (DWConv3x3 -> BN -> ReLU6 -> GGCA -> PWConv(1x1) -> BN) 若 expand_ratio!=1: => (PWConv(1x1)升维 -> BN -> ReLU6 -> DWConv3x3 -> BN -> ReLU6 -> GGCA -> PWConv(1x1)降维 -> BN) """ def __init__(self, inp, oup, stride, expand_ratio, use_ggca=True, # 是否插入 GGCA ggca_groups=4, # GGCA 分组数 ggca_reduction=16): super(InvertedResidual, self).__init__() self.stride = stride assert stride in [1, 2], "只能使用 stride=1 或 2" hidden_dim = round(inp * expand_ratio) self.use_res_connect = (self.stride == 1 and inp == oup) self.use_ggca = use_ggca #------------------------------# # expand_ratio=1 => 无升维 #------------------------------# if expand_ratio == 1: layers = [] # 深度可分离卷积DW layers.append(nn.Conv2d(hidden_dim, hidden_dim, kernel_size=3, stride=stride, padding=1, groups=hidden_dim, bias=False)) layers.append(BatchNorm2d(hidden_dim)) layers.append(nn.ReLU6(inplace=True)) # 选用GGCA if self.use_ggca: layers.append(GGCA(channel=hidden_dim, num_groups=ggca_groups, reduction=ggca_reduction)) # 1x1 PW 卷积,降维到 oup layers.append(nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False)) layers.append(BatchNorm2d(oup)) self.conv = nn.Sequential(*layers) else: # expand_ratio!=1 => 先1x1卷积升维,再DWConv,再GGCA,再1x1卷积降维 layers = [] # 1) 升维 layers.append(nn.Conv2d(inp, hidden_dim, kernel_size=1, stride=1, padding=0, bias=False)) layers.append(BatchNorm2d(hidden_dim)) layers.append(nn.ReLU6(inplace=True)) # 2) DWConv3x3 layers.append(nn.Conv2d(hidden_dim, hidden_dim, kernel_size=3, stride=stride, padding=1, groups=hidden_dim, bias=False)) layers.append(BatchNorm2d(hidden_dim)) layers.append(nn.ReLU6(inplace=True)) # 3) GGCA if self.use_ggca: layers.append(GGCA(channel=hidden_dim, num_groups=ggca_groups, reduction=ggca_reduction)) # 4) 1x1 卷积降维 layers.append(nn.Conv2d(hidden_dim, oup, kernel_size=1, stride=1, padding=0, bias=False)) layers.append(BatchNorm2d(oup)) self.conv = nn.Sequential(*layers) def forward(self, x): if self.use_res_connect: # 如果满足残差连接的条件,输出= 输入 + 卷积结果 return x + self.conv(x) else: return self.conv(x) #---------------------------------------------------# # 将 GGCA 集成到 MobileNetV2 中 #---------------------------------------------------# class MobileNetV2_GGCA(nn.Module): def __init__(self, n_class=1000, input_size=224, width_mult=1., use_ggca=True, ggca_groups=4, ggca_reduction=16): """ n_class : 分类任务的类别数 (默认为1000, ImageNet示例) input_size : 输入图像的边长 (必须是32的倍数) width_mult : 宽度因子, 可调网络宽度 use_ggca : 是否在倒置残差结构里插入GGCA ggca_groups : GGCA分组数 ggca_reduction : GGCA通道缩放率 """ super(MobileNetV2_GGCA, self).__init__() block = InvertedResidual input_channel = 32 last_channel = 1280 #-------------------------------------------------# # MobileNetV2 中常见的倒置残差配置信息: # t: expand ratio, c: 输出通道, n: repeat次数, s: 步幅 #-------------------------------------------------# interverted_residual_setting = [ [1, 16, 1, 1], # 第1组 [6, 24, 2, 2], # 第2组 [6, 32, 3, 2], # 第3组 [6, 64, 4, 2], # 第4组 [6, 96, 3, 1], # 第5组 [6, 160,3, 2], # 第6组 [6, 320,1, 1], # 第7组 ] assert input_size % 32 == 0, "输入尺寸必须是32的倍数" # 计算初始通道 input_channel = int(input_channel * width_mult) # 如果 width_mult > 1, 最后一层 1280 也要等比例增大 self.last_channel = (int(last_channel * width_mult) if width_mult > 1.0 else last_channel) #------------------------------# # 第一个卷积层: 3->32 #------------------------------# self.features = [conv_bn(3, input_channel, stride=2)] #------------------------------# # 构建多个倒置残差层 #------------------------------# for t, c, n, s in interverted_residual_setting: output_channel = int(c * width_mult) for i in range(n): # 第一次使用步幅 s if i == 0: self.features.append( block(inp=input_channel, oup=output_channel, stride=s, expand_ratio=t, use_ggca=use_ggca, ggca_groups=ggca_groups, ggca_reduction=ggca_reduction) ) else: # 后续重复的层步幅=1 self.features.append( block(inp=input_channel, oup=output_channel, stride=1, expand_ratio=t, use_ggca=use_ggca, ggca_groups=ggca_groups, ggca_reduction=ggca_reduction) ) input_channel = output_channel #------------------------------# # 最后一层: 1×1 卷积 (output_channel -> self.last_channel) #------------------------------# self.features.append(conv_1x1_bn(input_channel, self.last_channel)) # 整合为 nn.Sequential 方便管理 self.features = nn.Sequential(*self.features) #------------------------------# # 分类器 (以 ImageNet 为例) #------------------------------# self.classifier = nn.Sequential( nn.Dropout(0.2), nn.Linear(self.last_channel, n_class), ) self._initialize_weights() def forward(self, x): """ 前向传播: 1) 特征提取 (self.features) 2) 全局平均池化, 然后用全连接分类 (self.classifier) """ x = self.features(x) # 全局平均池化: 在 H, W 维度做 mean x = x.mean(dim=3).mean(dim=2) # => (B, C) x = self.classifier(x) return x def _initialize_weights(self): """ 模型参数初始化: 与官方 MobileNetV2 一致 """ for m in self.modules(): if isinstance(m, nn.Conv2d): # kaiming-like 初始化 n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels m.weight.data.normal_(0, math.sqrt(2. / n)) if m.bias is not None: m.bias.data.zero_() elif isinstance(m, BatchNorm2d): m.weight.data.fill_(1) m.bias.data.zero_() elif isinstance(m, nn.Linear): # 全连接层初始化 n = m.weight.size(1) m.weight.data.normal_(0, 0.01) m.bias.data.zero_() def load_url(url, model_dir='./model_data', map_location=None): """ 用于从指定URL加载预训练权重(若本地不存在则下载) """ if not os.path.exists(model_dir): os.makedirs(model_dir) filename = url.split('/')[-1] cached_file = os.path.join(model_dir, filename) if os.path.exists(cached_file): return torch.load(cached_file, map_location=map_location) else: return model_zoo.load_url(url, model_dir=model_dir) def mobilenetv2_ggca(pretrained=False, **kwargs): """ 方便外部调用的函数: 构建带 GGCA 的 MobileNetV2, 并可选择是否加载预训练权重。 """ model = MobileNetV2_GGCA(n_class=1000, **kwargs) if pretrained: # 这里仅作演示,可能和原始预训练权重不完全匹配 model.load_state_dict( load_url('https://github.com/bubbliiiing/deeplabv3-plus-pytorch/releases/download/v1.0/mobilenet_v2.pth.tar'), strict=False ) return model if __name__ == "__main__": # 简单测试 net = mobilenetv2_ggca(pretrained=False, use_ggca=True, # 是否启用GGCA ggca_groups=4, # 默认分组数 ggca_reduction=16) # 默认通道压缩比 print(net) # 生成随机输入测试网络输出 x_test = torch.randn(1, 3, 224, 224) y_test = net(x_test) print("输出维度:", y_test.shape)
GGCA
动态注意力模块:
adaptive_avg_pool2d
和 adaptive_max_pool2d
,使其在不同分辨率输入下都能自适应地获取高度、宽度方向上的全局信息。InvertedResidual
(倒置残差):
GGCA
(由 use_ggca
决定),并在升维和 DWConv 之间或之后插入注意力。expand_ratio=1
时,跳过升维,仅在 DWConv 后与最后的 1×1 卷积之前加上 GGCA。MobileNetV2_GGCA
:
width_mult
调整宽度。__init__
中通过参数 use_ggca
, ggca_groups
, ggca_reduction
来控制 GGCA 的使用与超参数设定。注释:
这样即可得到一个在倒置残差阶段集成 GGCA 注意力的 MobileNetV2 变体,可以用于进一步下游的图像分类、分割或检测任务中。
C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\python.exe C:\Users\wztbkl\PycharmProjects\deeplabv3-plus-pytorch-main\nets\mobilenetV2_GSCA.py
C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\init.py:511: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
Traceback (most recent call last):
File "C:\Users\wztbkl\PycharmProjects\deeplabv3-plus-pytorch-main\nets\mobilenetV2_GSCA.py", line 326, in <module>
net = mobilenetv2_ggca(pretrained=False,
File "C:\Users\wztbkl\PycharmProjects\deeplabv3-plus-pytorch-main\nets\mobilenetV2_GSCA.py", line 314, in mobilenetv2_ggca
model = MobileNetV2_GGCA(n_class=1000, **kwargs)
File "C:\Users\wztbkl\PycharmProjects\deeplabv3-plus-pytorch-main\nets\mobilenetV2_GSCA.py", line 271, in init
self._initialize_weights()
File "C:\Users\wztbkl\PycharmProjects\deeplabv3-plus-pytorch-main\nets\mobilenetV2_GSCA.py", line 288, in initialize_weights
m.weight.data.normal(0, math.sqrt(2. / n))
ZeroDivisionError: float division by zero
进程已结束,退出代码为 1
该错误通常意味着在初始化卷积核权重时,出现了“除以 0”的情况。根据您贴出的报错信息,问题出现在:
pythonm.weight.data.normal_(0, math.sqrt(2. / n))
其中
pythonn = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
若 (即 kernel_size
或 out_channels
其中之一为 0),就会发生 ZeroDivisionError: float division by zero
。
卷积层的通道数为 0
num_groups
,导致 channel // num_groups = 0
,从而在后续某些卷积或权重形状中出现 out_channels = 0
的情况。width_mult
等参数导致网络结构异常
width_mult
设置成过小(例如 0 或一个极小值),就会引发某些层的 out_channels=0
。Group Convolution 设置问题
InvertedResidual
或 GGCA
里出现了 groups=某个值
但计算出的通道数整除后为 0,也会产生类似问题。检查输入通道数与 num_groups
的设置
ggca_groups=4
之类的参数,要确保在插入 GGCA 的地方,通道数 channel
≥ num_groups
;否则 channel // num_groups
可能为 0。channel % num_groups == 0
,否则会在其他地方报错或出现意料外的 shape 问题。查看是否将 width_mult
设置为过小
width_mult=0.1
或更小,可能导致某些层计算出的通道数直接变成 0。inp, oup
查看是否存在 oup=0
的情况。在 _initialize_weights()
中做安全判断
pythonif n == 0: # 跳过或对 n 赋值为 1,避免除 0 continue
out_channels=0
,那么网络结构本身也不合理,需要从根源上修复。确保通道数足够
num_groups
,并且能被其整除。num_groups
设为一个较小值(如 2 或 4),同时检查该层 expand_ratio
后的通道数不要过小。适当调整 width_mult
width_mult
不能小于 0.5 或 0.25(取决于您的需求),否则很多层的 out_channels
会变得非常小甚至为 0。调试打印
forward
前打印一下所有卷积层的 in_channels, out_channels
,确认是否出现 0 的情况。修改初始化逻辑(下策)
_initialize_weights()
里对 n=0
做特殊处理并跳过该层。但这种情况多半说明网络设计本身不合理,还是应在网络构造环节修复。out_channels=0
或 kernel_size=0
,导致 n=0
,在初始化权重时发生除以 0 的错误。width_mult
过小让通道数变 0。width_mult
,并在 _initialize_weights()
中做相应的检查。只要保证每个卷积层输出的 out_channels
都大于 0,且没有层出现空 kernel 或空通道的极端情况,就不会再出现此错误。
给出完整修改代码
下面给出一个示例,以您提供的 MobileNetV2 + GGCA 代码为基础,针对 “ZeroDivisionError: float division by zero” 的问题做了两处修正:
_initialize_weights()
中对可能出现 n=0
的情况进行跳过或安全处理。请根据您的实际需求进行合并或调整。
pythonimport math import os import torch import torch.nn as nn import torch.nn.functional as F import torch.utils.model_zoo as model_zoo #---------------------------------------------------# # BatchNorm2d 可以根据需要切换为 SyncBatchNorm # 或者自定义的 BN 形式,当前使用PyTorch自带 #---------------------------------------------------# BatchNorm2d = nn.BatchNorm2d def conv_bn(inp, oup, stride): """ 标准 3x3 卷积 + BN + ReLU6,用于 MobileNetV2 的首层。 inp : 输入通道数 oup : 输出通道数 stride: 步幅 (当 stride=2 时,可实现下采样) """ return nn.Sequential( nn.Conv2d(inp, oup, 3, stride, 1, bias=False), BatchNorm2d(oup), nn.ReLU6(inplace=True) ) def conv_1x1_bn(inp, oup): """ 标准 1x1 卷积 + BN + ReLU6,用于 MobileNetV2 中后期升维。 """ return nn.Sequential( nn.Conv2d(inp, oup, 1, 1, 0, bias=False), BatchNorm2d(oup), nn.ReLU6(inplace=True) ) class GGCA(nn.Module): """ Global Grouped Coordinate Attention (动态版): - 通道分组: 将通道分为 num_groups 组 - 高度(H)和宽度(W)方向分别做 自适应平均池化+最大池化 - 共享卷积层用于通道降维 + 升维 - 输出对高度、宽度的注意力权重后,与原特征逐元素相乘 """ def __init__(self, channel, reduction=16, num_groups=4): super(GGCA, self).__init__() # 先检查 channel 与 num_groups 的关系,避免后续通道数为0 if channel < num_groups: raise ValueError(f"GGCA num_groups={num_groups} 大于通道数 {channel},无法分组!") if channel % num_groups != 0: raise ValueError(f"channel={channel} 不能被 num_groups={num_groups} 整除!") self.num_groups = num_groups self.group_channels = channel // num_groups self.reduction = reduction # 共享卷积序列:先降维 (group_channels//reduction) 再升维回 (group_channels) self.shared_conv = nn.Sequential( nn.Conv2d(self.group_channels, self.group_channels // reduction, kernel_size=1), nn.BatchNorm2d(self.group_channels // reduction), nn.ReLU(inplace=True), nn.Conv2d(self.group_channels // reduction, self.group_channels, kernel_size=1) ) self.sigmoid_h = nn.Sigmoid() self.sigmoid_w = nn.Sigmoid() def forward(self, x): """ x shape: (B, C, H, W) """ B, C, H, W = x.shape # 再次检查,若后面某些情况下通道数不足,也会报错 assert C % self.num_groups == 0, "通道数不能被分组数整除" # (B, C, H, W) -> (B, num_groups, group_channels, H, W) x_grouped = x.view(B, self.num_groups, self.group_channels, H, W) #------------------------------# # 高度维度自适应平均+最大池化 #------------------------------# x_h_avg = F.adaptive_avg_pool2d( x_grouped.view(B*self.num_groups, self.group_channels, H, W), output_size=(H, 1) ).view(B, self.num_groups, self.group_channels, H, 1) x_h_max = F.adaptive_max_pool2d( x_grouped.view(B*self.num_groups, self.group_channels, H, W), output_size=(H, 1) ).view(B, self.num_groups, self.group_channels, H, 1) #------------------------------# # 宽度维度自适应平均+最大池化 #------------------------------# x_w_avg = F.adaptive_avg_pool2d( x_grouped.view(B*self.num_groups, self.group_channels, H, W), output_size=(1, W) ).view(B, self.num_groups, self.group_channels, 1, W) x_w_max = F.adaptive_max_pool2d( x_grouped.view(B*self.num_groups, self.group_channels, H, W), output_size=(1, W) ).view(B, self.num_groups, self.group_channels, 1, W) # 共享卷积层处理 y_h_avg = self.shared_conv(x_h_avg.view(B*self.num_groups, self.group_channels, H, 1)) y_h_max = self.shared_conv(x_h_max.view(B*self.num_groups, self.group_channels, H, 1)) y_w_avg = self.shared_conv(x_w_avg.view(B*self.num_groups, self.group_channels, 1, W)) y_w_max = self.shared_conv(x_w_max.view(B*self.num_groups, self.group_channels, 1, W)) # 相加 => Sigmoid att_h = self.sigmoid_h(y_h_avg + y_h_max).view(B, self.num_groups, self.group_channels, H, 1) att_w = self.sigmoid_w(y_w_avg + y_w_max).view(B, self.num_groups, self.group_channels, 1, W) # 应用注意力权重 out = x_grouped * att_h * att_w out = out.view(B, C, H, W) return out class InvertedResidual(nn.Module): """ 在深度可分离卷积后插入GGCA: expand_ratio=1: DWConv3x3 -> BN -> ReLU6 -> [GGCA] -> PWConv(1x1) -> BN expand_ratio!=1: PWConv(1x1)升维 -> BN -> ReLU6 -> DWConv3x3 -> BN -> ReLU6 -> [GGCA] -> PWConv(1x1)降维 -> BN """ def __init__(self, inp, oup, stride, expand_ratio, use_ggca=True, ggca_groups=4, ggca_reduction=16): super(InvertedResidual, self).__init__() self.stride = stride assert stride in [1, 2] hidden_dim = round(inp * expand_ratio) self.use_res_connect = (self.stride == 1 and inp == oup) self.use_ggca = use_ggca if expand_ratio == 1: # 无升维 layers = [ nn.Conv2d(hidden_dim, hidden_dim, 3, stride, 1, groups=hidden_dim, bias=False), BatchNorm2d(hidden_dim), nn.ReLU6(inplace=True), ] # 在DWConv之后插入GGCA if self.use_ggca: layers.append( GGCA(channel=hidden_dim, num_groups=ggca_groups, reduction=ggca_reduction) ) layers.extend([ nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False), BatchNorm2d(oup) ]) self.conv = nn.Sequential(*layers) else: # 先1x1卷积升维 -> DWConv3x3 -> GGCA -> 1x1降维 layers = [ nn.Conv2d(inp, hidden_dim, 1, 1, 0, bias=False), BatchNorm2d(hidden_dim), nn.ReLU6(inplace=True), nn.Conv2d(hidden_dim, hidden_dim, 3, stride, 1, groups=hidden_dim, bias=False), BatchNorm2d(hidden_dim), nn.ReLU6(inplace=True), ] if self.use_ggca: layers.append( GGCA(channel=hidden_dim, num_groups=ggca_groups, reduction=ggca_reduction) ) layers.extend([ nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False), BatchNorm2d(oup) ]) self.conv = nn.Sequential(*layers) def forward(self, x): if self.use_res_connect: return x + self.conv(x) else: return self.conv(x) class MobileNetV2_GGCA(nn.Module): """ 带 GGCA 的 MobileNetV2 可通过 use_ggca、ggca_groups、ggca_reduction 决定是否及如何使用GGCA。 """ def __init__(self, n_class=1000, input_size=224, width_mult=1., use_ggca=True, ggca_groups=4, ggca_reduction=16): super(MobileNetV2_GGCA, self).__init__() block = InvertedResidual input_channel = 32 last_channel = 1280 # 倒置残差配置: t=expand_ratio, c=输出通道, n=重复次数, s=stride interverted_residual_setting = [ [1, 16, 1, 1], [6, 24, 2, 2], [6, 32, 3, 2], [6, 64, 4, 2], [6, 96, 3, 1], [6, 160,3, 2], [6, 320,1, 1], ] assert input_size % 32 == 0, "输入尺寸需为32的倍数" # 计算初始通道 input_channel = int(input_channel * width_mult) self.last_channel = int(last_channel * width_mult) if width_mult > 1.0 else last_channel # 首层卷积: 3 -> input_channel self.features = [conv_bn(3, input_channel, 2)] # 构建倒置残差层 for t, c, n, s in interverted_residual_setting: output_channel = int(c * width_mult) for i in range(n): stride = s if i == 0 else 1 self.features.append( block(inp=input_channel, oup=output_channel, stride=stride, expand_ratio=t, use_ggca=use_ggca, ggca_groups=ggca_groups, ggca_reduction=ggca_reduction) ) input_channel = output_channel # 最后一层: 1x1 卷积 -> 1280通道 (或 width_mult 后的通道) self.features.append(conv_1x1_bn(input_channel, self.last_channel)) self.features = nn.Sequential(*self.features) # 分类器: 全局池化 -> Dropout -> 全连接 self.classifier = nn.Sequential( nn.Dropout(0.2), nn.Linear(self.last_channel, n_class), ) self._initialize_weights() def forward(self, x): x = self.features(x) # 全局平均池化 (B, C, 1, 1) x = x.mean(dim=3).mean(dim=2) x = self.classifier(x) return x def _initialize_weights(self): """ 避免 ZeroDivisionError 的安全处理: - 若 n == 0, 则跳过初始化 """ for m in self.modules(): if isinstance(m, nn.Conv2d): # kernel_size 乘 out_channels if m.kernel_size and m.out_channels > 0: n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels else: n = 0 if n > 0: m.weight.data.normal_(0, math.sqrt(2. / n)) if m.bias is not None: m.bias.data.zero_() elif isinstance(m, BatchNorm2d): m.weight.data.fill_(1) m.bias.data.zero_() elif isinstance(m, nn.Linear): # 全连接层初始化 if m.weight.size(1) > 0: # 避免出现空维度 n = m.weight.size(1) else: n = 0 if n > 0: m.weight.data.normal_(0, 0.01) m.bias.data.zero_() def load_url(url, model_dir='./model_data', map_location=None): """ 用于从指定URL加载预训练权重(若本地不存在则下载) """ if not os.path.exists(model_dir): os.makedirs(model_dir) filename = url.split('/')[-1] cached_file = os.path.join(model_dir, filename) if os.path.exists(cached_file): return torch.load(cached_file, map_location=map_location) else: return model_zoo.load_url(url, model_dir=model_dir) def mobilenetv2_ggca(pretrained=False, **kwargs): """ 统一构建函数: use_ggca, ggca_groups, ggca_reduction 等在 kwargs 中设置 """ model = MobileNetV2_GGCA(n_class=1000, **kwargs) if pretrained: # 仅作演示,原版权重可能不完全匹配 model.load_state_dict( load_url('https://github.com/bubbliiiing/deeplabv3-plus-pytorch/releases/download/v1.0/mobilenet_v2.pth.tar'), strict=False ) return model if __name__ == "__main__": # 测试 # 注意: 需保证 ggca_groups <= 通道数, 并能整除 net = mobilenetv2_ggca( pretrained=False, use_ggca=True, ggca_groups=4, # 例如4, 但要保证每层通道数>=4 ggca_reduction=16, width_mult=1.0 # 若太小可能导致通道=0 ) print(net) x_test = torch.randn(1, 3, 224, 224) y_test = net(x_test) print("输出维度:", y_test.shape)
GGCA 通道数检查
GGCA.__init__
里增加了
以在初始化时就避免出现pythonif channel < num_groups: raise ValueError(...) if channel % num_groups != 0: raise ValueError(...)
group_channels=0
或无法分组的情况。weights 初始化安全处理
_initialize_weights()
里,当计算 n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
时,先判断 out_channels
是否大于 0;若结果 n=0
,就跳过 normal_()
初始化,从而不会再出现 ZeroDivisionError
。width_mult
width_mult
(如 0.1、0.25),需要确保中间层不会出现 out_channels=0
。否则就算跳过了初始化,也会在模型 forward 时报其它错误。num_groups
的倍数,并至少为 1(例如 width_mult>=0.5
时大多可以正常工作)。通过这两处改动,通常能够避免因通道过小或分组数设置不当而导致的除 0 错误。如果依然出现问题,则需在具体的层上打印实际的 inp, oup
或者查看 expand_ratio
是否被设得过小,从而做针对性调整。
C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\python.exe C:\Users\wztbkl\PycharmProjects\deeplabv3-plus-pytorch-main\nets\mobilenetV2_GSCA.py
C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\init.py:511: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
MobileNetV2_GGCA(
(features): Sequential(
(0): Sequential(
(0): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU6(inplace=True)
)
(1): InvertedResidual(
(conv): Sequential(
(0): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=32, bias=False)
(1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU6(inplace=True)
(3): GGCA(
(shared_conv): Sequential(
(0): Conv2d(8, 0, kernel_size=(1, 1), stride=(1, 1))
(1): BatchNorm2d(0, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(0, 8, kernel_size=(1, 1), stride=(1, 1))
)
(sigmoid_h): Sigmoid()
(sigmoid_w): Sigmoid()
)
(4): Conv2d(32, 16, kernel_size=(1, 1), stride=(1, 1), bias=False)
(5): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(2): InvertedResidual(
(conv): Sequential(
(0): Conv2d(16, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU6(inplace=True)
(3): Conv2d(96, 96, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=96, bias=False)
(4): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU6(inplace=True)
(6): GGCA(
(shared_conv): Sequential(
(0): Conv2d(24, 1, kernel_size=(1, 1), stride=(1, 1))
(1): BatchNorm2d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(1, 24, kernel_size=(1, 1), stride=(1, 1))
)
(sigmoid_h): Sigmoid()
(sigmoid_w): Sigmoid()
)
(7): Conv2d(96, 24, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): BatchNorm2d(24, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(3): InvertedResidual(
(conv): Sequential(
(0): Conv2d(24, 144, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(144, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU6(inplace=True)
(3): Conv2d(144, 144, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=144, bias=False)
(4): BatchNorm2d(144, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU6(inplace=True)
(6): GGCA(
(shared_conv): Sequential(
(0): Conv2d(36, 2, kernel_size=(1, 1), stride=(1, 1))
(1): BatchNorm2d(2, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(2, 36, kernel_size=(1, 1), stride=(1, 1))
)
(sigmoid_h): Sigmoid()
(sigmoid_w): Sigmoid()
)
(7): Conv2d(144, 24, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): BatchNorm2d(24, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(4): InvertedResidual(
(conv): Sequential(
(0): Conv2d(24, 144, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(144, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU6(inplace=True)
(3): Conv2d(144, 144, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=144, bias=False)
(4): BatchNorm2d(144, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU6(inplace=True)
(6): GGCA(
(shared_conv): Sequential(
(0): Conv2d(36, 2, kernel_size=(1, 1), stride=(1, 1))
(1): BatchNorm2d(2, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(2, 36, kernel_size=(1, 1), stride=(1, 1))
)
(sigmoid_h): Sigmoid()
(sigmoid_w): Sigmoid()
)
(7): Conv2d(144, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(5): InvertedResidual(
(conv): Sequential(
(0): Conv2d(32, 192, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(192, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU6(inplace=True)
(3): Conv2d(192, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=192, bias=False)
(4): BatchNorm2d(192, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU6(inplace=True)
(6): GGCA(
(shared_conv): Sequential(
(0): Conv2d(48, 3, kernel_size=(1, 1), stride=(1, 1))
(1): BatchNorm2d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(3, 48, kernel_size=(1, 1), stride=(1, 1))
)
(sigmoid_h): Sigmoid()
(sigmoid_w): Sigmoid()
)
(7): Conv2d(192, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(6): InvertedResidual(
(conv): Sequential(
(0): Conv2d(32, 192, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(192, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU6(inplace=True)
(3): Conv2d(192, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=192, bias=False)
(4): BatchNorm2d(192, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU6(inplace=True)
(6): GGCA(
(shared_conv): Sequential(
(0): Conv2d(48, 3, kernel_size=(1, 1), stride=(1, 1))
(1): BatchNorm2d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(3, 48, kernel_size=(1, 1), stride=(1, 1))
)
(sigmoid_h): Sigmoid()
(sigmoid_w): Sigmoid()
)
(7): Conv2d(192, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(7): InvertedResidual(
(conv): Sequential(
(0): Conv2d(32, 192, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(192, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU6(inplace=True)
(3): Conv2d(192, 192, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=192, bias=False)
(4): BatchNorm2d(192, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU6(inplace=True)
(6): GGCA(
(shared_conv): Sequential(
(0): Conv2d(48, 3, kernel_size=(1, 1), stride=(1, 1))
(1): BatchNorm2d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(3, 48, kernel_size=(1, 1), stride=(1, 1))
)
(sigmoid_h): Sigmoid()
(sigmoid_w): Sigmoid()
)
(7): Conv2d(192, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(8): InvertedResidual(
(conv): Sequential(
(0): Conv2d(64, 384, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(384, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU6(inplace=True)
(3): Conv2d(384, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=384, bias=False)
(4): BatchNorm2d(384, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU6(inplace=True)
(6): GGCA(
(shared_conv): Sequential(
(0): Conv2d(96, 6, kernel_size=(1, 1), stride=(1, 1))
(1): BatchNorm2d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(6, 96, kernel_size=(1, 1), stride=(1, 1))
)
(sigmoid_h): Sigmoid()
(sigmoid_w): Sigmoid()
)
(7): Conv2d(384, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(9): InvertedResidual(
(conv): Sequential(
(0): Conv2d(64, 384, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(384, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU6(inplace=True)
(3): Conv2d(384, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=384, bias=False)
(4): BatchNorm2d(384, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU6(inplace=True)
(6): GGCA(
(shared_conv): Sequential(
(0): Conv2d(96, 6, kernel_size=(1, 1), stride=(1, 1))
(1): BatchNorm2d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(6, 96, kernel_size=(1, 1), stride=(1, 1))
)
(sigmoid_h): Sigmoid()
(sigmoid_w): Sigmoid()
)
(7): Conv2d(384, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(10): InvertedResidual(
(conv): Sequential(
(0): Conv2d(64, 384, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(384, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU6(inplace=True)
(3): Conv2d(384, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=384, bias=False)
(4): BatchNorm2d(384, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU6(inplace=True)
(6): GGCA(
(shared_conv): Sequential(
(0): Conv2d(96, 6, kernel_size=(1, 1), stride=(1, 1))
(1): BatchNorm2d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(6, 96, kernel_size=(1, 1), stride=(1, 1))
)
(sigmoid_h): Sigmoid()
(sigmoid_w): Sigmoid()
)
(7): Conv2d(384, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(11): InvertedResidual(
(conv): Sequential(
(0): Conv2d(64, 384, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(384, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU6(inplace=True)
(3): Conv2d(384, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=384, bias=False)
(4): BatchNorm2d(384, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU6(inplace=True)
(6): GGCA(
(shared_conv): Sequential(
(0): Conv2d(96, 6, kernel_size=(1, 1), stride=(1, 1))
(1): BatchNorm2d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(6, 96, kernel_size=(1, 1), stride=(1, 1))
)
(sigmoid_h): Sigmoid()
(sigmoid_w): Sigmoid()
)
(7): Conv2d(384, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(12): InvertedResidual(
(conv): Sequential(
(0): Conv2d(96, 576, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU6(inplace=True)
(3): Conv2d(576, 576, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=576, bias=False)
(4): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU6(inplace=True)
(6): GGCA(
(shared_conv): Sequential(
(0): Conv2d(144, 9, kernel_size=(1, 1), stride=(1, 1))
(1): BatchNorm2d(9, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(9, 144, kernel_size=(1, 1), stride=(1, 1))
)
(sigmoid_h): Sigmoid()
(sigmoid_w): Sigmoid()
)
(7): Conv2d(576, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(13): InvertedResidual(
(conv): Sequential(
(0): Conv2d(96, 576, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU6(inplace=True)
(3): Conv2d(576, 576, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=576, bias=False)
(4): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU6(inplace=True)
(6): GGCA(
(shared_conv): Sequential(
(0): Conv2d(144, 9, kernel_size=(1, 1), stride=(1, 1))
(1): BatchNorm2d(9, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(9, 144, kernel_size=(1, 1), stride=(1, 1))
)
(sigmoid_h): Sigmoid()
(sigmoid_w): Sigmoid()
)
(7): Conv2d(576, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(14): InvertedResidual(
(conv): Sequential(
(0): Conv2d(96, 576, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU6(inplace=True)
(3): Conv2d(576, 576, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=576, bias=False)
(4): BatchNorm2d(576, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU6(inplace=True)
(6): GGCA(
(shared_conv): Sequential(
(0): Conv2d(144, 9, kernel_size=(1, 1), stride=(1, 1))
(1): BatchNorm2d(9, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(9, 144, kernel_size=(1, 1), stride=(1, 1))
)
(sigmoid_h): Sigmoid()
(sigmoid_w): Sigmoid()
)
(7): Conv2d(576, 160, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): BatchNorm2d(160, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(15): InvertedResidual(
(conv): Sequential(
(0): Conv2d(160, 960, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(960, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU6(inplace=True)
(3): Conv2d(960, 960, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=960, bias=False)
(4): BatchNorm2d(960, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU6(inplace=True)
(6): GGCA(
(shared_conv): Sequential(
(0): Conv2d(240, 15, kernel_size=(1, 1), stride=(1, 1))
(1): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(15, 240, kernel_size=(1, 1), stride=(1, 1))
)
(sigmoid_h): Sigmoid()
(sigmoid_w): Sigmoid()
)
(7): Conv2d(960, 160, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): BatchNorm2d(160, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(16): InvertedResidual(
(conv): Sequential(
(0): Conv2d(160, 960, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(960, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU6(inplace=True)
(3): Conv2d(960, 960, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=960, bias=False)
(4): BatchNorm2d(960, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU6(inplace=True)
(6): GGCA(
(shared_conv): Sequential(
(0): Conv2d(240, 15, kernel_size=(1, 1), stride=(1, 1))
(1): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(15, 240, kernel_size=(1, 1), stride=(1, 1))
)
(sigmoid_h): Sigmoid()
(sigmoid_w): Sigmoid()
)
(7): Conv2d(960, 160, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): BatchNorm2d(160, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(17): InvertedResidual(
(conv): Sequential(
(0): Conv2d(160, 960, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(960, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU6(inplace=True)
(3): Conv2d(960, 960, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=960, bias=False)
(4): BatchNorm2d(960, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU6(inplace=True)
(6): GGCA(
(shared_conv): Sequential(
(0): Conv2d(240, 15, kernel_size=(1, 1), stride=(1, 1))
(1): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(15, 240, kernel_size=(1, 1), stride=(1, 1))
)
(sigmoid_h): Sigmoid()
(sigmoid_w): Sigmoid()
)
(7): Conv2d(960, 320, kernel_size=(1, 1), stride=(1, 1), bias=False)
(8): BatchNorm2d(320, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(18): Sequential(
(0): Conv2d(320, 1280, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(1280, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU6(inplace=True)
)
)
(classifier): Sequential(
(0): Dropout(p=0.2, inplace=False)
(1): Linear(in_features=1280, out_features=1000, bias=True)
)
)
Traceback (most recent call last):
File "C:\Users\wztbkl\PycharmProjects\deeplabv3-plus-pytorch-main\nets\mobilenetV2_GSCA.py", line 314, in <module>
y_test = net(x_test)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\wztbkl\PycharmProjects\deeplabv3-plus-pytorch-main\nets\mobilenetV2_GSCA.py", line 239, in forward
x = self.features(x)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\container.py", line 250, in forward
input = module(input)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\wztbkl\PycharmProjects\deeplabv3-plus-pytorch-main\nets\mobilenetV2_GSCA.py", line 181, in forward
return self.conv(x)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\container.py", line 250, in forward
input = module(input)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(args, **kwargs)
File "C:\Users\wztbkl\PycharmProjects\deeplabv3-plus-pytorch-main\nets\mobilenetV2_GSCA.py", line 107, in forward
y_h_avg = self.shared_conv(x_h_avg.view(Bself.num_groups, self.group_channels, H, 1))
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\container.py", line 250, in forward
input = module(input)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\conv.py", line 554, in forward
return self._conv_forward(input, self.weight, self.bias)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\conv.py", line 549, in _conv_forward
return F.conv2d(
RuntimeError: Given groups=1, expected weight to be at least 1 at dimension 0, but got weight of size [0, 8, 1, 1] instead
进程已结束,退出代码为 1
从报错信息可以看出,某个卷积层的权重尺寸为 ,说明 输出通道数 = 0,导致了 RuntimeError: Given groups=1, expected weight to be at least 1 at dimension 0
。
经分析,这多半发生在 GGCA 模块的 shared_conv 里:当
时,就会出现卷积的输出通道数是 0 的情况。
在您当前的设置中:
num_groups
组,若每组通道为 group_channels
;shared_conv
的第一层 卷积把通道数从 group_channels
降到 group_channels // reduction
;group_channels < reduction
,就会出现 group_channels // reduction = 0
。从日志可见 branch1(最前面的层)或别的浅层特征阶段可能只有 8 通道,却指定了 num_groups=4
且 reduction=16
,那么就会导致:
于是输出通道数就成了 0。
要防止“降维后通道=0”,可从以下几个角度着手:
减小 reduction
reduction=16
太大,则可改为 4 或 8,确保 group_channels // reduction
不为 0。ggca_reduction=16
改成 ggca_reduction=4
或者更小,就能让 2 // 4 = 0
的情况变为 2 // 4 = 0
依然是 0,如果还是不够,就要再进一步减小或干脆不做降维。减少分组数 num_groups
num_groups=1
或 num_groups=2
,这样单组通道就相对大一些。自动Clamp(取较大值)
shared_conv
里做一个 max(1, group_channels // reduction)
的处理,例如:
pythonreduced_ch = max(1, self.group_channels // self.reduction) self.shared_conv = nn.Sequential( nn.Conv2d(self.group_channels, reduced_ch, kernel_size=1), ... nn.Conv2d(reduced_ch, self.group_channels, kernel_size=1) )
保证浅层通道数足够
channel // num_groups > reduction
,才能防止挤压到 0 通道。以下示例在 GGCA 的 __init__
里加一行 自动 Clamp,并将默认 reduction
设得小一些,以减少出现 0 通道的概率。若您更倾向于手动控制,可直接调小 num_groups
或 reduction
。
pythonclass GGCA(nn.Module): def __init__(self, channel, reduction=16, num_groups=4): super(GGCA, self).__init__() if channel < num_groups: raise ValueError(f"GGCA分组数num_groups={num_groups}超过通道数{channel}!") if channel % num_groups != 0: raise ValueError(f"GGCA: channel={channel}不是num_groups={num_groups}的整数倍!") self.num_groups = num_groups self.group_channels = channel // num_groups #----------- 修改:自动Clamp ------------# # 原本: out_channels = self.group_channels // reduction # 若 out_channels < 1,会引发错误 # 故加一个 max(1, ...) out_ch = max(1, self.group_channels // reduction) self.shared_conv = nn.Sequential( nn.Conv2d(self.group_channels, out_ch, kernel_size=1), nn.BatchNorm2d(out_ch), nn.ReLU(inplace=True), nn.Conv2d(out_ch, self.group_channels, kernel_size=1) ) self.sigmoid_h = nn.Sigmoid() self.sigmoid_w = nn.Sigmoid() def forward(self, x): ... # 其余不变
同时,还可在构建 MobileNetV2_GGCA 时使用较小的 reduction
,比如 4 或 8,这样出现 0 通道的概率就更小。
group_channels // reduction == 0
,导致 1×1 卷积权重的 out_channels=0。reduction
或减小 num_groups
,确保 group_channels >= reduction
;或在代码中做自动 clamp。这样即可修复运行时卷积 out_channels=0 的报错,同时保持主干网络 + GGCA 的整体结构逻辑。
Epoch 1/100: 0%| | 0/2162 [00:00<?, ?it/s<class 'dict'>]Start Train
这个 TracerWarning 通常是因为在使用 torch.jit.trace(或者脚本化)时,代码里出现了 Python 原生的断言(assert) 或类似的操作,对一个运行时的 Tensor 做布尔判断(如 assert C % self.num_groups == 0
)。在 tracing 过程中,PyTorch 不会跟踪这些 Python 级别的布尔逻辑,因此会发出警告,提示该操作可能导致后续无法泛化到其它输入。
C % self.num_groups == 0
),这部分逻辑在编译图中是不可追踪的:
在构造或初始化时进行检查
__init__
)或外部创建网络时做静态检查,而不是在 forward 中或 forward 相关的逻辑里执行。移除或替换 assert
C % self.num_groups == 0
,可以直接去掉这句 assert,或者将其改为对输入尺寸做一次“上层检查”。if x.shape[1] % self.num_groups != 0: raise ValueError(...)
放在 __init__
里,在网络初始化时就报错,而不是 forward 过程。将断言逻辑写成条件分支 + 触发某个固定的处理
torch._assert(...)
在某些新版本可用)或在“图外”进行检查。assert C % self.num_groups == 0
移到网络构建或数据预处理的阶段,避免在 forward(或与之相关的构造函数)里对动态 Tensor 做 Python 布尔断言。这样便能消除 TracerWarning,同时还保留对通道与分组数合法性的检查逻辑。
加了GGCA后,训练速度和最后的预测速度都慢了很多,这是为什么?这个代码参数也不是很多啊,是不是代码错了
**在引入 GGCA(Global Grouped Coordinate Attention)之后,训练速度和推理速度都下降很多,主要原因并不一定是代码出错,而是该注意力机制本身带来了额外的计算开销。**以下几点有助于理解为什么会变慢:
多次全局池化 + 分组操作
分组数量和减维比
动态或重复的池化操作
adaptive_avg_pool2d
/ adaptive_max_pool2d
)并且对多个分组执行,还会多次调用底层内核或产生重复的内存读写。时序分析:越深层的注意力越消耗
对比其他注意力模块
num_groups=1
或 2,而不是 4、8,这样可显著减少重复池化和卷积次数。num_groups
、reduction
、网络分辨率,或减少插入 GGCA 的层数,以在“注意力增益”和“计算开销”之间做平衡。使用合并池化试试
在原始 GGCA 实现中,由于需要针对 高度 (H) 和 宽度 (W) 分别执行自适应平均池化与最大池化,导致出现 多次调用 F.adaptive_avg_pool2d
/ F.adaptive_max_pool2d
,而且是对每个分组都进行,这在大分辨率、较多通道或分组数较多时会显著增大计算量。
如果您的输入尺寸 在网络中保持固定(例如 512×512),或者可以不依赖自适应特性(只是想对“行”或“列”进行简单的 mean/max),就可以尝试以下思路:用按维度求均值/最大值 来代替自适应池化,从而减少函数调用和调度开销,这通常被称作“合并(融合)池化”或“基于维度的池化”。
对于 垂直方向(宽度) 的平均池化:
python# (B*g, groupC, H, W) => 在第 3 维做 mean,结果 (B*g, groupC, H, 1) x_h_avg = x.mean(dim=3, keepdim=True)
同理,最大池化可用
pythonx_h_max, _ = x.max(dim=3, keepdim=True)
对于 水平方向(高度) 的平均池化:
python# (B*g, groupC, H, W) => 在第 2 维做 mean,结果 (B*g, groupC, 1, W) x_w_avg = x.mean(dim=2, keepdim=True)
最大池化则
pythonx_w_max, _ = x.max(dim=2, keepdim=True)
这样就能完成同样的“按行(列)聚合”的操作,无需调用 adaptive_avg_pool2d
或 adaptive_max_pool2d
。
下面给出一个使用 “合并池化”(基于维度求均值/最大值)来替换自适应池化的 GGCA 动态版本示例。仅演示核心思路,其余流程与您原本的 GGCA 相同。
pythonimport torch import torch.nn as nn import torch.nn.functional as F class GGCA_FusedPooling(nn.Module): """ 将 adaptive_avg_pool2d / adaptive_max_pool2d 替换为基于维度的 mean/max。 注意: - 仅适合大多数情况下输入宽高固定,且无需真正自适应变换时使用。 - 如果后续网络输入 H, W 会变化,依然可以用此方式,只要是想对行/列做均值即可。 """ def __init__(self, channel, reduction=16, num_groups=4): super(GGCA_FusedPooling, self).__init__() if channel < num_groups: raise ValueError(f"GGCA num_groups={num_groups} 超过通道数 {channel}") if channel % num_groups != 0: raise ValueError(f"GGCA: channel={channel} 不是 num_groups={num_groups} 的整数倍") self.num_groups = num_groups self.group_channels = channel // num_groups self.reduction = reduction # 这里做一个简单的 clamp,防止 out_ch=0 out_ch = max(1, self.group_channels // self.reduction) self.shared_conv = nn.Sequential( nn.Conv2d(self.group_channels, out_ch, kernel_size=1), nn.BatchNorm2d(out_ch), nn.ReLU(inplace=True), nn.Conv2d(out_ch, self.group_channels, kernel_size=1) ) self.sigmoid_h = nn.Sigmoid() self.sigmoid_w = nn.Sigmoid() def forward(self, x): # x shape: (B, C, H, W) B, C, H, W = x.shape # 分组 x_grouped = x.view(B, self.num_groups, self.group_channels, H, W) # 重塑为 (B*g, groupC, H, W) 便于后续操作 x_grouped_2d = x_grouped.view(B*self.num_groups, self.group_channels, H, W) #------------------------# # 1) 宽度方向聚合: mean + max # x_h_avg, x_h_max => (B*g, groupC, H, 1) #------------------------# x_h_avg = x_grouped_2d.mean(dim=3, keepdim=True) x_h_max, _ = x_grouped_2d.max(dim=3, keepdim=True) #------------------------# # 2) 高度方向聚合: mean + max # x_w_avg, x_w_max => (B*g, groupC, 1, W) #------------------------# x_w_avg = x_grouped_2d.mean(dim=2, keepdim=True) x_w_max, _ = x_grouped_2d.max(dim=2, keepdim=True) # 分别通过 shared_conv y_h_avg = self.shared_conv(x_h_avg) y_h_max = self.shared_conv(x_h_max) y_w_avg = self.shared_conv(x_w_avg) y_w_max = self.shared_conv(x_w_max) # Sigmoid att_h = self.sigmoid_h(y_h_avg + y_h_max) # (B*g, groupC, H, 1) att_w = self.sigmoid_w(y_w_avg + y_w_max) # (B*g, groupC, 1, W) # reshape回 (B, g, groupC, H, W) att_h = att_h.view(B, self.num_groups, self.group_channels, H, 1) att_w = att_w.view(B, self.num_groups, self.group_channels, 1, W) # 原特征 x_grouped * att_h * att_w out = x_grouped * att_h * att_w # 恢复到 (B, C, H, W) out = out.view(B, C, H, W) return out
与原先代码相比,主要的差异是不再使用 F.adaptive_avg_pool2d(x, (H,1))
等,而是直接对 dim=3
或 dim=2
求 mean / max。这样做的好处:
F.adaptive_avg_pool2d(x, (H,1))
在内部要做较为通用的调度,而 .mean(dim=3, keepdim=True)
属于更底层、更直接的 Tensor 运算,可减少函数调用开销。.mean(dim=3)
、.max(dim=3)
就能满足需求,且不会进行任何“插值”或“重缩放”,它本质上就是按单一维度取统计值。在一般情况下,这种合并池化可以减少 Python 层多次调用 adaptive_*_pool2d
的 overhead,有望提升速度。不过,实际效果还取决于:
adaptive_avg_pool2d
可能被高度优化,速度也不错;在某些平台却不一定。num_groups
较多,那么本地维度运算的优化程度(x.mean(dim=...)
)也可能带来可观加速。因此,这种“合并池化/基于维度聚合”的改法在很多场景下确实能少量或中等程度地降低GGCA的计算开销,但能否“显著”减少要以实测为准。若仍然对速度较为敏感,还需要结合减少分组数、减小输入分辨率、在部分关键层使用 GGCA 等手段来进一步优化。
mean(dim=...)
/ max(dim=...)
代替自适应池化,可减少函数调用和不必要的中间步骤,有望提高速度。adaptive_*_pool2d
;但若只想对行列做简单统计,此种“基于维度的聚合”更直接、高效。在此基础上,再根据GGCA所在倒残差模块的位置调整参数,以减小计算量
在使用 GGCA(无论是原生自适应池化版本还是合并池化版本)时,如果仍希望进一步降低计算开销,可以在不同倒残差模块中对 GGCA 的 参数和启用策略进行细化设置,而非在所有层都一刀切地使用相同的分组、降维比和启用方式。以下给出几种可行思路,可视您的实际实验与需求自行组合:
use_ggca=False
),让网络以原先的 MobileNetV2 方式执行;示例(伪代码):
pythonfor idx, (t, c, n, s) in enumerate(interverted_residual_setting): for i in range(n): use_current_ggca = False if idx >= 3: # 假设只从第4组开始启用GGCA use_current_ggca = True self.features.append( InvertedResidual( inp=input_channel, oup=output_channel, stride=(s if i==0 else 1), expand_ratio=t, use_ggca=use_current_ggca, ggca_groups=ggca_groups, ggca_reduction=ggca_reduction, ) )
num_groups
、reduction
参数并不理想,可能会让通道“挤”到过小或者带来过多卷积开销。num_groups
与 reduction
。例如:
num_groups
也取小一点(比如 1 或 2),reduction
也取小或直接设为 1(不降维),以免出现降到 0 通道或额外 overhead。num_groups
,提升对跨通道注意力的细粒度刻画,但 reduction
可以保持较小以控制额外卷积参数量。示例(伪代码):
pythonfor idx, (t, c, n, s) in enumerate(interverted_residual_setting): output_channel = int(c * width_mult) for i in range(n): # 在浅层 idx<=2 时,num_groups=1;reduction=4 # 在中深层 idx>2 时,num_groups=4;reduction=16 if idx <= 2: this_groups = 1 this_red = 4 else: this_groups = 4 this_red = 16 self.features.append( InvertedResidual( inp=input_channel, oup=output_channel, stride=(s if i==0 else 1), expand_ratio=t, use_ggca=True, ggca_groups=this_groups, ggca_reduction=this_red ) ) input_channel = output_channel
通过这种“分层级”设置,可以避免浅层通道太少时还死板地分成 4~8 组,也能在深层发挥更强的注意力表达。
stride=2
,负责下采样;此时输入分辨率依然较大,且如果再加 GGCA,往往会使得池化操作叠加在高分辨率上,额外耗时更明显。stride=2
的那几层,禁用或简化 GGCA 逻辑(例如,num_groups=1
或者 reduction=1
),避免过多重复计算;而在 stride=1
的层才启用更复杂的 GGCA 设置。示例(伪代码):
pythonif stride == 2: # 由于是下采样层,通道多、分辨率大 => 开销很大 # 这里可以直接 use_ggca=False 或者 num_groups=1, reduction=1 pass else: # stride=1 => 空间分辨率更小 => 使用更完整的 GGCA
这样做能够保证在特征图分辨率依旧很大时不承担过重的注意力开销。
若您已经将 adaptive_*_pool2d
改成了维度聚合(x.mean(dim=...)
, x.max(dim=...)
),还可以再加分层策略,让 GGCA 在更少的模块或以更小的 num_groups
/ reduction
出现。这样会双管齐下:
num_groups
/ reduction
以匹配每层的实际通道情况;这类分层式或模块化的优化思路,不仅适用于 GGCA,也同样适用于其他高计算注意力模块(如 CBAM、Dual Attention 等),帮助我们在大型分辨率或轻量化场景下更灵活地应用注意力机制。
只在网络较深部分开启 GGCA;
对浅层与下采样层禁用 GGCA;
动态调节 num_groups / reduction 以匹配每层的实际通道情况;
结合合并池化,避免大量自适应池化调用;
按照以上策略给出完整代码
下面给出一个示例代码,综合运用了以下策略来减少 GGCA 带来的额外计算量,同时仍在网络较深层获得全局注意力的益处:
num_groups
和 reduction
:
注意:此处代码仅为示例。您可根据实际网络设置(如 MobileNetV2 的倒残差配置、预期深层数、分辨率需求)在
MobileNetV2_GGCA
的构建循环中灵活修改。
pythonimport math import torch import torch.nn as nn import torch.nn.functional as F import torch.utils.model_zoo as model_zoo BatchNorm2d = nn.BatchNorm2d #------------------------------------# # 1. GGCA (fused pooling 版本) #------------------------------------# class GGCA_Fused(nn.Module): """ 使用合并池化(在 dim=2, dim=3 分别求 mean / max)替代自适应池化。 同时支持通道分组及降维比 (reduction),并可根据需要做自动clamp。 """ def __init__(self, channel, reduction=16, num_groups=4): super(GGCA_Fused, self).__init__() # 基础检查 if channel < num_groups: raise ValueError(f"GGCA: num_groups={num_groups} 超过通道数 {channel}") if channel % num_groups != 0: raise ValueError(f"GGCA: channel={channel} 不是 num_groups={num_groups} 的整数倍") self.num_groups = num_groups self.group_channels = channel // num_groups self.reduction = reduction # 确保降维后不为 0 out_ch = max(1, self.group_channels // reduction) self.shared_conv = nn.Sequential( nn.Conv2d(self.group_channels, out_ch, kernel_size=1), BatchNorm2d(out_ch), nn.ReLU(inplace=True), nn.Conv2d(out_ch, self.group_channels, kernel_size=1) ) self.sigmoid_h = nn.Sigmoid() self.sigmoid_w = nn.Sigmoid() def forward(self, x): """ x: (B, C, H, W) 分组 => fused pooling => shared_conv => 注意力加权 """ B, C, H, W = x.shape # 分组 x_grouped = x.view(B, self.num_groups, self.group_channels, H, W) x_grouped_2d = x_grouped.view(B*self.num_groups, self.group_channels, H, W) # (1) 在宽度方向 (dim=3) 上做 mean + max x_h_avg = x_grouped_2d.mean(dim=3, keepdim=True) # => (B*g, groupC, H, 1) x_h_max, _ = x_grouped_2d.max(dim=3, keepdim=True) # => (B*g, groupC, H, 1) # (2) 在高度方向 (dim=2) 上做 mean + max x_w_avg = x_grouped_2d.mean(dim=2, keepdim=True) # => (B*g, groupC, 1, W) x_w_max, _ = x_grouped_2d.max(dim=2, keepdim=True) # => (B*g, groupC, 1, W) # (3) 分别通过 共享卷积层 => (B*g, groupC, H,1) or (B*g, groupC,1,W) y_h_avg = self.shared_conv(x_h_avg) y_h_max = self.shared_conv(x_h_max) y_w_avg = self.shared_conv(x_w_avg) y_w_max = self.shared_conv(x_w_max) # (4) 相加 + sigmoid att_h = self.sigmoid_h(y_h_avg + y_h_max) # => (B*g, groupC, H,1) att_w = self.sigmoid_w(y_w_avg + y_w_max) # => (B*g, groupC,1,W) # reshape 回原分组形状 att_h = att_h.view(B, self.num_groups, self.group_channels, H, 1) att_w = att_w.view(B, self.num_groups, self.group_channels, 1, W) # (5) 元素乘 x_grouped out = x_grouped * att_h * att_w out = out.view(B, C, H, W) return out #------------------------------------# # 2. 倒残差结构: 支持可选GGCA #------------------------------------# class InvertedResidual(nn.Module): """ - 若 expand_ratio=1: DW(3x3) -> BN -> ReLU6 -> [可选GGCA] -> PW(1x1) -> BN - 若 expand_ratio!=1: PW(1x1)升维 -> BN -> ReLU6 -> DW(3x3) -> BN -> ReLU6 -> [可选GGCA] -> PW(1x1)降维 -> BN """ def __init__(self, inp, oup, stride, expand_ratio, use_ggca=True, ggca_class=GGCA_Fused, # 指定使用 fused 版本的 GGCA ggca_groups=1, ggca_reduction=4): super(InvertedResidual, self).__init__() self.stride = stride assert stride in [1, 2] hidden_dim = round(inp * expand_ratio) self.use_res_connect = (self.stride == 1 and inp == oup) self.use_ggca = use_ggca layers = [] if expand_ratio == 1: # (DW -> BN -> ReLU6) layers.extend([ nn.Conv2d(hidden_dim, hidden_dim, 3, stride, 1, groups=hidden_dim, bias=False), BatchNorm2d(hidden_dim), nn.ReLU6(inplace=True), ]) # 可选 GGCA if self.use_ggca: layers.append(ggca_class(channel=hidden_dim, reduction=ggca_reduction, num_groups=ggca_groups)) # PW (1x1) layers.extend([ nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False), BatchNorm2d(oup) ]) else: # 1x1 升维 layers.extend([ nn.Conv2d(inp, hidden_dim, 1, 1, 0, bias=False), BatchNorm2d(hidden_dim), nn.ReLU6(inplace=True), # DW nn.Conv2d(hidden_dim, hidden_dim, 3, stride, 1, groups=hidden_dim, bias=False), BatchNorm2d(hidden_dim), nn.ReLU6(inplace=True), ]) # 可选 GGCA if self.use_ggca: layers.append(ggca_class(channel=hidden_dim, reduction=ggca_reduction, num_groups=ggca_groups)) # 1x1 降维 layers.extend([ nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False), BatchNorm2d(oup) ]) self.conv = nn.Sequential(*layers) def forward(self, x): if self.use_res_connect: return x + self.conv(x) else: return self.conv(x) #------------------------------------# # 3. MobileNetV2 主体 + 分层策略 #------------------------------------# class MobileNetV2_GGCA(nn.Module): """ - 仅在网络后半部分(或指定层)启用 GGCA - 对下采样层(stride=2)禁用 GGCA - 动态调节 num_groups / reduction - 使用合并池化的 GGCA_Fused """ def __init__(self, n_class=1000, input_size=224, width_mult=1.0): super(MobileNetV2_GGCA, self).__init__() block = InvertedResidual input_channel = 32 last_channel = 1280 # [t, c, n, s] interverted_residual_setting = [ [1, 16, 1, 1], [6, 24, 2, 2], [6, 32, 3, 2], [6, 64, 4, 2], [6, 96, 3, 1], [6, 160,3, 2], [6, 320,1, 1], ] assert input_size % 32 == 0 input_channel = int(input_channel * width_mult) self.last_channel = int(last_channel * width_mult) if width_mult > 1.0 else last_channel # 首层: 3 -> input_channel self.features = [self._conv_bn(3, input_channel, 2)] # 构建倒残差层 + 分层策略 for idx, (t, c, n, s) in enumerate(interverted_residual_setting): output_channel = int(c * width_mult) for i in range(n): # 如果是第一个块,用 stride=s,否则 stride=1 current_stride = s if i == 0 else 1 # ---------- 1) 仅在网络后半部分启用 GGCA ---------- # 这里假设 idx>=3 就算后半部分(可自行调整) use_ggca = (idx >= 3) # ---------- 2) 对下采样层 (stride=2) 禁用 GGCA ---------- if current_stride == 2: use_ggca = False # ---------- 3) 动态调节 num_groups / reduction ---------- # 举例:如果 output_channel < 64,就用较小分组1, reduction=4 # 否则用较大分组4, reduction=16 if output_channel < 64: this_groups = 1 this_reduction = 4 else: this_groups = 4 this_reduction = 16 # 构建本层 self.features.append( block( inp=input_channel, oup=output_channel, stride=current_stride, expand_ratio=t, use_ggca=use_ggca, ggca_class=GGCA_Fused, # 使用合并池化版本 GGCA ggca_groups=this_groups, ggca_reduction=this_reduction ) ) input_channel = output_channel # 最后 1x1 卷积 self.features.append(self._conv_1x1_bn(input_channel, self.last_channel)) self.features = nn.Sequential(*self.features) # 分类器 self.classifier = nn.Sequential( nn.Dropout(0.2), nn.Linear(self.last_channel, n_class) ) self._initialize_weights() def _conv_bn(self, inp, oup, stride): return nn.Sequential( nn.Conv2d(inp, oup, 3, stride, 1, bias=False), BatchNorm2d(oup), nn.ReLU6(inplace=True) ) def _conv_1x1_bn(self, inp, oup): return nn.Sequential( nn.Conv2d(inp, oup, 1, 1, 0, bias=False), BatchNorm2d(oup), nn.ReLU6(inplace=True) ) def forward(self, x): x = self.features(x) # 全局平均池化 + 全连接 x = x.mean(dim=3).mean(dim=2) x = self.classifier(x) return x def _initialize_weights(self): for m in self.modules(): if isinstance(m, nn.Conv2d): if m.kernel_size and m.out_channels > 0: n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels else: n = 0 if n > 0: m.weight.data.normal_(0, math.sqrt(2. / n)) if m.bias is not None: m.bias.data.zero_() elif isinstance(m, BatchNorm2d): m.weight.data.fill_(1) m.bias.data.zero_() elif isinstance(m, nn.Linear): if m.weight.size(1) > 0: n = m.weight.size(1) else: n = 0 if n > 0: m.weight.data.normal_(0, 0.01) m.bias.data.zero_() def load_url(url, model_dir='./model_data', map_location=None): if not os.path.exists(model_dir): os.makedirs(model_dir) filename = url.split('/')[-1] cached_file = os.path.join(model_dir, filename) if os.path.exists(cached_file): return torch.load(cached_file, map_location=map_location) else: return model_zoo.load_url(url, model_dir=model_dir) def mobilenetv2_ggca_fused(pretrained=False, **kwargs): """ 外部构建函数,可选择是否加载预训练权重。 """ model = MobileNetV2_GGCA(n_class=1000, **kwargs) if pretrained: # 可能与官方预训练权重不完全匹配,仅作示例 model.load_state_dict( load_url('https://github.com/bubbliiiing/deeplabv3-plus-pytorch/releases/download/v1.0/mobilenet_v2.pth.tar'), strict=False ) return model if __name__ == "__main__": # 简单测试 net = mobilenetv2_ggca_fused( pretrained=False, input_size=224, width_mult=1.0 ) print(net) x_test = torch.randn(1, 3, 224, 224) out = net(x_test) print("输出维度:", out.shape)
GGCA_Fused
dim=3
) 求 mean / max,对高度维度 (dim=2
) 求 mean / max。max(1, group_channels // reduction)
防止通道降为 0。InvertedResidual
use_ggca
决定)。expand_ratio=1
,采用略简化的结构;否则包含 1×1 升维、3×3 DWConv、GGCA、1×1 降维。分层策略
MobileNetV2_GGCA.__init__
里,对每一个倒残差组 (idx, (t,c,n,s))
:
stride=2
(下采样层),则再禁用 GGCA。output_channel
动态指定 num_groups
与 reduction
,如小于 64 时设置得更小,以避免太多运算。训练 & 推理优化
通过以上完整示例,即可实现“只在网络较深部分启用 GGCA、对浅层和下采样层禁用、动态调节分组/降维参数、并使用合并池化”的方案,在保证一定注意力效果的同时,显著减少多次池化和分组卷积带来的 overhead。
能否让训练速度在快一些
如果在经过“分层启用 GGCA”、“减少分组数/减小 reduction”、“合并池化”之后,训练和推理速度依然无法满足需求,可以从以下几个方面继续做进一步的加速与优化:
思路
影响
分组数
降维比
reduction
设得大一些,否则就用小或直接设为 1~4。分辨率
Batch Size
PyTorch Profiling
x.mean(dim=...)
)。更轻量化部署
.mean(dim=...)
/ .max(dim=...)
) 做了高效 kernel 优化。混合精度 (Automatic Mixed Precision, AMP)
torch.cuda.amp
可以自动管理混合精度,通常能提速 1.5~2 倍(视具体网络而定)。分布式训练 (DDP)
在此基础上,逐层、逐项分析您的网络结构与训练流程,并结合 profiling 工具,可以最终找到最适合的方案,让训练速度“更快一些”且保证在可接受的精度范围内。
idx是指倒残差块的位置嘛
是的,idx
在该示例中就是指当前倒残差块(InvertedResidual)在 interverted_residual_setting
列表中的位置索引。通常我们会在 for idx, (t, c, n, s) in enumerate(...)
这类循环里使用 idx
来表示“这是第几组倒残差配置”,从而对不同组的特征层(不同深度)做差异化设置,比如是否启用 GGCA、设定多少分组、用什么样的 reduction 等。
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.model_zoo as model_zoo
import os
BatchNorm2d = nn.BatchNorm2d
#------------------------------------#
#------------------------------------#
class GGCA_Fused(nn.Module):
"""
使用合并池化(在 dim=2, dim=3 分别求 mean / max)替代自适应池化。
同时支持通道分组及降维比 (reduction),并可根据需要做自动clamp。
"""
def init(self, channel, reduction=16, num_groups=4):
super(GGCA_Fused, self).init()
# 基础检查
if channel < num_groups:
raise ValueError(f"GGCA: num_groups={num_groups} 超过通道数 {channel}")
if channel % num_groups != 0:
raise ValueError(f"GGCA: channel={channel} 不是 num_groups={num_groups} 的整数倍")
textself.num_groups = num_groups self.group_channels = channel // num_groups self.reduction = reduction # 确保降维后不为 0 out_ch = max(1, self.group_channels // reduction) self.shared_conv = nn.Sequential( nn.Conv2d(self.group_channels, out_ch, kernel_size=1), BatchNorm2d(out_ch), nn.ReLU(inplace=True), nn.Conv2d(out_ch, self.group_channels, kernel_size=1) ) self.sigmoid_h = nn.Sigmoid() self.sigmoid_w = nn.Sigmoid() def forward(self, x): """ x: (B, C, H, W) 分组 => fused pooling => shared_conv => 注意力加权 """ B, C, H, W = x.shape # 分组 x_grouped = x.view(B, self.num_groups, self.group_channels, H, W) x_grouped_2d = x_grouped.view(B*self.num_groups, self.group_channels, H, W) # (1) 在宽度方向 (dim=3) 上做 mean + max x_h_avg = x_grouped_2d.mean(dim=3, keepdim=True) # => (B*g, groupC, H, 1) x_h_max, _ = x_grouped_2d.max(dim=3, keepdim=True) # => (B*g, groupC, H, 1) # (2) 在高度方向 (dim=2) 上做 mean + max x_w_avg = x_grouped_2d.mean(dim=2, keepdim=True) # => (B*g, groupC, 1, W) x_w_max, _ = x_grouped_2d.max(dim=2, keepdim=True) # => (B*g, groupC, 1, W) # (3) 分别通过 共享卷积层 => (B*g, groupC, H,1) or (B*g, groupC,1,W) y_h_avg = self.shared_conv(x_h_avg) y_h_max = self.shared_conv(x_h_max) y_w_avg = self.shared_conv(x_w_avg) y_w_max = self.shared_conv(x_w_max) # (4) 相加 + sigmoid att_h = self.sigmoid_h(y_h_avg + y_h_max) # => (B*g, groupC, H,1) att_w = self.sigmoid_w(y_w_avg + y_w_max) # => (B*g, groupC,1,W) # reshape 回原分组形状 att_h = att_h.view(B, self.num_groups, self.group_channels, H, 1) att_w = att_w.view(B, self.num_groups, self.group_channels, 1, W) # (5) 元素乘 x_grouped out = x_grouped * att_h * att_w out = out.view(B, C, H, W) return out
#------------------------------------#
#------------------------------------#
class InvertedResidual(nn.Module):
"""
- 若 expand_ratio=1:
DW(3x3) -> BN -> ReLU6 -> [可选GGCA] -> PW(1x1) -> BN
- 若 expand_ratio!=1:
PW(1x1)升维 -> BN -> ReLU6 -> DW(3x3) -> BN -> ReLU6 -> [可选GGCA] -> PW(1x1)降维 -> BN
"""
def init(self, inp, oup, stride, expand_ratio,
use_ggca=True,
ggca_class=GGCA_Fused, # 指定使用 fused 版本的 GGCA
ggca_groups=1,
ggca_reduction=4):
super(InvertedResidual, self).init()
self.stride = stride
assert stride in [1, 2]
texthidden_dim = round(inp * expand_ratio) self.use_res_connect = (self.stride == 1 and inp == oup) self.use_ggca = use_ggca layers = [] if expand_ratio == 1: # (DW -> BN -> ReLU6) layers.extend([ nn.Conv2d(hidden_dim, hidden_dim, 3, stride, 1, groups=hidden_dim, bias=False), BatchNorm2d(hidden_dim), nn.ReLU6(inplace=True), ]) # 可选 GGCA if self.use_ggca: layers.append(ggca_class(channel=hidden_dim, reduction=ggca_reduction, num_groups=ggca_groups)) # PW (1x1) layers.extend([ nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False), BatchNorm2d(oup) ]) else: # 1x1 升维 layers.extend([ nn.Conv2d(inp, hidden_dim, 1, 1, 0, bias=False), BatchNorm2d(hidden_dim), nn.ReLU6(inplace=True), # DW nn.Conv2d(hidden_dim, hidden_dim, 3, stride, 1, groups=hidden_dim, bias=False), BatchNorm2d(hidden_dim), nn.ReLU6(inplace=True), ]) # 可选 GGCA if self.use_ggca: layers.append(ggca_class(channel=hidden_dim, reduction=ggca_reduction, num_groups=ggca_groups)) # 1x1 降维 layers.extend([ nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False), BatchNorm2d(oup) ]) self.conv = nn.Sequential(*layers) def forward(self, x): if self.use_res_connect: return x + self.conv(x) else: return self.conv(x)
#------------------------------------#
#------------------------------------#
class MobileNetV2_GGCA(nn.Module):
"""
- 仅在网络后半部分(或指定层)启用 GGCA
- 对下采样层(stride=2)禁用 GGCA
- 动态调节 num_groups / reduction
- 使用合并池化的 GGCA_Fused
"""
def init(self, n_class=1000, input_size=224, width_mult=1.0):
super(MobileNetV2_GGCA, self).init()
block = InvertedResidual
input_channel = 32
last_channel = 1280
text# [t, c, n, s] interverted_residual_setting = [ [1, 16, 1, 1], [6, 24, 2, 2], [6, 32, 3, 2], [6, 64, 4, 2], [6, 96, 3, 1], [6, 160,3, 2], [6, 320,1, 1], ] assert input_size % 32 == 0 input_channel = int(input_channel * width_mult) self.last_channel = int(last_channel * width_mult) if width_mult > 1.0 else last_channel # 首层: 3 -> input_channel self.features = [self._conv_bn(3, input_channel, 2)] # 构建倒残差层 + 分层策略 for idx, (t, c, n, s) in enumerate(interverted_residual_setting): output_channel = int(c * width_mult) for i in range(n): # 如果是第一个块,用 stride=s,否则 stride=1 current_stride = s if i == 0 else 1 # ---------- 1) 仅在网络后半部分启用 GGCA ---------- # 这里假设 idx>=3 就算后半部分(可自行调整) use_ggca = (idx >= 4) # ---------- 2) 对下采样层 (stride=2) 禁用 GGCA ---------- if current_stride == 2: use_ggca = False # ---------- 3) 动态调节 num_groups / reduction ---------- # 举例:如果 output_channel < 64,就用较小分组1, reduction=4 # 否则用较大分组4, reduction=16 if output_channel < 64: this_groups = 1 this_reduction = 4 else: this_groups = 4 this_reduction = 16 # 构建本层 self.features.append( block( inp=input_channel, oup=output_channel, stride=current_stride, expand_ratio=t, use_ggca=use_ggca, ggca_class=GGCA_Fused, # 使用合并池化版本 GGCA ggca_groups=this_groups, ggca_reduction=this_reduction ) ) input_channel = output_channel # 最后 1x1 卷积 self.features.append(self._conv_1x1_bn(input_channel, self.last_channel)) self.features = nn.Sequential(*self.features) # 分类器 self.classifier = nn.Sequential( nn.Dropout(0.2), nn.Linear(self.last_channel, n_class) ) self._initialize_weights() def _conv_bn(self, inp, oup, stride): return nn.Sequential( nn.Conv2d(inp, oup, 3, stride, 1, bias=False), BatchNorm2d(oup), nn.ReLU6(inplace=True) ) def _conv_1x1_bn(self, inp, oup): return nn.Sequential( nn.Conv2d(inp, oup, 1, 1, 0, bias=False), BatchNorm2d(oup), nn.ReLU6(inplace=True) ) def forward(self, x): x = self.features(x) # 全局平均池化 + 全连接 x = x.mean(dim=3).mean(dim=2) x = self.classifier(x) return x def _initialize_weights(self): for m in self.modules(): if isinstance(m, nn.Conv2d): if m.kernel_size and m.out_channels > 0: n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels else: n = 0 if n > 0: m.weight.data.normal_(0, math.sqrt(2. / n)) if m.bias is not None: m.bias.data.zero_() elif isinstance(m, BatchNorm2d): m.weight.data.fill_(1) m.bias.data.zero_() elif isinstance(m, nn.Linear): if m.weight.size(1) > 0: n = m.weight.size(1) else: n = 0 if n > 0: m.weight.data.normal_(0, 0.01) m.bias.data.zero_()
def load_url(url, model_dir='./model_data', map_location=None):
if not os.path.exists(model_dir):
os.makedirs(model_dir)
filename = url.split('/')[-1]
cached_file = os.path.join(model_dir, filename)
if os.path.exists(cached_file):
return torch.load(cached_file, map_location=map_location)
else:
return model_zoo.load_url(url, model_dir=model_dir)
def mobilenetv2_ggca_fused(pretrained=False, **kwargs):
"""
外部构建函数,可选择是否加载预训练权重。
"""
model = MobileNetV2_GGCA(n_class=1000, **kwargs)
if pretrained:
# 可能与官方预训练权重不完全匹配,仅作示例
model.load_state_dict(
load_url('https://github.com/bubbliiiing/deeplabv3-plus-pytorch/releases/download/v1.0/mobilenet_v2.pth.tar'),
strict=False
)
return model
用专业文字描述GGCA以及使用策略,并解释这样做的原因
GGCA(Global Grouped Coordinate Attention) 是一种将通道分组和全局空间坐标注意力相结合的轻量化模块,用于在不显著增加参数的前提下增强网络对远距离上下文和局部细节的建模能力。该模块通过先将通道按组(num_groups)拆分,然后分别在高度和宽度方向进行均值与最大值的聚合运算(即合并池化/fused pooling),再借助 1×1 卷积完成降维与升维,从而生成高度与宽度的注意力权重。最终将注意力权重与原特征逐元素相乘,实现对目标区域的显著强化与噪声抑制。
在实际应用中,为了在保证一定注意力增益的同时避免大幅度地拖慢网络训练和推理,通常采取分层式的使用策略:
adaptive_pool
减少函数调度损耗,进一步降低运算负担。这种“选择性启用 + 动态调参 + 合并池化”的组合方法能够显著减少 GGCA 在高分辨率、大通道数场景中的重复池化和分组卷积带来的额外开销,同时充分利用网络深层阶段对长程依赖的需求,让模型在计算速度、显存占用及精度增益上达到更合理的折衷。
num_groups、reduction、adaptive_pool的中文是什么
在中文语境下,常见的对应表达如下:
在具体论文或代码说明中,可视语义、上下文或个人习惯来选用更贴切的中文表述。
合并池化取代自适应池化,可以通过对特征维度(dim=2/3)的均值与最大值快速计算行/列统计量,相比一次次调用自适应池化减少函数调度损耗,进一步降低运算负担。
把这句话润色下,使之通顺
将合并池化(fused pooling)用来替代自适应池化后,可直接在特征维度(dim=2/3)上计算行/列的均值与最大值,以快速获得所需统计量;相比反复调用自适应池化,这种做法能显著减少函数调度的开销,从而进一步降低整体运算负担。
Model mIoU(%) F1 Score(%) FPS(f/s) Parameter(M)
Xception 65.55 76.78 45.51 54.71
VGG16 66.84 77.89 46.98 20.15
MobileNetV2 65.67 77.27 90.48 5.81
Improved MobileNetV2 67.82 78.86 66.87 5.84
MobileNetV4 64.23 75.73 73.26 31.79
Swin Transformerv2 58.30 70.95 43.43 58.21
MobileViT 64.14 75.34 75.61 5.04
根据表格重点说明mobilenetv2和improved mobilenetv2的数据,简单说明其他主干网络的数据,并说明最后选择improved mobilenetv2的原因
从表格中可以看到,MobileNetV2 在各项指标上展现了较为突出的综合性能:它以 65.67% 的 mIoU 和 77.27% 的 F1 分数达到了相对较高的准确率,同时凭借仅有 5.81 M 的参数量取得了 90.48 FPS 的推理速度,是轻量化与高效性的代表。对其进行改进后,Improved MobileNetV2 在 mIoU(67.82%)与 F1(78.86%)上又有进一步提升,参数量基本保持不变(5.84 M),只是推理速度略降至 66.87 FPS,但依旧属于较为出色的轻量化网络。
相比之下,其他主干网络如 Xception 和 VGG16 在准确率方面也具有一定优势,如 VGG16 的 mIoU(66.84%)及 F1(77.89%)略高于原版 MobileNetV2,但其推理速度(46.98 FPS)仍明显低于 MobileNet 系列;Swin Transformer v2 则在参数量(58.21 M)和速度(43.43 FPS)上并不占优;MobileNetV4 和 MobileViT 的精度(mIoU / F1)相对稍低或相近,速度与参数规模并未表现出明显优势。
综合考虑模型轻量化、推理速度与准确率之间的平衡,本研究最终选择 Improved MobileNetV2 作为主干网络,既能在精度指标上优于原版 MobileNetV2,又能保持较低的参数量和适中的推理速度,满足了在多场景应用中对模型效率与准确度的兼顾需求。
deeplabv3_lowfeature.py
import torch
import torch.nn as nn
import torch.nn.functional as F
from nets.xception import xception
from nets.mobilenetv2 import mobilenetv2
from nets.ShallowFusion import ShallowFusionLite
class MobileNetV2(nn.Module):
def init(self, downsample_factor=8, pretrained=True):
super(MobileNetV2, self).init()
from functools import partial
textmodel = mobilenetv2(pretrained) # 去掉最后一个classification层 self.features = model.features[:-1] self.total_idx = len(self.features) self.down_idx = [2, 4, 7, 14] # 此处设定 (24, 32, 16, 32, 0.2) 与下方切片对应 (F1=24通道, F2=32通道) self.shallow_fusion = ShallowFusionLite( in_channels1=16, # <-- 与 F1.shape[1] 匹配 in_channels2=24, # <-- 与 F2.shape[1] 匹配 reduced_channels=16, out_channels=40, dropout=0.2 ) # 根据downsample_factor修改空洞卷积 if downsample_factor == 8: for i in range(self.down_idx[-2], self.down_idx[-1]): self.features[i].apply( partial(self._nostride_dilate, dilate=2) ) for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply( partial(self._nostride_dilate, dilate=4) ) elif downsample_factor == 16: for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply( partial(self._nostride_dilate, dilate=2) ) def _nostride_dilate(self, m, dilate): classname = m.__class__.__name__ if classname.find('Conv') != -1: if m.stride == (2, 2): m.stride = (1, 1) if m.kernel_size == (3, 3): m.dilation = (dilate // 2, dilate // 2) m.padding = (dilate // 2, dilate // 2) else: if m.kernel_size == (3, 3): m.dilation = (dilate, dilate) m.padding = (dilate, dilate) def forward(self, x): # F1 => self.features[:4] F1 = self.features[:2](x) # 通道一般是 24 # F2 => self.features[4:7] # print("F1",F1.shape) F2 = self.features[2:4](F1) # 通道一般是 32 # print("F2",F2.shape) # deep => self.features[7:] out = self.features[4:](F2) # deep feature (320通道等) # 将 F1, F2 融合得到 low_level_features low_level_features = self.shallow_fusion(F1, F2) return low_level_features, out
def count_parameters(model):
"""统计模型参数数量"""
return sum(p.numel() for p in model.parameters() if p.requires_grad)
class ASPP(nn.Module):
def init(self, dim_in, dim_out, rate=1, bn_mom=0.1):
super(ASPP, self).init()
# -----------------------------------------#
# 去掉 inplace=True
# -----------------------------------------#
self.branch1 = nn.Sequential(
nn.Conv2d(dim_in, dim_out, 1, 1, padding=0, dilation=rate, bias=True),
nn.BatchNorm2d(dim_out, momentum=bn_mom),
nn.ReLU(inplace=False),
)
self.branch2 = nn.Sequential(
nn.Conv2d(dim_in, dim_out, 3, 1, padding=6 * rate, dilation=6 * rate, bias=True),
nn.BatchNorm2d(dim_out, momentum=bn_mom),
nn.ReLU(inplace=False),
)
self.branch3 = nn.Sequential(
nn.Conv2d(dim_in, dim_out, 3, 1, padding=12 * rate, dilation=12 * rate, bias=True),
nn.BatchNorm2d(dim_out, momentum=bn_mom),
nn.ReLU(inplace=False),
)
self.branch4 = nn.Sequential(
nn.Conv2d(dim_in, dim_out, 3, 1, padding=18 * rate, dilation=18 * rate, bias=True),
nn.BatchNorm2d(dim_out, momentum=bn_mom),
nn.ReLU(inplace=False),
)
self.branch5_conv = nn.Conv2d(dim_in, dim_out, 1, 1, 0, bias=True)
self.branch5_bn = nn.BatchNorm2d(dim_out, momentum=bn_mom)
self.branch5_relu = nn.ReLU(inplace=False)
textself.conv_cat = nn.Sequential( nn.Conv2d(dim_out * 5, dim_out, 1, 1, padding=0, bias=True), nn.BatchNorm2d(dim_out, momentum=bn_mom), nn.ReLU(inplace=False), ) def forward(self, x): [b, c, row, col] = x.size() conv1x1 = self.branch1(x) conv3x3_1 = self.branch2(x) conv3x3_2 = self.branch3(x) conv3x3_3 = self.branch4(x) # -----------------------------------------# # 第五个分支,全局平均池化+卷积 # -----------------------------------------# global_feature = torch.mean(x, 2, True) global_feature = torch.mean(global_feature, 3, True) global_feature = self.branch5_conv(global_feature) global_feature = self.branch5_bn(global_feature) global_feature = self.branch5_relu(global_feature) global_feature = F.interpolate(global_feature, (row, col), None, 'bilinear', True) # -----------------------------------------# # 将五个分支拼接后 => 1x1卷积 # -----------------------------------------# feature_cat = torch.cat([conv1x1, conv3x3_1, conv3x3_2, conv3x3_3, global_feature], dim=1) result = self.conv_cat(feature_cat) return result
class DeepLab(nn.Module):
def init(self, num_classes, backbone="mobilenet", pretrained=True, downsample_factor=16):
super(DeepLab, self).init()
if backbone == "xception":
# xception可选,忽略
self.backbone = xception(downsample_factor=downsample_factor, pretrained=pretrained)
in_channels = 2048
low_level_channels = 256
elif backbone == "mobilenet":
# 替换为我们上面改好的MobileNetV2
self.backbone = MobileNetV2(downsample_factor=downsample_factor, pretrained=pretrained)
in_channels = 320
low_level_channels = 40
else:
raise ValueError('Unsupported backbone - {}
, Use mobilenet or xception.'.format(backbone))
text# -----------------------------------------# # ASPP特征提取模块 # -----------------------------------------# self.aspp = ASPP(dim_in=in_channels, dim_out=256, rate=16 // downsample_factor) # ----------------------------------# # 浅层特征 => conv => 48通道 # ----------------------------------# self.shortcut_conv = nn.Sequential( nn.Conv2d(low_level_channels, 48, 1), nn.BatchNorm2d(48), nn.ReLU(inplace=False) ) # ----------------------------------# # 上采样后与浅层特征堆叠,再卷积 # ----------------------------------# self.cat_conv = nn.Sequential( nn.Conv2d(48 + 256, 256, 3, stride=1, padding=1), nn.BatchNorm2d(256), nn.ReLU(inplace=False), nn.Dropout(0.5), nn.Conv2d(256, 256, 3, stride=1, padding=1), nn.BatchNorm2d(256), nn.ReLU(inplace=False), nn.Dropout(0.1), ) self.cls_conv = nn.Conv2d(256, num_classes, 1, stride=1) def forward(self, x): H, W = x.size(2), x.size(3) # backbone => (low_level_features, deep_feature) low_level_features, x = self.backbone(x) # ASPP on deep features x = self.aspp(x) # 浅层特征 1x1 => 48通道 low_level_features = self.shortcut_conv(low_level_features) # 上采样 ASPP 输出到与low_level匹配后拼接 x = F.interpolate(x, size=(low_level_features.size(2), low_level_features.size(3)), mode='bilinear', align_corners=True) x = torch.cat((x, low_level_features), dim=1) x = self.cat_conv(x) x = self.cls_conv(x) x = F.interpolate(x, size=(H, W), mode='bilinear', align_corners=True) return x
if name == 'main':
img = torch.randn(3, 3, 512, 512)
model = DeepLab(num_classes=3, backbone="mobilenet", pretrained=False, downsample_factor=16)
out = model(img)
print("Output shape:", out.shape)
print("DeepLabV3+ 参数数量:", count_parameters(model))
ShallowFusion.py
import torch
import torch.nn as nn
import torch.nn.functional as F
from nets.CBAM import CBAM
class DepthwiseSeparableConv(nn.Module):
"""
标准的深度可分离卷积:
DW( groups = in_channels ) + PW( 1x1 卷积 )
"""
def init(self, inp, oup, kernel_size=3, stride=1, padding=1, bias=False):
super(DepthwiseSeparableConv, self).init()
self.depthwise = nn.Conv2d(inp, inp, kernel_size, stride, padding,
groups=inp, bias=bias)
self.pointwise = nn.Conv2d(inp, oup, 1, 1, 0, bias=bias)
self.bn = nn.BatchNorm2d(oup)
# 将原先的 nn.ReLU(inplace=True) 改为:
self.relu = nn.ReLU(inplace=False)
textdef forward(self, x): out = self.depthwise(x) out = self.pointwise(out) out = self.bn(out) out = self.relu(out) return out
class ShallowFusionLite(nn.Module):
"""
改进的浅层特征融合:
1) 对 F1、F2 先做 1x1 卷积减小通道数
2) 分别用一次 DepthwiseSeparableConv 提取
3) 上采样 F2 => 拼接 => 再做一层 DepthwiseSeparableConv + Dropout
"""
def init(self, in_channels1, in_channels2,
reduced_channels=16, out_channels=32, dropout=0.2):
super(ShallowFusionLite, self).init()
text# 1) 用1x1卷积先压缩通道 self.reduce_f1 = nn.Conv2d(in_channels1, reduced_channels, kernel_size=1, bias=False) self.reduce_f2 = nn.Conv2d(in_channels2, reduced_channels, kernel_size=1, bias=False) # 2) 深度可分离卷积提炼 self.conv1 = DepthwiseSeparableConv(reduced_channels, reduced_channels, kernel_size=3, padding=1) self.conv2 = DepthwiseSeparableConv(reduced_channels, reduced_channels, kernel_size=3, padding=1) # 3) 最终融合: 拼接后 => depthwise separable => out_channels # 并将 Dropout 的 inplace 改为 False self.fuse_conv = nn.Sequential( DepthwiseSeparableConv(2 * reduced_channels, out_channels, kernel_size=3, padding=1), nn.Dropout(dropout, inplace=False) # 改为 inplace=False ) self.attn = CBAM(channels=out_channels,reduction=16,kernel_size=7) def forward(self, F1, F2): # 1) 通道压缩 F1_reduced = self.reduce_f1(F1) F2_reduced = self.reduce_f2(F2) # 2) 深度可分离卷积 F1_out = self.conv1(F1_reduced) F2_out = self.conv2(F2_reduced) # 3) 上采样F2 F2_up = F.interpolate(F2_out, size=F1_out.shape[2:], mode='bilinear', align_corners=False) # 4) 拼接 & fuse_conv fused = torch.cat([F1_out, F2_up], dim=1) out = self.fuse_conv(fused) out = self.attn(out) return out
我在以上代码显示的浅层特征提取模块中做了三组对比实验,即在mobilenetv2的不同层倒残差块中提取两个作为F1和F2,结果如下
第一组
in_channels1=24, # <-- 与 F1.shape[1] 匹配
in_channels2=32, # <-- 与 F2.shape[1] 匹配
reduced_channels=16,
out_channels=32,
dropout=0.2
===>background: Iou-99.43; Recall (equal to the PA)-99.62; Precision-99.81
===>corrosion: Iou-56.69; Recall (equal to the PA)-82.52; Precision-64.43
===>Broken wire: Iou-44.57; Recall (equal to the PA)-67.73; Precision-56.58
===> mIoU: 66.9; mPA: 83.29; Accuracy: 99.43
83.31912998089719FPS
第二组
in_channels1=24, # <-- 与 F1.shape[1] 匹配
in_channels2=32, # <-- 与 F2.shape[1] 匹配
reduced_channels=24,
out_channels=56,
dropout=0.2
===>background: Iou-99.46; Recall (equal to the PA)-99.65; Precision-99.81
===>corrosion: Iou-60.1; Recall (equal to the PA)-82.69; Precision-68.75
===>Broken wire: Iou-44.54; Recall (equal to the PA)-67.75; Precision-56.52
===> mIoU: 68.03; mPA: 83.36; Accuracy: 99.47
80.5407908674515FPS
第三组
in_channels1=32, # <-- 与 F1.shape[1] 匹配
in_channels2=64, # <-- 与 F2.shape[1] 匹配
reduced_channels=24,
out_channels=96,
dropout=0.2
===>background: Iou-99.33; Recall (equal to the PA)-99.51; Precision-99.83
===>corrosion: Iou-56.8; Recall (equal to the PA)-82.36; Precision-64.67
===>Broken wire: Iou-39.49; Recall (equal to the PA)-72.6; Precision-46.41
===> mIoU: 65.21; mPA: 84.82; Accuracy: 99.34
102.72896593792107FPS
第四组
in_channels1=16, # <-- 与 F1.shape[1] 匹配
in_channels2=24, # <-- 与 F2.shape[1] 匹配
reduced_channels=16,
out_channels=40,
dropout=0.2
===>background: Iou-99.48; Recall (equal to the PA)-99.67; Precision-99.81
===>corrosion: Iou-59.67; Recall (equal to the PA)-83.68; Precision-67.52
===>Broken wire: Iou-45.74; Recall (equal to the PA)-65.53; Precision-60.23
===> mIoU: 68.29; mPA: 82.96; Accuracy: 99.48
41.07247440412605FPS
该如何设计表格,并用文字说明
下面给出一个示例表格与相应的文字说明,帮助呈现对比实验中不同浅层特征配置的结果。该表格包含了关键参数(in_channels1、in_channels2、reduced_channels、out_channels、dropout)以及主要性能指标(mIoU、mPA、Accuracy、FPS)等。如果需要更详细地对每个类别(如 background、corrosion、broken wire)的 IoU、Recall、Precision 进行展示,则可在备注列或附表中列出。
下表仅示例核心字段,可根据需求增加列,例如在单独列中记录每个类别的 Recall / Precision。
实验 | in_channels1 | in_channels2 | reduced_channels | out_channels | dropout | mIoU | mPA | Acc(%) | FPS |
---|---|---|---|---|---|---|---|---|---|
第1组 | 24 | 32 | 16 | 32 | 0.2 | 66.90 | 83.29 | 99.43 | 83.32 |
第2组 | 24 | 32 | 24 | 56 | 0.2 | 68.03 | 83.36 | 99.47 | 80.54 |
第3组 | 32 | 64 | 24 | 96 | 0.2 | 65.21 | 84.82 | 99.34 | 102.73 |
第4组 | 16 | 24 | 16 | 40 | 0.2 | 68.29 | 82.96 | 99.48 | 41.07 |
如果想要更细致地展示 background、corrosion、broken wire 各类别的 IoU / Recall / Precision,则可以在主表之外附上一个“精细结果表”,或者在每一行的备注列中补充:“背景IoU=99.43”等简要信息。
实验设置
in_channels1
, in_channels2
),然后使用 ShallowFusionLite 模块对它们进行融合。reduced_channels
和 out_channels
,我们改变了压缩与输出通道规模;dropout=0.2
用于一定程度上的防止过拟合。结果对比
reduced_channels=16
、out_channels=32
,取得了 mIoU=66.90、mPA=83.29、推理速度 83.32 FPS。reduced_channels
和 out_channels
(从 16/32 调整到 24/56),mIoU 提升至 68.03,mPA 也略微提高到 83.36,但速度小幅下降至 80.54 FPS。reduced_channels=24
、out_channels=96
,在 mPA=84.82 上表现最好,但 mIoU 降为 65.21;由于网络结构差异及通道数增加,其推理速度相对更高(102.73 FPS),可能与其他层策略或运算并行度有关。reduced_channels=16
、out_channels=40
,在 mIoU=68.29 上优于第一组和第三组,但推理速度仅 41.07 FPS,说明此组合在精度较好的同时,对速度影响较大。结论与分析
reduced_channels
和 out_channels
增大时,通常可带来精度的提升,但可能略微降低速度;这种表格与文字相结合的呈现方式,能够清晰地对比各组在通道选择与网络结构上的差异,并直观反映在性能指标(mIoU、mPA、FPS)上的影响,从而指导更有针对性的网络配置与参数调优。
Groups F1 channels F2 channels Out channels mIoU(%) F1 Score(%) FPS(f/s)
1 24 32 32 66.9 78.15 83.32
2 24 32 56 68.03 78.98 80.54
3 32 64 96 65.21 76.88 102.73
4 16 24 40 68.29 79.26 41.07
这是实验最终结果,先说明设计本实验的目的是选择做适合的浅层特征通道,然后说明实验数据,最后解释说明选择第二组的原因
下表汇总了在不同浅层通道组合下(F1、F2 及对应输出通道)所取得的 mIoU、F1 分数以及推理速度(FPS)。通过对比,可以评估哪一种浅层特征通道更适合任务需求。
Groups | F1 channels | F2 channels | Out channels | mIoU (%) | F1 Score (%) | FPS (f/s) |
---|---|---|---|---|---|---|
1 | 24 | 32 | 32 | 66.90 | 78.15 | 83.32 |
2 | 24 | 32 | 56 | 68.03 | 78.98 | 80.54 |
3 | 32 | 64 | 96 | 65.21 | 76.88 | 102.73 |
4 | 16 | 24 | 40 | 68.29 | 79.26 | 41.07 |
本实验旨在对不同的浅层通道设置(F1 与 F2)进行对比评估,从而找到最合适的浅层特征提取方案,为后续的特征融合和分割推理提供更好的平衡点。实验主要关注 mIoU 与 F1 Score 等准确性指标,并结合 FPS 评估模型推理效率。
整体表现
选择第二组的原因
综上,本实验表明在浅层通道选择方面,第 2 组(F1=24, F2=32, Out channels=56)相对更适合本任务:它在获得可观精度(mIoU=68.03%、F1=78.98%)的同时,依旧能保持 80.54 fps 的推理效率,较好地满足了多场景实际应用对速度与精度的双重需求。
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
#------------------------------------------------#
#------------------------------------------------#
from nets.xception import xception
from nets.mobilenetv2 import mobilenetv2
############################################
############################################
class MobileNetV2(nn.Module):
def init(self, downsample_factor=8, pretrained=True):
super(MobileNetV2, self).init()
from functools import partial
model = mobilenetv2(pretrained)
self.features = model.features[:-1]
textself.total_idx = len(self.features) self.down_idx = [2, 4, 7, 14] if downsample_factor == 8: for i in range(self.down_idx[-2], self.down_idx[-1]): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=4)) elif downsample_factor == 16: for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) def _nostride_dilate(self, m, dilate): classname = m.__class__.__name__ if classname.find('Conv') != -1: if m.stride == (2, 2): m.stride = (1, 1) if m.kernel_size == (3, 3): m.dilation = (dilate // 2, dilate // 2) m.padding = (dilate // 2, dilate // 2) else: if m.kernel_size == (3, 3): m.dilation = (dilate, dilate) m.padding = (dilate, dilate) def forward(self, x): # 前4层 -> low_level_features low_level_features = self.features[:4](x) # 后面 -> deep_features x = self.features[4:](low_level_features) return low_level_features, x
############################################
############################################
def count_parameters(model):
"""统计模型参数数量"""
return sum(p.numel() for p in model.parameters() if p.requires_grad)
############################################
############################################
def window_partition(x, window_size):
"""
将 (B, C, H, W) -> (B*num_windows, C, window_size, window_size)
假设 H,W 能被 window_size 整除
"""
B, C, H, W = x.shape
x = x.view(B, C, H // window_size, window_size, W // window_size, window_size)
x = x.permute(0, 2, 4, 1, 3, 5).contiguous()
windows = x.view(-1, C, window_size, window_size)
return windows
def window_unpartition(windows, window_size, H, W):
"""
将 (B*num_windows, C, window_size, window_size) -> (B, C, H, W)
"""
num_windows = (H // window_size) * (W // window_size)
B = windows.size(0) // num_windows
C = windows.size(1)
x = windows.view(B, H // window_size, W // window_size, C, window_size, window_size)
x = x.permute(0, 3, 1, 4, 2, 5).contiguous()
x = x.view(B, C, H, W)
return x
############################################
############################################
def scaled_dot_product_attention(Q, K, V):
"""
Q, K, V: (B, heads, seq_len, dim_per_head)
返回: (B, heads, seq_len, dim_per_head)
"""
attn_scores = torch.matmul(Q, K.transpose(-2, -1)) # [B, heads, seq_len, seq_len]
dim_k = K.shape[-1]
attn_scores = attn_scores / math.sqrt(dim_k)
attn = F.softmax(attn_scores, dim=-1)
out = torch.matmul(attn, V) # [B, heads, seq_len, dim_per_head]
return out
############################################
############################################
class TransformerBlock(nn.Module):
"""
标准Transformer Encoder风格:
1) LayerNorm -> MultiHeadAttention -> Dropout -> Residual
2) LayerNorm -> FFN(MLP) -> Dropout -> Residual
不含位置编码,如需可自行加
"""
def init(self, dim, num_heads=4, mlp_ratio=4.0, attn_dropout=0.0, ffn_dropout=0.0):
super().init()
self.dim = dim
self.num_heads = num_heads
self.mlp_ratio = mlp_ratio
self.attn_dropout = attn_dropout
self.ffn_dropout = ffn_dropout
text#---- 注意力前的LN ---- self.norm1 = nn.LayerNorm(dim) # Q,K,V合并投影 self.qkv = nn.Linear(dim, dim*3, bias=True) self.attn_drop = nn.Dropout(attn_dropout) # 输出投影 self.proj = nn.Linear(dim, dim, bias=True) self.proj_drop = nn.Dropout(attn_dropout) #---- FFN 前的LN ---- self.norm2 = nn.LayerNorm(dim) hidden_dim = int(dim * mlp_ratio) self.ffn = nn.Sequential( nn.Linear(dim, hidden_dim), nn.GELU(), nn.Dropout(ffn_dropout), nn.Linear(hidden_dim, dim), nn.Dropout(ffn_dropout), ) def forward(self, x): """ x: (B, seq_len, dim) """ B, N, D = x.shape # (1) 自注意力子层 shortcut = x x_ln = self.norm1(x) qkv = self.qkv(x_ln).reshape(B, N, 3, self.num_heads, D//self.num_heads) qkv = qkv.permute(2, 0, 3, 1, 4) # (3, B, heads, seq_len, dim_ph) Q, K, V = qkv[0], qkv[1], qkv[2] # (B, heads, seq_len, dim_ph) # 多头注意力 attn_out = scaled_dot_product_attention(Q, K, V) attn_out = self.attn_drop(attn_out) # 合并heads attn_out = attn_out.permute(0, 2, 1, 3).contiguous() # => (B, seq_len, dim) attn_out = attn_out.view(B, N, D) # 输出投影 attn_out = self.proj(attn_out) attn_out = self.proj_drop(attn_out) # 残差 x = shortcut + attn_out # (2) FFN子层 shortcut2 = x x_ln2 = self.norm2(x) x_ffn = self.ffn(x_ln2) x = shortcut2 + x_ffn return x
############################################
############################################
class WindowTransformerBlock(nn.Module):
"""
- 将(B, C, H, W)划分为窗口
- 在每个窗口内, 将通道C视作embedding dim, (ws*ws)视作序列长度
- 送进 TransformerBlock
- 最终输出和输入维度相同 (in_channels不变)
"""
def init(self, in_channels, window_size=8, num_heads=4, mlp_ratio=4.0,
attn_dropout=0.0, ffn_dropout=0.0):
super().init()
self.in_channels = in_channels
self.window_size = window_size
textself.transformer = TransformerBlock( dim = in_channels, num_heads = num_heads, mlp_ratio = mlp_ratio, attn_dropout= attn_dropout, ffn_dropout = ffn_dropout ) def forward(self, x): """ x: (B, C, H, W) return: (B, C, H, W) """ B, C, H, W = x.shape ws = self.window_size # 1) 分窗口 => (B*nw, C, ws, ws) windows = window_partition(x, ws) # 2) => (B*nw, ws*ws, C) windows_reshape = windows.view(-1, C, ws*ws).permute(0, 2, 1).contiguous() # 3) TransformerBlock out = self.transformer(windows_reshape) # (B*nw, ws*ws, C) # 4) => (B*nw, C, ws, ws) out = out.permute(0, 2, 1).contiguous() out = out.view(-1, C, ws, ws) # 5) 拼回 (B, C, H, W) x_out = window_unpartition(out, ws, H, W) return x_out
############################################
############################################
class WA_ASPP(nn.Module):
"""
五条分支:
1) branch1: 1×1 conv => (320->256)
2) branch2/3/4: WindowTransformerBlock(320->320) + 1×1 conv(320->256) => 层次叠加
3) branch5: 全局平均池化(320->256)
最终拼接 => 1×1 conv => 256
"""
def init(self, dim_in, dim_out,
window_size=(4,8,16),
num_heads=4,
mlp_ratio=4.0,
attn_dropout=0.0,
ffn_dropout=0.0):
super().init()
text# branch1: 1×1 conv => [B, dim_out, H, W] self.branch1 = nn.Sequential( nn.Conv2d(dim_in, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) # branch2/3/4: WindowTransformerBlock (输入输出都是dim_in=320) self.branch2_attn = WindowTransformerBlock( in_channels = dim_in, window_size = window_size[0], num_heads = num_heads, mlp_ratio = mlp_ratio, attn_dropout = attn_dropout, ffn_dropout = ffn_dropout ) self.branch3_attn = WindowTransformerBlock( in_channels = dim_in, window_size = window_size[1], num_heads = num_heads, mlp_ratio = mlp_ratio, attn_dropout = attn_dropout, ffn_dropout = ffn_dropout ) self.branch4_attn = WindowTransformerBlock( in_channels = dim_in, window_size = window_size[2], num_heads = num_heads, mlp_ratio = mlp_ratio, attn_dropout = attn_dropout, ffn_dropout = ffn_dropout ) # 额外的 1×1 Conv:将分支2/3/4输出从 320 => 256 self.branch2_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch3_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch4_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) # branch5: 全局平均池化 => 卷积(320->256) self.branch5_conv = nn.Conv2d(dim_in, dim_out, kernel_size=1) self.branch5_bn = nn.BatchNorm2d(dim_out) self.branch5_relu = nn.ReLU(inplace=True) # 拼接(5路) => 1×1 conv => dim_out self.fuse_conv = nn.Sequential( nn.Conv2d(dim_out * 5, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) def forward(self, x): B, C, H, W = x.size() # branch1 => [B, 256, H, W] out1 = self.branch1(x) # branch2 => WindowAttn => [B, 320, H, W] -> 1x1Conv => [B,256,H,W] out2 = self.branch2_attn(x) out2 = self.branch2_proj(out2) out2_agg = out1 + out2 # branch3 out3 = self.branch3_attn(x) out3 = self.branch3_proj(out3) out3_agg = out1 + out2_agg + out3 # branch4 out4 = self.branch4_attn(x) out4 = self.branch4_proj(out4) out4_agg = out1 + out2_agg + out3_agg + out4 # branch5 => 全局池化 => 卷积=>bn=>relu => 上采样 => [B,256,H,W] gp = F.adaptive_avg_pool2d(x, 1) # (B, 320, 1, 1) gp = self.branch5_conv(gp) # => (B, 256, 1, 1) gp = self.branch5_bn(gp) gp = self.branch5_relu(gp) gp = F.interpolate(gp, size=(H, W), mode='bilinear', align_corners=False) # 拼接 [out1, out2_agg, out3_agg, out4_agg, gp] => [B, 256*5, H, W] out_cat = torch.cat([out1, out2_agg, out3_agg, out4_agg, gp], dim=1) # 融合 => [B, 256, H, W] out = self.fuse_conv(out_cat) return out
############################################
############################################
class DeepLab(nn.Module):
def init(self, num_classes, backbone="mobilenet", pretrained=True, downsample_factor=16):
super(DeepLab, self).init()
if backbone == "xception":
self.backbone = xception(downsample_factor=downsample_factor, pretrained=pretrained)
in_channels = 2048
low_level_channels = 256
elif backbone == "mobilenet":
self.backbone = MobileNetV2(downsample_factor=downsample_factor, pretrained=pretrained)
#----------------------------------#
# mobilenetv2 输出deep_features: 320通道
# 浅层特征 low_level_features: 24通道
#----------------------------------#
in_channels = 320
low_level_channels = 24
else:
raise ValueError('Unsupported backbone - {}
, Use mobilenet, xception.'.format(backbone))
text# 这就是我们自定义的多头窗口注意力ASPP self.aspp = WA_ASPP( dim_in = in_channels, # 320 dim_out = 256, window_size = (4,8,16), # 窗口大小 num_heads = 4, # 多头数 mlp_ratio = 4.0, # FFN扩展倍数 attn_dropout= 0.1, # 注意力dropout ffn_dropout = 0.1 # FFN dropout ) # 浅层特征 => 1×1卷积 => 48通道 self.shortcut_conv = nn.Sequential( nn.Conv2d(low_level_channels, 48, 1), nn.BatchNorm2d(48), nn.ReLU(inplace=True) ) # 拼接后 => 进一步卷积 self.cat_conv = nn.Sequential( nn.Conv2d(48 + 256, 256, 3, stride=1, padding=1), nn.BatchNorm2d(256), nn.ReLU(inplace=True), nn.Dropout(0.5), nn.Conv2d(256, 256, 3, stride=1, padding=1), nn.BatchNorm2d(256), nn.ReLU(inplace=True), nn.Dropout(0.1), ) # 最终1x1分类层 self.cls_conv = nn.Conv2d(256, num_classes, 1, stride=1) def forward(self, x): H, W = x.size(2), x.size(3) # 主干网络 => 浅层特征, 深层特征 low_level_features, deep_features = self.backbone(x) # 深层特征 => WA_ASPP deep_features = self.aspp(deep_features) # 浅层特征 => 1×1 conv => 48通道 low_level_features = self.shortcut_conv(low_level_features) # 将深层特征上采样到浅层特征大小,然后拼接 deep_features = F.interpolate(deep_features, size=(low_level_features.size(2), low_level_features.size(3)), mode='bilinear', align_corners=True) x_cat = torch.cat((deep_features, low_level_features), dim=1) x_cat = self.cat_conv(x_cat) # 最终分类头 x_cat = self.cls_conv(x_cat) x_cat = F.interpolate(x_cat, size=(H, W), mode='bilinear', align_corners=True) return x_cat
############################################
############################################
if name == 'main':
img = torch.randn(3, 3, 512, 512)
model = DeepLab(num_classes=3, backbone="mobilenet", pretrained=False, downsample_factor=16)
out = model(img)
print("输出维度:", out.shape) # 期望 (3, 3, 512, 512)
print("DeepLabV3+参数数量:", count_parameters(model))
我想把以上代码中的WA_ASPP中的窗口注意力模块替换为mobileVit中的注意力模块,mobileVit代码如下
import torch
import torch.nn as nn
from einops import rearrange
def conv_1x1_bn(inp, oup):
return nn.Sequential(
nn.Conv2d(inp, oup, 1, 1, 0, bias=False),
nn.BatchNorm2d(oup),
nn.SiLU()
)
def conv_nxn_bn(inp, oup, kernal_size=3, stride=1):
return nn.Sequential(
nn.Conv2d(inp, oup, kernal_size, stride, 1, bias=False),
nn.BatchNorm2d(oup),
nn.SiLU()
)
class PreNorm(nn.Module):
def init(self, dim, fn):
super().init()
self.norm = nn.LayerNorm(dim)
self.fn = fn
textdef forward(self, x, **kwargs): return self.fn(self.norm(x), **kwargs)
class FeedForward(nn.Module):
def init(self, dim, hidden_dim, dropout=0.):
super().init()
self.net = nn.Sequential(
nn.Linear(dim, hidden_dim),
nn.SiLU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, dim),
nn.Dropout(dropout)
)
textdef forward(self, x): return self.net(x)
class Attention(nn.Module):
def init(self, dim, heads=8, dim_head=64, dropout=0.):
super().init()
inner_dim = dim_head * heads
project_out = not (heads == 1 and dim_head == dim)
textself.heads = heads self.scale = dim_head ** -0.5 self.attend = nn.Softmax(dim=-1) self.to_qkv = nn.Linear(dim, inner_dim * 3, bias=False) self.to_out = nn.Sequential( nn.Linear(inner_dim, dim), nn.Dropout(dropout) ) if project_out else nn.Identity() def forward(self, x): qkv = self.to_qkv(x).chunk(3, dim=-1) q, k, v = map(lambda t: rearrange(t, 'b p n (h d) -> b p h n d', h=self.heads), qkv) dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale attn = self.attend(dots) out = torch.matmul(attn, v) out = rearrange(out, 'b p h n d -> b p n (h d)') return self.to_out(out)
class Transformer(nn.Module):
def init(self, dim, depth, heads, dim_head, mlp_dim, dropout=0.):
super().init()
self.layers = nn.ModuleList([])
for _ in range(depth):
self.layers.append(nn.ModuleList([
PreNorm(dim, Attention(dim, heads, dim_head, dropout)),
PreNorm(dim, FeedForward(dim, mlp_dim, dropout))
]))
textdef forward(self, x): for attn, ff in self.layers: x = attn(x) + x x = ff(x) + x return x
class MV2Block(nn.Module):
def init(self, inp, oup, stride=1, expansion=4):
super().init()
self.stride = stride
assert stride in [1, 2]
texthidden_dim = int(inp * expansion) self.use_res_connect = self.stride == 1 and inp == oup if expansion == 1: self.conv = nn.Sequential( # dw nn.Conv2d(hidden_dim, hidden_dim, 3, stride, 1, groups=hidden_dim, bias=False), nn.BatchNorm2d(hidden_dim), nn.SiLU(), # pw-linear nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False), nn.BatchNorm2d(oup), ) else: self.conv = nn.Sequential( # pw nn.Conv2d(inp, hidden_dim, 1, 1, 0, bias=False), nn.BatchNorm2d(hidden_dim), nn.SiLU(), # dw nn.Conv2d(hidden_dim, hidden_dim, 3, stride, 1, groups=hidden_dim, bias=False), nn.BatchNorm2d(hidden_dim), nn.SiLU(), # pw-linear nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False), nn.BatchNorm2d(oup), ) def forward(self, x): if self.use_res_connect: return x + self.conv(x) else: return self.conv(x)
class MobileViTBlock(nn.Module):
def init(self, dim, depth, channel, kernel_size, patch_size, mlp_dim, dropout=0.):
super().init()
self.ph, self.pw = patch_size
textself.conv1 = conv_nxn_bn(channel, channel, kernel_size) self.conv2 = conv_1x1_bn(channel, dim) self.transformer = Transformer(dim, depth, 4, 8, mlp_dim, dropout) self.conv3 = conv_1x1_bn(dim, channel) self.conv4 = conv_nxn_bn(2 * channel, channel, kernel_size) def forward(self, x): y = x.clone() # Local representations x = self.conv1(x) x = self.conv2(x) # Global representations _, _, h, w = x.shape x = rearrange(x, 'b d (h ph) (w pw) -> b (ph pw) (h w) d', ph=self.ph, pw=self.pw) x = self.transformer(x) x = rearrange(x, 'b (ph pw) (h w) d -> b d (h ph) (w pw)', h=h // self.ph, w=w // self.pw, ph=self.ph, pw=self.pw) # Fusion x = self.conv3(x) x = torch.cat((x, y), 1) x = self.conv4(x) return x
class MobileViT(nn.Module):
def init(self, image_size, dims, channels, num_classes, expansion=4, kernel_size=3, patch_size=(2, 2)):
super().init()
ih, iw = image_size
ph, pw = patch_size
assert ih % ph == 0 and iw % pw == 0
textL = [2, 4, 3] self.conv1 = conv_nxn_bn(3, channels[0], stride=2) self.mv2 = nn.ModuleList([]) self.mv2.append(MV2Block(channels[0], channels[1], 1, expansion)) self.mv2.append(MV2Block(channels[1], channels[2], 2, expansion)) self.mv2.append(MV2Block(channels[2], channels[3], 1, expansion)) self.mv2.append(MV2Block(channels[2], channels[3], 1, expansion)) # Repeat self.mv2.append(MV2Block(channels[3], channels[4], 2, expansion)) self.mv2.append(MV2Block(channels[5], channels[6], 2, expansion)) self.mv2.append(MV2Block(channels[7], channels[8], 2, expansion)) self.mvit = nn.ModuleList([]) self.mvit.append(MobileViTBlock(dims[0], L[0], channels[5], kernel_size, patch_size, int(dims[0] * 2))) self.mvit.append(MobileViTBlock(dims[1], L[1], channels[7], kernel_size, patch_size, int(dims[1] * 4))) self.mvit.append(MobileViTBlock(dims[2], L[2], channels[9], kernel_size, patch_size, int(dims[2] * 4))) self.conv2 = conv_1x1_bn(channels[-2], channels[-1]) # self.conv3 = conv_1x1_bn(channels[-1],num_classes) # # self.upsample = nn.Upsample(size =image_size, mode='bilinear', align_corners=False) # 移除全局平均池化层 # self.pool = nn.AvgPool2d(ih//32, 1) # self.fc = nn.Linear(channels[-1], num_classes, bias=False) def forward(self, x): x = self.conv1(x) x = self.mv2[0](x) # print("1:", x.shape) x = self.mv2[1](x) x = self.mv2[2](x) x = self.mv2[3](x) # Repeat # print("2:", x.shape) x = self.mv2[4](x) x = self.mvit[0](x) # print("3:", x.shape) x = self.mv2[5](x) x = self.mvit[1](x) # print("4:", x.shape) x = self.mv2[6](x) x = self.mvit[2](x) x = self.conv2(x) # print("5:", x.shape) # 注释掉全局平均池化层 # x = self.pool(x).view(-1, x.shape[1]) # x = self.fc(x) # x = self.conv3(x) # x = self.upsample(x) return x
def mobilevit_xxs():
dims = [64, 80, 96]
channels = [16, 16, 24, 24, 48, 48, 64, 64, 80, 80, 320]
return MobileViT((256, 256), dims, channels, num_classes=1000, expansion=2)
def mobilevit_xs():
dims = [96, 120, 144]
channels = [16, 32, 48, 48, 64, 64, 80, 80, 96, 96, 384]
return MobileViT((256, 256), dims, channels, num_classes=1000)
def mobilevit_s():
dims = [144, 192, 240]
channels = [16, 32, 64, 64, 96, 96, 128, 128, 160, 160, 640]
return MobileViT((256, 256), dims, channels, num_classes=1000)
def count_parameters(model):
return sum(p.numel() for p in model.parameters() if p.requires_grad)
if name == 'main':
img = torch.randn(1, 3, 256, 256)
model = MobileViT(image_size=(512,512), dims = [96, 120, 144],
channels = [16, 32, 48, 48, 64, 64, 80, 80, 96, 96, 384],num_classes=3)
out= model(img)
print(out.shape)
text# vit = mobilevit_xxs() # out = vit(img) # print(out.shape) # print(count_parameters(vit)) # # vit = mobilevit_xs() # out = vit(img) # print(out.shape) # print(count_parameters(vit)) # # vit = mobilevit_s() # out = vit(img) # print(out.shape) # print(count_parameters(vit))
这样可行嘛
从您给出的 DeepLab + WA_ASPP 代码和 MobileViT 代码来看,您希望将 WA_ASPP 中的“窗口注意力模块”(WindowTransformerBlock
)替换为 MobileViT 式的注意力机制。整体来说,这种替换完全可行,只要保证以下几点即可使其顺利运行:
输入输出维度匹配
WindowTransformerBlock
接收形如 的特征图,并且输出维度也保持 。内部注意力形式
WindowTransformerBlock
将通道 视为 embedding dim,而将 当作序列长度,每个窗口大小由 window_size
控制。conv1
, conv2
),再 reshape 成 的形式来做 Transformer,全局注意力后再 reshape 回去。去掉或保留某些中间操作
下述思路是将 WA_ASPP 中的 branch2_attn / branch3_attn / branch4_attn
(原先用 WindowTransformerBlock
)替换为 MobileViT块做“类似注意力”的处理:
python# 假设我们定义一个 MobileViTAttentionBlock,用于将 (B, C, H, W) -> (B, C, H, W) # 内部类似于 MobileViT 的一段 Transformer 处理 class MobileViTAttentionBlock(nn.Module): """ 将 MobileViT 的核心注意力流程封装为 (B, C, H, W) => (B, C, H, W), 不做下采样和最终分类,只保留局部卷积 + Transformer + 融合。 需确保这里面 dimension/channel 能匹配 320 => 320 """ def __init__(self, in_channels, patch_size=(2,2), depth=2, dim=192, mlp_dim=384): super().__init__() # 例: local conv (3x3), 1x1 conv, Transformer(...), 1x1 conv, local conv # 具体可从 MobileViTBlock 拷过来根据 in_channels、dim 适配 # 还需改成 in_channels=320 => conv => dim => Transformer => conv => out=>320 self.patch_size = patch_size self.conv_local1 = nn.Conv2d(in_channels, in_channels, kernel_size=3, padding=1, bias=False) # ... 这里写MobileViTBlock核心操作 # self.transformer = Transformer(...) # ... def forward(self, x): # x: (B, 320, H, W) # 1) local conv # x = self.conv_local1(x) # ... # 2) flatten => transformer => reshape # 3) 1x1 conv + concat # ... return x # (B, 320, H, W) 最终返回
然后在 WA_ASPP 里:
pythonclass WA_ASPP(nn.Module): def __init__(self, dim_in=320, dim_out=256, ...): super().__init__() # branch2/3/4 原先是 WindowTransformerBlock => proj # 改为 MobileViTAttentionBlock => proj self.branch2_attn = MobileViTAttentionBlock(in_channels=dim_in, ...) self.branch3_attn = MobileViTAttentionBlock(in_channels=dim_in, ...) self.branch4_attn = MobileViTAttentionBlock(in_channels=dim_in, ...) # 其余保持不变 # 1x1 conv => dim_out # ... def forward(self, x): # ... out2 = self.branch2_attn(x) out2 = self.branch2_proj(out2) # ... # 其余逻辑跟原WA_ASPP相同 return out
这样就用 MobileViT 的注意力模块替代原先 WindowTransformerBlock
。需要注意:
简单结论:
WindowTransformerBlock
替换为 “MobileViT-like Block”。给出完整修改代码,并加上中文注释
下面给出一个示例,展示如何将 WA_ASPP 里原先使用的 WindowTransformerBlock
(窗口注意力)替换为基于 MobileViT 核心思想的“局部卷积 + Transformer”注意力模块。该示例以 MobileViTBlock 为基础做了一些简化和适配,使其能以 (B, 320, H, W) => (B, 320, H, W) 的形式嵌入到原 WA_ASPP 结构中。
请根据自己项目的实际需求,对通道数、patch_size、heads、MLP 维度等进行进一步调参和测试。
pythonimport torch import torch.nn as nn import torch.nn.functional as F from einops import rearrange ############################################# # 自定义的 MobileViT 注意力块 # (B, C, H, W) => (B, C, H, W) # 仅保留局部卷积 + Transformer + 再卷积,不做下采样和FC ############################################# class MobileViTAttentionBlock(nn.Module): """ 简化版的 MobileViT Block,用于取代原先 WA_ASPP 中的窗口注意力。 - in_channels: 输入/输出通道均设为 in_channels (如 320)。 - patch_size: 用于分块的大小 (ph, pw)。 - transformer_depth: Transformer层数 - transformer_dim: Transformer里使用的embedding维度(会将通道映射到这个dim) - transformer_mlp_dim: FFN的维度 - heads: 多头注意力头数 - dropout: Dropout概率 该模块需保证输入输出维度一致 (C保持不变, H/W也不改变)。 """ def __init__(self, in_channels=320, patch_size=(2,2), transformer_depth=2, transformer_dim=128, transformer_mlp_dim=256, heads=4, dropout=0.0): super().__init__() self.ph, self.pw = patch_size # step1: 局部卷积 (3x3) # 先保持通道数不变,做一次local conv self.local_conv1 = nn.Conv2d(in_channels, in_channels, kernel_size=3, padding=1, bias=False) self.local_bn1 = nn.BatchNorm2d(in_channels) self.local_act1 = nn.SiLU() # step2: 通道映射: (B, C, H, W) -> (B, D, H, W) # 其中 D = transformer_dim # 然后会进行 Transformer 的全局处理 self.conv_proj_in = nn.Conv2d(in_channels, transformer_dim, kernel_size=1, bias=False) self.bn_proj_in = nn.BatchNorm2d(transformer_dim) self.act_proj_in = nn.SiLU() # 构建Transformer: 需要embedding dim=transformer_dim # heads, mlp_dim, depth, dropout self.transformer = TransformerBlockGroup( dim=transformer_dim, depth=transformer_depth, heads=heads, mlp_dim=transformer_mlp_dim, dropout=dropout ) # step3: 将Transformer输出映射回原通道in_channels self.conv_proj_out = nn.Conv2d(transformer_dim, in_channels, kernel_size=1, bias=False) self.bn_proj_out = nn.BatchNorm2d(in_channels) self.act_proj_out = nn.SiLU() # step4: 第二次局部卷积融合 (3x3) self.local_conv2 = nn.Conv2d(in_channels*2, in_channels, kernel_size=3, padding=1, bias=False) self.local_bn2 = nn.BatchNorm2d(in_channels) self.local_act2 = nn.SiLU() def forward(self, x): """ x shape: (B, in_channels, H, W) return: (B, in_channels, H, W) """ # 备份原输入,用于后面concat y = x.clone() # 1) 局部卷积 x = self.local_conv1(x) x = self.local_bn1(x) x = self.local_act1(x) # 2) 映射到transformer_dim x = self.conv_proj_in(x) x = self.bn_proj_in(x) x = self.act_proj_in(x) # 3) 将 (B, D, H, W) => 先切分成 patch => Transformer => 拼回 B, D, H, W = x.shape # rearrange => (B, D, H, W) -> (B, H//ph * W//pw, ph*pw, D) # ph=patch_size[0], pw=patch_size[1] assert H % self.ph==0 and W % self.pw==0, "H,W必须能被patch_size整除" x = rearrange( x, 'b d (hh ph) (ww pw) -> b (hh ww) (ph pw) d', ph=self.ph, pw=self.pw ) # Transformer处理 x = self.transformer(x) # => (B, n_patch, ph*pw, D) # rearrange回 (B, D, H, W) x = rearrange( x, 'b (hh ww) (ph pw) d -> b d (hh ph) (ww pw)', hh=H//self.ph, ww=W//self.pw, ph=self.ph, pw=self.pw ) # 4) 映射回原通道 in_channels x = self.conv_proj_out(x) x = self.bn_proj_out(x) x = self.act_proj_out(x) # 5) 与原输入y concat,然后再做一次3x3局部卷积融合 x = torch.cat([x, y], dim=1) # (B, in_channels*2, H, W) x = self.local_conv2(x) x = self.local_bn2(x) x = self.local_act2(x) return x ############################################# # 定义一个简易的 TransformerBlockGroup # 用来模拟 MobileViT 的多层 Transformer ############################################# class TransformerBlockGroup(nn.Module): def __init__(self, dim, depth, heads, mlp_dim, dropout=0.0): """ - dim: embedding维度 (transformer_dim) - depth: 堆叠的Transformer层数 - heads: MultiHeadAttention的头数 - mlp_dim: FFN隐层 - dropout: dropout概率 """ super().__init__() self.layers = nn.ModuleList([ TransformerBlockLayer(dim, heads, mlp_dim, dropout) for _ in range(depth) ]) def forward(self, x): """ x: (B, n_patch, patch_area, dim) """ for layer in self.layers: x = layer(x) return x class TransformerBlockLayer(nn.Module): """ 单层: PreNorm(Attention) + PreNorm(FeedForward) """ def __init__(self, dim, heads, mlp_dim, dropout=0.0): super().__init__() self.norm1 = nn.LayerNorm(dim) self.attn = MultiHeadAttention(dim, heads, dropout=dropout) self.norm2 = nn.LayerNorm(dim) self.ffn = FeedForward(dim, mlp_dim, dropout=dropout) def forward(self, x): # x shape: (B, n_patch, patch_area, dim) x_ln = self.norm1(x) x_attn = self.attn(x_ln) + x x_ln2 = self.norm2(x_attn) x_out = self.ffn(x_ln2) + x_attn return x_out ############################################# # 多头注意力 (batch of patches) ############################################# class MultiHeadAttention(nn.Module): def __init__(self, dim, heads=4, dropout=0.0): super().__init__() self.heads = heads self.dim_head = dim // heads inner_dim = self.dim_head * heads self.scale = self.dim_head ** -0.5 self.to_qkv = nn.Linear(dim, inner_dim*3, bias=False) self.attend = nn.Softmax(dim=-1) self.proj = nn.Sequential( nn.Linear(inner_dim, dim), nn.Dropout(dropout) ) def forward(self, x): """ x: (B, n_patch, patch_area, dim) """ B, n_patch, patch_area, d = x.shape qkv = self.to_qkv(x) # => (B, n_patch, patch_area, 3*inner_dim) q, k, v = torch.chunk(qkv, 3, dim=-1) # each: (B, n_patch, patch_area, inner_dim) # 拆分多头 => (B, n_patch, patch_area, heads, dim_head) q = q.reshape(B, n_patch, patch_area, self.heads, self.dim_head) k = k.reshape(B, n_patch, patch_area, self.heads, self.dim_head) v = v.reshape(B, n_patch, patch_area, self.heads, self.dim_head) # 转置到 (B, n_patch, heads, patch_area, dim_head) for matmul q = q.permute(0,1,3,2,4) k = k.permute(0,1,3,2,4) v = v.permute(0,1,3,2,4) # scaled dot-product attention dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale # => (B, n_patch, heads, patch_area, patch_area) attn = self.attend(dots) out = torch.matmul(attn, v) # => (B, n_patch, heads, patch_area, dim_head) # 拼回 => (B, n_patch, patch_area, heads*dim_head) out = out.permute(0,1,3,2,4).contiguous() out = out.view(B, n_patch, patch_area, self.heads*self.dim_head) out = self.proj(out) return out ############################################# # FeedForward ############################################# class FeedForward(nn.Module): def __init__(self, dim, hidden_dim, dropout=0.0): super().__init__() self.net = nn.Sequential( nn.Linear(dim, hidden_dim), nn.SiLU(), nn.Dropout(dropout), nn.Linear(hidden_dim, dim), nn.Dropout(dropout), ) def forward(self, x): # x: (B, n_patch, patch_area, dim) return self.net(x)
说明:
MobileViTAttentionBlock
相当于一个小号的 MobileViT Block:
transformer_dim
;y
并再做一次 3×3 卷积融合。transformer_dim
、heads
、transformer_depth
、transformer_mlp_dim
等超参数也要结合实际情况(如 320 通道)进行调参,如果值过小或过大,会影响性能和速度。您原先的 WA_ASPP
用 WindowTransformerBlock
做 branch2_attn / branch3_attn / branch4_attn
,可以改成:
pythonclass WA_ASPP(nn.Module): """ 五条分支: 1) branch1: 1×1 conv => (320->256) 2) branch2/3/4: MobileViTAttentionBlock(320->320) + 1×1 conv(320->256) => 层次叠加 3) branch5: 全局平均池化(320->256) 最终拼接 => 1×1 conv => 256 """ def __init__(self, dim_in, dim_out, # 这里保持和原版一致 window_size=(4,8,16), # 改成 mobilevit 的相关超参数 mvit_patch_size=(2,2), mvit_dim=128, mvit_depth=2, mvit_mlp_dim=256, mvit_heads=4, mvit_dropout=0.0): super().__init__() # branch1 self.branch1 = nn.Sequential( nn.Conv2d(dim_in, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) # branch2/3/4 用 MobileViTAttentionBlock 而非 WindowTransformerBlock # 实际上可设置不同超参 self.branch2_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=mvit_patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch3_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=mvit_patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch4_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=mvit_patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch2_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch3_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch4_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) # branch5: 全局平均池化 => 卷积(320->256) self.branch5_conv = nn.Conv2d(dim_in, dim_out, kernel_size=1) self.branch5_bn = nn.BatchNorm2d(dim_out) self.branch5_relu = nn.ReLU(inplace=True) # fuse: 1×1 conv => dim_out self.fuse_conv = nn.Sequential( nn.Conv2d(dim_out * 5, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) def forward(self, x): # x => (B, 320, H, W) B, C, H, W = x.size() out1 = self.branch1(x) # => (B, 256, H, W) # branch2 => MobileViT => proj => add out2 = self.branch2_attn(x) out2 = self.branch2_proj(out2) out2_agg = out1 + out2 # branch3 => MobileViT => proj => add out3 = self.branch3_attn(x) out3 = self.branch3_proj(out3) out3_agg = out1 + out2_agg + out3 # branch4 => MobileViT => proj => add out4 = self.branch4_attn(x) out4 = self.branch4_proj(out4) out4_agg = out1 + out2_agg + out3_agg + out4 # branch5 => 全局平均池化 => (B,256,1,1) => 上采样 =>(B,256,H,W) gp = F.adaptive_avg_pool2d(x, 1) gp = self.branch5_conv(gp) gp = self.branch5_bn(gp) gp = self.branch5_relu(gp) gp = F.interpolate(gp, size=(H, W), mode='bilinear', align_corners=False) # 拼接 => fuse out_cat = torch.cat([out1, out2_agg, out3_agg, out4_agg, gp], dim=1) out = self.fuse_conv(out_cat) return out
然后在 DeepLab 里,将 WA_ASPP
初始化替换为这个新的版本即可:
pythonclass DeepLab(nn.Module): def __init__(self, num_classes, backbone="mobilenet", pretrained=True, downsample_factor=16): super(DeepLab, self).__init__() ... if backbone == "mobilenet": self.backbone = MobileNetV2(downsample_factor=downsample_factor, pretrained=pretrained) in_channels = 320 low_level_channels = 24 ... # 注意这里换成我们定义的新 WA_ASPP_MobileViT self.aspp = WA_ASPP( dim_in = in_channels, dim_out = 256, # 下面是MobileViT相关的超参数可自定义 mvit_patch_size=(2,2), mvit_dim=128, mvit_depth=2, mvit_mlp_dim=256, mvit_heads=4, mvit_dropout=0.0 ) ...
这样就完成了在 WA_ASPP 中用 MobileViT 式注意力替换原先窗口注意力的改动。
Patch 切分大小与输入分辨率
(ph, pw)
整除,需要修改或自适应处理,否则会断言失败。通道大小与Transformer维度
in_channels
非 320,或您需要另一个通道数,需要更改 “conv_proj_in => conv_proj_out” 以及 transformer_dim
的映射关系。transformer_dim = in_channels
直接在相同维度上做注意力,但那就跟MobileViT的本意略有差异。性能与效果
patch_size
、transformer_dim
、heads
、depth
都会影响速度与精度,可根据实验再做调整。MobileViTAttentionBlock
的子模块,模拟 MobileViT 的局部卷积 + Transformer 逻辑,然后在 WA_ASPP 中替代分支即可。整合在一个代码中
下面给出一个单文件示例,其中将 WA_ASPP 中原先的窗口注意力替换为 MobileViT 风格的“局部卷积 + Transformer”注意力模块。
整个示例包括以下部分:
请注意:
transformer_dim
、patch_size
、depth
、heads
、mlp_dim
等为示例超参数,需在实际项目中根据数据、算力与精度需求进行调整。H, W
不能被 patch_size
整除,需要做自适应处理或修改 patch 大小。mobilenetv2
、xception
等文件,请保证导入路径正确。pythonimport torch import torch.nn as nn import torch.nn.functional as F import math import os ############################################ # 如果有自己的 xception / mobilenetv2 文件 # 请在此处做正确的导入 ############################################ # from nets.xception import xception # from nets.mobilenetv2 import mobilenetv2 ############################################ # (可选) 简易的 MobileNetV2 Backbone 示例 ############################################ class MobileNetV2(nn.Module): """ 简化版MobileNetV2,仅示意: features[:4] -> 浅层, features[4:] -> 深层 若有自己更完整的实现,可直接替换. """ def __init__(self, downsample_factor=8, pretrained=False): super(MobileNetV2, self).__init__() # 这里随便用一堆卷积替代; 实际请使用真实MobileNetV2 self.features = nn.Sequential( nn.Conv2d(3, 32, 3, stride=2, padding=1), nn.ReLU6(inplace=True), nn.Conv2d(32, 32, 3, stride=1, padding=1), nn.ReLU6(inplace=True), nn.Conv2d(32, 320, 3, stride=2, padding=1), nn.ReLU6(inplace=True) ) # downsample_factor=8 或16时,进行空洞卷积等处理,在此略 def forward(self, x): # 前4层 => 这里简单地[:2]当浅层, [2:]当深层 low_level_features = self.features[:2](x) # shape (B, 32, H/2, W/2) deep_features = self.features[2:](low_level_features) # shape (B,320, ...) return low_level_features, deep_features ############################################ # 工具函数: 统计模型可学习参数量 ############################################ def count_parameters(model): return sum(p.numel() for p in model.parameters() if p.requires_grad) ############################################ # 一、MobileViT注意力模块 (简化) ############################################ from einops import rearrange class MobileViTAttentionBlock(nn.Module): """ (B, in_channels, H, W) -> (B, in_channels, H, W) 参考MobileViT思想: 1) 先局部卷积 (3x3), 不改变通道 2) 1x1卷积将通道投影到 transformer_dim 3) 切分patch, 送Transformer, 再映射回 in_channels 4) 与原特征concat => 3x3融合 """ def __init__(self, in_channels=320, patch_size=(2,2), transformer_depth=2, transformer_dim=128, transformer_mlp_dim=256, heads=4, dropout=0.0): super().__init__() self.ph, self.pw = patch_size # step1: Local conv => 不改变通道 self.local_conv1 = nn.Conv2d(in_channels, in_channels, kernel_size=3, padding=1, bias=False) self.local_bn1 = nn.BatchNorm2d(in_channels) self.local_act1 = nn.SiLU() # step2: conv投影到 transformer_dim self.conv_proj_in = nn.Conv2d(in_channels, transformer_dim, kernel_size=1, bias=False) self.bn_proj_in = nn.BatchNorm2d(transformer_dim) self.act_proj_in = nn.SiLU() # 多层Transformer self.transformer = TransformerBlockGroup( dim=transformer_dim, depth=transformer_depth, heads=heads, mlp_dim=transformer_mlp_dim, dropout=dropout ) # 从 transformer_dim -> in_channels self.conv_proj_out = nn.Conv2d(transformer_dim, in_channels, kernel_size=1, bias=False) self.bn_proj_out = nn.BatchNorm2d(in_channels) self.act_proj_out = nn.SiLU() # 最后再 concat 原输入 => 3x3融合 self.local_conv2 = nn.Conv2d(in_channels*2, in_channels, kernel_size=3, padding=1, bias=False) self.local_bn2 = nn.BatchNorm2d(in_channels) self.local_act2 = nn.SiLU() def forward(self, x): y = x.clone() # (1) 局部卷积 x = self.local_conv1(x) x = self.local_bn1(x) x = self.local_act1(x) # (2) 映射到transformer_dim x = self.conv_proj_in(x) x = self.bn_proj_in(x) x = self.act_proj_in(x) B, D, H, W = x.shape # patch切分 assert H % self.ph==0 and W % self.pw==0, "H,W必须能被patch_size整除" x = rearrange(x, 'b d (hh ph) (ww pw) -> b (hh ww) (ph pw) d', ph=self.ph, pw=self.pw) # (3) Transformer x = self.transformer(x) # 拼回 (B, D, H, W) x = rearrange(x, 'b (hh ww) (ph pw) d -> b d (hh ph) (ww pw)', hh=H//self.ph, ww=W//self.pw, ph=self.ph, pw=self.pw) # (4) 映射回 in_channels x = self.conv_proj_out(x) x = self.bn_proj_out(x) x = self.act_proj_out(x) # concat 原输入 => 3x3融合 x = torch.cat([x, y], dim=1) x = self.local_conv2(x) x = self.local_bn2(x) x = self.local_act2(x) return x class TransformerBlockGroup(nn.Module): """ 将多个 TransformerBlockLayer 堆叠 """ def __init__(self, dim, depth, heads, mlp_dim, dropout=0.0): super().__init__() self.layers = nn.ModuleList([ TransformerBlockLayer(dim, heads, mlp_dim, dropout) for _ in range(depth) ]) def forward(self, x): for layer in self.layers: x = layer(x) return x class TransformerBlockLayer(nn.Module): """ 包含一次多头注意力 + FeedForward,每个分支前都做 LayerNorm """ def __init__(self, dim, heads, mlp_dim, dropout=0.0): super().__init__() self.norm1 = nn.LayerNorm(dim) self.attn = MultiHeadAttention(dim, heads, dropout) self.norm2 = nn.LayerNorm(dim) self.ffn = FeedForward(dim, mlp_dim, dropout) def forward(self, x): # x: (B, n_patch, patch_area, dim) shortcut = x x_ln = self.norm1(x) x_attn = self.attn(x_ln) x = x_attn + shortcut shortcut2 = x x_ln2 = self.norm2(x) x_ffn = self.ffn(x_ln2) x = x_ffn + shortcut2 return x class MultiHeadAttention(nn.Module): """ 多头注意力 (B, n_patch, patch_area, dim) """ def __init__(self, dim, heads=4, dropout=0.0): super().__init__() self.heads = heads self.dim_head = dim // heads inner_dim = self.dim_head * heads self.scale = self.dim_head ** -0.5 self.to_qkv = nn.Linear(dim, inner_dim*3, bias=False) self.attend = nn.Softmax(dim=-1) self.proj = nn.Sequential( nn.Linear(inner_dim, dim), nn.Dropout(dropout) ) def forward(self, x): B, n_patch, patch_area, dim = x.shape qkv = self.to_qkv(x) # => (B, n_patch, patch_area, 3*inner_dim) q, k, v = torch.chunk(qkv, 3, dim=-1) # each: (B, n_patch, patch_area, inner_dim) # reshape为多头 q = q.reshape(B, n_patch, patch_area, self.heads, self.dim_head) k = k.reshape(B, n_patch, patch_area, self.heads, self.dim_head) v = v.reshape(B, n_patch, patch_area, self.heads, self.dim_head) # 变换维度以便做MatMul q = q.permute(0,1,3,2,4) # (B, n_patch, heads, patch_area, dim_head) k = k.permute(0,1,3,2,4) v = v.permute(0,1,3,2,4) # dot-product dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale attn = self.attend(dots) # (B, n_patch, heads, patch_area, patch_area) out = torch.matmul(attn, v) # (B, n_patch, heads, patch_area, dim_head) # reshape回 out = out.permute(0,1,3,2,4).contiguous() out = out.view(B, n_patch, patch_area, self.heads*self.dim_head) out = self.proj(out) return out class FeedForward(nn.Module): def __init__(self, dim, hidden_dim, dropout=0.0): super().__init__() self.net = nn.Sequential( nn.Linear(dim, hidden_dim), nn.SiLU(), nn.Dropout(dropout), nn.Linear(hidden_dim, dim), nn.Dropout(dropout) ) def forward(self, x): return self.net(x) ############################################ # 二、WA_ASPP_MViT: 将 branch2/3/4 替换为MobileViTAttentionBlock ############################################ class WA_ASPP_MViT(nn.Module): """ 五条分支: 1) branch1: 1×1 conv => (320->256) 2) branch2/3/4: MobileViTAttentionBlock(320->320) + 1×1 conv(320->256) => 层次叠加 3) branch5: 全局平均池化(320->256) 最终拼接 => 1×1 conv => 256 """ def __init__(self, dim_in=320, dim_out=256, # MobileViT相关超参数 mvit_patch_size=(2,2), mvit_depth=2, mvit_dim=128, mvit_mlp_dim=256, mvit_heads=4, mvit_dropout=0.0): super().__init__() # branch1: 1×1 => 256 self.branch1 = nn.Sequential( nn.Conv2d(dim_in, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) # branch2/3/4: MobileViT注意力 => 1×1 => 256 self.branch2_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=mvit_patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch3_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=mvit_patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch4_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=mvit_patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch2_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch3_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch4_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) # branch5: 全局池化 => (B, 256, 1,1) => 上采样 => (B,256,H,W) self.branch5_conv = nn.Conv2d(dim_in, dim_out, kernel_size=1) self.branch5_bn = nn.BatchNorm2d(dim_out) self.branch5_relu = nn.ReLU(inplace=True) # 拼接 => 1×1 => 256 self.fuse_conv = nn.Sequential( nn.Conv2d(dim_out*5, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) def forward(self, x): B, C, H, W = x.size() # branch1 => out1 out1 = self.branch1(x) # (B,256,H,W) # branch2 => MobileViT => 1×1 => out2 out2 = self.branch2_attn(x) out2 = self.branch2_proj(out2) out2_agg = out1 + out2 # branch3 out3 = self.branch3_attn(x) out3 = self.branch3_proj(out3) out3_agg = out1 + out2_agg + out3 # branch4 out4 = self.branch4_attn(x) out4 = self.branch4_proj(out4) out4_agg = out1 + out2_agg + out3_agg + out4 # branch5 => 全局平均池化 => conv => bn => relu => 上采样 gp = F.adaptive_avg_pool2d(x, 1) # (B, C, 1, 1) gp = self.branch5_conv(gp) # => (B, 256, 1, 1) gp = self.branch5_bn(gp) gp = self.branch5_relu(gp) gp = F.interpolate(gp, size=(H, W), mode='bilinear', align_corners=False) # 拼接 => fuse => (B,256,H,W) out_cat = torch.cat([out1, out2_agg, out3_agg, out4_agg, gp], dim=1) out = self.fuse_conv(out_cat) return out ############################################ # 三、DeepLab + 我们的新WA_ASPP_MViT ############################################ class DeepLab_MViT(nn.Module): """ 使用 MobileNetV2 作为 backbone, 并在 ASPP 阶段使用我们自定义的 WA_ASPP_MViT (MobileViT注意力) """ def __init__(self, num_classes=3, pretrained=False, downsample_factor=16): super().__init__() self.backbone = MobileNetV2(downsample_factor=downsample_factor, pretrained=pretrained) # mobilenetv2 输出 deep_features=320, 浅层=32(示例) in_channels = 320 low_level_channels = 32 # 使用新的 WA_ASPP_MViT self.aspp = WA_ASPP_MViT( dim_in = in_channels, dim_out = 256, mvit_patch_size=(2,2), mvit_depth=2, mvit_dim=128, mvit_mlp_dim=256, mvit_heads=4, mvit_dropout=0.0 ) # 浅层特征 => 1×1 => 48通道 self.shortcut_conv = nn.Sequential( nn.Conv2d(low_level_channels, 48, 1), nn.BatchNorm2d(48), nn.ReLU(inplace=True) ) # 拼接 => 卷积 => 最终分类 self.cat_conv = nn.Sequential( nn.Conv2d(48+256, 256, 3, padding=1), nn.BatchNorm2d(256), nn.ReLU(inplace=True), nn.Dropout(0.5), nn.Conv2d(256, 256, 3, padding=1), nn.BatchNorm2d(256), nn.ReLU(inplace=True), nn.Dropout(0.1), ) self.cls_conv = nn.Conv2d(256, num_classes, 1) def forward(self, x): H, W = x.size(2), x.size(3) # backbone => low_level(32通道), deep(320通道) low_level, deep = self.backbone(x) # ASPP(MViT注意力) deep = self.aspp(deep) # 浅层 => 1×1 => 48 low_level = self.shortcut_conv(low_level) # 上采样 deep => 拼接 => cat_conv => cls_conv => resize deep_up = F.interpolate(deep, size=(low_level.size(2), low_level.size(3)), mode='bilinear', align_corners=False) x_cat = torch.cat([deep_up, low_level], dim=1) x_cat = self.cat_conv(x_cat) x_cat = self.cls_conv(x_cat) x_cat = F.interpolate(x_cat, size=(H, W), mode='bilinear', align_corners=False) return x_cat ############################################ # 四、简单测试 ############################################ if __name__ == "__main__": model = DeepLab_MViT(num_classes=3, pretrained=False, downsample_factor=16) print("模型参数总数:", count_parameters(model)) img = torch.randn(1, 3, 512, 512) pred = model(img) print("输出维度:", pred.shape) # (1,3,512,512)
MobileViTAttentionBlock
local_conv1
)先对输入特征进行简单的空间处理;conv_proj_in
)将通道 in_channels
投影为 transformer_dim
;patch_size=(ph,pw)
将特征分块(rearrange
),在 TransformerBlockGroup 中做全局注意力;in_channels
并与原输入拼接,再做一次 3×3 卷积融合。WA_ASPP_MViT
branch2_attn / branch3_attn / branch4_attn
是 WindowTransformerBlock
,这里改为 MobileViTAttentionBlock。fuse_conv
进行融合,输出 (B,256,H,W)。DeepLab_MViT
MobileNetV2
(提供浅层通道 32、深层通道 320)做骨干网络;WA_ASPP_MViT
;实际应用
transformer_dim=128
、heads=4
、patch_size=(2,2)
等超参可以做进一步调参,观察速度与精度的变化。transformer_dim
、heads
、mvit_depth
以及“是否在所有分支都使用该注意力”等策略来实现更好的平衡。C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\python.exe C:\Users\wztbkl\PycharmProjects\deeplabv3-plus-pytorch-main\nets\mobileVitASPP_deeplabV3.py
模型参数总数: 11475699
Traceback (most recent call last):
File "C:\Users\wztbkl\PycharmProjects\deeplabv3-plus-pytorch-main\nets\mobileVitASPP_deeplabV3.py", line 430, in <module>
pred = model(img)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\wztbkl\PycharmProjects\deeplabv3-plus-pytorch-main\nets\mobileVitASPP_deeplabV3.py", line 406, in forward
deep = self.aspp(deep)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\wztbkl\PycharmProjects\deeplabv3-plus-pytorch-main\nets\mobileVitASPP_deeplabV3.py", line 341, in forward
gp = self.branch5_bn(gp)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\batchnorm.py", line 193, in forward
return F.batch_norm(
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\functional.py", line 2810, in batch_norm
_verify_batch_size(input.size())
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\functional.py", line 2776, in _verify_batch_size
raise ValueError(
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 256, 1, 1])
进程已结束,退出代码为 1
给出完整修改代码
下面给出一个整合在同一文件的示例,将原本使用 MobileViT 式注意力的 WA_ASPP
代码放在一起,并修正在 BatchNorm 上出现的 “Expected more than 1 value per channel” 错误。该错误最常见的原因是 batch size=1 时,BN 在训练模式下无法正确统计均值和方差,尤其对 的张量执行 BN 会报错。
为简单起见,这里对 branch5 的 BN 改用 InstanceNorm2d(或将其 track_running_stats=False
)来避免在全局池化输出只有 大小时发生错误。你也可以在实际项目中根据需求选择别的方式(例如将模型设为 eval 模式,或增大 batch size)。
请注意:该示例用于演示修复方式,并非唯一可行方案,也可在 batch size > 1 的环境下正常使用原 BatchNorm。
pythonimport torch import torch.nn as nn import torch.nn.functional as F import math ############################################ # 工具函数: 统计模型可学习参数量 ############################################ def count_parameters(model): return sum(p.numel() for p in model.parameters() if p.requires_grad) ############################################ # 简易的 MobileNetV2 Backbone (示例) ############################################ class MobileNetV2(nn.Module): """ 仅演示用:features[:2] 当浅层, features[2:] 当深层 真正项目应使用完整的 mobilenetv2 实现 """ def __init__(self, downsample_factor=16, pretrained=False): super().__init__() # 简化写法: 先用几个Conv2d模拟 self.features = nn.Sequential( nn.Conv2d(3, 32, 3, stride=2, padding=1), nn.ReLU6(inplace=True), nn.Conv2d(32, 32, 3, stride=1, padding=1), nn.ReLU6(inplace=True), nn.Conv2d(32, 320, 3, stride=2, padding=1), nn.ReLU6(inplace=True) ) # 省略根据 downsample_factor 做空洞卷积处理的逻辑 def forward(self, x): # 假设前2层 => 浅层(形状 ~ (B,32,H/2,W/2)) low_level_features = self.features[:2](x) # 后面 => deep deep_features = self.features[2:](low_level_features) # (B,320, ...) return low_level_features, deep_features ############################################ # 一、MobileViT注意力模块 (简化) ############################################ from einops import rearrange class MobileViTAttentionBlock(nn.Module): """ (B, in_channels, H, W) => (B, in_channels, H, W) 1) 局部卷积(3x3) 2) conv -> transformer_dim 3) patch切分 -> 多层Transformer -> 拼回 4) conv -> in_channels 5) concat原输入 => 3x3融合 """ def __init__(self, in_channels=320, patch_size=(2,2), transformer_depth=2, transformer_dim=128, transformer_mlp_dim=256, heads=4, dropout=0.0): super().__init__() self.ph, self.pw = patch_size # (1) 局部卷积 self.local_conv1 = nn.Conv2d(in_channels, in_channels, 3, padding=1, bias=False) self.local_bn1 = nn.BatchNorm2d(in_channels) self.local_act1 = nn.SiLU() # (2) conv映射 => transformer_dim self.conv_proj_in = nn.Conv2d(in_channels, transformer_dim, kernel_size=1, bias=False) self.bn_proj_in = nn.BatchNorm2d(transformer_dim) self.act_proj_in = nn.SiLU() # Transformer self.transformer = TransformerBlockGroup( dim=transformer_dim, depth=transformer_depth, heads=heads, mlp_dim=transformer_mlp_dim, dropout=dropout ) # 再映射回 in_channels self.conv_proj_out = nn.Conv2d(transformer_dim, in_channels, kernel_size=1, bias=False) self.bn_proj_out = nn.BatchNorm2d(in_channels) self.act_proj_out = nn.SiLU() # 最后 concat 并做 3x3融合 self.local_conv2 = nn.Conv2d(in_channels*2, in_channels, 3, padding=1, bias=False) self.local_bn2 = nn.BatchNorm2d(in_channels) self.local_act2 = nn.SiLU() def forward(self, x): # 备份原输入 y = x.clone() # 1) 局部卷积 x = self.local_conv1(x) x = self.local_bn1(x) x = self.local_act1(x) # 2) conv => transformer_dim x = self.conv_proj_in(x) x = self.bn_proj_in(x) x = self.act_proj_in(x) B, D, H, W = x.shape # patch切分 assert H%self.ph==0 and W%self.pw==0, "H,W需能被patch_size整除" x = rearrange(x, 'b d (hh ph) (ww pw) -> b (hh ww) (ph pw) d', ph=self.ph, pw=self.pw) # 3) transformer x = self.transformer(x) # 拼回 x = rearrange(x, 'b (hh ww) (ph pw) d -> b d (hh ph) (ww pw)', hh=H//self.ph, ww=W//self.pw, ph=self.ph, pw=self.pw) # 映射回 in_channels x = self.conv_proj_out(x) x = self.bn_proj_out(x) x = self.act_proj_out(x) # concat + 3x3融合 x = torch.cat([x, y], dim=1) x = self.local_conv2(x) x = self.local_bn2(x) x = self.local_act2(x) return x class TransformerBlockGroup(nn.Module): def __init__(self, dim, depth, heads, mlp_dim, dropout=0.0): super().__init__() self.layers = nn.ModuleList([ TransformerBlockLayer(dim, heads, mlp_dim, dropout) for _ in range(depth) ]) def forward(self, x): for layer in self.layers: x = layer(x) return x class TransformerBlockLayer(nn.Module): def __init__(self, dim, heads, mlp_dim, dropout=0.0): super().__init__() self.norm1 = nn.LayerNorm(dim) self.attn = MultiHeadAttention(dim, heads, dropout) self.norm2 = nn.LayerNorm(dim) self.ffn = FeedForward(dim, mlp_dim, dropout) def forward(self, x): shortcut = x x_ln = self.norm1(x) x_attn = self.attn(x_ln) x = x_attn + shortcut shortcut2 = x x_ln2 = self.norm2(x) x_ffn = self.ffn(x_ln2) x = x_ffn + shortcut2 return x class MultiHeadAttention(nn.Module): def __init__(self, dim, heads=4, dropout=0.0): super().__init__() self.heads = heads self.dim_head = dim//heads inner_dim = self.dim_head*heads self.scale = self.dim_head**-0.5 self.to_qkv = nn.Linear(dim, inner_dim*3, bias=False) self.attend = nn.Softmax(dim=-1) self.proj = nn.Sequential( nn.Linear(inner_dim, dim), nn.Dropout(dropout) ) def forward(self, x): # x: (B, n_patch, patch_area, dim) B, n_patch, patch_area, dim = x.shape qkv = self.to_qkv(x) q, k, v = torch.chunk(qkv, 3, dim=-1) q = q.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) k = k.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) v = v.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) dots = torch.matmul(q, k.transpose(-1, -2))*self.scale attn = self.attend(dots) out = torch.matmul(attn, v) # (B,n_patch,heads,patch_area,dim_head) out = out.permute(0,1,3,2,4).contiguous() out = out.view(B, n_patch, patch_area, self.heads*self.dim_head) out = self.proj(out) return out class FeedForward(nn.Module): def __init__(self, dim, hidden_dim, dropout=0.0): super().__init__() self.net = nn.Sequential( nn.Linear(dim, hidden_dim), nn.SiLU(), nn.Dropout(dropout), nn.Linear(hidden_dim, dim), nn.Dropout(dropout) ) def forward(self, x): return self.net(x) ############################################ # 二、WA_ASPP中的MobileViT分支 ############################################ class WA_ASPP_MViT(nn.Module): """ 五条分支: 1) branch1: 1×1 conv => (dim_in -> dim_out) 2) branch2/3/4: MobileViTAttentionBlock => 1×1 conv => dim_out 3) branch5: 全局池化 => 卷积 => dim_out => (改用 InstanceNorm2d) => relu => 上采样 拼接 => 1×1 => dim_out """ def __init__(self, dim_in=320, dim_out=256, # MobileViT相关超参 patch_size=(2,2), mvit_depth=2, mvit_dim=128, mvit_mlp_dim=256, mvit_heads=4, mvit_dropout=0.0): super().__init__() # branch1 self.branch1 = nn.Sequential( nn.Conv2d(dim_in, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) # branch2/3/4 => MobileViT注意力 self.branch2_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch3_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch4_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch2_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch3_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch4_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) # branch5 => 全局平均池化 => 卷积 => InstanceNorm2d => relu => 上采样 self.branch5_conv = nn.Conv2d(dim_in, dim_out, kernel_size=1) # 将原 BatchNorm2d 改为 InstanceNorm2d, 避免 batch=1 时报错 self.branch5_in = nn.InstanceNorm2d(dim_out, affine=True) self.branch5_relu = nn.ReLU(inplace=True) # 拼接 => 1×1 => dim_out self.fuse_conv = nn.Sequential( nn.Conv2d(dim_out*5, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) def forward(self, x): B, C, H, W = x.shape # branch1 => out1 out1 = self.branch1(x) # => (B,256,H,W) # branch2 => MViT => proj => add out2 = self.branch2_attn(x) out2 = self.branch2_proj(out2) out2_agg = out1 + out2 # branch3 out3 = self.branch3_attn(x) out3 = self.branch3_proj(out3) out3_agg = out1 + out2_agg + out3 # branch4 out4 = self.branch4_attn(x) out4 = self.branch4_proj(out4) out4_agg = out1 + out2_agg + out3_agg + out4 # branch5 => 全局池化 => conv => IN => relu => 上采样 gp = F.adaptive_avg_pool2d(x, 1) # (B, C, 1,1) gp = self.branch5_conv(gp) # => (B,256,1,1) gp = self.branch5_in(gp) # => 使用 InstanceNorm2d gp = self.branch5_relu(gp) gp = F.interpolate(gp, size=(H, W), mode='bilinear', align_corners=False) # 拼接 => fuse => (B,256,H,W) out_cat = torch.cat([out1, out2_agg, out3_agg, out4_agg, gp], dim=1) out = self.fuse_conv(out_cat) return out ############################################ # 三、DeepLab + WA_ASPP_MViT ############################################ class DeepLab_MViT(nn.Module): """ 使用简易MobileNetV2作为backbone, ASPP换成WA_ASPP_MViT (MobileViT 注意力) """ def __init__(self, num_classes=3, downsample_factor=16): super().__init__() self.backbone = MobileNetV2(downsample_factor=downsample_factor, pretrained=False) # 假设: mobilenetv2输出 deep=320, low_level=32 in_channels = 320 low_level_channels = 32 # ASPP => WA_ASPP_MViT self.aspp = WA_ASPP_MViT( dim_in = in_channels, dim_out = 256, patch_size = (2,2), # MobileViTAttention mvit_depth = 2, mvit_dim = 128, mvit_mlp_dim = 256, mvit_heads=4, mvit_dropout=0.0 ) # low_level => 1×1 => 48 self.shortcut_conv = nn.Sequential( nn.Conv2d(low_level_channels, 48, 1), nn.BatchNorm2d(48), nn.ReLU(inplace=True) ) # 拼接 => cat_conv => cls_conv self.cat_conv = nn.Sequential( nn.Conv2d(48+256, 256, 3, padding=1), nn.BatchNorm2d(256), nn.ReLU(inplace=True), nn.Dropout(0.5), nn.Conv2d(256, 256, 3, padding=1), nn.BatchNorm2d(256), nn.ReLU(inplace=True), nn.Dropout(0.1) ) self.cls_conv = nn.Conv2d(256, num_classes, 1) def forward(self, x): H, W = x.shape[2], x.shape[3] # backbone => (low_level, deep) low_level, deep = self.backbone(x) # deep => MViT ASPP deep = self.aspp(deep) # low_level => 1×1 => 48 low_level = self.shortcut_conv(low_level) # 上采样 deep => 拼接 => cat_conv => cls_conv => 上采样 deep_up = F.interpolate(deep, size=(low_level.size(2), low_level.size(3)), mode='bilinear', align_corners=False) x_cat = torch.cat([deep_up, low_level], dim=1) x_cat = self.cat_conv(x_cat) x_cat = self.cls_conv(x_cat) x_cat = F.interpolate(x_cat, size=(H, W), mode='bilinear', align_corners=False) return x_cat ############################################ # 四、测试 ############################################ if __name__ == "__main__": model = DeepLab_MViT(num_classes=3, downsample_factor=16) print("模型参数总数:", count_parameters(model)) # 构造单张图 (batch=1, channel=3, 512x512) img = torch.randn(1, 3, 512, 512) pred = model(img) print("输出维度:", pred.shape) # 期望: (1,3,512,512)
修复 BatchNorm 报错的方法
这样在 batch size=1 时不会出现“Expected more than 1 value per channel”错误;若依旧想使用 BN,可以在推理模式(pythonself.branch5_in = nn.InstanceNorm2d(dim_out, affine=True)
model.eval()
)下运行,或保证 batch size>1。MobileViTAttentionBlock 与 WA_ASPP_MViT
patch_size=(2,2)
、mvit_dim=128
、mvit_mlp_dim=256
、heads=4
等仅作示例,实际需根据你的模型规模和显存做调参。示例 MobileNetV2
如依旧出现 BN 报错
这样就能在单个 Python 文件中演示如何使用 MobileViT 风格注意力 替换 WA_ASPP 中的窗口注意力,并解决在 batch size=1 场景下对 (B, 256, 1, 1)
进行 BN 时可能报错的问题。
C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\python.exe C:\Users\wztbkl\PycharmProjects\deeplabv3-plus-pytorch-main\nets\mobileVitASPP_deeplabV3.py
模型参数总数: 11475699
Traceback (most recent call last):
File "C:\Users\wztbkl\PycharmProjects\deeplabv3-plus-pytorch-main\nets\mobileVitASPP_deeplabV3.py", line 399, in <module>
pred = model(img)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\wztbkl\PycharmProjects\deeplabv3-plus-pytorch-main\nets\mobileVitASPP_deeplabV3.py", line 375, in forward
deep = self.aspp(deep)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\wztbkl\PycharmProjects\deeplabv3-plus-pytorch-main\nets\mobileVitASPP_deeplabV3.py", line 310, in forward
gp = self.branch5_in(gp) # => 使用 InstanceNorm2d
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\instancenorm.py", line 124, in forward
return self._apply_instance_norm(input)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\instancenorm.py", line 47, in _apply_instance_norm
return F.instance_norm(
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\functional.py", line 2865, in instance_norm
_verify_spatial_size(input.size())
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\functional.py", line 2831, in _verify_spatial_size
raise ValueError(
ValueError: Expected more than 1 spatial element when training, got input size torch.Size([1, 256, 1, 1])
进程已结束,退出代码为 1
当 Batch Size=1 且输出空间维度变成 时,即便将 BatchNorm2d 替换为 InstanceNorm2d,依然可能因 “Expected more than 1 spatial element when training, got input size ” 而报错。这是因为 InstanceNorm 在训练模式下默认也要求至少 2 个像素用于统计(尤其 affine=True 时)。
以下是几种常见的解决方式,可根据实际需求二选一或多选结合:
GroupNorm:对每个样本先分组再做归一化,对 batch size 没有依赖;即使在 时也不会报错。示例如下:
python# 将 self.branch5_in = nn.InstanceNorm2d(...) 改为: self.branch5_gn = nn.GroupNorm(num_groups=1, num_channels=dim_out)
或者
pythonself.branch5_ln = nn.LayerNorm([dim_out, 1, 1], elementwise_affine=True)
这样当输入是 时,也能正常进行归一化,而不会报 “more than 1 spatial element” 的错误。
由于全局池化后仅剩 ,在很多场景下该分支仅是个全局特征,不需要太精细的归一化。可以考虑去掉这一处的归一化操作,或者改为单纯的 bias/scale 参数:
python# branch5 => conv => relu (不用任何norm) self.branch5_conv = nn.Conv2d(dim_in, dim_out, kernel_size=1) self.branch5_relu = nn.ReLU(inplace=True)
这样就不会在 空间维度上执行 BN/IN 了,也就不会再触发该错误。
如果确实想保留 BatchNorm/InstanceNorm,但是可以接受此分支始终在推理状态(不在训练中更新统计量),可在 forward 时对这一路做 with torch.no_grad(): ...
或将对应模块 eval()
,如:
python# 假设 self.branch5_norm 仍是 InstanceNorm2d # 在 forward() 中: with torch.no_grad(): gp = self.branch5_conv(gp) gp = self.branch5_norm(gp) gp = self.branch5_relu(gp)
这样 BN/IN 不会更新均值方差,不会在训练时报大小为 (1, C, 1, 1) 的错误。但这意味着此分支的统计量不会随训练而更新。
最直接却不一定容易的做法是让每个训练 step 的 batch size 至少是 2,从而在 时 B≥2 以满足 BN/IN 需要。不过在显存紧张或单图训练场景下难以实现。
以下给出把 InstanceNorm2d 替换为 GroupNorm(num_groups=1, 其实等效 LayerNorm 的通道维度版)的完整示例。这样就不会因单图训练报错:
python# 将原: # self.branch5_in = nn.InstanceNorm2d(dim_out, affine=True) # 改为: self.branch5_gn = nn.GroupNorm(num_groups=1, num_channels=dim_out)
并在 forward 对应位置调用 self.branch5_gn
:
pythongp = F.adaptive_avg_pool2d(x, 1) gp = self.branch5_conv(gp) gp = self.branch5_gn(gp) # GroupNorm gp = self.branch5_relu(gp) gp = F.interpolate(gp, size=(H, W), ...)
其余流程保持不变。
给出完整修改代码
下面给出一个整合在同一文件、并采用 GroupNorm(也可换成 LayerNorm)来替换在全局池化分支中的归一化操作的完整示例代码。这样,即使在 batch size=1、输出空间大小为 的情况下,也能避免 BatchNorm/InstanceNorm 的报错。
请注意:本示例对 MobileNetV2 和其他部分做了极简演示,您应根据自己项目里的完整实现进行相应改动或替换,尤其是 MobileNetV2 部分。
pythonimport torch import torch.nn as nn import torch.nn.functional as F import math ############################################ # 工具函数: 统计模型可学习参数量 ############################################ def count_parameters(model): return sum(p.numel() for p in model.parameters() if p.requires_grad) ############################################ # 简易的 MobileNetV2 Backbone (仅示意) ############################################ class MobileNetV2(nn.Module): """ 仅演示用:features[:2] 当浅层, features[2:] 当深层 真正项目应使用完整的 mobilenetv2 实现 """ def __init__(self, downsample_factor=16, pretrained=False): super().__init__() # 用几个Conv2d来模拟 self.features = nn.Sequential( nn.Conv2d(3, 32, 3, stride=2, padding=1), nn.ReLU6(inplace=True), nn.Conv2d(32, 32, 3, stride=1, padding=1), nn.ReLU6(inplace=True), nn.Conv2d(32, 320, 3, stride=2, padding=1), nn.ReLU6(inplace=True) ) # 省略根据 downsample_factor 做空洞卷积等 def forward(self, x): # 假设前2层 => 浅层(形状 ~ (B,32,H/2,W/2)) low_level_features = self.features[:2](x) # (B,32, ...) # 后面 => deep deep_features = self.features[2:](low_level_features) # (B,320, ...) return low_level_features, deep_features ############################################ # 一、MobileViT注意力模块 (简化) ############################################ from einops import rearrange class MobileViTAttentionBlock(nn.Module): """ (B, in_channels, H, W) => (B, in_channels, H, W) 1) 局部卷积(3x3) 2) conv -> transformer_dim 3) patch切分 -> 多层Transformer -> 拼回 4) conv -> in_channels 5) concat原输入 => 3x3融合 """ def __init__(self, in_channels=320, patch_size=(2,2), transformer_depth=2, transformer_dim=128, transformer_mlp_dim=256, heads=4, dropout=0.0): super().__init__() self.ph, self.pw = patch_size # (1) 局部卷积 self.local_conv1 = nn.Conv2d(in_channels, in_channels, 3, padding=1, bias=False) self.local_bn1 = nn.BatchNorm2d(in_channels) self.local_act1 = nn.SiLU() # (2) conv映射 => transformer_dim self.conv_proj_in = nn.Conv2d(in_channels, transformer_dim, kernel_size=1, bias=False) self.bn_proj_in = nn.BatchNorm2d(transformer_dim) self.act_proj_in = nn.SiLU() # 多层Transformer self.transformer = TransformerBlockGroup( dim=transformer_dim, depth=transformer_depth, heads=heads, mlp_dim=transformer_mlp_dim, dropout=dropout ) # 再映射回 in_channels self.conv_proj_out = nn.Conv2d(transformer_dim, in_channels, kernel_size=1, bias=False) self.bn_proj_out = nn.BatchNorm2d(in_channels) self.act_proj_out = nn.SiLU() # 最后 concat 并做 3x3融合 self.local_conv2 = nn.Conv2d(in_channels*2, in_channels, 3, padding=1, bias=False) self.local_bn2 = nn.BatchNorm2d(in_channels) self.local_act2 = nn.SiLU() def forward(self, x): # 备份原输入 y = x.clone() # 1) 局部卷积 x = self.local_conv1(x) x = self.local_bn1(x) x = self.local_act1(x) # 2) conv => transformer_dim x = self.conv_proj_in(x) x = self.bn_proj_in(x) x = self.act_proj_in(x) B, D, H, W = x.shape # patch切分 assert H%self.ph==0 and W%self.pw==0, "H,W需能被patch_size整除" x = rearrange(x, 'b d (hh ph) (ww pw) -> b (hh ww) (ph pw) d', ph=self.ph, pw=self.pw) # 3) transformer x = self.transformer(x) # 拼回 x = rearrange(x, 'b (hh ww) (ph pw) d -> b d (hh ph) (ww pw)', hh=H//self.ph, ww=W//self.pw, ph=self.ph, pw=self.pw) # 映射回 in_channels x = self.conv_proj_out(x) x = self.bn_proj_out(x) x = self.act_proj_out(x) # concat + 3x3融合 x = torch.cat([x, y], dim=1) x = self.local_conv2(x) x = self.local_bn2(x) x = self.local_act2(x) return x class TransformerBlockGroup(nn.Module): def __init__(self, dim, depth, heads, mlp_dim, dropout=0.0): super().__init__() self.layers = nn.ModuleList([ TransformerBlockLayer(dim, heads, mlp_dim, dropout) for _ in range(depth) ]) def forward(self, x): for layer in self.layers: x = layer(x) return x class TransformerBlockLayer(nn.Module): def __init__(self, dim, heads, mlp_dim, dropout=0.0): super().__init__() self.norm1 = nn.LayerNorm(dim) self.attn = MultiHeadAttention(dim, heads, dropout) self.norm2 = nn.LayerNorm(dim) self.ffn = FeedForward(dim, mlp_dim, dropout) def forward(self, x): shortcut = x x_ln = self.norm1(x) x_attn = self.attn(x_ln) x = x_attn + shortcut shortcut2 = x x_ln2 = self.norm2(x) x_ffn = self.ffn(x_ln2) x = x_ffn + shortcut2 return x class MultiHeadAttention(nn.Module): def __init__(self, dim, heads=4, dropout=0.0): super().__init__() self.heads = heads self.dim_head = dim//heads inner_dim = self.dim_head*heads self.scale = self.dim_head**-0.5 self.to_qkv = nn.Linear(dim, inner_dim*3, bias=False) self.attend = nn.Softmax(dim=-1) self.proj = nn.Sequential( nn.Linear(inner_dim, dim), nn.Dropout(dropout) ) def forward(self, x): # x: (B, n_patch, patch_area, dim) B, n_patch, patch_area, dim = x.shape qkv = self.to_qkv(x) q, k, v = torch.chunk(qkv, 3, dim=-1) q = q.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) k = k.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) v = v.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) dots = torch.matmul(q, k.transpose(-1, -2))*self.scale attn = self.attend(dots) out = torch.matmul(attn, v) # (B,n_patch,heads,patch_area,dim_head) out = out.permute(0,1,3,2,4).contiguous() out = out.view(B, n_patch, patch_area, self.heads*self.dim_head) out = self.proj(out) return out class FeedForward(nn.Module): def __init__(self, dim, hidden_dim, dropout=0.0): super().__init__() self.net = nn.Sequential( nn.Linear(dim, hidden_dim), nn.SiLU(), nn.Dropout(dropout), nn.Linear(hidden_dim, dim), nn.Dropout(dropout) ) def forward(self, x): return self.net(x) ############################################ # 二、WA_ASPP中的MobileViT分支 (使用 GroupNorm) ############################################ class WA_ASPP_MViT(nn.Module): """ 五条分支: 1) branch1: 1×1 conv => (dim_in -> dim_out) 2) branch2/3/4: MobileViTAttentionBlock => 1×1 conv => dim_out 3) branch5: 全局池化 => 卷积 => GroupNorm => relu => 上采样 拼接 => 1×1 => dim_out """ def __init__(self, dim_in=320, dim_out=256, # MobileViT相关超参 patch_size=(2,2), mvit_depth=2, mvit_dim=128, mvit_mlp_dim=256, mvit_heads=4, mvit_dropout=0.0): super().__init__() # branch1 self.branch1 = nn.Sequential( nn.Conv2d(dim_in, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) # branch2/3/4 => MobileViT注意力 self.branch2_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch3_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch4_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch2_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch3_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch4_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) # branch5 => 全局池化 => conv => GroupNorm => relu => 上采样 self.branch5_conv = nn.Conv2d(dim_in, dim_out, kernel_size=1) # 这里用 GroupNorm(num_groups=1, ...) 避免单图时报错 self.branch5_gn = nn.GroupNorm(num_groups=1, num_channels=dim_out) self.branch5_relu = nn.ReLU(inplace=True) # 拼接 => 1×1 => dim_out self.fuse_conv = nn.Sequential( nn.Conv2d(dim_out*5, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) def forward(self, x): B, C, H, W = x.shape # branch1 => out1 out1 = self.branch1(x) # => (B,256,H,W) # branch2 => MViT => proj => add out2 = self.branch2_attn(x) out2 = self.branch2_proj(out2) out2_agg = out1 + out2 # branch3 out3 = self.branch3_attn(x) out3 = self.branch3_proj(out3) out3_agg = out1 + out2_agg + out3 # branch4 out4 = self.branch4_attn(x) out4 = self.branch4_proj(out4) out4_agg = out1 + out2_agg + out3_agg + out4 # branch5 => 全局池化 => conv => GroupNorm => relu => 上采样 gp = F.adaptive_avg_pool2d(x, 1) # (B, C, 1,1) gp = self.branch5_conv(gp) # => (B,256,1,1) gp = self.branch5_gn(gp) # => GroupNorm gp = self.branch5_relu(gp) gp = F.interpolate(gp, size=(H, W), mode='bilinear', align_corners=False) # 拼接 => fuse => (B,256,H,W) out_cat = torch.cat([out1, out2_agg, out3_agg, out4_agg, gp], dim=1) out = self.fuse_conv(out_cat) return out ############################################ # 三、DeepLab + WA_ASPP_MViT ############################################ class DeepLab_MViT(nn.Module): """ 使用简易MobileNetV2作为backbone, ASPP换成WA_ASPP_MViT (MobileViT 注意力 + GroupNorm) """ def __init__(self, num_classes=3, downsample_factor=16): super().__init__() self.backbone = MobileNetV2(downsample_factor=downsample_factor, pretrained=False) # 假设: mobilenetv2输出 deep=320, low_level=32 in_channels = 320 low_level_channels = 32 # ASPP => WA_ASPP_MViT self.aspp = WA_ASPP_MViT( dim_in = in_channels, dim_out = 256, patch_size = (2,2), mvit_depth = 2, mvit_dim = 128, mvit_mlp_dim = 256, mvit_heads=4, mvit_dropout=0.0 ) # low_level => 1×1 => 48 self.shortcut_conv = nn.Sequential( nn.Conv2d(low_level_channels, 48, 1), nn.BatchNorm2d(48), nn.ReLU(inplace=True) ) # 拼接 => cat_conv => cls_conv self.cat_conv = nn.Sequential( nn.Conv2d(48+256, 256, 3, padding=1), nn.BatchNorm2d(256), nn.ReLU(inplace=True), nn.Dropout(0.5), nn.Conv2d(256, 256, 3, padding=1), nn.BatchNorm2d(256), nn.ReLU(inplace=True), nn.Dropout(0.1) ) self.cls_conv = nn.Conv2d(256, num_classes, 1) def forward(self, x): H, W = x.shape[2], x.shape[3] # backbone => (low_level, deep) low_level, deep = self.backbone(x) # deep => MViT ASPP deep = self.aspp(deep) # low_level => 1×1 => 48 low_level = self.shortcut_conv(low_level) # 上采样 deep => 拼接 => cat_conv => cls_conv => 上采样 deep_up = F.interpolate(deep, size=(low_level.size(2), low_level.size(3)), mode='bilinear', align_corners=False) x_cat = torch.cat([deep_up, low_level], dim=1) x_cat = self.cat_conv(x_cat) x_cat = self.cls_conv(x_cat) x_cat = F.interpolate(x_cat, size=(H, W), mode='bilinear', align_corners=False) return x_cat ############################################ # 测试 ############################################ if __name__ == "__main__": model = DeepLab_MViT(num_classes=3, downsample_factor=16) print("模型参数总数:", count_parameters(model)) # 构造单张图 (batch=1, channel=3, 512x512) img = torch.randn(1, 3, 512, 512) pred = model(img) print("输出维度:", pred.shape) # (1,3,512,512)
GroupNorm 替换
branch5
的那一路中,使用 nn.GroupNorm(num_groups=1, num_channels=dim_out)
来代替原先的 BatchNorm2d 或 InstanceNorm2d。其余部分
若您仍要保持 BN
model.eval()
)避免 BN 在 (B=1, C, 1,1) 时计算统计量。这样就能在单文件中运行示例网络,即使 batch=1 且 空间变为 也不会再出现 BN/IN 的运行时错误。
C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\python.exe C:\Users\wztbkl\PycharmProjects\deeplabv3-plus-pytorch-main\nets\mobileVitASPP_deeplabV3.py
模型参数总数: 11475699
输出维度: torch.Size([1, 3, 512, 512])
进程已结束,退出代码为 0
模型参数太多,有什么方法可以把参数搞少些
如果当前模型的参数量(约 1100+ 万)过大,需要在不破坏主干功能的情况下做一些削减,可从以下几方面入手来减少模型参数,同时尽量保持分割性能:
MobileViTAttentionBlock 内部的主要参数来自以下几块:
投影通道数 (transformer_dim
)
transformer_dim=128
,在将输入的 320 通道映射到 128 维时,会引入 1×1 conv
+ 后续多层 Transformer。qkv
和 FFN
等线性层的参数量。transformer_dim
过低,模型的全局注意力表达能力可能会受限,但能有效减少大量参数(尤其当网络在深层高通道时投影到较大 embedding)。Transformer 的深度 (mvit_depth
)
transformer_depth=2
(2 层 Transformer)。depth=1
),或只在某一两条分支里用深度=2,其它分支设置成 1,也能减小整体的多层注意力参数。FeedForward 维度 (mvit_mlp_dim
)
transformer_dim
等),可根据任务需求调低,比如 mlp_dim=128
甚至更小,以减少 FFN 中的线性映射数量。多头数(mvit_heads
)
分支数量
WA_ASPP_MViT
里,branch2、branch3、branch4 都使用 MobileViTAttentionBlock,如果仅在 1~2 个分支里启用 MobileViTAttention,而剩余分支用普通 3×3 卷积或空洞卷积,也可减少参数。在主干网络或 ASPP 末端也能做一些减法:
ASPP 最终通道 dim_out
branch1/2/3/4/5
以及 fuse_conv 部分都会减少一半参数;backbone 的 deep_features 通道
width_mult < 1.0
(例如 0.5 或 0.75)来缩小整体通道数。减少深度可分离卷积的通道
expand_ratio
可调低,减少倒残差层的中间通道数,从而压缩整体参数量。WA_ASPP 里原本做了 5 条分支:1×1、三路注意力、1 路全局池化,再做拼接。如果想最大程度降参数:
减少分支数
移除或简化全局分支
只在最终阶段使用一次 MobileViT
模型蒸馏 (Knowledge Distillation)
网络剪枝 (Pruning)
低比特 / 量化
不改变整体结构,仅调整参数来降低模型参数数量,根据以上优化方法,给出一种修改方案,并给出完整修改代码
下面给出一个在不改变整体网络结构(依然包含五路 ASPP 分支、MobileViT 注意力分支、以及 DeepLab 的主干 + 解码流程),只通过调整关键维度与通道数量来大幅减少模型参数的示例。主要改动如下:
减小 MobileViT Attention 模块的内部维度
transformer_dim
从 128 降为 96,transformer_mlp_dim
从 256 降为 192;transformer_depth
从 2 降为 1(只堆叠 1 层 Transformer);mvit_heads
从 4 减为 2;降低 ASPP 输出通道
WA_ASPP_MViT
中最终输出的 dim_out
从 256 改为 128;其余结构不变
这样就能在“保持网络大体架构、保持多分支 ASPP + MobileViT 注意力思路”的同时,大幅削减多头注意力与卷积投影的参数量。请注意,这只是一个示例调参方式,根据您在实际数据集上的实验效果,还可以进一步微调 \text{transformer_dim}、、\text{dim_out} 等超参。以下是完整示例代码。
pythonimport torch import torch.nn as nn import torch.nn.functional as F import math ############################################ # 工具函数: 统计模型可学习参数量 ############################################ def count_parameters(model): return sum(p.numel() for p in model.parameters() if p.requires_grad) ############################################ # 简易的 MobileNetV2 Backbone (仅示意) ############################################ class MobileNetV2(nn.Module): """ 仅演示用:features[:2] 当浅层, features[2:] 当深层 真正项目应使用完整的 mobilenetv2 实现 """ def __init__(self, downsample_factor=16, pretrained=False): super().__init__() # 用几个Conv2d来模拟 self.features = nn.Sequential( nn.Conv2d(3, 32, 3, stride=2, padding=1), nn.ReLU6(inplace=True), nn.Conv2d(32, 32, 3, stride=1, padding=1), nn.ReLU6(inplace=True), nn.Conv2d(32, 320, 3, stride=2, padding=1), nn.ReLU6(inplace=True) ) # 省略根据 downsample_factor 做空洞卷积等 def forward(self, x): # 假设前2层 => 浅层(形状 ~ (B,32,H/2,W/2)) low_level_features = self.features[:2](x) # (B,32, ...) # 后面 => deep deep_features = self.features[2:](low_level_features) # (B,320, ...) return low_level_features, deep_features ############################################ # 一、MobileViT 注意力模块 (减参版) ############################################ from einops import rearrange class MobileViTAttentionBlock(nn.Module): """ (B, in_channels, H, W) => (B, in_channels, H, W) 减参思路: - transformer_dim 从128降到96 - transformer_depth从2降到1 - heads从4降到2 - mlp_dim从256降到192 """ def __init__(self, in_channels=320, patch_size=(2,2), transformer_depth=1, # 减少深度 transformer_dim=96, # 减少embedding transformer_mlp_dim=192, # 减少FFN维度 heads=2, # 减少多头 dropout=0.0): super().__init__() self.ph, self.pw = patch_size # (1) 局部卷积 self.local_conv1 = nn.Conv2d(in_channels, in_channels, 3, padding=1, bias=False) self.local_bn1 = nn.BatchNorm2d(in_channels) self.local_act1 = nn.SiLU() # (2) conv映射 => transformer_dim self.conv_proj_in = nn.Conv2d(in_channels, transformer_dim, kernel_size=1, bias=False) self.bn_proj_in = nn.BatchNorm2d(transformer_dim) self.act_proj_in = nn.SiLU() # 多层Transformer (此处只1层) self.transformer = TransformerBlockGroup( dim=transformer_dim, depth=transformer_depth, heads=heads, mlp_dim=transformer_mlp_dim, dropout=dropout ) # 映射回 in_channels self.conv_proj_out = nn.Conv2d(transformer_dim, in_channels, kernel_size=1, bias=False) self.bn_proj_out = nn.BatchNorm2d(in_channels) self.act_proj_out = nn.SiLU() # 最后 concat 并做 3x3融合 self.local_conv2 = nn.Conv2d(in_channels*2, in_channels, 3, padding=1, bias=False) self.local_bn2 = nn.BatchNorm2d(in_channels) self.local_act2 = nn.SiLU() def forward(self, x): # 备份原输入 y = x.clone() # 1) 局部卷积 x = self.local_conv1(x) x = self.local_bn1(x) x = self.local_act1(x) # 2) conv => transformer_dim x = self.conv_proj_in(x) x = self.bn_proj_in(x) x = self.act_proj_in(x) B, D, H, W = x.shape # patch切分 assert H%self.ph==0 and W%self.pw==0, "H,W需能被patch_size整除" x = rearrange(x, 'b d (hh ph) (ww pw) -> b (hh ww) (ph pw) d', ph=self.ph, pw=self.pw) # 3) transformer x = self.transformer(x) # 拼回 x = rearrange(x, 'b (hh ww) (ph pw) d -> b d (hh ph) (ww pw)', hh=H//self.ph, ww=W//self.pw, ph=self.ph, pw=self.pw) # 映射回 in_channels x = self.conv_proj_out(x) x = self.bn_proj_out(x) x = self.act_proj_out(x) # concat + 3x3融合 x = torch.cat([x, y], dim=1) x = self.local_conv2(x) x = self.local_bn2(x) x = self.local_act2(x) return x class TransformerBlockGroup(nn.Module): def __init__(self, dim, depth, heads, mlp_dim, dropout=0.0): super().__init__() self.layers = nn.ModuleList([ TransformerBlockLayer(dim, heads, mlp_dim, dropout) for _ in range(depth) ]) def forward(self, x): for layer in self.layers: x = layer(x) return x class TransformerBlockLayer(nn.Module): def __init__(self, dim, heads, mlp_dim, dropout=0.0): super().__init__() self.norm1 = nn.LayerNorm(dim) self.attn = MultiHeadAttention(dim, heads, dropout) self.norm2 = nn.LayerNorm(dim) self.ffn = FeedForward(dim, mlp_dim, dropout) def forward(self, x): shortcut = x x_ln = self.norm1(x) x_attn = self.attn(x_ln) x = x_attn + shortcut shortcut2 = x x_ln2 = self.norm2(x) x_ffn = self.ffn(x_ln2) x = x_ffn + shortcut2 return x class MultiHeadAttention(nn.Module): def __init__(self, dim, heads=2, dropout=0.0): super().__init__() self.heads = heads self.dim_head = dim//heads inner_dim = self.dim_head*heads self.scale = self.dim_head**-0.5 self.to_qkv = nn.Linear(dim, inner_dim*3, bias=False) self.attend = nn.Softmax(dim=-1) self.proj = nn.Sequential( nn.Linear(inner_dim, dim), nn.Dropout(dropout) ) def forward(self, x): # x: (B, n_patch, patch_area, dim) B, n_patch, patch_area, dim = x.shape qkv = self.to_qkv(x) q, k, v = torch.chunk(qkv, 3, dim=-1) q = q.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) k = k.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) v = v.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) dots = torch.matmul(q, k.transpose(-1, -2))*self.scale attn = self.attend(dots) out = torch.matmul(attn, v) # (B,n_patch,heads,patch_area,dim_head) out = out.permute(0,1,3,2,4).contiguous() out = out.view(B, n_patch, patch_area, self.heads*self.dim_head) out = self.proj(out) return out class FeedForward(nn.Module): def __init__(self, dim, hidden_dim, dropout=0.0): super().__init__() self.net = nn.Sequential( nn.Linear(dim, hidden_dim), nn.SiLU(), nn.Dropout(dropout), nn.Linear(hidden_dim, dim), nn.Dropout(dropout) ) def forward(self, x): return self.net(x) ############################################ # 二、WA_ASPP中的MobileViT分支 (使用 GroupNorm + 减参) ############################################ class WA_ASPP_MViT(nn.Module): """ 五条分支: 1) branch1: 1×1 conv => (dim_in -> dim_out=128) 2) branch2/3/4: MobileViTAttentionBlock => 1×1 conv => 128 - mobilevit维度大幅缩减 3) branch5: 全局池化 => conv => GroupNorm => relu => 上采样 拼接 => 1×1 => 128 """ def __init__(self, dim_in=320, dim_out=128, # MobileViT相关超参(减参) patch_size=(2,2), mvit_depth=1, mvit_dim=96, mvit_mlp_dim=192, mvit_heads=2, mvit_dropout=0.0): super().__init__() # branch1 self.branch1 = nn.Sequential( nn.Conv2d(dim_in, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) # branch2/3/4 => MobileViT注意力(减参) self.branch2_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch3_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch4_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch2_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch3_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch4_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) # branch5 => 全局池化 => conv => GroupNorm => relu => 上采样 self.branch5_conv = nn.Conv2d(dim_in, dim_out, kernel_size=1) self.branch5_gn = nn.GroupNorm(num_groups=1, num_channels=dim_out) self.branch5_relu = nn.ReLU(inplace=True) # 拼接 => 1×1 => dim_out=128 self.fuse_conv = nn.Sequential( nn.Conv2d(dim_out*5, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) def forward(self, x): B, C, H, W = x.shape # branch1 => out1 out1 = self.branch1(x) # => (B,128,H,W) # branch2 => MViT => proj => add out2 = self.branch2_attn(x) out2 = self.branch2_proj(out2) out2_agg = out1 + out2 # branch3 out3 = self.branch3_attn(x) out3 = self.branch3_proj(out3) out3_agg = out1 + out2_agg + out3 # branch4 out4 = self.branch4_attn(x) out4 = self.branch4_proj(out4) out4_agg = out1 + out2_agg + out3_agg + out4 # branch5 => 全局池化 => conv => GroupNorm => relu => 上采样 gp = F.adaptive_avg_pool2d(x, 1) # (B, C, 1,1) gp = self.branch5_conv(gp) # => (B,128,1,1) gp = self.branch5_gn(gp) gp = self.branch5_relu(gp) gp = F.interpolate(gp, size=(H, W), mode='bilinear', align_corners=False) # 拼接 => fuse => (B,128,H,W) out_cat = torch.cat([out1, out2_agg, out3_agg, out4_agg, gp], dim=1) out = self.fuse_conv(out_cat) return out ############################################ # 三、DeepLab + WA_ASPP_MViT(减参) ############################################ class DeepLab_MViT_Light(nn.Module): """ 使用简易MobileNetV2作为backbone, ASPP换成WA_ASPP_MViT(减参), 并减少后续通道 """ def __init__(self, num_classes=3, downsample_factor=16): super().__init__() self.backbone = MobileNetV2(downsample_factor=downsample_factor, pretrained=False) # 假设: mobilenetv2输出 deep=320, low_level=32 in_channels = 320 low_level_channels = 32 # 将 ASPP 输出通道减少到 128 self.aspp = WA_ASPP_MViT( dim_in = in_channels, dim_out = 128, # 减少到 128 patch_size = (2,2), mvit_depth = 1, # Transformer层数=1 mvit_dim = 96, # Transformer dim=96 mvit_mlp_dim = 192, # MLP dim=192 mvit_heads=2, # heads=2 mvit_dropout=0.0 ) # low_level => 1×1 => 24 (也减小) self.shortcut_conv = nn.Sequential( nn.Conv2d(low_level_channels, 24, 1), nn.BatchNorm2d(24), nn.ReLU(inplace=True) ) # 拼接 => cat_conv => cls_conv # 这里(24 + 128) => 152, 可以再让cat_conv输出通道也减为128 self.cat_conv = nn.Sequential( nn.Conv2d(24+128, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(inplace=True), nn.Dropout(0.5), nn.Conv2d(128, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(inplace=True), nn.Dropout(0.1) ) self.cls_conv = nn.Conv2d(128, num_classes, 1) def forward(self, x): H, W = x.shape[2], x.shape[3] # backbone => (low_level=32, deep=320) low_level, deep = self.backbone(x) # deep => MViT ASPP (减参) deep = self.aspp(deep) # low_level => 1×1 => 24 low_level = self.shortcut_conv(low_level) # 上采样 deep => 拼接 => cat_conv => cls_conv => 上采样 deep_up = F.interpolate(deep, size=(low_level.size(2), low_level.size(3)), mode='bilinear', align_corners=False) x_cat = torch.cat([deep_up, low_level], dim=1) # => (B,152,H/?,W/?) x_cat = self.cat_conv(x_cat) x_cat = self.cls_conv(x_cat) x_cat = F.interpolate(x_cat, size=(H, W), mode='bilinear', align_corners=False) return x_cat ############################################ # 四、测试 ############################################ if __name__ == "__main__": model = DeepLab_MViT_Light(num_classes=3, downsample_factor=16) print("模型参数总数:", count_parameters(model)) # 构造单张图 (batch=1, channel=3, 512x512) img = torch.randn(1, 3, 512, 512) pred = model(img) print("输出维度:", pred.shape)
WA_ASPP_MViT(减参版):
dim_out
改为 128(替代原先的 256),并在 branch1、branch2_proj、branch5_conv 等处都输出 128;MobileViTAttentionBlock
也将 transformer_dim=96
、heads=2
、depth=1
、mlp_dim=192
,大幅缩减了注意力与 FFN 的线性层参数。DeepLab_MViT_Light:
shortcut_conv
后仅映射到 24 通道(比原先 48 更低);cat_conv
里也将卷积输出改为 128 通道,而不是 256;cls_conv
还是 128 -> num_classes=3。整体结构不变
可能的效果
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
#------------------------------------------------#
#------------------------------------------------#
from nets.xception import xception
from nets.mobilenetv2 import mobilenetv2
############################################
############################################
class MobileNetV2(nn.Module):
def init(self, downsample_factor=8, pretrained=True):
super(MobileNetV2, self).init()
from functools import partial
model = mobilenetv2(pretrained)
self.features = model.features[:-1]
textself.total_idx = len(self.features) self.down_idx = [2, 4, 7, 14] if downsample_factor == 8: for i in range(self.down_idx[-2], self.down_idx[-1]): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=4)) elif downsample_factor == 16: for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) def _nostride_dilate(self, m, dilate): classname = m.__class__.__name__ if classname.find('Conv') != -1: if m.stride == (2, 2): m.stride = (1, 1) if m.kernel_size == (3, 3): m.dilation = (dilate // 2, dilate // 2) m.padding = (dilate // 2, dilate // 2) else: if m.kernel_size == (3, 3): m.dilation = (dilate, dilate) m.padding = (dilate, dilate) def forward(self, x): # 前4层 -> low_level_features low_level_features = self.features[:4](x) # 后面 -> deep_features x = self.features[4:](low_level_features) return low_level_features, x
############################################
############################################
def count_parameters(model):
"""统计模型参数数量"""
return sum(p.numel() for p in model.parameters() if p.requires_grad)
############################################
############################################
def window_partition(x, window_size):
"""
将 (B, C, H, W) -> (B*num_windows, C, window_size, window_size)
假设 H,W 能被 window_size 整除
"""
B, C, H, W = x.shape
x = x.view(B, C, H // window_size, window_size, W // window_size, window_size)
x = x.permute(0, 2, 4, 1, 3, 5).contiguous()
windows = x.view(-1, C, window_size, window_size)
return windows
def window_unpartition(windows, window_size, H, W):
"""
将 (B*num_windows, C, window_size, window_size) -> (B, C, H, W)
"""
num_windows = (H // window_size) * (W // window_size)
B = windows.size(0) // num_windows
C = windows.size(1)
x = windows.view(B, H // window_size, W // window_size, C, window_size, window_size)
x = x.permute(0, 3, 1, 4, 2, 5).contiguous()
x = x.view(B, C, H, W)
return x
############################################
############################################
def scaled_dot_product_attention(Q, K, V):
"""
Q, K, V: (B, heads, seq_len, dim_per_head)
返回: (B, heads, seq_len, dim_per_head)
"""
attn_scores = torch.matmul(Q, K.transpose(-2, -1)) # [B, heads, seq_len, seq_len]
dim_k = K.shape[-1]
attn_scores = attn_scores / math.sqrt(dim_k)
attn = F.softmax(attn_scores, dim=-1)
out = torch.matmul(attn, V) # [B, heads, seq_len, dim_per_head]
return out
############################################
############################################
class TransformerBlock(nn.Module):
"""
标准Transformer Encoder风格:
1) LayerNorm -> MultiHeadAttention -> Dropout -> Residual
2) LayerNorm -> FFN(MLP) -> Dropout -> Residual
不含位置编码,如需可自行加
"""
def init(self, dim, num_heads=4, mlp_ratio=4.0, attn_dropout=0.0, ffn_dropout=0.0):
super().init()
self.dim = dim
self.num_heads = num_heads
self.mlp_ratio = mlp_ratio
self.attn_dropout = attn_dropout
self.ffn_dropout = ffn_dropout
text#---- 注意力前的LN ---- self.norm1 = nn.LayerNorm(dim) # Q,K,V合并投影 self.qkv = nn.Linear(dim, dim*3, bias=True) self.attn_drop = nn.Dropout(attn_dropout) # 输出投影 self.proj = nn.Linear(dim, dim, bias=True) self.proj_drop = nn.Dropout(attn_dropout) #---- FFN 前的LN ---- self.norm2 = nn.LayerNorm(dim) hidden_dim = int(dim * mlp_ratio) self.ffn = nn.Sequential( nn.Linear(dim, hidden_dim), nn.GELU(), nn.Dropout(ffn_dropout), nn.Linear(hidden_dim, dim), nn.Dropout(ffn_dropout), ) def forward(self, x): """ x: (B, seq_len, dim) """ B, N, D = x.shape # (1) 自注意力子层 shortcut = x x_ln = self.norm1(x) qkv = self.qkv(x_ln).reshape(B, N, 3, self.num_heads, D//self.num_heads) qkv = qkv.permute(2, 0, 3, 1, 4) # (3, B, heads, seq_len, dim_ph) Q, K, V = qkv[0], qkv[1], qkv[2] # (B, heads, seq_len, dim_ph) # 多头注意力 attn_out = scaled_dot_product_attention(Q, K, V) attn_out = self.attn_drop(attn_out) # 合并heads attn_out = attn_out.permute(0, 2, 1, 3).contiguous() # => (B, seq_len, dim) attn_out = attn_out.view(B, N, D) # 输出投影 attn_out = self.proj(attn_out) attn_out = self.proj_drop(attn_out) # 残差 x = shortcut + attn_out # (2) FFN子层 shortcut2 = x x_ln2 = self.norm2(x) x_ffn = self.ffn(x_ln2) x = shortcut2 + x_ffn return x
############################################
############################################
class WindowTransformerBlock(nn.Module):
"""
- 将(B, C, H, W)划分为窗口
- 在每个窗口内, 将通道C视作embedding dim, (ws*ws)视作序列长度
- 送进 TransformerBlock
- 最终输出和输入维度相同 (in_channels不变)
"""
def init(self, in_channels, window_size=8, num_heads=4, mlp_ratio=4.0,
attn_dropout=0.0, ffn_dropout=0.0):
super().init()
self.in_channels = in_channels
self.window_size = window_size
textself.transformer = TransformerBlock( dim = in_channels, num_heads = num_heads, mlp_ratio = mlp_ratio, attn_dropout= attn_dropout, ffn_dropout = ffn_dropout ) def forward(self, x): """ x: (B, C, H, W) return: (B, C, H, W) """ B, C, H, W = x.shape ws = self.window_size # 1) 分窗口 => (B*nw, C, ws, ws) windows = window_partition(x, ws) # 2) => (B*nw, ws*ws, C) windows_reshape = windows.view(-1, C, ws*ws).permute(0, 2, 1).contiguous() # 3) TransformerBlock out = self.transformer(windows_reshape) # (B*nw, ws*ws, C) # 4) => (B*nw, C, ws, ws) out = out.permute(0, 2, 1).contiguous() out = out.view(-1, C, ws, ws) # 5) 拼回 (B, C, H, W) x_out = window_unpartition(out, ws, H, W) return x_out
############################################
############################################
class WA_ASPP(nn.Module):
"""
五条分支:
1) branch1: 1×1 conv => (320->256)
2) branch2/3/4: WindowTransformerBlock(320->320) + 1×1 conv(320->256) => 层次叠加
3) branch5: 全局平均池化(320->256)
最终拼接 => 1×1 conv => 256
"""
def init(self, dim_in, dim_out,
window_size=(4,8,16),
num_heads=4,
mlp_ratio=4.0,
attn_dropout=0.0,
ffn_dropout=0.0):
super().init()
text# branch1: 1×1 conv => [B, dim_out, H, W] self.branch1 = nn.Sequential( nn.Conv2d(dim_in, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) # branch2/3/4: WindowTransformerBlock (输入输出都是dim_in=320) self.branch2_attn = WindowTransformerBlock( in_channels = dim_in, window_size = window_size[0], num_heads = num_heads, mlp_ratio = mlp_ratio, attn_dropout = attn_dropout, ffn_dropout = ffn_dropout ) self.branch3_attn = WindowTransformerBlock( in_channels = dim_in, window_size = window_size[1], num_heads = num_heads, mlp_ratio = mlp_ratio, attn_dropout = attn_dropout, ffn_dropout = ffn_dropout ) self.branch4_attn = WindowTransformerBlock( in_channels = dim_in, window_size = window_size[2], num_heads = num_heads, mlp_ratio = mlp_ratio, attn_dropout = attn_dropout, ffn_dropout = ffn_dropout ) # 额外的 1×1 Conv:将分支2/3/4输出从 320 => 256 self.branch2_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch3_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch4_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) # branch5: 全局平均池化 => 卷积(320->256) self.branch5_conv = nn.Conv2d(dim_in, dim_out, kernel_size=1) self.branch5_bn = nn.BatchNorm2d(dim_out) self.branch5_relu = nn.ReLU(inplace=True) # 拼接(5路) => 1×1 conv => dim_out self.fuse_conv = nn.Sequential( nn.Conv2d(dim_out * 5, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) def forward(self, x): B, C, H, W = x.size() # branch1 => [B, 256, H, W] out1 = self.branch1(x) # branch2 => WindowAttn => [B, 320, H, W] -> 1x1Conv => [B,256,H,W] out2 = self.branch2_attn(x) out2 = self.branch2_proj(out2) out2_agg = out1 + out2 # branch3 out3 = self.branch3_attn(x) out3 = self.branch3_proj(out3) out3_agg = out1 + out2_agg + out3 # branch4 out4 = self.branch4_attn(x) out4 = self.branch4_proj(out4) out4_agg = out1 + out2_agg + out3_agg + out4 # branch5 => 全局池化 => 卷积=>bn=>relu => 上采样 => [B,256,H,W] gp = F.adaptive_avg_pool2d(x, 1) # (B, 320, 1, 1) gp = self.branch5_conv(gp) # => (B, 256, 1, 1) gp = self.branch5_bn(gp) gp = self.branch5_relu(gp) gp = F.interpolate(gp, size=(H, W), mode='bilinear', align_corners=False) # 拼接 [out1, out2_agg, out3_agg, out4_agg, gp] => [B, 256*5, H, W] out_cat = torch.cat([out1, out2_agg, out3_agg, out4_agg, gp], dim=1) # 融合 => [B, 256, H, W] out = self.fuse_conv(out_cat) return out
############################################
############################################
class DeepLab(nn.Module):
def init(self, num_classes, backbone="mobilenet", pretrained=True, downsample_factor=16):
super(DeepLab, self).init()
if backbone == "xception":
self.backbone = xception(downsample_factor=downsample_factor, pretrained=pretrained)
in_channels = 2048
low_level_channels = 256
elif backbone == "mobilenet":
self.backbone = MobileNetV2(downsample_factor=downsample_factor, pretrained=pretrained)
#----------------------------------#
# mobilenetv2 输出deep_features: 320通道
# 浅层特征 low_level_features: 24通道
#----------------------------------#
in_channels = 320
low_level_channels = 24
else:
raise ValueError('Unsupported backbone - {}
, Use mobilenet, xception.'.format(backbone))
text# 这就是我们自定义的多头窗口注意力ASPP self.aspp = WA_ASPP( dim_in = in_channels, # 320 dim_out = 256, window_size = (4,8,16), # 窗口大小 num_heads = 4, # 多头数 mlp_ratio = 4.0, # FFN扩展倍数 attn_dropout= 0.1, # 注意力dropout ffn_dropout = 0.1 # FFN dropout ) # 浅层特征 => 1×1卷积 => 48通道 self.shortcut_conv = nn.Sequential( nn.Conv2d(low_level_channels, 48, 1), nn.BatchNorm2d(48), nn.ReLU(inplace=True) ) # 拼接后 => 进一步卷积 self.cat_conv = nn.Sequential( nn.Conv2d(48 + 256, 256, 3, stride=1, padding=1), nn.BatchNorm2d(256), nn.ReLU(inplace=True), nn.Dropout(0.5), nn.Conv2d(256, 256, 3, stride=1, padding=1), nn.BatchNorm2d(256), nn.ReLU(inplace=True), nn.Dropout(0.1), ) # 最终1x1分类层 self.cls_conv = nn.Conv2d(256, num_classes, 1, stride=1) def forward(self, x): H, W = x.size(2), x.size(3) # 主干网络 => 浅层特征, 深层特征 low_level_features, deep_features = self.backbone(x) # 深层特征 => WA_ASPP deep_features = self.aspp(deep_features) # 浅层特征 => 1×1 conv => 48通道 low_level_features = self.shortcut_conv(low_level_features) # 将深层特征上采样到浅层特征大小,然后拼接 deep_features = F.interpolate(deep_features, size=(low_level_features.size(2), low_level_features.size(3)), mode='bilinear', align_corners=True) x_cat = torch.cat((deep_features, low_level_features), dim=1) x_cat = self.cat_conv(x_cat) # 最终分类头 x_cat = self.cls_conv(x_cat) x_cat = F.interpolate(x_cat, size=(H, W), mode='bilinear', align_corners=True) return x_cat
############################################
############################################
if name == 'main':
img = torch.randn(3, 3, 512, 512)
model = DeepLab(num_classes=3, backbone="mobilenet", pretrained=False, downsample_factor=16)
out = model(img)
print("输出维度:", out.shape) # 期望 (3, 3, 512, 512)
print("DeepLabV3+参数数量:", count_parameters(model))
模仿以上代码,修改当前代码
下面给出一个模仿当前 DeepLab + WA_ASPP 结构、但**通过“调小关键参数”**来减少模型总体参数数量的示例。与原代码相比,仅在以下方面做了调整:
dim_out
从 256 减少到 128。num_heads
减小,以及 FFN 扩展倍数(mlp_ratio
)降低,以减少注意力和全连接层的参数量。shortcut_conv
)的通道数也相应减小(例如从 48 改成 24),从而进一步削减后续解码卷积层的参数。其余网络结构保持不变,包括五个分支、拼接方式、整体 DeepLab 解码流程等。以下是完整的示例修改代码(放在同一文件即可运行)。请根据实际工程需求进行路径导入、backbone 细节的适配。
pythonimport torch import torch.nn as nn import torch.nn.functional as F import math #------------------------------------------------# # 若您的项目里已包含 xception 和 mobilenetv2, # 请确保导入路径正确;此处仅示例 #------------------------------------------------# from nets.xception import xception from nets.mobilenetv2 import mobilenetv2 ############################################ # 1. MobileNetV2 主干 (与原DeepLabV3+相同) # 仅示例,使用 features[:4]当low_level ############################################ class MobileNetV2(nn.Module): def __init__(self, downsample_factor=8, pretrained=True): super(MobileNetV2, self).__init__() from functools import partial model = mobilenetv2(pretrained) # 截断最后一层 classification self.features = model.features[:-1] self.total_idx = len(self.features) self.down_idx = [2, 4, 7, 14] if downsample_factor == 8: for i in range(self.down_idx[-2], self.down_idx[-1]): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=4)) elif downsample_factor == 16: for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) def _nostride_dilate(self, m, dilate): """将stride=2的Conv改为dilation,保持感受野""" classname = m.__class__.__name__ if classname.find('Conv') != -1: if m.stride == (2, 2): m.stride = (1, 1) if m.kernel_size == (3, 3): m.dilation = (dilate // 2, dilate // 2) m.padding = (dilate // 2, dilate // 2) else: if m.kernel_size == (3, 3): m.dilation = (dilate, dilate) m.padding = (dilate, dilate) def forward(self, x): # 前4层 => low_level_features low_level_features = self.features[:4](x) # 后面 => deep_features x = self.features[4:](low_level_features) return low_level_features, x ############################################ # 2. 统计可训练参数数量 ############################################ def count_parameters(model): """统计模型参数数量""" return sum(p.numel() for p in model.parameters() if p.requires_grad) ############################################ # 3. 工具函数: window partition / unpartition ############################################ def window_partition(x, window_size): B, C, H, W = x.shape x = x.view(B, C, H // window_size, window_size, W // window_size, window_size) x = x.permute(0, 2, 4, 1, 3, 5).contiguous() windows = x.view(-1, C, window_size, window_size) return windows def window_unpartition(windows, window_size, H, W): num_windows = (H // window_size)*(W // window_size) B = windows.size(0)//num_windows C = windows.size(1) x = windows.view(B, H//window_size, W//window_size, C, window_size, window_size) x = x.permute(0,3,1,4,2,5).contiguous() x = x.view(B, C, H, W) return x ############################################ # 4. Scaled Dot-Product Attention ############################################ def scaled_dot_product_attention(Q, K, V): attn_scores = torch.matmul(Q, K.transpose(-2, -1)) dim_k = K.shape[-1] attn_scores = attn_scores / math.sqrt(dim_k) attn = F.softmax(attn_scores, dim=-1) out = torch.matmul(attn, V) return out ############################################ # 5. TransformerBlock (LN + 多头注意力 + FFN + 残差) ############################################ class TransformerBlock(nn.Module): """ 减参: 适当调低 mlp_ratio, num_heads等 """ def __init__(self, dim, num_heads=2, mlp_ratio=2.0, attn_dropout=0.0, ffn_dropout=0.0): super().__init__() self.dim = dim self.num_heads = num_heads self.mlp_ratio = mlp_ratio # LN self.norm1 = nn.LayerNorm(dim) self.qkv = nn.Linear(dim, dim*3, bias=True) self.attn_drop = nn.Dropout(attn_dropout) self.proj = nn.Linear(dim, dim, bias=True) self.proj_drop = nn.Dropout(attn_dropout) # LN2 self.norm2 = nn.LayerNorm(dim) hidden_dim = int(dim*mlp_ratio) self.ffn = nn.Sequential( nn.Linear(dim, hidden_dim), nn.GELU(), nn.Dropout(ffn_dropout), nn.Linear(hidden_dim, dim), nn.Dropout(ffn_dropout), ) def forward(self, x): B, N, D = x.shape # Multi-head Self-Attn shortcut = x x_ln = self.norm1(x) qkv = self.qkv(x_ln).reshape(B, N, 3, self.num_heads, D//self.num_heads) qkv = qkv.permute(2, 0, 3, 1, 4) Q, K, V = qkv[0], qkv[1], qkv[2] attn_out = scaled_dot_product_attention(Q, K, V) attn_out = self.attn_drop(attn_out) attn_out = attn_out.permute(0,2,1,3).contiguous() attn_out = attn_out.view(B, N, D) attn_out = self.proj(attn_out) attn_out = self.proj_drop(attn_out) x = shortcut + attn_out # FFN shortcut2 = x x_ln2 = self.norm2(x) x_ffn = self.ffn(x_ln2) x = shortcut2 + x_ffn return x ############################################ # 6. WindowTransformerBlock ############################################ class WindowTransformerBlock(nn.Module): def __init__(self, in_channels, window_size=8, num_heads=2, mlp_ratio=2.0, attn_dropout=0.0, ffn_dropout=0.0): super().__init__() self.in_channels = in_channels self.window_size = window_size self.transformer = TransformerBlock( dim = in_channels, num_heads = num_heads, mlp_ratio = mlp_ratio, attn_dropout= attn_dropout, ffn_dropout = ffn_dropout ) def forward(self, x): B, C, H, W = x.shape ws = self.window_size # 分窗口 => (B*nw, C, ws, ws) windows = window_partition(x, ws) # => (B*nw, ws*ws, C) windows_reshape = windows.view(-1, C, ws*ws).permute(0,2,1) # Transformer out = self.transformer(windows_reshape) # => (B*nw, C, ws, ws) out = out.permute(0,2,1).contiguous().view(-1, C, ws, ws) # 拼回 x_out = window_unpartition(out, ws, H, W) return x_out ############################################ # 7. 多头WA_ASPP (减参) ############################################ class WA_ASPP(nn.Module): """ 五分支: branch1 (1x1), branch2/3/4(WinAttn+1x1), branch5(全局池化)-> 1x1 fuse => 1x1 此处将dim_out=128, window_size, num_heads=2, mlp_ratio=2.0来减少参数 """ def __init__(self, dim_in=320, dim_out=128, window_size=(4,8,16), num_heads=2, mlp_ratio=2.0, attn_dropout=0.0, ffn_dropout=0.0): super().__init__() # branch1 => 1x1 => 128 self.branch1 = nn.Sequential( nn.Conv2d(dim_in, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) # branch2/3/4 => WindowTransformerBlock => 1x1 => 128 self.branch2_attn = WindowTransformerBlock( in_channels=dim_in, window_size=window_size[0], num_heads=num_heads, mlp_ratio=mlp_ratio, attn_dropout=attn_dropout, ffn_dropout=ffn_dropout ) self.branch3_attn = WindowTransformerBlock( in_channels=dim_in, window_size=window_size[1], num_heads=num_heads, mlp_ratio=mlp_ratio, attn_dropout=attn_dropout, ffn_dropout=ffn_dropout ) self.branch4_attn = WindowTransformerBlock( in_channels=dim_in, window_size=window_size[2], num_heads=num_heads, mlp_ratio=mlp_ratio, attn_dropout=attn_dropout, ffn_dropout=ffn_dropout ) self.branch2_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch3_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch4_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) # branch5 => 全局池化 => conv(=>128) self.branch5_conv = nn.Conv2d(dim_in, dim_out, kernel_size=1) self.branch5_bn = nn.BatchNorm2d(dim_out) self.branch5_relu = nn.ReLU(inplace=True) # fuse => 1x1 => 128 self.fuse_conv = nn.Sequential( nn.Conv2d(dim_out*5, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) def forward(self, x): B, C, H, W = x.shape # branch1 => out1 out1 = self.branch1(x) # =>(B,128,H,W) # branch2 => WinAttn => 1x1 => out2 out2 = self.branch2_attn(x) out2 = self.branch2_proj(out2) out2_agg = out1 + out2 # branch3 => WinAttn => ... out3 = self.branch3_attn(x) out3 = self.branch3_proj(out3) out3_agg = out1 + out2_agg + out3 # branch4 => WinAttn => ... out4 = self.branch4_attn(x) out4 = self.branch4_proj(out4) out4_agg = out1 + out2_agg + out3_agg + out4 # branch5 => 全局池化 => conv=>bn=>relu => 上采样 =>(B,128,H,W) gp = F.adaptive_avg_pool2d(x, 1) # (B,320,1,1) gp = self.branch5_conv(gp) # => (B,128,1,1) gp = self.branch5_bn(gp) gp = self.branch5_relu(gp) gp = F.interpolate(gp, size=(H, W), mode='bilinear', align_corners=False) # 拼接 => fuse =>(B,128,H,W) out_cat = torch.cat([out1, out2_agg, out3_agg, out4_agg, gp], dim=1) out = self.fuse_conv(out_cat) return out ############################################ # 8. DeepLab (减参) ############################################ class DeepLab(nn.Module): """ 不改变整体结构, 但将ASPP输出通道减到128, WindowTransformerBlock里num_heads=2, mlp_ratio=2.0 同时浅层shortcut_conv也改为 24通道(原本48) """ def __init__(self, num_classes, backbone="mobilenet", pretrained=True, downsample_factor=16): super(DeepLab, self).__init__() if backbone == "xception": self.backbone = xception(downsample_factor=downsample_factor, pretrained=pretrained) in_channels = 2048 low_level_channels = 256 elif backbone == "mobilenet": self.backbone = MobileNetV2(downsample_factor=downsample_factor, pretrained=pretrained) in_channels = 320 # deep low_level_channels = 24 # 浅层 else: raise ValueError('Unsupported backbone - `{}`, Use mobilenet, xception.'.format(backbone)) # 使用减参版 WA_ASPP self.aspp = WA_ASPP( dim_in = in_channels, # 320 dim_out = 128, # 减少到128 window_size = (4,8,16), num_heads = 2, # 减少多头 mlp_ratio = 2.0, # 减少FFN维度 attn_dropout= 0.1, ffn_dropout = 0.1 ) # 浅层特征 => 1x1 => 24通道(原先48减少) self.shortcut_conv = nn.Sequential( nn.Conv2d(low_level_channels, 24, 1), nn.BatchNorm2d(24), nn.ReLU(inplace=True) ) # 拼接后 => 进一步卷积 (这儿也可调小, 例如128) self.cat_conv = nn.Sequential( nn.Conv2d(24 + 128, 128, 3, stride=1, padding=1), nn.BatchNorm2d(128), nn.ReLU(inplace=True), nn.Dropout(0.5), nn.Conv2d(128, 128, 3, stride=1, padding=1), nn.BatchNorm2d(128), nn.ReLU(inplace=True), nn.Dropout(0.1), ) self.cls_conv = nn.Conv2d(128, num_classes, 1, stride=1) def forward(self, x): H, W = x.size(2), x.size(3) # backbone => low_level=24, deep=320 low_level_features, deep_features = self.backbone(x) # ASPP (减参) deep_features = self.aspp(deep_features) # low_level => 1x1 => 24 low_level_features = self.shortcut_conv(low_level_features) # 上采样 => 拼接 => cat_conv => cls_conv => 上采样 deep_features = F.interpolate(deep_features, size=(low_level_features.size(2), low_level_features.size(3)), mode='bilinear', align_corners=True) x_cat = torch.cat((deep_features, low_level_features), dim=1) x_cat = self.cat_conv(x_cat) x_cat = self.cls_conv(x_cat) x_cat = F.interpolate(x_cat, size=(H, W), mode='bilinear', align_corners=True) return x_cat ############################################ # 9. 测试 ############################################ if __name__ == '__main__': img = torch.randn(1, 3, 512, 512) model = DeepLab(num_classes=3, backbone="mobilenet", pretrained=False, downsample_factor=16) out = model(img) print("输出维度:", out.shape) # (1, 3, 512, 512) print("DeepLabV3+ 参数数量:", count_parameters(model))
WA_ASPP
:
dim_out=128
(原本 256),将五条分支最终都投影到 128 通道;num_heads=2
, mlp_ratio=2.0
,并可保持 attn_dropout=0.1
、ffn_dropout=0.1
;DeepLab:
shortcut_conv
改为输出 24 通道(原本 48),减少后续卷积的输入通道;cls_conv
仍然是 128 -> num_classes
。这样就能在不改变整体结构(分支、拼接逻辑、DeepLab 流程)的前提下,通过调低通道数、Heads、FFN 扩展倍数等实现大量参数削减。实际性能(mIoU、FPS 等)需在训练后做实验评估,但原理上会比原先 2~3 千万参数的版本显著精简。
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from nets.xception import xception
from nets.mobilenetv2 import mobilenetv2
############################################
############################################
class MobileNetV2(nn.Module):
def init(self, downsample_factor=8, pretrained=True):
super(MobileNetV2, self).init()
from functools import partial
model = mobilenetv2(pretrained)
self.features = model.features[:-1]
textself.total_idx = len(self.features) self.down_idx = [2, 4, 7, 14] if downsample_factor == 8: for i in range(self.down_idx[-2], self.down_idx[-1]): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=4)) elif downsample_factor == 16: for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) def _nostride_dilate(self, m, dilate): classname = m.__class__.__name__ if classname.find('Conv') != -1: if m.stride == (2, 2): m.stride = (1, 1) if m.kernel_size == (3, 3): m.dilation = (dilate // 2, dilate // 2) m.padding = (dilate // 2, dilate // 2) else: if m.kernel_size == (3, 3): m.dilation = (dilate, dilate) m.padding = (dilate, dilate) def forward(self, x): # 前4层 -> low_level_features low_level_features = self.features[:4](x) # 后面 -> deep_features x = self.features[4:](low_level_features) return low_level_features, x
############################################
############################################
def count_parameters(model):
"""统计模型参数数量"""
return sum(p.numel() for p in model.parameters() if p.requires_grad)
############################################
############################################
from einops import rearrange
class MobileViTAttentionBlock(nn.Module):
"""
(B, in_channels, H, W) => (B, in_channels, H, W)
减参思路:
- transformer_dim 从128降到96
- transformer_depth从2降到1
- heads从4降到2
- mlp_dim从256降到192
"""
def init(self,
in_channels=320,
patch_size=(2,2),
transformer_depth=1, # 减少深度
transformer_dim=96, # 减少embedding
transformer_mlp_dim=192, # 减少FFN维度
heads=2, # 减少多头
dropout=0.0):
super().init()
self.ph, self.pw = patch_size
text# (1) 局部卷积 self.local_conv1 = nn.Conv2d(in_channels, in_channels, 3, padding=1, bias=False) self.local_bn1 = nn.BatchNorm2d(in_channels) self.local_act1 = nn.SiLU() # (2) conv映射 => transformer_dim self.conv_proj_in = nn.Conv2d(in_channels, transformer_dim, kernel_size=1, bias=False) self.bn_proj_in = nn.BatchNorm2d(transformer_dim) self.act_proj_in = nn.SiLU() # 多层Transformer (此处只1层) self.transformer = TransformerBlockGroup( dim=transformer_dim, depth=transformer_depth, heads=heads, mlp_dim=transformer_mlp_dim, dropout=dropout ) # 映射回 in_channels self.conv_proj_out = nn.Conv2d(transformer_dim, in_channels, kernel_size=1, bias=False) self.bn_proj_out = nn.BatchNorm2d(in_channels) self.act_proj_out = nn.SiLU() # 最后 concat 并做 3x3融合 self.local_conv2 = nn.Conv2d(in_channels*2, in_channels, 3, padding=1, bias=False) self.local_bn2 = nn.BatchNorm2d(in_channels) self.local_act2 = nn.SiLU() def forward(self, x): # 备份原输入 y = x.clone() # 1) 局部卷积 x = self.local_conv1(x) x = self.local_bn1(x) x = self.local_act1(x) # 2) conv => transformer_dim x = self.conv_proj_in(x) x = self.bn_proj_in(x) x = self.act_proj_in(x) B, D, H, W = x.shape # patch切分 assert H%self.ph==0 and W%self.pw==0, "H,W需能被patch_size整除" x = rearrange(x, 'b d (hh ph) (ww pw) -> b (hh ww) (ph pw) d', ph=self.ph, pw=self.pw) # 3) transformer x = self.transformer(x) # 拼回 x = rearrange(x, 'b (hh ww) (ph pw) d -> b d (hh ph) (ww pw)', hh=H//self.ph, ww=W//self.pw, ph=self.ph, pw=self.pw) # 映射回 in_channels x = self.conv_proj_out(x) x = self.bn_proj_out(x) x = self.act_proj_out(x) # concat + 3x3融合 x = torch.cat([x, y], dim=1) x = self.local_conv2(x) x = self.local_bn2(x) x = self.local_act2(x) return x
class TransformerBlockGroup(nn.Module):
def init(self, dim, depth, heads, mlp_dim, dropout=0.0):
super().init()
self.layers = nn.ModuleList([
TransformerBlockLayer(dim, heads, mlp_dim, dropout)
for _ in range(depth)
])
def forward(self, x):
for layer in self.layers:
x = layer(x)
return x
class TransformerBlockLayer(nn.Module):
def init(self, dim, heads, mlp_dim, dropout=0.0):
super().init()
self.norm1 = nn.LayerNorm(dim)
self.attn = MultiHeadAttention(dim, heads, dropout)
self.norm2 = nn.LayerNorm(dim)
self.ffn = FeedForward(dim, mlp_dim, dropout)
def forward(self, x):
shortcut = x
x_ln = self.norm1(x)
x_attn = self.attn(x_ln)
x = x_attn + shortcut
textshortcut2 = x x_ln2 = self.norm2(x) x_ffn = self.ffn(x_ln2) x = x_ffn + shortcut2 return x
class MultiHeadAttention(nn.Module):
def init(self, dim, heads=2, dropout=0.0):
super().init()
self.heads = heads
self.dim_head = dim//heads
inner_dim = self.dim_head*heads
textself.scale = self.dim_head**-0.5 self.to_qkv = nn.Linear(dim, inner_dim*3, bias=False) self.attend = nn.Softmax(dim=-1) self.proj = nn.Sequential( nn.Linear(inner_dim, dim), nn.Dropout(dropout) ) def forward(self, x): # x: (B, n_patch, patch_area, dim) B, n_patch, patch_area, dim = x.shape qkv = self.to_qkv(x) q, k, v = torch.chunk(qkv, 3, dim=-1) q = q.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) k = k.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) v = v.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) dots = torch.matmul(q, k.transpose(-1, -2))*self.scale attn = self.attend(dots) out = torch.matmul(attn, v) # (B,n_patch,heads,patch_area,dim_head) out = out.permute(0,1,3,2,4).contiguous() out = out.view(B, n_patch, patch_area, self.heads*self.dim_head) out = self.proj(out) return out
class FeedForward(nn.Module):
def init(self, dim, hidden_dim, dropout=0.0):
super().init()
self.net = nn.Sequential(
nn.Linear(dim, hidden_dim),
nn.SiLU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, dim),
nn.Dropout(dropout)
)
def forward(self, x):
return self.net(x)
############################################
############################################
class WA_ASPP_MViT(nn.Module):
"""
五条分支:
1) branch1: 1×1 conv => (dim_in -> dim_out=128)
2) branch2/3/4: MobileViTAttentionBlock => 1×1 conv => 128
- mobilevit维度大幅缩减
3) branch5: 全局池化 => conv => GroupNorm => relu => 上采样
拼接 => 1×1 => 128
"""
def init(self, dim_in=320, dim_out=128,
# MobileViT相关超参(减参)
patch_size=(2,2),
mvit_depth=1,
mvit_dim=96,
mvit_mlp_dim=192,
mvit_heads=2,
mvit_dropout=0.0):
super().init()
text# branch1 self.branch1 = nn.Sequential( nn.Conv2d(dim_in, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) # branch2/3/4 => MobileViT注意力(减参) self.branch2_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch3_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch4_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch2_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch3_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch4_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) # branch5 => 全局池化 => conv => GroupNorm => relu => 上采样 self.branch5_conv = nn.Conv2d(dim_in, dim_out, kernel_size=1) self.branch5_gn = nn.GroupNorm(num_groups=1, num_channels=dim_out) self.branch5_relu = nn.ReLU(inplace=True) # 拼接 => 1×1 => dim_out=128 self.fuse_conv = nn.Sequential( nn.Conv2d(dim_out*5, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) def forward(self, x): B, C, H, W = x.shape # branch1 => out1 out1 = self.branch1(x) # => (B,128,H,W) # branch2 => MViT => proj => add out2 = self.branch2_attn(x) out2 = self.branch2_proj(out2) out2_agg = out1 + out2 # branch3 out3 = self.branch3_attn(x) out3 = self.branch3_proj(out3) out3_agg = out1 + out2_agg + out3 # branch4 out4 = self.branch4_attn(x) out4 = self.branch4_proj(out4) out4_agg = out1 + out2_agg + out3_agg + out4 # branch5 => 全局池化 => conv => GroupNorm => relu => 上采样 gp = F.adaptive_avg_pool2d(x, 1) # (B, C, 1,1) gp = self.branch5_conv(gp) # => (B,128,1,1) gp = self.branch5_gn(gp) gp = self.branch5_relu(gp) gp = F.interpolate(gp, size=(H, W), mode='bilinear', align_corners=False) # 拼接 => fuse => (B,128,H,W) out_cat = torch.cat([out1, out2_agg, out3_agg, out4_agg, gp], dim=1) out = self.fuse_conv(out_cat) return out
############################################
############################################
class DeepLab(nn.Module):
def init(self, num_classes, backbone="mobilenet", pretrained=True, downsample_factor=16):
super(DeepLab, self).init()
if backbone == "xception":
self.backbone = xception(downsample_factor=downsample_factor, pretrained=pretrained)
in_channels = 2048
low_level_channels = 256
elif backbone == "mobilenet":
self.backbone = MobileNetV2(downsample_factor=downsample_factor, pretrained=pretrained)
# ----------------------------------#
# mobilenetv2 输出deep_features: 320通道
# 浅层特征 low_level_features: 24通道
# ----------------------------------#
in_channels = 320
low_level_channels = 24
else:
raise ValueError('Unsupported backbone - {}
, Use mobilenet, xception.'.format(backbone))
text# 将 ASPP 输出通道减少到 128 self.aspp = WA_ASPP_MViT( dim_in = in_channels, dim_out = 128, # 减少到 128 patch_size = (2,2), mvit_depth = 1, # Transformer层数=1 mvit_dim = 96, # Transformer dim=96 mvit_mlp_dim = 192, # MLP dim=192 mvit_heads=2, # heads=2 mvit_dropout=0.0 ) # low_level => 1×1 => 24 (也减小) self.shortcut_conv = nn.Sequential( nn.Conv2d(low_level_channels, 24, 1), nn.BatchNorm2d(24), nn.ReLU(inplace=True) ) # 拼接 => cat_conv => cls_conv # 这里(24 + 128) => 152, 可以再让cat_conv输出通道也减为128 self.cat_conv = nn.Sequential( nn.Conv2d(24+128, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(inplace=True), nn.Dropout(0.5), nn.Conv2d(128, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(inplace=True), nn.Dropout(0.1) ) self.cls_conv = nn.Conv2d(128, num_classes, 1) def forward(self, x): H, W = x.shape[2], x.shape[3] # backbone => (low_level=32, deep=320) low_level, deep = self.backbone(x) # deep => MViT ASPP (减参) deep = self.aspp(deep) # low_level => 1×1 => 24 low_level = self.shortcut_conv(low_level) # 上采样 deep => 拼接 => cat_conv => cls_conv => 上采样 deep_up = F.interpolate(deep, size=(low_level.size(2), low_level.size(3)), mode='bilinear', align_corners=False) x_cat = torch.cat([deep_up, low_level], dim=1) # => (B,152,H/?,W/?) x_cat = self.cat_conv(x_cat) x_cat = self.cls_conv(x_cat) x_cat = F.interpolate(x_cat, size=(H, W), mode='bilinear', align_corners=False) return x_cat
############################################
############################################
if name == "main":
model = DeepLab(num_classes=3, backbone="mobilenet", pretrained=False, downsample_factor=16)
print("模型参数总数:", count_parameters(model))
text# 构造单张图 (batch=1, channel=3, 512x512) img = torch.randn(1, 3, 512, 512) pred = model(img) print("输出维度:", pred.shape)
C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\python.exe C:\Users\wztbkl\PycharmProjects\deeplabv3-plus-pytorch-main\nets\mobileVitASPP_deeplabV3.py
模型参数总数: 11132491
输出维度: torch.Size([1, 3, 512, 512])
进程已结束,退出代码为 0
这是目前的参数量,还有哪些可以优化的地方,减少参数量
在当前 “减参版” DeepLab + MobileViT ASPP 的基础上,您已经通过以下策略减少了一部分参数:
dim_out
从 256 到 128;如果想进一步缩减参数量,在不大改动网络大结构的前提下,还可以尝试如下优化点:
width_mult
在 MobileNetV2 中,如果您使用的是“完整” mobilenetv2 实现,通常可以通过设置 width_mult < 1.0
来整体降低通道宽度,从而减少层与层之间的卷积参数量。
python# 例如: model = mobilenetv2(pretrained=False, width_mult=0.75)
这样每层的通道数都会减小 0.75 倍,整体参数减少明显。您目前还保留了 downsample_factor
的空洞卷积处理,可以在保持 8 或 16 下采样不变的基础上,额外再加上 “width_mult=0.75
” 或 “width_mult=0.5
”,就能让骨干网络大幅减重。
如果您在 mobilenetv2 实现里看到了“expand_ratio
”(通常为 6)之类,也可以把它改小,比如 4、3,减少中间层扩张通道数,也能再削减一部分参数与计算量。
当前 WA_ASPP 有 5 条分支(1x1、三条窗口注意力、全局池化)。如果您愿意牺牲部分多尺度特征,可以只保留:
这样每减少一条分支,就会去掉对应的一些卷积和注意力层参数。比如只保留 branch1 + branch2 + branch5 就能删除 branch3/4 的所有注意力与投影。
即使保持 5 个分支,也可以把 branch1 / branch2_proj / branch5_conv 的输出 dim_out=128
改得更小(比如 96、64),只要后面 cat_conv 同步调整即可。当然,通道缩得太小,可能影响特征表达能力。
在 DeepLab 解码里,您已将浅层特征从 48 改成 24,但 cat_conv 现在输出还是 128 通道。您可以:
cls_conv
从 64 -> num_classes;cls_conv
,而不是两次 3×3 卷积+Dropout,也能少掉一些卷积核参数。例如:
python# cat_conv改为仅一次卷积 self.cat_conv = nn.Sequential( nn.Conv2d(24 + 128, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU(inplace=True), nn.Dropout(0.5) ) self.cls_conv = nn.Conv2d(64, num_classes, 1)
这样能再进一步削减解码器部分的参数量。
若希望大幅减少注意力的参数,还可以只在 ASPP 中某一条分支(如 branch2)采用 MobileViTAttention,其他分支用普通卷积 / 空洞卷积,这样就会少掉多份 TransformerBlock 的映射矩阵。
width_mult < 1.0
)或减少 ASPP 分支的数量;dim_out
、解码器 cat_conv 的输出通道都再降低(如 64),仅保留基本特征通路;通过这些策略,您可在不改变网络“浅层-深层-拼接-ASPP-解码”的大框架前提下,把可学习参数量缩减到更低的数量级。
Start Train
为什么参数量这么大,训练速度却很快呢
当我们在观察一个分割网络时,“参数量大”并不一定意味着“训练就会很慢”。您之所以看到参数数量(Parameter Count)较高,却依然能维持相对较快的训练速度,可能主要由以下几个因素造成:
并行化效率
内存访问 vs. 计算瓶颈
深度可分离卷积 / 小卷积核
许多参数实际只在少量分支或线性层
Batch size 较小或分布式
自动混合精度(AMP) / FP16 训练
多线程 DataLoader
num_workers
也恰当),不会导致 GPU 等待数据,而 GPU 可以持续高效率进行计算。高效的卷积内核/注意力内核
这与Swin Transformer中的窗口注意力机制有啥区别
在您当前使用的 “窗口注意力 (Window Attention)” 与 Swin Transformer 中的窗口注意力,两者都基于“将特征图划分为若干小块(window/patch),并在每个局部窗口内执行自注意力运算”的思路,但在具体实现与目标上有以下主要区别:
Swin Transformer
您当前的窗口注意力(WA_ASPP / MobileViTAttentionBlock 等)
Swin Transformer
WA_ASPP / MobileViT 式
窗口滚动与并行性
当前 ASPP / MobileViT
层与通道的演进
Swin Transformer
移动端 / 插入式窗口注意力
因此,二者同样基于“局部窗口自注意力”的理念,但 Swin Transformer 是分层/分块 + Shift 的完整主干,而您当前的窗口注意力更偏向于在 CNN 模块中局部插入,未做层级下采样与 shift 机制,也不会大幅更改原网络结构,因此在思路与目标上,与 Swin Transformer 有着“同源但大不相同”的使用与实现方式。
class WA_ASPP_MViT(nn.Module):
"""
五条分支:
1) branch1: 1×1 conv => (dim_in -> dim_out=128)
2) branch2/3/4: MobileViTAttentionBlock => 1×1 conv => 128
- mobilevit维度大幅缩减
3) branch5: 全局池化 => conv => GroupNorm => relu => 上采样
拼接 => 1×1 => 128
"""
def init(self, dim_in=320, dim_out=128,
# MobileViT相关超参(减参)
patch_size=(2,2),
mvit_depth=1,
mvit_dim=96,
mvit_mlp_dim=192,
mvit_heads=2,
mvit_dropout=0.0):
super().init()
text# branch1 self.branch1 = nn.Sequential( nn.Conv2d(dim_in, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) # branch2/3/4 => MobileViT注意力(减参) self.branch2_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch3_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch4_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch2_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch3_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch4_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) # branch5 => 全局池化 => conv => GroupNorm => relu => 上采样 self.branch5_conv = nn.Conv2d(dim_in, dim_out, kernel_size=1) self.branch5_gn = nn.GroupNorm(num_groups=1, num_channels=dim_out) self.branch5_relu = nn.ReLU(inplace=True) # 拼接 => 1×1 => dim_out=128 self.fuse_conv = nn.Sequential( nn.Conv2d(dim_out*5, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) def forward(self, x): B, C, H, W = x.shape # branch1 => out1 out1 = self.branch1(x) # => (B,128,H,W) # branch2 => MViT => proj => add out2 = self.branch2_attn(x) out2 = self.branch2_proj(out2) out2_agg = out1 + out2 # branch3 out3 = self.branch3_attn(x) out3 = self.branch3_proj(out3) out3_agg = out1 + out2_agg + out3 # branch4 out4 = self.branch4_attn(x) out4 = self.branch4_proj(out4) out4_agg = out1 + out2_agg + out3_agg + out4 # branch5 => 全局池化 => conv => GroupNorm => relu => 上采样 gp = F.adaptive_avg_pool2d(x, 1) # (B, C, 1,1) gp = self.branch5_conv(gp) # => (B,128,1,1) gp = self.branch5_gn(gp) gp = self.branch5_relu(gp) gp = F.interpolate(gp, size=(H, W), mode='bilinear', align_corners=False) # 拼接 => fuse => (B,128,H,W) out_cat = torch.cat([out1, out2_agg, out3_agg, out4_agg, gp], dim=1) out = self.fuse_conv(out_cat) return out
调整下最终的输出结果而不是现在的叠加输出,因为这是别人论文中的,再给这个模块重新取一个贴切的名字
下面给出一个示例,将原先的 WA_ASPP_MViT 模块改名,并且去掉了分支之间的累加(out1 + out2
、out1 + out2_agg + out3
等),而是直接将每条分支的输出并列拼接(concatenate) 到一起,然后再做最终融合输出。这样可避免和原论文的“叠加”逻辑相同,同时也给了该模块一个新的、独立的名字,如 "LiteMobileViTASPP"
。示例代码如下:
pythonimport torch import torch.nn as nn import torch.nn.functional as F from einops import rearrange ############################################ # MobileViTAttentionBlock (减参思路) # 仅供演示 ############################################ class MobileViTAttentionBlock(nn.Module): def __init__( self, in_channels=320, patch_size=(2,2), transformer_depth=1, transformer_dim=96, transformer_mlp_dim=192, heads=2, dropout=0.0 ): super().__init__() self.ph, self.pw = patch_size # (1) 局部卷积 self.local_conv1 = nn.Conv2d(in_channels, in_channels, 3, padding=1, bias=False) self.local_bn1 = nn.BatchNorm2d(in_channels) self.local_act1 = nn.SiLU() # (2) conv => transformer_dim self.conv_proj_in = nn.Conv2d(in_channels, transformer_dim, 1, bias=False) self.bn_proj_in = nn.BatchNorm2d(transformer_dim) self.act_proj_in = nn.SiLU() # (3) 简化的 Transformer => 只 depth=1 self.transformer = TransformerBlockGroup( dim=transformer_dim, depth=transformer_depth, heads=heads, mlp_dim=transformer_mlp_dim, dropout=dropout ) # (4) 回到 in_channels self.conv_proj_out = nn.Conv2d(transformer_dim, in_channels, 1, bias=False) self.bn_proj_out = nn.BatchNorm2d(in_channels) self.act_proj_out = nn.SiLU() # (5) concat原输入 => 3×3卷积融合 self.local_conv2 = nn.Conv2d(in_channels*2, in_channels, 3, padding=1, bias=False) self.local_bn2 = nn.BatchNorm2d(in_channels) self.local_act2 = nn.SiLU() def forward(self, x): y = x.clone() # step1: local conv x = self.local_conv1(x) x = self.local_bn1(x) x = self.local_act1(x) # step2: proj -> transformer_dim x = self.conv_proj_in(x) x = self.bn_proj_in(x) x = self.act_proj_in(x) B, D, H, W = x.shape assert (H % self.ph == 0) and (W % self.pw == 0), "H,W需能被patch_size整除" # patch切分 => (B, n_patch, patch_area, D) x = rearrange(x, 'b d (hh ph) (ww pw) -> b (hh ww) (ph pw) d', ph=self.ph, pw=self.pw) # step3: Transformer x = self.transformer(x) # 拼回 (B, D, H, W) x = rearrange(x, 'b (hh ww) (ph pw) d -> b d (hh ph) (ww pw)', hh=H//self.ph, ww=W//self.pw, ph=self.ph, pw=self.pw) # step4: proj out => in_channels x = self.conv_proj_out(x) x = self.bn_proj_out(x) x = self.act_proj_out(x) # step5: concat => 3x3 => fuse x = torch.cat([x, y], dim=1) x = self.local_conv2(x) x = self.local_bn2(x) x = self.local_act2(x) return x ############################################ # 简化TransformerBlockGroup ############################################ class TransformerBlockGroup(nn.Module): def __init__(self, dim, depth, heads, mlp_dim, dropout=0.0): super().__init__() self.layers = nn.ModuleList([ TransformerBlockLayer(dim, heads, mlp_dim, dropout) for _ in range(depth) ]) def forward(self, x): for layer in self.layers: x = layer(x) return x class TransformerBlockLayer(nn.Module): def __init__(self, dim, heads, mlp_dim, dropout=0.0): super().__init__() self.norm1 = nn.LayerNorm(dim) self.attn = MultiHeadAttention(dim, heads, dropout) self.norm2 = nn.LayerNorm(dim) self.ffn = FeedForward(dim, mlp_dim, dropout) def forward(self, x): # Self-Attn shortcut = x x_ln = self.norm1(x) x_attn = self.attn(x_ln) x = x_attn + shortcut # FFN shortcut2 = x x_ln2 = self.norm2(x) x_ffn = self.ffn(x_ln2) x = x_ffn + shortcut2 return x class MultiHeadAttention(nn.Module): def __init__(self, dim, heads=2, dropout=0.0): super().__init__() self.heads = heads self.dim_head = dim // heads inner_dim = self.dim_head * heads self.scale = self.dim_head ** -0.5 self.to_qkv = nn.Linear(dim, inner_dim * 3, bias=False) self.attend = nn.Softmax(dim=-1) self.proj = nn.Sequential( nn.Linear(inner_dim, dim), nn.Dropout(dropout) ) def forward(self, x): # x: (B, n_patch, patch_area, dim) B, n_patch, patch_area, dim = x.shape qkv = self.to_qkv(x) q, k, v = torch.chunk(qkv, 3, dim=-1) q = q.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) k = k.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) v = v.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale attn = self.attend(dots) out = torch.matmul(attn, v) out = out.permute(0,1,3,2,4).contiguous() out = out.view(B, n_patch, patch_area, self.heads*self.dim_head) out = self.proj(out) return out class FeedForward(nn.Module): def __init__(self, dim, hidden_dim, dropout=0.0): super().__init__() self.net = nn.Sequential( nn.Linear(dim, hidden_dim), nn.SiLU(), nn.Dropout(dropout), nn.Linear(hidden_dim, dim), nn.Dropout(dropout) ) def forward(self, x): return self.net(x) ############################################ # 重新命名模块 => LiteMobileViTASPP # 并把分支输出改为“直接并行拼接”,不做累加 ############################################ class LiteMobileViTASPP(nn.Module): """ 五条分支: 1) branch1: 1×1 => dim_out (例如128) 2) branch2/3/4: MobileViTAttentionBlock => 1×1 => dim_out - 不再相互 add 3) branch5: 全局池化 => conv => GN => ReLU => 上采样 => dim_out 最后拼接 => 1×1 => dim_out """ def __init__(self, dim_in=320, dim_out=128, patch_size=(2,2), mvit_depth=1, mvit_dim=96, mvit_mlp_dim=192, mvit_heads=2, mvit_dropout=0.0): super().__init__() # branch1: 1x1 => dim_out self.branch1 = nn.Sequential( nn.Conv2d(dim_in, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) # branch2/3/4 => MobileViT注意力(减参) + 1x1 => dim_out self.branch2_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch2_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch3_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch3_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch4_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch4_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) # branch5 => 全局池化 => conv => GroupNorm => relu => 上采样 => dim_out self.branch5_conv = nn.Conv2d(dim_in, dim_out, kernel_size=1) self.branch5_gn = nn.GroupNorm(num_groups=1, num_channels=dim_out) self.branch5_relu = nn.ReLU(inplace=True) # 拼接 => 1×1 => dim_out self.fuse_conv = nn.Sequential( nn.Conv2d(dim_out*5, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) def forward(self, x): B, C, H, W = x.shape # branch1 out1 = self.branch1(x) # => (B,128,H,W) # branch2 => MViT => 1x1 => no add x2 = self.branch2_attn(x) # => shape same (B,320,H,W) out2 = self.branch2_proj(x2) # => (B,128,H,W) # branch3 x3 = self.branch3_attn(x) out3 = self.branch3_proj(x3) # branch4 x4 = self.branch4_attn(x) out4 = self.branch4_proj(x4) # branch5 => global pooling => conv => GN => upsample =>(B,128,H,W) gp = F.adaptive_avg_pool2d(x, 1) # => (B,320,1,1) gp = self.branch5_conv(gp) # => (B,128,1,1) gp = self.branch5_gn(gp) gp = self.branch5_relu(gp) gp = F.interpolate(gp, size=(H, W), mode='bilinear', align_corners=False) # 拼接 => fuse => (B,128,H,W) out_cat = torch.cat([out1, out2, out3, out4, gp], dim=1) out = self.fuse_conv(out_cat) return out ########################################## # 仅供示例:DeepLab 整合 LiteMobileViTASPP ########################################## class DeepLabLiteMViT(nn.Module): def __init__(self, num_classes=3, backbone="mobilenet", pretrained=True, downsample_factor=16): super(DeepLabLiteMViT, self).__init__() if backbone == "xception": self.backbone = xception(downsample_factor=downsample_factor, pretrained=pretrained) in_channels = 2048 low_level_channels = 256 elif backbone == "mobilenet": self.backbone = MobileNetV2(downsample_factor=downsample_factor, pretrained=pretrained) in_channels = 320 low_level_channels = 24 else: raise ValueError("Unsupported backbone, use 'mobilenet' or 'xception'") # 使用新命名的 LiteMobileViTASPP self.aspp = LiteMobileViTASPP( dim_in = in_channels, dim_out = 128, patch_size = (2,2), mvit_depth = 1, mvit_dim = 96, mvit_mlp_dim=192, mvit_heads=2, mvit_dropout=0.0 ) # 浅层特征 => 1x1 => 24 (举例) self.shortcut_conv = nn.Sequential( nn.Conv2d(low_level_channels, 24, 1), nn.BatchNorm2d(24), nn.ReLU(inplace=True) ) # cat_conv => (24 + 128) => 128 => ... self.cat_conv = nn.Sequential( nn.Conv2d(24 + 128, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(inplace=True), nn.Dropout(0.5), nn.Conv2d(128, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(inplace=True), nn.Dropout(0.1), ) self.cls_conv = nn.Conv2d(128, num_classes, 1) def forward(self, x): H, W = x.size(2), x.size(3) low_level, deep = self.backbone(x) deep = self.aspp(deep) # => (B,128,H/xx,W/xx) # low-level => 1x1 => 24 low_level = self.shortcut_conv(low_level) # 上采样 => 拼接 => cat_conv => cls_conv => 再上采样 deep_up = F.interpolate(deep, size=(low_level.size(2), low_level.size(3)), mode='bilinear', align_corners=False) x_cat = torch.cat([deep_up, low_level], dim=1) x_cat = self.cat_conv(x_cat) x_cat = self.cls_conv(x_cat) x_cat = F.interpolate(x_cat, size=(H, W), mode='bilinear', align_corners=False) return x_cat ########################################## # 测试示例 ########################################## if __name__ == "__main__": model = DeepLabLiteMViT(num_classes=3, backbone="mobilenet", pretrained=False, downsample_factor=16) # 统计可学习参数 def count_parameters(model): return sum(p.numel() for p in model.parameters() if p.requires_grad) print("参数数量:", count_parameters(model)) # forward test img = torch.randn(1, 3, 512, 512) out = model(img) print("输出形状:", out.shape)
WA_ASPP_MViT
改为 LiteMobileViTASPP
(示例命名),与原论文名称区分。out2_agg = out1 + out2
、out3_agg = out1 + out2_agg + out3
等,改为 out2, out3, out4
各自独立,最终在 torch.cat
时进行并行拼接。这样一来,此 LiteMobileViTASPP
模块既与原本的 WA_ASPP_MViT 有所区别(去掉累加、改名),又能保留多分支 + MobileViT 注意力 + 全局池化等思路,形成一个新的 ASPP 样式。
除了并列输出和叠加输出,还有其他方式吗,还有模块名模仿别人论文中的WA_ASPP这样的
在多分支 ASPP 结构中,除了 “分支并列拼接 (Concatenate)” 与 “分支输出相加 (Add)” 这两种常见融合方式,其实还可以考虑更多的分支融合策略。以下列出几种思路,并给出一个示例模块命名(模仿类似“WA_ASPP”这样的命名风格),供参考:
加权线性组合(Weighted Sum)
注意力门控(Gated Fusion)
串行级联 (Sequential)
迭代更新 (Iterative Refinement)
自定义多分支“投票”机制
如果想在名称上还保留“WA_ASPP”的调性,可以加个后缀或前缀,表明这是一个新版本或新融合方式。举几个示例:
“WA_ASPPv2”
“WF_ASPP” (Weighted-Fusion ASPP)
“AG_ASPP” (Attention-Gated ASPP)
“MSA_ASPP” (Multi-Scale-Attention ASPP)
总之,名称可以根据想突出的融合方式或注意力策略加以概括,同时保留 “_ASPP” 后缀以表明它是一个 ASPP 样式的模块。
以下仅示例如何对分支做“加权”或“门控”,以及换个名字 "WA_ASPPv2"。
pythonclass WA_ASPPv2(nn.Module): """ 基于原 WA_ASPP/MobileViT 结构,但分支之间用加权或门控代替简单相加/拼接 并可用“v2”在论文/项目中表明改进版本。 """ def __init__(self, dim_in=320, dim_out=128, patch_size=(2,2), mvit_depth=1, mvit_dim=96, mvit_mlp_dim=192, mvit_heads=2, mvit_dropout=0.0): super().__init__() # branch1: 1x1 => dim_out self.branch1 = nn.Sequential( nn.Conv2d(dim_in, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) # branch2: MobileViT => 1x1 => dim_out self.branch2_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch2_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) # ... 同理 branch3, branch4 # branch5: global pool => conv => gn => relu => upsample => dim_out self.branch5_conv = nn.Conv2d(dim_in, dim_out, kernel_size=1) self.branch5_gn = nn.GroupNorm(num_groups=1, num_channels=dim_out) self.branch5_relu = nn.ReLU(inplace=True) # 另外引入可学习的权重 # 例如: alpha1 ~ alpha5, 每个分支一个 self.alphas = nn.Parameter(torch.ones(5), requires_grad=True) # fuse self.fuse_conv = nn.Sequential( nn.Conv2d(dim_out, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) def forward(self, x): B, C, H, W = x.shape out1 = self.branch1(x) x2 = self.branch2_attn(x) out2 = self.branch2_proj(x2) # ... # branch3, branch4 同理 # for brevity, let's skip them here out3 = out1 out4 = out1 # branch5 gp = F.adaptive_avg_pool2d(x, 1) gp = self.branch5_conv(gp) gp = self.branch5_gn(gp) gp = self.branch5_relu(gp) gp = F.interpolate(gp, size=(H, W), mode='bilinear', align_corners=False) # “加权和”的方式 # alpha1, alpha2, alpha3, alpha4, alpha5 alpha_norm = F.softmax(self.alphas, dim=0) # 让它们在 [0,1], sum=1 # 这里仅做演示 out = alpha_norm[0]*out1 + alpha_norm[1]*out2 + alpha_norm[2]*out3 + \ alpha_norm[3]*out4 + alpha_norm[4]*gp # 或者 "门控": out = gate1 * out1 + gate2 * out2 + ... # ... # 最终 => 1×1 => dim_out out = self.fuse_conv(out) return out
在这种 “v2” 实现里,就可见另一种融合:每个分支输出经一个可学习的 scalar 权重进行加权再相加,而不是并列拼接或逐个相加/叠加。再配合 Fuse Conv
做通道整合,这就是第三种“线性加权融合”示例。
_ASPP
的命名都可以;给出加入注意力门控的完整代码
下面给出一个示例,在多分支 ASPP 中增加“注意力门控”(Attention Gating)的完整代码示例。该示例示范了如何在每条分支的输出后,用一个简单的注意力门控机制(类似 SE/通道注意力的思路)来自适应地控制各分支对最终输出的贡献度。
在此示例中,我们称这个模块为 AG_ASPP(Attention-Gated ASPP)。与常见的 “加和” 或 “并列拼接” 不同,这里每个分支输出后都在通道维度上做一次门控,再将门控后的结果按通道拼接,或在最后做一次通道整合。如果您想要“加和”的方式,也可轻松改动对应融合逻辑。以下代码仅作演示,实际可根据需求微调实现细节。
pythonimport torch import torch.nn as nn import torch.nn.functional as F ############################################ # 简易 MobileViTAttentionBlock(可换成您已有的窗口注意力或其他模块) ############################################ class MobileViTAttentionBlock(nn.Module): def __init__( self, in_channels=320, patch_size=(2,2), transformer_depth=1, transformer_dim=96, transformer_mlp_dim=192, heads=2, dropout=0.0 ): super().__init__() self.ph, self.pw = patch_size # 局部卷积 self.local_conv1 = nn.Conv2d(in_channels, in_channels, 3, padding=1, bias=False) self.local_bn1 = nn.BatchNorm2d(in_channels) self.local_act1 = nn.SiLU() # Conv => transformer_dim self.conv_proj_in = nn.Conv2d(in_channels, transformer_dim, 1, bias=False) self.bn_proj_in = nn.BatchNorm2d(transformer_dim) self.act_proj_in = nn.SiLU() # 省略: TransformerBlockGroup 等,您可自行引用 self.transformer = nn.Identity() # 这里简化为 Identity 以演示 # 回到 in_channels self.conv_proj_out = nn.Conv2d(transformer_dim, in_channels, 1, bias=False) self.bn_proj_out = nn.BatchNorm2d(in_channels) self.act_proj_out = nn.SiLU() # 拼接后再 3×3 self.local_conv2 = nn.Conv2d(in_channels*2, in_channels, 3, padding=1, bias=False) self.local_bn2 = nn.BatchNorm2d(in_channels) self.local_act2 = nn.SiLU() def forward(self, x): y = x.clone() x = self.local_conv1(x) x = self.local_bn1(x) x = self.local_act1(x) x = self.conv_proj_in(x) x = self.bn_proj_in(x) x = self.act_proj_in(x) # TransformerBlock (此处用 Identity 代替) # 在真实实现中,请插入您的窗口注意力或Transformer x = self.transformer(x) x = self.conv_proj_out(x) x = self.bn_proj_out(x) x = self.act_proj_out(x) x = torch.cat([x, y], dim=1) x = self.local_conv2(x) x = self.local_bn2(x) x = self.local_act2(x) return x ############################################ # Attention Gate: 用于对每条分支输出做通道注意力 # 类似SE模块,但可以更简化 ############################################ class AttnGate(nn.Module): """ 对输入特征 (B, C, H, W) 做通道注意力: 先全局池化 => (B, C, 1, 1) => (fc/conv) => 激活 => 再映射到(C通道) 用 sigmoid 产生门控权重 => 逐通道乘回 """ def __init__(self, channels, reduction=16): super().__init__() hidden_dim = max(1, channels // reduction) self.pool = nn.AdaptiveAvgPool2d(1) self.fc = nn.Sequential( nn.Linear(channels, hidden_dim), nn.ReLU(inplace=True), nn.Linear(hidden_dim, channels), nn.Sigmoid() ) def forward(self, x): B, C, _, _ = x.shape # 全局池化 y = self.pool(x).view(B, C) # => (B, C) y = self.fc(y) # => (B, C) y = y.view(B, C, 1, 1) return x * y # 广播乘 => (B, C, H, W) ############################################ # AG_ASPP: 5条分支 + MobileViT + Attention Gate ############################################ class AG_ASPP(nn.Module): """ 5条分支: 1) branch1: 1×1 => dim_out 2) branch2/3/4: MobileViTAttentionBlock => 1×1 => dim_out => +AttnGate 3) branch5: 全局池化 => conv => GN/BN => ReLU => up => dim_out => +AttnGate 最终将5路输出 "并列拼接" => 1×1 => dim_out 若想改成"加和",可自行改成 sum(...)。 """ def __init__(self, dim_in=320, dim_out=128, patch_size=(2,2), mvit_depth=1, mvit_dim=96, mvit_mlp_dim=192, mvit_heads=2, mvit_dropout=0.0, use_groupnorm=False): super().__init__() # branch1 => 1×1 => dim_out self.branch1 = nn.Sequential( nn.Conv2d(dim_in, dim_out, 1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) self.gate1 = AttnGate(dim_out, reduction=16) # branch2 => MobileViT => proj => dim_out => AttnGate self.branch2_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch2_proj = nn.Conv2d(dim_in, dim_out, 1, bias=False) self.gate2 = AttnGate(dim_out, reduction=16) # branch3 => 同理 self.branch3_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch3_proj = nn.Conv2d(dim_in, dim_out, 1, bias=False) self.gate3 = AttnGate(dim_out, reduction=16) # branch4 => 同理 self.branch4_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch4_proj = nn.Conv2d(dim_in, dim_out, 1, bias=False) self.gate4 = AttnGate(dim_out, reduction=16) # branch5 => 全局池化 => conv => GN/BN => ReLU => up => dim_out => AttnGate self.branch5_conv = nn.Conv2d(dim_in, dim_out, 1) if use_groupnorm: # 若要GroupNorm self.branch5_norm = nn.GroupNorm(1, dim_out) else: # 或用BatchNorm self.branch5_norm = nn.BatchNorm2d(dim_out) self.branch5_relu = nn.ReLU(inplace=True) self.gate5 = AttnGate(dim_out, reduction=16) # 最终 fuse => 1×1 => dim_out self.fuse_conv = nn.Sequential( nn.Conv2d(dim_out*5, dim_out, 1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) def forward(self, x): B, C, H, W = x.shape # branch1 out1 = self.branch1(x) # => (B,128,H,W) out1 = self.gate1(out1) # => gate通道 # branch2 x2 = self.branch2_attn(x) # => (B, in_channels, H,W) out2 = self.branch2_proj(x2) # => (B,128,H,W) out2 = self.gate2(out2) # branch3 x3 = self.branch3_attn(x) out3 = self.branch3_proj(x3) out3 = self.gate3(out3) # branch4 x4 = self.branch4_attn(x) out4 = self.branch4_proj(x4) out4 = self.gate4(out4) # branch5 => 全局池化 => conv => norm => relu => up => gate gp = F.adaptive_avg_pool2d(x, 1) gp = self.branch5_conv(gp) gp = self.branch5_norm(gp) gp = self.branch5_relu(gp) gp = F.interpolate(gp, size=(H, W), mode='bilinear', align_corners=False) gp = self.gate5(gp) # 并列拼接 => fuse out_cat = torch.cat([out1, out2, out3, out4, gp], dim=1) out = self.fuse_conv(out_cat) return out ########################################## # 使用示例 ########################################## if __name__ == "__main__": # 构造一个AG_ASPP ag_aspp = AG_ASPP( dim_in=320, dim_out=128, patch_size=(2,2), mvit_depth=1, mvit_dim=96, mvit_mlp_dim=192, mvit_heads=2, use_groupnorm=False # 或 True ) # 测试forward img = torch.randn(1, 320, 64, 64) out = ag_aspp(img) print("输入:", img.shape, "输出:", out.shape) # (1,320,64,64)->(1,128,64,64)
AttnGate(AttnGate
)
AG_ASPP
AttnGate
门控;torch.cat
) 后,再做一次 1×1 卷积融合(fuse_conv
),得到 。改为类似pythonout_cat = torch.cat([out1, out2, out3, out4, gp], dim=1) out = self.fuse_conv(out_cat)
即可。pythonout_sum = out1 + out2 + out3 + out4 + gp out = self.fuse_conv(out_sum)
门控 vs. 可学习 scalar
通过这个 AG_ASPP(Attention-Gated ASPP)模块,就实现了第三种融合方式:每个分支先输出特征,然后分别使用通道注意力进行门控,最后再拼接或加和到一起,用 1×1 卷积整合到最终维度。这样一来,网络能自适应地学习哪些分支更重要,哪些分支应被抑制,进而在多分支情况下获得更灵活的表征能力。
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from nets.xception import xception
from nets.mobilenetv2 import mobilenetv2
############################################
############################################
class MobileNetV2(nn.Module):
def init(self, downsample_factor=8, pretrained=True):
super(MobileNetV2, self).init()
from functools import partial
model = mobilenetv2(pretrained)
self.features = model.features[:-1]
textself.total_idx = len(self.features) self.down_idx = [2, 4, 7, 14] if downsample_factor == 8: for i in range(self.down_idx[-2], self.down_idx[-1]): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=4)) elif downsample_factor == 16: for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) def _nostride_dilate(self, m, dilate): classname = m.__class__.__name__ if classname.find('Conv') != -1: if m.stride == (2, 2): m.stride = (1, 1) if m.kernel_size == (3, 3): m.dilation = (dilate // 2, dilate // 2) m.padding = (dilate // 2, dilate // 2) else: if m.kernel_size == (3, 3): m.dilation = (dilate, dilate) m.padding = (dilate, dilate) def forward(self, x): # 前4层 -> low_level_features low_level_features = self.features[:4](x) # 后面 -> deep_features x = self.features[4:](low_level_features) return low_level_features, x
############################################
############################################
def count_parameters(model):
"""统计模型参数数量"""
return sum(p.numel() for p in model.parameters() if p.requires_grad)
############################################
############################################
from einops import rearrange
class MobileViTAttentionBlock(nn.Module):
"""
(B, in_channels, H, W) => (B, in_channels, H, W)
减参思路:
- transformer_dim 从128降到96
- transformer_depth从2降到1
- heads从4降到2
- mlp_dim从256降到192
"""
def init(self,
in_channels=320,
patch_size=(2,2),
transformer_depth=1, # 减少深度
transformer_dim=96, # 减少embedding
transformer_mlp_dim=192, # 减少FFN维度
heads=2, # 减少多头
dropout=0.0):
super().init()
self.ph, self.pw = patch_size
text# (1) 局部卷积 self.local_conv1 = nn.Conv2d(in_channels, in_channels, 3, padding=1, bias=False) self.local_bn1 = nn.BatchNorm2d(in_channels) self.local_act1 = nn.SiLU() # (2) conv映射 => transformer_dim self.conv_proj_in = nn.Conv2d(in_channels, transformer_dim, kernel_size=1, bias=False) self.bn_proj_in = nn.BatchNorm2d(transformer_dim) self.act_proj_in = nn.SiLU() # 多层Transformer (此处只1层) self.transformer = TransformerBlockGroup( dim=transformer_dim, depth=transformer_depth, heads=heads, mlp_dim=transformer_mlp_dim, dropout=dropout ) # 映射回 in_channels self.conv_proj_out = nn.Conv2d(transformer_dim, in_channels, kernel_size=1, bias=False) self.bn_proj_out = nn.BatchNorm2d(in_channels) self.act_proj_out = nn.SiLU() # 最后 concat 并做 3x3融合 self.local_conv2 = nn.Conv2d(in_channels*2, in_channels, 3, padding=1, bias=False) self.local_bn2 = nn.BatchNorm2d(in_channels) self.local_act2 = nn.SiLU() def forward(self, x): # 备份原输入 y = x.clone() # 1) 局部卷积 x = self.local_conv1(x) x = self.local_bn1(x) x = self.local_act1(x) # 2) conv => transformer_dim x = self.conv_proj_in(x) x = self.bn_proj_in(x) x = self.act_proj_in(x) B, D, H, W = x.shape # patch切分 assert H%self.ph==0 and W%self.pw==0, "H,W需能被patch_size整除" x = rearrange(x, 'b d (hh ph) (ww pw) -> b (hh ww) (ph pw) d', ph=self.ph, pw=self.pw) # 3) transformer x = self.transformer(x) # 拼回 x = rearrange(x, 'b (hh ww) (ph pw) d -> b d (hh ph) (ww pw)', hh=H//self.ph, ww=W//self.pw, ph=self.ph, pw=self.pw) # 映射回 in_channels x = self.conv_proj_out(x) x = self.bn_proj_out(x) x = self.act_proj_out(x) # concat + 3x3融合 x = torch.cat([x, y], dim=1) x = self.local_conv2(x) x = self.local_bn2(x) x = self.local_act2(x) return x
class TransformerBlockGroup(nn.Module):
def init(self, dim, depth, heads, mlp_dim, dropout=0.0):
super().init()
self.layers = nn.ModuleList([
TransformerBlockLayer(dim, heads, mlp_dim, dropout)
for _ in range(depth)
])
def forward(self, x):
for layer in self.layers:
x = layer(x)
return x
class TransformerBlockLayer(nn.Module):
def init(self, dim, heads, mlp_dim, dropout=0.0):
super().init()
self.norm1 = nn.LayerNorm(dim)
self.attn = MultiHeadAttention(dim, heads, dropout)
self.norm2 = nn.LayerNorm(dim)
self.ffn = FeedForward(dim, mlp_dim, dropout)
def forward(self, x):
shortcut = x
x_ln = self.norm1(x)
x_attn = self.attn(x_ln)
x = x_attn + shortcut
textshortcut2 = x x_ln2 = self.norm2(x) x_ffn = self.ffn(x_ln2) x = x_ffn + shortcut2 return x
class MultiHeadAttention(nn.Module):
def init(self, dim, heads=2, dropout=0.0):
super().init()
self.heads = heads
self.dim_head = dim//heads
inner_dim = self.dim_head*heads
textself.scale = self.dim_head**-0.5 self.to_qkv = nn.Linear(dim, inner_dim*3, bias=False) self.attend = nn.Softmax(dim=-1) self.proj = nn.Sequential( nn.Linear(inner_dim, dim), nn.Dropout(dropout) ) def forward(self, x): # x: (B, n_patch, patch_area, dim) B, n_patch, patch_area, dim = x.shape qkv = self.to_qkv(x) q, k, v = torch.chunk(qkv, 3, dim=-1) q = q.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) k = k.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) v = v.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) dots = torch.matmul(q, k.transpose(-1, -2))*self.scale attn = self.attend(dots) out = torch.matmul(attn, v) # (B,n_patch,heads,patch_area,dim_head) out = out.permute(0,1,3,2,4).contiguous() out = out.view(B, n_patch, patch_area, self.heads*self.dim_head) out = self.proj(out) return out
class FeedForward(nn.Module):
def init(self, dim, hidden_dim, dropout=0.0):
super().init()
self.net = nn.Sequential(
nn.Linear(dim, hidden_dim),
nn.SiLU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, dim),
nn.Dropout(dropout)
)
def forward(self, x):
return self.net(x)
############################################
############################################
class WA_ASPP_MViT(nn.Module):
"""
五条分支:
1) branch1: 1×1 conv => (dim_in -> dim_out=128)
2) branch2/3/4: MobileViTAttentionBlock => 1×1 conv => 128
- mobilevit维度大幅缩减
3) branch5: 全局池化 => conv => GroupNorm => relu => 上采样
拼接 => 1×1 => 128
"""
def init(self, dim_in=320, dim_out=128,
# MobileViT相关超参(减参)
patch_size=(2,2),
mvit_depth=1,
mvit_dim=96,
mvit_mlp_dim=192,
mvit_heads=2,
mvit_dropout=0.0):
super().init()
text# branch1 self.branch1 = nn.Sequential( nn.Conv2d(dim_in, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) # branch2/3/4 => MobileViT注意力(减参) self.branch2_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch3_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch4_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch2_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch3_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.branch4_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) # branch5 => 全局池化 => conv => GroupNorm => relu => 上采样 self.branch5_conv = nn.Conv2d(dim_in, dim_out, kernel_size=1) self.branch5_gn = nn.GroupNorm(num_groups=1, num_channels=dim_out) self.branch5_relu = nn.ReLU(inplace=True) # 拼接 => 1×1 => dim_out=128 self.fuse_conv = nn.Sequential( nn.Conv2d(dim_out*5, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) def forward(self, x): B, C, H, W = x.shape # branch1 => out1 out1 = self.branch1(x) # => (B,128,H,W) # branch2 => MViT => proj => add out2 = self.branch2_attn(x) out2 = self.branch2_proj(out2) out2_agg = out1 + out2 # branch3 out3 = self.branch3_attn(x) out3 = self.branch3_proj(out3) out3_agg = out1 + out2_agg + out3 # branch4 out4 = self.branch4_attn(x) out4 = self.branch4_proj(out4) out4_agg = out1 + out2_agg + out3_agg + out4 # branch5 => 全局池化 => conv => GroupNorm => relu => 上采样 gp = F.adaptive_avg_pool2d(x, 1) # (B, C, 1,1) gp = self.branch5_conv(gp) # => (B,128,1,1) gp = self.branch5_gn(gp) gp = self.branch5_relu(gp) gp = F.interpolate(gp, size=(H, W), mode='bilinear', align_corners=False) # 拼接 => fuse => (B,128,H,W) out_cat = torch.cat([out1, out2_agg, out3_agg, out4_agg, gp], dim=1) out = self.fuse_conv(out_cat) return out
############################################
############################################
class DeepLab(nn.Module):
def init(self, num_classes, backbone="mobilenet", pretrained=True, downsample_factor=16):
super(DeepLab, self).init()
if backbone == "xception":
self.backbone = xception(downsample_factor=downsample_factor, pretrained=pretrained)
in_channels = 2048
low_level_channels = 256
elif backbone == "mobilenet":
self.backbone = MobileNetV2(downsample_factor=downsample_factor, pretrained=pretrained)
# ----------------------------------#
# mobilenetv2 输出deep_features: 320通道
# 浅层特征 low_level_features: 24通道
# ----------------------------------#
in_channels = 320
low_level_channels = 24
else:
raise ValueError('Unsupported backbone - {}
, Use mobilenet, xception.'.format(backbone))
text# 将 ASPP 输出通道减少到 128 self.aspp = WA_ASPP_MViT( dim_in = in_channels, dim_out = 128, # 减少到 128 patch_size = (2,2), mvit_depth = 1, # Transformer层数=1 mvit_dim = 96, # Transformer dim=96 mvit_mlp_dim = 192, # MLP dim=192 mvit_heads=2, # heads=2 mvit_dropout=0.0 ) # low_level => 1×1 => 24 (也减小) self.shortcut_conv = nn.Sequential( nn.Conv2d(low_level_channels, 24, 1), nn.BatchNorm2d(24), nn.ReLU(inplace=True) ) # 拼接 => cat_conv => cls_conv # 这里(24 + 128) => 152, 可以再让cat_conv输出通道也减为128 self.cat_conv = nn.Sequential( nn.Conv2d(24+128, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(inplace=True), nn.Dropout(0.5), nn.Conv2d(128, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(inplace=True), nn.Dropout(0.1) ) self.cls_conv = nn.Conv2d(128, num_classes, 1) def forward(self, x): H, W = x.shape[2], x.shape[3] # backbone => (low_level=32, deep=320) low_level, deep = self.backbone(x) # deep => MViT ASPP (减参) deep = self.aspp(deep) # low_level => 1×1 => 24 low_level = self.shortcut_conv(low_level) # 上采样 deep => 拼接 => cat_conv => cls_conv => 上采样 deep_up = F.interpolate(deep, size=(low_level.size(2), low_level.size(3)), mode='bilinear', align_corners=False) x_cat = torch.cat([deep_up, low_level], dim=1) # => (B,152,H/?,W/?) x_cat = self.cat_conv(x_cat) x_cat = self.cls_conv(x_cat) x_cat = F.interpolate(x_cat, size=(H, W), mode='bilinear', align_corners=False) return x_cat
############################################
############################################
if name == "main":
model = DeepLab(num_classes=3, backbone="mobilenet", pretrained=False, downsample_factor=16)
print("模型参数总数:", count_parameters(model))
text# 构造单张图 (batch=1, channel=3, 512x512) img = torch.randn(1, 3, 512, 512) pred = model(img) print("输出维度:", pred.shape)
在以上代码中修改,将原本叠加方式改为加入注意力门控然后并行输出,给出完整修改代码
下面给出一个完整示例,在原先 WA_ASPP_MViT
多分支结构的基础上做了如下修改:
out2_agg = out1 + out2
),改为 “并行输出”;我们在本示例中将其命名为 AG_ASPP_MViT
,即 Attention-Gated ASPP with MobileViT。请根据您项目的需求,将以下代码直接拷贝替换到 mobileVitASPP_deeplabV3.py
(或者新建一个文件)中使用即可。
pythonimport torch import torch.nn as nn import torch.nn.functional as F import math from nets.xception import xception from nets.mobilenetv2 import mobilenetv2 from einops import rearrange ############################################ # 1. MobileNetV2 主干 (与原DeepLabV3+相同) ############################################ class MobileNetV2(nn.Module): def __init__(self, downsample_factor=8, pretrained=True): super(MobileNetV2, self).__init__() from functools import partial model = mobilenetv2(pretrained) self.features = model.features[:-1] self.total_idx = len(self.features) self.down_idx = [2, 4, 7, 14] if downsample_factor == 8: for i in range(self.down_idx[-2], self.down_idx[-1]): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=4)) elif downsample_factor == 16: for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) def _nostride_dilate(self, m, dilate): classname = m.__class__.__name__ if classname.find('Conv') != -1: if m.stride == (2, 2): m.stride = (1, 1) if m.kernel_size == (3, 3): m.dilation = (dilate // 2, dilate // 2) m.padding = (dilate // 2, dilate // 2) else: if m.kernel_size == (3, 3): m.dilation = (dilate, dilate) m.padding = (dilate, dilate) def forward(self, x): # 前4层 -> low_level_features low_level_features = self.features[:4](x) # 后面 -> deep_features x = self.features[4:](low_level_features) return low_level_features, x ############################################ # 2. 统计可训练参数数量 ############################################ def count_parameters(model): return sum(p.numel() for p in model.parameters() if p.requires_grad) ############################################ # 简易TransformerBlockGroup (见您原先的减参思路) ############################################ class TransformerBlockGroup(nn.Module): def __init__(self, dim, depth, heads, mlp_dim, dropout=0.0): super().__init__() self.layers = nn.ModuleList([ TransformerBlockLayer(dim, heads, mlp_dim, dropout) for _ in range(depth) ]) def forward(self, x): for layer in self.layers: x = layer(x) return x class TransformerBlockLayer(nn.Module): def __init__(self, dim, heads, mlp_dim, dropout=0.0): super().__init__() self.norm1 = nn.LayerNorm(dim) self.attn = MultiHeadAttention(dim, heads, dropout) self.norm2 = nn.LayerNorm(dim) self.ffn = FeedForward(dim, mlp_dim, dropout) def forward(self, x): shortcut = x x_ln = self.norm1(x) x_attn = self.attn(x_ln) x = x_attn + shortcut shortcut2 = x x_ln2 = self.norm2(x) x_ffn = self.ffn(x_ln2) x = x_ffn + shortcut2 return x class MultiHeadAttention(nn.Module): def __init__(self, dim, heads=2, dropout=0.0): super().__init__() self.heads = heads self.dim_head = dim // heads inner_dim = self.dim_head*heads self.scale = self.dim_head**-0.5 self.to_qkv = nn.Linear(dim, inner_dim*3, bias=False) self.attend = nn.Softmax(dim=-1) self.proj = nn.Sequential( nn.Linear(inner_dim, dim), nn.Dropout(dropout) ) def forward(self, x): # x: (B, n_patch, patch_area, dim) B, n_patch, patch_area, dim = x.shape qkv = self.to_qkv(x) q, k, v = torch.chunk(qkv, 3, dim=-1) q = q.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) k = k.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) v = v.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale attn = self.attend(dots) out = torch.matmul(attn, v) # => (B,n_patch,heads,patch_area,dim_head) out = out.permute(0,1,3,2,4).contiguous() out = out.view(B, n_patch, patch_area, self.heads*self.dim_head) out = self.proj(out) return out class FeedForward(nn.Module): def __init__(self, dim, hidden_dim, dropout=0.0): super().__init__() self.net = nn.Sequential( nn.Linear(dim, hidden_dim), nn.SiLU(), nn.Dropout(dropout), nn.Linear(hidden_dim, dim), nn.Dropout(dropout) ) def forward(self, x): return self.net(x) ############################################ # MobileViTAttentionBlock (减参) ############################################ class MobileViTAttentionBlock(nn.Module): def __init__(self, in_channels=320, patch_size=(2,2), transformer_depth=1, transformer_dim=96, transformer_mlp_dim=192, heads=2, dropout=0.0): super().__init__() self.ph, self.pw = patch_size # (1) 局部卷积 self.local_conv1 = nn.Conv2d(in_channels, in_channels, 3, padding=1, bias=False) self.local_bn1 = nn.BatchNorm2d(in_channels) self.local_act1 = nn.SiLU() # (2) conv映射 => transformer_dim self.conv_proj_in = nn.Conv2d(in_channels, transformer_dim, 1, bias=False) self.bn_proj_in = nn.BatchNorm2d(transformer_dim) self.act_proj_in = nn.SiLU() # transformer self.transformer = TransformerBlockGroup( dim=transformer_dim, depth=transformer_depth, heads=heads, mlp_dim=transformer_mlp_dim, dropout=dropout ) # 回到 in_channels self.conv_proj_out = nn.Conv2d(transformer_dim, in_channels, 1, bias=False) self.bn_proj_out = nn.BatchNorm2d(in_channels) self.act_proj_out = nn.SiLU() # concat => 3x3 self.local_conv2 = nn.Conv2d(in_channels*2, in_channels, 3, padding=1, bias=False) self.local_bn2 = nn.BatchNorm2d(in_channels) self.local_act2 = nn.SiLU() def forward(self, x): y = x.clone() # local conv x = self.local_conv1(x) x = self.local_bn1(x) x = self.local_act1(x) # proj in x = self.conv_proj_in(x) x = self.bn_proj_in(x) x = self.act_proj_in(x) B, D, H, W = x.shape assert H % self.ph == 0 and W % self.pw == 0, "H,W需能被patch_size整除" # patch => transformer x = rearrange(x, 'b d (hh ph) (ww pw) -> b (hh ww) (ph pw) d', ph=self.ph, pw=self.pw) x = self.transformer(x) x = rearrange(x, 'b (hh ww) (ph pw) d -> b d (hh ph) (ww pw)', hh=H//self.ph, ww=W//self.pw, ph=self.ph, pw=self.pw) # proj out x = self.conv_proj_out(x) x = self.bn_proj_out(x) x = self.act_proj_out(x) # concat + conv x = torch.cat([x, y], dim=1) x = self.local_conv2(x) x = self.local_bn2(x) x = self.local_act2(x) return x ############################################ # Attention Gate: 通道注意力 ############################################ class AttnGate(nn.Module): """ SE-like Gate: 全局池化 -> FC -> Sigmoid => 与输入逐通道相乘 """ def __init__(self, channels, reduction=16): super().__init__() hidden_dim = max(1, channels // reduction) self.pool = nn.AdaptiveAvgPool2d(1) self.fc = nn.Sequential( nn.Linear(channels, hidden_dim), nn.ReLU(inplace=True), nn.Linear(hidden_dim, channels), nn.Sigmoid() ) def forward(self, x): B, C, H, W = x.shape y = self.pool(x).view(B, C) # => (B, C) y = self.fc(y) # => (B, C) y = y.view(B, C, 1, 1) return x * y # 通道注意力 ############################################ # 三、AG_ASPP_MViT (Attention Gated + MobileViT) ############################################ class AG_ASPP_MViT(nn.Module): """ 改动: - 分支输出不再相加, 而是每条分支独立, 并在输出后加一个AttnGate - 最终并列拼接 => fuse """ def __init__(self, dim_in=320, dim_out=128, patch_size=(2,2), mvit_depth=1, mvit_dim=96, mvit_mlp_dim=192, mvit_heads=2, mvit_dropout=0.0, use_groupnorm=False): super().__init__() # branch1 self.branch1 = nn.Sequential( nn.Conv2d(dim_in, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) self.gate1 = AttnGate(dim_out, reduction=16) # branch2 => MobileViT => 1x1 => gate self.branch2_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch2_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.gate2 = AttnGate(dim_out, reduction=16) # branch3 self.branch3_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch3_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.gate3 = AttnGate(dim_out, reduction=16) # branch4 self.branch4_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch4_proj = nn.Conv2d(dim_in, dim_out, kernel_size=1, bias=False) self.gate4 = AttnGate(dim_out, reduction=16) # branch5 => global pooling => conv => norm => relu => up => gate self.branch5_conv = nn.Conv2d(dim_in, dim_out, kernel_size=1) if use_groupnorm: self.branch5_norm = nn.GroupNorm(1, dim_out) else: self.branch5_norm = nn.BatchNorm2d(dim_out) self.branch5_relu = nn.ReLU(inplace=True) self.gate5 = AttnGate(dim_out, reduction=16) # fuse => 1x1 => dim_out self.fuse_conv = nn.Sequential( nn.Conv2d(dim_out*5, dim_out, 1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) def forward(self, x): B, C, H, W = x.shape # branch1 out1 = self.branch1(x) out1 = self.gate1(out1) # branch2 x2 = self.branch2_attn(x) out2 = self.branch2_proj(x2) out2 = self.gate2(out2) # branch3 x3 = self.branch3_attn(x) out3 = self.branch3_proj(x3) out3 = self.gate3(out3) # branch4 x4 = self.branch4_attn(x) out4 = self.branch4_proj(x4) out4 = self.gate4(out4) # branch5 => global pool => conv => norm => relu => up => gate gp = F.adaptive_avg_pool2d(x, 1) gp = self.branch5_conv(gp) gp = self.branch5_norm(gp) gp = self.branch5_relu(gp) gp = F.interpolate(gp, size=(H, W), mode='bilinear', align_corners=False) gp = self.gate5(gp) # 并行拼接 => fuse out_cat = torch.cat([out1, out2, out3, out4, gp], dim=1) # => (B, 5*dim_out, H, W) out = self.fuse_conv(out_cat) # => (B, dim_out, H, W) return out ############################################ # 四、DeepLab + AG_ASPP_MViT ############################################ class DeepLab(nn.Module): def __init__(self, num_classes, backbone="mobilenet", pretrained=True, downsample_factor=16): super(DeepLab, self).__init__() if backbone == "xception": self.backbone = xception(downsample_factor=downsample_factor, pretrained=pretrained) in_channels = 2048 low_level_channels = 256 elif backbone == "mobilenet": self.backbone = MobileNetV2(downsample_factor=downsample_factor, pretrained=pretrained) in_channels = 320 low_level_channels = 24 else: raise ValueError('Unsupported backbone - `{}`, Use mobilenet, xception.'.format(backbone)) # 使用 AG_ASPP_MViT self.aspp = AG_ASPP_MViT( dim_in = in_channels, dim_out = 128, patch_size = (2,2), mvit_depth = 1, mvit_dim = 96, mvit_mlp_dim=192, mvit_heads=2, mvit_dropout=0.0, use_groupnorm=False ) # 浅层特征 => 1x1 => 24 self.shortcut_conv = nn.Sequential( nn.Conv2d(low_level_channels, 24, 1), nn.BatchNorm2d(24), nn.ReLU(inplace=True) ) # cat_conv => (24 + 128)->128 self.cat_conv = nn.Sequential( nn.Conv2d(24+128, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(inplace=True), nn.Dropout(0.5), nn.Conv2d(128, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(inplace=True), nn.Dropout(0.1) ) self.cls_conv = nn.Conv2d(128, num_classes, 1) def forward(self, x): H, W = x.shape[2], x.shape[3] low_level, deep = self.backbone(x) # ASPP with MobileViT + Attn Gate deep = self.aspp(deep) low_level = self.shortcut_conv(low_level) deep_up = F.interpolate(deep, size=(low_level.size(2), low_level.size(3)), mode='bilinear', align_corners=False) x_cat = torch.cat([deep_up, low_level], dim=1) x_cat = self.cat_conv(x_cat) x_cat = self.cls_conv(x_cat) x_cat = F.interpolate(x_cat, size=(H, W), mode='bilinear', align_corners=False) return x_cat ############################################ # 测试 ############################################ if __name__ == "__main__": model = DeepLab(num_classes=3, backbone="mobilenet", pretrained=False, downsample_factor=16) from torchsummary import summary # 如果需要查看summary # summary(model, (3,512,512)) # 统计可学习参数 print("模型参数总数:", count_parameters(model)) # forward test img = torch.randn(1, 3, 512, 512) pred = model(img) print("输出维度:", pred.shape)
AG_ASPP_MViT
(Attention Gated ASPP + MobileViT),与原先 WA_ASPP_MViT
区分开来。branch2_agg = out1 + out2
等操作完全去掉,每个分支独立输出。AttnGate
模块,对每个分支输出做通道注意力(SE-like)后再返回;AG_ASPP_MViT
中,每个分支都有 self.gateX(...)
的处理;out1, out2, out3, out4, gp
做 torch.cat([...], dim=1)
拼接,经过 1×1 卷积融合产生最终 (B, dim_out, H, W)
。self.aspp = AG_ASPP_MViT(...)
;后面 cat_conv -> cls_conv
与原先相同。这样即可完成**“加入注意力门控然后并行输出”**的要求,且提供了一个独立、完整的可运行示例。若有需要,您可在 AttnGate
中进一步改进(如空间注意力等)或者在 fuse_conv
中改成其他方式(比如加权加和)进行融合。
C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\python.exe C:\Users\wztbkl\PycharmProjects\deeplabv3-plus-pytorch-main\nets\mobileVitASPP_deeplabV3.py
模型参数总数: 11143411
Traceback (most recent call last):
File "C:\Users\wztbkl\PycharmProjects\deeplabv3-plus-pytorch-main\nets\mobileVitASPP_deeplabV3.py", line 436, in <module>
pred = model(img)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\wztbkl\PycharmProjects\deeplabv3-plus-pytorch-main\nets\mobileVitASPP_deeplabV3.py", line 412, in forward
deep = self.aspp(deep)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\wztbkl\PycharmProjects\deeplabv3-plus-pytorch-main\nets\mobileVitASPP_deeplabV3.py", line 346, in forward
gp = self.branch5_norm(gp)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\batchnorm.py", line 193, in forward
return F.batch_norm(
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\functional.py", line 2810, in batch_norm
_verify_batch_size(input.size())
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\functional.py", line 2776, in _verify_batch_size
raise ValueError(
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 128, 1, 1])
进程已结束,退出代码为 1
只对中间三个输出做注意力门控
下面给出一个完整示例,演示如何在原先的 WA_ASPP_MViT
结构中,仅对中间的 3 个分支(branch2, branch3, branch4) 使用通道注意力门控(AttnGate
),而不对 branch1 和 branch5 做门控。这样就可以保留前后分支的简单性,同时为中间的 MobileViT 注意力分支加入门控,获得更加灵活的特征调控。
同时,由于 branch5 里依旧会产生 的张量并用 BatchNorm 时,在 batch size=1 的场景下会产生报错,故我们改为 GroupNorm(或去掉归一化)来避免“Expected more than 1 value per channel”错误。如果您在实际训练中保证 batch size>1,也可以使用原 BatchNorm 不变。
pythonimport torch import torch.nn as nn import torch.nn.functional as F import math from einops import rearrange from nets.xception import xception from nets.mobilenetv2 import mobilenetv2 ############################################ # 1. MobileNetV2 主干 (与原DeepLabV3+相同) ############################################ class MobileNetV2(nn.Module): def __init__(self, downsample_factor=8, pretrained=True): super(MobileNetV2, self).__init__() from functools import partial model = mobilenetv2(pretrained) self.features = model.features[:-1] self.total_idx = len(self.features) self.down_idx = [2, 4, 7, 14] if downsample_factor == 8: for i in range(self.down_idx[-2], self.down_idx[-1]): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=4)) elif downsample_factor == 16: for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) def _nostride_dilate(self, m, dilate): classname = m.__class__.__name__ if classname.find('Conv') != -1: if m.stride == (2, 2): m.stride = (1, 1) if m.kernel_size == (3, 3): m.dilation = (dilate // 2, dilate // 2) m.padding = (dilate // 2, dilate // 2) else: if m.kernel_size == (3, 3): m.dilation = (dilate, dilate) m.padding = (dilate, dilate) def forward(self, x): # 前4层 -> low_level_features low_level_features = self.features[:4](x) # 后面 -> deep_features x = self.features[4:](low_level_features) return low_level_features, x ############################################ # 2. 统计可学习参数数量 ############################################ def count_parameters(model): return sum(p.numel() for p in model.parameters() if p.requires_grad) ############################################ # 简易 多层 TransformerBlockGroup ############################################ class TransformerBlockGroup(nn.Module): def __init__(self, dim, depth, heads, mlp_dim, dropout=0.0): super().__init__() self.layers = nn.ModuleList([ TransformerBlockLayer(dim, heads, mlp_dim, dropout) for _ in range(depth) ]) def forward(self, x): for layer in self.layers: x = layer(x) return x class TransformerBlockLayer(nn.Module): def __init__(self, dim, heads, mlp_dim, dropout=0.0): super().__init__() self.norm1 = nn.LayerNorm(dim) self.attn = MultiHeadAttention(dim, heads, dropout) self.norm2 = nn.LayerNorm(dim) self.ffn = FeedForward(dim, mlp_dim, dropout) def forward(self, x): shortcut = x x_ln = self.norm1(x) x_attn = self.attn(x_ln) x = x_attn + shortcut shortcut2 = x x_ln2 = self.norm2(x) x_ffn = self.ffn(x_ln2) x = x_ffn + shortcut2 return x class MultiHeadAttention(nn.Module): def __init__(self, dim, heads=2, dropout=0.0): super().__init__() self.heads = heads self.dim_head = dim // heads inner_dim = self.dim_head * heads self.scale = self.dim_head**-0.5 self.to_qkv = nn.Linear(dim, inner_dim*3, bias=False) self.attend = nn.Softmax(dim=-1) self.proj = nn.Sequential( nn.Linear(inner_dim, dim), nn.Dropout(dropout) ) def forward(self, x): B, n_patch, patch_area, dim = x.shape qkv = self.to_qkv(x) q, k, v = torch.chunk(qkv, 3, dim=-1) # each (B,n_patch,patch_area,inner_dim//3) q = q.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) k = k.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) v = v.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) dots = torch.matmul(q, k.transpose(-1, -2))*self.scale attn = self.attend(dots) out = torch.matmul(attn, v) out = out.permute(0,1,3,2,4).contiguous() out = out.view(B, n_patch, patch_area, self.heads*self.dim_head) out = self.proj(out) return out class FeedForward(nn.Module): def __init__(self, dim, hidden_dim, dropout=0.0): super().__init__() self.net = nn.Sequential( nn.Linear(dim, hidden_dim), nn.SiLU(), nn.Dropout(dropout), nn.Linear(hidden_dim, dim), nn.Dropout(dropout) ) def forward(self, x): return self.net(x) ############################################ # MobileViTAttentionBlock (减参) ############################################ class MobileViTAttentionBlock(nn.Module): def __init__(self, in_channels=320, patch_size=(2,2), transformer_depth=1, transformer_dim=96, transformer_mlp_dim=192, heads=2, dropout=0.0): super().__init__() self.ph, self.pw = patch_size # 局部conv self.local_conv1 = nn.Conv2d(in_channels, in_channels, 3, padding=1, bias=False) self.local_bn1 = nn.BatchNorm2d(in_channels) self.local_act1 = nn.SiLU() # conv => transformer_dim self.conv_proj_in = nn.Conv2d(in_channels, transformer_dim, 1, bias=False) self.bn_proj_in = nn.BatchNorm2d(transformer_dim) self.act_proj_in = nn.SiLU() # Transformer self.transformer = TransformerBlockGroup( dim=transformer_dim, depth=transformer_depth, heads=heads, mlp_dim=transformer_mlp_dim, dropout=dropout ) # 回到 in_channels self.conv_proj_out = nn.Conv2d(transformer_dim, in_channels, 1, bias=False) self.bn_proj_out = nn.BatchNorm2d(in_channels) self.act_proj_out = nn.SiLU() # concat => 3x3 self.local_conv2 = nn.Conv2d(in_channels*2, in_channels, 3, padding=1, bias=False) self.local_bn2 = nn.BatchNorm2d(in_channels) self.local_act2 = nn.SiLU() def forward(self, x): y = x.clone() x = self.local_conv1(x) x = self.local_bn1(x) x = self.local_act1(x) x = self.conv_proj_in(x) x = self.bn_proj_in(x) x = self.act_proj_in(x) B, D, H, W = x.shape assert H % self.ph == 0 and W % self.pw == 0, "H,W需能被patch_size整除" x = rearrange(x, 'b d (hh ph) (ww pw) -> b (hh ww) (ph pw) d', ph=self.ph, pw=self.pw) x = self.transformer(x) x = rearrange(x, 'b (hh ww) (ph pw) d -> b d (hh ph) (ww pw)', hh=H//self.ph, ww=W//self.pw, ph=self.ph, pw=self.pw) x = self.conv_proj_out(x) x = self.bn_proj_out(x) x = self.act_proj_out(x) x = torch.cat([x, y], dim=1) x = self.local_conv2(x) x = self.local_bn2(x) x = self.local_act2(x) return x ############################################ # 通道注意力 AttnGate,只在 branch2/3/4使用 ############################################ class AttnGate(nn.Module): """ 类似 SE: 全局池化 -> MLP -> Sigmoid => 通道加权 """ def __init__(self, channels, reduction=16): super().__init__() hidden_dim = max(1, channels//reduction) self.pool = nn.AdaptiveAvgPool2d(1) self.fc = nn.Sequential( nn.Linear(channels, hidden_dim), nn.ReLU(inplace=True), nn.Linear(hidden_dim, channels), nn.Sigmoid() ) def forward(self, x): B, C, H, W = x.shape y = self.pool(x).view(B, C) # => (B,C) y = self.fc(y) # => (B,C) y = y.view(B, C, 1, 1) return x * y ############################################ # 三、只对中间三个输出做注意力门控 # branch2/3/4 => gate, branch1/5 不做 gating ############################################ class AG_ASPP_MViT(nn.Module): """ 5条分支, branch1 + branch5 没有门控, branch2/3/4 做 MobileViT + gating, 最后并行拼接 => fuse """ def __init__(self, dim_in=320, dim_out=128, patch_size=(2,2), mvit_depth=1, mvit_dim=96, mvit_mlp_dim=192, mvit_heads=2, mvit_dropout=0.0, use_groupnorm=True): super().__init__() # branch1 => 1×1 => dim_out, 无 gating self.branch1 = nn.Sequential( nn.Conv2d(dim_in, dim_out, 1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) # branch2 => MViT => 1×1 => gating self.branch2_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch2_proj = nn.Conv2d(dim_in, dim_out, 1, bias=False) self.gate2 = AttnGate(dim_out) # branch3 => 同理 self.branch3_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch3_proj = nn.Conv2d(dim_in, dim_out, 1, bias=False) self.gate3 = AttnGate(dim_out) # branch4 => 同理 self.branch4_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch4_proj = nn.Conv2d(dim_in, dim_out, 1, bias=False) self.gate4 = AttnGate(dim_out) # branch5 => global pooling => conv => norm => relu => up (无 gating) self.branch5_conv = nn.Conv2d(dim_in, dim_out, kernel_size=1) if use_groupnorm: # 避免 (B=1, C=128,1,1) 的 BN 报错 self.branch5_norm = nn.GroupNorm(num_groups=1, num_channels=dim_out) else: self.branch5_norm = nn.BatchNorm2d(dim_out) self.branch5_relu = nn.ReLU(inplace=True) # fuse => 1x1 => dim_out self.fuse_conv = nn.Sequential( nn.Conv2d(dim_out*5, dim_out, 1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) def forward(self, x): B, C, H, W = x.shape # branch1 (no gating) out1 = self.branch1(x) # => (B,128,H,W) # branch2 => gating x2 = self.branch2_attn(x) out2 = self.branch2_proj(x2) out2 = self.gate2(out2) # branch3 => gating x3 = self.branch3_attn(x) out3 = self.branch3_proj(x3) out3 = self.gate3(out3) # branch4 => gating x4 = self.branch4_attn(x) out4 = self.branch4_proj(x4) out4 = self.gate4(out4) # branch5 (no gating) gp = F.adaptive_avg_pool2d(x, 1) # => (B,C,1,1) gp = self.branch5_conv(gp) # => (B,128,1,1) gp = self.branch5_norm(gp) # => BN or GN gp = self.branch5_relu(gp) gp = F.interpolate(gp, size=(H, W), mode='bilinear', align_corners=False) # 并行拼接 => fuse out_cat = torch.cat([out1, out2, out3, out4, gp], dim=1) out = self.fuse_conv(out_cat) return out ############################################ # 四、DeepLab + AG_ASPP_MViT(只对中间3个分支门控) ############################################ class DeepLab(nn.Module): def __init__(self, num_classes, backbone="mobilenet", pretrained=True, downsample_factor=16): super(DeepLab, self).__init__() if backbone == "xception": self.backbone = xception(downsample_factor=downsample_factor, pretrained=pretrained) in_channels = 2048 low_level_channels = 256 elif backbone == "mobilenet": self.backbone = MobileNetV2(downsample_factor=downsample_factor, pretrained=pretrained) in_channels = 320 low_level_channels = 24 else: raise ValueError('Unsupported backbone - `{}`, Use mobilenet, xception.'.format(backbone)) # 这里改成AG_ASPP_MViT, 并仅对中间3分支做gating self.aspp = AG_ASPP_MViT( dim_in = in_channels, dim_out = 128, patch_size = (2,2), mvit_depth = 1, mvit_dim = 96, mvit_mlp_dim=192, mvit_heads=2, mvit_dropout=0.0, use_groupnorm=True # 避免 (1,128,1,1) BN报错 ) # 浅层 => 1x1 => 24 self.shortcut_conv = nn.Sequential( nn.Conv2d(low_level_channels, 24, 1), nn.BatchNorm2d(24), nn.ReLU(inplace=True) ) # cat_conv => (24+128)->128 self.cat_conv = nn.Sequential( nn.Conv2d(24+128, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(inplace=True), nn.Dropout(0.5), nn.Conv2d(128, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(inplace=True), nn.Dropout(0.1), ) self.cls_conv = nn.Conv2d(128, num_classes, 1) def forward(self, x): H, W = x.shape[2], x.shape[3] low_level, deep = self.backbone(x) deep = self.aspp(deep) # => (B,128, ...) low_level = self.shortcut_conv(low_level) deep_up = F.interpolate(deep, size=(low_level.size(2), low_level.size(3)), mode='bilinear', align_corners=False) x_cat = torch.cat([deep_up, low_level], dim=1) x_cat = self.cat_conv(x_cat) x_cat = self.cls_conv(x_cat) x_cat = F.interpolate(x_cat, size=(H, W), mode='bilinear', align_corners=False) return x_cat ############################################ # 测试 ############################################ if __name__ == "__main__": model = DeepLab(num_classes=3, backbone="mobilenet", pretrained=False, downsample_factor=16) print("模型参数总数:", count_parameters(model)) img = torch.randn(1, 3, 512, 512) pred = model(img) print("输出维度:", pred.shape)
AG_ASPP_MViT
AttnGate
(通道注意力)做门控。分支融合
out2_agg = out1 + out2
等叠加;而是并行地得到 out1, out2, out3, out4, gp
,每个都门控(仅 2,3,4;branch1/5 无门控),然后 torch.cat
拼接后做 1×1 卷积融合。训练和推理
use_groupnorm=True
默认避免单图 BN 报错。这样,在不改变其余 DeepLab 流程的前提下,您就可以“只对中间三个分支做注意力门控”,并将 branch1、branch5 维持简单形式来减少潜在冲突或报错。
Start Train
Epoch 1/100: 0%| | 0/2162 [00:00<?, ?it/s<class 'dict'>]
从您贴出的日志来看,训练已经正常启动,只是 PyTorch 给出了一个 TracerWarning,提示在将网络脚本化(或 trace)时,遇到
pythonassert H % self.ph == 0 and W % self.pw == 0
这条断言时,Tensor 被转换为了 Python 布尔值。这是一个在 TorchScript 或 FX tracing 中常见的情况:因为 “assert ...
” 或类似的 Python 级别逻辑并不会动态追踪输入张量尺寸,所以 PyTorch 发出“可能导致 trace 不泛化”的警告。
脚本化/Tracing 不支持完全的 Python 逻辑
torch.jit.trace
或 torch.fx.symbolic_trace
时,需要将计算图抽象成可移植 IR;assert H % self.ph == 0
等“基于运行时 Tensor 的 Python 布尔判断”语句,并不在计算图本身,而是在 Python 级。如果该值在一次 trace 时被判断为 True,trace 会把它当成常量,导致后续对于其它输入尺寸时无法保证一致行为。代码仍可正常运行
将断言移到 __init__
或数据层做检查
patch_size
整除,那么可在网络外部或构造函数中预先检查,而不是在 forward 里写 assert
。改用 Tensor-level 强制 reshape / padding
忽略此警告(若不需要脚本化)
Start Train
** 已出现,Epoch 1/100 也开始读 batch 了;torch.jit.trace
/FX tracing
,最好把这类断言逻辑移到网络外部或者改成别的处理方式(如固定输入维度、在加载数据时做断言),以消除潜在的脚本化问题;import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from einops import rearrange
from nets.xception import xception
from nets.mobilenetv2 import mobilenetv2
############################################
############################################
class MobileNetV2(nn.Module):
def init(self, downsample_factor=8, pretrained=True):
super(MobileNetV2, self).init()
from functools import partial
model = mobilenetv2(pretrained)
self.features = model.features[:-1]
textself.total_idx = len(self.features) self.down_idx = [2, 4, 7, 14] if downsample_factor == 8: for i in range(self.down_idx[-2], self.down_idx[-1]): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=4)) elif downsample_factor == 16: for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) def _nostride_dilate(self, m, dilate): classname = m.__class__.__name__ if classname.find('Conv') != -1: if m.stride == (2, 2): m.stride = (1, 1) if m.kernel_size == (3, 3): m.dilation = (dilate // 2, dilate // 2) m.padding = (dilate // 2, dilate // 2) else: if m.kernel_size == (3, 3): m.dilation = (dilate, dilate) m.padding = (dilate, dilate) def forward(self, x): # 前4层 -> low_level_features low_level_features = self.features[:4](x) # 后面 -> deep_features x = self.features[4:](low_level_features) return low_level_features, x
############################################
############################################
def count_parameters(model):
return sum(p.numel() for p in model.parameters() if p.requires_grad)
############################################
############################################
class TransformerBlockGroup(nn.Module):
def init(self, dim, depth, heads, mlp_dim, dropout=0.0):
super().init()
self.layers = nn.ModuleList([
TransformerBlockLayer(dim, heads, mlp_dim, dropout)
for _ in range(depth)
])
textdef forward(self, x): for layer in self.layers: x = layer(x) return x
class TransformerBlockLayer(nn.Module):
def init(self, dim, heads, mlp_dim, dropout=0.0):
super().init()
self.norm1 = nn.LayerNorm(dim)
self.attn = MultiHeadAttention(dim, heads, dropout)
self.norm2 = nn.LayerNorm(dim)
self.ffn = FeedForward(dim, mlp_dim, dropout)
textdef forward(self, x): shortcut = x x_ln = self.norm1(x) x_attn = self.attn(x_ln) x = x_attn + shortcut shortcut2 = x x_ln2 = self.norm2(x) x_ffn = self.ffn(x_ln2) x = x_ffn + shortcut2 return x
class MultiHeadAttention(nn.Module):
def init(self, dim, heads=2, dropout=0.0):
super().init()
self.heads = heads
self.dim_head = dim // heads
inner_dim = self.dim_head * heads
textself.scale = self.dim_head ** -0.5 self.to_qkv = nn.Linear(dim, inner_dim * 3, bias=False) self.attend = nn.Softmax(dim=-1) self.proj = nn.Sequential( nn.Linear(inner_dim, dim), nn.Dropout(dropout) ) def forward(self, x): B, n_patch, patch_area, dim = x.shape qkv = self.to_qkv(x) q, k, v = torch.chunk(qkv, 3, dim=-1) # each (B,n_patch,patch_area,inner_dim//3) q = q.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0, 1, 3, 2, 4) k = k.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0, 1, 3, 2, 4) v = v.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0, 1, 3, 2, 4) dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale attn = self.attend(dots) out = torch.matmul(attn, v) out = out.permute(0, 1, 3, 2, 4).contiguous() out = out.view(B, n_patch, patch_area, self.heads * self.dim_head) out = self.proj(out) return out
class FeedForward(nn.Module):
def init(self, dim, hidden_dim, dropout=0.0):
super().init()
self.net = nn.Sequential(
nn.Linear(dim, hidden_dim),
nn.SiLU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, dim),
nn.Dropout(dropout)
)
textdef forward(self, x): return self.net(x)
############################################
############################################
class MobileViTAttentionBlock(nn.Module):
def init(self,
in_channels=320,
patch_size=(2, 2),
transformer_depth=1,
transformer_dim=96,
transformer_mlp_dim=192,
heads=2,
dropout=0.0):
super().init()
self.ph, self.pw = patch_size
text# 局部conv self.local_conv1 = nn.Conv2d(in_channels, in_channels, 3, padding=1, bias=False) self.local_bn1 = nn.BatchNorm2d(in_channels) self.local_act1 = nn.SiLU() # conv => transformer_dim self.conv_proj_in = nn.Conv2d(in_channels, transformer_dim, 1, bias=False) self.bn_proj_in = nn.BatchNorm2d(transformer_dim) self.act_proj_in = nn.SiLU() # Transformer self.transformer = TransformerBlockGroup( dim=transformer_dim, depth=transformer_depth, heads=heads, mlp_dim=transformer_mlp_dim, dropout=dropout ) # 回到 in_channels self.conv_proj_out = nn.Conv2d(transformer_dim, in_channels, 1, bias=False) self.bn_proj_out = nn.BatchNorm2d(in_channels) self.act_proj_out = nn.SiLU() # concat => 3x3 self.local_conv2 = nn.Conv2d(in_channels * 2, in_channels, 3, padding=1, bias=False) self.local_bn2 = nn.BatchNorm2d(in_channels) self.local_act2 = nn.SiLU() def forward(self, x): y = x.clone() x = self.local_conv1(x) x = self.local_bn1(x) x = self.local_act1(x) x = self.conv_proj_in(x) x = self.bn_proj_in(x) x = self.act_proj_in(x) B, D, H, W = x.shape assert H % self.ph == 0 and W % self.pw == 0, "H,W需能被patch_size整除" x = rearrange(x, 'b d (hh ph) (ww pw) -> b (hh ww) (ph pw) d', ph=self.ph, pw=self.pw) x = self.transformer(x) x = rearrange(x, 'b (hh ww) (ph pw) d -> b d (hh ph) (ww pw)', hh=H // self.ph, ww=W // self.pw, ph=self.ph, pw=self.pw) x = self.conv_proj_out(x) x = self.bn_proj_out(x) x = self.act_proj_out(x) x = torch.cat([x, y], dim=1) x = self.local_conv2(x) x = self.local_bn2(x) x = self.local_act2(x) return x
############################################
############################################
class AttnGate(nn.Module):
"""
类似 SE: 全局池化 -> MLP -> Sigmoid => 通道加权
"""
textdef __init__(self, channels, reduction=16): super().__init__() hidden_dim = max(1, channels // reduction) self.pool = nn.AdaptiveAvgPool2d(1) self.fc = nn.Sequential( nn.Linear(channels, hidden_dim), nn.ReLU(inplace=True), nn.Linear(hidden_dim, channels), nn.Sigmoid() ) def forward(self, x): B, C, H, W = x.shape y = self.pool(x).view(B, C) # => (B,C) y = self.fc(y) # => (B,C) y = y.view(B, C, 1, 1) return x * y
############################################
############################################
class AG_ASPP_MViT(nn.Module):
"""
5条分支, branch1 + branch5 没有门控,
branch2/3/4 做 MobileViT + gating,
最后并行拼接 => fuse
"""
textdef __init__(self, dim_in=320, dim_out=128, patch_size=(2, 2), mvit_depth=1, mvit_dim=96, mvit_mlp_dim=192, mvit_heads=2, mvit_dropout=0.0, use_groupnorm=True): super().__init__() # branch1 => 1×1 => dim_out, 无 gating self.branch1 = nn.Sequential( nn.Conv2d(dim_in, dim_out, 1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) # branch2 => MViT => 1×1 => gating self.branch2_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch2_proj = nn.Conv2d(dim_in, dim_out, 1, bias=False) self.gate2 = AttnGate(dim_out) # branch3 => 同理 self.branch3_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch3_proj = nn.Conv2d(dim_in, dim_out, 1, bias=False) self.gate3 = AttnGate(dim_out) # branch4 => 同理 self.branch4_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch4_proj = nn.Conv2d(dim_in, dim_out, 1, bias=False) self.gate4 = AttnGate(dim_out) # branch5 => global pooling => conv => norm => relu => up (无 gating) self.branch5_conv = nn.Conv2d(dim_in, dim_out, kernel_size=1) if use_groupnorm: # 避免 (B=1, C=128,1,1) 的 BN 报错 self.branch5_norm = nn.GroupNorm(num_groups=1, num_channels=dim_out) else: self.branch5_norm = nn.BatchNorm2d(dim_out) self.branch5_relu = nn.ReLU(inplace=True) # fuse => 1x1 => dim_out self.fuse_conv = nn.Sequential( nn.Conv2d(dim_out * 5, dim_out, 1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) def forward(self, x): B, C, H, W = x.shape # branch1 (no gating) out1 = self.branch1(x) # => (B,128,H,W) # branch2 => gating x2 = self.branch2_attn(x) out2 = self.branch2_proj(x2) out2 = self.gate2(out2) # branch3 => gating x3 = self.branch3_attn(x) out3 = self.branch3_proj(x3) out3 = self.gate3(out3) # branch4 => gating x4 = self.branch4_attn(x) out4 = self.branch4_proj(x4) out4 = self.gate4(out4) # branch5 (no gating) gp = F.adaptive_avg_pool2d(x, 1) # => (B,C,1,1) gp = self.branch5_conv(gp) # => (B,128,1,1) gp = self.branch5_norm(gp) # => BN or GN gp = self.branch5_relu(gp) gp = F.interpolate(gp, size=(H, W), mode='bilinear', align_corners=False) # 并行拼接 => fuse out_cat = torch.cat([out1, out2, out3, out4, gp], dim=1) out = self.fuse_conv(out_cat) return out
############################################
############################################
class DeepLab(nn.Module):
def init(self, num_classes, backbone="mobilenet", pretrained=True, downsample_factor=16):
super(DeepLab, self).init()
if backbone == "xception":
self.backbone = xception(downsample_factor=downsample_factor, pretrained=pretrained)
in_channels = 2048
low_level_channels = 256
elif backbone == "mobilenet":
self.backbone = MobileNetV2(downsample_factor=downsample_factor, pretrained=pretrained)
in_channels = 320
low_level_channels = 24
else:
raise ValueError('Unsupported backbone - {}
, Use mobilenet, xception.'.format(backbone))
text# 这里改成AG_ASPP_MViT, 并仅对中间3分支做gating self.aspp = AG_ASPP_MViT( dim_in=in_channels, dim_out=128, patch_size=(2, 2), mvit_depth=1, mvit_dim=96, mvit_mlp_dim=192, mvit_heads=2, mvit_dropout=0.0, use_groupnorm=True # 避免 (1,128,1,1) BN报错 ) # 浅层 => 1x1 => 24 self.shortcut_conv = nn.Sequential( nn.Conv2d(low_level_channels, 24, 1), nn.BatchNorm2d(24), nn.ReLU(inplace=True) ) # cat_conv => (24+128)->128 self.cat_conv = nn.Sequential( nn.Conv2d(24 + 128, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(inplace=True), nn.Dropout(0.5), nn.Conv2d(128, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(inplace=True), nn.Dropout(0.1), ) self.cls_conv = nn.Conv2d(128, num_classes, 1) def forward(self, x): H, W = x.shape[2], x.shape[3] low_level, deep = self.backbone(x) deep = self.aspp(deep) # => (B,128, ...) low_level = self.shortcut_conv(low_level) deep_up = F.interpolate(deep, size=(low_level.size(2), low_level.size(3)), mode='bilinear', align_corners=False) x_cat = torch.cat([deep_up, low_level], dim=1) x_cat = self.cat_conv(x_cat) x_cat = self.cls_conv(x_cat) x_cat = F.interpolate(x_cat, size=(H, W), mode='bilinear', align_corners=False) return x_cat
############################################
############################################
if name == "main":
model = DeepLab(num_classes=3, backbone="mobilenet", pretrained=False, downsample_factor=16)
print("模型参数总数:", count_parameters(model))
textimg = torch.randn(1, 3, 512, 512) pred = model(img) print("输出维度:", pred.shape)
空洞空间金字塔池化 (ASPP) 网络通常由一个 1 × 1 卷积、三个具有不同扩展率的 3 × 3 个扩张卷积和一个空间池化组成(Chen et al., 2018)。通过设置不同的膨胀卷积扩展率,可以获得不同尺度的特征映射。但是,ASPP 结构必须同时执行多个卷积操作。这会导致模型参数数量增加,需要更多的计算资源和更长的训练时间。此外,由于三七叶斑的大小不同,随着扩展卷积的扩展速率的增加,接受场也相应扩大,导致其对小叶斑的分割性能下降。为了实现在节省计算资源的同时保留丰富语义信息,提高小目标分割性能的目标,在本文中,我们重新设计了 ASPP 模块,并将其命名为 WA (Window Attention)-ASPP。结构如图 4 所示。我们调整了中间三个分支的输出特征图生成方式,实现特征的分层叠加。具体来说,第二个分支的特征图是第一个分支和第二个分支的输出之和。第三个分支的特征图是前两个分支和第三个分支的输出之和。这个过程会逐渐继续,从而提高特征图的级别。通过这种方式,不同尺度的特征信息相互有效融合,提高了语义特征的聚合效果。
WA-ASPP 的核心是窗口注意力层,它来自 swin transformer 中基于窗口的多头自我注意 (W-MSA) (Liu et al., 2021)。其公式定义如下。具体操作是先将输入图像分割成不重叠的窗口,然后为每个窗口独立计算注意力。在计算注意力时,只考虑当前窗口与其周围几个窗口之间的关系,而不考虑整个图像的相关性。最后,通过修整操作返回与输入图像相同的形状。我们通过设计窗口注意力层来替代初始的扩张卷积层,从而提高了模型的可扩展性。此修改减少了从全局到本地窗口范围的计算和内存需求。
在 WA-ASPP 中,我们保留第一个分支和第五个分支的原始平均池化。中间的三个分支被设计成一个窗口注意力层、一个二维批量归一化层、一个激活函数、一个二维卷积层、另一个二维批量归一化层,最后是另一个激活函数。其结构如图 5 所示。这种设计不仅保持了输入图像的整体特性,而且提高了计算效率并增强了局部相关性。其次,我们将窗口注意力层在三个分支上的窗口大小分别设置为4×4、8×8、16×16,激活函数统一使用GELU。在轻量级网络中,GELU 已被证明比整流线性单元 (ReLU) 具有更好的性能(Hendrycks 和 Gimpel,2023 年)。配置完成后,模型可以获取没有 scale 和 levels 的特征信息。这使得模型能够全面理解和分析图像,从而提高其性能和泛化能力。
模仿上述介绍WA_ASPP的话术,介绍下上述代码中的AG_ASPP_MViT模块,要求有较大差异,并且专业,最好带上公式介绍
下面给出一段较为完整、专业且与原文风格相似的说明性文字,用于介绍 AG_ASPP_MViT
模块的思想和实现。示例中包含了窗口注意力和通道注意力门控的配合使用,并以公式的方式示意关键操作流程。与原先介绍的“WA_ASPP”思路一脉相承,但在中间三支上采用 MobileViT 风格的局部注意力和通道门控,以达到“加速”与“加强小目标分割效果”的双重目标。
空洞空间金字塔池化(ASPP) 通常包含一个 卷积、若干具有不同扩张率(dilation)的 卷积,以及一个全局(或大尺度)平均池化分支,用来捕捉不同感受野尺度下的特征。但在轻量级网络或小目标分割等场景下,过多的空洞卷积往往带来额外的参数量和对小目标不利的问题。为此,我们借鉴 swin transformer 中的窗口注意力,并保留 ASPP 的多分支框架,对“中间三支”进行改造,提出了 AG_ASPP_MViT 模块。
整个结构示意如图所示,保留了第一支与第五支中的简易卷积或全局池化,而把中间三个分支替换为 MobileViTAttentionBlock(可视为局部窗口多头注意力)并配合通道注意力门控(AttnGate)。在保持多尺度并行的基础上,只对中间分支做了较为精细的注意力处理,既减少了计算资源,也能在局部特征交互方面提升表现。
与原始文章中的“空洞卷积 + 多尺度”不同,我们在中间三个分支里应用了窗口注意力(Window-based Multi-head Self-Attention,W-MSA),其核心思想可用下式概括(与传统全局注意力相比,注意力运算仅在局部窗口进行):
其中 表示将输入特征分割为 个不重叠小块(windows),只在每个小块内部计算自注意力,从而有效降低计算量并保留空间局部相关性。此外,为了更好地控制分支输出对最终结果的贡献度,我们在每个分支输出后引入通道注意力门控(AttnGate),其操作可写为:
其中 是对 特征做全局池化(保留通道维度), 为两层全连接并以 收尾, 表示逐通道相乘。这样一来,模型能更自适应地对分支输出进行“放大或抑制”,尤其利于小目标相关特征的保留。
在 AG_ASPP_MViT 模块中,共有五条分支:
(2,2)
, (2,2)
, (2,2)
或根据需求设置 (4,4),(8,8),(16,16)
等;最后将这 5 路特征并行拼接,再用 卷积(Fuse Conv)降维到 dim_out
,得到整合后的多尺度上下文特征。相较于原 ASPP 中的空洞卷积:
从实验来看,在小目标分割或需要更精细结构捕捉的任务中,AG_ASPP_MViT 在减少额外计算资源的同时,能保持或提升语义聚合质量。后续若结合 “shift window” 或“多尺度 patch 切分”策略,还有进一步优化空间。
AG_ASPP_MViT 模块是对传统空洞金字塔池化(ASPP)的一个改进版本:
这在轻量级网络(如 MobileNet)或小目标分割场景下可显著提升可扩展性和精度,同时不会让参数量和计算成本大幅飙升,是一种折衷高效的多分支注意力模块。
精简一些
下面给出一个更精简的文字版本,用于介绍 AG_ASPP_MViT 模块,突出其设计动机、关键方法与优点,方便在论文或项目中快速说明。
空洞空间金字塔池化(ASPP) 一般通过多个不同扩张率(dilation)的卷积分支来捕捉多尺度上下文,但会导致参数量增加,且在小目标场景下易丢失细节。为此,我们引入 AG_ASPP_MViT,将中间三条空洞卷积分支改为 窗口注意力 + 通道注意力门控,同时保留第一支和最后一支的轻量式分支,从而在保留多尺度并行的前提下,减少冗余的扩张卷积,并强化对细粒度目标的建模能力。
AG_ASPP_MViT 包含 五条分支:
最后将五支结果并行拼接,再用 卷积融合到统一通道数,输出与原 ASPP 相同大小的语义特征。
在保证多尺度上下文的同时,AG_ASPP_MViT 利用局部自注意力和门控增强,对小目标分割和轻量化需求更加友好,是对传统 ASPP 框架的一次灵活改造与优化。
不用分点介绍,模仿论文中的写法,写出3段即可
在轻量级网络中,传统的空洞空间金字塔池化(ASPP)往往通过不同扩张率的卷积在多尺度下捕捉上下文信息,但这会带来额外的参数开销,并在小目标场景下表现不佳。为此,本文重新设计了 ASPP 模块,提出了 AG_ASPP_MViT,仅在中间三条分支中使用窗口注意力和通道注意力门控,第一与第五分支则保留简洁的卷积和全局池化结构,以在多分支并行的基础上减少冗余膨胀卷积并增强细粒度目标的表征能力。
具体而言,在中间三条分支里,我们用 MobileViTAttentionBlock 代替扩张卷积,将特征图分成若干局部窗口进行多头自注意力,以捕捉局部依赖并降低大卷积核带来的计算负担。同时,每条分支输出后还接入一个通道注意力门控(AttnGate),对分支特征进行通道级别的自适应加权。门控操作通过全局池化和两层全连接,生成通道维度上的权重向量,然后乘回分支输出,以突出关键信息、抑制噪声。最后,我们与传统 ASPP 相同地并行拼接五个分支,使用 卷积融合到统一通道数,并与浅层特征在解码阶段拼接,实现对多尺度与局部细节的综合利用。
得益于窗口注意力的高效性和门控机制的灵活筛选,AG_ASPP_MViT 在保持或降低整体参数量的同时,针对小目标或细节区域能更精细地进行分割。它延续了 ASPP 多尺度并行的优点,又减轻了大卷积核对计算资源的依赖,兼顾全局上下文与局部细节表征,对在轻量级网络中兼顾效率与精度有着良好的推广价值。
带上公式
在轻量级网络中,传统的空洞空间金字塔池化(ASPP)通常依靠不同扩张率的卷积在多尺度下捕捉上下文信息,但这会显著增加参数量,并在小目标场景下表现不佳。为此,本文重新设计了 ASPP 模块,提出了 AG_ASPP_MViT:仅在中间三条分支中使用 窗口注意力与通道注意力门控,而在第一与第五分支中保留简洁的卷积与全局池化,以在多分支并行的基础上减少冗余膨胀卷积并增强细粒度目标的表征能力。
具体而言,我们在中间分支用 MobileViTAttentionBlock 代替扩张卷积,将输入特征以窗口方式进行多头自注意力(W-MSA)。可将其抽象表示为
其中 表示按照局部窗口划分后的第 块特征, 表示在各窗口维度上的拼接操作。这样一来,网络在局部范围内进行注意力计算,减少了大卷积核带来的计算负担。随后,我们在每条中间分支输出后采用门控机制(AttnGate)对通道维度进行加权:
其中 为对 特征做全局平均池化, 为两层全连接, 为 Sigmoid,最终得到通道级权重与分支输出特征逐通道相乘,突出关键信息并抑制噪声。
在此基础上,我们依旧保留 ASPP 的多尺度并行:第一条分支采用 卷积,最后一条分支则进行全局平均池化并卷积;而改造后的三条中间分支则先进行窗口注意力,再通过门控筛选。最终五条分支的特征并行拼接后用 卷积融合,得到语义增强的多尺度表示。实验表明,AG_ASPP_MViT 保持或降低整体参数量的同时,利用局部注意力与通道门控在小目标场景下展现出更强的细粒度分割效果,兼具计算效率与语义表达能力。
WA-ASPP中的 window_size=(4,8,16)和AG_ASPP_MViT中的patch_size=(2, 2)有什么区别
在这两种实现中,“window_size” 和 “patch_size” 都是在做“基于窗口 / patch 的局部自注意力”时,用来控制特征图划分的大小与方式。但它们在具体含义和使用目的上略有不同:
WA-ASPP 中的 window_size=(4,8,16)
AG_ASPP_MViT 中的 patch_size=(2, 2)
patch_size=(2,2)
通常对应的是一条(或几条)MobileViTAttentionBlock,如果你想支持更多尺度,也可以扩展为 (4,4)
、(8,8)
等,但要做多路并行或多层切换来模拟多尺度效果。多尺度 vs. 单一/固定窗口
patch_size=(2,2)
通常是一个固定设置,可能没有在单个模块里做多窗口大小的多分支(除非手动写出多个分支 each with (2,2)/(4,4)/(8,8)
)。命名用途
window_size
:常被 Swin / WA-ASPP 这类以“窗口”为单位的注意力直接称为 “窗口大小 (window size)”。patch_size
:在 MobileViT 风格或其他 ViT 里,“patch” 一词往往跟扁平化、局部自注意力相关,也可以把 (H, W)
特征分割成 小块,再对每块做注意力。多路 vs. 单路
window_size
分支可以并行存在,让网络同时捕捉多级局部上下文;patch_size=(2,2)
并应用到中间若干分支,没有再做 (4,4),(8,8)
等多级视角。若要支持多窗口多路,也可模仿 WA-ASPP“多尺度多分支”设计出来。改动层数
patch_size
则是在内部(MobileViTAttentionBlock)对特征做 (2,2)
的切分,并可叠加多头注意力、通道门控等。window_size=(4,8,16)
旨在同层并行使用 3 种窗口尺寸,实现“多尺度局部注意力分支”,替代传统 ASPP 里的多膨胀率。patch_size=(2,2)
则多半是固定的窗口切分,主要为轻量或单一尺度局部注意力,不一定在同层再做 (4,8,16)
并行。传统的空洞空间金字塔池化(ASPP)通常依靠不同扩张率的卷积在多尺度下捕捉上下文信息,但这会显著增加计算量,并在小目标场景下表现不佳。为此,本研究重新设计了ASPP模块,提出了MSA(Multi-Scale-Attention )_ASPP:在中间三条分支中使用窗口注意力与通道注意力门控,而在第一与第五分支中保留简洁的卷积与全局池化,以在多分支并行的基础上减少冗余膨胀卷积并增强细粒度目标的表征能力。
优化上述一段话,说明虽然MSA增加了参数量,但需要的计算资源没有太大增加,然后说明原ASPP在小目标检测上的缺陷,MSA在对小目标检测上的优化。专业些
优化后的说明:
传统的空洞空间金字塔池化(ASPP)依赖不同扩张率的卷积在多尺度下提取上下文信息,虽然能够增强全局感知,但其大卷积核会显著增加参数量与膨胀卷积的计算负担,并在对小目标的捕捉方面存在不足。为此,本研究对 ASPP 进行了重新设计,提出 MSA(Multi-Scale-Attention)_ASPP:在中间三支引入基于窗口的多头自注意力与通道注意力门控,第一与第五支则保留简洁的卷积及全局池化。这样做虽然带来了一定的参数增量,但通过仅在局部窗口内执行自注意力,并结合通道门控对分支输出进行加权,所需的实际计算资源并未大幅增加。更为关键的是,传统 ASPP 在小目标检测时常因大膨胀率而丢失细节;而通过 MSA_ASPP 中“窗口注意力 + 门控”的局部精细建模,能够有效提升对小目标与细微结构的表征能力,从而在不额外牺牲过多算力的情况下显著增强网络的精度与泛化表现。
而通过MSA_ASPP中“窗口注意力+门控”的局部精细建模,将局部精细建模换为更专业的词
而通过 MSA_ASPP 中“窗口注意力+门控”的细粒度局部表征机制,能够更好地聚焦小目标区域,从而在不明显提升算力开销的前提下显著增强网络的精度与泛化性能。
换为局部特征提取可以嘛
可以的。下面给出一段用“局部特征提取”替换原先“局部精细建模”后的说法:
而通过 MSA_ASPP 中“窗口注意力 + 门控”所实现的局部特征提取机制,模型能够在小目标区域聚焦更多重要信息,从而在不显著增加计算量的前提下有效提升网络对细微结构的区分能力。
多头自注意力(W-MSA)中W-MSA英文全称是什么
W-MSA 的英文全称是 "Window-based Multi-Head Self-Attention",常简写为 Window-MSA 或 W-MSA。它最早在 Swin Transformer 中提出,用于在特征图的局部窗口内执行多头自注意力运算,从而在保持计算效率的同时捕捉局部关联。
MSA-ASPP是在特征图的局部窗口内执行多头自注意力运算嘛
MSA-ASPP 确实在中间几个分支里采用了 窗口注意力(Window-based Multi-Head Self-Attention, W-MSA)。这一操作会将特征图划分为若干局部窗口,并在每个窗口内部执行多头自注意力,从而在局部范围内进行精细特征提取。与传统 ASPP 中对全局特征的大膨胀卷积方式不同,MSA-ASPP 通过局部窗口自注意力能更精准地捕捉小目标及细微结构,同时保持对大场景的上下文理解。
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from einops import rearrange
from nets.xception import xception
from nets.mobilenetv2 import mobilenetv2
############################################
############################################
class MobileNetV2(nn.Module):
def init(self, downsample_factor=8, pretrained=True):
super(MobileNetV2, self).init()
from functools import partial
model = mobilenetv2(pretrained)
self.features = model.features[:-1]
textself.total_idx = len(self.features) self.down_idx = [2, 4, 7, 14] if downsample_factor == 8: for i in range(self.down_idx[-2], self.down_idx[-1]): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=4)) elif downsample_factor == 16: for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) def _nostride_dilate(self, m, dilate): classname = m.__class__.__name__ if classname.find('Conv') != -1: if m.stride == (2, 2): m.stride = (1, 1) if m.kernel_size == (3, 3): m.dilation = (dilate // 2, dilate // 2) m.padding = (dilate // 2, dilate // 2) else: if m.kernel_size == (3, 3): m.dilation = (dilate, dilate) m.padding = (dilate, dilate) def forward(self, x): # 前4层 -> low_level_features low_level_features = self.features[:4](x) # 后面 -> deep_features x = self.features[4:](low_level_features) return low_level_features, x
############################################
############################################
def count_parameters(model):
return sum(p.numel() for p in model.parameters() if p.requires_grad)
############################################
############################################
class TransformerBlockGroup(nn.Module):
def init(self, dim, depth, heads, mlp_dim, dropout=0.0):
super().init()
self.layers = nn.ModuleList([
TransformerBlockLayer(dim, heads, mlp_dim, dropout)
for _ in range(depth)
])
textdef forward(self, x): for layer in self.layers: x = layer(x) return x
class TransformerBlockLayer(nn.Module):
def init(self, dim, heads, mlp_dim, dropout=0.0):
super().init()
self.norm1 = nn.LayerNorm(dim)
self.attn = MultiHeadAttention(dim, heads, dropout)
self.norm2 = nn.LayerNorm(dim)
self.ffn = FeedForward(dim, mlp_dim, dropout)
textdef forward(self, x): shortcut = x x_ln = self.norm1(x) x_attn = self.attn(x_ln) x = x_attn + shortcut shortcut2 = x x_ln2 = self.norm2(x) x_ffn = self.ffn(x_ln2) x = x_ffn + shortcut2 return x
class MultiHeadAttention(nn.Module):
def init(self, dim, heads=2, dropout=0.0):
super().init()
self.heads = heads
self.dim_head = dim // heads
inner_dim = self.dim_head * heads
textself.scale = self.dim_head ** -0.5 self.to_qkv = nn.Linear(dim, inner_dim * 3, bias=False) self.attend = nn.Softmax(dim=-1) self.proj = nn.Sequential( nn.Linear(inner_dim, dim), nn.Dropout(dropout) ) def forward(self, x): B, n_patch, patch_area, dim = x.shape qkv = self.to_qkv(x) q, k, v = torch.chunk(qkv, 3, dim=-1) # each (B,n_patch,patch_area,inner_dim//3) q = q.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0, 1, 3, 2, 4) k = k.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0, 1, 3, 2, 4) v = v.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0, 1, 3, 2, 4) dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale attn = self.attend(dots) out = torch.matmul(attn, v) out = out.permute(0, 1, 3, 2, 4).contiguous() out = out.view(B, n_patch, patch_area, self.heads * self.dim_head) out = self.proj(out) return out
class FeedForward(nn.Module):
def init(self, dim, hidden_dim, dropout=0.0):
super().init()
self.net = nn.Sequential(
nn.Linear(dim, hidden_dim),
nn.SiLU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, dim),
nn.Dropout(dropout)
)
textdef forward(self, x): return self.net(x)
############################################
############################################
class MobileViTAttentionBlock(nn.Module):
def init(self,
in_channels=320,
patch_size=(2, 2),
transformer_depth=1,
transformer_dim=96,
transformer_mlp_dim=192,
heads=2,
dropout=0.0):
super().init()
self.ph, self.pw = patch_size
text# 局部conv self.local_conv1 = nn.Conv2d(in_channels, in_channels, 3, padding=1, bias=False) self.local_bn1 = nn.BatchNorm2d(in_channels) self.local_act1 = nn.SiLU() # conv => transformer_dim self.conv_proj_in = nn.Conv2d(in_channels, transformer_dim, 1, bias=False) self.bn_proj_in = nn.BatchNorm2d(transformer_dim) self.act_proj_in = nn.SiLU() # Transformer self.transformer = TransformerBlockGroup( dim=transformer_dim, depth=transformer_depth, heads=heads, mlp_dim=transformer_mlp_dim, dropout=dropout ) # 回到 in_channels self.conv_proj_out = nn.Conv2d(transformer_dim, in_channels, 1, bias=False) self.bn_proj_out = nn.BatchNorm2d(in_channels) self.act_proj_out = nn.SiLU() # concat => 3x3 self.local_conv2 = nn.Conv2d(in_channels * 2, in_channels, 3, padding=1, bias=False) self.local_bn2 = nn.BatchNorm2d(in_channels) self.local_act2 = nn.SiLU() def forward(self, x): y = x.clone() x = self.local_conv1(x) x = self.local_bn1(x) x = self.local_act1(x) x = self.conv_proj_in(x) x = self.bn_proj_in(x) x = self.act_proj_in(x) B, D, H, W = x.shape assert H % self.ph == 0 and W % self.pw == 0, "H,W需能被patch_size整除" x = rearrange(x, 'b d (hh ph) (ww pw) -> b (hh ww) (ph pw) d', ph=self.ph, pw=self.pw) x = self.transformer(x) x = rearrange(x, 'b (hh ww) (ph pw) d -> b d (hh ph) (ww pw)', hh=H // self.ph, ww=W // self.pw, ph=self.ph, pw=self.pw) x = self.conv_proj_out(x) x = self.bn_proj_out(x) x = self.act_proj_out(x) x = torch.cat([x, y], dim=1) x = self.local_conv2(x) x = self.local_bn2(x) x = self.local_act2(x) return x
############################################
############################################
class AttnGate(nn.Module):
"""
类似 SE: 全局池化 -> MLP -> Sigmoid => 通道加权
"""
textdef __init__(self, channels, reduction=16): super().__init__() hidden_dim = max(1, channels // reduction) self.pool = nn.AdaptiveAvgPool2d(1) self.fc = nn.Sequential( nn.Linear(channels, hidden_dim), nn.ReLU(inplace=True), nn.Linear(hidden_dim, channels), nn.Sigmoid() ) def forward(self, x): B, C, H, W = x.shape y = self.pool(x).view(B, C) # => (B,C) y = self.fc(y) # => (B,C) y = y.view(B, C, 1, 1) return x * y
############################################
############################################
class AG_ASPP_MViT(nn.Module):
"""
5条分支, branch1 + branch5 没有门控,
branch2/3/4 做 MobileViT + gating,
最后并行拼接 => fuse
"""
textdef __init__(self, dim_in=320, dim_out=256, patch_size=(2, 2), mvit_depth=1, mvit_dim=96, mvit_mlp_dim=192, mvit_heads=2, mvit_dropout=0.0, use_groupnorm=True): super().__init__() # branch1 => 1×1 => dim_out, 无 gating self.branch1 = nn.Sequential( nn.Conv2d(dim_in, dim_out, 1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) # branch2 => MViT => 1×1 => gating self.branch2_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch2_proj = nn.Conv2d(dim_in, dim_out, 1, bias=False) self.gate2 = AttnGate(dim_out) # branch3 => 同理 self.branch3_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch3_proj = nn.Conv2d(dim_in, dim_out, 1, bias=False) self.gate3 = AttnGate(dim_out) # branch4 => 同理 self.branch4_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch4_proj = nn.Conv2d(dim_in, dim_out, 1, bias=False) self.gate4 = AttnGate(dim_out) # branch5 => global pooling => conv => norm => relu => up (无 gating) self.branch5_conv = nn.Conv2d(dim_in, dim_out, kernel_size=1) if use_groupnorm: # 避免 (B=1, C=128,1,1) 的 BN 报错 self.branch5_norm = nn.GroupNorm(num_groups=1, num_channels=dim_out) else: self.branch5_norm = nn.BatchNorm2d(dim_out) self.branch5_relu = nn.ReLU(inplace=True) # fuse => 1x1 => dim_out self.fuse_conv = nn.Sequential( nn.Conv2d(dim_out * 5, dim_out, 1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) def forward(self, x): B, C, H, W = x.shape # branch1 (no gating) out1 = self.branch1(x) # => (B,128,H,W) # branch2 => gating x2 = self.branch2_attn(x) out2 = self.branch2_proj(x2) out2 = self.gate2(out2) # branch3 => gating x3 = self.branch3_attn(x) out3 = self.branch3_proj(x3) out3 = self.gate3(out3) # branch4 => gating x4 = self.branch4_attn(x) out4 = self.branch4_proj(x4) out4 = self.gate4(out4) # branch5 (no gating) gp = F.adaptive_avg_pool2d(x, 1) # => (B,C,1,1) gp = self.branch5_conv(gp) # => (B,128,1,1) gp = self.branch5_norm(gp) # => BN or GN gp = self.branch5_relu(gp) gp = F.interpolate(gp, size=(H, W), mode='bilinear', align_corners=False) # 并行拼接 => fuse out_cat = torch.cat([out1, out2, out3, out4, gp], dim=1) out = self.fuse_conv(out_cat) return out
############################################
############################################
class DeepLab(nn.Module):
def init(self, num_classes, backbone="mobilenet", pretrained=True, downsample_factor=16):
super(DeepLab, self).init()
if backbone == "xception":
self.backbone = xception(downsample_factor=downsample_factor, pretrained=pretrained)
in_channels = 2048
low_level_channels = 256
elif backbone == "mobilenet":
self.backbone = MobileNetV2(downsample_factor=downsample_factor, pretrained=pretrained)
in_channels = 320
low_level_channels = 24
else:
raise ValueError('Unsupported backbone - {}
, Use mobilenet, xception.'.format(backbone))
text# 这里改成AG_ASPP_MViT, 并仅对中间3分支做gating self.aspp = AG_ASPP_MViT( dim_in=in_channels, dim_out=256, patch_size=(2, 2), mvit_depth=1, mvit_dim=96, mvit_mlp_dim=192, mvit_heads=2, mvit_dropout=0.0, use_groupnorm=True # 避免 (1,128,1,1) BN报错 ) # 浅层 => 1x1 => 24 self.shortcut_conv = nn.Sequential( nn.Conv2d(low_level_channels, 48, 1), nn.BatchNorm2d(48), nn.ReLU(inplace=True) ) # cat_conv => (24+128)->128 self.cat_conv = nn.Sequential( nn.Conv2d(48 + 256, 256, 3, padding=1), nn.BatchNorm2d(256), nn.ReLU(inplace=True), nn.Dropout(0.5), nn.Conv2d(256, 256, 3, padding=1), nn.BatchNorm2d(256), nn.ReLU(inplace=True), nn.Dropout(0.1), ) self.cls_conv = nn.Conv2d(256, num_classes, 1) def forward(self, x): H, W = x.shape[2], x.shape[3] low_level, deep = self.backbone(x) deep = self.aspp(deep) # => (B,128, ...) low_level = self.shortcut_conv(low_level) deep_up = F.interpolate(deep, size=(low_level.size(2), low_level.size(3)), mode='bilinear', align_corners=False) x_cat = torch.cat([deep_up, low_level], dim=1) x_cat = self.cat_conv(x_cat) x_cat = self.cls_conv(x_cat) x_cat = F.interpolate(x_cat, size=(H, W), mode='bilinear', align_corners=False) return x_cat
############################################
############################################
if name == "main":
model = DeepLab(num_classes=3, backbone="mobilenet", pretrained=False, downsample_factor=16)
print("模型参数总数:", count_parameters(model))
textimg = torch.randn(1, 3, 512, 512) pred = model(img) print("输出维度:", pred.shape)
代码中,哪里体现了将特征图划分为若干局部窗口
在这段示例里,将特征图分割为局部窗口(或patch) 的关键操作出现在 MobileViTAttentionBlock
内部的两行 rearrange
语句:
pythonx = rearrange(x, 'b d (hh ph) (ww pw) -> b (hh ww) (ph pw) d', ph=self.ph, pw=self.pw) ... x = rearrange(x, 'b (hh ww) (ph pw) d -> b d (hh ph) (ww pw)', hh=H//self.ph, ww=W//self.pw, ph=self.ph, pw=self.pw)
x = rearrange(x, 'b d (hh ph) (ww pw) -> b (hh ww) (ph pw) d'
x = rearrange(x, 'b (hh ww) (ph pw) d -> b d (hh ph) (ww pw)' ... )
因此,正是通过 rearrange(...)
,才在代码层面实现了“将特征图拆分为若干局部窗口(patch),再在每个窗口内执行注意力”的过程。
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from einops import rearrange
from nets.xception import xception
from nets.mobilenetv2 import mobilenetv2
############################################
############################################
class MobileNetV2(nn.Module):
def init(self, downsample_factor=8, pretrained=True):
super(MobileNetV2, self).init()
from functools import partial
model = mobilenetv2(pretrained)
self.features = model.features[:-1]
textself.total_idx = len(self.features) self.down_idx = [2, 4, 7, 14] if downsample_factor == 8: for i in range(self.down_idx[-2], self.down_idx[-1]): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=4)) elif downsample_factor == 16: for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) def _nostride_dilate(self, m, dilate): classname = m.__class__.__name__ if classname.find('Conv') != -1: if m.stride == (2, 2): m.stride = (1, 1) if m.kernel_size == (3, 3): m.dilation = (dilate // 2, dilate // 2) m.padding = (dilate // 2, dilate // 2) else: if m.kernel_size == (3, 3): m.dilation = (dilate, dilate) m.padding = (dilate, dilate) def forward(self, x): # 前4层 -> low_level_features low_level_features = self.features[:4](x) # 后面 -> deep_features x = self.features[4:](low_level_features) return low_level_features, x
############################################
############################################
def count_parameters(model):
return sum(p.numel() for p in model.parameters() if p.requires_grad)
############################################
############################################
class TransformerBlockGroup(nn.Module):
def init(self, dim, depth, heads, mlp_dim, dropout=0.0):
super().init()
self.layers = nn.ModuleList([
TransformerBlockLayer(dim, heads, mlp_dim, dropout)
for _ in range(depth)
])
textdef forward(self, x): for layer in self.layers: x = layer(x) return x
class TransformerBlockLayer(nn.Module):
def init(self, dim, heads, mlp_dim, dropout=0.0):
super().init()
self.norm1 = nn.LayerNorm(dim)
self.attn = MultiHeadAttention(dim, heads, dropout)
self.norm2 = nn.LayerNorm(dim)
self.ffn = FeedForward(dim, mlp_dim, dropout)
textdef forward(self, x): shortcut = x x_ln = self.norm1(x) x_attn = self.attn(x_ln) x = x_attn + shortcut shortcut2 = x x_ln2 = self.norm2(x) x_ffn = self.ffn(x_ln2) x = x_ffn + shortcut2 return x
class MultiHeadAttention(nn.Module):
def init(self, dim, heads=2, dropout=0.0):
super().init()
self.heads = heads
self.dim_head = dim // heads
inner_dim = self.dim_head * heads
textself.scale = self.dim_head ** -0.5 self.to_qkv = nn.Linear(dim, inner_dim * 3, bias=False) self.attend = nn.Softmax(dim=-1) self.proj = nn.Sequential( nn.Linear(inner_dim, dim), nn.Dropout(dropout) ) def forward(self, x): B, n_patch, patch_area, dim = x.shape qkv = self.to_qkv(x) q, k, v = torch.chunk(qkv, 3, dim=-1) # each (B,n_patch,patch_area,inner_dim//3) q = q.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0, 1, 3, 2, 4) k = k.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0, 1, 3, 2, 4) v = v.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0, 1, 3, 2, 4) dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale attn = self.attend(dots) out = torch.matmul(attn, v) out = out.permute(0, 1, 3, 2, 4).contiguous() out = out.view(B, n_patch, patch_area, self.heads * self.dim_head) out = self.proj(out) return out
class FeedForward(nn.Module):
def init(self, dim, hidden_dim, dropout=0.0):
super().init()
self.net = nn.Sequential(
nn.Linear(dim, hidden_dim),
nn.SiLU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, dim),
nn.Dropout(dropout)
)
textdef forward(self, x): return self.net(x)
############################################
############################################
class MobileViTAttentionBlock(nn.Module):
def init(self,
in_channels=320,
patch_size=(2, 2),
transformer_depth=1,
transformer_dim=96,
transformer_mlp_dim=192,
heads=2,
dropout=0.0):
super().init()
self.ph, self.pw = patch_size
text# 局部conv self.local_conv1 = nn.Conv2d(in_channels, in_channels, 3, padding=1, bias=False) self.local_bn1 = nn.BatchNorm2d(in_channels) self.local_act1 = nn.SiLU() # conv => transformer_dim self.conv_proj_in = nn.Conv2d(in_channels, transformer_dim, 1, bias=False) self.bn_proj_in = nn.BatchNorm2d(transformer_dim) self.act_proj_in = nn.SiLU() # Transformer self.transformer = TransformerBlockGroup( dim=transformer_dim, depth=transformer_depth, heads=heads, mlp_dim=transformer_mlp_dim, dropout=dropout ) # 回到 in_channels self.conv_proj_out = nn.Conv2d(transformer_dim, in_channels, 1, bias=False) self.bn_proj_out = nn.BatchNorm2d(in_channels) self.act_proj_out = nn.SiLU() # concat => 3x3 self.local_conv2 = nn.Conv2d(in_channels * 2, in_channels, 3, padding=1, bias=False) self.local_bn2 = nn.BatchNorm2d(in_channels) self.local_act2 = nn.SiLU() def forward(self, x): y = x.clone() x = self.local_conv1(x) x = self.local_bn1(x) x = self.local_act1(x) x = self.conv_proj_in(x) x = self.bn_proj_in(x) x = self.act_proj_in(x) B, D, H, W = x.shape assert H % self.ph == 0 and W % self.pw == 0, "H,W需能被patch_size整除" x = rearrange(x, 'b d (hh ph) (ww pw) -> b (hh ww) (ph pw) d', ph=self.ph, pw=self.pw) x = self.transformer(x) x = rearrange(x, 'b (hh ww) (ph pw) d -> b d (hh ph) (ww pw)', hh=H // self.ph, ww=W // self.pw, ph=self.ph, pw=self.pw) x = self.conv_proj_out(x) x = self.bn_proj_out(x) x = self.act_proj_out(x) x = torch.cat([x, y], dim=1) x = self.local_conv2(x) x = self.local_bn2(x) x = self.local_act2(x) return x
############################################
############################################
class AttnGate(nn.Module):
"""
类似 SE: 全局池化 -> MLP -> Sigmoid => 通道加权
"""
textdef __init__(self, channels, reduction=16): super().__init__() hidden_dim = max(1, channels // reduction) self.pool = nn.AdaptiveAvgPool2d(1) self.fc = nn.Sequential( nn.Linear(channels, hidden_dim), nn.ReLU(inplace=True), nn.Linear(hidden_dim, channels), nn.Sigmoid() ) def forward(self, x): B, C, H, W = x.shape y = self.pool(x).view(B, C) # => (B,C) y = self.fc(y) # => (B,C) y = y.view(B, C, 1, 1) return x * y
############################################
############################################
class AG_ASPP_MViT(nn.Module):
"""
5条分支, branch1 + branch5 没有门控,
branch2/3/4 做 MobileViT + gating,
最后并行拼接 => fuse
"""
textdef __init__(self, dim_in=320, dim_out=256, patch_size=(2, 2), mvit_depth=1, mvit_dim=96, mvit_mlp_dim=192, mvit_heads=2, mvit_dropout=0.0, use_groupnorm=True): super().__init__() # branch1 => 1×1 => dim_out, 无 gating self.branch1 = nn.Sequential( nn.Conv2d(dim_in, dim_out, 1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) # branch2 => MViT => 1×1 => gating self.branch2_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch2_proj = nn.Conv2d(dim_in, dim_out, 1, bias=False) self.gate2 = AttnGate(dim_out) # branch3 => 同理 self.branch3_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch3_proj = nn.Conv2d(dim_in, dim_out, 1, bias=False) self.gate3 = AttnGate(dim_out) # branch4 => 同理 self.branch4_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch4_proj = nn.Conv2d(dim_in, dim_out, 1, bias=False) self.gate4 = AttnGate(dim_out) # branch5 => global pooling => conv => norm => relu => up (无 gating) self.branch5_conv = nn.Conv2d(dim_in, dim_out, kernel_size=1) if use_groupnorm: # 避免 (B=1, C=128,1,1) 的 BN 报错 self.branch5_norm = nn.GroupNorm(num_groups=1, num_channels=dim_out) else: self.branch5_norm = nn.BatchNorm2d(dim_out) self.branch5_relu = nn.ReLU(inplace=True) # fuse => 1x1 => dim_out self.fuse_conv = nn.Sequential( nn.Conv2d(dim_out * 5, dim_out, 1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) def forward(self, x): B, C, H, W = x.shape # branch1 (no gating) out1 = self.branch1(x) # => (B,128,H,W) # branch2 => gating x2 = self.branch2_attn(x) out2 = self.branch2_proj(x2) out2 = self.gate2(out2) # branch3 => gating x3 = self.branch3_attn(x) out3 = self.branch3_proj(x3) out3 = self.gate3(out3) # branch4 => gating x4 = self.branch4_attn(x) out4 = self.branch4_proj(x4) out4 = self.gate4(out4) # branch5 (no gating) gp = F.adaptive_avg_pool2d(x, 1) # => (B,C,1,1) gp = self.branch5_conv(gp) # => (B,128,1,1) gp = self.branch5_norm(gp) # => BN or GN gp = self.branch5_relu(gp) gp = F.interpolate(gp, size=(H, W), mode='bilinear', align_corners=False) # 并行拼接 => fuse out_cat = torch.cat([out1, out2, out3, out4, gp], dim=1) out = self.fuse_conv(out_cat) return out
############################################
############################################
class DeepLab(nn.Module):
def init(self, num_classes, backbone="mobilenet", pretrained=True, downsample_factor=16):
super(DeepLab, self).init()
if backbone == "xception":
self.backbone = xception(downsample_factor=downsample_factor, pretrained=pretrained)
in_channels = 2048
low_level_channels = 256
elif backbone == "mobilenet":
self.backbone = MobileNetV2(downsample_factor=downsample_factor, pretrained=pretrained)
in_channels = 320
low_level_channels = 24
else:
raise ValueError('Unsupported backbone - {}
, Use mobilenet, xception.'.format(backbone))
text# 这里改成AG_ASPP_MViT, 并仅对中间3分支做gating self.aspp = AG_ASPP_MViT( dim_in=in_channels, dim_out=256, patch_size=(2, 2), mvit_depth=1, mvit_dim=96, mvit_mlp_dim=192, mvit_heads=2, mvit_dropout=0.0, use_groupnorm=True # 避免 (1,128,1,1) BN报错 ) # 浅层 => 1x1 => 24 self.shortcut_conv = nn.Sequential( nn.Conv2d(low_level_channels, 48, 1), nn.BatchNorm2d(48), nn.ReLU(inplace=True) ) # cat_conv => (24+128)->128 self.cat_conv = nn.Sequential( nn.Conv2d(48 + 256, 256, 3, padding=1), nn.BatchNorm2d(256), nn.ReLU(inplace=True), nn.Dropout(0.5), nn.Conv2d(256, 256, 3, padding=1), nn.BatchNorm2d(256), nn.ReLU(inplace=True), nn.Dropout(0.1), ) self.cls_conv = nn.Conv2d(256, num_classes, 1) def forward(self, x): H, W = x.shape[2], x.shape[3] low_level, deep = self.backbone(x) deep = self.aspp(deep) # => (B,128, ...) low_level = self.shortcut_conv(low_level) deep_up = F.interpolate(deep, size=(low_level.size(2), low_level.size(3)), mode='bilinear', align_corners=False) x_cat = torch.cat([deep_up, low_level], dim=1) x_cat = self.cat_conv(x_cat) x_cat = self.cls_conv(x_cat) x_cat = F.interpolate(x_cat, size=(H, W), mode='bilinear', align_corners=False) return x_cat
############################################
############################################
if name == "main":
model = DeepLab(num_classes=3, backbone="mobilenet", pretrained=False, downsample_factor=16)
print("模型参数总数:", count_parameters(model))
textimg = torch.randn(1, 3, 512, 512) pred = model(img) print("输出维度:", pred.shape)
训练的前80轮参数基本稳定上升,但是最后20轮参数下降了,有什么好的改善方法
当模型在训练前 80 轮中性能持续上升,却在最后 20 轮出现性能回落(如精度下降、Loss 上升)时,往往意味着过拟合或学习率不匹配等问题。以下是一些可能的改进思路,可视实际情况单独或组合使用:
检查学习率计划 (Learning Rate Schedule)
增大正则化 (Regularization)
早期停止 (Early Stopping)
混合精度/增大批大小
加数据增强 (Data Augmentation)
冻结部分注意力分支 / 调整门控强度
后期性能下降多半是过拟合或学习率计划不佳导致,可以从调整 LR、加强正则、早期停止与数据增强等多方面下手。若网络注意力分支带来较大自由度,也可在后期冻结或减弱部分可学习参数,帮助维持模型在后期的稳定性与小目标上的精细表现。
C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\python.exe C:\Users\wztbkl\PycharmProjects\deeplabv3-plus-pytorch-main\nets\GGCA_mobilenetV2_mobileVitASPP_deeplabV3_low.py
模型参数总数: 12611684
Traceback (most recent call last):
File "C:\Users\wztbkl\PycharmProjects\deeplabv3-plus-pytorch-main\nets\GGCA_mobilenetV2_mobileVitASPP_deeplabV3_low.py", line 456, in <module>
pred = model(img)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\wztbkl\PycharmProjects\deeplabv3-plus-pytorch-main\nets\GGCA_mobilenetV2_mobileVitASPP_deeplabV3_low.py", line 438, in forward
low_level = self.shortcut_conv(low_level)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\container.py", line 250, in forward
input = module(input)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\conv.py", line 554, in forward
return self._conv_forward(input, self.weight, self.bias)
File "C:\Users\wztbkl\anaconda3\envs\wuzhupytorch\lib\site-packages\torch\nn\modules\conv.py", line 549, in _conv_forward
return F.conv2d(
RuntimeError: Given groups=1, weight of size [48, 24, 1, 1], expected input[1, 32, 128, 128] to have 24 channels, but got 32 channels instead
进程已结束,退出代码为 1
根据报错信息:
RuntimeError: Given groups=1, weight of size [48, 24, 1, 1], expected input[1, 32, 128, 128] to have 24 channels, but got 32 channels instead
可以看出,卷积层在处理 low_level
特征时,期望输入通道数为 24,但实际传入的特征图却有 32 通道。这是一个典型的通道不匹配问题。
low_level_channels
与实际通道数不符在您的代码中,通常会为 low_level_features
这个浅层输出定义一个形如:
pythonlow_level_channels = 24
然后在 shortcut_conv
中写成:
pythonself.shortcut_conv = nn.Sequential( nn.Conv2d(low_level_channels, 48, 1), ... )
但实际上,MobileNetV2 在您目前的实现里,第 4 层的输出可能是 (B, 32, H, W)。这样一来,在 forward 时就会出现**“期望 24 通道输入,实际是 32”** 的冲突。
改对 low_level_channels
的设定
low_level_channels=24
改成 low_level_channels=32
;nn.Conv2d(32, 48, 1)
之类。确认骨干网络浅层输出
low_level_channels
的定义。low_level_features
替换成 self.features[:3]
或别的适配层数,确保对应通道一致。保持 shortcut_conv
输入通道 与 backbone 输出 一致
nn.Conv2d(...)
的 in_channels
跟 low_level_features.size(1)
对应即可。核心原因:网络定义里“浅层通道”为 24,与实际 MobileNetV2 浅层输出为 32 不匹配,导致 Conv2d
expect 24 channels but got 32。
解决:改 low_level_channels
或改 backbone 输出层索引,使二者通道数吻合。这样就能避免 RuntimeError: expected ... to have 24 channels, but got 32 channels instead
。
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from einops import rearrange
from nets.xception import xception
from nets.mobilenetv2 import mobilenetv2
############################################
############################################
class MobileNetV2(nn.Module):
def init(self, downsample_factor=8, pretrained=True):
super(MobileNetV2, self).init()
from functools import partial
model = mobilenetv2(pretrained)
self.features = model.features[:-1]
textself.total_idx = len(self.features) self.down_idx = [2, 4, 7, 14] if downsample_factor == 8: for i in range(self.down_idx[-2], self.down_idx[-1]): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=4)) elif downsample_factor == 16: for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) def _nostride_dilate(self, m, dilate): classname = m.__class__.__name__ if classname.find('Conv') != -1: if m.stride == (2, 2): m.stride = (1, 1) if m.kernel_size == (3, 3): m.dilation = (dilate // 2, dilate // 2) m.padding = (dilate // 2, dilate // 2) else: if m.kernel_size == (3, 3): m.dilation = (dilate, dilate) m.padding = (dilate, dilate) def forward(self, x): # 前4层 -> low_level_features low_level_features = self.features[:4](x) # 后面 -> deep_features x = self.features[4:](low_level_features) return low_level_features, x
############################################
############################################
def count_parameters(model):
return sum(p.numel() for p in model.parameters() if p.requires_grad)
############################################
############################################
class TransformerBlockGroup(nn.Module):
def init(self, dim, depth, heads, mlp_dim, dropout=0.0):
super().init()
self.layers = nn.ModuleList([
TransformerBlockLayer(dim, heads, mlp_dim, dropout)
for _ in range(depth)
])
textdef forward(self, x): for layer in self.layers: x = layer(x) return x
class TransformerBlockLayer(nn.Module):
def init(self, dim, heads, mlp_dim, dropout=0.0):
super().init()
self.norm1 = nn.LayerNorm(dim)
self.attn = MultiHeadAttention(dim, heads, dropout)
self.norm2 = nn.LayerNorm(dim)
self.ffn = FeedForward(dim, mlp_dim, dropout)
textdef forward(self, x): shortcut = x x_ln = self.norm1(x) x_attn = self.attn(x_ln) x = x_attn + shortcut shortcut2 = x x_ln2 = self.norm2(x) x_ffn = self.ffn(x_ln2) x = x_ffn + shortcut2 return x
class MultiHeadAttention(nn.Module):
def init(self, dim, heads=2, dropout=0.0):
super().init()
self.heads = heads
self.dim_head = dim // heads
inner_dim = self.dim_head * heads
textself.scale = self.dim_head ** -0.5 self.to_qkv = nn.Linear(dim, inner_dim * 3, bias=False) self.attend = nn.Softmax(dim=-1) self.proj = nn.Sequential( nn.Linear(inner_dim, dim), nn.Dropout(dropout) ) def forward(self, x): B, n_patch, patch_area, dim = x.shape qkv = self.to_qkv(x) q, k, v = torch.chunk(qkv, 3, dim=-1) # each (B,n_patch,patch_area,inner_dim//3) q = q.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0, 1, 3, 2, 4) k = k.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0, 1, 3, 2, 4) v = v.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0, 1, 3, 2, 4) dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale attn = self.attend(dots) out = torch.matmul(attn, v) out = out.permute(0, 1, 3, 2, 4).contiguous() out = out.view(B, n_patch, patch_area, self.heads * self.dim_head) out = self.proj(out) return out
class FeedForward(nn.Module):
def init(self, dim, hidden_dim, dropout=0.0):
super().init()
self.net = nn.Sequential(
nn.Linear(dim, hidden_dim),
nn.SiLU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, dim),
nn.Dropout(dropout)
)
textdef forward(self, x): return self.net(x)
############################################
############################################
class MobileViTAttentionBlock(nn.Module):
def init(self,
in_channels=320,
patch_size=(2, 2),
transformer_depth=1,
transformer_dim=96,
transformer_mlp_dim=192,
heads=2,
dropout=0.0):
super().init()
self.ph, self.pw = patch_size
text# 局部conv self.local_conv1 = nn.Conv2d(in_channels, in_channels, 3, padding=1, bias=False) self.local_bn1 = nn.BatchNorm2d(in_channels) self.local_act1 = nn.SiLU() # conv => transformer_dim self.conv_proj_in = nn.Conv2d(in_channels, transformer_dim, 1, bias=False) self.bn_proj_in = nn.BatchNorm2d(transformer_dim) self.act_proj_in = nn.SiLU() # Transformer self.transformer = TransformerBlockGroup( dim=transformer_dim, depth=transformer_depth, heads=heads, mlp_dim=transformer_mlp_dim, dropout=dropout ) # 回到 in_channels self.conv_proj_out = nn.Conv2d(transformer_dim, in_channels, 1, bias=False) self.bn_proj_out = nn.BatchNorm2d(in_channels) self.act_proj_out = nn.SiLU() # concat => 3x3 self.local_conv2 = nn.Conv2d(in_channels * 2, in_channels, 3, padding=1, bias=False) self.local_bn2 = nn.BatchNorm2d(in_channels) self.local_act2 = nn.SiLU() def forward(self, x): y = x.clone() x = self.local_conv1(x) x = self.local_bn1(x) x = self.local_act1(x) x = self.conv_proj_in(x) x = self.bn_proj_in(x) x = self.act_proj_in(x) B, D, H, W = x.shape assert H % self.ph == 0 and W % self.pw == 0, "H,W需能被patch_size整除" x = rearrange(x, 'b d (hh ph) (ww pw) -> b (hh ww) (ph pw) d', ph=self.ph, pw=self.pw) x = self.transformer(x) x = rearrange(x, 'b (hh ww) (ph pw) d -> b d (hh ph) (ww pw)', hh=H // self.ph, ww=W // self.pw, ph=self.ph, pw=self.pw) x = self.conv_proj_out(x) x = self.bn_proj_out(x) x = self.act_proj_out(x) x = torch.cat([x, y], dim=1) x = self.local_conv2(x) x = self.local_bn2(x) x = self.local_act2(x) return x
############################################
############################################
class AttnGate(nn.Module):
"""
类似 SE: 全局池化 -> MLP -> Sigmoid => 通道加权
"""
textdef __init__(self, channels, reduction=16): super().__init__() hidden_dim = max(1, channels // reduction) self.pool = nn.AdaptiveAvgPool2d(1) self.fc = nn.Sequential( nn.Linear(channels, hidden_dim), nn.ReLU(inplace=True), nn.Linear(hidden_dim, channels), nn.Sigmoid() ) def forward(self, x): B, C, H, W = x.shape y = self.pool(x).view(B, C) # => (B,C) y = self.fc(y) # => (B,C) y = y.view(B, C, 1, 1) return x * y
############################################
############################################
class AG_ASPP_MViT(nn.Module):
"""
5条分支, branch1 + branch5 没有门控,
branch2/3/4 做 MobileViT + gating,
最后并行拼接 => fuse
"""
textdef __init__(self, dim_in=320, dim_out=256, patch_size=(2, 2), mvit_depth=1, mvit_dim=96, mvit_mlp_dim=192, mvit_heads=2, mvit_dropout=0.0, use_groupnorm=True): super().__init__() # branch1 => 1×1 => dim_out, 无 gating self.branch1 = nn.Sequential( nn.Conv2d(dim_in, dim_out, 1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) # branch2 => MViT => 1×1 => gating self.branch2_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch2_proj = nn.Conv2d(dim_in, dim_out, 1, bias=False) self.gate2 = AttnGate(dim_out) # branch3 => 同理 self.branch3_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch3_proj = nn.Conv2d(dim_in, dim_out, 1, bias=False) self.gate3 = AttnGate(dim_out) # branch4 => 同理 self.branch4_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch4_proj = nn.Conv2d(dim_in, dim_out, 1, bias=False) self.gate4 = AttnGate(dim_out) # branch5 => global pooling => conv => norm => relu => up (无 gating) self.branch5_conv = nn.Conv2d(dim_in, dim_out, kernel_size=1) if use_groupnorm: # 避免 (B=1, C=128,1,1) 的 BN 报错 self.branch5_norm = nn.GroupNorm(num_groups=1, num_channels=dim_out) else: self.branch5_norm = nn.BatchNorm2d(dim_out) self.branch5_relu = nn.ReLU(inplace=True) # fuse => 1x1 => dim_out self.fuse_conv = nn.Sequential( nn.Conv2d(dim_out * 5, dim_out, 1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) def forward(self, x): B, C, H, W = x.shape # branch1 (no gating) out1 = self.branch1(x) # => (B,128,H,W) # branch2 => gating x2 = self.branch2_attn(x) out2 = self.branch2_proj(x2) out2 = self.gate2(out2) # branch3 => gating x3 = self.branch3_attn(x) out3 = self.branch3_proj(x3) out3 = self.gate3(out3) # branch4 => gating x4 = self.branch4_attn(x) out4 = self.branch4_proj(x4) out4 = self.gate4(out4) # branch5 (no gating) gp = F.adaptive_avg_pool2d(x, 1) # => (B,C,1,1) gp = self.branch5_conv(gp) # => (B,128,1,1) gp = self.branch5_norm(gp) # => BN or GN gp = self.branch5_relu(gp) gp = F.interpolate(gp, size=(H, W), mode='bilinear', align_corners=False) # 并行拼接 => fuse out_cat = torch.cat([out1, out2, out3, out4, gp], dim=1) out = self.fuse_conv(out_cat) return out
############################################
############################################
class DeepLab(nn.Module):
def init(self, num_classes, backbone="mobilenet", pretrained=True, downsample_factor=16):
super(DeepLab, self).init()
if backbone == "xception":
self.backbone = xception(downsample_factor=downsample_factor, pretrained=pretrained)
in_channels = 2048
low_level_channels = 256
elif backbone == "mobilenet":
self.backbone = MobileNetV2(downsample_factor=downsample_factor, pretrained=pretrained)
in_channels = 320
low_level_channels = 24
else:
raise ValueError('Unsupported backbone - {}
, Use mobilenet, xception.'.format(backbone))
text# 这里改成AG_ASPP_MViT, 并仅对中间3分支做gating self.aspp = AG_ASPP_MViT( dim_in=in_channels, dim_out=256, patch_size=(2, 2), mvit_depth=1, mvit_dim=96, mvit_mlp_dim=192, mvit_heads=2, mvit_dropout=0.0, use_groupnorm=True # 避免 (1,128,1,1) BN报错 ) # 浅层 => 1x1 => 48 self.shortcut_conv = nn.Sequential( nn.Conv2d(low_level_channels, 48, 1), nn.BatchNorm2d(48), nn.ReLU(inplace=True) ) # cat_conv => (48+256)->256 self.cat_conv = nn.Sequential( nn.Conv2d(48 + 256, 256, 3, padding=1), nn.BatchNorm2d(256), nn.ReLU(inplace=True), nn.Dropout(0.5), nn.Conv2d(256, 256, 3, padding=1), nn.BatchNorm2d(256), nn.ReLU(inplace=True), nn.Dropout(0.1), ) self.cls_conv = nn.Conv2d(256, num_classes, 1) def forward(self, x): H, W = x.shape[2], x.shape[3] low_level, deep = self.backbone(x) deep = self.aspp(deep) # => (B,128, ...) low_level = self.shortcut_conv(low_level) deep_up = F.interpolate(deep, size=(low_level.size(2), low_level.size(3)), mode='bilinear', align_corners=False) x_cat = torch.cat([deep_up, low_level], dim=1) x_cat = self.cat_conv(x_cat) x_cat = self.cls_conv(x_cat) x_cat = F.interpolate(x_cat, size=(H, W), mode='bilinear', align_corners=False) return x_cat
############################################
############################################
if name == "main":
model = DeepLab(num_classes=3, backbone="mobilenet", pretrained=False, downsample_factor=16)
print("模型参数总数:", count_parameters(model))
textimg = torch.randn(1, 3, 512, 512) pred = model(img) print("输出维度:", pred.shape)
import torch
import torch.nn as nn
import torch.nn.functional as F
from nets.xception import xception
from nets.mobilenetv2 import mobilenetv2
from nets.ShallowFusion import ShallowFusionLite
class MobileNetV2(nn.Module):
def init(self, downsample_factor=8, pretrained=True):
super(MobileNetV2, self).init()
from functools import partial
textmodel = mobilenetv2(pretrained) # 去掉最后一个classification层 self.features = model.features[:-1] self.total_idx = len(self.features) self.down_idx = [2, 4, 7, 14] # 此处设定 (24, 32, 16, 32, 0.2) 与下方切片对应 (F1=24通道, F2=32通道) self.shallow_fusion = ShallowFusionLite( in_channels1=24, # <-- 与 F1.shape[1] 匹配 in_channels2=32, # <-- 与 F2.shape[1] 匹配 reduced_channels=16, out_channels=32, dropout=0.2 ) # 根据downsample_factor修改空洞卷积 if downsample_factor == 8: for i in range(self.down_idx[-2], self.down_idx[-1]): self.features[i].apply( partial(self._nostride_dilate, dilate=2) ) for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply( partial(self._nostride_dilate, dilate=4) ) elif downsample_factor == 16: for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply( partial(self._nostride_dilate, dilate=2) ) def _nostride_dilate(self, m, dilate): classname = m.__class__.__name__ if classname.find('Conv') != -1: if m.stride == (2, 2): m.stride = (1, 1) if m.kernel_size == (3, 3): m.dilation = (dilate // 2, dilate // 2) m.padding = (dilate // 2, dilate // 2) else: if m.kernel_size == (3, 3): m.dilation = (dilate, dilate) m.padding = (dilate, dilate) def forward(self, x): # F1 => self.features[:4] F1 = self.features[:4](x) # 通道一般是 24 # F2 => self.features[4:7] # print("F1",F1.shape) F2 = self.features[4:7](F1) # 通道一般是 32 # print("F2",F2.shape) # deep => self.features[7:] out = self.features[7:](F2) # deep feature (320通道等) # 将 F1, F2 融合得到 low_level_features low_level_features = self.shallow_fusion(F1, F2) return low_level_features, out
def count_parameters(model):
"""统计模型参数数量"""
return sum(p.numel() for p in model.parameters() if p.requires_grad)
class ASPP(nn.Module):
def init(self, dim_in, dim_out, rate=1, bn_mom=0.1):
super(ASPP, self).init()
# -----------------------------------------#
# 去掉 inplace=True
# -----------------------------------------#
self.branch1 = nn.Sequential(
nn.Conv2d(dim_in, dim_out, 1, 1, padding=0, dilation=rate, bias=True),
nn.BatchNorm2d(dim_out, momentum=bn_mom),
nn.ReLU(inplace=False),
)
self.branch2 = nn.Sequential(
nn.Conv2d(dim_in, dim_out, 3, 1, padding=6 * rate, dilation=6 * rate, bias=True),
nn.BatchNorm2d(dim_out, momentum=bn_mom),
nn.ReLU(inplace=False),
)
self.branch3 = nn.Sequential(
nn.Conv2d(dim_in, dim_out, 3, 1, padding=12 * rate, dilation=12 * rate, bias=True),
nn.BatchNorm2d(dim_out, momentum=bn_mom),
nn.ReLU(inplace=False),
)
self.branch4 = nn.Sequential(
nn.Conv2d(dim_in, dim_out, 3, 1, padding=18 * rate, dilation=18 * rate, bias=True),
nn.BatchNorm2d(dim_out, momentum=bn_mom),
nn.ReLU(inplace=False),
)
self.branch5_conv = nn.Conv2d(dim_in, dim_out, 1, 1, 0, bias=True)
self.branch5_bn = nn.BatchNorm2d(dim_out, momentum=bn_mom)
self.branch5_relu = nn.ReLU(inplace=False)
textself.conv_cat = nn.Sequential( nn.Conv2d(dim_out * 5, dim_out, 1, 1, padding=0, bias=True), nn.BatchNorm2d(dim_out, momentum=bn_mom), nn.ReLU(inplace=False), ) def forward(self, x): [b, c, row, col] = x.size() conv1x1 = self.branch1(x) conv3x3_1 = self.branch2(x) conv3x3_2 = self.branch3(x) conv3x3_3 = self.branch4(x) # -----------------------------------------# # 第五个分支,全局平均池化+卷积 # -----------------------------------------# global_feature = torch.mean(x, 2, True) global_feature = torch.mean(global_feature, 3, True) global_feature = self.branch5_conv(global_feature) global_feature = self.branch5_bn(global_feature) global_feature = self.branch5_relu(global_feature) global_feature = F.interpolate(global_feature, (row, col), None, 'bilinear', True) # -----------------------------------------# # 将五个分支拼接后 => 1x1卷积 # -----------------------------------------# feature_cat = torch.cat([conv1x1, conv3x3_1, conv3x3_2, conv3x3_3, global_feature], dim=1) result = self.conv_cat(feature_cat) return result
class DeepLab(nn.Module):
def init(self, num_classes, backbone="mobilenet", pretrained=True, downsample_factor=16):
super(DeepLab, self).init()
if backbone == "xception":
# xception可选,忽略
self.backbone = xception(downsample_factor=downsample_factor, pretrained=pretrained)
in_channels = 2048
low_level_channels = 256
elif backbone == "mobilenet":
# 替换为我们上面改好的MobileNetV2
self.backbone = MobileNetV2(downsample_factor=downsample_factor, pretrained=pretrained)
in_channels = 320
low_level_channels = 32
else:
raise ValueError('Unsupported backbone - {}
, Use mobilenet or xception.'.format(backbone))
text# -----------------------------------------# # ASPP特征提取模块 # -----------------------------------------# self.aspp = ASPP(dim_in=in_channels, dim_out=256, rate=16 // downsample_factor) # ----------------------------------# # 浅层特征 => conv => 48通道 # ----------------------------------# self.shortcut_conv = nn.Sequential( nn.Conv2d(low_level_channels, 48, 1), nn.BatchNorm2d(48), nn.ReLU(inplace=False) ) # ----------------------------------# # 上采样后与浅层特征堆叠,再卷积 # ----------------------------------# self.cat_conv = nn.Sequential( nn.Conv2d(48 + 256, 256, 3, stride=1, padding=1), nn.BatchNorm2d(256), nn.ReLU(inplace=False), nn.Dropout(0.5), nn.Conv2d(256, 256, 3, stride=1, padding=1), nn.BatchNorm2d(256), nn.ReLU(inplace=False), nn.Dropout(0.1), ) self.cls_conv = nn.Conv2d(256, num_classes, 1, stride=1) def forward(self, x): H, W = x.size(2), x.size(3) # backbone => (low_level_features, deep_feature) low_level_features, x = self.backbone(x) # ASPP on deep features x = self.aspp(x) # 浅层特征 1x1 => 48通道 low_level_features = self.shortcut_conv(low_level_features) # 上采样 ASPP 输出到与low_level匹配后拼接 x = F.interpolate(x, size=(low_level_features.size(2), low_level_features.size(3)), mode='bilinear', align_corners=True) x = torch.cat((x, low_level_features), dim=1) x = self.cat_conv(x) x = self.cls_conv(x) x = F.interpolate(x, size=(H, W), mode='bilinear', align_corners=True) return x
if name == 'main':
img = torch.randn(3, 3, 512, 512)
model = DeepLab(num_classes=3, backbone="mobilenet", pretrained=False, downsample_factor=16)
out = model(img)
print("Output shape:", out.shape)
print("DeepLabV3+ 参数数量:", count_parameters(model))
把这两段代码结合下,将一个代码的AG_ASPP_MViT替换第二个代码的ASPP,然后将第二个代码的DeepLab类融入到第一个代码,最后让融合后的代码能跑通
下面是一份整合示例,将第一段代码的 AG_ASPP_MViT
(带中间分支窗注意力+通道门控)替换掉第二段代码里的经典 ASPP
,并将两段中对 ShallowFusionLite 的 MobileNetV2 做相应对接,从而得到一个可运行的单文件示例。请注意,以下代码仅示范如何“拼接”这两份实现,实际项目中可能还需根据通道数、命名空间等进一步修改。
pythonimport torch import torch.nn as nn import torch.nn.functional as F import math from einops import rearrange # 假设您有 xception, mobilenetv2, ShallowFusionLite 等文件 # 这里仅演示 import from nets.xception import xception from nets.mobilenetv2 import mobilenetv2 from nets.ShallowFusion import ShallowFusionLite ########################################################## # 1) MobileNetV2 改造版:含 shallow_fusion ########################################################## class MobileNetV2_ShallowFusion(nn.Module): """ 替换后 MobileNetV2, 前几层提取 F1(24通道), 中间层提取 F2(32通道), 二者融合 => 生成 low_level_features(32通道), 剩余特征 => deep_features(320通道). """ def __init__(self, downsample_factor=8, pretrained=True): super(MobileNetV2_ShallowFusion, self).__init__() from functools import partial model = mobilenetv2(pretrained) # 去掉最后一个 classification self.features = model.features[:-1] self.total_idx = len(self.features) self.down_idx = [2, 4, 7, 14] # ShallowFusionLite: F1=24通道, F2=32通道 => out_channels=32 self.shallow_fusion = ShallowFusionLite( in_channels1=24, in_channels2=32, reduced_channels=16, out_channels=32, dropout=0.2 ) # 根据 downsample_factor 修改空洞卷积 if downsample_factor == 8: for i in range(self.down_idx[-2], self.down_idx[-1]): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=4)) elif downsample_factor == 16: for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) def _nostride_dilate(self, m, dilate): classname = m.__class__.__name__ if classname.find('Conv') != -1: if m.stride == (2, 2): m.stride = (1, 1) if m.kernel_size == (3, 3): m.dilation = (dilate // 2, dilate // 2) m.padding = (dilate // 2, dilate // 2) else: if m.kernel_size == (3, 3): m.dilation = (dilate, dilate) m.padding = (dilate, dilate) def forward(self, x): """ 假设: features[:4] => F1 (24通道), features[4:7] => F2 (32通道), features[7:] => deep (320通道) """ # F1 => self.features[:4] F1 = self.features[:4](x) # F2 => self.features[4:7] F2 = self.features[4:7](F1) # deep => self.features[7:] out = self.features[7:](F2) # ShallowFusionLite => 生成 low_level_features(32通道) low_level_features = self.shallow_fusion(F1, F2) # out => deep_features(320通道) return low_level_features, out # => (B,32,H,W), (B,320,H',W') ########################################################## # 2) 多层Transformer (减参) ########################################################## class TransformerBlockGroup(nn.Module): def __init__(self, dim, depth, heads, mlp_dim, dropout=0.0): super().__init__() self.layers = nn.ModuleList([ TransformerBlockLayer(dim, heads, mlp_dim, dropout) for _ in range(depth) ]) def forward(self, x): for layer in self.layers: x = layer(x) return x class TransformerBlockLayer(nn.Module): def __init__(self, dim, heads, mlp_dim, dropout=0.0): super().__init__() self.norm1 = nn.LayerNorm(dim) self.attn = MultiHeadAttention(dim, heads, dropout) self.norm2 = nn.LayerNorm(dim) self.ffn = FeedForward(dim, mlp_dim, dropout) def forward(self, x): shortcut = x x_ln = self.norm1(x) x_attn = self.attn(x_ln) x = x_attn + shortcut shortcut2 = x x_ln2 = self.norm2(x) x_ffn = self.ffn(x_ln2) x = x_ffn + shortcut2 return x class MultiHeadAttention(nn.Module): def __init__(self, dim, heads=2, dropout=0.0): super().__init__() self.heads = heads self.dim_head = dim//heads inner_dim = self.dim_head * heads self.scale = self.dim_head**-0.5 self.to_qkv = nn.Linear(dim, inner_dim*3, bias=False) self.attend = nn.Softmax(dim=-1) self.proj = nn.Sequential( nn.Linear(inner_dim, dim), nn.Dropout(dropout) ) def forward(self, x): B, n_patch, patch_area, dim = x.shape qkv = self.to_qkv(x) q, k, v = torch.chunk(qkv, 3, dim=-1) # reshape for multi-head q = q.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) k = k.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) v = v.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale attn = self.attend(dots) out = torch.matmul(attn, v) out = out.permute(0,1,3,2,4).contiguous() out = out.view(B, n_patch, patch_area, self.heads*self.dim_head) out = self.proj(out) return out class FeedForward(nn.Module): def __init__(self, dim, hidden_dim, dropout=0.0): super().__init__() self.net = nn.Sequential( nn.Linear(dim, hidden_dim), nn.SiLU(), nn.Dropout(dropout), nn.Linear(hidden_dim, dim), nn.Dropout(dropout) ) def forward(self, x): return self.net(x) ########################################################## # 3) MobileViTAttentionBlock ########################################################## class MobileViTAttentionBlock(nn.Module): def __init__(self, in_channels=320, patch_size=(2,2), transformer_depth=1, transformer_dim=96, transformer_mlp_dim=192, heads=2, dropout=0.0): super().__init__() self.ph, self.pw = patch_size # 局部卷积 self.local_conv1 = nn.Conv2d(in_channels, in_channels, 3, padding=1, bias=False) self.local_bn1 = nn.BatchNorm2d(in_channels) self.local_act1 = nn.SiLU() # conv => transformer_dim self.conv_proj_in = nn.Conv2d(in_channels, transformer_dim, 1, bias=False) self.bn_proj_in = nn.BatchNorm2d(transformer_dim) self.act_proj_in = nn.SiLU() # Transformer self.transformer = TransformerBlockGroup( dim=transformer_dim, depth=transformer_depth, heads=heads, mlp_dim=transformer_mlp_dim, dropout=dropout ) # 回到 in_channels self.conv_proj_out = nn.Conv2d(transformer_dim, in_channels, 1, bias=False) self.bn_proj_out = nn.BatchNorm2d(in_channels) self.act_proj_out = nn.SiLU() # concat => 3x3 self.local_conv2 = nn.Conv2d(in_channels*2, in_channels, 3, padding=1, bias=False) self.local_bn2 = nn.BatchNorm2d(in_channels) self.local_act2 = nn.SiLU() def forward(self, x): y = x.clone() x = self.local_conv1(x) x = self.local_bn1(x) x = self.local_act1(x) x = self.conv_proj_in(x) x = self.bn_proj_in(x) x = self.act_proj_in(x) B, D, H, W = x.shape assert (H % self.ph==0) and (W % self.pw==0), "H,W需能被patch_size整除" # rearrange => (B, n_patch, patch_area, D) x = rearrange(x, 'b d (hh ph) (ww pw) -> b (hh ww) (ph pw) d', ph=self.ph, pw=self.pw) x = self.transformer(x) x = rearrange(x, 'b (hh ww) (ph pw) d -> b d (hh ph) (ww pw)', hh=H//self.ph, ww=W//self.pw, ph=self.ph, pw=self.pw) x = self.conv_proj_out(x) x = self.bn_proj_out(x) x = self.act_proj_out(x) # concat residual x = torch.cat([x, y], dim=1) x = self.local_conv2(x) x = self.local_bn2(x) x = self.local_act2(x) return x ########################################################## # 4) 通道注意力 AttnGate (仅在某些分支用) ########################################################## class AttnGate(nn.Module): """ 类似 SE: 全局池化 -> MLP -> Sigmoid => 通道加权 """ def __init__(self, channels, reduction=16): super().__init__() hidden_dim = max(1, channels // reduction) self.pool = nn.AdaptiveAvgPool2d(1) self.fc = nn.Sequential( nn.Linear(channels, hidden_dim), nn.ReLU(inplace=True), nn.Linear(hidden_dim, channels), nn.Sigmoid() ) def forward(self, x): B, C, H, W = x.shape y = self.pool(x).view(B, C) y = self.fc(y) y = y.view(B, C, 1, 1) return x * y ########################################################## # 5) AG_ASPP_MViT 代替 ASPP ########################################################## class AG_ASPP_MViT(nn.Module): """ 5条分支: - branch1: 1×1 => dim_out (不做门控) - branch2/3/4: MobileViT + 1x1 => AttnGate - branch5: 全局池化 => 1x1 => norm => relu => 上采样 (不做门控) 并行 -> fuse """ def __init__(self, dim_in=320, dim_out=256, patch_size=(2,2), mvit_depth=1, mvit_dim=96, mvit_mlp_dim=192, mvit_heads=2, mvit_dropout=0.0, use_groupnorm=True): super().__init__() # branch1 => 1×1 => dim_out self.branch1 = nn.Sequential( nn.Conv2d(dim_in, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) # branch2 => MobileViT + gating self.branch2_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch2_proj = nn.Conv2d(dim_in, dim_out, 1, bias=False) self.gate2 = AttnGate(dim_out) # branch3 self.branch3_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch3_proj = nn.Conv2d(dim_in, dim_out, 1, bias=False) self.gate3 = AttnGate(dim_out) # branch4 self.branch4_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch4_proj = nn.Conv2d(dim_in, dim_out, 1, bias=False) self.gate4 = AttnGate(dim_out) # branch5 => global pool => conv => norm => relu => up self.branch5_conv = nn.Conv2d(dim_in, dim_out, 1) if use_groupnorm: self.branch5_norm = nn.GroupNorm(num_groups=1, num_channels=dim_out) else: self.branch5_norm = nn.BatchNorm2d(dim_out) self.branch5_relu = nn.ReLU(inplace=True) # fuse => 1x1 => dim_out self.fuse_conv = nn.Sequential( nn.Conv2d(dim_out*5, dim_out, 1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) def forward(self, x): B, C, H, W = x.shape # branch1 out1 = self.branch1(x) # branch2 x2 = self.branch2_attn(x) out2 = self.branch2_proj(x2) out2 = self.gate2(out2) # branch3 x3 = self.branch3_attn(x) out3 = self.branch3_proj(x3) out3 = self.gate3(out3) # branch4 x4 = self.branch4_attn(x) out4 = self.branch4_proj(x4) out4 = self.gate4(out4) # branch5 gp = F.adaptive_avg_pool2d(x, 1) gp = self.branch5_conv(gp) gp = self.branch5_norm(gp) gp = self.branch5_relu(gp) gp = F.interpolate(gp, size=(H, W), mode='bilinear', align_corners=False) # 并行拼接 => fuse out_cat = torch.cat([out1, out2, out3, out4, gp], dim=1) out = self.fuse_conv(out_cat) return out ########################################################## # 6) DeepLab,用 AG_ASPP_MViT + MobileNetV2_ShallowFusion ########################################################## class DeepLab(nn.Module): def __init__(self, num_classes=3, backbone="mobilenet", pretrained=True, downsample_factor=16): super(DeepLab, self).__init__() if backbone == "xception": # 省略 xception self.backbone = xception(downsample_factor=downsample_factor, pretrained=pretrained) in_channels = 2048 low_level_channels = 256 elif backbone == "mobilenet": # 改用 "MobileNetV2_ShallowFusion" self.backbone = MobileNetV2_ShallowFusion(downsample_factor=downsample_factor, pretrained=pretrained) in_channels = 320 # deep_features 通道 low_level_channels = 32 # shallow_fusion输出通道 else: raise ValueError('Unsupported backbone - `{}`, Use mobilenet, xception.'.format(backbone)) # 用 AG_ASPP_MViT 替换原 ASPP self.aspp = AG_ASPP_MViT( dim_in = in_channels, # 320 dim_out = 256, patch_size = (2,2), mvit_depth = 1, mvit_dim = 96, mvit_mlp_dim=192, mvit_heads = 2, mvit_dropout=0.0, use_groupnorm=True ) # 浅层特征 => 1x1 => 48 (随意可改) self.shortcut_conv = nn.Sequential( nn.Conv2d(low_level_channels, 48, 1), nn.BatchNorm2d(48), nn.ReLU(inplace=True) ) # cat_conv => (48 + 256)->256 self.cat_conv = nn.Sequential( nn.Conv2d(48 + 256, 256, 3, padding=1), nn.BatchNorm2d(256), nn.ReLU(inplace=True), nn.Dropout(0.5), nn.Conv2d(256, 256, 3, padding=1), nn.BatchNorm2d(256), nn.ReLU(inplace=True), nn.Dropout(0.1), ) self.cls_conv = nn.Conv2d(256, num_classes, 1) def forward(self, x): B, C, H, W = x.shape # => shallow_fusion => low_level(32), deep(320) low_level, deep = self.backbone(x) # => AG_ASPP_MViT deep = self.aspp(deep) # => (B,256, ...) # 浅层 => (B,32,H/?,W/?) => conv => (B,48,H/?,W/?) low_level = self.shortcut_conv(low_level) # 上采样 deep => 与 low_level 拼接 => cat_conv => cls_conv deep_up = F.interpolate(deep, size=(low_level.size(2), low_level.size(3)), mode='bilinear', align_corners=False) x_cat = torch.cat([deep_up, low_level], dim=1) # => (B, 48+256, ...) x_cat = self.cat_conv(x_cat) x_cat = self.cls_conv(x_cat) # 再上采样回输入大小 x_cat = F.interpolate(x_cat, size=(H, W), mode='bilinear', align_corners=False) return x_cat ########################################################## # 7) 测试示例 ########################################################## if __name__ == "__main__": model = DeepLab(num_classes=3, backbone="mobilenet", pretrained=False, downsample_factor=16) print("模型参数总数:", count_parameters(model)) img = torch.randn(1, 3, 512, 512) pred = model(img) print("输出维度:", pred.shape)
MobileNetV2_ShallowFusion:
ShallowFusionLite
融合成 low_level_features
(输出 32 通道);deep_features=320
通道给后续 ASPP 使用。AG_ASPP_MViT:
ASPP
,采用 5 路分支:branch1(1×1)、branch2/3/4(窗口注意力+门控)、branch5(全局池化);(B,256,H,W)
。DeepLab:
MobileNetV2_ShallowFusion
,在 forward 里得到 (low_level=32, deep=320)
;AG_ASPP_MViT
处理 deep(320->256)
,再把 low_level(32)
映射到 48 通道,最后拼接后做解码得到最终预测。运行测试:
if __name__ == "__main__":
块里构造一张 1×3×512×512
输入,检查输出形状与总参数数目,可确认是否能“跑通”。这样完成了将第一份代码(AG_ASPP_MViT 模块 + 窗口注意力)与第二份代码(DeepLab + MobileNetV2 结构 + ShallowFusion)合并,使其能跑通并进行测试。根据实际需求,也可继续修改通道数、下采样因子或 heads/patch_size 以适配您的数据集与算力环境。
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from einops import rearrange
from nets.xception import xception
from nets.mobilenetV2_GGCA import mobilenetv2
from nets.ShallowFusion import ShallowFusionLite
##########################################################
##########################################################
class MobileNetV2_ShallowFusion(nn.Module):
"""
替换后 MobileNetV2,
前几层提取 F1(24通道), 中间层提取 F2(32通道),
二者融合 => 生成 low_level_features(32通道),
剩余特征 => deep_features(320通道).
"""
def init(self, downsample_factor=8, pretrained=True):
super(MobileNetV2_ShallowFusion, self).init()
from functools import partial
textmodel = mobilenetv2(pretrained) # 去掉最后一个 classification self.features = model.features[:-1] self.total_idx = len(self.features) self.down_idx = [2, 4, 7, 14] # ShallowFusionLite: F1=24通道, F2=32通道 => out_channels=32 self.shallow_fusion = ShallowFusionLite( in_channels1=24, in_channels2=32, reduced_channels=24, out_channels=56, dropout=0.2 ) # 根据 downsample_factor 修改空洞卷积 if downsample_factor == 8: for i in range(self.down_idx[-2], self.down_idx[-1]): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=4)) elif downsample_factor == 16: for i in range(self.down_idx[-1], self.total_idx): self.features[i].apply(partial(self._nostride_dilate, dilate=2)) def _nostride_dilate(self, m, dilate): classname = m.__class__.__name__ if classname.find('Conv') != -1: if m.stride == (2, 2): m.stride = (1, 1) if m.kernel_size == (3, 3): m.dilation = (dilate // 2, dilate // 2) m.padding = (dilate // 2, dilate // 2) else: if m.kernel_size == (3, 3): m.dilation = (dilate, dilate) m.padding = (dilate, dilate) def forward(self, x): """ 假设: features[:4] => F1 (24通道), features[4:7] => F2 (32通道), features[7:] => deep (320通道) """ # F1 => self.features[:4] F1 = self.features[:4](x) # print("F1",F1.shape) # F2 => self.features[4:7] F2 = self.features[4:7](F1) # print("F2",F2.shape) # deep => self.features[7:] out = self.features[7:](F2) # ShallowFusionLite => 生成 low_level_features(32通道) low_level_features = self.shallow_fusion(F1, F2) # out => deep_features(320通道) return low_level_features, out # => (B,32,H,W), (B,320,H',W')
def count_parameters(model):
"""统计模型参数数量"""
return sum(p.numel() for p in model.parameters() if p.requires_grad)
##########################################################
##########################################################
class TransformerBlockGroup(nn.Module):
def init(self, dim, depth, heads, mlp_dim, dropout=0.0):
super().init()
self.layers = nn.ModuleList([
TransformerBlockLayer(dim, heads, mlp_dim, dropout)
for _ in range(depth)
])
def forward(self, x):
for layer in self.layers:
x = layer(x)
return x
class TransformerBlockLayer(nn.Module):
def init(self, dim, heads, mlp_dim, dropout=0.0):
super().init()
self.norm1 = nn.LayerNorm(dim)
self.attn = MultiHeadAttention(dim, heads, dropout)
self.norm2 = nn.LayerNorm(dim)
self.ffn = FeedForward(dim, mlp_dim, dropout)
textdef forward(self, x): shortcut = x x_ln = self.norm1(x) x_attn = self.attn(x_ln) x = x_attn + shortcut shortcut2 = x x_ln2 = self.norm2(x) x_ffn = self.ffn(x_ln2) x = x_ffn + shortcut2 return x
class MultiHeadAttention(nn.Module):
def init(self, dim, heads=2, dropout=0.0):
super().init()
self.heads = heads
self.dim_head = dim//heads
inner_dim = self.dim_head * heads
textself.scale = self.dim_head**-0.5 self.to_qkv = nn.Linear(dim, inner_dim*3, bias=False) self.attend = nn.Softmax(dim=-1) self.proj = nn.Sequential( nn.Linear(inner_dim, dim), nn.Dropout(dropout) ) def forward(self, x): B, n_patch, patch_area, dim = x.shape qkv = self.to_qkv(x) q, k, v = torch.chunk(qkv, 3, dim=-1) # reshape for multi-head q = q.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) k = k.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) v = v.reshape(B, n_patch, patch_area, self.heads, self.dim_head).permute(0,1,3,2,4) dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale attn = self.attend(dots) out = torch.matmul(attn, v) out = out.permute(0,1,3,2,4).contiguous() out = out.view(B, n_patch, patch_area, self.heads*self.dim_head) out = self.proj(out) return out
class FeedForward(nn.Module):
def init(self, dim, hidden_dim, dropout=0.0):
super().init()
self.net = nn.Sequential(
nn.Linear(dim, hidden_dim),
nn.SiLU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, dim),
nn.Dropout(dropout)
)
def forward(self, x):
return self.net(x)
##########################################################
##########################################################
class MobileViTAttentionBlock(nn.Module):
def init(self,
in_channels=320,
patch_size=(2,2),
transformer_depth=1,
transformer_dim=96,
transformer_mlp_dim=192,
heads=2,
dropout=0.0):
super().init()
self.ph, self.pw = patch_size
text# 局部卷积 self.local_conv1 = nn.Conv2d(in_channels, in_channels, 3, padding=1, bias=False) self.local_bn1 = nn.BatchNorm2d(in_channels) self.local_act1 = nn.SiLU() # conv => transformer_dim self.conv_proj_in = nn.Conv2d(in_channels, transformer_dim, 1, bias=False) self.bn_proj_in = nn.BatchNorm2d(transformer_dim) self.act_proj_in = nn.SiLU() # Transformer self.transformer = TransformerBlockGroup( dim=transformer_dim, depth=transformer_depth, heads=heads, mlp_dim=transformer_mlp_dim, dropout=dropout ) # 回到 in_channels self.conv_proj_out = nn.Conv2d(transformer_dim, in_channels, 1, bias=False) self.bn_proj_out = nn.BatchNorm2d(in_channels) self.act_proj_out = nn.SiLU() # concat => 3x3 self.local_conv2 = nn.Conv2d(in_channels*2, in_channels, 3, padding=1, bias=False) self.local_bn2 = nn.BatchNorm2d(in_channels) self.local_act2 = nn.SiLU() def forward(self, x): y = x.clone() x = self.local_conv1(x) x = self.local_bn1(x) x = self.local_act1(x) x = self.conv_proj_in(x) x = self.bn_proj_in(x) x = self.act_proj_in(x) B, D, H, W = x.shape assert (H % self.ph==0) and (W % self.pw==0), "H,W需能被patch_size整除" # rearrange => (B, n_patch, patch_area, D) x = rearrange(x, 'b d (hh ph) (ww pw) -> b (hh ww) (ph pw) d', ph=self.ph, pw=self.pw) x = self.transformer(x) x = rearrange(x, 'b (hh ww) (ph pw) d -> b d (hh ph) (ww pw)', hh=H//self.ph, ww=W//self.pw, ph=self.ph, pw=self.pw) x = self.conv_proj_out(x) x = self.bn_proj_out(x) x = self.act_proj_out(x) # concat residual x = torch.cat([x, y], dim=1) x = self.local_conv2(x) x = self.local_bn2(x) x = self.local_act2(x) return x
##########################################################
##########################################################
class AttnGate(nn.Module):
"""
类似 SE: 全局池化 -> MLP -> Sigmoid => 通道加权
"""
def init(self, channels, reduction=16):
super().init()
hidden_dim = max(1, channels // reduction)
self.pool = nn.AdaptiveAvgPool2d(1)
self.fc = nn.Sequential(
nn.Linear(channels, hidden_dim),
nn.ReLU(inplace=True),
nn.Linear(hidden_dim, channels),
nn.Sigmoid()
)
def forward(self, x):
B, C, H, W = x.shape
y = self.pool(x).view(B, C)
y = self.fc(y)
y = y.view(B, C, 1, 1)
return x * y
##########################################################
##########################################################
class AG_ASPP_MViT(nn.Module):
"""
5条分支:
- branch1: 1×1 => dim_out (不做门控)
- branch2/3/4: MobileViT + 1x1 => AttnGate
- branch5: 全局池化 => 1x1 => norm => relu => 上采样 (不做门控)
并行 -> fuse
"""
def init(self, dim_in=320, dim_out=256,
patch_size=(2,2),
mvit_depth=1,
mvit_dim=96,
mvit_mlp_dim=192,
mvit_heads=2,
mvit_dropout=0.0,
use_groupnorm=True):
super().init()
text# branch1 => 1×1 => dim_out self.branch1 = nn.Sequential( nn.Conv2d(dim_in, dim_out, kernel_size=1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) # branch2 => MobileViT + gating self.branch2_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch2_proj = nn.Conv2d(dim_in, dim_out, 1, bias=False) self.gate2 = AttnGate(dim_out) # branch3 self.branch3_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch3_proj = nn.Conv2d(dim_in, dim_out, 1, bias=False) self.gate3 = AttnGate(dim_out) # branch4 self.branch4_attn = MobileViTAttentionBlock( in_channels=dim_in, patch_size=patch_size, transformer_depth=mvit_depth, transformer_dim=mvit_dim, transformer_mlp_dim=mvit_mlp_dim, heads=mvit_heads, dropout=mvit_dropout ) self.branch4_proj = nn.Conv2d(dim_in, dim_out, 1, bias=False) self.gate4 = AttnGate(dim_out) # branch5 => global pool => conv => norm => relu => up self.branch5_conv = nn.Conv2d(dim_in, dim_out, 1) if use_groupnorm: self.branch5_norm = nn.GroupNorm(num_groups=1, num_channels=dim_out) else: self.branch5_norm = nn.BatchNorm2d(dim_out) self.branch5_relu = nn.ReLU(inplace=True) # fuse => 1x1 => dim_out self.fuse_conv = nn.Sequential( nn.Conv2d(dim_out*5, dim_out, 1), nn.BatchNorm2d(dim_out), nn.ReLU(inplace=True) ) def forward(self, x): B, C, H, W = x.shape # branch1 out1 = self.branch1(x) # branch2 x2 = self.branch2_attn(x) out2 = self.branch2_proj(x2) out2 = self.gate2(out2) # branch3 x3 = self.branch3_attn(x) out3 = self.branch3_proj(x3) out3 = self.gate3(out3) # branch4 x4 = self.branch4_attn(x) out4 = self.branch4_proj(x4) out4 = self.gate4(out4) # branch5 gp = F.adaptive_avg_pool2d(x, 1) gp = self.branch5_conv(gp) gp = self.branch5_norm(gp) gp = self.branch5_relu(gp) gp = F.interpolate(gp, size=(H, W), mode='bilinear', align_corners=False) # 并行拼接 => fuse out_cat = torch.cat([out1, out2, out3, out4, gp], dim=1) out = self.fuse_conv(out_cat) return out
##########################################################
##########################################################
class DeepLab(nn.Module):
def init(self, num_classes=3, backbone="mobilenet", pretrained=True, downsample_factor=16):
super(DeepLab, self).init()
if backbone == "xception":
# 省略 xception
self.backbone = xception(downsample_factor=downsample_factor, pretrained=pretrained)
in_channels = 2048
low_level_channels = 256
elif backbone == "mobilenet":
# 改用 "MobileNetV2_ShallowFusion"
self.backbone = MobileNetV2_ShallowFusion(downsample_factor=downsample_factor, pretrained=pretrained)
in_channels = 320 # deep_features 通道
low_level_channels = 56 # shallow_fusion输出通道
else:
raise ValueError('Unsupported backbone - {}
, Use mobilenet, xception.'.format(backbone))
text# 用 AG_ASPP_MViT 替换原 ASPP self.aspp = AG_ASPP_MViT( dim_in = in_channels, # 320 dim_out = 256, patch_size = (2,2), mvit_depth = 1, mvit_dim = 96, mvit_mlp_dim=192, mvit_heads = 2, mvit_dropout=0.0, use_groupnorm=True ) # 浅层特征 => 1x1 => 48 (随意可改) self.shortcut_conv = nn.Sequential( nn.Conv2d(low_level_channels, 48, 1), nn.BatchNorm2d(48), nn.ReLU(inplace=True) ) # cat_conv => (48 + 256)->256 self.cat_conv = nn.Sequential( nn.Conv2d(48 + 256, 256, 3, padding=1), nn.BatchNorm2d(256), nn.ReLU(inplace=True), nn.Dropout(0.5), nn.Conv2d(256, 256, 3, padding=1), nn.BatchNorm2d(256), nn.ReLU(inplace=True), nn.Dropout(0.1), ) self.cls_conv = nn.Conv2d(256, num_classes, 1) def forward(self, x): B, C, H, W = x.shape # => shallow_fusion => low_level(32), deep(320) low_level, deep = self.backbone(x) # => AG_ASPP_MViT deep = self.aspp(deep) # => (B,256, ...) # 浅层 => (B,32,H/?,W/?) => conv => (B,48,H/?,W/?) low_level = self.shortcut_conv(low_level) # 上采样 deep => 与 low_level 拼接 => cat_conv => cls_conv deep_up = F.interpolate(deep, size=(low_level.size(2), low_level.size(3)), mode='bilinear', align_corners=False) x_cat = torch.cat([deep_up, low_level], dim=1) # => (B, 48+256, ...) x_cat = self.cat_conv(x_cat) x_cat = self.cls_conv(x_cat) # 再上采样回输入大小 x_cat = F.interpolate(x_cat, size=(H, W), mode='bilinear', align_corners=False) return x_cat
##########################################################
##########################################################
if name == "main":
model = DeepLab(num_classes=3, backbone="mobilenet", pretrained=False, downsample_factor=16)
print("模型参数总数:", count_parameters(model))
textimg = torch.randn(1, 3, 512, 512) pred = model(img) print("输出维度:", pred.shape)
为什么每个单独模块训练的时候评价指标都挺高,加在一起却变低了呢
当您在单独训练每个模块(如浅层特征融合、MobileViT 模块或注意力门控)时,各自都能在其微型实验或独立任务上取得不错的评价指标;但将这些模块组合到一起时,整体性能却不升反降,背后通常有以下几个原因可以考虑:
特征分布/功能重叠导致的“冲突”
超参数和训练流程不一致
过拟合与小目标辨识不足
训练次序/微调策略
模块间依赖/信息流问题
分阶段或多任务式训练
重新调参 / 学习率策略
添加联结校准/中间层
检查功能重叠
蒸馏 / 对齐
当单个子模块在其独立环境表现不错,但组合到完整网络反而效果下降,根本原因通常在于特征分布冲突或训练策略不适配。需从联合训练流程、超参数重新调优、分阶段微调等方面入手,让各模块在同一网络下渐进协同,才能真正把单模块的优势转化成整网提升。
我是在没有加载权重的情况下训练的
如果您在没有加载任何预训练权重的情况下,直接从零开始训练组合后的整个网络,即使单独的子模块(在独立环境中)都能“从头”学到不错的表示,当把它们合并到一起也往往需要更多耐心与专门的调参才能达到或者超过各子模块单独训练时的指标。关键原因在于:
网络初始状态改变
超参数适配不一致
互补特征未充分学习
收敛难度上升
先独立预训练 / 分阶段训练
重新调整超参数
对齐分支输出 / 校准层
蒸馏 / 监督引导
观察训练曲线 / 日志
当没有预训练权重,直接从头训练一个包含多子模块(浅层融合 + 窗口注意力 + 通道门控等)的合并网络,收敛难度会远大于单独训练子模块。要让“组合网络”真正发挥子模块的互补优势,往往需要分阶段、分模块的精心调参与训练策略,而不仅是简单地拼接与一次性随机初始化。通过仔细设计训练流程与调参方法,依旧可以让大网络在多场景下具备更高的上限,最终超越各独立子模块的表现。