附件是根据审稿人的意见做出修改的版本,审稿人主要意见是创新点不足,提出的组件之间缺少关联性。为了说明...
创建于:2026年5月6日
使用 GPT-5.5 Thinking 回答 Chat01
创建于:2026年5月6日
使用 GPT-5.5 Thinking 回答 Chat01
附件是根据审稿人的意见做出修改的版本,审稿人主要意见是创新点不足,提出的组件之间缺少关联性。为了说明论文提出的是一个完整的框架,补充了图1和图2,但是图1和图2看起来并不能说明问题所在。如何串联创新点,在论文哪些段落需要做出修改,写出修改后的英文段落,摘要和贡献点的总结必须重写。还有哪些部分必须修改
我看了附件的修订稿。现在稿件已经把“1×1 pixel token、multiscale convolution、Head-Domain Attention、CAB、轻量化复杂度”都写进去了,但主要问题是:创新点仍然像并列堆叠,而不是一个由同一矛盾推导出来的完整框架。目前摘要、引言贡献点、Related Work 末段、Motivation、Methodology 开头、Ablation 都需要一起改;只补 Fig.1/Fig.2 不够。附件中 Fig.1 主要是在比较传统 MHA 与 LMCAN,Fig.2 主要说明 1×1 token 的谱保真优势,Fig.3 才是整体网络结构,Fig.4 是 CAB 机制,因此 Fig.1 和 Fig.2 目前不能独立回答“为什么这些组件必须组合成一个框架”这个审稿意见。(LMCAN_v2.pdf)
不要把创新写成:
multiscale convolution + head-domain attention + center attention + lightweight
而要写成:
HS–MS fusion requires pixel-level spectral fidelity. Therefore, LMCAN keeps 1×1 pixel tokens instead of patch merging. However, pixel tokens reduce the effective receptive field under window attention. To resolve this spectral-fidelity/spatial-context trade-off, LMCAN first injects multiscale receptive fields before attention and assigns them to different heads. Since these scale-aware heads would otherwise remain independent in conventional MHA, Head-Domain Attention is introduced to model cross-scale head correlations. Finally, because window attention still misses neighborhoods across window boundaries, CAB performs center-based local compensation.
也就是:
1×1 pixel token 保谱 → receptive field 不足 → pre-attention multiscale heads 扩感受野 → HAB 让多尺度 head 发生交互 → CAB 修补窗口边界邻域缺失 → 形成一个面向 HS–MS fusion 的统一 shallow multiscale attention framework。
这个逻辑需要从摘要、引言、Related Work、Motivation、Methodology、Ablation 反复一致地出现。当前稿件已经在引言贡献点后写了“not a simple combination”的段落,但它更像声明,缺少“问题—推导—对应设计—证据”的闭环。(LMCAN_v2.pdf)
| 位置 | 必须修改的原因 | 修改方向 |
|---|---|---|
| Abstract | 现在摘要已经列出模块,但“为什么这些模块相互关联”仍不够强 | 改成围绕一个核心矛盾:spectral fidelity vs. spatial context |
| Introduction 最后两段 | 现在贡献点是模块罗列 | 改成“一个框架、三个耦合设计” |
| Contributions | 必须重写 | 每条贡献都要说明它解决上一条设计引出的什么问题 |
| Related Work B 末段 | 现在有 gap,但语言偏泛,且 “some studies…” 不够尖锐 | 明确现有方法缺少“pixel-token-preserving + cross-scale-head interaction + boundary compensation”的统一设计 |
| Motivation | 当前 Fig.2 后的推导方向正确,但需要更明确地把 Fig.2 与三个模块串起来 | 把 Motivation 写成设计原则,而不是单独解释 1×1 token 和 CAB |
| Methodology A 开头 | 现在说网络有三个 components,容易被审稿人理解为拼装 | 改成“the architecture implements the above design principle” |
| MMH / CAB 小节开头 | 需要明确接口关系:MSC 生成 scale-aware heads,HAB 负责 head-domain communication,CAB 是 MMH 后的 boundary compensation | 不要把 CAB 写成独立补丁 |
| Fig.1 / Fig.2 caption 和正文引用 | 目前图不能充分说明完整框架 | Fig.1 改成“problem-to-solution framework”,Fig.2 改成定量支撑谱保真 |
| Ablation Study | 当前消融差异很小,审稿人会觉得创新点贡献不充分 | 需要补 cumulative ablation、cross-scale/head affinity 可视化、boundary-region 指标 |
| Conclusion | 现在仍是模块罗列 | 改成重申统一框架和设计闭环 |
可直接替换原摘要。当前摘要虽然提到 scale-specific features、Head-Domain Attention 和 CAB,但还没有把它们明确写成“由同一矛盾驱动的统一框架”。(LMCAN_v2.pdf)
Abstract— Hyperspectral–multispectral (HS–MS) image fusion aims to reconstruct a high-resolution hyperspectral image by combining the rich spectral signatures of low-resolution hyperspectral observations with the fine spatial structures of high-resolution multispectral images. A key challenge in Transformer-based fusion is the trade-off between spectral fidelity and spatial context modeling: patch merging and deep hierarchical designs enlarge the receptive field but may mix pixel-wise spectral signatures and increase computational cost, whereas 1 × 1 pixel tokens preserve spectra but limit contextual modeling within window attention. To address this trade-off, we propose LMCAN, a Lightweight Multi-Scale Centralized Attention Network that formulates HS–MS fusion as a unified pixel-token-preserving multiscale attention framework. Specifically, LMCAN first introduces pre-attention multiscale convolution to generate scale-specific features while maintaining pixel-level spectral representations. These features are explicitly assigned to different attention heads, so that each head models dependencies under a distinct receptive field. To avoid independent processing among scale-aware heads, a Head-Domain Attention module is further developed to learn adaptive head-to-head correlations and enable cross-scale communication within a shallow attention layer. Finally, a Center Attention Block is designed as a boundary-compensation mechanism to recover neighborhood interactions missed by fixed window partitioning. The proposed design avoids hierarchical patch merging, preserves spectral integrity, and achieves multiscale spatial–spectral interaction with low computational complexity. Experiments on CAVE, Harvard, Chikusei, and Pavia datasets demonstrate that LMCAN achieves superior reconstruction accuracy and efficiency compared with representative conventional, CNN-based, and Transformer-based fusion methods.
建议替换 Introduction 中从 “However, most Transformer-based fusion models…” 到 “To overcome these challenges…” 这一段。现在原文已经指出 deep stacking、patch merging、cross-scale interaction 和 window boundary 问题,但它们之间的因果关系还不够强。(LMCAN_v2.pdf)
Existing Transformer-based HS–MS fusion methods usually enlarge the receptive field by deep stacking, shifted-window operations, or hierarchical patch merging. Although these strategies are effective for natural image restoration, they are not fully aligned with the spectral-fidelity requirement of HS–MS fusion. In hyperspectral data, each spatial pixel corresponds to a high-dimensional spectral signature; therefore, merging neighboring pixels into larger tokens may introduce artificial spectral mixing. A straightforward solution is to preserve 1 × 1 pixel tokens, but this choice narrows the effective receptive field of window-based attention and weakens multiscale spatial context modeling. This reveals a fundamental dilemma: HS–MS fusion requires pixel-level spectral preservation and multiscale spatial interaction simultaneously, while conventional hierarchical Transformers usually improve one at the expense of the other. Moreover, standard multi-head attention treats different heads as independent subspaces and fuses them only by concatenation and projection, which is insufficient for explicit cross-scale communication. Fixed window partitioning further causes missing neighborhood interactions near window boundaries, which is harmful to edge and texture reconstruction.
建议放在贡献点前,替换现有 “To overcome these challenges, we propose...” 段落。
To address the above dilemma, we propose LMCAN, a Lightweight Multi-Scale Centralized Attention Network for HS–MS image fusion. The central idea of LMCAN is not to independently add multiscale convolution, head interaction, and local refinement, but to organize them into a single pixel-token-preserving attention framework. First, LMCAN keeps the 1 × 1 token representation to avoid spectral mixing. Second, it injects multiscale receptive fields before attention and assigns scale-specific features to different heads, thereby compensating for the limited receptive field caused by pixel tokens without using patch merging. Third, Head-Domain Attention explicitly models correlations among these scale-aware heads, enabling cross-scale information exchange within a shallow structure. Finally, Center Attention is introduced after window attention to compensate for boundary-deficient neighborhoods. In this way, LMCAN jointly addresses spectral preservation, multiscale context modeling, cross-scale interaction, and boundary reconstruction under a lightweight design.
必须重写。建议不要写成“我们提出模块 A、模块 B、模块 C”,而要写成“我们提出一个统一机制;模块是机制中的不同环节”。当前贡献点虽然已经较清楚,但仍然像四条并列模块。(LMCAN_v2.pdf)
The main contributions of this work are summarized as follows:
A unified pixel-token-preserving multiscale attention framework for HS–MS fusion. We identify the spectral-fidelity/spatial-context trade-off in Transformer-based HS–MS fusion and propose a shallow attention framework that preserves 1 × 1 pixel tokens while introducing multiscale contextual modeling without hierarchical patch merging.
Scale-aware head construction before attention. Instead of splitting homogeneous feature channels as in conventional multi-head attention, LMCAN first extracts scale-specific features using parallel dilated convolutions and then assigns them to different heads. This design makes each head correspond to a distinct receptive field, allowing multiscale spatial–spectral modeling while maintaining pixel-level spectral signatures.
Head-Domain Attention for explicit cross-scale communication. To overcome the independence of conventional attention heads, we design a Head-Domain Attention module that learns adaptive head-to-head affinities among scale-aware heads. This enables information exchange across receptive fields before final feature projection and strengthens the coupling between multiscale feature extraction and attention-based fusion.
Center Attention for boundary-aware neighborhood compensation. We introduce a Center Attention Block that uses each reconstruction pixel as the query center and aggregates information from its spatial neighborhood. This mechanism compensates for the missing local interactions caused by fixed window partitioning and improves edge and boundary reconstruction without sacrificing spectral fidelity.
Comprehensive validation of accuracy–efficiency trade-off. Experiments on CAVE, Harvard, Chikusei, and Pavia datasets, together with complexity analysis, efficiency evaluation, and ablation studies, demonstrate that LMCAN achieves competitive or superior fusion performance with significantly lower model complexity than representative CNN-based and Transformer-based methods.
建议替换 Related Work B 最后一段。当前稿件已经写了 “two issues remain insufficiently addressed”,但还可以更聚焦。(LMCAN_v2.pdf)
Despite the progress of Transformer-based HS–MS fusion, existing methods still lack a unified design that simultaneously satisfies spectral fidelity, multiscale context modeling, and efficient feature interaction. Hierarchical Transformers usually obtain multiscale representations through repeated stacking or patch merging, which increases channel dimensionality and may compromise pixel-wise spectral signatures. Some recent studies introduce convolutional priors, inter-head communication, or local neighborhood attention, but these mechanisms are typically used as independent enhancements rather than as coupled components derived from the specific requirements of HS–MS fusion. In contrast, LMCAN starts from the need to preserve 1 × 1 pixel tokens and then addresses the resulting limitations in a progressive manner: multiscale receptive fields are injected before attention, scale-aware heads are explicitly correlated by Head-Domain Attention, and boundary-deficient neighborhoods are compensated by Center Attention. Therefore, the novelty of LMCAN lies in a unified shallow multiscale attention framework tailored to pixel-level spectral preservation, rather than in a simple combination of existing modules.
建议替换 Motivation 的前半部分。现在 Fig.2 说明 1×1 token 更保谱,但需要把 Fig.2 明确变成整个框架的出发点。(LMCAN_v2.pdf)
The motivation of LMCAN comes from a central requirement of HS–MS fusion: the reconstructed HR-HSI should preserve the spectral signature of each spatial pixel while recovering fine spatial details. Conventional Transformer designs often use K × K patch tokens or patch merging to reduce computational cost and enlarge the receptive field. However, for hyperspectral imagery, a patch may contain pixels with different material compositions and thus different spectral responses. Aggregating such pixels into a single token inevitably mixes their spectral signatures, which may degrade spectral reconstruction. As illustrated in Fig. 2, the 1 × 1 pixel-token strategy better preserves spectral profiles than conventional patch-based tokenization, making it more suitable for HS–MS fusion.
However, preserving 1 × 1 pixel tokens introduces a new difficulty: the receptive field of window-based attention becomes limited, and long-range or multiscale spatial dependencies cannot be sufficiently captured by a shallow Transformer block. Directly adopting hierarchical patch merging would reintroduce spectral mixing and increase the model complexity. Therefore, LMCAN follows a different design principle: multiscale context should be introduced before attention without changing the pixel-token representation. This motivates the proposed multiscale convolution module, which assigns different receptive fields to different channel groups and converts conventional homogeneous attention heads into scale-aware heads.
Once scale-aware heads are constructed, another limitation arises. In standard multi-head attention, different heads are processed independently and interact only after concatenation and output projection. This is insufficient for HS–MS fusion, because features extracted at different receptive fields should communicate adaptively during fusion. We therefore introduce Head-Domain Attention to learn head-to-head correlations among scale-aware heads. Finally, since fixed window partitioning still misses some spatially adjacent pixels at window boundaries, Center Attention is employed as a boundary-compensation mechanism. These three designs are thus not independent additions, but consecutive solutions to the same spectral-fidelity-preserving multiscale modeling problem.
建议替换 Network Architecture 中 “The network primarily consists of three components...” 附近。当前这句话会强化“组件拼装”的印象。(LMCAN_v2.pdf)
The architecture of LMCAN is designed to implement the above pixel-token-preserving multiscale attention principle. Given the upsampled LR-HSI and the HR-MSI, the network first forms a spatially aligned input that contains both complete spectral information and high-resolution spatial structures. Instead of using patch merging or hierarchical token downsampling, LMCAN keeps the spatial resolution unchanged throughout the feature interaction stage. The multiscale convolution module first converts the input into scale-aware pixel-token features. These features are then fed into the Feature Interaction Module, where MMH performs token-domain attention within each scale-aware head and Head-Domain Attention exchanges information across heads. After several MMH blocks, CAB further compensates for neighborhood information missed by fixed window partitioning. Therefore, the three modules play complementary roles within one framework: multiscale convolution constructs scale-aware heads, Head-Domain Attention couples these heads, and CAB restores boundary-local context.
当前 MMH 公式部分是合理的,但建议在公式前加入下面这段,强调 MSC 和 HAB 的接口关系。Head-Domain Attention 目前已经写了 head-to-head affinity 和区别于 conventional MHA 的公式说明,但需要更早告诉审稿人“它为什么不是普通 head attention 的小改动”。(LMCAN_v2.pdf)
The MMH block is the core component that transforms multiscale convolutional features into an explicit cross-scale attention representation. Different from conventional MHA, where all heads are obtained by linearly splitting the same token sequence into homogeneous subspaces, the heads in MMH are generated from scale-specific features. Therefore, each head has a clear physical meaning: it describes spatial–spectral dependencies under a specific receptive field. Token-domain self-attention is first performed within each head to model dependencies at that scale. Then, Head-Domain Attention is applied across head descriptors to learn how different receptive fields should be adaptively combined. This design enables scale-specific modeling and cross-scale communication within the same shallow attention layer.
当前 CAB 的定义较完整,但要强调它不是“另一个 attention 模块”,而是 window attention 的边界补偿机制。(LMCAN_v2.pdf)
CAB is introduced as a boundary-compensation step after MMH rather than as an independent refinement module. Although MMH provides multiscale interaction within window attention, fixed window partitioning may still prevent spatially adjacent pixels from attending to each other when they fall into different windows. This problem is particularly harmful in HS–MS fusion, where the reconstruction of a target pixel depends not only on global context but also on its immediate local neighborhood. CAB addresses this limitation by treating each reconstruction pixel as the query center and sampling keys and values from a center-aligned neighborhood. In this way, each pixel can directly aggregate the most relevant surrounding information, including boundary-crossing neighbors, while the 1 × 1 spectral representation remains unchanged.
现在 Fig.1 只是“Conventional MHA vs LMCAN”,但审稿人关心的是“完整框架”和“组件关联”。建议新 Fig.1 画成 Problem-to-solution framework:
上排:
HS–MS fusion requirement → preserve pixel spectra → use 1×1 tokens
中排:
Problem caused by 1×1 tokens → limited receptive field → pre-attention multiscale heads
下排:
Problem caused by independent heads → HAB cross-scale communication → problem caused by window boundaries → CAB boundary compensation
右侧输出:
spectral fidelity + multiscale spatial context + low complexity
新的 caption:
Fig. 1. Design rationale of LMCAN. The proposed framework is derived from the spectral-fidelity requirement of HS–MS fusion. LMCAN first preserves 1 × 1 pixel tokens to avoid spectral mixing, then injects multiscale receptive fields before attention to compensate for the limited context of pixel tokens. Head-Domain Attention explicitly models cross-scale correlations among scale-aware heads, while Center Attention compensates for missing boundary neighborhoods caused by fixed window partitioning. These coupled designs form a unified shallow multiscale attention framework rather than a simple combination of independent modules.
当前 Fig.2 的方向是对的,但图中曲线和说明需要更清晰:建议增加 “Average SAM / mean spectral error for n×n token and 1×1 token” 的数值,最好用 bar plot 或表格嵌入图中。当前 Fig.2 只说明“1×1 更保谱”,但没有直接连接 HAB/CAB。正文引用可以改成:
As shown in Fig. 2, patch-based tokenization introduces larger spectral deviations because each token aggregates multiple spatial pixels with potentially different spectral signatures. This observation motivates the use of 1 × 1 pixel tokens in LMCAN. However, this choice reduces the receptive field of window attention, which further motivates the subsequent multiscale head construction and head-domain interaction. Therefore, Fig. 2 provides the empirical basis for the first design decision of LMCAN, while Fig. 1 summarizes how the remaining modules are derived from this decision.
当前 Table VI 的消融差距很小,例如 full model 49.04 dB,而 Exp.I/II/III 分别是 48.92、48.99、48.86 dB,审稿人可能认为模块贡献不显著。(LMCAN_v2.pdf) 建议增加 cumulative ablation,而不是只做 replacement ablation。
建议新增表格结构:
| Variant | 1×1 token | MSC | HAB | CAB | PSNR | SAM | ERGAS | SSIM |
|---|---|---|---|---|---|---|---|---|
| Baseline W-MSA | ✓ | × | × | × | ||||
| + MSC | ✓ | ✓ | × | × | ||||
| + MSC + HAB | ✓ | ✓ | ✓ | × | ||||
| + MSC + HAB + CAB | ✓ | ✓ | ✓ | ✓ |
建议替换消融引言为:
To verify that the proposed modules are coupled components of a unified framework rather than independent enhancements, we conduct cumulative and replacement ablation studies on the CAVE dataset. Starting from a pixel-token window-attention baseline, we progressively add the multiscale convolution module, Head-Domain Attention, and Center Attention. This setting directly evaluates the proposed design logic: MSC compensates for the limited receptive field of 1 × 1 tokens, HAB enables communication among the resulting scale-aware heads, and CAB recovers boundary-local neighborhoods missed by fixed window partitioning.
建议替换消融分析为:
The cumulative results show that each component improves the fusion performance in a progressive manner. Adding MSC improves spatial context modeling by assigning different receptive fields to different heads. Introducing HAB further improves spectral–spatial reconstruction because the scale-aware heads are no longer processed independently; instead, their correlations are adaptively learned in the head domain. Finally, adding CAB improves boundary-sensitive reconstruction by allowing each target pixel to aggregate information from its center-aligned neighborhood. These results confirm that the performance gain of LMCAN comes from the coordinated interaction among the three designs, rather than from a simple increase in model capacity.
还建议增加两类补充实验:
建议替换当前 Conclusion。当前结论仍然是“先 MSC,再 MMH,再 CAB”的模块罗列。(LMCAN_v2.pdf)
In this paper, we proposed LMCAN, a lightweight pixel-token-preserving multiscale attention framework for HS–MS image fusion. The proposed method is motivated by the trade-off between spectral fidelity and spatial context modeling in Transformer-based fusion. By preserving 1 × 1 pixel tokens, LMCAN avoids spectral mixing caused by patch merging. To compensate for the limited receptive field of pixel tokens, multiscale receptive fields are introduced before attention and assigned to different heads. Head-Domain Attention further enables adaptive cross-scale communication among these scale-aware heads, while Center Attention compensates for missing local neighborhoods near window boundaries. In this way, LMCAN integrates spectral preservation, multiscale spatial modeling, cross-scale interaction, and boundary refinement into a unified shallow architecture. Extensive experiments on simulated and remote sensing datasets demonstrate that LMCAN achieves a favorable balance between reconstruction accuracy, spectral fidelity, and computational efficiency.
统一命名:文中同时出现 SL-MMHA、MMH、HAB、Head-Domain Attention、Center Attention、Centralized Attention。建议统一为:
MSC → MMH → HDA/HAB → CAB。标题中的 “Centralized Attention” 与正文 “Center Attention Block” 也要解释清楚,否则审稿人会觉得概念不一致。
明确 H 和 3 个 dilation branches 的关系:MSC 写的是分成三个 heads,dilation = 1, 2, 3;MMH 公式又写一般 H heads。需要说明:
“For simplicity, we use three scale-aware groups in implementation; the formulation can be generalized to H heads.”
检查 Eq.25 的缩放项:Head-domain attention 的 softmax 缩放建议写成 sqrt(d_a),不要写成容易被误读的 sqrt(H) d_a 或不清楚的形式。
避免 M 的符号冲突:前文 M 表示 window size,CAB 中 M 又表示 number of heads。建议 window size 用 S,head number 用 H_c 或 M_h。
Fig.3 应该前移或改为 Fig.2:现在 Fig.3 才是整体网络图,但审稿意见要求“完整框架”。建议把新的 design rationale 作为 Fig.1,整体架构作为 Fig.2,spectral fidelity 对比作为 Fig.3 或 supplementary。
补充“不是简单组合”的实验证据:仅靠文字说不是简单组合不够。必须用 cumulative ablation、head affinity visualization、boundary mask evaluation 来支撑。
实验结果表述要减少夸张:例如 “state-of-the-art performance across all datasets” 要谨慎,因为 Pavia 中 SSIM 与多个方法同为 0.996,CAVE 中 SAM 与 PSRT 持平。建议写 “competitive or superior”。
语言细节:
“Another way” 应改成 “In addition” 或 “Recent studies also show”。
“Cave dataset” 应统一为 “CAVE dataset”。
“Table 4” 应统一为 “Table IV”。
“Multiscale” 和 “multi-scale” 统一一种写法。
Fig. caption 中 “LMCAN(Ours)” 改成 “LMCAN (Ours)”。
Related Work 需要更有针对性地比较 MSST-Net、PSRT、PanFormer、HyperTransformer 等:不要只列方法,要明确它们没有解决什么:patch merging、head independence、boundary neighborhood、pixel-token spectral preservation。
回复审稿人时要明确写:
“We have revised the Abstract, Introduction, Related Work, Motivation, Methodology, Fig. 1, Fig. 2, and Ablation Study to clarify that LMCAN is a unified framework derived from the spectral-fidelity/spatial-context trade-off, rather than a simple combination of independent components.”
根据回答的意见,附件是我修改之后的版本,从内容和格式上认真审阅这篇文章,是否满足SCI高水平论文的要求,检查表述逻辑、图表是否合理,实验设置是否完善。每一小节内容是否正确且适当,公式是否正确等问题。
结论先说:这版比上一版明显更接近“完整框架”的表达,但还没有达到高水平 SCI/TGRS 类论文的可投稿终稿标准。主要原因不是方法完全不行,而是形式、公式严谨性、实验支撑和图表说服力仍有明显短板。尤其是:文中仍有大量红色修订文字、命名不统一、若干公式/符号不够严谨、Fig.1/Fig.2 仍然不够专业、Table VI 的消融结果不足以支撑“各模块协同有效”的强结论,Table V 的效率结论也需要更谨慎。(LMCAN_v2+(2).pdf)
目前稿件的核心叙事已经基本形成:
1×1 pixel token 保持光谱 → 导致感受野不足 → MSC 提供多尺度感受野 → HDA 建立多尺度 head 交互 → CAB 补偿窗口边界邻域缺失。
这个逻辑在摘要、引言、贡献点、Related Work gap 和结论中都已经出现,方向是对的。摘要现在明确提出了 spectral fidelity 与 spatial context modeling 的矛盾,并把 LMCAN 描述为 pixel-token-preserving multiscale attention framework,这比上一版有明显提升。(LMCAN_v2+(2).pdf)
但是,高水平 SCI 论文不仅要求逻辑能讲通,还要求图表、公式、实验、语言、排版都能经得起审稿人逐项推敲。目前仍存在以下高风险问题:
因此,我的判断是:
目前版本可以作为“较大幅修改后的 revision draft”,但不建议直接作为高水平 SCI clean version 提交。需要再做一次系统性技术审查、公式修正、图表重绘和实验补强。
摘要现在的逻辑是正确的:先提出 Transformer-based fusion 中 spectral fidelity 和 spatial context 的矛盾,再引出 1×1 token、MSC、HDA、CAB。摘要中还强调了避免 hierarchical patch merging、保持 spectral integrity 和低复杂度,这些都能回应“创新点不足、组件之间缺少关联性”的审稿意见。(LMCAN_v2+(2).pdf)
但还需要修改两点:
第一,摘要中写 “superior reconstruction accuracy and efficiency” 过强。因为 Table V 中 LMCAN 的 FLOPs 和参数量很低,但 latency 不是最低,GPPNN 的 latency 和 FPS 更好;Pavia 中 SSIM 也只是与多个方法并列。建议改成:
achieves competitive or superior reconstruction accuracy with a favorable accuracy–efficiency trade-off.
第二,代码链接放在 Abstract 末尾是否合适要看投稿规范。如果是匿名审稿,GitHub 链接可能暴露作者身份;如果是非匿名,可以放在 footnote 或 Code Availability,而不是 Abstract 主体中。
贡献点已经比上一版好很多,尤其是第 1 点把方法定义为 “unified pixel-token-preserving multiscale attention framework”,第 2–4 点也按设计链条组织,而不是简单罗列模块。(LMCAN_v2+(2).pdf)
但第 5 点仍需更谨慎。现在贡献点说 “significantly lower model complexity”,这在参数量和 FLOPs 上成立,但在 latency 上不完全成立。建议改为:
The proposed method achieves a favorable trade-off among reconstruction accuracy, parameter count, FLOPs, memory usage, and inference speed.
Introduction 的主线基本成立。你现在把现有 Transformer 的问题归纳为 patch merging 可能破坏 pixel-wise spectral signature、1×1 token 会限制 window attention 感受野、standard MHA 缺少 head 间交互、fixed window partition 会造成边界邻域缺失,这个逻辑是合理的。(LMCAN_v2+(2).pdf)
但仍有三个问题:
第一,“state-of-the-art results across multiple benchmarks” 应谨慎使用。引言中说 INNformer、Panformer、PSRT 等达到 state-of-the-art,这可以保留,但对自己方法不要频繁使用 “state-of-the-art”。建议统一为 “competitive or superior”。
第二,“Center” 和 “Centralized” 必须统一。标题是 “Center Attention Network”,摘要是 “Center Attention Network”,但正文中仍写 “Lightweight Multi-Scale Centralized Attention Network”。建议全文统一为:
LMCAN: Lightweight Multi-Scale Center Attention Network
如果想保留 Centralized,则标题、摘要、CAB 名称也都要统一,但我建议用 Center Attention,因为 CAB 的机制是 center-aligned neighborhood attention。
第三,Introduction 最后 Fig.1 的引用仍然偏弱。Fig.1 caption 现在是 “Difference from Conventional Multi-Head Attention”,但正文想说明的是完整设计 rationale。建议改为:
Fig. 1 illustrates the design rationale of LMCAN rather than only the difference from conventional MHA.
Related Work 的内容覆盖较广,包含传统方法、CNN、deep unfolding、Transformer-based fusion、卷积与 Transformer 结合等,这从覆盖面上是够的。(LMCAN_v2+(2).pdf)
但作为高水平 SCI,Related Work 还不够“批判性”。现在有些段落仍像文献堆砌,尤其是 Transformer 部分连续列举 PanFormer、PSRT、CSAKD、HyperFormer、DCTransformer、CasFormer、MSST-Net、PSCINN 等,但对每类方法“为什么不能解决本文问题”的分析还不够集中。
建议把 Transformer 相关工作重组为三类:
语言上必须修改:
Motivation 的方向是对的:你用 Fig.2 说明 patch tokenization 可能造成 spectral mixing,进而解释为什么要使用 1×1 pixel token;然后指出 1×1 token 会带来感受野不足,于是引出 MSC、HDA、CAB。(LMCAN_v2+(2).pdf)
但 Fig.2 目前仍然不够强。图中有 patch crop、1×1 crop、谱曲线和 “PSRT / Our” 标注,但它更像示意图,不像严格实验图。正文中说 “quantitative evaluations on the CAVE dataset demonstrate...”,但图里没有清晰给出定量指标,例如 average SAM、MRAE、RMSE 或 spectral error bar。(LMCAN_v2+(2).pdf)
建议 Fig.2 改为三部分:
(a) patch token 与 1×1 token 的示意;
(b) 不同 patch size 下的 average SAM / MRAE / RMSE;
(c) 一个代表像素的 spectral curve comparison。
否则审稿人会质疑:“Fig.2 只是画了一个假设,并没有证明 1×1 token 更保谱。”
另外,Motivation 中 Eq.2–5 的复杂度推导和后文 Section III-D 的复杂度分析有一定重复。建议保留 Motivation 中的直观解释,把详细推导集中放在 Complexity Analysis,避免前后重复。
该小节整体结构是合理的。输入为 upsampled LR-HSI 与 HR-MSI 拼接,经过 multiscale convolution、FIM、重建模块,再用 residual learning 生成 HR-HSI,这条流程清楚。文中也解释了 MSC、MMH/HDA、CAB 的互补关系。(LMCAN_v2+(2).pdf)
但有两个高风险点。
第一,术语不统一。小节开头写 “The proposed Multiscale Multi-Head Network”,但论文名称是 LMCAN。建议统一为:
The proposed LMCAN adopts a single-stream residual learning paradigm...
第二,Eq.14 写成:
IHR-HSI = Xup + ReLU(Conv(Fdeep))
这个公式需要慎重。残差学习中 residual 可能为正也可能为负,如果对 residual 使用 ReLU,会限制网络只能做非负修正,理论上不合理。除非你的代码确实这样实现且有实验验证,否则建议改为:
其中 是 reconstruction head。最后是否 clamp 到有效范围可以单独说明。
MSC 小节基本合理,1×1 embedding 后分成三组,分别用 dilation = 1, 2, 3 的 3×3 卷积,再 concat。(LMCAN_v2+(2).pdf)
但需要澄清:
建议加一句:
In our implementation, the number of scale-aware groups is set to three, corresponding to dilation rates 1, 2, and 3. When a larger number of attention heads is used, heads sharing the same dilation rate are grouped into the same scale category.
如果没有这种实现,就不要写 “generalized to H heads”。
MMH 的基本想法合理:先在每个 scale-aware head 内做 token-domain attention,再通过 Head-Domain Attention 做 head-to-head interaction。公式 Eq.19–29 的框架是可以接受的。(LMCAN_v2+(2).pdf)
但仍有几个需要修正或解释的问题:
CAB 公式总体较完整。它以每个像素为 query center,从 邻域采样 key/value,再用多头不同 dilation factor 扩大感受野。这个机制与“boundary compensation”目标基本一致。(LMCAN_v2+(2).pdf)
但有三个问题:
这是目前最需要严谨化的公式之一。文中写:
但 是 张量, 是 ,空间退化不能简单写成右乘 ,除非你明确定义了张量展开或 mode-wise multiplication。(LMCAN_v2+(2).pdf)
建议改成更规范的形式之一:
其中 表示 blur + downsampling, 是 spectral response matrix。
或者使用矩阵展开形式:
关键是要保证维度完全一致。
推导思路基本可以接受:patch merging 后空间减少、通道增加,projection term 增大。但表述需要更谨慎。你现在说 patch merging 后参数 quartic growth ,这是针对单层通道从 变成 的 projection 参数而言成立;但如果实际 Swin stage 里有 channel reduction 或不同 block 数,不能泛化成所有 hierarchical Transformer 都如此。(LMCAN_v2+(2).pdf)
建议改为:
Under the simplified setting without additional channel reduction, the projection parameters scale as...
这样更严谨。
当前渲染图中 Eq.25 的缩放项是 ,这是正确的;之前容易被误读的问题在这版中基本解决。(LMCAN_v2+(2).pdf)
复杂度分析能说明 lightweight 方向,但仍需补充边界:
实验覆盖了 CAVE、Harvard、Chikusei、Pavia 四个数据集,指标包括 PSNR、SAM、ERGAS、SSIM,对比方法包括传统方法、CNN/deep unfolding 方法和 Transformer 方法,整体框架是完整的。数据模拟包括 Gaussian blur、4× bicubic downsampling 和 Nikon D700 spectral response function,这些设置也符合 HS-MS fusion 常见实验范式。(LMCAN_v2+(2).pdf)
但距离高水平 SCI 仍缺少以下内容:
四个主结果表总体是有说服力的。CAVE 中 LMCAN 的 PSNR 49.04、ERGAS 1.18、SSIM 0.996,整体表现较强;Harvard 中 PSNR 45.91、SAM 2.20、ERGAS 1.95,也较好;Chikusei 和 Pavia 也有较好结果。(LMCAN_v2+(2).pdf)
但有几点必须改:
Table V 是有价值的,因为它加入了 FLOPs、latency、FPS、memory。LMCAN 参数量 0.157M、FLOPs 12.06G、memory 333.17MB,确实很轻。(LMCAN_v2+(2).pdf)
但结论要谨慎:LMCAN 不是最快的。GPPNN latency 28.81ms、FPS 34.70,而 LMCAN latency 70.25ms、FPS 14.23。你可以说:
LMCAN achieves the lowest parameter count, FLOPs, and memory usage among the compared methods while maintaining the highest reconstruction accuracy; although it is not the fastest model, it provides the best overall accuracy–complexity trade-off.
不要写 “best efficiency” 或 “fastest”。
Table VI 是目前最薄弱的实验。Full model 的 PSNR 最好,但 ERGAS 不是最好:Exp.I 和 Exp.II 的 ERGAS 都是 1.17,而 Our 是 1.18。SSIM 四个实验全是 0.996。(LMCAN_v2+(2).pdf)
这意味着不能写:
each component improves the fusion performance in a progressive manner.
应该改成更谨慎:
The full model achieves the best PSNR and SAM, while maintaining comparable ERGAS and SSIM. This suggests that the three modules jointly improve pixel-wise reconstruction and spectral angular consistency, although their effects on ERGAS are relatively marginal.
更重要的是,当前 ablation 不是 cumulative ablation。建议必须增加:
| Variant | MSC | HDA | CAB | PSNR | SAM | ERGAS | SSIM |
|---|---|---|---|---|---|---|---|
| Baseline W-MSA | × | × | × | ||||
| + MSC | ✓ | × | × | ||||
| + MSC + HDA | ✓ | ✓ | × | ||||
| + MSC + HDA + CAB | ✓ | ✓ | ✓ |
否则不能有力证明“组件之间有关联”。
Fig.1 的目的正确,但视觉上还不够高水平。它现在比较 Conventional MHA 和 LMCAN,并展示 scale-specific heads、HDA 和 CAB。但问题是:
建议重画为 problem-driven framework:
左侧:Spectral fidelity requirement → 1×1 token
中间:Limited receptive field → MSC scale-aware heads
右侧:Independent heads → HDA cross-scale communication
下方:Window boundary missing neighbors → CAB boundary compensation
输出:Spectral fidelity + multiscale context + low complexity
Fig.2 是最需要修改的图之一。它要证明 1×1 token 的必要性,但当前图像过于示意化,且定量信息不清楚。(LMCAN_v2+(2).pdf)
建议:
Fig.3 是整体结构图,内容比较完整,比 Fig.1 更像 network architecture。(LMCAN_v2+(2).pdf)
但需要修改:
Overall architecture of LMCAN for HS–MS image fusion.
Fig.4 说明 CAB 的 Q/K/V、sampling 和 multi-head center attention,基本合理。(LMCAN_v2+(2).pdf)
但 caption 中 “adaptive expansion rate” 不准确。公式里是 ,看起来是 head-specific dilation,不是 adaptive。建议改成:
samples K/V within a center-aligned neighborhood using head-specific dilation rates.
可视化结果整体可用,但仍需改进:
必须修改以下格式问题:
按照优先级排序:
我的判断是:
创新叙事:基本达标,但还需图和实验支撑。
现在已经能讲清“不是简单模块拼接”,这是最大进步。
方法严谨性:未完全达标。
退化模型、符号、残差公式、复杂度边界仍需修正。
实验充分性:中等偏弱。
四个数据集和多个 baseline 是优点,但 ablation、robustness、statistical analysis、boundary/HDA-specific evidence 不足。
图表质量:未达标。
Fig.1/Fig.2 仍不够专业,Table VI 的结论与数据不完全匹配。
语言和格式:未达标。
红色修订文字、命名不统一、格式细节和语病会明显影响审稿印象。
最终建议:
不要直接提交当前版本作为 clean manuscript。建议做一次 major polishing + experiment strengthening。完成上述必须修改项后,论文有机会达到较高水平 SCI 的基本要求;若想冲击 TGRS/Information Fusion 类期刊,必须补充 cumulative ablation、boundary evaluation、head-affinity visualization 和更严格的实验设置。
当前论文版本的摘要和contribution部分需要重新写。摘要部分不要出现破折号和冒号一类AI痕迹明显的表述,contribution总结创新点3点就可以,要复合叙述串联模块的逻辑,要强调是整个一个创新,注意不要AI风格。
下面给出可直接替换的 Abstract 和 Contributions。我尽量去掉了破折号、冒号、过度模板化句式和明显 AI 风格表达,同时把 MSC、HDA、CAB 写成一个连续设计逻辑,而不是三个孤立模块。当前版本摘要和贡献点仍然是模块罗列式表达,且贡献点有 5 条,与你现在的修改目标不一致。(LMCAN_v2+(2).pdf)
Abstract
Hyperspectral and multispectral image fusion aims to recover a high resolution hyperspectral image by combining the spectral richness of low resolution hyperspectral observations with the spatial details of high resolution multispectral images. A major difficulty in Transformer fusion lies in preserving pixel wise spectral signatures while still capturing sufficient spatial context. Existing hierarchical designs usually enlarge the receptive field through deep stacking or patch merging, but these operations may mix the spectra of neighboring pixels and increase the computational burden. In this paper, we propose LMCAN, a lightweight multi scale center attention network designed around a unified pixel preserving attention scheme. LMCAN keeps each spatial location as a 1 × 1 pixel token to reduce spectral mixing. Since this choice limits the receptive field of window attention, multi scale convolution is first used to construct scale aware feature groups before attention, allowing different heads to model dependencies under different receptive fields. The resulting heads are no longer treated as independent branches. Instead, a head domain attention mechanism is introduced to learn their correlations and promote information exchange across scales. After window attention, a center attention block further aggregates neighborhood information around each reconstruction pixel, which compensates for local interactions missed by fixed window partitioning and improves boundary detail reconstruction. In this way, spectral preservation, multi scale context modeling, cross scale interaction, and local boundary compensation are integrated into a compact fusion framework. Experiments on the CAVE, Harvard, Chikusei, and Pavia datasets show that LMCAN achieves competitive or superior reconstruction accuracy with a favorable balance among parameter count, computational cost, memory usage, and inference efficiency.
The main contributions of this work are summarized as follows.
We propose a lightweight pixel preserving attention framework for hyperspectral and multispectral image fusion. The framework is motivated by the conflict between spectral fidelity and spatial context modeling. Instead of relying on patch merging or deep hierarchical attention, LMCAN maintains the 1 × 1 pixel token representation and introduces multi scale receptive fields before attention, so that spatial context can be enriched without sacrificing pixel wise spectral signatures.
We develop a coupled multi scale attention mechanism in which convolutional scale construction and attention head interaction are jointly considered. Multi scale convolution first assigns different receptive fields to different feature groups, which gives each attention head a clear scale meaning. Head domain attention then learns adaptive correlations among these scale aware heads, allowing information from different receptive fields to be exchanged before final feature fusion. This design turns conventional independent attention heads into an interacting cross scale representation.
We introduce center attention as a local compensation mechanism within the same fusion framework. While the multi scale head interaction strengthens contextual modeling inside window attention, fixed window partitioning can still miss adjacent pixels near window boundaries. Center attention uses each reconstruction pixel as the query center and aggregates information from its surrounding neighborhood, thereby improving boundary and texture recovery while keeping the pixel level spectral representation unchanged. Together with the scale aware head design, this forms a unified lightweight architecture for accurate and efficient hyperspectral image fusion.
摘要中建议把方法全称统一为 Lightweight Multi Scale Center Attention Network,不要再出现 Centralized Attention Network,否则和 CAB 的 “Center Attention Block” 概念不一致。
贡献点后面原文的 Fig. 1 引导段也要跟着改,因为现在贡献点从 5 条压缩成 3 条后,后文不能再用 “not a simple combination of multiscale convolution, local attention, and neighborhood refinement” 这种容易被审稿人理解为模块拼接的句子。建议改成一句更自然的承接:
Fig. 1 illustrates how the proposed components are derived from a single design objective. LMCAN first preserves pixel level spectra by keeping 1 × 1 tokens, then restores the lost spatial context through scale aware heads and head domain interaction, and finally compensates for boundary neighborhoods by center attention. This progressive design distinguishes LMCAN from conventional multi head attention, where different heads are homogeneous and independent, and from hierarchical Transformers, where multi scale context is mainly obtained through patch merging and deeper stacking.