附件是根据审稿人的意见做出修改的版本，审稿人主要意见是创新点不足，提出的组件之间缺少关联性。为了说明...

创建于：2026年5月6日

问题

LMCAN_v2.pdf

附件是根据审稿人的意见做出修改的版本，审稿人主要意见是创新点不足，提出的组件之间缺少关联性。为了说明论文提出的是一个完整的框架，补充了图1和图2，但是图1和图2看起来并不能说明问题所在。如何串联创新点，在论文哪些段落需要做出修改，写出修改后的英文段落，摘要和贡献点的总结必须重写。还有哪些部分必须修改

思考

Citing appropriately with line-less references

I need to make sure I cite the right parts—turn2file0 seems like the choice here, and maybe I can cite it multiple times. I won’t use line numbers, though, because they're not available. I could also include turn0file0 in places where it's relevant. The response should be clear but also concise.

Thought for 57s

回答

我看了附件的修订稿。现在稿件已经把“1×1 pixel token、multiscale convolution、Head-Domain Attention、CAB、轻量化复杂度”都写进去了，但主要问题是：创新点仍然像并列堆叠，而不是一个由同一矛盾推导出来的完整框架。目前摘要、引言贡献点、Related Work 末段、Motivation、Methodology 开头、Ablation 都需要一起改；只补 Fig.1/Fig.2 不够。附件中 Fig.1 主要是在比较传统 MHA 与 LMCAN，Fig.2 主要说明 1×1 token 的谱保真优势，Fig.3 才是整体网络结构，Fig.4 是 CAB 机制，因此 Fig.1 和 Fig.2 目前不能独立回答“为什么这些组件必须组合成一个框架”这个审稿意见。(LMCAN_v2.pdf)

一、建议把创新点串成这一条主线

不要把创新写成：

multiscale convolution + head-domain attention + center attention + lightweight

而要写成：

HS–MS fusion requires pixel-level spectral fidelity. Therefore, LMCAN keeps 1×1 pixel tokens instead of patch merging. However, pixel tokens reduce the effective receptive field under window attention. To resolve this spectral-fidelity/spatial-context trade-off, LMCAN first injects multiscale receptive fields before attention and assigns them to different heads. Since these scale-aware heads would otherwise remain independent in conventional MHA, Head-Domain Attention is introduced to model cross-scale head correlations. Finally, because window attention still misses neighborhoods across window boundaries, CAB performs center-based local compensation.

也就是：

1×1 pixel token 保谱 → receptive field 不足 → pre-attention multiscale heads 扩感受野 → HAB 让多尺度 head 发生交互 → CAB 修补窗口边界邻域缺失 → 形成一个面向 HS–MS fusion 的统一 shallow multiscale attention framework。

这个逻辑需要从摘要、引言、Related Work、Motivation、Methodology、Ablation 反复一致地出现。当前稿件已经在引言贡献点后写了“not a simple combination”的段落，但它更像声明，缺少“问题—推导—对应设计—证据”的闭环。(LMCAN_v2.pdf)

二、必须修改的段落位置

位置	必须修改的原因	修改方向
Abstract	现在摘要已经列出模块，但“为什么这些模块相互关联”仍不够强	改成围绕一个核心矛盾：spectral fidelity vs. spatial context
Introduction 最后两段	现在贡献点是模块罗列	改成“一个框架、三个耦合设计”
Contributions	必须重写	每条贡献都要说明它解决上一条设计引出的什么问题
Related Work B 末段	现在有 gap，但语言偏泛，且 “some studies…” 不够尖锐	明确现有方法缺少“pixel-token-preserving + cross-scale-head interaction + boundary compensation”的统一设计
Motivation	当前 Fig.2 后的推导方向正确，但需要更明确地把 Fig.2 与三个模块串起来	把 Motivation 写成设计原则，而不是单独解释 1×1 token 和 CAB
Methodology A 开头	现在说网络有三个 components，容易被审稿人理解为拼装	改成“the architecture implements the above design principle”
MMH / CAB 小节开头	需要明确接口关系：MSC 生成 scale-aware heads，HAB 负责 head-domain communication，CAB 是 MMH 后的 boundary compensation	不要把 CAB 写成独立补丁
Fig.1 / Fig.2 caption 和正文引用	目前图不能充分说明完整框架	Fig.1 改成“problem-to-solution framework”，Fig.2 改成定量支撑谱保真
Ablation Study	当前消融差异很小，审稿人会觉得创新点贡献不充分	需要补 cumulative ablation、cross-scale/head affinity 可视化、boundary-region 指标
Conclusion	现在仍是模块罗列	改成重申统一框架和设计闭环

三、可直接替换的英文段落

1. Rewritten Abstract

可直接替换原摘要。当前摘要虽然提到 scale-specific features、Head-Domain Attention 和 CAB，但还没有把它们明确写成“由同一矛盾驱动的统一框架”。(LMCAN_v2.pdf)

Abstract— Hyperspectral–multispectral (HS–MS) image fusion aims to reconstruct a high-resolution hyperspectral image by combining the rich spectral signatures of low-resolution hyperspectral observations with the fine spatial structures of high-resolution multispectral images. A key challenge in Transformer-based fusion is the trade-off between spectral fidelity and spatial context modeling: patch merging and deep hierarchical designs enlarge the receptive field but may mix pixel-wise spectral signatures and increase computational cost, whereas 1 × 1 pixel tokens preserve spectra but limit contextual modeling within window attention. To address this trade-off, we propose LMCAN, a Lightweight Multi-Scale Centralized Attention Network that formulates HS–MS fusion as a unified pixel-token-preserving multiscale attention framework. Specifically, LMCAN first introduces pre-attention multiscale convolution to generate scale-specific features while maintaining pixel-level spectral representations. These features are explicitly assigned to different attention heads, so that each head models dependencies under a distinct receptive field. To avoid independent processing among scale-aware heads, a Head-Domain Attention module is further developed to learn adaptive head-to-head correlations and enable cross-scale communication within a shallow attention layer. Finally, a Center Attention Block is designed as a boundary-compensation mechanism to recover neighborhood interactions missed by fixed window partitioning. The proposed design avoids hierarchical patch merging, preserves spectral integrity, and achieves multiscale spatial–spectral interaction with low computational complexity. Experiments on CAVE, Harvard, Chikusei, and Pavia datasets demonstrate that LMCAN achieves superior reconstruction accuracy and efficiency compared with representative conventional, CNN-based, and Transformer-based fusion methods.

2. Rewritten Introduction problem paragraph

建议替换 Introduction 中从 “However, most Transformer-based fusion models…” 到 “To overcome these challenges…” 这一段。现在原文已经指出 deep stacking、patch merging、cross-scale interaction 和 window boundary 问题，但它们之间的因果关系还不够强。(LMCAN_v2.pdf)

Existing Transformer-based HS–MS fusion methods usually enlarge the receptive field by deep stacking, shifted-window operations, or hierarchical patch merging. Although these strategies are effective for natural image restoration, they are not fully aligned with the spectral-fidelity requirement of HS–MS fusion. In hyperspectral data, each spatial pixel corresponds to a high-dimensional spectral signature; therefore, merging neighboring pixels into larger tokens may introduce artificial spectral mixing. A straightforward solution is to preserve 1 × 1 pixel tokens, but this choice narrows the effective receptive field of window-based attention and weakens multiscale spatial context modeling. This reveals a fundamental dilemma: HS–MS fusion requires pixel-level spectral preservation and multiscale spatial interaction simultaneously, while conventional hierarchical Transformers usually improve one at the expense of the other. Moreover, standard multi-head attention treats different heads as independent subspaces and fuses them only by concatenation and projection, which is insufficient for explicit cross-scale communication. Fixed window partitioning further causes missing neighborhood interactions near window boundaries, which is harmful to edge and texture reconstruction.

3. Rewritten transition paragraph before contributions

建议放在贡献点前，替换现有 “To overcome these challenges, we propose...” 段落。

To address the above dilemma, we propose LMCAN, a Lightweight Multi-Scale Centralized Attention Network for HS–MS image fusion. The central idea of LMCAN is not to independently add multiscale convolution, head interaction, and local refinement, but to organize them into a single pixel-token-preserving attention framework. First, LMCAN keeps the 1 × 1 token representation to avoid spectral mixing. Second, it injects multiscale receptive fields before attention and assigns scale-specific features to different heads, thereby compensating for the limited receptive field caused by pixel tokens without using patch merging. Third, Head-Domain Attention explicitly models correlations among these scale-aware heads, enabling cross-scale information exchange within a shallow structure. Finally, Center Attention is introduced after window attention to compensate for boundary-deficient neighborhoods. In this way, LMCAN jointly addresses spectral preservation, multiscale context modeling, cross-scale interaction, and boundary reconstruction under a lightweight design.

4. Rewritten Contributions

必须重写。建议不要写成“我们提出模块 A、模块 B、模块 C”，而要写成“我们提出一个统一机制；模块是机制中的不同环节”。当前贡献点虽然已经较清楚，但仍然像四条并列模块。(LMCAN_v2.pdf)

The main contributions of this work are summarized as follows:

A unified pixel-token-preserving multiscale attention framework for HS–MS fusion. We identify the spectral-fidelity/spatial-context trade-off in Transformer-based HS–MS fusion and propose a shallow attention framework that preserves 1 × 1 pixel tokens while introducing multiscale contextual modeling without hierarchical patch merging.
Scale-aware head construction before attention. Instead of splitting homogeneous feature channels as in conventional multi-head attention, LMCAN first extracts scale-specific features using parallel dilated convolutions and then assigns them to different heads. This design makes each head correspond to a distinct receptive field, allowing multiscale spatial–spectral modeling while maintaining pixel-level spectral signatures.
Head-Domain Attention for explicit cross-scale communication. To overcome the independence of conventional attention heads, we design a Head-Domain Attention module that learns adaptive head-to-head affinities among scale-aware heads. This enables information exchange across receptive fields before final feature projection and strengthens the coupling between multiscale feature extraction and attention-based fusion.
Center Attention for boundary-aware neighborhood compensation. We introduce a Center Attention Block that uses each reconstruction pixel as the query center and aggregates information from its spatial neighborhood. This mechanism compensates for the missing local interactions caused by fixed window partitioning and improves edge and boundary reconstruction without sacrificing spectral fidelity.
Comprehensive validation of accuracy–efficiency trade-off. Experiments on CAVE, Harvard, Chikusei, and Pavia datasets, together with complexity analysis, efficiency evaluation, and ablation studies, demonstrate that LMCAN achieves competitive or superior fusion performance with significantly lower model complexity than representative CNN-based and Transformer-based methods.

5. Rewritten Related Work gap paragraph

建议替换 Related Work B 最后一段。当前稿件已经写了 “two issues remain insufficiently addressed”，但还可以更聚焦。(LMCAN_v2.pdf)

Despite the progress of Transformer-based HS–MS fusion, existing methods still lack a unified design that simultaneously satisfies spectral fidelity, multiscale context modeling, and efficient feature interaction. Hierarchical Transformers usually obtain multiscale representations through repeated stacking or patch merging, which increases channel dimensionality and may compromise pixel-wise spectral signatures. Some recent studies introduce convolutional priors, inter-head communication, or local neighborhood attention, but these mechanisms are typically used as independent enhancements rather than as coupled components derived from the specific requirements of HS–MS fusion. In contrast, LMCAN starts from the need to preserve 1 × 1 pixel tokens and then addresses the resulting limitations in a progressive manner: multiscale receptive fields are injected before attention, scale-aware heads are explicitly correlated by Head-Domain Attention, and boundary-deficient neighborhoods are compensated by Center Attention. Therefore, the novelty of LMCAN lies in a unified shallow multiscale attention framework tailored to pixel-level spectral preservation, rather than in a simple combination of existing modules.

6. Rewritten Motivation section opening

建议替换 Motivation 的前半部分。现在 Fig.2 说明 1×1 token 更保谱，但需要把 Fig.2 明确变成整个框架的出发点。(LMCAN_v2.pdf)

The motivation of LMCAN comes from a central requirement of HS–MS fusion: the reconstructed HR-HSI should preserve the spectral signature of each spatial pixel while recovering fine spatial details. Conventional Transformer designs often use K × K patch tokens or patch merging to reduce computational cost and enlarge the receptive field. However, for hyperspectral imagery, a patch may contain pixels with different material compositions and thus different spectral responses. Aggregating such pixels into a single token inevitably mixes their spectral signatures, which may degrade spectral reconstruction. As illustrated in Fig. 2, the 1 × 1 pixel-token strategy better preserves spectral profiles than conventional patch-based tokenization, making it more suitable for HS–MS fusion.

However, preserving 1 × 1 pixel tokens introduces a new difficulty: the receptive field of window-based attention becomes limited, and long-range or multiscale spatial dependencies cannot be sufficiently captured by a shallow Transformer block. Directly adopting hierarchical patch merging would reintroduce spectral mixing and increase the model complexity. Therefore, LMCAN follows a different design principle: multiscale context should be introduced before attention without changing the pixel-token representation. This motivates the proposed multiscale convolution module, which assigns different receptive fields to different channel groups and converts conventional homogeneous attention heads into scale-aware heads.

Once scale-aware heads are constructed, another limitation arises. In standard multi-head attention, different heads are processed independently and interact only after concatenation and output projection. This is insufficient for HS–MS fusion, because features extracted at different receptive fields should communicate adaptively during fusion. We therefore introduce Head-Domain Attention to learn head-to-head correlations among scale-aware heads. Finally, since fixed window partitioning still misses some spatially adjacent pixels at window boundaries, Center Attention is employed as a boundary-compensation mechanism. These three designs are thus not independent additions, but consecutive solutions to the same spectral-fidelity-preserving multiscale modeling problem.

7. Rewritten Methodology A opening

建议替换 Network Architecture 中 “The network primarily consists of three components...” 附近。当前这句话会强化“组件拼装”的印象。(LMCAN_v2.pdf)

The architecture of LMCAN is designed to implement the above pixel-token-preserving multiscale attention principle. Given the upsampled LR-HSI and the HR-MSI, the network first forms a spatially aligned input that contains both complete spectral information and high-resolution spatial structures. Instead of using patch merging or hierarchical token downsampling, LMCAN keeps the spatial resolution unchanged throughout the feature interaction stage. The multiscale convolution module first converts the input into scale-aware pixel-token features. These features are then fed into the Feature Interaction Module, where MMH performs token-domain attention within each scale-aware head and Head-Domain Attention exchanges information across heads. After several MMH blocks, CAB further compensates for neighborhood information missed by fixed window partitioning. Therefore, the three modules play complementary roles within one framework: multiscale convolution constructs scale-aware heads, Head-Domain Attention couples these heads, and CAB restores boundary-local context.

8. Rewritten MMH explanatory paragraph

当前 MMH 公式部分是合理的，但建议在公式前加入下面这段，强调 MSC 和 HAB 的接口关系。Head-Domain Attention 目前已经写了 head-to-head affinity 和区别于 conventional MHA 的公式说明，但需要更早告诉审稿人“它为什么不是普通 head attention 的小改动”。(LMCAN_v2.pdf)

The MMH block is the core component that transforms multiscale convolutional features into an explicit cross-scale attention representation. Different from conventional MHA, where all heads are obtained by linearly splitting the same token sequence into homogeneous subspaces, the heads in MMH are generated from scale-specific features. Therefore, each head has a clear physical meaning: it describes spatial–spectral dependencies under a specific receptive field. Token-domain self-attention is first performed within each head to model dependencies at that scale. Then, Head-Domain Attention is applied across head descriptors to learn how different receptive fields should be adaptively combined. This design enables scale-specific modeling and cross-scale communication within the same shallow attention layer.

9. Rewritten CAB explanatory paragraph

当前 CAB 的定义较完整，但要强调它不是“另一个 attention 模块”，而是 window attention 的边界补偿机制。(LMCAN_v2.pdf)

CAB is introduced as a boundary-compensation step after MMH rather than as an independent refinement module. Although MMH provides multiscale interaction within window attention, fixed window partitioning may still prevent spatially adjacent pixels from attending to each other when they fall into different windows. This problem is particularly harmful in HS–MS fusion, where the reconstruction of a target pixel depends not only on global context but also on its immediate local neighborhood. CAB addresses this limitation by treating each reconstruction pixel as the query center and sampling keys and values from a center-aligned neighborhood. In this way, each pixel can directly aggregate the most relevant surrounding information, including boundary-crossing neighbors, while the 1 × 1 spectral representation remains unchanged.

10. Rewritten Fig.1 / Fig.2 captions and正文引用

Fig.1 建议重画，不建议保留现有结构作为 Fig.1

现在 Fig.1 只是“Conventional MHA vs LMCAN”，但审稿人关心的是“完整框架”和“组件关联”。建议新 Fig.1 画成 Problem-to-solution framework：

上排：
HS–MS fusion requirement → preserve pixel spectra → use 1×1 tokens

中排：
Problem caused by 1×1 tokens → limited receptive field → pre-attention multiscale heads

下排：
Problem caused by independent heads → HAB cross-scale communication → problem caused by window boundaries → CAB boundary compensation

右侧输出：
spectral fidelity + multiscale spatial context + low complexity

新的 caption：

Fig. 1. Design rationale of LMCAN. The proposed framework is derived from the spectral-fidelity requirement of HS–MS fusion. LMCAN first preserves 1 × 1 pixel tokens to avoid spectral mixing, then injects multiscale receptive fields before attention to compensate for the limited context of pixel tokens. Head-Domain Attention explicitly models cross-scale correlations among scale-aware heads, while Center Attention compensates for missing boundary neighborhoods caused by fixed window partitioning. These coupled designs form a unified shallow multiscale attention framework rather than a simple combination of independent modules.

Fig.2 可以保留，但需要加强定量信息

当前 Fig.2 的方向是对的，但图中曲线和说明需要更清晰：建议增加 “Average SAM / mean spectral error for n×n token and 1×1 token” 的数值，最好用 bar plot 或表格嵌入图中。当前 Fig.2 只说明“1×1 更保谱”，但没有直接连接 HAB/CAB。正文引用可以改成：

As shown in Fig. 2, patch-based tokenization introduces larger spectral deviations because each token aggregates multiple spatial pixels with potentially different spectral signatures. This observation motivates the use of 1 × 1 pixel tokens in LMCAN. However, this choice reduces the receptive field of window attention, which further motivates the subsequent multiscale head construction and head-domain interaction. Therefore, Fig. 2 provides the empirical basis for the first design decision of LMCAN, while Fig. 1 summarizes how the remaining modules are derived from this decision.

11. Rewritten Ablation Study section

当前 Table VI 的消融差距很小，例如 full model 49.04 dB，而 Exp.I/II/III 分别是 48.92、48.99、48.86 dB，审稿人可能认为模块贡献不显著。(LMCAN_v2.pdf) 建议增加 cumulative ablation，而不是只做 replacement ablation。

建议新增表格结构：

Variant	1×1 token	MSC	HAB	CAB
Baseline W-MSA	✓	×	×	×
+ MSC	✓	✓	×	×
+ MSC + HAB	✓	✓	✓	×
+ MSC + HAB + CAB	✓	✓	✓	✓

建议替换消融引言为：

To verify that the proposed modules are coupled components of a unified framework rather than independent enhancements, we conduct cumulative and replacement ablation studies on the CAVE dataset. Starting from a pixel-token window-attention baseline, we progressively add the multiscale convolution module, Head-Domain Attention, and Center Attention. This setting directly evaluates the proposed design logic: MSC compensates for the limited receptive field of 1 × 1 tokens, HAB enables communication among the resulting scale-aware heads, and CAB recovers boundary-local neighborhoods missed by fixed window partitioning.

建议替换消融分析为：

The cumulative results show that each component improves the fusion performance in a progressive manner. Adding MSC improves spatial context modeling by assigning different receptive fields to different heads. Introducing HAB further improves spectral–spatial reconstruction because the scale-aware heads are no longer processed independently; instead, their correlations are adaptively learned in the head domain. Finally, adding CAB improves boundary-sensitive reconstruction by allowing each target pixel to aggregate information from its center-aligned neighborhood. These results confirm that the performance gain of LMCAN comes from the coordinated interaction among the three designs, rather than from a simple increase in model capacity.

还建议增加两类补充实验：

Boundary-region evaluation：在边缘 mask 或 gradient mask 上单独计算 PSNR/SAM/ERGAS，证明 CAB 的必要性。
Head-affinity visualization：显示 HAB 学到的 H×H affinity matrix，证明不同 scale heads 确实发生了自适应交互，而不是公式上的“伪创新”。

12. Rewritten Conclusion

建议替换当前 Conclusion。当前结论仍然是“先 MSC，再 MMH，再 CAB”的模块罗列。(LMCAN_v2.pdf)

In this paper, we proposed LMCAN, a lightweight pixel-token-preserving multiscale attention framework for HS–MS image fusion. The proposed method is motivated by the trade-off between spectral fidelity and spatial context modeling in Transformer-based fusion. By preserving 1 × 1 pixel tokens, LMCAN avoids spectral mixing caused by patch merging. To compensate for the limited receptive field of pixel tokens, multiscale receptive fields are introduced before attention and assigned to different heads. Head-Domain Attention further enables adaptive cross-scale communication among these scale-aware heads, while Center Attention compensates for missing local neighborhoods near window boundaries. In this way, LMCAN integrates spectral preservation, multiscale spatial modeling, cross-scale interaction, and boundary refinement into a unified shallow architecture. Extensive experiments on simulated and remote sensing datasets demonstrate that LMCAN achieves a favorable balance between reconstruction accuracy, spectral fidelity, and computational efficiency.

四、还必须修改的其他部分

统一命名：文中同时出现 SL-MMHA、MMH、HAB、Head-Domain Attention、Center Attention、Centralized Attention。建议统一为：
MSC → MMH → HDA/HAB → CAB。标题中的 “Centralized Attention” 与正文 “Center Attention Block” 也要解释清楚，否则审稿人会觉得概念不一致。
明确 H 和 3 个 dilation branches 的关系：MSC 写的是分成三个 heads，dilation = 1, 2, 3；MMH 公式又写一般 H heads。需要说明：
“For simplicity, we use three scale-aware groups in implementation; the formulation can be generalized to H heads.”
检查 Eq.25 的缩放项：Head-domain attention 的 softmax 缩放建议写成 sqrt(d_a)，不要写成容易被误读的 sqrt(H) d_a 或不清楚的形式。
避免 M 的符号冲突：前文 M 表示 window size，CAB 中 M 又表示 number of heads。建议 window size 用 S，head number 用 H_c 或 M_h。
Fig.3 应该前移或改为 Fig.2：现在 Fig.3 才是整体网络图，但审稿意见要求“完整框架”。建议把新的 design rationale 作为 Fig.1，整体架构作为 Fig.2，spectral fidelity 对比作为 Fig.3 或 supplementary。
补充“不是简单组合”的实验证据：仅靠文字说不是简单组合不够。必须用 cumulative ablation、head affinity visualization、boundary mask evaluation 来支撑。
实验结果表述要减少夸张：例如 “state-of-the-art performance across all datasets” 要谨慎，因为 Pavia 中 SSIM 与多个方法同为 0.996，CAVE 中 SAM 与 PSRT 持平。建议写 “competitive or superior”。
语言细节：
“Another way” 应改成 “In addition” 或 “Recent studies also show”。
“Cave dataset” 应统一为 “CAVE dataset”。
“Table 4” 应统一为 “Table IV”。
“Multiscale” 和 “multi-scale” 统一一种写法。
Fig. caption 中 “LMCAN(Ours)” 改成 “LMCAN (Ours)”。
Related Work 需要更有针对性地比较 MSST-Net、PSRT、PanFormer、HyperTransformer 等：不要只列方法，要明确它们没有解决什么：patch merging、head independence、boundary neighborhood、pixel-token spectral preservation。
回复审稿人时要明确写：
“We have revised the Abstract, Introduction, Related Work, Motivation, Methodology, Fig. 1, Fig. 2, and Ablation Study to clarify that LMCAN is a unified framework derived from the spectral-fidelity/spatial-context trade-off, rather than a simple combination of independent components.”

问题

LMCAN_v2+(2).pdf

根据回答的意见，附件是我修改之后的版本，从内容和格式上认真审阅这篇文章，是否满足SCI高水平论文的要求，检查表述逻辑、图表是否合理，实验设置是否完善。每一小节内容是否正确且适当，公式是否正确等问题。

思考

Reviewing calculations and dataset verification

The spatial and channel terms for projection look consistent. The attention term aligns with my earlier calculations, so that seems fine. However, "linear growth with depth" in LMCAN and PM isn’t accurate, as their growth is geometric. Regarding datasets, it’s important to verify the dimensions of the Pavia dataset to avoid inaccuracies. I might check this online for confirmation.

Thought for 2m 12s

回答

结论先说：这版比上一版明显更接近“完整框架”的表达，但还没有达到高水平 SCI/TGRS 类论文的可投稿终稿标准。主要原因不是方法完全不行，而是形式、公式严谨性、实验支撑和图表说服力仍有明显短板。尤其是：文中仍有大量红色修订文字、命名不统一、若干公式/符号不够严谨、Fig.1/Fig.2 仍然不够专业、Table VI 的消融结果不足以支撑“各模块协同有效”的强结论，Table V 的效率结论也需要更谨慎。(LMCAN_v2+(2).pdf)

一、总体评价

目前稿件的核心叙事已经基本形成：

1×1 pixel token 保持光谱 → 导致感受野不足 → MSC 提供多尺度感受野 → HDA 建立多尺度 head 交互 → CAB 补偿窗口边界邻域缺失。

这个逻辑在摘要、引言、贡献点、Related Work gap 和结论中都已经出现，方向是对的。摘要现在明确提出了 spectral fidelity 与 spatial context modeling 的矛盾，并把 LMCAN 描述为 pixel-token-preserving multiscale attention framework，这比上一版有明显提升。(LMCAN_v2+(2).pdf)

但是，高水平 SCI 论文不仅要求逻辑能讲通，还要求图表、公式、实验、语言、排版都能经得起审稿人逐项推敲。目前仍存在以下高风险问题：

红色修订文字没有去掉，正文、公式、结论多处仍为红色。若这是 clean manuscript，绝对不合格；若是 revision marked version，可以保留，但必须另交 clean version。
标题、摘要、正文命名不统一：标题是 “Center Attention Network”，正文仍出现 “Centralized Attention Network”，图中也有 “Centralized”。这会让审稿人认为概念没有统一。
Fig.1 和 Fig.2 仍然不够高水平：Fig.1 太拥挤，图中文字很小，caption 格式不规范；Fig.2 看起来像示意图，而不是严格实验图。
Table VI 消融结果支撑不足：Full model 的 PSNR 最好，但 ERGAS 不是最好；SSIM 全部相同。不能简单宣称所有模块均全面提升。
实验设置仍不够完整：缺少验证集、随机种子、weight decay、batch size、学习率策略、数据增强、重复实验/标准差、不同尺度因子、噪声/错配鲁棒性实验。
公式有若干需要重写或澄清的地方，尤其是退化模型、残差重建公式、符号复用、复杂度统计边界。
语言仍有不少非母语表达和格式问题，例如 “Another way”、Fig. caption 空格、Table 4/IV 混用、Cave/CAVE 混用等。

因此，我的判断是：

目前版本可以作为“较大幅修改后的 revision draft”，但不建议直接作为高水平 SCI clean version 提交。需要再做一次系统性技术审查、公式修正、图表重绘和实验补强。

二、摘要和贡献点是否合格

摘要

摘要现在的逻辑是正确的：先提出 Transformer-based fusion 中 spectral fidelity 和 spatial context 的矛盾，再引出 1×1 token、MSC、HDA、CAB。摘要中还强调了避免 hierarchical patch merging、保持 spectral integrity 和低复杂度，这些都能回应“创新点不足、组件之间缺少关联性”的审稿意见。(LMCAN_v2+(2).pdf)

但还需要修改两点：

第一，摘要中写 “superior reconstruction accuracy and efficiency” 过强。因为 Table V 中 LMCAN 的 FLOPs 和参数量很低，但 latency 不是最低，GPPNN 的 latency 和 FPS 更好；Pavia 中 SSIM 也只是与多个方法并列。建议改成：

achieves competitive or superior reconstruction accuracy with a favorable accuracy–efficiency trade-off.

第二，代码链接放在 Abstract 末尾是否合适要看投稿规范。如果是匿名审稿，GitHub 链接可能暴露作者身份；如果是非匿名，可以放在 footnote 或 Code Availability，而不是 Abstract 主体中。

贡献点

贡献点已经比上一版好很多，尤其是第 1 点把方法定义为 “unified pixel-token-preserving multiscale attention framework”，第 2–4 点也按设计链条组织，而不是简单罗列模块。(LMCAN_v2+(2).pdf)

但第 5 点仍需更谨慎。现在贡献点说 “significantly lower model complexity”，这在参数量和 FLOPs 上成立，但在 latency 上不完全成立。建议改为：

The proposed method achieves a favorable trade-off among reconstruction accuracy, parameter count, FLOPs, memory usage, and inference speed.

三、Introduction 是否合理

Introduction 的主线基本成立。你现在把现有 Transformer 的问题归纳为 patch merging 可能破坏 pixel-wise spectral signature、1×1 token 会限制 window attention 感受野、standard MHA 缺少 head 间交互、fixed window partition 会造成边界邻域缺失，这个逻辑是合理的。(LMCAN_v2+(2).pdf)

但仍有三个问题：

第一，“state-of-the-art results across multiple benchmarks” 应谨慎使用。引言中说 INNformer、Panformer、PSRT 等达到 state-of-the-art，这可以保留，但对自己方法不要频繁使用 “state-of-the-art”。建议统一为 “competitive or superior”。

第二，“Center” 和 “Centralized” 必须统一。标题是 “Center Attention Network”，摘要是 “Center Attention Network”，但正文中仍写 “Lightweight Multi-Scale Centralized Attention Network”。建议全文统一为：

LMCAN: Lightweight Multi-Scale Center Attention Network

如果想保留 Centralized，则标题、摘要、CAB 名称也都要统一，但我建议用 Center Attention，因为 CAB 的机制是 center-aligned neighborhood attention。

第三，Introduction 最后 Fig.1 的引用仍然偏弱。Fig.1 caption 现在是 “Difference from Conventional Multi-Head Attention”，但正文想说明的是完整设计 rationale。建议改为：

Fig. 1 illustrates the design rationale of LMCAN rather than only the difference from conventional MHA.

四、Related Work 是否适当

Related Work 的内容覆盖较广，包含传统方法、CNN、deep unfolding、Transformer-based fusion、卷积与 Transformer 结合等，这从覆盖面上是够的。(LMCAN_v2+(2).pdf)

但作为高水平 SCI，Related Work 还不够“批判性”。现在有些段落仍像文献堆砌，尤其是 Transformer 部分连续列举 PanFormer、PSRT、CSAKD、HyperFormer、DCTransformer、CasFormer、MSST-Net、PSCINN 等，但对每类方法“为什么不能解决本文问题”的分析还不够集中。

建议把 Transformer 相关工作重组为三类：

Patch/hierarchical Transformer methods：强调它们依赖 patch merging 或深层堆叠，可能不利于 pixel-level spectral fidelity。
Cross-modal / cross-attention methods：强调它们关注 HS/MS 模态交互，但没有解决 scale-aware heads 的显式关联。
CNN-enhanced Transformer methods：强调卷积先验有效，但通常没有形成 pixel-token-preserving multiscale head framework。

语言上必须修改：

“Another way, studies have also shown...” 应改为 “In addition, recent studies have shown...”
“fusion.These” 缺少空格。
“Multiscale” 和 “multi-scale” 需要全文统一。
“introduces / proposed / have developed” 时态需要统一。

五、Motivation 是否充分

Motivation 的方向是对的：你用 Fig.2 说明 patch tokenization 可能造成 spectral mixing，进而解释为什么要使用 1×1 pixel token；然后指出 1×1 token 会带来感受野不足，于是引出 MSC、HDA、CAB。(LMCAN_v2+(2).pdf)

但 Fig.2 目前仍然不够强。图中有 patch crop、1×1 crop、谱曲线和 “PSRT / Our” 标注，但它更像示意图，不像严格实验图。正文中说 “quantitative evaluations on the CAVE dataset demonstrate...”，但图里没有清晰给出定量指标，例如 average SAM、MRAE、RMSE 或 spectral error bar。(LMCAN_v2+(2).pdf)

建议 Fig.2 改为三部分：

(a) patch token 与 1×1 token 的示意；
(b) 不同 patch size 下的 average SAM / MRAE / RMSE；
(c) 一个代表像素的 spectral curve comparison。

否则审稿人会质疑：“Fig.2 只是画了一个假设，并没有证明 1×1 token 更保谱。”

另外，Motivation 中 Eq.2–5 的复杂度推导和后文 Section III-D 的复杂度分析有一定重复。建议保留 Motivation 中的直观解释，把详细推导集中放在 Complexity Analysis，避免前后重复。

六、Methodology 是否正确

A. Network Architecture

该小节整体结构是合理的。输入为 upsampled LR-HSI 与 HR-MSI 拼接，经过 multiscale convolution、FIM、重建模块，再用 residual learning 生成 HR-HSI，这条流程清楚。文中也解释了 MSC、MMH/HDA、CAB 的互补关系。(LMCAN_v2+(2).pdf)

但有两个高风险点。

第一，术语不统一。小节开头写 “The proposed Multiscale Multi-Head Network”，但论文名称是 LMCAN。建议统一为：

The proposed LMCAN adopts a single-stream residual learning paradigm...

第二，Eq.14 写成：

IHR-HSI = Xup + ReLU(Conv(Fdeep))

这个公式需要慎重。残差学习中 residual 可能为正也可能为负，如果对 residual 使用 ReLU，会限制网络只能做非负修正，理论上不合理。除非你的代码确实这样实现且有实验验证，否则建议改为：

\hat{Z}=X_{up}+\mathcal{R}(F_{deep})

其中 $\mathcal{R}(\cdot)$ 是 reconstruction head。最后是否 clamp 到有效范围可以单独说明。

B. Multiscale Convolution Module

MSC 小节基本合理，1×1 embedding 后分成三组，分别用 dilation = 1, 2, 3 的 3×3 卷积，再 concat。(LMCAN_v2+(2).pdf)

但需要澄清：

这里实现是 three scale-aware groups，后文 MMH 用的是一般 $H$ heads，CAB 又用 $M$ heads。必须说明实际实现中 $H=3$ 还是 $H$ 可以大于 3。
如果 attention heads 数量大于 3，那么 scale groups 与 attention heads 如何映射？是每个 scale group 对应多个 heads，还是 total heads = 3？
Fig.3 中写 C/3，说明实现似乎是 3 组。那 MMH 公式最好写成 $H_s=3$ scale groups，再说明可推广到 $H$ heads。

建议加一句：

In our implementation, the number of scale-aware groups is set to three, corresponding to dilation rates 1, 2, and 3. When a larger number of attention heads is used, heads sharing the same dilation rate are grouped into the same scale category.

如果没有这种实现，就不要写 “generalized to H heads”。

C. MMH / Head-Domain Attention

MMH 的基本想法合理：先在每个 scale-aware head 内做 token-domain attention，再通过 Head-Domain Attention 做 head-to-head interaction。公式 Eq.19–29 的框架是可以接受的。(LMCAN_v2+(2).pdf)

但仍有几个需要修正或解释的问题：

$F$ 被定义为 normalized feature tokens，但 Eq.28 又把 $F$ 作为 residual 加回去。这里要说明 $F$ 是 LN 前还是 LN 后的输入。标准 Transformer 通常是： $X' = X + \mathrm{Attention}(\mathrm{LN}(X))$ 现在公式写法容易让人误解为把 normalized feature 作为 residual。
$F_H$ 是由 $\hat{F}$ 混合得到，而不是由 value 或 token-attended features 得到。这个设计可以成立，但要解释为什么 head-domain branch 对原始 scale-aware features 做 mixing，而不是对 $F^S_h$ 做 mixing。
$\alpha$ 写成 learnable or fixed scaling factor。高水平论文不能这么模糊，必须明确：实际实验中 $\alpha$ 是 learnable 还是 fixed？初始化是多少？
Conventional MHA 公式 Eq.30 中 $B_h$ 是 relative position bias 吗？如果是，需要说明；如果不是，删除 $B_h$ 。

D. Center Attention Block

CAB 公式总体较完整。它以每个像素为 query center，从 $K \times K$ 邻域采样 key/value，再用多头不同 dilation factor $r_m$ 扩大感受野。这个机制与“boundary compensation”目标基本一致。(LMCAN_v2+(2).pdf)

但有三个问题：

Fig.4 caption 写 “adaptive expansion rate”，而公式里 $r_m$ 看起来是固定 dilation factor。如果不是可学习或输入自适应，就不能叫 adaptive。建议改成 “head-specific dilation rate”。
CAB 被称为边界补偿，但实验没有 boundary-region evaluation。只靠整体 PSNR/SAM 不足以证明 CAB 对边界有效。
符号冲突严重： $H$ $H$ 同时表示图像高度和 head number； $M$ $M$ 在 Motivation 中表示 window size，在 CAB 中表示 head number； $K$ $K$ 既表示 patch size，又表示 neighborhood size。建议统一为：
- image size: $H_0, W_0$
- head number: $N_h$
- window size: $S$
- CAB neighborhood size: $K_c$

七、公式和理论分析是否正确

退化模型 Eq.6–8

这是目前最需要严谨化的公式之一。文中写：

X = ZH_1 + N_{spa}

Y = ZH_2 + N_{spe}

但 $Z$ 是 $H \times W \times C$ 张量， $X$ 是 $h \times w \times C$ ，空间退化不能简单写成右乘 $H_1$ ，除非你明确定义了张量展开或 mode-wise multiplication。(LMCAN_v2+(2).pdf)

建议改成更规范的形式之一：

\mathcal{X} = \mathcal{D}_s(\mathcal{Z}) + \mathcal{N}_{spa}

\mathcal{Y} = \mathcal{Z} \times_3 R + \mathcal{N}_{spe}

其中 $\mathcal{D}_s(\cdot)$ 表示 blur + downsampling， $R \in \mathbb{R}^{C \times c}$ 是 spectral response matrix。

或者使用矩阵展开形式：

X_{(3)} = Z_{(3)} P_s + N_x

Y_{(3)} = R Z_{(3)} + N_y

关键是要保证维度完全一致。

W-MSA 与 patch merging 复杂度 Eq.2–5

推导思路基本可以接受：patch merging 后空间减少、通道增加，projection term 增大。但表述需要更谨慎。你现在说 patch merging 后参数 quartic growth $(\propto n^4)$ ，这是针对单层通道从 $C$ 变成 $n^2C$ 的 projection 参数而言成立；但如果实际 Swin stage 里有 channel reduction 或不同 block 数，不能泛化成所有 hierarchical Transformer 都如此。(LMCAN_v2+(2).pdf)

建议改为：

Under the simplified setting without additional channel reduction, the projection parameters scale as...

这样更严谨。

HDA 公式 Eq.25

当前渲染图中 Eq.25 的缩放项是 $\sqrt{d_a}$ ，这是正确的；之前容易被误读的问题在这版中基本解决。(LMCAN_v2+(2).pdf)

Complexity Analysis Eq.48–55

复杂度分析能说明 lightweight 方向，但仍需补充边界：

当前复杂度只统计 feature interaction layer，未包含 MSC、MLP、reconstruction head。
HAB 的复杂度写为 $2HWC^2 + HWMC$ ，但 HDA 实际是对 head descriptors 做 $H \times H$ attention，不一定需要 $HW C^2$ 级别的 value projection。需要和代码实现严格一致。
CAB 的复杂度中包含 $4HWC^2$ ，这相当于 QKV+O projection；如果 Q/K/V 来自共享投影或轻量投影，公式需要改。
Table V 的低 FLOPs 与 latency 不完全一致，说明复杂度分析不能单独作为效率结论。

八、实验设置是否完善

实验覆盖了 CAVE、Harvard、Chikusei、Pavia 四个数据集，指标包括 PSNR、SAM、ERGAS、SSIM，对比方法包括传统方法、CNN/deep unfolding 方法和 Transformer 方法，整体框架是完整的。数据模拟包括 Gaussian blur、4× bicubic downsampling 和 Nikon D700 spectral response function，这些设置也符合 HS-MS fusion 常见实验范式。(LMCAN_v2+(2).pdf)

但距离高水平 SCI 仍缺少以下内容：

必须补充的实验设置细节

batch size。
optimizer betas、weight decay。
learning rate schedule，是否 cosine decay / step decay。
validation set，如何选择 best checkpoint。
随机种子和重复实验次数。
数据增强策略，如 flip、rotation、crop。
训练 patch 数量，每个数据集具体多少 patches。
是否所有 baseline 都重新训练 2000 epochs，还是使用官方参数。
baseline 的输入通道适配细节，尤其是 pansharpening 方法如何从 PAN 输入改成 HR-MSI 输入。
是否使用相同 degradation setting 和 spectral response function。

建议补充的高水平实验

不同尺度因子实验：至少增加 8×，最好 2×/4×/8×。
噪声鲁棒性实验：LR-HSI 或 HR-MSI 加 Gaussian noise。
光谱响应函数错配实验：测试 spectral response mismatch。
空间错配实验：模拟 HR-MSI 与 LR-HSI misregistration。
真实无 GT 数据实验：如果无法提供真实 GT，可以做 no-reference 或 visual consistency。
per-scene results 或 mean±std：只给平均值不够。
统计显著性：至少给标准差，或者 paired t-test。
boundary-region evaluation：专门证明 CAB 对边界有效。
head-affinity visualization：证明 HDA 学到了跨尺度 head 关系。
spectral profile comparison：用若干像素的光谱曲线证明 spectral fidelity。

九、结果表格是否合理

Tables I–IV

四个主结果表总体是有说服力的。CAVE 中 LMCAN 的 PSNR 49.04、ERGAS 1.18、SSIM 0.996，整体表现较强；Harvard 中 PSNR 45.91、SAM 2.20、ERGAS 1.95，也较好；Chikusei 和 Pavia 也有较好结果。(LMCAN_v2+(2).pdf)

但有几点必须改：

Pavia 的 SSIM 不是唯一最优，多个方法都是 0.996，不能写 “across all criteria”。
CAVE 的 SAM 与 PSRT 并列，不能写 SAM 明显优于所有方法。
Table 4 应改为 Table IV。
params 列建议统一为 “Params (M)”，不要写 “params”。
传统方法没有参数量时用 “–” 比 “/” 更正式。
小数位不统一，例如 Chikusei GLP-HS 为 31.6，其他是两位小数，建议统一两位。
表格中的最佳/次优颜色需要确保在黑白打印中可区分，IEEE 风格通常不依赖颜色，可用 bold/underline。

Table V

Table V 是有价值的，因为它加入了 FLOPs、latency、FPS、memory。LMCAN 参数量 0.157M、FLOPs 12.06G、memory 333.17MB，确实很轻。(LMCAN_v2+(2).pdf)

但结论要谨慎：LMCAN 不是最快的。GPPNN latency 28.81ms、FPS 34.70，而 LMCAN latency 70.25ms、FPS 14.23。你可以说：

LMCAN achieves the lowest parameter count, FLOPs, and memory usage among the compared methods while maintaining the highest reconstruction accuracy; although it is not the fastest model, it provides the best overall accuracy–complexity trade-off.

不要写 “best efficiency” 或 “fastest”。

Table VI

Table VI 是目前最薄弱的实验。Full model 的 PSNR 最好，但 ERGAS 不是最好：Exp.I 和 Exp.II 的 ERGAS 都是 1.17，而 Our 是 1.18。SSIM 四个实验全是 0.996。(LMCAN_v2+(2).pdf)

这意味着不能写：

each component improves the fusion performance in a progressive manner.

应该改成更谨慎：

The full model achieves the best PSNR and SAM, while maintaining comparable ERGAS and SSIM. This suggests that the three modules jointly improve pixel-wise reconstruction and spectral angular consistency, although their effects on ERGAS are relatively marginal.

更重要的是，当前 ablation 不是 cumulative ablation。建议必须增加：

Variant	MSC	HDA	CAB
Baseline W-MSA	×	×	×
+ MSC	✓	×	×
+ MSC + HDA	✓	✓	×
+ MSC + HDA + CAB	✓	✓	✓

否则不能有力证明“组件之间有关联”。

十、图是否合理

Fig.1

Fig.1 的目的正确，但视觉上还不够高水平。它现在比较 Conventional MHA 和 LMCAN，并展示 scale-specific heads、HDA 和 CAB。但问题是：

图中文字过小。
模块太多，信息拥挤。
caption 格式不规范：“Attention:(a)Conventional” 缺少空格。
“LMCAN(Ours)” 应为 “LMCAN (Ours)”。
图中仍出现 “Centralized”，与标题 “Center” 不一致。
它看起来像方法结构图，而不是 design rationale 图。(LMCAN_v2+(2).pdf)

建议重画为 problem-driven framework：

左侧：Spectral fidelity requirement → 1×1 token
中间：Limited receptive field → MSC scale-aware heads
右侧：Independent heads → HDA cross-scale communication
下方：Window boundary missing neighbors → CAB boundary compensation
输出：Spectral fidelity + multiscale context + low complexity

Fig.2

Fig.2 是最需要修改的图之一。它要证明 1×1 token 的必要性，但当前图像过于示意化，且定量信息不清楚。(LMCAN_v2+(2).pdf)

建议：

增加 patch size = 1、2、4、8 的平均 SAM 或 MRAE 柱状图。
明确图中曲线对应哪个像素、哪个场景、哪个 band range。
删除 “PSRT Transformer” 这种容易引起误解的标注。
legend 中不要使用奇怪的中文引号 “——”。

Fig.3

Fig.3 是整体结构图，内容比较完整，比 Fig.1 更像 network architecture。(LMCAN_v2+(2).pdf)

但需要修改：

图中 “4 blocks”“2 blocks” 与正文 $m, n$ 的关系要明确。
“Detail reconstruction module” 中 Conv + ReLU 与 Eq.14 的 ReLU 问题要统一。
caption 建议改成：

Overall architecture of LMCAN for HS–MS image fusion.

Fig.4

Fig.4 说明 CAB 的 Q/K/V、sampling 和 multi-head center attention，基本合理。(LMCAN_v2+(2).pdf)

但 caption 中 “adaptive expansion rate” 不准确。公式里是 $r_m$ ，看起来是 head-specific dilation，不是 adaptive。建议改成：

samples K/V within a center-aligned neighborhood using head-specific dilation rates.

Fig.5–8

可视化结果整体可用，但仍需改进：

residual map 的 colorbar 太小。
不同方法的 residual color scale 是否统一需要说明。
文中提到 red boxes / magnified regions，但图里不够明显。
“Cave dataset” 应统一为 “CAVE dataset”。
对 Chikusei 和 Pavia 的可视化图，伪彩色图偏暗，细节不够清晰。

十一、格式和语言问题

必须修改以下格式问题：

红色修订文字必须在 clean version 中全部去掉。
“Center” 和 “Centralized” 必须统一。
“Multi-scale / multiscale / Multiscale” 全文统一。
“HS-MS” 和 “HS–MS” 统一，建议用 en dash：HS–MS。
“Cave” 统一为 “CAVE”。
“TableI / TableII” 改为 “Table I / Table II”。
“Table 4” 改为 “Table IV”。
“Fig.1” 改为 “Fig. 1”，全文统一。
“LMCAN(Ours)” 改为 “LMCAN (Ours)”。
“Another way” 改为 “In addition”。
“fusion.These” 改为 “fusion. These”。
“self-attention among multiscale features within a single-stage multiscale modeling within each MMH block” 这句话语病严重，应重写。
“superior / state-of-the-art / significantly” 要减少使用，避免过度声明。
IEEE 模板中不要长期保留彩色修订文字，除非是 marked revision。

十二、必须修改清单

按照优先级排序：

必须立即修改

删除 clean version 中所有红色修订文字。
全文统一 “Center Attention Network” 和 “Center Attention Block”，删除 “Centralized”。
重写 Eq.6–8 的退化模型，使空间退化和光谱退化维度严格正确。
修改 Eq.14，避免对 residual 使用 ReLU，除非代码确实如此且有充分解释。
统一符号：图像高宽、head 数、window size、neighborhood size 不要共用 H/M/K。
修改 Table VI 的分析，不能声称所有模块在所有指标上均提升。
增加 cumulative ablation。
修改 Table V 的效率结论，说明 LMCAN 不是最快，而是 accuracy–complexity trade-off 最好。
重画或大幅修改 Fig.1 和 Fig.2。
修正所有 caption 和表格格式问题。

强烈建议补充

Boundary-region evaluation，用来证明 CAB。
Head-affinity visualization，用来证明 HDA。
不同尺度因子实验，例如 2×/4×/8×。
噪声或错配鲁棒性实验。
per-scene results 或 mean±std。
训练细节补全：batch size、weight decay、scheduler、validation、seed、augmentation。
对 Pavia 数据集尺寸、来源和预处理重新核对。
增加 spectral profile comparison 图。

十三、是否满足高水平 SCI 要求

我的判断是：

创新叙事：基本达标，但还需图和实验支撑。
现在已经能讲清“不是简单模块拼接”，这是最大进步。

方法严谨性：未完全达标。
退化模型、符号、残差公式、复杂度边界仍需修正。

实验充分性：中等偏弱。
四个数据集和多个 baseline 是优点，但 ablation、robustness、statistical analysis、boundary/HDA-specific evidence 不足。

图表质量：未达标。
Fig.1/Fig.2 仍不够专业，Table VI 的结论与数据不完全匹配。

语言和格式：未达标。
红色修订文字、命名不统一、格式细节和语病会明显影响审稿印象。

最终建议：

不要直接提交当前版本作为 clean manuscript。建议做一次 major polishing + experiment strengthening。完成上述必须修改项后，论文有机会达到较高水平 SCI 的基本要求；若想冲击 TGRS/Information Fusion 类期刊，必须补充 cumulative ablation、boundary evaluation、head-affinity visualization 和更严格的实验设置。

问题

当前论文版本的摘要和contribution部分需要重新写。摘要部分不要出现破折号和冒号一类AI痕迹明显的表述，contribution总结创新点3点就可以，要复合叙述串联模块的逻辑，要强调是整个一个创新，注意不要AI风格。

思考

Rewriting abstract and contributions in Chinese

I need to rewrite the abstract and contributions in Chinese based on the provided materials. The abstract includes five points from pages 1-2. I'll avoid dashes and colons, as per the user's request. Citations should be included; however, I’m not sure if I need exact line references or just a general citation based on the file provided. I’ll figure that out!

Thought for 13s

回答

下面给出可直接替换的 Abstract 和 Contributions。我尽量去掉了破折号、冒号、过度模板化句式和明显 AI 风格表达，同时把 MSC、HDA、CAB 写成一个连续设计逻辑，而不是三个孤立模块。当前版本摘要和贡献点仍然是模块罗列式表达，且贡献点有 5 条，与你现在的修改目标不一致。(LMCAN_v2+(2).pdf)

Revised Abstract

Abstract
Hyperspectral and multispectral image fusion aims to recover a high resolution hyperspectral image by combining the spectral richness of low resolution hyperspectral observations with the spatial details of high resolution multispectral images. A major difficulty in Transformer fusion lies in preserving pixel wise spectral signatures while still capturing sufficient spatial context. Existing hierarchical designs usually enlarge the receptive field through deep stacking or patch merging, but these operations may mix the spectra of neighboring pixels and increase the computational burden. In this paper, we propose LMCAN, a lightweight multi scale center attention network designed around a unified pixel preserving attention scheme. LMCAN keeps each spatial location as a 1 × 1 pixel token to reduce spectral mixing. Since this choice limits the receptive field of window attention, multi scale convolution is first used to construct scale aware feature groups before attention, allowing different heads to model dependencies under different receptive fields. The resulting heads are no longer treated as independent branches. Instead, a head domain attention mechanism is introduced to learn their correlations and promote information exchange across scales. After window attention, a center attention block further aggregates neighborhood information around each reconstruction pixel, which compensates for local interactions missed by fixed window partitioning and improves boundary detail reconstruction. In this way, spectral preservation, multi scale context modeling, cross scale interaction, and local boundary compensation are integrated into a compact fusion framework. Experiments on the CAVE, Harvard, Chikusei, and Pavia datasets show that LMCAN achieves competitive or superior reconstruction accuracy with a favorable balance among parameter count, computational cost, memory usage, and inference efficiency.

Revised Contributions

The main contributions of this work are summarized as follows.

We propose a lightweight pixel preserving attention framework for hyperspectral and multispectral image fusion. The framework is motivated by the conflict between spectral fidelity and spatial context modeling. Instead of relying on patch merging or deep hierarchical attention, LMCAN maintains the 1 × 1 pixel token representation and introduces multi scale receptive fields before attention, so that spatial context can be enriched without sacrificing pixel wise spectral signatures.
We develop a coupled multi scale attention mechanism in which convolutional scale construction and attention head interaction are jointly considered. Multi scale convolution first assigns different receptive fields to different feature groups, which gives each attention head a clear scale meaning. Head domain attention then learns adaptive correlations among these scale aware heads, allowing information from different receptive fields to be exchanged before final feature fusion. This design turns conventional independent attention heads into an interacting cross scale representation.
We introduce center attention as a local compensation mechanism within the same fusion framework. While the multi scale head interaction strengthens contextual modeling inside window attention, fixed window partitioning can still miss adjacent pixels near window boundaries. Center attention uses each reconstruction pixel as the query center and aggregates information from its surrounding neighborhood, thereby improving boundary and texture recovery while keeping the pixel level spectral representation unchanged. Together with the scale aware head design, this forms a unified lightweight architecture for accurate and efficient hyperspectral image fusion.

需要同步调整的地方

摘要中建议把方法全称统一为 Lightweight Multi Scale Center Attention Network，不要再出现 Centralized Attention Network，否则和 CAB 的 “Center Attention Block” 概念不一致。

贡献点后面原文的 Fig. 1 引导段也要跟着改，因为现在贡献点从 5 条压缩成 3 条后，后文不能再用 “not a simple combination of multiscale convolution, local attention, and neighborhood refinement” 这种容易被审稿人理解为模块拼接的句子。建议改成一句更自然的承接：

Fig. 1 illustrates how the proposed components are derived from a single design objective. LMCAN first preserves pixel level spectra by keeping 1 × 1 tokens, then restores the lost spatial context through scale aware heads and head domain interaction, and finally compensates for boundary neighborhoods by center attention. This progressive design distinguishes LMCAN from conventional multi head attention, where different heads are homogeneous and independent, and from hierarchical Transformers, where multi scale context is mainly obtained through patch merging and deeper stacking.

问题

思考

回答