我正在撰写SCI高水平论文，论文题目是：GuideHDR-T: Guided HDR Transfo...

Tạo vào: 20 tháng 4, 2026

Trả lời bằng GPT-5.4 Thinking bởi Chat01

Câu hỏi

我正在撰写SCI高水平论文，论文题目是：GuideHDR-T: Guided HDR Transformer with Motion Suppression and Implicit Cross-Exposure Calibration
摘要为：High dynamic range (HDR) imaging reconstructs an HDR radiance map from multi-exposure low dynamic range (LDR) images, but dynamic scenes remain challenging due to the coupled effects of scene motion and severe exposure variation. Existing methods mainly rely on alignment or attention-based fusion, yet their performance degrades when large motion co-occurs with over-exposed or under-exposed regions, where degraded observations yield weak and poorly comparable features across exposures, leading to unreliable correspondence, incomplete detail transfer, and residual ghosting artifacts. To address this issue, we propose a dynamic HDR reconstruction framework based on cross-exposure structure calibration. Specifically, a Motion-aware Pre-modulation Module (MPM) first suppresses motion-induced outliers in non-reference exposures, a Cross-exposure Structure Calibration (CASC) module then enhances feature comparability by injecting exposure-robust structural cues into degraded representations, and a Decoupled Attention Fusion Network (DAFN) finally performs progressive fusion on the calibrated features to recover globally coherent structures and fine local details. Extensive experiments on public benchmarks show that the proposed method consistently outperforms previous approaches in both quantitative accuracy and perceptual quality, especially in challenging scenes with large motion and extreme exposure gaps.
Introduction部分为：
High dynamic range (HDR) imaging aims to faithfully reproduce real-world scenes with extreme luminance variations by overcoming the limited dynamic range of conventional imaging sensors. A widely adopted solution is multi-exposure HDR imaging, which reconstructs an HDR radiance map from a sequence of low dynamic range (LDR) images captured at different exposure levels. This paradigm has been extensively studied due to its ability to preserve both highlight and shadow details and has found broad applications in display systems, computational photography, and visual content creation. While impressive results have been achieved in static scenes, extending multi-exposure HDR imaging to dynamic environments remains a fundamental and unresolved challenge, as scene motion and exposure disparities jointly complicate correspondence estimation and information fusion across exposures.

When applied to dynamic scenes, multi-exposure HDR imaging often suffers from severe ghosting artifacts caused by object motion and camera movement during exposure bracketing. Misaligned content across exposures leads to inconsistent observations of the same scene points, resulting in duplicated structures, blurred edges, or missing details in the reconstructed HDR image. To alleviate this issue, a large body of prior work has focused on improving motion handling through explicit alignment, motion estimation, or attention-based correspondence modeling. More recently, Transformer-based architectures have been introduced to dynamic HDR reconstruction, leveraging cross-attention to adaptively aggregate complementary information across exposures without relying on strict pixel-wise alignment. Despite these advances, we observe that when large motion and extreme exposure differences occur simultaneously, cross-attention frequently degenerates into unstable and diffused matching, preventing reliable detail transfer from non-reference exposures and leading to residual ghosting or texture loss. This observation suggests that the effectiveness of attention-based fusion critically depends on certain prerequisites that are not always satisfied in challenging dynamic HDR scenarios.

To further analyze the origin of this failure, we revisit a fundamental prerequisite shared by both alignment-based and attention-based fusion methods, namely the availability of reliable cross-exposure correspondences. Further investigation reveals that the degradation of cross-attention arises from the coupling of two factors. First, dynamic motion introduces cross-frame inconsistencies, causing features from different exposures to be spatially and temporally misaligned, which injects significant noise into correspondence estimation. Second, extreme exposure variation itself leads to representation degradation: in heavily over-exposed or under-exposed regions, saturation and quantization effects suppress structural and semantic cues, making features of the same scene content inherently incomparable across exposures. Notably, these two factors are not independent—motion ambiguity is amplified in poorly exposed regions, while degraded exposure further weakens motion correspondence. As a result, even when approximate spatial alignment is achieved, cross-attention may still fail because corresponding structures cannot be reliably matched in the feature space. This analysis indicates that the core challenge in dynamic HDR reconstruction lies not merely in designing more sophisticated fusion operators, but in first establishing cross-exposure representations that are matchable under both motion and extreme exposure variation.

Motivated by this insight, we reformulate dynamic HDR reconstruction as a two-stage objective: instead of directly fusing multi-exposure observations, we first establish matchable cross-exposure representations that remain stable under motion and severe exposure changes, and then perform reliable information fusion on top of them. To this end, we propose a divide-and-conquer framework consisting of three progressively coupled components. First, a Motion-robust Prior Modulation (MPM) module is introduced to suppress dynamic outliers and reduce the interference of large-scale motion, providing a stable basis for subsequent cross-exposure modeling. Second, as our core contribution, we propose a Cross-exposure Adaptive Structure Calibration (CASC) module, which explicitly enforces structure-consistent calibration across exposures and restores feature comparability in saturated or severely under-exposed regions. By anchoring representations to exposure-invariant structural cues, CASC significantly improves the reliability of cross-exposure correspondence and attention modeling. Finally, based on the calibrated and matchable representations, we design a Decoupled Attention Fusion Network (DAFN) that performs structure-guided and semantics-aware fusion in a staged manner, enabling faithful detail transfer while avoiding unstable propagation. With this design, our method effectively disentangles motion-induced mismatch and exposure-induced representation collapse, leading to robust HDR reconstruction in challenging dynamic scenes.

In summary, our contributions are three-fold:
We provide a principled analysis revealing that the failure of attention-based dynamic HDR methods stems from the breakdown of cross-exposure matchability under the coupled effects of motion and extreme exposure variation.
We propose a novel Cross-exposure Adaptive Structure Calibration module that restores representation comparability across exposures by enforcing structure-consistent calibration, enabling reliable correspondence modeling.

根据以上内容，围绕本论文的主要核心内容进行修改。要满足SCI高水平期刊的要求，首先指出当前存在的问题，然后具体地针对性的做出修改。

Suy nghĩ

我先按高水平SCI论文常见标准，从“问题定位是否尖锐、逻辑链是否闭环、术语是否统一、贡献是否足够可验证”这几个维度，先指出你这版 Introduction 的核心问题，再给出一版针对性修改稿。

Thought for 6s

Trả lời

下面我先针对当前版本存在的问题进行诊断，再给出面向SCI高水平期刊标准的定向修改建议，最后提供一版可直接用于论文的 Introduction 修改稿。我会重点围绕你论文的核心主线：** 动态HDR重建中的“运动干扰 + 曝光失配”耦合问题，以及你提出的 MPM–CASC–DAFN 三阶段解决框架**。

一、当前版本存在的主要问题

从SCI高水平论文，尤其是图像处理、计算机视觉方向期刊的写作标准来看，你这版 Introduction 已经具备较好的技术基础，但仍存在以下几个关键问题，需要有针对性地修改。

1. 核心问题提出得还不够“聚焦”和“尖锐”

你目前的引言虽然提到了动态场景中的 ghosting、large motion、extreme exposure gap，但问题表述仍偏“现象描述”，还没有完全上升到机制层面的核心矛盾。

当前写法的问题在于：

前两段更多是在说“动态HDR很难，现有方法会退化”；
第三段才真正提出“cross-exposure matchability breakdown”；
这样会导致读者在前半部分感受到的是“老问题重复铺陈”，而不是“新问题被精准揭示”。

高水平SCI期刊更看重： 你不仅要说“现有方法效果不好”，更要明确指出为什么不好，并且这个“为什么”必须是你全文方法设计的理论出发点。

你的核心创新其实非常明确：

现有动态HDR方法的根本瓶颈，不只是对齐不准，也不只是融合不足，而是在运动与极端曝光耦合条件下，跨曝光特征失去可匹配性（matchability），从而导致后续对齐、注意力和信息传递全部失效。

这个观点需要被提到更前面，并成为全文主线。

2. 方法动机和模块设计之间还不够“强绑定”

你现在的 Introduction 中已经写了：

MPM 抑制运动异常；
CASC 恢复结构一致性；
DAFN 进行分阶段融合。

但是问题在于，这三者与前文提出的科学问题之间的映射关系还不够紧密，容易给审稿人留下“模块堆叠”的印象。

SCI高水平期刊通常很在意：

问题 → 原因分析 → 方法设计 → 预期收益 是否构成严格闭环。

你需要让读者清楚看到：

因为运动会引入动态离群点，所以需要 MPM；
因为极端曝光会破坏结构可比性，所以需要 CASC；
因为在完成“去离群 + 可匹配校准”之后，才有资格进行可靠融合，所以需要 DAFN。

也就是说，三个模块不是并列罗列，而是围绕同一个核心矛盾递进展开。

3. 术语命名前后不统一，容易影响专业感

这是高水平论文里非常忌讳的问题。你当前文本中存在以下不一致：

MPM 名称不统一

摘要中是：

Motion-aware Pre-modulation Module (MPM)

Introduction 中却写成：

Motion-robust Prior Modulation (MPM)

这会让审稿人直接感觉论文打磨不够细。

CASC 名称不统一

摘要中是：

Cross-exposure Structure Calibration (CASC)

Introduction 中却写成：

Cross-exposure Adaptive Structure Calibration (CASC)

这同样必须统一。建议全文只保留一个版本。
从你摘要整体风格来看，建议统一为：

Motion-aware Pre-modulation Module (MPM)
Cross-exposure Structure Calibration (CASC)
Decoupled Attention Fusion Network (DAFN)

这样和摘要完全一致，最稳妥。

4. “attention 失效”的论证还可以更学术、更精炼

你现在用了 “degenerates into unstable and diffused matching” 这样的表述，方向是对的，但略偏现象化。高水平SCI写作中，更推荐使用一些更“机制化”的表达，比如：

produces ambiguous correspondence distributions
weakens discriminability of cross-exposure features
causes unreliable token association in degraded regions
undermines transferable structural evidence across exposures

这样会比单纯说 “unstable and diffused matching” 更学术、更有说服力。

5. 贡献部分还不完整，而且第二点写得不够“结果导向”

你现在 contributions 只写了两点，而且第二点后面还没写完。
此外，贡献陈述应避免只说“我们提出了某模块”，还要强调它带来的能力提升或问题解决效果。

例如：

不是只说“we propose CASC”
而要说“we propose CASC to restore cross-exposure feature comparability in degraded regions, thereby enabling more reliable correspondence estimation and attention-based fusion”

即：模块 + 机制 + 作用结果

二、针对性的修改原则

基于以上问题，我建议你对 Introduction 做如下针对性调整：

修改重点 1：前移核心科学问题

在第二段或第三段就明确指出：

现有方法的根本瓶颈是跨曝光表征在运动和曝光退化耦合作用下失去可匹配性。

这样读者会更快理解你的论文“新”在哪里。

修改重点 2：突出“不是融合不强，而是输入不可匹配”

这是你论文非常重要的理论价值。
你要把论述从：

“现有 fusion 不够好” 提升为：
“在 matchability 没建立之前，再复杂的对齐或注意力融合都难以可靠工作”。

这会显著增强论文的理论层次。

修改重点 3：把三个模块写成递进式问题求解链

建议固定逻辑：

MPM：先抑制运动引起的动态异常；
CASC：再修复曝光退化导致的跨曝光结构不可比；
DAFN：最后在已校准表征上进行稳定融合。

这样的组织方式更像“problem-driven design”，而不是“module-driven design”。

修改重点 4：统一命名，强化正式性

全文统一为：

Motion-aware Pre-modulation Module (MPM)
Cross-exposure Structure Calibration (CASC)
Decoupled Attention Fusion Network (DAFN)

修改重点 5：贡献写成“问题认识 + 方法设计 + 实验结论”

高水平SCI的 contribution 最好对应三层：

理论认识：指出现有方法失败的根源；
方法创新：提出对应解决方案；
实验验证：证明方法有效且优于SOTA。

三、修改后的 Introduction（推荐版本，可直接替换）

下面给出一版围绕你论文核心内容重构后的 Introduction，风格更符合SCI高水平期刊要求，逻辑也更紧。

Introduction（Revised Version）

High dynamic range (HDR) imaging aims to recover scene radiance with a significantly wider luminance range than that captured by conventional image sensors. Among various HDR acquisition strategies, multi-exposure HDR imaging has been widely adopted due to its practicality and effectiveness, where several low dynamic range (LDR) images with different exposure levels are fused to reconstruct a single HDR image. This paradigm is able to preserve complementary details in both bright and dark regions and has been extensively applied in computational photography, display systems, and visual media production. However, despite its success in static scenes, reliable HDR reconstruction in dynamic environments remains a long-standing challenge, because scene motion and exposure variation jointly disrupt correspondence estimation and cross-exposure information transfer.

In dynamic scenes, multi-exposure HDR imaging often suffers from severe ghosting artifacts caused by object motion or camera displacement during exposure bracketing. To mitigate this issue, a large body of prior work has focused on explicit alignment, optical-flow-based compensation, deformable aggregation, or attention-based fusion. More recently, Transformer-based methods have shown promising performance by leveraging cross-attention to adaptively aggregate complementary information across exposures without relying on strict pixel-wise registration. Nevertheless, existing methods still degrade significantly in challenging cases where large motion co-occurs with extreme exposure gaps. In such scenarios, over-exposed or under-exposed regions usually contain severely corrupted observations, making the corresponding features weak, ambiguous, and poorly comparable across exposures. As a result, even advanced attention mechanisms often fail to establish reliable cross-exposure associations, leading to incomplete detail transfer, texture inconsistency, and residual ghosting artifacts.

This failure suggests that the core difficulty of dynamic HDR reconstruction lies beyond fusion design itself. We argue that both alignment-based and attention-based methods implicitly rely on a shared prerequisite, namely, the existence of matchable cross-exposure representations. However, this prerequisite is frequently violated in dynamic HDR scenes due to the coupled effects of motion and exposure degradation. On the one hand, scene motion introduces spatial and temporal inconsistencies, causing dynamic outliers and interfering with correspondence estimation. On the other hand, severe over-exposure and under-exposure suppress structural and semantic cues through saturation and quantization, making features of the same scene content inherently less discriminative and less comparable across exposures. More importantly, these two factors reinforce each other: motion ambiguity becomes more severe in poorly exposed regions, while degraded observations further weaken motion-aware correspondence modeling. Consequently, the performance bottleneck of existing dynamic HDR methods is not merely insufficient alignment or inadequate fusion capacity, but the breakdown of cross-exposure matchability under coupled motion and exposure variation.

Motivated by this observation, we reformulate dynamic HDR reconstruction as a progressive process that first establishes reliable and matchable cross-exposure representations and then performs robust information fusion on top of them. Following this principle, we propose GuideHDR-T, a guided HDR Transformer framework with motion suppression and implicit cross-exposure calibration. Specifically, we design a Motion-aware Pre-modulation Module (MPM) to suppress motion-induced outliers in non-reference exposures before cross-exposure interaction, thereby reducing dynamic interference at an early stage. Built upon this motion-suppressed representation, we further introduce a Cross-exposure Structure Calibration (CASC) module, which injects exposure-robust structural guidance into degraded features and restores their comparability across exposures, especially in saturated or severely under-exposed regions. Finally, we develop a Decoupled Attention Fusion Network (DAFN) that performs progressive fusion over the calibrated features to recover globally coherent structures and fine local details while avoiding unreliable propagation from corrupted observations. In this way, our framework explicitly disentangles motion suppression, exposure-aware calibration, and reliable fusion, enabling robust HDR reconstruction in challenging dynamic scenes.

Extensive experiments on public benchmark datasets demonstrate that the proposed method consistently outperforms previous state-of-the-art approaches in both quantitative accuracy and perceptual quality. In particular, our method shows clear advantages in scenes with large motion and extreme exposure discrepancies, where existing methods often suffer from noticeable ghosting, structural distortion, or loss of fine details.

In summary, the main contributions of this work are three-fold:

We provide a principled analysis showing that the failure of existing dynamic HDR methods mainly stems from the breakdown of cross-exposure matchability under the coupled effects of scene motion and severe exposure variation, rather than from fusion design alone.
We propose a novel Cross-exposure Structure Calibration (CASC) module that restores feature comparability across exposures by injecting exposure-robust structural cues into degraded representations, thereby improving the reliability of cross-exposure correspondence modeling.
We develop a unified dynamic HDR reconstruction framework, GuideHDR-T, which integrates motion suppression, implicit cross-exposure calibration, and decoupled attention fusion in a progressive manner, achieving superior reconstruction quality on challenging dynamic HDR benchmarks.

四、这版修改相较于原文的提升点

这版修改主要做了以下针对性增强：

1. 更强的问题导向

不是泛泛地说“动态HDR困难”，而是直接把问题上升到：

cross-exposure matchability breakdown

这会让你的论文更有“理论抓手”。

2. 更强的方法闭环

现在逻辑是非常清晰的：

运动干扰 → MPM
曝光退化导致不可比 → CASC
在可匹配表征上做可靠融合 → DAFN

这样审稿人更容易接受模块设计的合理性。

3. 更符合高水平SCI的叙述习惯

高水平期刊通常很喜欢这类结构：

应用背景
现有方法及其局限
失败根源分析
新视角/新问题定义
方法概述
贡献总结

你现在这版更接近这种标准写法。

4. 避免术语不一致

我已经帮你统一为：

Motion-aware Pre-modulation Module (MPM)
Cross-exposure Structure Calibration (CASC)
Decoupled Attention Fusion Network (DAFN)

建议全文，包括摘要、方法、图示、实验部分，全部统一。

五、你当前稿件还建议继续修改的地方

除了 Introduction，本论文要满足SCI高水平期刊要求，我建议你下一步重点再修改以下内容：

1. Abstract 里也要统一术语

你摘要和 Introduction 里的模块命名必须百分之百一致。

2. Contribution 要进一步可验证

后续在实验部分要明确对应验证：

MPM 是否减少运动伪影；
CASC 是否提升跨曝光匹配质量；
DAFN 是否提升全局结构与局部细节恢复。

也就是说，ablation study 要和 contribution 一一对应。

3. 方法部分最好强化“implicit calibration”的表述

你的题目里有：

Implicit Cross-Exposure Calibration

那么在方法概述中要更明确地解释：

为什么是 implicit；
它和显式对齐/显式曝光校正有什么本质区别；
它如何在特征空间而不是图像空间完成校准。

这一点会直接影响题目与正文的一致性。

4. 引言中可适当埋下实验验证点

比如可以在最后一段稍微加一句：

“especially under large motion and extreme exposure gaps” 这已经做了，但后续实验节标题和分析要对应这个表述。

如果你愿意，我下一步可以继续帮你把 Contribution 部分单独润色成更强的SCI表达版本，或者直接帮你进一步修改 Related Work / Method Overview，使整篇论文的语言风格保持统一。

Câu hỏi

不要局限于当前Introduction部分每一段落的内容，指出更新后的每一个段落存在的问题以及各段落衔接逻辑。遵循更强的SCI表达版本写一个新的Introduction。并且更新一个新的英文摘要。总结贡献点的时候尽可能自然一些，“In summary, the main contributions of this work are three-fold:”听起来非常的突兀。

Suy nghĩ

我会先逐段诊断你当前这版 Introduction 在“问题表达、段内重心、段间衔接、方法引出”上的具体不足，然后给你一版重新组织后的高水平 SCI 风格 Introduction，并同步重写一个更凝练、更自然的英文摘要。

Thought for 8s

Trả lời

下面我不再局限于你现有 Introduction 的段落框架，而是从高水平 SCI 论文写作的整体逻辑出发，先系统指出你这版引言在每一段本身的问题以及段落之间的衔接问题，再给出一版全新重写的 Introduction，最后附上一版新的英文摘要。
我会特别注意你提到的要求：问题提出要更尖锐，逻辑更顺，表达更像高水平期刊，同时 contribution 的引出要自然，不突兀。

一、你当前 Introduction 的逐段问题诊断

你现在的 Introduction 一共可视为 5 个部分：

HDR 背景与动态场景挑战
现有方法与 attention 失效现象
失败根源分析：motion + exposure coupling
方法动机与三模块概述
Contributions

下面逐段分析。

第 1 段的问题：背景完整，但“论文问题”进入得太慢

原段落核心内容是：

HDR imaging 的意义；
多曝光 HDR 的常见范式；
静态场景表现不错；
动态场景仍然困难。

这一段本身存在的问题

第一，这段写得比较标准，但仍停留在通用背景介绍层面，没有足够快地进入你论文真正要解决的矛盾。
对于高水平 SCI 论文，尤其是计算机视觉/图像重建类文章，Introduction 第一段最好做到两件事同时完成：

交代问题背景；
尽快把“dynamic HDR 为什么难”具体化，而不是只说 “scene motion and exposure disparities jointly complicate...”。

第二，这一段的问题表述还偏抽象。
比如 “scene motion and exposure disparities jointly complicate correspondence estimation and information fusion” 这句话虽然没错，但它还是一个比较宽泛的总结，审稿人读到这里还无法明确知道你论文的切入点到底是什么。
高水平写法通常会更早点明：

不是简单的 motion；
也不是简单的 exposure gap；
而是二者耦合后导致跨曝光可匹配性下降。

与下一段的衔接问题

这一段到第二段的衔接比较常规，但不够“推进式”。
目前的逻辑是：

第一段：动态 HDR 很难；
第二段：现有方法做了很多努力。

这没问题，但它更像“背景综述式”写法，而不是“问题驱动式”写法。
更强的写法应该是：

第一段直接把动态 HDR 的核心失败现象说清楚；
第二段顺势讨论为什么现有 alignment / attention 方法仍不能解决这一问题。

也就是说，第二段应该是对第一段问题的回应，而不是简单接着做文献铺陈。

第 2 段的问题：现有方法介绍充分，但批判性还不够集中

原段落核心内容是：

动态 HDR 会有 ghosting；
以往工作做 alignment、motion estimation、attention；
Transformer 也引入了 cross-attention；
但在 large motion + extreme exposure 下仍然会失败。

这一段本身存在的问题

第一，这一段对现有方法覆盖较全面，但略有“综述堆叠感”。
你写了 explicit alignment、motion estimation、attention-based correspondence、Transformer cross-attention，这些都是对的，但对于 Introduction 来说，综述密度略高，问题聚焦还不够强。
SCI 高水平期刊更希望看到的是：

你不是在列方法类别；
而是在为“为什么这些方法还不够”服务。

第二，对 attention 失效的描述已经有雏形，但还没有完全上升到“机制分析”。
你说：

“cross-attention frequently degenerates into unstable and diffused matching”

这个表述方向是对的，但更强的写法可以是：

cross-attention depends on discriminative and comparable features across exposures;
when observations are severely degraded, token association becomes ambiguous and structurally unreliable.

这样就能自然引到下一段的“matchability prerequisite”。

与下一段的衔接问题

第二段和第三段之间，其实是你全文最重要的过渡，但目前这个过渡还不够锋利。
因为第二段讲的是：

现有 attention 在某些场景下失效；

第三段才说：

根本原因是缺乏 reliable cross-exposure correspondences。

问题在于，这个“根本原因”来得略晚。
高水平论文通常会让第二段末尾更有“悬念性”，例如：

These observations suggest that the limitation may not lie in the fusion operator itself, but in the lack of matchable cross-exposure representations on which fusion implicitly relies.

这样第三段一开头就可以直接进入理论分析，衔接会顺很多。

第 3 段的问题：这是最关键的一段，但还可以更“收束”为单一中心论点

原段落核心内容是：

重访 alignment 和 attention 的共同前提；
motion 带来不一致；
exposure degradation 带来 representation collapse；
二者相互强化；
根本挑战是 matchable representations。

这一段本身存在的问题

这一段实际上是你整篇文章最有价值的部分，因为它承担的是问题重定义。
但当前还有三个细节问题：

问题 1：中心句不够强

这一段的核心观点应该是：

现有动态 HDR 方法失败的根源在于跨曝光表征失去可匹配性。

但你目前是通过较长铺陈慢慢得出这个结论。
更强的写法应该在段首就直接亮明中心句，然后后面分两点解释为什么。

问题 2：两类因素写得并列，但“耦合性”还不够突出

你已经提到了：

motion ambiguity is amplified in poorly exposed regions
degraded exposure further weakens motion correspondence

这很好，但还可以进一步强化：

不是两个问题简单叠加；
而是一个问题会放大另一个问题，导致 correspondence estimation and feature fusion simultaneously collapse。

这样更能体现你的论文不是在解决常规的“双难题”，而是在解决耦合退化机制。

问题 3：与方法之间的桥梁还不够直接

这一段虽然分析得很好，但在走向方法时，仍然需要一个更明确的逻辑转折：

既然问题本质是“可匹配性缺失”，那么方法不应该直接做 fusion，而应先构建可匹配表征。

这一步你在下一段说了，但这一段结尾就应该提前铺好。

与下一段的衔接问题

当前第三段到第四段的衔接已经比前面好，但仍稍显“先分析，再另起一段讲方法”。
更强的做法是让第三段结尾自然推出：

This calls for a reconstruction pipeline that explicitly restores cross-exposure matchability before feature fusion.

这样第四段就不是“介绍方法”，而是“顺理成章提出解决方案”。

第 4 段的问题：方法概述基本完整，但仍有“模块罗列感”

原段落核心内容是：

reformulate dynamic HDR as a two-stage objective；
MPM suppress motion outliers；
CASC restores comparability；
DAFN fuses calibrated features。

这一段本身存在的问题

第一，这段的方法总体设计已经比较清楚，但最大的问题是：模块之间像并列介绍，而不是层层递进。

你论文最强的地方其实不是“提出了三个模块”，而是这三个模块恰好对应三个逐步收缩的问题：

先压制动态离群；
再恢复跨曝光结构可比性；
最后在可匹配空间中融合。

因此高水平写法应尽量避免像：

“First, we do A. Second, we do B. Finally, we do C.”

这种工程式列举，而是写成：

because motion contamination should be suppressed before calibration...
once cross-exposure comparability is restored...
fusion can then be performed on a more reliable basis...

第二，题目中用了 Implicit Cross-Exposure Calibration，但当前段落对“implicit”体现还不够。
如果题目里要保留 implicit，那么 Introduction 里必须让读者隐约明白：

这不是显式曝光映射；
不是显式辐射校正；
而是通过结构引导在特征空间中完成校准。

否则题目与正文之间会有一点轻微脱节。

与下一段的衔接问题

第四段到 contribution 段的衔接目前比较生硬。
因为前面都还在技术叙述，后面突然来一句：

In summary, our contributions are three-fold

这就是你自己也觉得突兀的原因。
高水平写法通常会先加一个自然收束句，例如：

The resulting framework not only addresses the coupled degradation caused by motion and exposure variation, but also leads to consistent gains on challenging dynamic HDR benchmarks.
On this basis, the present work makes the following contributions.

这样会自然很多。

第 5 段的问题：贡献点表达还不够成熟，且与前文重复感较强

你目前 contribution 写法的问题主要有三点：

问题 1：引出方式生硬

正如你指出的，
“In summary, the main contributions of this work are three-fold:”
确实太模板化，容易显得机械。

问题 2：贡献点没有完全做到“层次分明”

高水平 SCI 通常希望贡献点对应三类价值：

认识层：发现或揭示了什么问题；
方法层：提出了什么新机制；
系统层/验证层：在整体框架和实验上达到了什么效果。

你现在的贡献点更多是“分析 + 提模块”，还没把“整体系统价值”明确写出来。

问题 3：表述偏“自我陈述”，不够自然

更成熟的贡献表达方式应该像从前文顺势收束出来，而不是突然切换成会议式 listicle 风格。

二、你当前 Introduction 的整体逻辑问题

除了逐段问题，你这版引言的整体逻辑还存在三个结构性不足。

1. 主线虽有，但“主问题”出现得偏晚

你的真正创新主线是：

动态 HDR 的本质瓶颈在于运动与曝光耦合下跨曝光表征失去可匹配性。

但这个中心论点是在第三段才被清晰提出。
对于高水平文章，建议在前两段内就让这个观点显形。

2. 现有方法、失败分析、方法设计之间闭环还不够紧

理想逻辑应该是：

现有方法为何失效；
失效根源是什么；
根源要求什么样的新解法；
你的框架如何逐步满足这一要求。

而你当前版本里，现有方法和方法设计之间还有一点“中间跳了一步”的感觉。

3. 从“问题分析”到“模块设计”的转换还不够像科学论证

现在更像：

我们发现了问题；
然后提出三个模块。

更强的写法要让读者感觉：

这些模块不是“想出来的”，而是被问题分析“推导出来的”。

三、建议采用的全新 Introduction 逻辑框架

我建议你新的 Introduction 使用下面这个五段式逻辑：

第 1 段：背景 + 动态 HDR 的核心挑战

快速建立任务背景，并尽快指出动态 HDR 不只是 motion，而是 motion 和 exposure disparity 同时破坏可恢复信息。

第 2 段：现有方法进展 + 为什么仍不够

简述 alignment / attention / Transformer 的路线，然后指出它们都隐含依赖一个前提：跨曝光信息必须可比较、可关联。

第 3 段：本文核心问题定义

明确提出：真正瓶颈是 cross-exposure matchability breakdown。
分析其来源：motion contamination + exposure-induced representation degradation 的耦合。

第 4 段：方法思想与整体框架

说明：因此应先恢复可匹配性，再做融合。
自然引出 MPM、CASC、DAFN 的递进关系。

第 5 段：自然收束 + contribution

不使用突兀模板句，而是用“Built on this perspective...”或“Viewed from this perspective...”来引出贡献。

四、全新重写的 Introduction（SCI 强表达版）

下面是一版新的 Introduction。
这版不是在你原文上做局部修补，而是按更强的 SCI 写法重构后的版本。

Introduction

High dynamic range (HDR) imaging seeks to recover scene radiance beyond the limited sensing range of conventional cameras, so that details in both bright highlights and dark shadows can be faithfully preserved. Among existing acquisition strategies, multi-exposure HDR imaging remains one of the most practical and widely adopted solutions, where several low dynamic range (LDR) images captured with different exposure settings are combined to reconstruct a single HDR image. Owing to its effectiveness and low hardware cost, this paradigm has been extensively used in computational photography, image-based content creation, and display-related applications. Yet its success still largely depends on scene staticity. Once the scene becomes dynamic, HDR reconstruction becomes substantially more difficult, because motion and exposure variation simultaneously distort cross-frame observations and undermine the transfer of complementary information across exposures.

A large body of prior work has therefore focused on dynamic HDR reconstruction. Earlier methods mainly addressed the problem through explicit alignment, optical-flow estimation, or motion-aware image registration, with the goal of compensating for inter-frame displacement before fusion. More recently, attention-based and Transformer-based models have shown promising performance by adaptively aggregating informative content from non-reference exposures without relying on strict pixel-wise alignment. Despite these advances, performance still degrades markedly in scenes involving large motion together with extreme exposure gaps. In such cases, non-reference images often contain severely saturated or under-exposed regions, where structural evidence is weak, incomplete, or highly distorted. As a consequence, the information to be transferred across exposures becomes not only spatially inconsistent but also intrinsically hard to compare, causing unreliable correspondence estimation, unstable feature interaction, and persistent ghosting artifacts in the reconstructed HDR image.

This observation suggests that the main bottleneck of dynamic HDR reconstruction may not lie in the fusion operator alone. Whether a method relies on explicit alignment or implicit attention, both paradigms presuppose that corresponding content across exposures remains sufficiently matchable in the feature space. However, this prerequisite is often violated in challenging dynamic scenes. On the one hand, scene motion introduces dynamic outliers and cross-frame inconsistencies, contaminating the estimation of reliable correspondences. On the other hand, severe exposure degradation destroys discriminative structural cues through saturation and quantization, making features from the same scene content poorly comparable across exposures. More importantly, these two effects are tightly coupled rather than independent: motion ambiguity becomes more pronounced in poorly exposed regions, while degraded observations further weaken the reliability of motion-aware association. Under this coupled degradation, attention can no longer establish stable cross-exposure interactions, and alignment alone cannot restore missing comparability. From this perspective, the essential challenge is to construct cross-exposure representations that remain matchable under both motion disturbance and severe exposure variation.

Guided by this view, we approach dynamic HDR reconstruction as a progressive process in which cross-exposure matchability must be established before reliable fusion can take place. Based on this principle, we propose GuideHDR-T, a guided HDR Transformer with motion suppression and implicit cross-exposure calibration. The framework is organized in a coarse-to-fine and problem-driven manner. First, a Motion-aware Pre-modulation Module (MPM) suppresses motion-induced outliers in non-reference exposures, reducing the interference caused by dynamic content before cross-exposure interaction. Building on the motion-suppressed features, we then introduce a Cross-exposure Structure Calibration (CASC) module, which injects exposure-robust structural guidance into degraded representations and implicitly restores their comparability across exposures, especially in saturated and severely under-exposed regions. Once a more reliable feature basis is established, a Decoupled Attention Fusion Network (DAFN) performs staged fusion over the calibrated features to recover globally consistent structures and fine-grained local details while avoiding unstable propagation from corrupted observations. In this way, motion suppression, cross-exposure calibration, and feature fusion are no longer entangled in a single step, but are handled in a coordinated and progressive manner.

Seen from this perspective, the contribution of the present work is twofold in methodology and onefold in understanding. First, we show that the failure of existing dynamic HDR methods in challenging scenes is closely related to the breakdown of cross-exposure matchability under the coupled effects of motion and severe exposure variation. Second, we introduce an implicit structure-guided calibration mechanism that restores feature comparability across exposures and provides a more reliable basis for correspondence modeling and information transfer. Third, by integrating motion suppression, cross-exposure calibration, and decoupled attention fusion into a unified Transformer framework, the proposed method achieves consistently better quantitative and perceptual performance on public dynamic HDR benchmarks, particularly in cases involving large motion and extreme exposure differences.

五、这版新 Introduction 的段落衔接逻辑说明

为了便于你后续继续修改整篇文章，这里我把这版新引言的段落关系也说清楚。

第 1 段 → 第 2 段

第 1 段提出：

多曝光 HDR 在静态场景有效；
动态场景下 motion 与 exposure variation 共同造成困难。

第 2 段顺势回答：

既然如此，已有工作如何应对；
但这些方法为什么仍然不够。

这属于“问题提出 → 现有回应”的自然推进。

第 2 段 → 第 3 段

第 2 段末尾指出：

不仅不一致，而且 intrinsically hard to compare。

第 3 段就进一步抽象：

真正问题不是 fusion 本身，而是其隐含前提不成立；
即 cross-exposure matchability 失效。

这属于“现象描述 → 根因分析”的推进。

第 3 段 → 第 4 段

第 3 段最后得出：

核心挑战是先构建可匹配表征。

第 4 段自然承接：

因此我们把动态 HDR 视为“先建立 matchability，再做 fusion”的渐进过程；
然后引出 MPM、CASC、DAFN。

这属于“问题重定义 → 方法设计”的推进。

第 4 段 → 第 5 段

第 4 段说明了系统设计逻辑和整体优势。
第 5 段不再突然说 “In summary...”，而是用：

Seen from this perspective...

来做收束。
这就让 contribution 看起来不是“附加列表”，而是前文逻辑的自然落点。

六、全新英文摘要（更符合高水平 SCI 风格）

下面给你一版新的英文摘要。
这版会更强调：

核心问题不是简单 ghosting，而是 cross-exposure matchability breakdown；
方法三模块之间是递进关系；
语言更适合 SCI 期刊摘要。

Abstract

High dynamic range (HDR) imaging reconstructs an HDR radiance map from multiple low dynamic range (LDR) images captured at different exposures. Although substantial progress has been made in static scenes, dynamic HDR reconstruction remains challenging because scene motion and severe exposure variation jointly impair cross-exposure correspondence and information transfer. Existing methods mainly rely on alignment or attention-based fusion, yet their performance often deteriorates when large motion co-occurs with over-exposed or under-exposed regions, where degraded observations provide weak and poorly comparable features across exposures. This leads to unreliable matching, incomplete detail transfer, and residual ghosting artifacts.

In this work, we argue that the key bottleneck of dynamic HDR reconstruction lies in the breakdown of cross-exposure matchability under the coupled effects of motion and exposure degradation. To address this issue, we propose GuideHDR-T, a guided HDR Transformer with motion suppression and implicit cross-exposure calibration. Specifically, a Motion-aware Pre-modulation Module (MPM) first suppresses motion-induced outliers in non-reference exposures, providing a more stable basis for subsequent cross-exposure interaction. A Cross-exposure Structure Calibration (CASC) module then injects exposure-robust structural guidance into degraded representations, implicitly restoring feature comparability across exposures. On top of the calibrated features, a Decoupled Attention Fusion Network (DAFN) performs progressive fusion to recover globally consistent structures and fine local details while reducing unreliable propagation from corrupted observations. Extensive experiments on public benchmarks demonstrate that the proposed method consistently outperforms previous state-of-the-art approaches in both quantitative accuracy and perceptual quality, with particularly clear advantages in challenging scenes involving large motion and extreme exposure gaps.

七、这版摘要相较于你原摘要的主要提升

1. 问题定义更集中

你原摘要已经不错，但更多是在说：

现有方法遇到大运动和曝光差时会退化。

新摘要则更进一步明确：

根本瓶颈是 cross-exposure matchability breakdown。

这会让摘要更有理论亮点。

2. 模块关系更清楚

现在三模块是明确递进的：

MPM：先抑制 motion outliers；
CASC：再恢复 feature comparability；
DAFN：最后做可靠 fusion。

这比简单模块罗列更有说服力。

3. 更贴近高水平期刊摘要习惯

高水平摘要通常遵循：

task and challenge
limitation of prior work
key insight
proposed method
experimental validation

这版就是按这个节奏组织的。

八、给你的进一步建议

你这篇论文现在最值得继续强化的地方，不是单纯润色语句，而是全篇统一一个中心论点：

Dynamic HDR reconstruction fails in difficult scenes because cross-exposure representations are no longer reliably matchable under coupled motion and exposure degradation.

只要这个主线在摘要、引言、方法概述、消融实验、结论里全部保持一致，论文整体层次会提升很多。

你后面写作时可以统一遵循这个框架：

Abstract：提出 matchability breakdown
Introduction：分析其来源与意义
Method：MPM–CASC–DAFN 分别如何对应解决
Experiments：证明 matchability restored 后，ghosting 与 detail loss 同时改善
Ablation：分别验证 motion suppression、structure calibration、decoupled fusion 的必要性

下一步最值得做的是把 Related Work 和 Method Overview 也按这条主线一起重写，否则 Introduction 很强、后文仍旧传统写法，会显得前后不一致。

Câu hỏi

v1.pdf

根据之前的修改，附件是我当前论文的版本，仔细具体的审阅这篇文章，以一个高水平SCI期刊审稿人的角度，从表述逻辑性是否连贯、公式是否正确、插图是否合理等各方面进行审阅，提出具体恰当的修改意见，要足够详细。

Suy nghĩ

Thought for 9s

Trả lời

下面我按高水平 SCI 期刊审稿人的标准，对你这篇稿件做一次尽可能具体、严格、可执行的审阅。整体上，这篇文章的选题有意义、方法主线也较清晰：你已经围绕“动态 HDR 重建在大运动与极端曝光同时出现时，跨曝光表征失去可匹配性”这一核心观点建立了 MPM–CASC–DAFN 的三阶段框架，摘要、引言和方法主线基本一致。与此同时，当前稿件在文献完整性、术语一致性、公式严谨性、实验可信度、图表编号、图文对应、定量分析规范性等方面还存在较多会影响录用判断的问题，且其中有几项属于必须修改的问题。以下评审意见基于你上传的当前稿件版本。(v1.pdf)

一、总体评价

1. 论文的优点

这篇稿件最突出的优点有三点。

第一，问题意识比一般 HDR 重建论文更强。引言已经不再停留于“动态场景有 ghosting”这一常规层面，而是把问题上升到cross-exposure matchability 的破坏，这一点比很多单纯强调 alignment 或 fusion 的工作更有理论抓手。摘要和引言中都围绕这一点展开，整体主线是成立的。(v1.pdf)

第二，方法结构具有较好的递进关系。从第 3 节来看，网络先通过 MPM 做运动抑制，再通过 CASC 做跨曝光结构校准，最后由 DAFN 做深层融合，这个顺序与引言提出的“先恢复可匹配性，再做可靠融合”的论证是基本一致的。第 3.2 节的总公式以及第 3.3–3.5 节的模块分工也反映了这一点。(v1.pdf)

第三，实验结果从表面上看具有竞争力。Table 1 和 Table 2 都显示方法在多个指标上优于对比方法，尤其在 Kalantari 和 SCTNet/Tel 数据集上均给出了全指标领先的结果；同时你也提供了可视化对比和消融实验，说明作者已经有一定的完整实验意识。(v1.pdf)

2. 影响录用的主要问题

不过，从审稿角度，这篇稿件当前最核心的问题不是“方法没价值”，而是稿件成熟度明显不足，尤其体现在以下几个方面：

参考文献和相关工作部分存在大量占位符“?”、“??”、“?????”，这是重大硬伤，会直接损害论文可信度。第 2 节多处出现这种情况，而且 Table/figure/文本中也有未替换引用。(v1.pdf)
术语和模块命名前后不一致。例如摘要与引言使用的是 “Cross-exposure Structure Calibration (CASC)”，但后文和结论、消融分析中又出现 “Chroma Anchored Structure Calibration” 和 “Double Attention Fusion Network”等不同写法。(v1.pdf)
图号、表号、正文引用混乱。例如可视化分析部分把 Kalantari 对比写成 “Figure 2”，但真正的定性图在第 9 页显示为 Figure 5；SCTNet 数据集定性图在正文中还写着 “Figure X”；第 11 页 Figure 7 的标题居然写成 “Fig. 4. Visual ablation analysis…”。这类问题会让审稿人认为稿件还没有完成基本整理。(v1.pdf)
公式定义与模块实际操作之间存在不够严谨甚至可能有误的地方，尤其是 Fourier phase fusion、Sigmoid 输出 HDR、以及若干符号定义不完整的问题。第 3 节虽然形式上完整，但从建模和表达严谨性来看还需要明显加强。(v1.pdf)
实验部分缺乏足够的可重复性与公平性说明。例如对比方法来源、训练设置是否统一、HDR-VDP2 的具体配置、SCTNet/Tel 数据集是否有公开划分、是否采用作者原始模型或复现模型等，都没有交代清楚。(v1.pdf)

如果按高水平 SCI 审稿标准判断，这些问题会使论文目前更像是**“有潜力但仍处于未定稿状态的投稿草稿”**，而不是已经达到最终投稿质量的稿件。

二、按章节的详细审阅意见

1. 题目、摘要与关键词

优点

题目 “GuideHDR-T: Guided HDR Transformer with Motion Suppression and Implicit Cross-Exposure Calibration” 与正文主线总体是一致的，能够反映方法核心，即 motion suppression + cross-exposure calibration + transformer-based fusion。摘要也已经比一般写法更聚焦于“运动与曝光耦合导致 feature comparability 下降”的问题。(v1.pdf)

主要问题与建议

(1) “Implicit Cross-Exposure Calibration” 在摘要和方法中解释不够充分

题目中强调的是 implicit calibration，但摘要和正文没有充分解释“implicit”究竟相对于什么而言是 implicit。当前 CASC 实际上通过 YCbCr 分支、频域 phase injection 和 cross-attention 完成特征空间校准，但读者并不能从摘要和方法概述里立刻明确：

它为什么不是 explicit calibration？
它与显式曝光映射、显式对齐或显式 radiometric correction 的区别是什么？

建议在摘要最后一句方法描述中增加一层解释，例如：

…implicitly restores feature comparability in the latent space without requiring explicit exposure transfer or geometric warping.

这样题目中的 “implicit” 才真正落地。(v1.pdf)

(2) 关键词偏泛

当前关键词是 “High dynamic range imaging, Deghosting, Multi-exposure fusion, Deep Learning”。这些词过于宽泛，没有体现论文独特性。建议加入更能反映你方法贡献的关键词，比如：

Dynamic HDR reconstruction
Cross-exposure calibration
Motion suppression
Transformer
Structure-guided fusion

这样更利于检索，也更符合高水平期刊习惯。(v1.pdf)

2. Introduction

优点

引言的逻辑相比普通论文更强，已经能够从一般动态 HDR 困难，推进到 alignment/attention 的局限，再进一步抽象出“cross-exposure matchability”这一主问题。第 1–2 页的四个自然段衔接总体比初稿成熟，贡献引出方式也比模板句自然。(v1.pdf)

主要问题与建议

(1) 论点成立，但“为何你的问题定义优于既有表述”还不够明确

你提出现有方法失败的根源在于“cross-exposure matchability breakdown”，这是好的。但高水平期刊会进一步追问：

这个定义与已有论文中关于 ghosting、misalignment、exposure inconsistency 的说法相比，新增了什么理论解释力？
为什么 “matchability” 不是对已有问题的简单改写？

建议在引言第三段或 related work 的结尾增加一句更明确的对比：

Existing works typically formulate the challenge as alignment failure or fusion instability, whereas we argue that both are downstream symptoms of a more fundamental issue: the loss of cross-exposure matchability under coupled motion and radiometric degradation.

这样你提出的新视角才更具“概念贡献”。(v1.pdf)

(2) Contribution 段虽然自然，但还不够“可验证”

当前贡献写法很自然，但第三点仍偏结果陈述。建议把第三点改为更具体的“系统级设计 + 实验验证”：

不只是 “integrating motion suppression, cross-exposure calibration, and decoupled attention fusion into a unified Transformer framework”
还要加上 “and demonstrate consistent gains in both physical fidelity and perceptual quality across two public benchmarks and multiple ablation settings”

这样 contribution 与实验闭环更强。(v1.pdf)

3. Related Work

这是当前稿件中问题最严重的章节之一。

严重问题 1：引用彻底不完整

第 2 节出现了大量 (??Kalantari et al., 2017; ?; ?; ?; ?; ?)、(?????)、(?)、(??) 之类的占位符。这在 SCI 审稿中属于非常严重的问题，不仅影响阅读，还会直接损害论文的学术规范性。第 2–3 页几乎每个小节都有这种情况。(v1.pdf)

必须修改：

全面检查 bib 数据库和正文引用命令；
所有 ?、??、????? 必须替换为准确文献；
每一类方法至少列出代表性 3–5 篇文献；
对于 SAFNet、AFUNet、SCTNet 等文中点名讨论的方法，必须给出完整来源。

建议

建议你把 Related Work 改成更规范的三小节：

Explicit alignment-based dynamic HDR
Implicit alignment / attention-based dynamic HDR
Alignment-free or fusion-centric dynamic HDR

每小节先讲代表方法，再讲局限，最后在节末统一引出你方法位置。现在虽然也有这个意图，但因为引用缺失和段落过长，读起来很散。(v1.pdf)

严重问题 2：本节过长，而且有明显“综述化堆叠”倾向

Related Work 目前写得非常长，尤其是 explicit alignment-fusion 和 implicit alignment-fusion 两段，已经超过了多数 SCI 论文的必要篇幅。你对每一类方法的技术路线解释很详细，但这种详细度在 related work 中未必加分，反而会让人觉得你在“写综述”，削弱主文的聚焦性。第 2–4 页就是这种观感。(v1.pdf)

建议

把每一类方法压缩为：

一段代表方法
一段局限总结

把更多技术机制分析放在 introduction 或 discussion，而不是 related work 里铺太长。

重要问题 3：你的方法定位段写得有价值，但术语不统一

Related Work 最后一段提出 “Our Paradigm: Implicit Alignment-Calibration-Fusion”，这个定位是不错的，也确实比单纯写 “our method differs from previous works” 更高级。问题是后文第 3.2 节又称第三阶段为 “Dual-Attention Fusion Network (DAFN)”，而结论和消融里又叫 “Double Attention Fusion Network”，CASC 也有不同版本名称。(v1.pdf)

必须统一：

全文统一为 Motion-aware Pre-modulation Module (MPM)
Cross-exposure Structure Calibration (CASC)
Decoupled Attention Fusion Network (DAFN)

不要再出现：

Chroma Anchored Structure Calibration
Double Attention Fusion Network
Dual-Attention Refinement Block 与 DAFN 定义不完全对应的表述混乱

4. Method

第 3 节是论文核心，当前整体结构已经较清晰，但在公式严谨性、符号定义、模块说明与图示一致性方面需要加强。

4.1 Dual-Domain Input Representation

优点

式 (1)–(2) 清楚给出了从 LDR 到 linear domain 的处理，并构建六通道 dual-domain input，这一部分比较规范。第 4 页对应文字也能说明为什么保留 nonlinear 与 linear 两种信息。(v1.pdf)

问题与建议

(1) 物理符号不够严格

式 (1) 使用

\breve{Z}_i = \frac{Z_i^\gamma}{t_i}

这是动态 HDR 领域常见近似，但从严格角度说，这并不是真正意义上的“irradiance recovery”，而是基于响应函数简化与 gamma expansion 的近似线性化。当前文字把它写成 “mapped into the linear domain through gamma expansion” 还算稳妥，但建议进一步说明假设：

assumes a fixed gamma camera response approximation,
ignores sensor-specific response nonlinearity and clipping irreversibility.

否则审稿人会质疑这个变换的物理准确性。(v1.pdf)

(2) 建议补充曝光时间单位与归一化说明

你写了 $t_i$ denotes exposure time，但没有说明：

是否归一化到 reference exposure
输入值是否已经按 [0,1] 归一化
是否做过黑电平/白电平裁剪

建议在实现细节补一行，否则别人难以复现。(v1.pdf)

4.2 Overall Architecture

问题 1：正文写 DAFN，图 1 标题与内部模块标签需彻底一致

第 4–5 页图 1 给出了整体结构，正文称第三阶段为 DAFN，但图内是否所有位置都对应 DAFN / DARB / GCA / CGSF，需要确保一一一致。当前从图像可读性上看，图 1 的文字偏小、颜色偏浅，且缩略后的模块名称很难看清，这对审稿人非常不友好。(v1.pdf)

建议

将 Figure 1 做成矢量图，字体至少保证双栏打印后仍清晰可读；
用编号箭头标出三个阶段；
直接在图中标注 “motion suppression”, “structure calibration”, “fusion”；
提高颜色对比，避免浅底浅字。

4.3 Motion-aware Pre-modulation Module (MPM)

优点

第 3.3 节的逻辑基本成立：先用共享卷积抽取 shallow features，再对 reference branch 做 deformable anchor mask，对 non-reference branch 通过 ECA 与 reference 交互生成置信掩码。整体上和你的问题定义相符：先抑制动态离群响应，再让后续 cross-exposure interaction 更稳定。(v1.pdf)

主要问题与建议

(1) “Elastic cross-attention (ECA)” 没有定义清楚

式 (8) 里直接写

M_i = ECA(F_i^{rgb}, \tilde F_2^{rgb})

但 ECA 是什么、与标准 cross-attention 的区别是什么、是否包含 deformable sampling、是否与 DCN offset 共享、是否产生 attention map 还是 mask，都没有说明。第 4 页正文只说是 “elastic cross-attention operator”，这远远不够。(v1.pdf)

必须补充：

给出 ECA 的具体定义或伪代码；
至少说明 query/key/value 的来源；
说明 “elastic” 体现在哪里；
说明输出 $M_i$ 的尺寸、范围，以及是否经过 sigmoid 归一化。

否则 MPM 仍像“概念模块”，而不是可复现方法。

(2) 参考分支的 mask 构造逻辑还不够有说服力

式 (5)–(7) 中，reference feature 先经过 offset conv，再经 deformable conv，再通过 sigmoid 得到结构 anchor mask。问题是：

为什么 reference 自身需要 deformable conv 才能得到结构 anchor？
offset 从 reference 自身预测，而不是从 multi-frame relation 预测，其理论依据是什么？
这个 mask 具体抑制了什么？是局部不稳定纹理、饱和区域，还是纯粹强调结构？

当前文字只说 “more robust geometric structural anchoring capability”，这个说法太抽象。建议补一段解释，最好结合 Fig. 2 给出直观语义。(v1.pdf)

(3) 命名上 “Pre-modulation” 与 “suppresses unreliable motion responses” 之间可以再更精确

如果这个模块的本质是生成 motion-aware confidence masks，那么名称也许可以考虑更贴近其作用，例如在正文解释中多用：

motion-aware confidence gating
pre-fusion motion suppression

以增强模块功能的可理解性。当前 “pre-modulation” 稍偏泛。(v1.pdf)

4.4 Cross-Exposure Structure Calibration Module (CASC)

这是你方法中最有新意的一部分，但也是最需要数学和概念澄清的部分。

优点

CASC 试图把 YCbCr 分支中的相对稳定结构线索注入 RGB 特征，并且引入频域相位的结构先验，这是一个有辨识度的设计。Fig. 3 的大体思路——双分支、pooling、PSI、双向 cross-attention、FFN 聚合——是清楚的。(v1.pdf)

主要问题与建议

(1) 公式 (14) 的 phase fusion 写法不够严谨，可能存在数学问题

你定义：

P_i^{mix} = P_i^{rgb} + P_i^{ycbcr}

然后用这个 mixed phase 和各自 amplitude 做逆变换。这里至少有三个问题：

相位是周期变量，直接线性相加未必合理；
没有说明是否做了 wrap/normalization；
如果 $P^{rgb}$ 与 $P^{ycbcr}$ 均来自复频谱，相加后对应的复指数相位是否仍有明确物理或几何意义？

严格来说，相位不是普通实值特征，直接求和可能会引入不连续与 aliasing。(v1.pdf)

建议：

如果这是经验设计，要明确说是 “empirical mixed phase formulation”；
更好的是改成带权组合，并说明经过相位归一化，例如： $P_i^{mix} = \mathrm{Wrap}(\lambda P_i^{rgb} + (1-\lambda)P_i^{ycbcr})$
或者直接改用复数表示上的单位向量融合，而不是裸相位相加。

否则审稿人很可能会质疑公式合理性。

(2) 频域变换写法中实值/复值没有交代

式 (12)–(16) 把 FFT 结果写成 $A e^{jP}$ ，这是常见表达，但后续逆变换得到的 $\bar F_i^{rgb}$ 与 $\bar F_i^{ycbcr}$ 应为复值还是取实部？当前文中没有说明。实际实现中，多半应是取 real part。建议明确写出：

\bar F = \mathrm{Re}\left(\mathcal F^{-1}(A e^{jP})\right)

否则从数学上是不闭合的。(v1.pdf)

(3) “YCbCr branch provides relatively stable luminance-structure cues” 这句话需更严谨

YCbCr 并不天然意味着“结构更稳定”。如果你真正使用的是 Y 通道中的 luminance 作为结构引导，那么建议直接说明是：

the Y component emphasizes luminance structure,
while CbCr provide complementary chromatic cues if used.

但当前式 (10) 写的是整个 YCbCr(I_i) 再过卷积，这意味着你并不是只用 Y 通道。那“luminance-structure cues” 的表述就有点不够精确。(v1.pdf)

(4) Pooling 设计解释略弱

式 (11) 中对 YCbCr 用 MaxPool，对 RGB 用 AvgPool。你给出了一个启发式解释，但没有实验证明这种选择优于反过来或都用同一种 pooling。建议在 CASC 消融中加入：

Avg vs Max vs Mixed
with/without YCbCr branch
with/without PSI

否则读者会认为这只是经验堆叠。(v1.pdf)

(5) 最终 calibrated RGB feature 的表达建议加入 residual with original branch

当前式 (19)：

\hat F_i^{rgb} = FFN(F_{i}^{y\leftarrow r}) + FFN(F_i^{r\leftarrow y})

没有保留 $\tilde F_i^{rgb}$ 的 residual path。这样从形式上看，原始 RGB 表征可能被完全覆盖。更稳妥的写法通常是：

\hat F_i^{rgb} = \tilde F_i^{rgb} + FFN(\cdot) + FFN(\cdot)

如果实现中其实有 residual，请在公式中体现；如果没有，也建议考虑加入。(v1.pdf)

4.5 DAFN

优点

第 3.5 节将 DAFN 进一步拆为 DARB，并在 DARB 内分 GCA 和 CGSF 两个阶段，逻辑上是清楚的：先做 window-based self-attention 聚合上下文，再做 reference/non-reference 之间的双向 cross-attention 进行语义融合。Fig. 4 也大致支持这一结构。(v1.pdf)

主要问题与建议

(1) DAFN 与 DARB 的命名层级要更清楚

当前读者可能搞不清：

DAFN 是整个第三阶段；
DARB 是 DAFN 中的基本 block；
CGSF 和 GCA 是 DARB 的子模块。

建议在第 3.5 节开头显式写：

DAFN is instantiated as a stack of $N$ DARB blocks, each consisting of GCA followed by CGSF.

现在虽然有类似意思，但还不够干净。(v1.pdf)

(2) 式 (24)–(25) 中的符号定义不完整

你写了：

F_{2\leftarrow i}^j = CA(Q_{F_2}^{GCA}, K_{F_i}^{GCA}, V_{F_i}^{GCA})

但 $Q_{F_2}^{GCA}$ 、 $K_{F_i}^{GCA}$ 、 $V_{F_i}^{GCA}$ 并未正式定义。建议说明 query/key/value 是由线性投影得到，还是直接从 split 后 feature 产生。(v1.pdf)

(3) 式 (26) 的拼接维度不清楚

F^j = MLP(\mathrm{Concat}_{i\in\{1,3\}}(F_{2\leftarrow i}^j, F_{i\leftarrow 2}^j)) + F_{GCA}^j

这里 concat 是在 channel 维拼接还是 token 维拼接？如果是 reference 与 non-reference branch 拼接，尺寸如何与 $F^j_{GCA}$ 对齐做残差？建议明确。(v1.pdf)

(4) 输出层使用 Sigmoid 重建 HDR 值，合理性需要解释

式 (28)：

\hat H = \sigma(\mathrm{Conv}(H_{res}))

对 HDR 重建来说，Sigmoid 会把输出限制在 $[0,1]$ 。这只有在你把 ground-truth HDR 预先做过归一化或压缩、并且训练/测试都在归一化空间进行时才合理。否则从 HDR radiance recovery 的物理意义看，Sigmoid 可能过于强约束。(v1.pdf)

必须说明：

训练目标中的 $H$ 是否已归一化到 $[0,1]$
如果是，归一化方式是什么
测试时如何还原真实 HDR radiance
若只评估 normalized linear HDR，则为何仍称为 HDR image reconstruction 而不是 normalized HDR estimation

否则公式看上去会被审稿人认为不严谨。

4.6 Objective Function

优点

第 3.6 节采用 $\mu$ -law 压缩配合 $L_1$ 、perceptual、SSIM 和 frequency-domain loss，这在 HDR 任务中是常见且合理的。整体损失设计有一定完整性。(v1.pdf)

主要问题与建议

(1) Perceptual loss 的 feature extractor 未说明

式 (31) 只说 $\phi(\cdot)$ denotes the feature extractor，但没有说明：

使用 VGG-19 还是其他 backbone
提取哪些层
是否冻结参数
输入是单通道还是三通道 tone-mapped image

这些细节都必须补充。(v1.pdf)

(2) Frequency-domain loss 只比较幅值，建议说明原因

式 (33) 比较的是 $|\mathcal F(T(\hat H))| - |\mathcal F(T(H))|$ 。既然你方法中强调 phase-guided structure injection，那么损失里为什么不比较 phase 或结构相关频谱信息？当前做法并非错误，但需要解释：

为什么只约束 amplitude enough
是否尝试过 phase-related regularization
频域损失与 PSI 的关系是什么

否则会给人一种“方法里用 phase，很重要；损失里却忽略 phase”的不协调感。(v1.pdf)

(3) 没有给出 $\alpha,\beta,\gamma$ 的具体数值

式 (34) 给了权重符号，但实验部分没有看到具体超参数设置。这直接影响复现性。必须在实现细节中写清。(v1.pdf)

5. Figures and Tables

这是当前稿件另一个非常突出的弱项。

严重问题 1：图号和正文引用混乱

第 5.2 节写 “Figure 2 demonstrates a challenging scene…”，但 Kalantari 定性图实际出现在第 9 页，标为 Figure 5。(v1.pdf)
SCTNet/Tel 数据集定性分析中写 “as shown in Figure X”，这是明显未完成替换。(v1.pdf)
第 11 页 Figure 7 的标题却写 “Fig. 4. Visual ablation analysis of the CASC module”。这是严重编辑错误。(v1.pdf)

这些问题必须全部修正，否则属于投稿前基本格式错误。

严重问题 2：图像可读性一般

Figure 1–4 的网络结构图字体较小，双栏下阅读吃力。第 5–6 页尤其明显。(v1.pdf)
Figure 5 和 Figure 6 的可视化对比虽然布局完整，但 ROI 放大框和箭头密度较高，图注也偏长，读者要花很多精力辨认。第 9–10 页能看出你的意图，但图面信息略拥挤。(v1.pdf)

建议

网络图改为矢量图；
每张图只保留必要箭头和框；
将图注改为“说明图内容”，不要在图注里塞入大量论文结论；
对定性图增加 one-line takeaway，例如：
“GuideHDR-T better preserves text contours and suppresses halo artifacts under saturated exposure mismatch.”

重要问题 3：图注与正文存在重复

Figure 5、6 的图注已经解释了 red arrows、blue boxes、现有方法不足等，正文又重复一遍，造成冗长。(v1.pdf)

建议采取分工：

图注只解释图上元素；
正文只做针对性分析，不再重复所有视觉现象。

6. Experiments

6.1 Dataset and Implementation Details

优点

给出了 Kalantari 和 SCTNet/Tel 数据集的训练测试规模、patch size、augmentation、optimizer、learning rate、batch size、epoch 和 GPU 信息，这比许多稿件更完整。第 7 页整体框架是有的。(v1.pdf)

主要问题与建议

(1) 数据集名称不统一

你在第 4.1 节中先写 “SCTNet dataset”，Table 2 标题却是 “Tel dataset”，后面第 5.2 节也写 “Comparison on the Tel Dataset”。参考文献中对应作者是 Tel et al., 2023。这里读者会困惑：到底是 SCTNet dataset 还是 Tel dataset？(v1.pdf)

建议统一：

若数据集由 Tel et al. 提出，首次出现写 “the SCTNet dataset introduced by Tel et al.”；
后文全部统一称 “SCTNet dataset”。

(2) 缺少训练/测试公平性说明

对于对比方法，必须说明：

结果来自原论文、公开代码复现，还是你重训得到？
若你重训，训练设置是否一致？
若不同方法原始训练数据不同，是否直接使用公开结果？

当前第 7–8 页完全没有交代。高水平期刊很看重公平性。(v1.pdf)

(3) HDR-VDP2 的设置需要说明

HDR-VDP2 对显示器参数、 viewing distance 等敏感。你只是说使用该指标，但没有任何配置说明。需要补充：

采用的 HDR-VDP-2 版本
显示条件
是否使用默认参数
输入是否为 linear HDR radiance

否则该指标结果可重复性不足。(v1.pdf)

(4) 评价指标定义重复

式 (29) 在 loss 中已经定义过 $\mu$ -law 变换，第 4.1 节又在式 (35) 重新定义一次，同一篇短文中这样重复不算大错，但显得不够简洁。建议第一次定义后直接引用，或统一放在 preliminaries / metrics 小节。(v1.pdf)

6.2 Quantitative Results

优点

Table 1 与 Table 2 的呈现方式直观，指标齐全，也都标出了 best。第 8 页 Table 1 和第 8 页 Table 2 从视觉上是清楚的。(v1.pdf)

主要问题与建议

(1) 方法名不一致

Table 1 中出现 “Kalantari (Wu et al., 2018)” 和 “DeepHDR (Kalantari et al., 2017)”；正文里又说 “Kalantari (Wu et al., 2018) and SCTNet … baseline models”。这里命名方式混乱。Wu et al., 2018 并不是 “Kalantari”。(v1.pdf)

建议统一成论文名或作者名中的一种：

DeepHDR (Kalantari et al., 2017)
End-to-End Deep HDR Imaging (Wu et al., 2018)
AHDRNet (Yan et al., 2019)
...

不要混用“第一作者名字/方法名字/年份”。

(2) Table 2 中方法名错误

Table 2 中写了 “DHDRNet (Kalantari et al., 2017)”，但该文通常是 DeepHDR，不是 DHDRNet。说明表格和正文校对不充分。(v1.pdf)

(3) “新 SOTA” 表述要更谨慎

第 7–8 页你多次说 “successfully setting new SOTA records”。这种表达建议克制，尤其在缺少对比公平性说明、缺少统计显著性分析、且个别指标提升极小的时候。比如 Table 1 中 $\mu$ -PSNR 44.50 比 SCTNet 44.49 只高 0.01 dB，这不宜用非常强的语气。(v1.pdf)

更科学的表述是：

achieves the best overall performance
consistently outperforms prior methods across all reported metrics
shows measurable gains, especially on perceptual metrics

(4) 建议增加参数量、FLOPs 和推理速度

你方法引入了 MPM、CASC、DAFN、FFT/IFFT、cross-attention 等复杂模块，但实验完全没有效率对比。高水平期刊通常会问：

相比 SCTNet / HDR-Transformer，模型有多大？
推理速度如何？
CASC 的频域操作是否引入明显开销？

建议增加一个 efficiency table。否则方法虽然精度高，但难评估实用性。(v1.pdf)

6.3 Qualitative Results

优点

Figure 5 和 Figure 6 选取的案例具有代表性，特别是文字细节恢复、边缘 halo、人物运动 ghosting 等，能够与方法设定形成对应。(v1.pdf)

主要问题与建议

(1) 图文对应错误

Kalantari 定性分析正文里写的是 “Figure 2”，但实际图在 Figure 5。SCTNet/Tel 的定性分析里甚至还是 “Figure X”。这是必须修正的编辑错误。(v1.pdf)

(2) 解释中有超出证据的因果归因

例如你直接说某个局部效果改善“thanks to the precise feature repair of overexposed areas by CASC” 或 “due to the clean feature foundation provided by MPM and the fine fusion strategy of DAFN”。在没有模块级局部可视化证据时，这种因果判断略强。第 8–9 页有多处类似表述。(v1.pdf)

建议改为：

likely attributable to
consistent with the design of CASC
suggests that MPM/CASC/DAFN contributes to…

这样更客观。

(3) 建议增加 error maps

目前只有 ROI crop 和箭头框，若能加入：

absolute error map
structure error / edge error
attention/confidence visualization of MPM or CASC

会更有说服力。你在 Figure 7 里已经做了部分 error visualization，可考虑把这类可视化前移到主结果里。(v1.pdf)

7. Ablation Study

优点

Table 3 给出了逐步加入 MPM、CASC、DAFN 的消融结果，能够初步说明每个组件有效。Table 4 又进一步对 CASC 内部进行了解析，说明作者有意识地细分模块贡献。(v1.pdf)

主要问题与建议

(1) Baseline 定义不清楚

Table 3 说 baseline 是 Swin Transformer–based backbone，但正文没有明确 baseline 具体由什么组成：

是直接把三帧拼接后送入 DAFN？
是否保留 dual-domain input？
是否保留 patch embedding / DARB 结构？
和 SCTNet 的关系是什么？

没有明确 baseline，消融结果说服力会大打折扣。第 10 页虽然写 “We define the Swin Transformer–based backbone as our Baseline.”，但远远不够。(v1.pdf)

建议： 加一段 baseline definition，或者在 supplementary 给出结构图。

(2) 把单个指标强行对应到单模块，论证略过度

第 10 页你写：

l-PSNR 直接反映 MPM 的能力
$\mu$ -SSIM 最适合评估 CASC
HDR-VDP2 验证 DAFN

这种一一对应过于强。实际上三个模块都会共同影响多个指标。这样的写法容易被审稿人认为“后验解释”。(v1.pdf)

建议改成更谨慎的表达：

l-PSNR is particularly sensitive to radiometric/registration errors and is therefore informative for assessing the effect of MPM
$\mu$ -SSIM is more indicative of structural restoration and is thus helpful for evaluating CASC
HDR-VDP2 complements the above metrics from a perceptual perspective

不要写成单一指标专属单一模块。

(3) Table 4 的格式错误明显

第 11 页 Table 4 中，w/o Dual-Domain (RGB Only) 那一行出现了类似 42.58 0.03 0.9934 0.0001 67.22 0.07 这样的内容，显然是想同时显示主值与增量，但格式完全坏掉了。Full CASC (Ours) 也有类似问题。(v1.pdf)

必须修正：

如果想写增量，用括号如 42.58 (+0.03)；
否则就只保留主值；
三列指标必须对齐，不能让审稿人猜测数字含义。

(4) Figure 7 与正文的对应混乱

第 11 页 Figure 7 标题写 “Fig. 4. Visual ablation analysis…”；正文里也说 “The visualization in Fig. 4…”。这说明 figure numbering 没有更新完成。(v1.pdf)

(5) CASC 消融中术语再次不一致

第 11 页写的是 “Chroma Anchored Structure Calibration (CASC)” ，与前文 “Cross-exposure Structure Calibration (CASC)” 不一致。这是严重术语问题。(v1.pdf)

8. Conclusion

问题与建议

(1) 术语不统一再次出现

结论里又写成：

“Chroma-Anchored Structure Calibration (CASC)”
“backend Dual-Attention Fusion Network (DAFN)” 前文中有时又叫 Decoupled/Double 这些必须统一。(v1.pdf)

(2) 结论没有总结局限性

高水平 SCI 结论通常不只重复贡献，也会简短指出局限或未来方向。你只写了“lightweighting for real-time applications”，偏弱。(v1.pdf)

建议加入更有针对性的 limitation：

reliance on three-frame exposure setting
lack of explicit benchmarking on extreme saturation or handheld burst noise
computational overhead introduced by frequency-domain calibration

这样更显成熟。

三、关于公式正确性与数学严谨性的专项意见

这里单独总结你稿件中最值得优先修改的公式问题。

1. 式 (14) 相位直接相加不严谨

这是最值得警惕的地方。相位是周期量，直接线性相加可能造成不可解释的 wrapping 问题。若这是经验设计，必须明确说明并在附录给出实现细节。更推荐使用归一化/带权/复向量融合形式。(v1.pdf)

2. FFT/IFFT 后未说明取实部

式 (15)、(16) 的输出理论上是复数，需明确取实部。(v1.pdf)

3. ECA 未公式化

MPM 的关键算子没有正式定义，方法可复现性不足。(v1.pdf)

4. Q/K/V 未定义

DAFN 中的 cross-attention 公式缺少投影定义。(v1.pdf)

5. HDR 输出用 Sigmoid 的物理合理性需说明

否则读者会质疑你是不是仅仅在做 normalized HDR mapping。(v1.pdf)

6. 损失权重未给值

$\alpha,\beta,\gamma$ 必须补。(v1.pdf)

四、关于写作表达和逻辑连贯性的专项意见

1. 全文存在“英文成熟度不均”的问题

摘要和引言语言质量相对较好，但从 related work 开始，出现明显的长句堆叠和过度修饰，例如：

“fundamental advantages of the GuideHDR-T architecture in complex motion feature extraction and alignment over traditional convolutional networks”
“successfully setting a new SOTA record”
“decisive jump”, “breaks this bottleneck”, “driving force of CASC”

这种表达在顶刊/高水平 SCI 中略显宣传化。(v1.pdf)

建议整体风格改得更克制：

用 measured improvement 替代 decisive jump
用 suggests / indicates 替代 strongly demonstrates / fully verifies
用 achieves the best overall results 替代 sets a new SOTA record

2. 段落层次需进一步压缩

Related work 和 qualitative analysis 中都存在单段过长问题，影响可读性。建议：

每段只承载一个中心论点；
句长适度控制；
将图像分析拆成 bullet-like prose structure（但最终文本仍可写成连贯段落）。

3. 术语必须全文统一

这是当前稿件最影响专业感的点之一。请统一以下内容：

CASC 只能有一个全称
DAFN 只能有一个全称
SCTNet dataset / Tel dataset 只能保留一种主叫法
Kalantari/Wu/DeepHDR 的方法名引用方式统一
Figure/Table 编号全部重排一次

五、我建议你按“必须修改 / 强烈建议修改 / 可进一步增强”三个层级推进

必须修改

修复全部文献占位符 ? / ?? / ?????。(v1.pdf)
全文统一模块名称与术语。(v1.pdf)
修复所有图号、表号、Figure X、Fig.4/Fig.7 等交叉引用错误。(v1.pdf)
修正 Table 4 的格式错误。(v1.pdf)
补充 ECA、Q/K/V、loss 权重、perceptual extractor、HDR-VDP2 设置等关键实现细节。(v1.pdf)
解释 Sigmoid 输出 HDR 的合理性，或改写输出定义。(v1.pdf)
澄清 phase fusion 的数学合理性，并修正式 (14)–(16) 的表述。(v1.pdf)

强烈建议修改

压缩 related work，减少综述化堆叠。(v1.pdf)
增加效率对比（参数量/FLOPs/速度）。(v1.pdf)
增加更细粒度消融，特别是 pooling、phase fusion、YCbCr branch 的作用。(v1.pdf)
调整实验表述语气，避免过度宣传式措辞。(v1.pdf)

可进一步增强

增加中间特征/attention/confidence 可视化，增强模块解释性。(v1.pdf)
在 conclusion 中加入局限与 future work 的更具体讨论。(v1.pdf)
若篇幅允许，可加一个 discussion 段，专门讨论“matchability”与 alignment/fusion 的关系。(v1.pdf)

六、审稿结论倾向

如果以严格的高水平 SCI 审稿标准来判断，我会给出接近 Major Revision 甚至当前版本偏向 Reject but encourage resubmission 的意见。不是因为方法主线没有价值，而是因为当前稿件存在太多学术规范与完稿质量问题，尤其是：

文献未完成，
图表交叉引用错误明显，
模块和术语不统一，
若干关键公式和实现细节不够严谨。

这些问题会直接削弱论文可信度。
但反过来说，这篇稿件是有较强修改潜力的。如果你把上面列出的“必须修改”项逐条处理干净，再把实验和写作进一步收束，这篇工作是有希望达到较强投稿质量的。

下一步最有效的做法，不是继续零散润色，而是先做一次全文一致性与规范性清理。然后再针对方法和实验做第二轮加强。
我可以下一步继续帮你做两件事中的一个：
一是按审稿意见直接帮你列一个逐节修改清单（revision checklist）；
二是继续帮你逐页改写第 2 节 Related Work 和第 5 节 Experiments 的英文正文。