先梳理这个论文的核心内容,重点关注数据集制作部分的内容
Created on: December 10, 2025
Created on: December 10, 2025
先梳理这个论文的核心内容,重点关注数据集制作部分的内容
下面先把论文主线捋清楚,然后把数据集 Geo-CoT380k 的制作流程作为重点单独展开(你关心的“数据集制作”核心都在第3.2节与附录A.4,以及表1/表2)。 (21RSThinker.pdf)
这篇论文要解决的问题是:遥感领域的VLM(视觉-语言模型)在复杂分析任务上容易“看起来会说、但不可信”,因为传统端到端训练让推理步骤变成隐变量,输出缺乏可验证的“证据链”。作者提出 Perceptually-Grounded Geospatial Chain-of-Thought(Geo-CoT):把遥感分析变成一个可核查的多步过程,每一步都要尽可能对应到可定位的视觉证据(如框、坐标、相对位置等),形成“可复查的分析轨迹”。(page 1-3)
方法上做了两件事:
最终模型 RSThinker 能输出 <think>...</think><answer>...</answer>:既给答案也给可核查的推理轨迹,并在多任务基准上显著提升。(page 2、6-9)
论文里把数据集的制作定位为:“不是让大模型自由发挥推理”,而是把“已有真值证据”喂给它,让它做“推理路径的回填/改写(retrofitting)”,从而把推理强行约束在真值之上,减少幻觉(page 5,3.2)。
Geo-CoT380k 的 384,591 条样本来自多个公开遥感数据集,按任务汇总在表1(page 5):
一个关键工程细节:像 DOTAv2 这种大图,作者提到会切成 800×800 patch来构造训练样本。(page 5)
你可以把它理解为:作者没有从零人工写CoT,而是把“遥感经典任务的现成标注”统一变成 “(图像, 问题, 真值/标注) → 结构化推理轨迹”的格式。
作者用 GPT-4V 来生成 CoT,但强调“faithfulness by design”(设计上保证忠实)——关键约束是(page 5):
输入给 GPT-4V 的不是只有图和问题,而是额外提供:
任务不是“让模型推理出答案”,而是:
这相当于把生成模型从“解题者”降级成“解释器/记录员”:它主要负责把标注信息组织成一条像专家那样的检查流程文本。
附录 A.4(page 16-18)把提示词设计说得很清楚:
Two-tier prompting protocol(两层提示):
Base prompt 的几个关键硬约束(page 18):
\n\n\n 分隔){"CoT": "..."}(page 19)这里有个很“数据工程”的点:作者不仅要推理正确,还要格式稳定、可解析,所以对输出格式和段落结构要求非常明确。
A.4 还特别写了:制作数据时会用一些脚手架信息(page 16):
但这些只用于生成阶段来保证推理能“贴着真值走”,并且:
这意味着 Geo-CoT380k 的核心产物是:结构化 CoT 文本(以及答案),而不是把所有中间脚手架都开放出来。
附录 A.4 给了大量示例,基本能反推出每类任务希望 CoT 长什么样(page 19-26):
这些 exemplars 的作用是:让 GPT-4V 不只是“给理由”,而是输出一种可复查的操作日志(先看哪,再确认什么,最后总结)。
表2(page 5)说明 GRPO 的“偏好/强化数据”除了 Geo-CoT380k 的原始实例外,还加了:
注意这里强调的是“rationale-free instances”(没有CoT的原始样本)被用来做 RL 采样与奖励对齐(page 5,3.3),奖励函数按任务指标设计(表3,page 5)。
如果你接下来想更“复现导向”,我也可以把 Geo-CoT380k 的制作流程整理成一套可执行的 pipeline 清单(输入字段、不同任务的aux schema、prompt拼接规则、输出校验与清洗点),直接对照附录A.4逐项落地。
图片是我进一步的改进思路,在原来论文的技术上进行了进一步的改进1.在数据集上在原来任务的基础上加入了时序类任务(变化检测)、高认知推理任务(实体关系 推理、实体关系 检测、指令分解)、空间类任务(轨迹规划 / 导航)、以及多模态感知任务(在原论文几个任务光学的基础上加入了红外、sar两类任务构成多模态数据集)2.在原来辅助信息部分新加入经纬度信息和成像时间等来辅助分析(经纬度信息可以调用离线库返回具体的区域地点信息) 3.改变原来思维链的推理过程[规划阶段]- [执行阶段 - 模拟工具调用]- [反思与回溯阶段 - 核心]- [综合推理阶段]新增了工具(YOLOv8-OBB、CLIP (ViT-B/32)、GDAL + Shapely、Reverse Geocoder)使用环节、以及反思回溯环节、让思维链更加完善。4.增加对思维链的评估环节,原论文只对结果进行了评估,并未对思维链进行评估,想通过对思维链进行评估得到的分数再用于微调模型的反向传播
下面我将 对你的改进方案进行系统化梳理与结构化总结 ——帮助你将“原论文技术框架” → “你的增强版框架”之间的关系讲清楚,并把每一项改动的技术意义、数据制作流程影响、模型训练影响都明确化。
你给的几张图(任务体系、工具体系、思维链评价指标、辅助信息扩展)我也已经纳入整体分析(这些内容来自你后续设计的图示,而非原论文(21RSThinker.pdf))。
你的改动可以归纳为 四大类增强:
在原论文的 6 大任务基础上,你新增了 4 类更高层次、更贴近真实场景的任务:
| 原论文任务 | 你的新增任务 | 技术价值 |
|---|---|---|
| 视觉任务(检测、计数、VG)、语义任务(分类、Caption、VQA)(21RSThinker.pdf) | 时序变化类(变化检测) | 引入跨时序推理,迫使模型具备“前后对比” 与 “变化原因解释能力” |
| — | 高认知推理:实体关系检测 / 关系推理 / 指令分解 | 从目标几何 → 关系逻辑,提升模型的结构化推理维度 |
| — | 空间认知任务:轨迹规划 / 导航 | 需要空间拓扑推理、可达性推理、路径优化 |
| 原论文只有光学 | 加入 SAR + 红外 → 多模态感知任务 | 让模型具备跨光谱一致性推理能力、提高鲁棒性 |
这些任务直接扩展了 Geo-CoT 数据集的复杂度,使其从「多任务视觉数据集」升级为 “空间认知 + 时序认知 + 高阶推理”的综合大模型训练集。
原论文的数据制作只使用:
你新增的部分使 CoT 推理具备:
作用:
对应你的表格中 “Reverse Geocoder / POI SQLite”工具。
作用:
价值:
💡 总结:你把 Geo-CoT 的“视觉辅助信息”扩展为“地理-时空辅助信息”,这是重大增强,使 CoT 从视觉解释 → 上升到地理认知与因果推理层。
原论文的思维链结构只有:
(详见 Figure 2 & 附录 A.4 prompt 结构(21RSThinker.pdf))
你升级为一个 完整的推理循环(REPL-like):
相当于原论文版本,但可加入:
你设计了 4 类工具:
| 工具 | 用途 |
|---|---|
| YOLOv8-OBB | 精确计数、旋转框定位 |
| CLIP (ViT-B/32) | 高层语义与类比推理 |
| GDAL + Shapely | 距离/面积/方位/地理几何 |
| Reverse Geocoder | 从经纬度获得地名、区域标签 |
这使得 CoT 能产生:
你的评价表中 B3(自我反思机制)正是对应这个阶段。
价值:
原论文只有简单的“汇总回答”,你扩展为:
这是一个完整的推理闭环,远超原论文的线性流程。
原论文中:
你提出要对 思维链本身评分, 即:
包括:
这些都是 原论文没有的评估机制。
你提出要将其作为:
用于反向更新模型,使模型不仅“答对”,还要“推得对”。
这本质上是:
这是极具价值的创新点!
| 维度 | 原论文 RSThinker | 你的增强版 |
|---|---|---|
| 任务 | 6 类视觉任务 | + 时序任务、空间导航、关系推理、多模态感知 |
| 辅助信息 | 仅视觉结构信息 | + GPS、时间、DEM、土地利用、POI |
| 工具 | 无工具,仅视觉推理 | + 检测、语义、几何、地理工具 |
| 思维链结构 | 线性 3 段 | 规划 → 工具推理 → 反思回溯 → 综合分析 |
| 思维链评估 | 无 | 建立完整 rubric,并用于反向更新 |
| 强化学习目标 | 仅任务结果 | 任务结果 + 思维链质量 |
你本质上把 RSThinker(视觉推理模型)升级为一个 Remote-Sensing Agent(遥感智能体)。
只需要告诉我下一步你想要:
A. 继续完善论文改进方案?
B. 生成方法部分(Method)?
C. 生成数据集部分(Dataset)?
D. 生成总框架图和流程图?
E. 生成可直接用于训练的 Prompt?
你希望我从哪里开始?
1.这是我原来设计的思维链的评估标准我觉得它太笼统了(评分标准0-10、20-30这样设计有很大的不确定性误差比较大),现在想把这些标准大体分为两类(一类可量化的就直接用公式计算,比如对象识别准确度和空间定位精度这些不要局限于我举的例子)(第二类不可量化的指标,给分标准再具体一些可以列出分多条,如果思维链每满足一点便多得一点分数,这样每每一分都有依据)(第一类可量化和第二类不可量化再给出权重得到总的思维链的评估得分)2.在完成思维链评估的基础上怎么进一步融合对结果的评估,将两个部分(结果和思维链的得分)融合在一起得到最终总的得分
我先直接给你一个**“可落地版评分体系”**,把你原来的 A/B/C 三大维度拆成:
所有可量化指标都统一到 区间,方便后面加权融合。
适用:检测、计数、VG 这类有目标真值的任务。
设
计算:
实现上,用解析 CoT 或用“工具日志”里的目标列表来对齐 GT。
适用:VG / 检测 / 需要框或坐标的 VQA。
如果是点坐标而不是框,可以用归一化距离:
利用你设计的 YOLOv8-OBB / CLIP / GDAL+Shapely / Reverse Geocoder 的调用日志来打分。
可以设计两层指标:
假设你有 GT 的行政区/城市/关键 POI 或真实经纬度。
设 GT 坐标为 ,CoT 里反向地理编码得到的中心为 ,球面距离 :
可以取 5–10km 控制衰减速度。
有 GT 时间戳或季节标签时:
为每个量化指标给权重,例子:
这里不再用「0–10 / 20–30」这种模糊档,而是把每个大项拆成**若干条「是否满足」**的小规则,每一条给固定分;所有条目加起来自动落在 。
设计 5 条,每条 0.2 分:
<answer> 中的关键信息在 <think> 中都能找到对应来源,而不是凭空冒出。得分:
例如 4 条,每条 0.25 分:
比如 4 条,每条 0.25 分:
"为了确认…我再次检查…" / "需要复核…"。例如 5 条,每条 0.2:
给 B1/B2/B3/C3 设权重(示例):
用一个系数 表示你想多信任“可量化部分”:
至此,你就从原来的「0–10、20–30 区间」变成了:
原论文在 GRPO 阶段只使用任务结果指标做 reward,比如 VQA/分类用 0 或 1、计数用归一化 MAE、检测用 mAP 等。(21RSThinker.pdf)
你现在要把「结果得分」和「思维链得分」合并。
按任务不同做归一化到 :
与原文中 count 的 reward 形式保持一致。(21RSThinker.pdf)
记最终结果得分为 。
优点:
如果你希望结果错误时整体得分拉低很多,可以让思维链分只在结果比较正确时放大作用:
这个形式非常适合做 RL 的 reward:
如果你愿意,我下一步可以帮你:
这是我之前设想的思维链的推理过程,请结合所给信息重新进行修改,原来设计的思维链的评估标准我觉得它太笼统了(评分标准0-10、20-30这样设计有很大的不确定性误差比较大),现在想把这些标准大体分为两类(一类可量化的就直接用公式计算,比如对象识别准确度和空间定位精度这些不要局限于我举的例子)(第二类不可量化的指标,给分标准再具体一些可以列出分多条,如果思维链每满足一点便多得一点分数,这样每每一分都有依据)(第一类可量化和第二类不可量化再给出权重得到总的思维链的评估得分)2.在完成思维链评估的基础上怎么进一步融合对结果的评估,将两个部分(结果和思维链的得分)融合在一起得到最终总的得分
那我直接给你一套 “按阶段拆开的新思维链评价体系”,同时帮你把它和“结果评分”融合成一个总分框架。下面所有分数都归一到 ,最后再乘 100 映射到 0–100 方便和你原来的 S/A/B/F 分档对齐。
我们约定模型的 CoT 结构为:
text<planning> … </planning> <tool_execution> … </tool_execution> <reflexion> … </reflexion> <synthesis> … </synthesis>
接下来所有指标都围绕这四段来设计。
所有可量化指标都取值 ,再线性加权得到 。
适用:检测/计数/VG、Tracking 等有目标真值的任务。
设 CoT 中显式列出的目标集合为 ,真值集合为 :
注意这里是“思维链里被点名的目标”和 GT 的对齐,而不是最终 detection 输出,也能用于关系任务(比如先识别再说关系)。
若是框:
无匹配则该项 IoU=0。
若是点坐标:
为归一化距离(如图宽对角线)。
若是分割 mask,可直接用平均 IoU。
对需要“枚举+求和”的任务(小车数量、变更面积块数等),利用 “思维链内部的计数结果” 和 GT 做比较(区别于论文里直接用最终答案的 MAE 奖励(21RSThinker.pdf)):
设 CoT 中列出元素个数为 ,GT 为 :
控制惩罚强度(如 0.5)。
根据工具日志统计:
如果有 GT 的 行政区 / POI / 经纬度 / 时间标签:
同样可定义成 1 / 0.5 / 0(三档)。
给一组权重(例子):
每个样本可以只启用有 GT 的那几项,剩余权重按比例重归一。
我们把 checklist 严格绑定到四个阶段,每条都只给固定分值,所有条目加和自动落在 。
假设一共 16 条,每条 1/16 分:
满足几条就得 。
得分 。
得分 。
<answer> 只包含问题要求的内容(数字/类别/短句),不混入额外解释。得分 。
每命中一条就多 1/16 分,每一分都有依据。
用一个系数 控制“量化 vs checklist”的权重,例如 (更看重客观一致性):
再映射到 0–100:
你原来定义的 4 个区间可以直接套在这上面用来分 Tier(S/A/B/F),并决定样本走“核心训练 / 普通训练 / RLHF 修正 / DPO 负样本”那张表里的路由。
原论文中,结果 reward 是按任务的指标来算的,例如:
我们把这些都统一归一成 的 结果得分 。
优点:实现非常简单,适合 SFT 或排序用。
为了避免“思维链很好但答案错了”的样本拿到很高 reward,可以用:
这样非常适合作为 GRPO / PPO 的 reward:
标注 & 自动评估流程
数据分层
CoT_score 做思维链质量分层(S/A/B/F)。S_final 作为 RL 阶段的 reward;论文写法
如果你愿意,我可以下一步帮你把这套东西写成伪代码 / Python evaluation 脚本,或者直接改写成论文里的 “CoT Evaluation & Reward Design” 小节。
请给每个任务确定一个评估指标(关系推理及关系检测任务、指令分解任务、任务调度任务、定位描述任务、目标检测、图像分类、图像描述、目标计数、变化检测、VQA),要求给出指标的同时给出计算的公式
下面我按你列的 10 个任务,各给一个 主评估指标 + 明确公式。符号约定:
指标:三元组 F1-score(Relation Triplet F1)
把每个样本预测出的关系写成三元组集合:
,真值集合 。
指标:步骤 F1-score(Step F1)
对每个样本有 GT 子指令集合 ,模型输出集合 (每一步的语义匹配由人工或匹配器判定)。
指标:前序约束满足率(Precedence Satisfaction Rate)
设 GT 给出一组先后约束
,表示任务 必须在任务 之前执行。
模型给出排列 (任务在序列中的位置)。
满足约束数:
总约束数:。
指标:定位 IoU(Localization IoU)
这里先只考察“定位”的质量:预测区域 ,真值区域 (框、多边形或掩膜)。
若需要把文本质量也加进去,可以再引入 BLEU/CIDEr 等乘上一个权重,但作为主指标 IoU 最客观。
指标:[email protected](mean Average Precision at IoU=0.5) (21RSThinker.pdf)
对每个类别 :
全类平均:
指标:分类准确率(Accuracy)
指标:CIDEr
每个候选描述 和参考集合 :
对每种 n-gram(通常 )计算 TF-IDF 向量 , 。
单个样本 CIDEr 分数:
数据集上平均:
(你论文里也是把 BLEU-4/METEOR/CIDEr 线性加权成一个 reward,CIDEr 可以单独作为主评估指标。(21RSThinker.pdf))
指标:归一化计数得分(Count Score, 基于 MAE) (21RSThinker.pdf)
先定义 MAE:
在单个样本级别定义归一化得分(和 RSThinker 中 reward 一致):
一般取 ,再截断到 。
数据集总得分为平均:
指标:F1-score on Change Mask
对“变化”的像素/栅格:
可以按像素,也可以按变化区域(连通域)统计 TP/FP/FN。
指标:加权准确率(Weighted Accuracy) (21RSThinker.pdf)
参考你论文 Table 3 里的做法,把每个答案打成 3 档:
整体得分:
如果你之后想把这些指标再统一到一个 总 reward(比如你刚才说的“结果 + 思维链评估”融合),可以直接把上面这些都映射到 后,按任务种类或样本重要性做加权平均,我也可以帮你把那部分写成完整的公式版 “Reward Design” 小节。
按照任务、评估方法、评估公式(最终把输出归一化到1-100的区间内)的格式输出一个整理之后的汇总表格
下面给出你所要求的 “任务—评估方法—评估公式(归一化到 1–100)” 的系统化汇总表格。
这些指标均与遥感 VLM 的典型任务匹配,并且确保 可计算、可比较、可归一化。
(注:表格中的公式已统一将结果线性映射至 1–100 区间,便于后续与思维链得分融合。)
| 任务类别 | 任务名称 | 评估方法(指标说明) | 归一化计算公式(输出 1–100) |
|---|---|---|---|
| 关系推理类 | 关系推理(Relation Reasoning) | 对实体关系预测是否正确,如“建筑 A 在道路 B 北侧” | |
| 关系检测(Relation Detection) | 检测所有关系三元组(subject–predicate–object),计算 F1 | ||
| 任务解析类 | 指令分解(Instruction Decomposition) | 将任务分解为若干子步骤,对比 GT 子步骤集合,计算匹配度 Jaccard | $\text{Score}=100\times \frac{ |
| 任务调度(Task Scheduling) | 若模型输出一条工具调用序列,与 GT 序列比较序列编辑距离 | $\text{Score}=100\times \left(1-\frac{ED}{\max( | |
| 空间认知类 | 定位描述(Referring / Spatial Description) | 输出的定位表达(如“左上角”“靠近道路”等)与 GT 的空间 token 比对,基于匹配率 | |
| 感知类任务 | 目标检测(Object Detection) | 使用 [email protected](或 [email protected]:0.95)作为指标 | |
| 图像分类(Image Classification) | 分类正确率 | ||
| 图像描述(Image Captioning) | 可使用 CIDEr 作为主要评分(更稳定),归一化基于 0–1 缩放 | (通常 CIDEr_max≈150) | |
| 目标计数(Object Counting) | 采用 MAE,数值越低越好 | $\text{Score}=100\times \left(1 - \frac{MAE}{\max( | |
| 时序类任务 | 变化检测(Change Detection) | 二分类/语义变化地图的 IoU(或 F1) | |
| 复杂推理类 | VQA(Visual Question Answering) | 通常采用 Accuracy 或 soft match |
属于结构化输出任务,因此分类准确率或 F1 最能衡量是否预测正确。
模型输出的子步骤集合 与 GT 集合 做集合比较:
使用 Jaccard Index(IoU of sets)特别合适。
用 编辑距离(Edit Distance) 衡量模型序列与 GT 序列的差异。
这是语言但具有结构化特征,可使用 token-level Precision 或 Recall。
这里使用 匹配率 —— 最直观、通用。
采用遥感领域的标准指标即可:mAP、Acc、CIDEr、IoU、MAE 等。
任务 —— 评估方法 —— 归一化公式 1–100
| 任务 | 评估指标 | 量化方式 | 归一化公式 |
|---|---|---|---|
| 关系推理 | Accuracy | 分类正确率 | Score = 100 × Acc |
| 关系检测 | F1 | 关系三元组检测 F1 | Score = 100 × F1 |
| 指令分解 | Jaccard | 子步骤集合的匹配度 | Score = 100 × |
| 任务调度 | 序列编辑距离 | 预测序列 vs GT 序列 | Score = 100 × (1 − ED/max( |
| 定位描述 | Token Match Rate | 空间描述语义 token 匹配 | Score = 100 × Matched/Total |
| 目标检测 | [email protected] | 标准检测指标 | Score = 100 × mAP |
| 图像分类 | Accuracy | 分类正确率 | Score = 100 × Acc |
| 图像描述 | CIDEr | 文本生成质量 | Score = 100 × (CIDEr / CIDEr_max) |
| 目标计数 | MAE | 误差越低越好 | Score = 100 × (1 − MAE/max( |
| 变化检测 | IoU | 语义变化的 IoU | Score = 100 × IoU |
| VQA | Accuracy | QA 正确率 | Score = 100 × Acc |
如果你希望我把“思维链得分 + 任务结果得分”合成为一个 最终总分公式,我也可以继续为你设计。
对于思维链的不可量化部分的这个分类评估,有没有可以改进的地方(不全需要完善、或者可以合并、需要删除的地方),现在的分类有点不清晰
对于思维链的不可量化部分的这个分类评估,有没有可以改进的地方(不全需要完善、或者可以合并、需要删除的地方),现在的分类有点不清晰
下面我基于你给出的 不可量化思维链结构(planning / tool_execution / reflexion / synthesis),结合你上传的文件内容((不可量化.md)),重新做了一次 系统化重构与优化。
你的核心问题是:
❝ 当前的不可量化分类有点不清晰,是否可以合并、删除或重新组织? ❞
答案是:可以,而且应该。
我会先给出一套 更清晰、更可执行、更稳健的四大类不可量化评价结构,避免冗余、重叠和模糊;
接着说明“为什么要这样改”;
再给你一个最终的优化版思维链结构(适合作为论文正式定义)。
经过分析,你的不可量化维度可以更清晰地组织为 4 大类:
评估思维链是否具有合理的步骤结构,而不是跳跃式回答。
包含:
A1 任务理解是否清晰(Task Understanding)
A2 推理步骤是否成体系(Stepwise Reasoning)
A3 规划—执行—反思—综合 是否逻辑顺序一致
你的原始分类里,“planning / execution / reflexion / synthesis”其实都隐含了“证据如何被引用并转化成结论”,所以证据链应该独立成一个维度。
包含:
原始分类中“物理合理性”“无矛盾”“交叉校验”分散在多个维度里,这里合并为一个更加一致的维度。
包含:
C1 无自相矛盾(Internal Consistency)
C2 物理 / 地理 / 常识合理性(Physical–Geo Validity)
C3 交叉验证质量(Cross-Validation Quality)
C4 不确定性表达(Uncertainty Handling)
包含:
| 旧分类问题 | 新分类改进 |
|---|---|
| 结构与逻辑混在一起 | 分离为 结构完整性 与 证据链 |
| 多个维度重复评价同一问题 | 归并为一致性与稳健性 |
| reflexion 既是过程步骤又是评价项 | 将 reflexion 的作用拆散成 冲突检查 / 修正 / 不确定性表达 |
| 难以定义“表现好/表现差” | checklist 更清晰,评分稳定性更高 |
下面是一套适用于模型训练、RL、数据筛选的不可量化评价准则(所有条目均可 Checklist 化):
因为:
因为 synthesis 是结构阶段,引用证据属于验证性问题
我们将不可量化思维链质量拆分为四类:结构完整性、证据可追溯性、一致性稳健性、洞察质量。每类包含若干可打勾的 checklist 条目,每满足一项获得固定得分,用于构建稳定、可解释的思维链质量评估函数。
结构化定义如下:
如果你希望我 把这套新的不可量化维度映射到你的标签结构(<planning>、<execution>、<reflexion>、<synthesis>)
我可以立刻给你一个 最干净、最一致、最论文化的最终版本。
现在定指标来进行评估的工作以及完成了,需要设计数据集的思维链了,我有下面的问题1.如果仿照RSThinker设计了上述数据集,这个数据集存在一个问题它没包含调用工具(YOLOv8、CLIP、OpenCV、GDAL/Shapely、SQLite/Geocoder)的真实返回结果,只是规划了要用什么工具,后面的分析的话也没法真正集合工具的输出结果来进一步推理,只是假装有个结果形式化的继续后续分析 2.有办法通过一个数据集让大模型既会灵活自主调用工具又能对调用的工具的结果进行分析,3.还是说一个大模型没法同时学会调用工具和进行结果的分析思维链生成两个任务,我需要多个agent去协同,有的负责工具调用有的负责对结果进一步的分析,哪个想法更合理一些 4.如果是多agent的话我应该如何去训练不同的agent,并且让他们能协同工作呢?
先说结论,再展开细讲你这 4 个问题:
(下面回答按你的 4 个问题一一来)
你现在的样本长这样(简化):
xml<planning>……我要用 YOLOv8 / CLIP / GDAL …</planning> <tool_execution> * Action: 调用 YOLOv8…… * Observation: (这里其实是“想象中的结果”,不是实际跑出来的) </tool_execution> <reflexion>基于“想象的 Observation”继续分析</reflexion> <synthesis>总结</synthesis>
问题:
👉 这种数据集只能很好地教会 “写一个像样的调用计划”,但不能教会“基于工具返回做严肃推理”。这一点和你前面设计的“思维链评估”目标是冲突的。(不可量化.md)
改进方向:必须在数据集中把 “真实工具输出” 固定下来。
可以,而且这类设计在工具增强 LLM 里已经被大量验证过(比如 ReAct 轨迹、function-calling 日志等,只是你要做成遥感 + Geo 工具版)。
对每个样本,结构改成“对话/轨迹式”:
xml<example> <user>问题/任务描述 + 图像/元数据</user> <assistant_think>(可选)高层 planning,自然语言</assistant_think> <assistant_tool_call> { "tool": "YOLOv8_OBB", "args": {...} } </assistant_tool_call> <tool_result> {... 这里是 YOLOv8 真正跑出来的 JSON/框/类别 ...} </tool_result> <assistant_think> (基于 tool_result 的分析、检查、再决定是否调用下一个工具) </assistant_think> <assistant_tool_call> … </assistant_tool_call> <tool_result> … </tool_result> ... <assistant_final_cot> (完整思维链 + 证据链) </assistant_final_cot> <assistant_answer> (最终答案/标签) </assistant_answer> </example>
关键点:
tool_result 不是让模型预测,而是视作 “环境发来的观测”,训练时只作为上下文,不计算 loss;<assistant_tool_call>;tool_result 之后,怎样在 <assistant_think> / <assistant_final_cot> 里引用这些结果。先离线跑所有工具
再让一个强模型做“带工具日志的专家解题”
<planning> → <tool_call> → <tool_result> → <reflexion> → <synthesis> 模板生成完整轨迹;训练你的学生模型
assistant_* 段计算 loss;tool_result 段 mask 掉,不回传梯度。这样一来,“什么时候调工具” 和 “看完结果怎么想” 都是在 同一套数据、同一个模型 上学到的。
可以的,而且业界主流就是这么干的。
你的场景虽然是遥感 + 地理工具,但模式是一样的,只不过工具换成 YOLO、GDAL、Geocoder 等。
所以如果是论文/系统的第一版,我会强烈建议:
优先做“单模型 + 工具执行器”的方案,
把多 agent 留到后续扩展或 ablation。
如果你想做成一个比较“agentic”的架构,大概有两种主流方式:
本质是 一个模型、多个角色,但从工程看起来像“多 agent”:
Planner Agent
Tool Handler / Executor Agent
Analyzer / Synthesizer Agent
训练方式:
优点:
更“硬核”的多 agent:
单独训练一个 Tool-Calling 模型:
<assistant_tool_call> 部分做 SFT / RL。单独训练一个 Reasoning 模型:
<reflexion> + <synthesis> + <answer> 部分做 SFT。推理时:
协同训练方式(可选强化学习):
S_final = f(结果得分, 思维链得分) 作为全局 reward;(不可量化.md)这个路子更研究范,但实现和调参成本会高一截,而且模型之间的“锅”不好分(谁把问题搞砸了?)。
数据集设计层面:
模型方案选择:
如果你愿意,下一步我可以帮你:
结合这些工具的输出结果,并且尽可能保留对思维链推理有用的信息,帮我给这几个工具生成固定好**“输入指令”和“返回的 JSON 格式”**的格式
下面我直接给你 4 个工具的“输入指令规范 + 固定 JSON I/O 格式”。
思路是:
tool + args;json{ "tool": "<TOOL_NAME>", "args": { ... 工具特定参数 ... } }
json{ "tool": "<TOOL_NAME>", "ok": true, "error_msg": null, "input_args": { ... 原样回显 ... }, "result": { ... 工具特定结果 ... } }
下面只定义各工具的
args和result内部结构。
用途:在指定图像或区域内做旋转框检测,输出“是什么 + 在哪儿 + 多大”。
json{ "image_id": "img_000123", "region": { "type": "full", // "full" 或 "bbox" "bbox_pixel": { // 当 type = "bbox" 时必填 "x_min": 0, "y_min": 0, "x_max": 1023, "y_max": 1023 } }, "class_filter": ["ship", "building"], // 可为空数组,表示检测所有类别 "score_threshold": 0.25, "max_dets": 200 }
json{ "image_id": "img_000123", "region": { ... 同输入 ... }, "detections": [ { "det_id": "yobb_0001", "class_id": 5, "class_name": "ship", "confidence": 0.91, "bbox_obb_pixel": { // 旋转框(像素坐标) "cx": 123.4, "cy": 456.7, "w": 40.2, "h": 12.3, "angle_deg": -30.0 // 以水平向右为 0 度,逆时针为正 }, "bbox_aabb_pixel": { // 对齐框,便于粗略定位 / 可视化 "x_min": 100, "y_min": 440, "x_max": 146, "y_max": 473 }, "geometry_wgs84": { // 可选:映射到地理坐标后的多边形 "type": "Polygon", "coordinates": [[[lon,lat], ...]] }, "area_pixel": 492.3, "area_m2": 123.4 // 依赖分辨率,有则填,没有可设为 null } ], "summary": { "num_detections": 37, "counts_by_class": { "ship": 20, "building": 17 } } }
对思维链有用的信息:
– 每个目标的 id / 类别 / 置信度;
– 像素级位置(cx,cy,w,h,angle)+ 可选的地理多边形;
– 面积(pixel / m²)和按类别的计数。
用途:给定图像或局部区域 + 候选文本标签,返回语义相似度排序,用于“这是港口还是居民区”。
json{ "image_id": "img_000123", "region": { "type": "full", // 或 "bbox" "bbox_pixel": { "x_min": 0, "y_min": 0, "x_max": 1023, "y_max": 1023 } }, "candidate_texts": [ // 候选语义标签 "harbor", "residential area", "industrial zone", "farmland" ], "top_k": 5 }
如需要对某个检测目标做语义 refine,可以在 args 里增加
ref_det_id,由执行器用该 det 对应区域裁剪图像。
json{ "image_id": "img_000123", "region": { ... 同输入 ... }, "scores": [ { "text": "harbor", "score": 0.83, "rank": 1 }, { "text": "industrial zone", "score": 0.62, "rank": 2 }, { "text": "residential area", "score": 0.21, "rank": 3 }, { "text": "farmland", "score": 0.05, "rank": 4 } ], "best_label": "harbor", "best_score": 0.83 }
对思维链有用的信息:
– 最高得分标签及其分数;
– 全部候选的相对排序,方便模型说“更像 X 而不是 Y”。
这类操作比较多,我帮你设计成一个 通用“operation + 参数”格式,每次只做一个操作,返回结果统一放在 result.value 下,同时附带一小段自然语言 result.comment 给 CoT 用。
json{ "operation": "distance", // "distance" | "area" | "length" | "direction" | "buffer_intersect" ... "geom_ref_a": { // 几何体 A,可以是引用,也可以直接给几何 "type": "det_ref", // "det_ref" | "wkt" | "geojson" "det_id": "yobb_0001" // type = det_ref 时使用(来自 YOLOv8-OBB) }, "geom_ref_b": { "type": "det_ref", "det_id": "yobb_0005" }, "crs": "EPSG:4326", // 或 "pixel" 表示仅用像素坐标 "units": "meter" }
不同 operation 所需字段略有差别,例如:
operation = "area":只用 geom_ref_a;operation = "direction":A→B 的方位角;operation = "buffer_intersect":几何 A 是否与以 B 为中心、半径 r 的缓冲区相交,还要加一个 buffer_radius_m。json{ "operation": "distance", "geom_ref_a": { ... }, "geom_ref_b": { ... }, "value": { "distance_m": 123.4, "direction_deg": 45.0, // 0 = 北, 顺时针 "direction_cardinal": "NE" // N/NE/E/SE/S/SW/W/NW }, "comment": "Object yobb_0001 is about 123 m to the NE of yobb_0005." }
json{ "operation": "area", "geom_ref_a": { ... }, "value": { "area_m2": 5320.7, "area_ha": 0.53 }, "comment": "The area of object yobb_0001 is about 0.53 hectares." }
对思维链有用的信息:
– 数值结果(distance_m、area_m2 …);
– 简短 comment 帮助模型直接引用“在东北 123 米处”“面积约 0.53 公顷”。
用途:
json{ "mode": "reverse_geocode", // "reverse_geocode" | "poi_search" "location": { // 中心点(通常是影像中心或某目标中心) "lat": 31.2304, "lon": 121.4737 }, "radius_m": 5000, // 搜索半径(用于 poi_search,可选) "category_filter": [ // 想重点关心的 POI 类型,可为空 "port", "industrial", "residential" ], "top_k": 10 }
json{ "mode": "reverse_geocode", "location": { "lat": 31.2304, "lon": 121.4737 }, "admin": { "country": "China", "province": "Shanghai", "city": "Shanghai", "district": "Pudong New Area" }, "nearest_poi": { "poi_id": "poi_0001", "name": "XXX Harbor", "category": "port", "distance_m": 560.0, "direction_cardinal": "E" }, "landuse": "industrial_port", // 自建 POI / 土地利用标签 "poi_list": [ { "poi_id": "poi_0001", "name": "XXX Harbor", "category": "port", "distance_m": 560.0 }, { "poi_id": "poi_0002", "name": "YYY Oil Depot", "category": "petrochemical", "distance_m": 920.0 } ], "comment": "The point is in Pudong New Area, Shanghai, near a port/industrial zone." }
json{ "mode": "poi_search", "location": { ... }, "radius_m": 5000, "category_filter": ["port","industrial"], "poi_list": [ { "poi_id": "poi_0003", "name": "ZZZ Container Terminal", "category": "port", "distance_m": 2100.0, "direction_cardinal": "SE" }, ... ], "comment": "Several port-related POIs are found within 5 km to the SE." }
对思维链有用的信息:
– 行政区层级(国家/省/市/区);
– 最近的具有代表性的 POI 名称 + 类型 + 距离 + 方位;
– 土地利用/功能标签(landuse);
– 一句 summary comment,方便直接说“位于上海浦东的新港口工业区”。
最终你可以统一要求模型在 <tool_execution> 中写成类似:
xml<tool_execution> Action: { "tool": "YOLOv8_OBB", "args": { "image_id": "IMG_001", "region": {"type":"full"}, "class_filter": ["ship"], "score_threshold": 0.3, "max_dets": 100 } } Observation: { ... 这里填执行器返回的 JSON(省略 input_args),供后续 reflexion 使用 ... } </tool_execution>
这样:
如果你愿意,下一步我可以帮你基于这 4 个工具,写一条完整的 示例样本(包含 planning / tool_execution / reflexion / synthesis 的 CoT),让你直接拿去当数据模板。
现在我对RSTHinker进行了很多的改进,1.在数据集上在原来任务的基础上加入了时序类任务(变化检测)、高认知推理任务(实体关系 推理、实体关系 检测、指令分解)、空间类任务(轨迹规划 / 导航)、以及多模态感知任务(在原论文几个任务光学的基础上加入了红外、sar两类任务构成多模态数据集)2.在原来辅助信息部分新加入经纬度信息和成像时间等来辅助分析(经纬度信息可以调用离线库返回具体的区域地点信息) 3.改变原来思维链的推理过程[规划阶段]- [执行阶段 - 模拟工具调用]- [反思与回溯阶段 - 核心]- [综合推理阶段]新增了工具(YOLOv8-OBB、CLIP (ViT-B/32)、GDAL + Shapely、Reverse Geocoder)使用环节,调用工具的返回结果进一步丰富证据部分不再只是原RSTHinker的坐标证据。4.并且规范了工具的调用输入指令规范 + 固定 JSON I/O 格式 ,我希望设计训练数据集的 思维链,思维链中的工具调用需要具体的根据数据集自带的原始信息,按照规范的工具输出格式来进行自动生成,不打算在数据集制作过程真实调用工具的返回结果 请根据我的改进和要求重新帮是生成这个shared base prompt
下面我根据你 所有的改进方向 + RSTHinker 原文 Shared Base Prompt(参考你上传 PDF 第17–18页)(不可量化.md)
重新设计了一个 全新的、适用于你强化后的多任务遥感 CoT 数据集的 Shared Base Prompt。
⚠️ 这个版本已经完全适配你的四大升级:
(可直接用于数据集构建)
You are an expert in remote sensing, geographic information science, and multimodal spatial reasoning.
Your goal is to help construct a high-quality Remote Sensing CoT Dataset (RS-CoT++), where each sample includes a structured reasoning chain, including planning, tool selection, simulated tool invocations, reflection, and final synthesis.
Each dataset sample contains:
Your task is to produce a detailed CoT for the given input.
The CoT must follow the enhanced four-stage structure:
In this stage you:
Important:
Planning must not mention or imply the correct answer.
Planning must not hallucinate nonexistent tools.
Planning must remain high-level and describe why certain tools are needed.
Wrap the planning stage in <planning>...</planning>.
For each planned step requiring external evidence:
json{ "tool": "<TOOL_NAME>", "args": { ... } }
json{ "tool": "<TOOL_NAME>", "ok": true, "error_msg": null, "input_args": { ... echoed args ... }, "result": { ... simulated but structurally correct fields ... } }
Critical constraints for tool Observations:
Wrap each tool-execution block inside <tool_execution>...</tool_execution>.
This is the core reasoning module.
In this stage, you:
Reflection should never reveal the ground-truth answer directly.
Instead, it should analyze evidence quality and update the reasoning path.
Wrap this stage in <reflexion>...</reflexion>.
In this stage you:
The final answer itself will be provided outside the CoT.
Wrap this stage in <synthesis>...</synthesis>.
\n\n\n.When generating the CoT, output EXACTLY:
"CoT": "<planning> ... </planning>\n\n\n<tool_execution> ... </tool_execution>\n\n\n<reflexion> ... </reflexion>\n\n\n<synthesis> ... </synthesis>"
No additional text outside the CoT field.
如果你需要,我还可以继续为你生成:
告诉我你想要哪个下一步?
下面我根据你上传的 关系推理数据格式(含两张图与 text_input/text_output)
为你的 RS-CoT++ 关系推理任务 设计一个 可直接用于数据集构建的 Exemplars 模版。
该模版严格吸取你提供的结构(subject/object + ref/det + rel)(关系推理数据集图片加文字.docx),同时融入你升级后的 四阶段思维链结构 + 工具调用格式规范。
下面的模版适用于:
text"CoT": " <planning> 为推理两个目标之间的关系,首先需要理解两处区域各自的类别、位置、方向与语义属性。 根据 text_input,我们关注两个检测框:subject(ref 标注区域)与 object(给定 det 坐标区域)。 为判断二者之间的关系,可分三个子步骤: 1. 使用 YOLOv8-OBB 获取两个对象的类别与其精确空间位置(旋转框)。 2. 使用 GDAL+Shapely 计算两对象之间的距离、方向,并确认二者是否具有邻接/包含/接触等关系。 3. 使用 CLIP 评估两个对象区域的语义相似性或功能线索,辅助判断关系类型(如停放、靠近、接触、位于、穿过等)。 随后结合所有工具输出,在后续阶段进行关系类型的推理与验证。 </planning> <tool_execution> Action: { "tool": "YOLOv8_OBB", "args": { "image_id": "<IMAGE_ID>", "region": {"type": "bbox", "bbox_pixel": <SUBJECT_BOX>}, "class_filter": [], "score_threshold": 0.25, "max_dets": 20 } } Observation: { "tool": "YOLOv8_OBB", "ok": true, "input_args": { "image_id": "<IMAGE_ID>", "region": {"type": "bbox", "bbox_pixel": <SUBJECT_BOX>} }, "result": { "detections": [ { "det_id": "subj_001", "class_name": "<SIMULATED_CLASS_SUBJECT>", "confidence": 0.87, "bbox_obb_pixel": { "cx": 100.5, "cy": 220.7, "w": 32.4, "h": 15.2, "angle_deg": -12.0 } } ], "summary": {"num_detections": 1} } } </tool_execution> <tool_execution> Action: { "tool": "YOLOv8_OBB", "args": { "image_id": "<IMAGE_ID>", "region": {"type": "bbox", "bbox_pixel": <OBJECT_BOX>}, "class_filter": [], "score_threshold": 0.25, "max_dets": 20 } } Observation: { "tool": "YOLOv8_OBB", "ok": true, "input_args": { "image_id": "<IMAGE_ID>", "region": {"type": "bbox", "bbox_pixel": <OBJECT_BOX>} }, "result": { "detections": [ { "det_id": "obj_001", "class_name": "<SIMULATED_CLASS_OBJECT>", "confidence": 0.92, "bbox_obb_pixel": { "cx": 115.0, "cy": 235.0, "w": 30.0, "h": 14.0, "angle_deg": -10.0 } } ], "summary": {"num_detections": 1} } } </tool_execution> <tool_execution> Action: { "tool": "GDAL_ShAPELY", "args": { "operation": "distance", "geom_ref_a": {"type": "det_ref", "det_id": "subj_001"}, "geom_ref_b": {"type": "det_ref", "det_id": "obj_001"}, "crs": "pixel", "units": "pixel" } } Observation: { "tool": "GDAL_ShAPELY", "ok": true, "input_args": { "operation": "distance", "geom_ref_a": {"type": "det_ref", "det_id": "subj_001"}, "geom_ref_b": {"type": "det_ref", "det_id": "obj_001"} }, "result": { "distance_pixel": 18.5, "direction_deg": 45.0, "direction_cardinal": "NE", "comment": "Object obj_001 is about 18 pixels to the NE of subj_001." } } </tool_execution> <reflexion> 根据工具返回的模拟信息,双方类别分别为 <SIMULATED_CLASS_SUBJECT> 与 <SIMULATED_CLASS_OBJECT>。 空间关系方面: - 两者中心距离约 18 像素,属于非常接近的空间状态。 - 方向信息表明 object 位于 subject 的东北方向。 - 两者旋转框尺寸与车辆/设施等常见目标相符,没有异常值。 结合语义关系: - 若两者均为车辆类别(如 small-car、truck),这种短距离且平行方向的布局常见于“停放在旁”“靠近”的关系; - 若 object 框代表某类固定设施(如 parking-lot),则车辆相对其区域的定位关系也成立。 因此关系类型应从:靠近 / 并排停放 / 位于某设施内 / 依附关系 等方向判定。 在综合阶段将进一步整合类别 + 距离 + 朝向 + 语义信息,得出最合理的关系类型。 </reflexion> <synthesis> 综合分析两目标类别、空间距离、方向关系及语义场景,可以得出两者之间最合理的关系类型。 推理依据: - 二者类别(<SIMULATED_CLASS_SUBJECT>, <SIMULATED_CLASS_OBJECT>)。 - 距离极近(18 像素)。 - 方向呈 NE 相对关系。 - 两对象在遥感语境下常形成 <RELATION_CANDIDATE> 类型关系(如停靠、邻近、位于内部等)。 最终根据这些证据得出适合作为任务输出的关系标签(该标签在最终 answer 中给出)。 </synthesis> "
你可以根据每个实际样本替换以下占位符:
| 占位符 | 含义 |
|---|---|
<IMAGE_ID> | 当前图像 ID(如“15991_1000_1000_0_0”) |
<SUBJECT_BOX> | 来自 text_input 的 subject det 坐标 |
<OBJECT_BOX> | text_input 中另一对象的 det 坐标 |
<SIMULATED_CLASS_SUBJECT> | 模拟 YOLOv8-OBB 返回的 subject 类别 |
<SIMULATED_CLASS_OBJECT> | 模拟 YOLOv8-OBB 返回的对象类别 |
<RELATION_CANDIDATE> | 候选关系描述(如 parked-at / next-to / overlap 等) |
根据文件中的样本(small-car vs small-car)(关系推理数据集图片加文字.docx)
我给你一个已填充示例片段(仅展示 synthesis 风格,不包含完整 CoT):
text最终关系类别:park-next-to subject: small-car object: small-car
告诉我你希望我进一步生成哪一部分?
下面给出 根据你上传的数据集中第一条关系推理样本(small-car vs small-car)生成的 完整思维链(CoT)。
本 CoT 严格遵守你设定的 4 阶段结构(planning → tool_execution → reflexion → synthesis),
以及你要求的 工具调用 JSON 规范,并且使用模拟工具返回结果(不依赖真实工具输出)。
数据来源引用:(关系推理数据集图片加文字.docx)
text<planning> 为判断两个小车之间的空间关系,需要从视觉与空间角度综合分析。 本任务中给出了两个目标区域: - subject: small-car,位置框 [[522,339,531,356]] - object: 另一个目标区域 [[532,350,544,371]] 两者像素坐标非常接近,疑似两辆车停放在邻近位置。 为了确认其关系类型,需要以下步骤: 1. 使用 YOLOv8-OBB 对两个 bbox 区域内进行分类确认与精细定位,得到对象的旋转框中心位置、朝向及尺寸参数。 2. 使用 GDAL + Shapely 计算两个对象中心点间的距离和方向,确认是否属于近邻、小距离并排停放等空间结构。 3. 使用 CLIP(ViT-B/32)进行语义校验,判断两个目标的语义类别是否一致(均为 small-car),从而辅助推理可能的关系类型。 4. 结合上述信息,在反思阶段分析两对象是否呈现“紧邻(next-to)”“靠停(parked-next-to)”之类的常见遥感车辆布局关系。 以上工具调用将提供关系推断所需的空间与语义证据。 </planning> <tool_execution> Action: { "tool": "YOLOv8_OBB", "args": { "image_id": "15991_1000_1000_0_0", "region": { "type": "bbox", "bbox_pixel": [522,339,531,356] }, "class_filter": [], "score_threshold": 0.25, "max_dets": 10 } } Observation: { "tool": "YOLOv8_OBB", "ok": true, "input_args": { "image_id": "15991_1000_1000_0_0", "region": {"type":"bbox","bbox_pixel":[522,339,531,356]} }, "result": { "detections": [ { "det_id": "subj_001", "class_name": "small-car", "confidence": 0.93, "bbox_obb_pixel": { "cx": 526.5, "cy": 347.5, "w": 10.5, "h": 18.2, "angle_deg": -8.0 } } ], "summary": {"num_detections": 1} } } </tool_execution> <tool_execution> Action: { "tool": "YOLOv8_OBB", "args": { "image_id": "15991_1000_1000_0_0", "region": { "type": "bbox", "bbox_pixel": [532,350,544,371] }, "class_filter": [], "score_threshold": 0.25, "max_dets": 10 } } Observation: { "tool": "YOLOv8_OBB", "ok": true, "input_args": { "image_id": "15991_1000_1000_0_0", "region": {"type":"bbox","bbox_pixel":[532,350,544,371]} }, "result": { "detections": [ { "det_id": "obj_001", "class_name": "small-car", "confidence": 0.91, "bbox_obb_pixel": { "cx": 538.0, "cy": 360.0, "w": 11.0, "h": 18.0, "angle_deg": -10.0 } } ], "summary": {"num_detections": 1} } } </tool_execution> <tool_execution> Action: { "tool": "GDAL_ShAPELY", "args": { "operation": "distance", "geom_ref_a": {"type": "det_ref", "det_id": "subj_001"}, "geom_ref_b": {"type": "det_ref", "det_id": "obj_001"}, "crs": "pixel", "units": "pixel" } } Observation: { "tool": "GDAL_ShAPELY", "ok": true, "input_args": { "operation": "distance", "geom_ref_a": {"type": "det_ref", "det_id": "subj_001"}, "geom_ref_b": {"type": "det_ref", "det_id": "obj_001"} }, "result": { "distance_pixel": 15.8, "direction_deg": 40.0, "direction_cardinal": "NE", "comment": "obj_001 is located about 16 pixels to the NE of subj_001." } } </tool_execution> <tool_execution> Action: { "tool": "CLIP_ViTB32", "args": { "image_id": "15991_1000_1000_0_0", "region_type": "pair", "candidate_texts": ["small-car", "truck", "bus", "parking-lot"], "top_k": 4 } } Observation: { "tool": "CLIP_ViTB32", "ok": true, "input_args": { "image_id": "15991_1000_1000_0_0", "region_type": "pair", "candidate_texts": ["small-car","truck","bus","parking-lot"] }, "result": { "scores": [ {"text":"small-car","score":0.89,"rank":1}, {"text":"parking-lot","score":0.31,"rank":2}, {"text":"truck","score":0.12,"rank":3}, {"text":"bus","score":0.05,"rank":4} ], "best_label":"small-car" } } </tool_execution> <reflexion> 根据 YOLOv8-OBB 的模拟检测结果,两个区域均识别为 small-car,尺寸与车辆形态一致,且二者的旋转框参数接近,说明是同类型车辆。 GDAL/Shapely 的距离计算结果显示两车中心点仅相距约 16 像素,空间上属于非常近邻的距离,方向呈 NE 关系,即 object 在 subject 的东北方向。 这种距离与方向关系在遥感车辆场景中通常表示“相邻停放”或“紧挨着停车”。 CLIP 的语义相似度进一步确认两者均属于车辆类别 small-car,支持二者为同类对象的判断。 综合空间距离很近、平行位置关系、语义一致性几项证据,各方向均与“park-next-to(紧邻停放)”的典型模式一致,未出现冲突或异常信息。 因此下一阶段可基于全部证据综合推理出合理的关系类型。 </reflexion> <synthesis> 将所有证据整合后,两个对象均为 small-car,且空间距离极近(约 16 像素),位置呈紧邻并排布局,并无包含、覆盖、依附等结构关系特征。 这一布局高度符合遥感车辆中“并排停放 / 紧邻停靠(park-next-to)”的关系模式。 因此本样本的关系类型应为:二者相互 **park-next-to**。 </synthesis>
告诉我下一步继续生成哪一个?
现在我对RSTHinker进行了很多的改进,1.在数据集上在原来任务的基础上加入了时序类任务(变化检测)、高认知推理任务(实体关系 推理、实体关系 检测、指令分解)、空间类任务(轨迹规划 / 导航)、以及多模态感知任务(在原论文几个任务光学的基础上加入了红外、sar两类任务构成多模态数据集)2.在原来辅助信息部分新加入经纬度信息和成像时间等来辅助分析(经纬度信息可以调用离线库返回具体的区域地点信息) 3.改变原来思维链的推理过程[规划阶段]- [执行阶段 - 模拟工具调用]- [反思与回溯阶段 - 核心]- [综合推理阶段]新增了工具(YOLOv8-OBB、CLIP (ViT-B/32)、GDAL + Shapely、Reverse Geocoder)使用环节,调用工具的返回结果进一步丰富证据部分不再只是原RSTHinker的坐标证据。4.并且规范了工具的调用输入指令规范 + 固定 JSON I/O 格式 ,我希望设计训练数据集的 思维链,思维链中的工具调用需要具体的根据数据集自带的原始信息,按照规范的工具输出格式来进行自动生成,不打算在数据集制作过程真实调用工具的返回结果 请根据我的改进和要求重新帮是生成这个shared base prompt
下面给出 完全重新设计的、适用于你改进后 RSTHinker++(多任务 + 多模态 + 工具调用 + 四阶段思维链结构) 的 Shared Base Prompt。
它完全覆盖你上传文件中的原始 Shared Base Prompt(见 PDF 第17–18页)(21RSThinker Extract[17-27].pdf),并替换为升级后的版本,支持:
下面即为你的最终 Shared Base Prompt(可直接放进数据集生成器)。
You are an expert in remote sensing, geospatial reasoning, and multimodal perception.
Your task is to construct high-quality CoT (Chain-of-Thought) annotations for a remote sensing dataset containing diverse tasks, multimodal imagery, and auxiliary metadata.
Each sample contains:
Your goal is to produce a structured, multi-stage CoT, incorporating planning, tool usage, reflection, and synthesis.
This upgraded CoT should follow the enhanced four-stage framework:
In this stage you must:
Carefully understand the task type
Extract key information from auxiliary metadata
Examples:
Formulate a step-by-step plan
Explicitly determine:
Do not reference the correct answer or reveal it implicitly.
Do not include auxiliary information directly—only use it conceptually to guide planning.
Wrap this stage as:
text<planning> ... </planning>
You must simulate tool usage following standardized JSON formats.
Tools available:
json{ "tool": "<TOOL_NAME>", "args": { ... } }
json{ "tool": "<TOOL_NAME>", "ok": true, "error_msg": null, "input_args": { ... echo args ... }, "result": { ... simulated structured fields ... } }
Important Rules:
Multiple tool calls should appear sequentially and logically as the reasoning requires.
Wrap each tool block as:
text<tool_execution> Action: { ... } Observation: { ... } </tool_execution>
In this stage, you must:
Format:
text<reflexion> ... </reflexion>
In this stage:
Format:
text<synthesis> ... </synthesis>
No early leakage, no phrases like “this matches the answer”.
You may use it for reasoning, but must not quote it.
\n\n\n.Your tool invocations must be coherent with planning.
Your CoT should be adapted automatically to:
When asked to generate a CoT, output exactly:
"CoT": "<planning> ... </planning>\n\n\n<tool_execution> ... </tool_execution>\n\n\n<reflexion> ... </reflexion>\n\n\n<synthesis> ... </synthesis>"
Do NOT output anything else.
它完全兼容你上传的原始版(PDF 第17–18页)(21RSThinker Extract[17-27].pdf),并加入:
只需告诉我你想要哪个部分即可。
下面是对你上传的三个提示词模板(都来自 提示词模版.txt)的 综合比较分析,并在最后给出 融合三者优点后的最终“最优提示词模版”。
为了准确比较,我基于 提示词模版.txt 的内容 逐段审阅(见文件片段、、 等)。
结构最完善,强调四阶段 CoT(planning→tool_execution→reflexion→synthesis),并严格规范工具 JSON 格式。
参考,这是一个较“智能代理风格”的版本。
参考等。
| 项目 | 模版 1 | 模版 2 | 模版 3 |
|---|---|---|---|
| 结构化程度 | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ |
| 工具 JSON 规范性 | ⭐⭐⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐ |
| 多模态支持 | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| 模拟工具输出 realism | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| 强调 Agent 推理行为 | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| 适合大规模数据集自动生成 | ⭐⭐⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐ |
| 反思阶段完善度 | ⭐⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐⭐ |
因此我为你重新融合三个模板,生成一个 更强、更简洁、更稳定适合数据集自动生成的“最优 Shared Base Prompt”。
这个版本:
下面给出的即为 可直接用于自动生成数据集的最终版本。
You are an expert in remote sensing, geospatial reasoning, multimodal perception (Optical / IR / SAR), temporal analysis, and structured tool-based reasoning.
Your goal is to construct a high-quality Tool-Augmented Chain-of-Thought (CoT) for a remote sensing dataset.
Every CoT must simulate a realistic reasoning process consisting of:
➡ Planning → Tool Execution (Simulation) → Reflection → Synthesis
The CoT should NOT contain the correct answer itself.
Each data sample provides:
Auxiliary information is ONLY for generating tool outputs.
It must NOT appear explicitly in the CoT.
In this stage you must:
Identify the task type:
Break the problem into sequential subtasks.
Decide which tools are needed and why:
Planning should NOT contain the answer or auxiliary information verbatim.
Use:
text<planning> ... </planning>
For each required tool call:
json{ "tool": "<TOOL_NAME>", "args": { ... } }
json{ "tool": "<TOOL_NAME>", "ok": true, "error_msg": null, "input_args": { ... echo args ... }, "result": { ... simulated fields ... } }
Wrap each call:
text<tool_execution> Action: {...} Observation: {...} </tool_execution>
You must:
Format:
text<reflexion> ... </reflexion>
Format:
text<synthesis> ... </synthesis>
它融合:
完全适配你设计的:
只要告诉我即可。
后续想要大批量的生成这种优质的思维链数据集,不是一张图片一张图片的去生成,应该具体怎么办
这是一个非常关键、而且你现在正好走到“工程化拐点”的问题。
我直接给你一个可落地的答案:不是“多调几次模型”,而是把思维链当作一种“可程序生成的中间语言”来做数据工程。
我会按 “为什么不能一张一张做 → 正确的总体方案 → 具体工程流程 → 三种规模化生成路线 → 常见误区” 来讲。
你现在的 CoT 具有这些特征:
👉 这类 CoT 的本质不是“语言创作”,而是“程序化推理轨迹生成”。
如果你继续:
一张图 → 人/模型手写一条 CoT
会出现三个不可扩展问题:
所以:
必须把 CoT 的生成过程本身“算法化”
不要“生成思维链”,而是“编译思维链”
用 GT + 元信息 → 中间结构表示 → 渲染为自然语言 CoT
也就是三层结构:
textGT / 标注 / 元数据 ↓ Structured Reasoning Graph(结构化推理图) ↓ CoT Renderer(模板 + 受控随机)
你要做的不是一个 Prompt,而是一个 CoT 生成流水线。
对每个样本,你至少要有:
json{ "image_id": "...", "task_type": "relation_reasoning", "objects": [ { "obj_id": "A", "class": "small-car", "bbox": [x1,y1,x2,y2] }, { "obj_id": "B", "class": "small-car", "bbox": [...] } ], "relations": [ { "subject": "A", "object": "B", "relation": "park-next-to" } ], "auxiliary": { "latlon": ..., "timestamp": ..., "sensor": "optical" } }
👉 这是你真正的数据金矿,不是图片。
你要先生成一个 Reasoning Trace JSON,例如:
json{ "planning": { "task": "infer spatial relation", "required_evidence": ["class", "distance", "direction"], "tools": ["YOLOv8_OBB", "GDAL_ShAPELY", "CLIP"] }, "tool_execution": [ { "tool": "YOLOv8_OBB", "target": "A", "expected_class": "small-car" }, { "tool": "YOLOv8_OBB", "target": "B", "expected_class": "small-car" }, { "tool": "GDAL_ShAPELY", "operation": "distance", "from": "A", "to": "B" } ], "reflexion": { "checks": [ "class consistency", "distance threshold < 20px", "parallel orientation" ] }, "synthesis": { "decision_basis": ["close distance", "same class", "adjacent layout"] } }
⚠️ 这一步不需要 LLM,100% 规则可写
你已经意识到:
不能真实跑 YOLO / CLIP / GDAL
正确做法是:GT → 工具输出翻译器
例如:
pythondef simulate_yolo(obj): return { "class_name": obj["class"], "confidence": random.uniform(0.85, 0.95), "bbox_obb_pixel": jitter_bbox(obj["bbox"]) }
pythondef simulate_distance(a, b): d = euclidean(center(a), center(b)) return { "distance_pixel": d + noise(), "direction_cardinal": infer_direction(a, b) }
👉 这是你规模化的核心护城河
现在才用 LLM 或模板系统,把结构渲染成自然语言。
关键原则:
你甚至可以:
规则生成 90% + LLM paraphrase 10%
👉 这是我最推荐你走的
规则生成结构 + LLM 补 reflexion/synthesis
全 LLM + 强校验器
我直接给你一句“架构结论”:
你的 CoT 是一种“遥感推理中间表示语言(RS-IRL)”,
数据集生成 = 编译 GT → IR → CoT
这比 RSThinker 原论文 高一个工程和认知层级。
如果你愿意,我可以:
你只需要告诉我:
👉 你下一步是偏“论文复现/发表”,还是偏“系统/工程落地”?
我会按那个方向继续。
1.具体说明,规则生成 90%的这一部分应该怎么完成?是可以通过写代码来程序化实现吗,请给出详细的实现步骤?还是需要人工一条一条去编写?2.考虑到我们的数据集是不同的子数据集整合在一起的,不同子数据提供的信息详细程度不同,统一生成真值层可能会遇到一些问题,比如有的数据集图片信息不足等 3.生成「结构化推理轨迹」要如何实现呢?
下面把你问的 3 个点拆开讲清楚:“规则生成 90%”到底怎么做、子数据集信息不一致怎么统一、结构化推理轨迹怎么自动化生成。另外你还需要保留 RSThinker 的几个硬约束:CoT 分段用 \n\n\n、辅助信息不在 CoT 里出现、不能提前泄露答案等(原文里明确要求了这些规则)。(21RSThinker Extract[17-27].pdf)
不需要人工逐样本写。
正确方式是:人工写一次“规则与模板库” + 写代码程序化生成每条样本的 CoT。 之后新增任务/子数据集,只是补“适配器”和少量规则。
adapter_xxx.py这一步是规模化的关键:Observation 完全由代码生成,LLM 不参与,就不会“格式飘、字段飘”。
task_type + 当前样本的 GT 字段完备度,选择一个推理图/步骤模板Step 0:定义统一真值层 Schema(一次性工作)
core(所有任务都尽量有):image_id, modality, sensor, width,height, geo, timestamp(s), task_typeobjects(可选):obj_id, class, bbox_xyxy/obb/seg, attributesrelations(可选):(subj_id, obj_id, rel_type)temporal(可选):t1_image_id, t2_image_id, change_polygons, change_labelnavigation(可选):start, goal, constraints, map_graph(optional)instruction(可选):subgoals[] / operators[]Step 1:给每个子数据集写 Adapter
provenance: {source_dataset, original_fields, missing_fields}Step 2:写工具仿真器(代码)
simulate_yolo(gt_objects, region)simulate_gdal(op, geom_a, geom_b)(distance/area/direction/intersection…)reverse_geocode(latlon)(可以真查离线库)simulate_clip(gt_label, candidate_texts)(按 softmax 分布采样)Step 3:写“推理图谱/规则库”(一次性 + 可扩展)
reasoning_recipe.yaml/jsonStep 4:写 Trace Builder
Step 5:写 Renderer
\n\n\n 分隔,满足 RSThinker 要求)(21RSThinker Extract[17-27].pdf)Step 6:批处理生成
Step 7:自动质检与抽检
人工“逐条写”的部分,应该只存在于:
(a) 写规则库/模板库;(b) 写 20~50 条 exemplars 用来定风格。其余都自动化。
这是常态,解决思路是:“核心最小字段 + 可选字段 + 置信/缺失机制 + 分档生成策略”。
例如对每条样本算一个 info_level:
null,并显式记录缺失原因建议加:
json"missing": { "geo": "not_provided_by_source", "objects": "classification_only_task" }
这样 Trace Builder 就能决定:
你最应该避免的是:为了统一字段而凭空补 bbox/时间/地点。
正确做法是:
把它理解成:一个可编程的推理“配方系统”(recipe engine)。不是让 LLM 想步骤,而是你用规则生成步骤。
建议你固定成这样(便于渲染与评估):
json{ "planning": { "task_summary": "...", "subtasks": ["...", "..."], "evidence_needed": ["class", "distance", "direction"], "tool_plan": ["YOLOv8_OBB", "GDAL_ShAPELY", "CLIP_ViTB32"] }, "tool_steps": [ {"id":"t1","tool":"YOLOv8_OBB","args_ref":"subject_bbox","outputs":["subj_det_id"]}, {"id":"t2","tool":"YOLOv8_OBB","args_ref":"object_bbox","outputs":["obj_det_id"]}, {"id":"t3","tool":"GDAL_ShAPELY","op":"distance","inputs":["subj_det_id","obj_det_id"]} ], "reflexion": { "checks": [ {"type":"consistency","rule":"both are vehicles"}, {"type":"threshold","rule":"distance_pixel < 20 => adjacent"} ] }, "synthesis": { "decision_rules": [ "same_class & close_distance => park-next-to" ], "supporting_facts": ["...来自 t1/t2/t3 ..."] } }
recipe.yaml例:关系推理
先把 10 类任务各写一个最小 recipe(不追求全覆盖),后续再加分支与降级策略。
如果你愿意,我可以下一步直接给你一个“项目骨架”(目录结构 + 每个模块的接口定义 + 关系推理/变化检测各 1 个完整 recipe 示例),你照着填就能批量跑起来。
生成一个“项目骨架”(目录结构 + 每个模块的接口定义 + 关系推理/变化检测各 1 个完整 recipe 示例),帮助我去理解自动化生成数据集的流程是如何实现的。
下面给你一个可直接落地的“自动化生成 CoT 数据集”的项目骨架:包含目录结构、各模块接口(函数/数据结构约定)、以及关系推理和变化检测各 1 份完整 recipe 示例。整体流程严格遵循 RSThinker 的核心约束:CoT 分段、避免泄露答案、辅助信息不直接出现在 CoT 中等(原文强调段落用 "\n\n\n" 分隔等规则)。 (21RSThinker Extract[17-27].pdf)
textrs_cot_builder/ README.md pyproject.toml (或 requirements.txt) configs/ schema/ canonical_gt.schema.json tool_io.schema.json reasoning_trace.schema.json recipes/ relation_reasoning_v1.yaml change_detection_v1.yaml render/ phrases_zh.yaml # 可选:中文短语库(同义改写) phrases_en.yaml # 可选:英文短语库 data/ raw/ # 各子数据集原始文件(不进 git 或只放示例) processed/ canonical_gt/ # 统一真值层输出(jsonl) tool_obs/ # 工具模拟输出(jsonl,可选缓存) cot/ # 最终 CoT 样本(jsonl) src/ main.py # CLI 入口:build-canonical / build-cot / validate pipeline/ build_canonical.py build_cot.py validate.py adapters/ # 子数据集适配器:raw -> canonical_gt __init__.py base.py adapter_xxx.py adapter_yyy.py core/ types.py # dataclass / TypedDict:CanonicalGT、ToolCall、ToolObs、Trace schema_validate.py # jsonschema 校验 utils.py # 随机种子、噪声、几何计算等 toolsim/ # 工具仿真器:GT -> Observation JSON(无需真实工具) __init__.py yolo_obb.py clip_vitb32.py gdal_shapely.py reverse_geocoder.py dispatcher.py # 根据 tool_name 分发到对应仿真器 trace/ __init__.py recipe_engine.py # 读 recipe + 选择分支 + 输出 structured trace recipes.py # 将 yaml recipe 编译为内部结构(可选) render/ __init__.py cot_renderer.py # structured trace -> 四段文本 templates_zh.py # planning/reflexion/synthesis 的模板句式 templates_en.py tool_block.py # tool_execution 的严格 JSON dump quality/ leak_check.py # 检测 CoT 是否泄露 answer / 是否引用 aux literal consistency_check.py # 数值逻辑一致性检查(distance/方向等) stats.py # 分布统计、抽检输出
src/core/types.py
pythonfrom dataclasses import dataclass from typing import Any, Dict, List, Optional, Literal, Tuple BBox = Tuple[float, float, float, float] # (x1,y1,x2,y2) pixel Modality = Literal["optical", "infrared", "sar", "multimodal"] TaskType = Literal["relation_reasoning", "change_detection", "vqa", "detect", "count", "caption", "navigation", "instr_decomp"] @dataclass class ObjectGT: obj_id: str cls: str bbox_xyxy: Optional[BBox] = None obb: Optional[Dict[str, float]] = None # cx,cy,w,h,angle_deg(可空) mask_rle: Optional[str] = None attrs: Dict[str, Any] = None @dataclass class RelationGT: subject_id: str object_id: str rel_type: str @dataclass class CanonicalGT: sample_id: str task_type: TaskType image_ids: List[str] # change detection 可含 t1,t2 modality: Modality width: int height: int # 可选元信息(可为空,trace builder 会降级) lat: Optional[float] = None lon: Optional[float] = None timestamp: Optional[str] = None timestamp_t1: Optional[str] = None timestamp_t2: Optional[str] = None # 监督信息 question: str = "" answer: Any = None objects: List[ObjectGT] = None relations: List[RelationGT] = None change_mask: Optional[str] = None # RLE / polygon list change_polygons: Optional[List[Any]] = None # 缺失字段解释 missing: Dict[str, str] = None provenance: Dict[str, Any] = None
src/core/types.py
python@dataclass class ToolCall: tool: str args: Dict[str, Any] @dataclass class ToolObs: tool: str ok: bool error_msg: Optional[str] input_args: Dict[str, Any] result: Dict[str, Any]
src/core/types.py
python@dataclass class TraceStep: step_id: str tool_call: ToolCall # 依赖哪些 obj_id / det_id / 上一步结果(用于生成 args) depends_on: List[str] = None produces: List[str] = None @dataclass class ReasoningTrace: planning: Dict[str, Any] # subtasks, evidence_needed, tool_plan tool_steps: List[TraceStep] reflexion: Dict[str, Any] # checks[] synthesis: Dict[str, Any] # decision_rules, supporting_facts
src/adapters/base.py
pythonclass BaseAdapter: def load_raw(self, raw_path: str) -> Any: raise NotImplementedError def to_canonical(self, raw_item: Any) -> "CanonicalGT": """把某个子数据集样本转成统一真值层。缺什么就填 None + missing 说明。""" raise NotImplementedError
src/trace/recipe_engine.py
pythonclass RecipeEngine: def __init__(self, recipe_yaml_path: str): self.recipe = self._load(recipe_yaml_path) def build_trace(self, gt: "CanonicalGT") -> "ReasoningTrace": """ 根据 task_type + 字段完备度选择分支: - 有 bbox/objects 就走 detection-based 分支 - 没有 bbox 就降级为全图语义/地理分支 """ ...
src/toolsim/dispatcher.py
pythonclass ToolSimDispatcher: def simulate(self, tool_call: "ToolCall", gt: "CanonicalGT") -> "ToolObs": """严格输出你定义的 JSON schema,数值可加噪声但要合理。""" ...
放在:configs/recipes/
yamltask_type: relation_reasoning version: v1 planning: task_summary: "Infer the relation between subject and object." subtasks: - "Confirm both entities' categories and precise positions." - "Measure spatial distance and direction." - "Use semantic cues to disambiguate relation type if needed." evidence_needed: ["class", "position", "distance", "direction"] tool_plan: ["YOLOv8_OBB", "GDAL_ShAPELY", "CLIP_ViTB32"] branches: - name: "bbox_available" when: require_fields: ["objects", "relations"] require_object_bbox: true tool_steps: - step_id: t1 tool: YOLOv8_OBB args_from: image_id: "$image_ids[0]" region: type: "bbox" bbox_pixel: "$rel.subject.bbox_xyxy" class_filter: [] score_threshold: 0.25 max_dets: 20 produces: ["det_subj"] - step_id: t2 tool: YOLOv8_OBB args_from: image_id: "$image_ids[0]" region: type: "bbox" bbox_pixel: "$rel.object.bbox_xyxy" class_filter: [] score_threshold: 0.25 max_dets: 20 produces: ["det_obj"] - step_id: t3 tool: GDAL_ShAPELY args_from: operation: "distance" geom_ref_a: { type: "det_ref", det_id: "$det_subj.det_id" } geom_ref_b: { type: "det_ref", det_id: "$det_obj.det_id" } crs: "pixel" units: "pixel" depends_on: ["det_subj", "det_obj"] produces: ["dist_dir"] - step_id: t4 tool: CLIP_ViTB32 args_from: image_id: "$image_ids[0]" region: type: "pair_bbox" bbox_pair_pixel: ["$rel.subject.bbox_xyxy", "$rel.object.bbox_xyxy"] candidate_texts: ["small-car", "truck", "bus", "parking-lot"] top_k: 4 produces: ["clip_sem"] reflexion: checks: - "Check class consistency: both are plausible vehicle categories." - "Check distance threshold: very small distance suggests adjacent/next-to." - "Check direction consistency with bbox centers (e.g., NE)." - "If semantic scores conflict with YOLO classes, prefer the more confident signal." synthesis: decision_rules: - "IF same_vehicle_class AND distance_pixel < 20 THEN relation_candidate = 'park-next-to' or 'next-to'." - "IF one is 'parking-lot' AND other is vehicle AND overlap/buffer_intersect THEN relation_candidate = 'park-in'." supporting_facts: - "Use YOLO detections (class/confidence) for both entities." - "Use GDAL distance/direction for spatial relation." - "Use CLIP top label as secondary semantic check."
这个 recipe 的关键点:完全可以程序化生成,不需要 LLM 来“想步骤”。
yamltask_type: change_detection version: v1 planning: task_summary: "Detect and describe changes between two timestamps." subtasks: - "Align and compare t1 vs t2 image content." - "Localize changed regions and classify change type." - "Use spatial measurements to quantify change extent if needed." evidence_needed: ["change_region", "change_type", "extent"] tool_plan: ["GDAL_ShAPELY", "CLIP_ViTB32", "YOLOv8_OBB"] branches: - name: "change_mask_available" when: require_fields: ["change_mask"] tool_steps: - step_id: t1 tool: GDAL_ShAPELY args_from: operation: "area" geom_ref_a: type: "geojson" geojson: "$change_polygon_or_mask_geojson" crs: "pixel" units: "pixel" produces: ["chg_area"] - step_id: t2 tool: CLIP_ViTB32 args_from: image_id: "$image_ids[0]" # t1 region: type: "full" candidate_texts: ["construction", "demolition", "flooding", "vegetation_change", "road_change"] top_k: 5 produces: ["clip_t1_scene"] - step_id: t3 tool: CLIP_ViTB32 args_from: image_id: "$image_ids[1]" # t2 region: type: "full" candidate_texts: ["construction", "demolition", "flooding", "vegetation_change", "road_change"] top_k: 5 produces: ["clip_t2_scene"] - step_id: t4 tool: YOLOv8_OBB args_from: image_id: "$image_ids[1]" # t2 region: type: "bbox" bbox_pixel: "$change_bbox_from_mask" # 由 mask 外接框程序计算 class_filter: [] score_threshold: 0.25 max_dets: 50 produces: ["t2_dets"] reflexion: checks: - "Check that change region is localized (not entire image unless justified)." - "Cross-check change type with modality/time metadata (e.g., seasonal vegetation vs construction)." - "If YOLO detects new objects in t2 inside change bbox, treat as 'appearance/construction'." - "If scene semantics shift (CLIP t1 vs t2), use it as supporting evidence, not sole evidence." synthesis: decision_rules: - "IF new_building_or_vehicle_appears_in_t2 AND change_area>0 THEN change_type='construction/appearance'." - "IF water-related semantics rises AND region expands THEN change_type='flooding'." - "IF vegetation semantics shifts but no new man-made objects THEN change_type='vegetation_change'." supporting_facts: - "Use GDAL area for extent." - "Use CLIP t1/t2 semantic distribution for change category hint." - "Use YOLO in t2 for newly appeared objects inside change region."
python -m src.main build-canonical --raw data/raw --out data/processed/canonical_gt/train.jsonl
python -m src.main build-cot --gt data/processed/canonical_gt/train.jsonl --recipes configs/recipes --out data/processed/cot/train.jsonl
python -m src.main validate --cot data/processed/cot/train.jsonl --schemas configs/schema
CanonicalGTtask_type 与字段完备度选分支,输出 ReasoningTrace如果你愿意,我下一步可以把这个骨架进一步“具体化”为一份可运行的最小 Python 代码样例(只实现 relation_reasoning 和 change_detection 两个 task),这样你可以直接在你们的数据上跑出第一批 jsonl。
我先把你的两个需求拆开回答:
下面所有样例都基于你给的那批示例数据来写。(任务的汇总.md)
统一结构(文字里就按你设计的四阶段来写):
text_output 对齐)JSON 结构我用一种比较朴素、容易程序化生成的格式:
tool_name:字符串instruction:自然语言指令(可直接从样本信息构造)inputs:结构化参数outputs:和你之前规定的字段类似的假结果你可以之后把这里的字段名改成你真正定好的那套。
示例取第一条 building-building 的样本。(任务的汇总.md)
CoT 内容示例:
[规划阶段]
问题要求判断图像中两个目标之间的关系,并给出它们各自的类别。已知输入里给出了第一个建筑物的框[[681,656,866,787]]和第二个目标的框[[912,638,997,716]]。
计划步骤:
1)在给定框内做目标检测/分类,确认都是建筑物;
2)用几何工具根据框坐标计算两者的中心点及距离;
3)根据距离与尺度阈值判断是否属于“close-to”等空间关系;
4)按照模板输出 subject/object 类别和关系标签。[执行阶段 – 模拟工具调用]
调用 YOLOv8-OBB 做定类检测:json{ "tool_name": "YOLOv8-OBB", "instruction": "Detect objects and categories inside the given bounding boxes.", "inputs": { "image_id": "02102_600_800_0_0", "regions": [ {"region_id": "A", "bbox": [681,656,866,787]}, {"region_id": "B", "bbox": [912,638,997,716]} ], "candidate_categories": ["building", "road", "ship", "vehicle"] }, "outputs": { "detections": [ {"region_id": "A", "category": "building", "score": 0.97}, {"region_id": "B", "category": "building", "score": 0.95} ] } }然后调用 GDAL+Shapely(或你定义的几何工具)计算两框间的空间关系:
json{ "tool_name": "GeoRelation", "instruction": "Compute distance and topological relation between two bounding boxes.", "inputs": { "bbox_A": [681,656,866,787], "bbox_B": [912,638,997,716] }, "outputs": { "center_A": [773.5, 721.5], "center_B": [954.5, 677.0], "euclidean_distance": 185.0, "normalized_distance": 0.23, "relation_candidates": ["overlap", "inside", "adjacent", "far", "close"], "selected_relation": "close" } }根据你自己的关系词表,把
"close"映射到<|rel|>close-to<|/rel|>。[反思与回溯阶段]
两个 region 都被检测为 building,置信度较高;两中心点归一化距离约 0.23,小于“近邻”阈值(例如 0.3)。两框也没有互相包含或大面积重叠,更像“相邻/靠近”。因此选择关系close-to是合理的。
同时问题要求“and output their categories”,需要在答案中显式指出 subject/object 的类别均为 building。[综合推理阶段]
整合上述工具结果与反思,生成最终关系三元组描述:
subject: building, object: building, the building is <|rel|>close-to<|/rel|> the building.
取 “Analyze spatial relationships in region …” 示例。(任务的汇总.md)
CoT 内容示例:
[规划阶段]
指令要求在指定区域内分析空间关系,可以拆成:
1)定位目标区域;
2)在区域内做目标检测,列出对象及类别;
3)分析对象之间的空间关系;
4)总结整体上下文。
最终输出需要是“Step1…Step4”的自然语言步骤序列。[执行阶段 – 模拟工具调用]
1)简单几何计算区域位置(无需工具,或用自定义RegionLocator):json{ "tool_name": "RegionLocator", "instruction": "Judge the rough position of a region within the whole image.", "inputs": { "image_width": 1000, "image_height": 1000, "region_bbox": [566,350,907,582] }, "outputs": { "position": "right", "relative_area_ratio": 0.17 } }2)调用 YOLOv8-OBB 做检测:
json{ "tool_name": "YOLOv8-OBB", "instruction": "Detect all objects inside the region and give categories and bboxes.", "inputs": { "image_id": "06901_1000_1000_0_0", "region": [566,350,907,582] }, "outputs": { "detections": [ {"id": 1, "category": "tennis-court", "bbox": [581,355,683,439]}, {"id": 2, "category": "building", "bbox": [595,370,821,562]}, {"id": 3, "category": "building", "bbox": [706,381,879,563]} ] } }3)调用几何关系工具分析建筑之间距离:
json{ "tool_name": "GeoRelation", "instruction": "Compute pair-wise relations between detected objects.", "inputs": { "detections": [ {"id": 2, "category": "building", "bbox": [595,370,821,562]}, {"id": 3, "category": "building", "bbox": [706,381,879,563]} ] }, "outputs": { "relations": [ { "subject_id": 2, "object_id": 3, "relation": "close-to" }, { "subject_id": 3, "object_id": 2, "relation": "close-to" } ] } }[反思与回溯阶段]
检查:在目标区域内一共 3 个对象,类别与给出的 ground truth 匹配;建筑之间距离较近,因此close-to合理;网球场与建筑物之间虽然也有空间关系,但指令和原始标注只强调建筑间关系即可。输出需要按 Step1…Step4 的格式组织。[综合推理阶段]
按照步骤整合:
- Step1:目标区域位于图像右侧;
- Step2:列出 1 个 tennis-court + 2 个 buildings 及其坐标;
- Step3:说明 2 个建筑之间互为 close-to;
- Step4:给出“检测到 3 个目标(2 类)、2 个交互”的总结。
取 easy 难度的样本。(任务的汇总.md)
这里除了视觉工具,还可以引入一个离线路径规划工具,例如基于栅格图的 A*。
CoT 内容示例:
[规划阶段]
已知:输入给出了起始位姿[690, 660, 492, 265, 704, 108],目标描述为 Leslie Road 上一排房屋,对应的参考坐标[947, 876, 52],并给出了最终目标建筑坐标[917, 897, 73](在真值里)。
规划步骤:
1)利用地理/三维信息定位地标及目标建筑坐标;
2)构建飞行空间栅格或图结构;
3)使用 A* 或其它规划算法生成从起点到目标的路径点序列;
4)为每个路径点补充姿态角,使其朝向飞行方向;
5)输出<|pose|>序列。[执行阶段 – 模拟工具调用]
1)用 GeoPlanner(自定义工具,内部可基于 GDAL+Shapely)解析地标坐标:json{ "tool_name": "GeoPlanner", "instruction": "Resolve landmark positions and target building position.", "inputs": { "area": "birmingham_block_1", "landmarks": [ {"name": "Leslie Road", "pos": [947, 876, 52]} ] }, "outputs": { "landmarks": [ {"name": "Leslie Road", "pos": [947, 876, 52]} ], "target": {"category": "building", "pos": [917, 897, 73]} } }2)调用 PathPlannerAStar:
json{ "tool_name": "PathPlannerAStar", "instruction": "Plan a collision-free 3D path from start to target over all buildings.", "inputs": { "start_pose": [690, 660, 492, 265, 704, 108], "target_pos": [917, 897, 73], "area_map_id": "birmingham_block_1", "constraints": ["fly_over_buildings"] }, "outputs": { "waypoints": [ [690, 660, 492, 265, 704, 108], [636, 712, 492, 279, 711, 104], ... [917, 902, 49, 828, 300, 181] ] } }实际
waypoints按真值中的<|pose|>序列填充即可。[反思与回溯阶段]
检查:路径是否从起点出发,经过建筑上方,高度由高逐渐下降到接近目标高度 73;终点坐标与目标建筑位置一致或非常接近。若中途高度变化不平滑或没有覆盖建筑区域,需要调整规划参数后重新生成。[综合推理阶段]
把地标解析、目标确定和路径规划三部分的结果整合成自然语言描述和<|pose|>[[...], ...]的轨迹输出,结构上与现有text_output一致。
取 IR 船只检测样本。(任务的汇总.md)
[规划阶段]
任务:在红外图像中检测并定位各个船只,区分 sailboat、canoe、bulk carrier 等细粒度类别,并给出每个实例的 bbox。
计划步骤:
1)根据传感器类型(IR)选择对应的检测模型;
2)在整张图上运行检测工具,得到所有船只候选框和类别;
3)筛选置信度高的结果,按类别分组;
4)将结果整理成“数量 + 类别 + 坐标列表”的文本。[执行阶段 – 模拟工具调用]
使用 YOLOv8-OBB-IR(IR 专用权重):json{ "tool_name": "YOLOv8-OBB-IR", "instruction": "Detect ships and related objects on IR image.", "inputs": { "image_id": "IR_ship_5340_0", "class_names": ["sailboat", "canoe", "bulk carrier"] }, "outputs": { "detections": [ {"category": "sailboat", "bbox": [530, 580, 610, 690], "score": 0.96}, {"category": "sailboat", "bbox": [860, 530, 870, 540], "score": 0.93}, {"category": "canoe", "bbox": [690, 560, 710, 570], "score": 0.91}, {"category": "bulk carrier", "bbox": [580, 530, 600, 540], "score": 0.94}, {"category": "bulk carrier", "bbox": [670, 530, 690, 540], "score": 0.95}, {"category": "bulk carrier", "bbox": [810, 520, 830, 530], "score": 0.93}, {"category": "bulk carrier", "bbox": [910, 530, 940, 540], "score": 0.92}, {"category": "bulk carrier", "bbox": [970, 520, 980, 530], "score": 0.90} ] } }[反思与回溯阶段]
统计数量:sailboat 2、canoe 1、bulk carrier 5,与检测结果一致;检查 bbox 是否位于水域且不重叠明显错误区域;确认没有漏检明显船只。[综合推理阶段]
将统计信息转成目标 text_output:
There are 2 <|ref|>sailboats<|/ref|><|det|>[[...],[...]]<|/det|>, 1 <|ref|>canoe<|/ref|><|det|>[...]<|/det|> and 5 <|ref|>bulk carriers<|/ref|><|det|>[[...],...]]<|/det|> in the image.
取 IR 船只场景描述样本。(任务的汇总.md)
[规划阶段]
目标是“总结视觉内容”,重点包括各细粒度船只类别及数量。可分为:
1)利用检测/分类工具获取对象类别与数量;
2)根据数量信息构造结构化场景摘要;
3)用语言模板生成多样化描述句。[执行阶段 – 模拟工具调用]
直接复用上一节目标检测的输出作为感知结果,然后构造一个内部汇总工具:json{ "tool_name": "SceneSummarizer", "instruction": "Summarize object counts by category from detection results.", "inputs": { "detections": [ {"category": "sailboat"}, {"category": "sailboat"}, {"category": "canoe"}, {"category": "bulk carrier"}, {"category": "bulk carrier"}, {"category": "bulk carrier"}, {"category": "bulk carrier"}, {"category": "bulk carrier"} ] }, "outputs": { "counts": { "sailboat": 2, "canoe": 1, "bulk carrier": 5 } } }[反思与回溯阶段]
检查统计结果与 detection 一致,且类别覆盖了图像主体。参考任务指令,“fine-grained categories such as truck, liners, and warships” 只是例子,这里实际出现的是 sailboat / canoe / bulk carrier,合理。[综合推理阶段]
根据 counts 用一个模板生成自然语言,例如:
- “In the image, we can see 2 sailboats, 1 canoe, and 5 large cargo ships.”
- “A total of 2 sailboats, 1 canoe, and 5 bulk carriers are visible in the image.”
等多条,同 ground truth 的 list 对齐。
取 IR 场景分类样本。(任务的汇总.md)
[规划阶段]
任务是输出整幅图像的场景类别,例如“Ship”。可思路:
1)利用全局特征(CLIP)或检测结果,确定主导对象类别;
2)用一个规则:若大部分对象是船只,则场景类别为 Ship;
3)输出单词标签。[执行阶段 – 模拟工具调用]
方案一:用 CLIP 做图文匹配:json{ "tool_name": "CLIP-ViT-B/32", "instruction": "Compute similarity between the image and candidate scene prompts.", "inputs": { "image_id": "IR_ship_3376_0", "text_candidates": [ "a ship scene", "an airport", "a residential area", "a forest" ] }, "outputs": { "scores": [ {"text": "a ship scene", "score": 0.87}, {"text": "an airport", "score": 0.21}, {"text": "a residential area", "score": 0.18}, {"text": "a forest", "score": 0.12} ], "top_scene": "Ship" } }[反思与回溯阶段]
CLIP 得到 “a ship scene” 最高,相差明显,并且原始标注中也提到图中有 sailboat。场景类别设为Ship合理。[综合推理阶段]
输出:Ship.
使用你给的从起点到终点生成“行动指令”的示例。(任务的汇总.md)
[规划阶段]
输入给出了起终位置的三维坐标,输出是人类风格的分步导航指令(Step1…Step5),每一步引用可见建筑或停车场等地标。
思路:
1)根据起点与终点,调用路径规划工具得到一条关键路径;
2)在路径上采样一些转折点;
3)调用检测/分割工具获取这些点附近的显著建筑和颜色等属性;
4)将路径分段,每段用“从当前位置朝某方向移动,经过某地标,到达下一个关键点”的句式表述。[执行阶段 – 模拟工具调用]
1)路径规划(略,类似前面的 PathPlannerAStar)。
2)调用 YOLOv8-OBB + ColorAnalyzer 检测关键点附近建筑及其颜色:json{ "tool_name": "LandmarkAnalyzer", "instruction": "At each waypoint, detect nearby mid-rise buildings and estimate dominant color.", "inputs": { "image_ids": [...], "waypoints": [ [739,610,199], [650,610,199], [560,610,199], ... ] }, "outputs": { "landmarks": [ {"step": 1, "side": "left", "color": "red", "height": "mid-rise", "shape": "irregular"}, {"step": 2, "side": "left", "color": "blue", "height": "mid-rise", "shape": "irregular"}, ... ] } }[反思与回溯阶段]
检查:每一步的描述是否对应路径中的转折点,颜色与建筑形状是否与图像一致;最后一步是否确实到达终点附近的停车场区域。若某步的地标过于模糊或重复,可以重新选择更显著的建筑。[综合推理阶段]
将每一段路径和地标信息组织成自然语言:
- Step1:一直向前,左侧经过一个红色中高层不规则建筑,直到位于另一栋红色中高层不规则建筑上方;
- …
- Step5:右转直行,右侧经过一个黄色中高层不规则建筑,直到来到停车场上方。
你现在有 4 个核心工具:
根据各任务特点,可以再扩展几类“轻量级、可离线”的工具,尽量使用传统算法 / 小模型,避免特别吃算力的超大模型。
下面按 功能类型 + 推荐库/模型 + 适用任务 来整理。
功能:距离、相对方向(左/右/前/后)、是否相交/包含、缓冲区、轨迹长度等。
工具组合:
Shapely(你已经在用):
distance, intersects, contains, buffer, union 等。RelationEngine:
close-to, far-from, intersect, inside, around 等。适用任务:
在 YOLOv8-OBB 基础上,可以再引入一些专门模型:
mmdetection 训练,然后离线推理。适用任务:
不一定要训练变化检测大模型,传统 CV + 简单阈值 就能给出很强的辅助证据:
OpenCV / scikit-image:
ChangeDetector:
json{ "change_regions": [ {"category": "new-building", "bbox": [...], "score": 0.88}, {"category": "demolished-building", "bbox": [...], "score": 0.82} ], "global_change_ratio": 0.12 }
适用任务:
你提到“任务调度不要局限于现有例子,许多任务无法用已有工具解决”,这里可以定义一些逻辑型 / 规则型工具,只需 Python 就能实现:
TaskGraphPlanner
networkx 或自写拓扑排序算法,离线零成本。AStarPathPlanner / GridPlanner
RuleBasedScheduler
这些工具非常适合你想要的“高认知”任务:大模型负责读任务文本、把任务拆成节点和约束;Scheduler / Planner 负责计算顺序和路径;再由大模型用自然语言解释结果。
除了 CLIP,你还可以用一些 中等大小、仍可离线部署的模型:
sentence-transformers(如 all-MiniLM-L6-v2)做文本 embedding,用于指令相似度、子任务聚类。适用任务:
你已经有:
可以再增加一个轻量工具:
适用任务:
适用任务:
自写工具工作量感觉有点大,有没有现成的,学术界认可度比较高的工具罗列汇总一下
可以,完全不用“从零自写一堆小工具”。你可以优先把“学术界公认 + 社区活跃 + 可离线”的工具当作你Agent 的工具箱,再在上面包一层自己的轻量接口。
我帮你按任务类型整理一套“现成工具清单”,每个都写用途、优缺点和适合你哪些任务。你后面只需要给它们起一个统一的工具名,比如 Tool_OpenCV_Morph, Tool_OTB_Change, 把参数和 JSON 包一包就能用。
建议:你可以把很多“自写小工具”降级为“OpenCV + 少量 Python 逻辑”,比如
ChangeDetector_OpenCV,RegionMorph_OpenCV。
如果你想给 “变化检测任务” 配一组被学界认可的工具,以下几个是典型代表:
Tool_OpenCD_ChangeModelAwesome Remote Sensing Change Detection 是一个大列表,方便你挑模型和代码实现(GitHub)对你来说,Open-CD + Orfeo/OTB + OpenCV 就足够构建非常“学术硬”的变化检测工具组合。
你要做轨迹规划、导航、空间关系,可以利用这些成熟的地理库,而不是自己造轮子:
(你本来就用 Shapely,建议直接把 GeoPandas 也用上,方便“批量空间计算”。)
这些都很轻量、不需要 GPU。
你已经有 YOLOv8,实际上可以把下层框架也列为工具,以便扩展更多模型。
Tool_D2_MaskRCNN 来用。对你来说,不必“直接暴露这些框架给模型”,而是:在这些框架内部训练各种模型 → 对外只暴露统一的 JSON 工具接口(比如
Tool_Detector_Building)。
你已经有 CLIP ViT-B/32,这是非常合适的 base line。
在需要更轻量或中文语义时,可以考虑:
例如:
Tool_OpenCV_ChangeMask
inputs: {img_t1_id, img_t2_id} → outputs: {binary_mask, change_bbox_list}Tool_OTB_SARFilter
Tool_OpenCD_ChangeModel
Tool_GeoPandas_SpatialRel
relation: close-to / inside / intersect + distance_m所以工作量主要在:
如果你想在短时间内搭出“看起来很 academic 的工具体系”,我会建议:
如果你愿意,下一步我可以帮你:
考虑到算力的限制,有限使用哪些工具性价比最高?
在算力受限(尤其是:希望离线可跑、最好 CPU/轻 GPU 就能跑、且学术界认可度高)的前提下,你要的不是“工具越多越好”,而是一套性价比最高、覆盖任务最广、维护成本最低的工具组合。
下面我给你一个**“优先级工具栈”,按必须优先(Tier-0)→ 强烈推荐(Tier-1)→ 按需可选(Tier-2)**分层,并说明它们分别能覆盖你数据集的哪些任务(变化检测、关系推理、指令分解、轨迹规划、多模态感知等)。
OpenCV + Shapely/GeoPandas + 小型检测器(YOLOv8n/YOLOv5n) + 轻量 CLIP(ViT-B/32 或 RN50) + 离线地名库(Geocoder)
NetworkX(图搜索/调度/路径规划) + scikit-image(变化检测更强) + Rasterio(地理栅格读写)
NetworkX:
scikit-image:
Rasterio:
OTB(Orfeo ToolBox) / Open-CD / SNAP / Detectron2
这些是“学术认可度更高但工程成本略大”的选项:
你的任务里至少有三类,现有四工具支撑不够:
下面是我推荐你优先使用的工具清单,并注明主要功能与算力需求:
| 工具 | 主要功能 | 离线 | 算力 | 学术认可度 | 推荐等级 |
|---|---|---|---|---|---|
| OpenCV | 差分、阈值、形态学、配准、特征点 | ✅ | CPU | ⭐⭐⭐⭐⭐ | Tier-0 |
| scikit-image | 连通域、区域特征、mask处理、分割 | ✅ | CPU | ⭐⭐⭐⭐⭐ | Tier-1 |
| YOLOv8n / YOLOv5n | 检测/计数(光学/IR/SAR可微调) | ✅ | CPU/轻GPU | ⭐⭐⭐⭐⭐ | Tier-0 |
| CLIP ViT-B/32 或 RN50 | 场景分类/语义对齐 | ✅ | GPU更快/CPU可用 | ⭐⭐⭐⭐⭐ | Tier-0 |
| Shapely + GeoPandas | 距离、方向、相交、包含、缓冲区 | ✅ | CPU | ⭐⭐⭐⭐⭐ | Tier-0 |
| Rasterio | GeoTIFF读写,坐标系,栅格处理 | ✅ | CPU | ⭐⭐⭐⭐ | Tier-1 |
| NetworkX | A*/Dijkstra 路径规划、任务DAG调度 | ✅ | CPU | ⭐⭐⭐⭐ | Tier-1 |
| Sentence-Transformers MiniLM | 指令分解/子任务匹配 | ✅ | CPU | ⭐⭐⭐⭐ | Tier-2(看需求) |
| OTB | 遥感经典处理(纹理、滤波、分类) | ✅ | CPU | ⭐⭐⭐⭐⭐ | Tier-2 |
| Open-CD | 深度变化检测 SOTA | ✅ | GPU | ⭐⭐⭐⭐⭐ | Tier-2 |
按你数据集的任务类型,我给一个“最省算力的调用组合”:
如果你要用最小的工具集覆盖最多任务,而且未来还要扩展:
✅ OpenCV + scikit-image + YOLOv8n + CLIP RN50/ViT-B32 + Shapely/GeoPandas + NetworkX + Geocoder(离线库)
你会发现这套组合:
我可以继续帮你做两件最关键的事情:
你只要回复我:
我就可以给你一份完整的“工具规范表 + 任务 recipe 总表”,直接用于你项目骨架里的 configs。
为这套“低算力工具栈”建立统一的工具 JSON I/O 规范
下面给你一套**“低算力工具栈”的统一 Tool JSON I/O 规范**(可直接放进你的 configs/schema/tool_io.schema.json 之类的文件中),目标是:
"\n\n\n" 分隔、辅助信息不直接出现在 CoT 里等)。(21RSThinker Extract[17-27].pdf)ok=false + error_msg,并给 result.fallback(让 trace 能降级)。crs、units、pixel/geo、axis_order。json{ "tool": "TOOL_NAME", "version": "1.0", "request_id": "uuid-or-snowflake", "args": {}, "context": { "sample_id": "optional", "image_id": "optional", "modality": "optical|infrared|sar|multimodal", "timestamp": "optional-iso8601" } }
json{ "tool": "TOOL_NAME", "version": "1.0", "request_id": "same-as-action", "ok": true, "error_msg": null, "input_args": {}, "result": {}, "metrics": { "runtime_ms": 0, "device": "cpu|cuda|na", "notes": "optional" } }
说明
input_args:建议直接 echo args(方便审计/回放)metrics.runtime_ms 可选(你做“模拟输出”时也能填 0)json{ "Region": { "type": "full|bbox|poly|mask_rle|pair_bbox", "bbox_pixel": [x1, y1, x2, y2], "poly_pixel": [[x,y],...], "mask_rle": "string", "bbox_pair_pixel": [[x1,y1,x2,y2],[x1,y1,x2,y2]] } }
crs: "pixel" | "epsg:4326" | "epsg:3857" | "utm:xxN" | "unknown"units: "pixel" | "meter" | "degree" | "unknown"下面每个工具都给出:Action.args 与 Observation.result 的固定字段。
DET_YOLO(Axis-aligned bbox)json{ "image_id": "string", "region": {"type":"full"}, "class_filter": ["optional"], "score_threshold": 0.25, "max_dets": 200 }
json{ "detections": [ { "det_id": "d1", "class_name": "vehicle", "confidence": 0.91, "bbox_xyxy_pixel": [x1,y1,x2,y2] } ], "summary": { "num_detections": 1, "classes": {"vehicle": 1} } }
DET_YOLO_OBB(Oriented bbox)同上,输出改为 bbox_obb_pixel:
json{ "detections": [ { "det_id": "d1", "class_name": "ship", "confidence": 0.88, "bbox_obb_pixel": {"cx":0,"cy":0,"w":0,"h":0,"angle_deg":0} } ], "summary": {"num_detections": 1} }
SEM_CLIP_RANKjson{ "image_id": "string", "region": {"type":"full|bbox|pair_bbox", "bbox_pixel":[...]}, "candidate_texts": ["a ship scene", "an airport"], "top_k": 5, "normalize": true }
json{ "scores": [ {"text":"a ship scene","score":0.81,"rank":1}, {"text":"an airport","score":0.22,"rank":2} ], "best_label": "a ship scene", "score_type": "cosine", "comment": "optional" }
CV_CHANGE_MASK(差分+阈值+形态学)json{ "image_t1_id": "string", "image_t2_id": "string", "preprocess": {"blur_ksize": 5, "hist_match": false}, "diff": {"method": "absdiff|ssim", "channel": "gray|rgb|sar_amp"}, "threshold": {"method":"otsu|fixed", "value": 25}, "morph": {"open": 3, "close": 5}, "min_region_area_px": 50 }
json{ "change_mask_rle": "string", "change_regions": [ { "region_id": "c1", "bbox_xyxy_pixel": [x1,y1,x2,y2], "area_px": 1234, "score": 0.77 } ], "global_change_ratio": 0.034 }
这个工具能支撑:变化检测、局部变化定位、后续 GDAL/Shapely 面积统计、甚至给 YOLO 一个 ROI 去检测“新出现目标”。
GEO_RELATION(距离/方向/拓扑关系)json{ "crs": "pixel|epsg:4326", "units": "pixel|meter", "geom_a": {"type":"bbox|poly", "bbox_pixel":[...], "poly_pixel":[...]}, "geom_b": {"type":"bbox|poly", "bbox_pixel":[...], "poly_pixel":[...]}, "operations": ["distance","direction","intersects","contains","iou"] }
json{ "distance": {"value": 15.8, "units": "pixel"}, "direction": {"deg": 40.0, "cardinal": "NE"}, "topology": {"intersects": false, "contains": false}, "iou": 0.0 }
GEO_AREA(面积/周长)json{ "crs":"pixel|epsg:4326", "units":"pixel|meter", "geom": {"type":"poly|mask_rle", "poly_pixel":[...], "mask_rle":"..."}, "return": ["area","perimeter"] }
json{ "area": {"value": 10234, "units": "pixel^2"}, "perimeter": {"value": 560.2, "units": "pixel"} }
RASTER_SAMPLE(栅格采样/统计)json{ "raster_id": "string", "region": {"type":"bbox|poly", "bbox_pixel":[...]}, "stats": ["mean","std","min","max"], "band": 1 }
json{ "stats": {"mean": 0.43, "std": 0.11, "min": 0.02, "max": 0.91}, "band": 1 }
SAR/IR 场景下,
RASTER_SAMPLE能提供“亮度/后向散射”分布证据(算力很低)。
GEO_REVERSE_GEOCODEjson{ "lat": 0.0, "lon": 0.0, "levels": ["country","admin1","admin2","city"], "poi_top_k": 5, "poi_radius_m": 3000 }
json{ "admin": { "country": "string", "admin1": "string", "admin2": "string", "city": "string" }, "poi": [ {"name":"airport","category":"transport","distance_m":1200}, {"name":"port","category":"transport","distance_m":2400} ] }
PLAN_GRAPH_SHORTEST_PATH(A*/Dijkstra)json{ "graph_id": "string", "start": "node_id|string|[x,y]", "goal": "node_id|string|[x,y]", "algorithm": "astar|dijkstra", "edge_weight": "cost", "constraints": {"avoid_nodes": [], "avoid_edges": []} }
json{ "path": ["n1","n7","n9","n20"], "path_cost": 123.4, "num_steps": 4 }
SCHED_DAG_TOPO_SORT(任务依赖调度)json{ "tasks": [ {"id":"t1","requires":[]}, {"id":"t2","requires":["t1"]} ], "tie_breaker": "fifo|priority", "priority": {"t1": 1, "t2": 2} }
json{ "order": ["t1","t2"], "valid": true }
json{ "DET_YOLO": {"version":"1.0","offline":true,"device":"cpu/cuda","io_schema":"det_yolo_v1"}, "DET_YOLO_OBB": {"version":"1.0","offline":true,"device":"cpu/cuda","io_schema":"det_yolo_obb_v1"}, "SEM_CLIP_RANK": {"version":"1.0","offline":true,"device":"cpu/cuda","io_schema":"sem_clip_rank_v1"}, "CV_CHANGE_MASK": {"version":"1.0","offline":true,"device":"cpu","io_schema":"cv_change_mask_v1"}, "GEO_RELATION": {"version":"1.0","offline":true,"device":"cpu","io_schema":"geo_relation_v1"}, "GEO_AREA": {"version":"1.0","offline":true,"device":"cpu","io_schema":"geo_area_v1"}, "RASTER_SAMPLE": {"version":"1.0","offline":true,"device":"cpu","io_schema":"raster_sample_v1"}, "GEO_REVERSE_GEOCODE": {"version":"1.0","offline":true,"device":"cpu","io_schema":"geo_reverse_geocode_v1"}, "PLAN_GRAPH_SHORTEST_PATH": {"version":"1.0","offline":true,"device":"cpu","io_schema":"plan_graph_shortest_path_v1"}, "SCHED_DAG_TOPO_SORT": {"version":"1.0","offline":true,"device":"cpu","io_schema":"sched_dag_topo_sort_v1"} }
"\n\n\n"(这是你继承 RSThinker 的关键解析约束)。(21RSThinker Extract[17-27].pdf)如果你希望我下一步继续,我可以把上面这份规范直接整理成 2 个文件的“可复制版本”:
tool_io.schema.json(jsonschema,可做自动校验)tool_registry.json(工具清单 + 版本)你更偏好 中文工具名还是 英文工具名(比如 CV_CHANGE_MASK vs OpenCVChangeMask)?
1.还有一点问题,如果按照这个思路去训练模型,模型的正确性就完全依赖于调用工具返回的结果的正确性,如果调工具返回的结果不准确直接影响模型的分析结果,这个问题应该如何解决 2.训练集中 对应每一种任务是不是还需要设计一些工具调用结果有错误的,但是在思维链中通过反思回溯能检查出来并纠正错误的COT,来模仿真实测试过程中调用工具出错的情况 3.对于多模态任务,这个YOLOv8能适应不同模态的数据吗?会不会在非光学的情况下检测正确率很低?
下面按你 3 个问题分别给出可操作的解决方案(会把你材料里“先用完美伪工具输出做SFT、再加调度器引入真实工具”的思路一起整合进来)。(数据集改进.txt)
这个问题本质是:推理模型(LLM)与感知/测量工具之间存在误差传递。工程上通常用“分层鲁棒性 + 冗余证据 + 校验器”解决,而不是指望单个工具永远正确。
这正好对应你材料里“先合成伪工具输出,再加调度器接真实工具”的系统路线。(数据集改进.txt)
每类关键结论尽量由两种不同类型证据支持,例如:
这样即使某个工具误检,也不至于直接把最终答案带偏。
在系统侧加一个轻量 validator(不靠LLM):
你材料里已经设计了类似的可量化指标:对象识别、框精度、工具调用等,都可以直接用来做 validator/筛选。(数据集改进.txt)
要,而且非常值得做。
原因:你真实测试时不可能永远工具全对;如果训练全是“完美工具”,模型一遇到冲突就不会处理。
但要注意做法:不是随机乱错,而是“可被检查出来的错”,并且让 <reflexion> 触发“纠错/降权/重试/换工具”。
给每条样本生成两个版本:
然后在 <reflexion> 里固定一套“纠错策略模板”:
这会显著提升你材料里提到的“鲁棒性和自我修正”维度。(数据集改进.txt)
结论:原生 YOLO(拿 RGB 预训练权重直接跑)在 IR/SAR 上通常会明显掉点,尤其是 SAR(散斑、几何畸变、成像机理完全不同)。
但 YOLO 作为框架本身是通用的——关键在数据与训练方式。
DET_YOLO_OPTICALDET_YOLO_IRDET_YOLO_SAR<planning> 里根据 modality 选择对应 detector(这也更符合“工具调度/选择”的训练目标)<reflexion> 里纠正。如果你愿意,我可以把“noisy-CoT 注入策略”具体化成一份 Noise Profile 配置文件(哪些字段怎么错、错多少、如何在 reflexion 里纠正),并给你一个可直接集成进项目骨架的生成器接口。
| 工具名称 (Tool) | 功能描述 (Function) | 输入指令规范 (Input Directive / Action) | 输出数据规范 (Output JSON Schema) | 核心字段说明 (Key Fields) |
| --------------------------------- | -------------------------- | ---------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------ |
| 1. 视觉感知<br>(YOLOv8-OBB) | 检测指定类别的目标,返回带旋转角度的边框。 | Action: YOLOv8-OBB(<br>image_id="...",<br>classes=["...", "..."],<br>conf_thres=0.3) | [<br> {<br> "id": 1,<br> "class": "ship",<br> "conf": 0.92,<br> "bbox_obb": {"cx":.., "cy":.., "angle": 45.0}<br> }<br>] | angle: 推断物体朝向。<br>conf: 置信度高低。 |
| 2. 语义理解<br>(CLIP ViT-B/32) | 对整体或局部区域进行 Zero-shot 类别预测。 | Action: CLIP(<br>image_id="...",<br>candidate_labels=["A", "B"],<br>crop_bbox=[x1,y1,x2,y2]) | {<br> "top_label": "A",<br> "top_conf": 0.75,<br> "distribution": {"A": 0.75, "B": 0.25}<br>} | distribution: 返回概率分布。 |
| 3. 空间计算<br>(GDAL + Shapely) | 计算目标的物理距离、面积或拓扑关系。 | Action: GDAL-Shapely(<br>operation="distance/area",<br>obj_ids=[1, 2]) | {<br> "pixel_value": 850,<br> "physical_value": 425.5,<br> "unit": "meters",<br> "relation": "disjoint"<br>} | physical_value + unit: 返回真实物理单位 。 |
| 4. 地理知识<br>(Reverse Geocoder) | 获取图像位置及周边的 POI(兴趣点)信息。 | Action: Reverse Geocode(<br>image_id="...",<br>query_type="context") | {<br> "location": {"addr": "..."},<br> "nearby_pois": [{"name": "Airport", "dist": 5.2}],<br> ``<br>} | nearby_pois: 提供环境语境(如附近有机场)。<br> |
工具集又补充了路径规划类任务工具Trajectory_Planner和变化检测类任务工具BIT,yolov8补充上模态选择,请仿照之前的例子补充完之前的表格
| 工具名称 (Tool) | 功能描述 (Function) | 输入指令规范 (Input Directive / Action) | 输出数据规范 (Output JSON Schema) | 核心字段说明 (Key Fields) |
| --------------------------------- | -------------------------- | ---------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------ |
| 1. 视觉感知<br>(YOLOv8-OBB) | 检测指定类别的目标,返回带旋转角度的边框。 | Action: YOLOv8-OBB(<br>image_id="...",<br>classes=["...", "..."],<br>conf_thres=0.3) | [<br> {<br> "id": 1,<br> "class": "ship",<br> "conf": 0.92,<br> "bbox_obb": {"cx":.., "cy":.., "angle": 45.0}<br> }<br>] | angle: 推断物体朝向。<br>conf: 置信度高低。 |
| 2. 语义理解<br>(CLIP ViT-B/32) | 对整体或局部区域进行 Zero-shot 类别预测。 | Action: CLIP(<br>image_id="...",<br>candidate_labels=["A", "B"],<br>crop_bbox=[x1,y1,x2,y2]) | {<br> "top_label": "A",<br> "top_conf": 0.75,<br> "distribution": {"A": 0.75, "B": 0.25}<br>} | distribution: 返回概率分布。 |
| 3. 空间计算<br>(GDAL + Shapely) | 计算目标的物理距离、面积或拓扑关系。 | Action: GDAL-Shapely(<br>operation="distance/area",<br>obj_ids=[1, 2]) | {<br> "pixel_value": 850,<br> "physical_value": 425.5,<br> "unit": "meters",<br> "relation": "disjoint"<br>} | physical_value + unit: 返回真实物理单位 。 |
| 4. 地理知识<br>(Reverse Geocoder) | 获取图像位置及周边的 POI(兴趣点)信息。 | Action: Reverse Geocode(<br>image_id="...",<br>query_type="context") | {<br> "location": {"addr": "..."},<br> "nearby_pois": [{"name": "Airport", "dist": 5.2}],<br> ``<br>} | nearby_pois: 提供环境语境(如附近有机场)。<br> |
工具集又补充了路径规划类任务工具Trajectory_Planner和变化检测类任务工具BIT,yolov8补充上模态选择,请仿照之前的例子补充完之前的表格
很好,这里我只做你这一件事:
👉 在你现有表格风格不变的前提下,把工具体系补全,包括:
我不会引入新风格,也不会改你已有字段命名。
| 工具名称 (Tool) | 功能描述 (Function) | 输入指令规范 (Input Directive / Action) | 输出数据规范 (Output JSON Schema) | 核心字段说明 (Key Fields) |
|---|
| 1. 视觉感知<br>(YOLOv8-OBB) | 在指定模态下检测目标,返回带旋转角度的边框。支持 Optical / SAR / IR。 | Action: YOLOv8-OBB(<br>image_id="...",<br>modality="optical/sar/ir",<br>classes=["...", "..."],<br>conf_thres=0.3) | [<br>{<br>"id": 1,<br>"class": "ship",<br>"conf": 0.92,<br>"bbox_obb": {<br>"cx": 512.3,<br>"cy": 384.7,<br>"w": 120.5,<br>"h": 38.2,<br>"angle": 45.0<br>}<br>}<br>] | modality:决定使用的检测权重。<br>angle:目标朝向(遥感关键)。<br>conf:检测可信度,用于反思阶段筛选。 |
🔹 说明:
- 数据集生成阶段:你用 GT → 生成 bbox_obb
- 真实推理阶段:调度器根据
modality选择对应权重(YOLOv8-OBB-Opt / SAR / IR)
| 2. 语义理解<br>(CLIP ViT-B/32) | 对整体或局部区域进行 Zero-shot 语义预测与对齐。 | Action: CLIP(<br>image_id="...",<br>candidate_labels=["A","B"],<br>crop_bbox=[x1,y1,x2,y2]) | {<br>"top_label": "A",<br>"top_conf": 0.75,<br>"distribution": {<br>"A": 0.75,<br>"B": 0.25<br>}<br>} | distribution:完整概率分布,用于不确定性建模与冲突检测。 |
| 3. 空间计算<br>(GDAL + Shapely) | 计算目标的距离、面积或拓扑关系(像素→物理单位)。 | Action: GDAL-Shapely(<br>operation="distance/area/iou/contains",<br>obj_ids=[1,2]) | {<br>"pixel_value": 850,<br>"physical_value": 425.5,<br>"unit": "meters",<br>"relation": "disjoint"<br>} | physical_value + unit:支撑定量推理。<br>relation:拓扑关系证据。 |
| 4. 地理知识<br>(Reverse Geocoder) | 根据经纬度返回行政区与周边 POI 语境。 | Action: Reverse_Geocode(<br>image_id="...",<br>query_type="context") | {<br>"location": {<br>"country": "...",<br>"admin1": "...",<br>"city": "..."<br>},<br>"nearby_pois": [<br>{"name":"Airport","dist":5.2}<br>]<br>} | nearby_pois:高层语义证据(港口/机场/工业区)。 |
| 5. 路径规划<br>(Trajectory_Planner) | 根据起点、终点和约束生成可行轨迹或航线。 | Action: Trajectory_Planner(<br>start_pose=[x,y,z,yaw,pitch,roll],<br>goal_pose=[x,y,z],<br>constraints=["fly_over_buildings"]) | {<br>"trajectory": [<br>[690,660,492,265,704,108],<br>[636,712,492,279,711,104],<br>...<br>],<br>"length": 1350.2,<br>"status": "success"<br>} | trajectory:核心输出,用于导航类任务。<br>status:失败可触发回溯。 |
🔹 说明:
- 底层实现可以是 A* / Dijkstra / NetworkX
- 数据集阶段直接用 GT 轨迹生成
| 6. 变化检测<br>(BIT) | 对多时相图像进行变化检测,输出变化区域与类型。 | Action: BIT_ChangeDetect(<br>image_t1="...",<br>image_t2="...",<br>change_type="auto") | {<br>"change_mask": "rle_string",<br>"change_regions": [<br>{<br>"id":1,<br>"bbox":[x1,y1,x2,y2],<br>"type":"new-building",<br>"score":0.87<br>}<br>],<br>"change_ratio": 0.12<br>} | change_regions:局部变化证据。<br>change_ratio:全局变化强度。 |
🔹 说明:
- 数据集生成:由 GT mask / polygon 转写
- 真实推理:BIT / OpenCD / 传统差分都可接
我强烈建议你下一步做 其中之一:
👉 “工具调用错误注入规范表”
(例如:YOLO 漏检 / BIT 误报 / Trajectory 失败 → reflexion 怎么写)
👉 “每类任务 × 工具调用最小模板(recipe)”
(变化检测 / 关系推理 / 导航各一套)
你只要告诉我选 A 还是 B。
下面我严格按你的要求来做这件事,不引入任何新工具、不改工具名、不改 CoT 框架、不改样例语义,只是把每一种任务的“典型思维链分析思路”显式写成可训练的 CoT 样例。
我遵循的硬约束是:
YOLOv8-OBB / CLIP / GDAL-Shapely / Reverse Geocoder / Trajectory_Planner / BIT<planning> → <tool_execution> → <reflexion> → <synthesis>xml<planning> This task requires determining the relationship between two spatially specified objects. The primary goal is to identify the category of each object and infer their spatial relationship. Since the objects are explicitly localized by bounding boxes, I should first confirm their object categories. Then, I need to measure their spatial proximity to determine whether they are adjacent, overlapping, or far apart. YOLOv8-OBB is suitable for object category confirmation, and GDAL-Shapely can be used to compute spatial distance and relation. </planning> <tool_execution> Action: YOLOv8-OBB( image_id="02102_600_800_0_0", modality="optical", classes=["building"], conf_thres=0.3 ) Observation: [ { "id": 1, "class": "building", "conf": 0.93, "bbox_obb": {"cx": 773, "cy": 721, "w": 185, "h": 131, "angle": 0} }, { "id": 2, "class": "building", "conf": 0.91, "bbox_obb": {"cx": 954, "cy": 677, "w": 85, "h": 78, "angle": 0} } ] Action: GDAL-Shapely( operation="distance", obj_ids=[1,2] ) Observation: { "pixel_value": 182, "physical_value": 91.0, "unit": "meters", "relation": "disjoint" } </tool_execution> <reflexion> Both detected objects are classified as buildings with high confidence. The measured distance between the two objects is relatively small compared to the image scale, and they do not overlap. This suggests a close spatial proximity rather than containment or separation. The geometric relation is consistent with a near-adjacent configuration. </reflexion> <synthesis> The reasoning integrates object classification from detection and quantitative spatial measurement. The small separation distance supports interpreting the two buildings as being spatially close. Both semantic and geometric evidence align to support a close proximity relationship. </synthesis>
xml<planning> This task requires decomposing a high-level spatial analysis instruction into explicit procedural steps. The region of interest is given, so I should first locate it within the image. Next, I need to detect objects within that region and identify their categories. Then, I should analyze spatial relationships among the detected objects. Finally, I need to summarize the overall context based on detection and relations. YOLOv8-OBB will be used for object detection, and GDAL-Shapely will support relation analysis. </planning> <tool_execution> Action: YOLOv8-OBB( image_id="06901_1000_1000_0_0", modality="optical", classes=["tennis-court","building"], conf_thres=0.3 ) Observation: [ {"id":1,"class":"tennis-court","conf":0.94,"bbox_obb":{"cx":632,"cy":397,"w":102,"h":84,"angle":0}}, {"id":2,"class":"building","conf":0.92,"bbox_obb":{"cx":708,"cy":466,"w":226,"h":192,"angle":0}}, {"id":3,"class":"building","conf":0.91,"bbox_obb":{"cx":793,"cy":472,"w":173,"h":182,"angle":0}} ] Action: GDAL-Shapely( operation="distance", obj_ids=[2,3] ) Observation: { "pixel_value": 85, "physical_value": 42.5, "unit": "meters", "relation": "disjoint" } </tool_execution> <reflexion> The detected objects include one sports facility and two buildings. The distance between the two buildings is relatively small, suggesting a close spatial relationship. The results are consistent across detection and geometric measurement. The information is sufficient to decompose the task into detection, relation analysis, and contextual summarization steps. </reflexion> <synthesis> The reasoning follows a structured pipeline: region localization, object detection, relation inference, and context summarization. Each step contributes incremental information, enabling a clear decomposition of the original instruction. </synthesis>
xml<planning> This task requires generating a flight trajectory for a quadcopter. I need to extract the target description and identify landmark references. The landmark coordinates provide a spatial anchor to infer the target building location. Once the target position is determined, a feasible trajectory from the start pose to the target must be planned. Trajectory_Planner is the appropriate tool for computing the flight path. </planning> <tool_execution> Action: Reverse_Geocoder( image_id="birmingham_block_1", query_type="context" ) Observation: { "location":{"city":"Birmingham"}, "nearby_pois":[{"name":"Leslie Road","dist":0.1}] } Action: Trajectory_Planner( start_pose=[690,660,492,265,704,108], goal_pose=[917,897,73], constraints=["fly_over_buildings"] ) Observation: { "trajectory":[ [690,660,492,265,704,108], [636,712,492,279,711,104], [508,814,423,110,717,274], [917,902,49,828,300,181] ], "length":1350.2, "status":"success" } </tool_execution> <reflexion> The trajectory begins at the specified start pose and progressively moves toward the inferred target location. Altitude changes indicate clearance over buildings, satisfying task constraints. The planned path is continuous and terminates near the target position. </reflexion> <synthesis> By grounding the target description using geographic context and planning a constrained path, the reasoning produces a coherent navigation strategy. The resulting trajectory satisfies spatial and task requirements. </synthesis>
xml<planning> This task focuses on detecting and locating objects in an infrared image. The objective is to identify fine-grained object categories and their coordinates. YOLOv8-OBB with infrared modality is suitable for this detection task. </planning> <tool_execution> Action: YOLOv8-OBB( image_id="IR_ship_5340_0", modality="ir", classes=["sailboat","canoe","bulk carrier"], conf_thres=0.3 ) Observation: [ {"id":1,"class":"sailboat","conf":0.92,"bbox_obb":{"cx":570,"cy":635,"w":80,"h":110,"angle":12}}, {"id":2,"class":"sailboat","conf":0.90,"bbox_obb":{"cx":865,"cy":535,"w":15,"h":12,"angle":5}}, {"id":3,"class":"canoe","conf":0.88,"bbox_obb":{"cx":700,"cy":565,"w":20,"h":10,"angle":0}} ] </tool_execution> <reflexion> The detected objects correspond to maritime targets with plausible sizes and shapes in infrared imagery. The confidence scores are sufficiently high, indicating reliable detection. The number and distribution of detections are consistent with a maritime scene. </reflexion> <synthesis> The reasoning aggregates detections to characterize the objects present in the image. The detected categories and their locations form a complete perception result. </synthesis>
xml<planning> This task requires summarizing the visual content of an image. I need to understand what objects are present and their overall distribution. Object detection results can be aggregated to produce a concise scene description. </planning> <tool_execution> Action: YOLOv8-OBB( image_id="IR_ship_5340_0", modality="ir", classes=["sailboat","canoe","bulk carrier"], conf_thres=0.3 ) Observation: [ {"id":1,"class":"sailboat","conf":0.92}, {"id":2,"class":"sailboat","conf":0.90}, {"id":3,"class":"canoe","conf":0.88}, {"id":4,"class":"bulk carrier","conf":0.93}, {"id":5,"class":"bulk carrier","conf":0.91} ] </tool_execution> <reflexion> The detected objects include multiple types of vessels. The counts indicate a mix of small and large maritime vehicles. This information is sufficient to form a concise descriptive summary. </reflexion> <synthesis> By aggregating detected object categories and counts, the scene can be described at a high level. The resulting description reflects the dominant visual elements in the image. </synthesis>
xml<planning> This task aims to determine the overall scene category of the image. Global semantic understanding is required rather than detailed object localization. CLIP is appropriate for inferring the dominant scene category. </planning> <tool_execution> Action: CLIP( image_id="IR_ship_3376_0", candidate_labels=["Ship","Airport","Residential Area"], crop_bbox=null ) Observation: { "top_label":"Ship", "top_conf":0.86, "distribution":{"Ship":0.86,"Airport":0.08,"Residential Area":0.06} } </tool_execution> <reflexion> The semantic prediction strongly favors a maritime scene. The dominance of vessel-related semantics supports the classification. The confidence margin over other categories is significant. </reflexion> <synthesis> The scene-level semantic evidence indicates a ship-related environment. The reasoning relies on global visual patterns rather than individual detections. </synthesis>
xml<planning> This task requires generating a step-by-step action plan to move from a start position to a destination. The plan should reference visible landmarks and movement directions. I need to understand the spatial progression between the start and end positions and identify salient buildings along the route. YOLOv8-OBB can help identify landmarks, while Trajectory_Planner informs movement sequence. </planning> <tool_execution> Action: YOLOv8-OBB( image_id="plan_sequence", modality="optical", classes=["building","parking-lot"], conf_thres=0.3 ) Observation: [ {"id":1,"class":"building","conf":0.91}, {"id":2,"class":"building","conf":0.89}, {"id":3,"class":"parking-lot","conf":0.94} ] Action: Trajectory_Planner( start_pose=[739,610,199], goal_pose=[356,611,199], constraints=[] ) Observation: { "trajectory":["straight","straight","right-turn","straight"], "status":"success" } </tool_execution> <reflexion> The detected landmarks provide clear reference points along the route. The planned movement sequence is simple and consistent with the spatial layout. Each step can be expressed as a human-readable instruction. </reflexion> <synthesis> By combining landmark detection with movement sequencing, a clear and interpretable action plan can be produced. The reasoning supports translating spatial navigation into natural language steps. </synthesis>
你现在已经具备:
如果你愿意,下一步我可以帮你把这些 “人写 CoT”抽象成“模板 + 参数”,直接生成 10 万级 CoT 数据集。
text{ "image_id": "10275_800_600_0_0", "chain_of_thought": { "planning": "The task requires identifying the relationship between an airplane and another object in a specified region. Step 1: Use YOLOv8_OBB to classify the second object within the given bounding box [[11,78,999,996]] and confirm the category of the airplane. Step 2: Analyze spatial layout—given the large rectangular region likely representing a long linear structure, assess alignment and positioning. Step 3: Infer semantic relation based on domain knowledge: airplanes on elongated structures typically indicate taxiing or parking on runways.", "tool_execution": { "steps": [ { "action": { "tool_name": "YOLOv8_OBB", "parameters": { "image_id": "10275_800_600_0_0", "regions": [ { "category": "airplane", "oriented_bbox": [ 137, 763, 217, 869 ] }, { "category": "unknown", "oriented_bbox": [ 11, 78, 999, 996 ] } ] } }, "observation": { "tool_name": "YOLOv8_OBB", "status": "success", "result": [ { "region": [ 137, 763, 217, 869 ], "class": "airplane", "conf": 0.96 }, { "region": [ 11, 78, 999, 996 ], "class": "runway", "conf": 0.98 } ] } }, { "action": { "tool_name": "GDAL-Shapely", "parameters": { "operation": "distance_and_bearing", "p1": [ 177, 816 ], "p2": [ 505, 537 ] } }, "observation": { "tool_name": "GDAL-Shapely", "status": "success", "result": { "pixel_distance": 340.2, "topology": "intersect", "bearing": "North-West" } } } ] }, "reflection": "YOLOv8_OBB confirms the subject is 'airplane' and the object is 'runway' with high confidence. The bounding box of the runway spans nearly the entire image vertically, indicating a major infrastructure element. The airplane’s location intersects spatially with the runway region, suggesting it lies directly on it. Given that aircraft move along runways during takeoff/landing phases, the most plausible activity is taxiing rather than flying or parked-off. No occlusion or detachment observed.", "synthesis": "The subject (airplane) is located on the runway, intersecting its spatial extent, and is actively moving along it. This corresponds to the semantic relationship 'taxi-on'. Final Answer: subject: airplane, object: runway, the airplane is <|rel|>taxi-on<|/rel|> the runway." } },
这个cot推理过程是否正确,工具调用的结果是怎么计算的,工具调用的结果是正确的么
pythonimport math dist=math.hypot(177-505,816-537) dist 430.6100323959023
你这段 CoT 的大方向思路是合理的(先用 YOLOv8-OBB 确认类别,再用 GDAL-Shapely 做几何关系,从而推出语义关系),但就你贴出来的这条样本而言,里面的工具调用结果在数值/方向/拓扑含义上存在明显不一致,所以“工具返回是否正确”——按当前 JSON 来看是不正确或不自洽的。
下面我按你问的 3 个点逐条拆开讲:它哪里对、工具结果该怎么计算、以及你现在这份结果哪里不对。
YOLOv8-OBB,但 JSON 里写的是 YOLOv8_OBB(下划线 vs 连字符)。如果你做 schema 校验/自动解析,这会直接导致调用无法对齐。所以:推理框架对,但这条样本里“工具结果→结论”的链条目前不成立(至少需要修正工具结果或改成可解释的几何计算方式)。
你第二步调用是:
jsonAction: GDAL-Shapely( operation="distance_and_bearing", p1=[177,816], p2=[505,537] ) Observation: { "pixel_distance": 340.2, "topology": "intersect", "bearing": "North-West" }
通常就是欧氏距离:
[137,763,217,869]
[11,78,999,996]
欧氏距离:
✅ 应接近 430.6 像素
❌ 你写的 340.2 不匹配(除非你有额外的缩放/投影/坐标系变换,但 Action 里完全没说明)。
这里要先约定图像坐标系:通常是
从 p1=(177,816) 到 p2=(505,537):
所以方向应该是:North-East(东北)
❌ 你写 North-West(西北)是反的。
如果你用的是地理坐标系(y 向上),那 dy 就是 +279(北),仍然是 dx>0,所以还是 NE,不会变成 NW。
你传入的是 两个点 p1/p2,点之间谈不上 “intersect/contains/disjoint” 这种拓扑关系(拓扑关系是几何体之间的:bbox vs bbox、poly vs poly)。
所以要么:
更关键的是:如果你真正想表达的是 “airplane 在 runway 上”,应该计算的是:
bbox_airplane ∩ bbox_runway 是否非空(intersects)bbox_runway.contains(bbox_airplane) 是否为真(contains / within)用你给的两个 bbox 来看:
[11,78,999,996] 完全包住 airplane bbox [137,763,217,869]✅ 总结:这份 GDAL-Shapely 的结果(distance、bearing、topology)至少三处不正确或不自洽。
我现在无法看到 10275_800_600_0_0 这张图,所以没法替你判断“它真的是 runway 吗”。但从工程角度,你这段输出存在两个风险点:
[11,78,999,996],这种情况下 YOLO 很容易把大面积背景判成某个大类(runway/road/tarmac/field),它并不是一个很强的定位证据。更稳的做法是(仍然只用你允许的 6 个工具名):
YOLOv8-OBB 在全图检测 runway(不要用一个几乎全图的 ROI 去“逼迫”它输出 runway)GDAL-Shapely(operation="contains/iou") 去验证 airplane bbox 是否在 runway bbox 上YOLOv8-OBB 对 runway 置信度不稳定,再用 CLIP 对 runway/tarmac/road 做候选标签对齐辅助判断最小修改建议(不增加新工具):
YOLOv8-OBB、GDAL-Shapely、CLIP 等严格一致operation="contains":runway contains airplane?operation="iou":overlap 程度operation="distance":几何体距离(如果相交距离=0)axis_order 或 image_coord_y_down=truecontains==true + runway 类别高置信 + airplane bbox 在 runway 内部 → “on-runway”如果你愿意,我可以直接按你现在这条 JSON 的字段风格,给你一版修正后的 tool_execution(包含正确的 distance/bearing/contains 计算方式),并把 reflection/synthesis 改成“即使工具可能错也能自检”的鲁棒版本(仍然只用这 6 个工具名)。
system_prompt = """You are an expert in the field of remote sensing, highly familiar with {task} within this domain. You are capable of providing a chain_of_thought for addressing specific challenges associated with this task.
Your generated chain_of_thought should include the following components: planning, tool_execution(Action & Observation), reflection, and synthesis. Your ultimate task is to generate a complete chain_of_thought corresponding to the input.
planning: Outline the steps and strategies you will employ to tackle the given task.
tool_execution(Action & Observation): For {task}, there are some tools available: <tool_define>{tool_define}</tool_define>. You need to select the appropriate tool based on {task} to resolve it and provide the tool's execution results. Note: You cannot invoke these tools because they do not exist. Instead, you need to simulate their invocation and use your knowledge to directly generate the tool execution results.
relection: Extract information that supports the answer based on the tool's results, and continuously refine the answer.
synthesis: Deduce the final result through comprehensive reasoning.
Note: You must ensure structured output, and your return results must conform to the following structure without containing any additional information:
<output_structure>
[
{{
"image_id": "string, unique identifier for the image from input json object",
"chain_of_thought": {{
"planning": "string, your planning for addressing task based on the input json object",
"tool_execution": {{
"steps": [
{{
"action": {{
"tool_name": "string, name of the tool you choose to use",
"parameters": "string, input parameters for the tool"
}},
"observation": {{
"tool_name": "string, name of the tool you choose to use",
"status": "string, status of the tool execution, e.g., success or failure",
"result": "string, the execution result of the tool"
}},
}}
]
}},
"reflection": "string, your reflection based on the tool execution results",
"synthesis": "string, your final answer based on the reflection"
}}
}}
]
</output_structure>
Here are some examples you can refer to:
<chain_of_thought_example>
{chain_of_thought_example}
</chain_of_thought_example>
"""
这份提示词应该如何改进
system_prompt = """You are an expert in the field of remote sensing, highly familiar with {task} within this domain. You are capable of providing a chain_of_thought for addressing specific challenges associated with this task.
Your generated chain_of_thought should include the following components: planning, tool_execution(Action & Observation), reflection, and synthesis. Your ultimate task is to generate a complete chain_of_thought corresponding to the input.
planning: Outline the steps and strategies you will employ to tackle the given task.
tool_execution(Action & Observation): For {task}, there are some tools available: <tool_define>{tool_define}</tool_define>. You need to select the appropriate tool based on {task} to resolve it and provide the tool's execution results. Note: You cannot invoke these tools because they do not exist. Instead, you need to simulate their invocation and use your knowledge to directly generate the tool execution results.
relection: Extract information that supports the answer based on the tool's results, and continuously refine the answer.
synthesis: Deduce the final result through comprehensive reasoning.
Note: You must ensure structured output, and your return results must conform to the following structure without containing any additional information:
<output_structure>
[
{{
"image_id": "string, unique identifier for the image from input json object",
"chain_of_thought": {{
"planning": "string, your planning for addressing task based on the input json object",
"tool_execution": {{
"steps": [
{{
"action": {{
"tool_name": "string, name of the tool you choose to use",
"parameters": "string, input parameters for the tool"
}},
"observation": {{
"tool_name": "string, name of the tool you choose to use",
"status": "string, status of the tool execution, e.g., success or failure",
"result": "string, the execution result of the tool"
}},
}}
]
}},
"reflection": "string, your reflection based on the tool execution results",
"synthesis": "string, your final answer based on the reflection"
}}
}}
]
</output_structure>
Here are some examples you can refer to:
<chain_of_thought_example>
{chain_of_thought_example}
</chain_of_thought_example>
"""
这份提示词应该如何改进
这是一个非常关键、而且你现在这个阶段问得非常对的问题。
你的这份 system_prompt 已经能“跑起来”,但如果目标是:稳定生成高质量、可控、可训练、可审计的 CoT(尤其是你这种多任务 + 工具模拟的数据集),它还需要系统性升级。
下面我分 三层 来讲:
先明确:不是推倒重来,而是“精修”。
你当前 prompt 的优点:
明确了四阶段 CoT 结构
planning / tool_execution / reflection / synthesis
👉 这和你数据集中 7+1 类任务的“通用思维骨架”是完全一致的。
明确“工具不可真实调用,只能模拟”
👉 这是你构建“伪工具 + 可控 CoT 数据集”的核心前提。
强制结构化 JSON 输出
👉 这是后续自动校验、自动注入噪声、自动评估的基础。
但问题是:
它还不足以“约束模型如何正确地‘假装’工具是可信但可被反思的”。
现在模型可以:
distancebearingtopology而你前面已经发现:
👉 一旦工具结果不自洽,整个 CoT 会被“污染”
缺失点:
prompt 没有要求:
“工具输出必须与输入参数在几何 / 逻辑上保持一致”
你现在写的是:
reflection: Extract information that supports the answer...
这句话太弱了,模型会把 reflection 当成:
但在你的体系里,reflection 是核心安全阀,它必须承担:
例如你前面那个例子里:
intersect(几何事实)taxi-on(语义关系)但 prompt 没要求模型:
必须先给出几何事实,再在 synthesis 中映射到语义关系
这会导致模型直接“跳答案”。
你现在用 {task} 占位,但 prompt 里没有说:
👉 这会导致同一任务类型的 CoT 风格漂移,对训练是灾难性的。
json"parameters": "string" "result": "string"
这在工程和训练上都非常危险:
⚠️ 从模型安全与泛化角度:
你真正想要的不是“自由 CoT”,而是“受控推理轨迹”
你现在这个 prompt,在更大模型上很容易产生“冗长、不稳定、不可控”的 CoT。
下面是一个**“为你这个数据集量身定做”的升级版**。
👉 它不会引入新概念,只是把你已经在做的事情写清楚、写硬、写可控。
textYou are an expert in remote sensing and aerial scene understanding. You are highly familiar with the task: {task}, and you must generate a structured, verifiable reasoning trace for this task. Your goal is to generate a complete reasoning trace that follows a fixed four-stage structure: planning, tool_execution, reflection, and synthesis. IMPORTANT CONSTRAINTS: 1. You must strictly follow the output JSON structure provided below. Do not add or remove any fields. 2. You must ONLY use the tools defined in <tool_define>. You may not invent new tools. 3. Tools do NOT actually exist. You must simulate tool execution results based on domain knowledge. 4. Simulated tool results MUST be logically and geometrically consistent with the tool inputs. 5. The final answer must be derived ONLY from information validated in the reflection stage. --- Stage Definitions: planning: - Describe the task-solving strategy appropriate for {task}. - Explicitly state which tools will be used and why. - Follow the canonical reasoning pattern of this task type (e.g., detection → geometry → relation). tool_execution (Action & Observation): - Each step must include one Action and one Observation. - Action.parameters and Observation.result MUST be structured JSON objects, not free text. - Simulated results must obey physical, geometric, and semantic consistency (e.g., distances must match coordinates, directions must match vectors). reflection: - Critically examine the tool outputs. - Check for internal consistency (e.g., geometry vs semantics). - Identify whether any tool output is unreliable, conflicting, or ambiguous. - Decide which information can be trusted and which should be down-weighted or ignored. synthesis: - Produce the final conclusion using ONLY validated information from reflection. - Do NOT introduce new evidence. - Map geometric or quantitative facts to semantic conclusions when required by the task. --- Additional Task-Specific Rules: - Relationship reasoning: geometric relations MUST precede semantic relations. - Instruction decomposition: output MUST follow a step-by-step procedural structure. - Task scheduling/navigation: target and landmarks MUST be localized before trajectory generation. - Multimodal perception: modality-specific reasoning MUST be respected. - Change detection: temporal differences MUST be explicitly identified before interpretation. --- You must output EXACTLY the following JSON structure and nothing else: <output_structure> [ { "image_id": "...", "chain_of_thought": { "planning": "...", "tool_execution": { "steps": [ { "action": { "tool_name": "...", "parameters": { } }, "observation": { "tool_name": "...", "status": "success", "result": { } } } ] }, "reflection": "...", "synthesis": "..." } } ] </output_structure> You may refer to the following examples to understand style and structure: <chain_of_thought_example> {chain_of_thought_example} </chain_of_thought_example>
使用这个版本后,你将得到:
工具结果不再“随便编”
→ distance / bearing / topology 会自然变得合理
reflection 真正成为“纠错层”
→ 为你后续引入 noisy-CoT、工具出错样本打基础
不同任务的 CoT 风格稳定
→ 关系推理 / 导航 / 感知不会混成一锅粥
可直接用于大规模自动生成数据集
你现在已经走到一个非常专业的阶段了,下一步通常是二选一:
{task="relationship_reasoning"} → 自动加载对应规则{task="navigation"} → 自动禁止无关工具如果你愿意,直接告诉我:
👉 选 A 还是 B
"tool_define": [{
"tool_name": "YOLOv8_OBB",
"parameters": {
"image_id": "string, the unique identifier of the image",
"regions": "list of objects, each object contains category and oriented bounding box coordinates"
}
},
{
"tool_name": "GDAL-Shapely",
"parameters": {
"operation": "distance_and_bearing",
"p1": "list of two floats, the coordinates of the first point",
"p2": "list of two floats, the coordinates of the second point"
}
}
],
对于这两个工具的定义是否准确,如果不准确请修改,并在每一个工具定义中添加新的describe参数,描述每个工具的功能和输出参数
你这两个工具的定义 不够准确/不够可用,主要问题是:
modality 会决定加载不同权重。(任务的汇总 - 副本.md)distance_and_bearing(p1,p2),但你们任务里还会用它算 拓扑关系/方向/物理单位距离 等(例如 relation/intersects、direction 等)。(任务的汇总 - 副本.md)YOLOv8-OBB(连字符),而不是 YOLOv8_OBB(下划线)。建议把工具名统一,否则后续 schema 校验/解析会出问题。(任务的汇总 - 副本.md)下面给出我建议的修正版 tool_define:每个工具都新增 describe,并把输出参数写清楚(用 output_schema + output_fields)。
json[ { "tool_name": "YOLOv8-OBB", "describe": "Multimodal oriented object detection (and optional ROI verification) for remote sensing images. Supports optical/SAR/IR. Returns oriented bounding boxes with class labels and confidence scores.", "parameters": { "image_id": "string, unique identifier of the image", "modality": "string, one of ['optical','sar','ir'], used to select the proper detector weights", "classes": "list[string], target categories to detect (e.g., ['ship','building']). If null/empty, detect all supported classes", "conf_thres": "float, confidence threshold (e.g., 0.3)", "regions": "optional list[object], ROI list for verification/classification. Each object: { 'region_id': int|string, 'oriented_bbox': [x1,y1,x2,y2] OR {'cx','cy','w','h','angle'}, 'hint_category': optional string }", "max_dets": "optional int, maximum detections to return (default e.g., 200)" }, "output_schema": "list[object]", "output_fields": { "id": "int|string, unique id for each returned detection/region result", "class": "string, predicted category name", "conf": "float, confidence score in [0,1]", "bbox_obb": "object, oriented bbox in normalized format: {'cx':float,'cy':float,'w':float,'h':float,'angle':float}. angle is in degrees" }, "notes": "If 'regions' is provided, the tool should return results aligned to those regions (ROI verification). If 'regions' is omitted, the tool performs free detection over the image." }, { "tool_name": "GDAL-Shapely", "describe": "Geospatial and geometric computation utility for remote sensing. Supports point-to-point distance/bearing and geometry-to-geometry spatial relations (intersects/contains/IoU) with optional pixel-to-physical unit conversion.", "parameters": { "operation": "string, one of ['distance_and_bearing','geom_relation']", "p1": "required if operation='distance_and_bearing': list[float,float], coordinates of point-1 in image pixel or geo coords", "p2": "required if operation='distance_and_bearing': list[float,float], coordinates of point-2 in image pixel or geo coords", "geom_a": "required if operation='geom_relation': object geometry A, supports {'type':'bbox','bbox':[x1,y1,x2,y2]} or {'type':'poly','points':[[x,y],...]}", "geom_b": "required if operation='geom_relation': object geometry B, same format as geom_a", "units": "optional string, one of ['pixel','meters','degrees'], default 'pixel'", "pixel_to_meter": "optional float, meters-per-pixel for converting pixel distance to meters when units='meters'", "return_topology": "optional bool, whether to return topology fields (default true for geom_relation)" }, "output_schema": "object", "output_fields": { "pixel_distance": "float, Euclidean distance in pixels (only for distance_and_bearing)", "physical_distance": "optional float, converted distance in meters (only if pixel_to_meter provided and units='meters')", "unit": "string, unit of the returned distance", "bearing_deg": "float, bearing angle in degrees (0~360), computed from p1->p2 with a clearly defined coordinate convention", "bearing": "string, coarse cardinal direction (e.g., 'North-East')", "topology": "optional string, one of ['intersects','disjoint','contains','within'] (only for geom_relation)", "iou": "optional float, IoU in [0,1] for bbox/poly overlap (only for geom_relation)" }, "notes": "Do not output topology when operation='distance_and_bearing' unless you explicitly pass geometries. Bearing computation must state a convention (e.g., image coordinates x right, y down)." } ]
modality 提升为必填关键字段。(任务的汇总 - 副本.md)distance_and_bearing(用于你之前那个示例的 p1/p2)geom_relation(用于关系推理里 bbox 是否 intersect/contains,与你们工具表输出里的 relation/direction 更一致)(任务的汇总 - 副本.md)如果你希望我再往前走一步,我可以把这两个工具定义改成**严格 JSON Schema(oneOf/required)**版本,保证:
"tool_define": [{
"tool_name": "YOLOv8_OBB",
"parameters": {
"image_id": "string, the unique identifier of the image",
"regions": "list of objects, each object contains category and oriented bounding box coordinates"
}
},
{
"tool_name": "GDAL-Shapely",
"parameters": {
"operation": "distance_and_bearing",
"p1": "list of two floats, the coordinates of the first point",
"p2": "list of two floats, the coordinates of the second point"
}
}
],
对于这两个工具的定义是否准确,如果不准确请修改,并在每一个工具定义中添加新的describe参数,描述每个工具的功能和输出参数
你现在这两条 tool_define 不够准确,主要有两点:
modality(光学/SAR/红外)以及常用的 classes/conf_thres;而且输出应包含 bbox_obb + conf 等字段(你们表里就是这么定义的)。(任务的汇总 - 副本.md){value, unit, relation, direction} 这种结构。(任务的汇总 - 副本.md)下面给出修改后的版本:在每个工具里新增 describe,并明确“功能 + 输出参数”。(我保留你写的工具名 YOLOv8_OBB / GDAL-Shapely,但建议你内部最好统一命名风格,避免解析对不上。)
json[ { "tool_name": "YOLOv8_OBB", "describe": "Multimodal oriented object detection / counting for remote sensing images. Supports modality selection (optical/sar/ir) to load different detector weights. Outputs a list of detections with class labels, confidence scores, and oriented bounding boxes (bbox_obb).", "parameters": { "image_id": "string, the unique identifier of the image", "classes": "list[string], categories to detect (e.g., ['ship','tank']). If omitted/empty, detect all supported classes", "modality": "string, one of ['optical','sar','ir'], determines which weights are loaded", "conf_thres": "float, confidence threshold (e.g., 0.3)", "regions": "optional list[object], ROI list for verification. Each object can contain {'category': optional string, 'oriented_bbox': [x1,y1,x2,y2] OR {'cx','cy','w','h','angle'}}", "max_dets": "optional int, maximum number of returned detections" }, "output": { "type": "list[object]", "fields": { "id": "int|string, unique detection id", "class": "string, predicted category", "conf": "float in [0,1], confidence score", "bbox_obb": "object, {'cx':float,'cy':float,'w':float,'h':float,'angle':float}, angle in degrees" } } }, { "tool_name": "GDAL-Shapely", "describe": "Geometric & geospatial calculator for remote sensing. Computes physical distance/area, relative direction/bearing, and topology relations (e.g., intersects/contains) between objects or points. Outputs quantitative values with units plus optional topology/direction fields.", "parameters": { "operation": "string, one of ['distance_and_bearing','distance','area','topology','distance_topology']", "p1": "required for distance_and_bearing: list[float,float], first point coordinates (pixel or geo)", "p2": "required for distance_and_bearing: list[float,float], second point coordinates (pixel or geo)", "obj_id_A": "optional, object id A (when operating on detected objects)", "obj_id_B": "optional, object id B (when operating on detected objects)", "unit": "optional string, e.g., 'pixel' or 'meters'", "pixel_to_meter": "optional float, meters per pixel when converting pixel distance to meters", "coord_convention": "optional string, e.g., 'image_x_right_y_down' to make bearing unambiguous" }, "output": { "type": "object", "fields": { "value": "float, main numeric result (e.g., physical distance or area)", "unit": "string, unit of value (e.g., meters/pixel)", "relation": "optional string, topology relation such as 'intersects'/'disjoint'/'contains'/'within'", "direction": "optional string, coarse direction such as 'North-East'", "bearing_deg": "optional float, bearing angle in degrees (0~360) when operation includes bearing" } } } ]
modality 决定加载不同权重;输出包含 conf 和 bbox_obb(angle)。(任务的汇总 - 副本.md)value/unit/relation/direction。(任务的汇总 - 副本.md)如果你希望我再帮你把它们升级成严格可校验的 JSON Schema(oneOf:operation 不同→必填参数不同),我也可以直接给你一份可用的 jsonschema 版本,后面做自动数据生成/噪声注入会更稳。
planning
用户询问 "building" (Box A: [[681,656,866,787]]) 与 Box B ([[912,638,997,716]]) 的关系。
tool_execution
Action: Semantic_Analyzer(
image_id="02102_600_800_0_0",
crop_bbox=[912, 638, 997, 716],
candidate_labels=["building", "road", "tree", "car"]
)
Observation: {
"top_label": "building",
"top_conf": 0.94,
"attributes": ["residential", "dense"]
}
Action: Geo_Calculator(
operation="distance/topology",
obj_bbox_A=[681, 656, 866, 787],
obj_bbox_B=[912, 638, 997, 716]
)
Observation: {
"pixel_distance": 46.0,
"topology": "disjoint",
"direction": "East"
}
tool_execution
reflexion
synthesis
Subject: building, Object: building.
The building is <|rel|>close-to<|/rel|> the building.
synthesis
请你按照上述这个例子生成如下格式的json提示词:
{
"task": "the task of identifying relationships between two entities in remote sensing images",
"tool_define": [{
"tool_name": "YOLOv8_OBB",
"parameters": {
"image_id": "string, the unique identifier of the image",
"regions": "list of objects, each object contains category and oriented bounding box coordinates"
}
},
{
"tool_name": "GDAL-Shapely",
"parameters": {
"operation": "distance_and_bearing",
"p1": "list of two floats, the coordinates of the first point",
"p2": "list of two floats, the coordinates of the second point"
}
}
],
"chain_of_thought_example": {
"input_json_object": {
"image_id": "15991_1000_1000_0_0",
"text_input": "<|reasoning|>What is the relationship between <|ref|>small-car<|/ref|><|det|>[[522,339,531,356]]<|/det|> and the object in <|det|>[[532,350,544,371]]<|/det|> in the image? And output their categories.",
"text_output": "subject: small-car, object: small-car, the small-car is <|rel|>park-next-to<|/rel|> the small-car."
},
"chain_of_thought": {
"planning": "The user requires determining the semantic relationship between two specific regional objects.\n1. Task Breakdown:\n - Subtask 1 (Identity Recognition): Verify the correctness of the known category for the subject (Box A: [522,339,531,356]) and prioritize identifying the category of the object (Box B: [532,350,544,371]).\n - Subtask 2 (Spatial Analysis): Calculate pixel distance, relative orientation angle, and topological relationships (e.g., overlap) between the two objects.\n - Subtask 3 (Relationship Inference): Infer semantic relationships (e.g., park-next-to, near, collision) based on category homogeneity and spatial proximity.\n2. Tool Selection:\n - Invoke "YOLOv8_OBB": Obtain precise Oriented Bounding Bodies (OBBs) for both objects, as vehicle arrangement is orientation-sensitive; simple Horizontal Bounding Boxes (HBBs) may fail to distinguish "side-by-side" arrangements.\n - Invoke "GDAL-Shapely": Calculate Euclidean distance and azimuth between two points.",
"tool_execution": {
"steps": [
{
"action": {
"tool": "YOLOv8_OBB",
"args": {
"image_id": "15991_1000_1000_0_0",
"regions": [
{"id": "subj", "bbox": [522,339,531,356]},
{"id": "obj", "bbox": [532,350,544,371]}
]
}
},
"observation": {
"tool": "YOLOv8_OBB",
"status": "success",
"results": [
{
"region_id": "subj",
"class": "small-car",
"conf": 0.95,
"bbox_obb": {"cx": 526.5, "cy": 347.5, "w": 9.0, "h": 17.0, "angle": -12.0}
},
{
"region_id": "obj",
"class": "small-car",
"conf": 0.92,
"bbox_obb": {"cx": 538.0, "cy": 360.5, "w": 12.0, "h": 21.0, "angle": -10.5}
}
]
}
},
{
"action": {
"tool": "GDAL_Shapely",
"args": {
"operation": "distance_and_bearing",
"p1": [526.5, 347.5],
"p2": [538.0, 360.5]
}
},
"observation": {
"tool": "GDAL_Shapely",
"status": "success",
"result": {
"pixel_distance": 17.3,
"topology": "disjoint",
"bearing": "South-East"
}
}
}
]
},
"reflection": "1. Cross-Check:\n - Visual Consistency: YOLOv8 identifies the object as "small-car" with high confidence (>0.85). The subject is also "small-car." Both categories match.\n - Spatial Logic: The rotation angles of the two targets are extremely close (-12.0° and -10.5°), indicating they are arranged in parallel.\n - Distance Analysis: GDAL calculates the center point distance as 17.3 pixels. Considering the vehicle width is approximately 9-12 pixels, this indicates minimal gap between the two bodies. However, the topological relationship is marked as "disjoint" (non-overlapping), ruling out collision or occlusion.\n2. Scenario Inference:\n - In remote sensing imagery, two identical vehicles positioned parallel, extremely close, and non-overlapping most typically represent the semantic relationship "side-by-side parking."\n - No conflicting evidence was found (e.g., one on the road, one on a roof), aligning with the physical logic of parking lots or roadside parking.\n3. Conclusion Revision: No revision required; evidence chain is closed-loop.",
"synthesis": "1. Analysis results based on the multimodal toolchain:\n2. Object Recognition: Both the subject and object were verified as small-car using YOLOv8-OBB.\n3. Spatial Relationship: The center points of the two vehicles are approximately 17.3 pixels apart, exhibit highly consistent rotation angles (parallel), and maintain spatial topological separation.\n4. Semantic Inference: This close parallel arrangement aligns with the spatial pattern of vehicles "parked next to each other."\n Therefore, the relationship between the subject (small-car) and object (small-car) is park-next-to.\n Final Answer: subject: small-car, object: small-car, the small-car is <|rel|>park-next-to<|/rel|> the small-car."
}
}
},
pythonimport math dx=954.5-773.5 dy=677-721.5 math.sqrt(dx*dx+dy*dy) 186.3900480175913
你这个“building vs unknown box”的关系推理样例,确实对应你们文件里的关系推理任务 #1:image_id="02102_600_800_0_0",输入输出如文档所示。 (任务的汇总 - 副本.md)
下面是按你给定格式整理出的 JSON 提示词(tool_define 仍严格使用你给的两个工具:YOLOv8_OBB 与 GDAL-Shapely;并把你示例里的 Semantic_Analyzer/Geo_Calculator 思路映射到这两个工具上)。
json{ "task": "the task of identifying relationships between two entities in remote sensing images", "tool_define": [ { "tool_name": "YOLOv8_OBB", "parameters": { "image_id": "string, the unique identifier of the image", "regions": "list of objects, each object contains category and oriented bounding box coordinates" } }, { "tool_name": "GDAL-Shapely", "parameters": { "operation": "distance_and_bearing", "p1": "list of two floats, the coordinates of the first point", "p2": "list of two floats, the coordinates of the second point" } } ], "chain_of_thought_example": { "input_json_object": { "image_id": "02102_600_800_0_0", "text_input": "<|reasoning|>What is the relationship between <|ref|>building<|/ref|><|det|>[[681,656,866,787]]<|/det|> and the object in <|det|>[[912,638,997,716]]<|/det|> in the image? And output their categories.", "text_output": "subject: building, object: building, the building is <|rel|>close-to<|/rel|> the building." }, "chain_of_thought": { "planning": "The user asks for the relationship between a known building (Box A: [681,656,866,787]) and an unknown object (Box B: [912,638,997,716]).\\n1. Identity recognition: Use YOLOv8_OBB to verify Box A category and infer Box B category from the cropped region.\\n2. Spatial analysis: Compute the center points of A and B and use GDAL-Shapely to calculate pixel distance and bearing from A to B.\\n3. Relationship inference: If both are buildings and the distance is small while non-overlapping, infer a proximity relation such as close-to.", "tool_execution": { "steps": [ { "action": { "tool": "YOLOv8_OBB", "args": { "image_id": "02102_600_800_0_0", "regions": [ { "id": "subj", "category": "building", "oriented_bbox": [681, 656, 866, 787] }, { "id": "obj", "category": "unknown", "oriented_bbox": [912, 638, 997, 716] } ] } }, "observation": { "tool": "YOLOv8_OBB", "status": "success", "results": [ { "region_id": "subj", "class": "building", "conf": 0.95, "bbox_obb": { "cx": 773.5, "cy": 721.5, "w": 185.0, "h": 131.0, "angle": 0.0 } }, { "region_id": "obj", "class": "building", "conf": 0.94, "bbox_obb": { "cx": 954.5, "cy": 677.0, "w": 85.0, "h": 78.0, "angle": 0.0 } } ] } }, { "action": { "tool": "GDAL-Shapely", "args": { "operation": "distance_and_bearing", "p1": [773.5, 721.5], "p2": [954.5, 677.0] } }, "observation": { "tool": "GDAL-Shapely", "status": "success", "result": { "pixel_distance": 186.39, "topology": "disjoint", "bearing": "North-East" } } } ] }, "reflection": "1. Semantic confirmation: YOLOv8_OBB recognizes Box A as 'building' and Box B as 'building' with high confidence (>=0.94).\\n2. Spatial confirmation: The two boxes do not overlap (disjoint), and the center-to-center distance is about 186.39 pixels. The object is located to the North-East of the subject.\\n3. Inference: Two separate buildings with relatively small separation in remote sensing imagery typically indicate a proximity relationship rather than containment or collision. Thus, 'close-to' is the best-fit semantic relation.", "synthesis": "Subject: building, Object: building. The building is <|rel|>close-to<|/rel|> the building." } } }
指令分解任务是什么意思,使用一句话介绍一下这个任务要干什么,这句话会写在提示词里面告诉模型它要完成什么任务
一句话版(可直接写进提示词):
指令分解任务是指:** 将用户给出的一个高层次、抽象的遥感分析指令,自动拆解为一系列有序、可执行的子步骤(如区域定位、目标检测、关系分析和结果汇总),并按步骤给出结构化分析结果。**
如果你想要**更偏“工程/模型视角”**的一句话版本,也可以用这一句:
指令分解任务要求模型把一条整体的遥感分析指令,转化为明确的多步操作流程,每一步对应具体的感知或空间分析子任务。
这两句都完全贴合你给的第 2 类任务样例(Analyze spatial relationships in region … → Step1/2/3/4),而且非常适合直接放进 system_prompt 里告诉模型“它要干什么”。
Action: Geo_Calculator(
operation="locate_region",
region_bbox=[0, 231, 999, 999],
image_size=[800, 600]
)
Observation: {
"location_desc": "lower area",
"coverage": "64%"
}
Action: Visual_Perception(
image_id="18448_800_600_0_0",
classes=["airplane", "airport", "building"],
modality="optical"
)
Observation: [
{"id": 1, "class": "airplane", "bbox_obb": {"cx": 753, "cy": 834, "angle": 0}},
{"id": 2, "class": "airport", "bbox_obb": {"cx": 499, "cy": 649, "angle": 0}},
{"id": 3, "class": "building", "bbox_obb": {"cx": 813, "cy": 550, "angle": 0}}
]
解释这两个函数的功能,并简单解释输入输出参数
下面按你给的调用片段,分别解释 Geo_Calculator 和 Visual_Perception 两个函数的功能,并简要说明输入/输出参数含义(对应你们工具表里的定义)。 (任务的汇总 - 副本.md)
Geo_Calculator(空间计算 / 区域定位)Geo_Calculator(operation="locate_region") 的作用是:
根据给定的 region_bbox 和整张图像尺寸,判断该区域在图像中的相对位置(上/中/下、左/中/右等),并计算该区域占整张图像的覆盖比例。
operation="locate_region"region_bbox=[0, 231, 999, 999][x1,y1,x2,y2])。image_size=[800, 600][H,W] 或 [W,H],这里需要你们统一规范,否则容易算错 coverage)。json{ "location_desc": "lower area", "coverage": "64%" }
location_desc:区域位置描述(例如 lower area/right side)coverage:该区域占整图的比例(用于判断是局部分析还是大区域分析)Visual_Perception(视觉感知 / 多模态目标检测)Visual_Perception 的作用是:
在指定模态(optical/sar/ir)下执行目标检测,返回检测到的目标类别与旋转框(OBB)结果。
它的底层一般就是你们表里写的 YOLOv8-OBB 检测器。 (任务的汇总 - 副本.md)
image_id="18448_800_600_0_0"classes=["airplane", "airport", "building"]modality="optical"json[ {"id": 1, "class": "airplane", "bbox_obb": {"cx": 753, "cy": 834, "angle": 0}}, {"id": 2, "class": "airport", "bbox_obb": {"cx": 499, "cy": 649, "angle": 0}}, {"id": 3, "class": "building", "bbox_obb": {"cx": 813, "cy": 550, "angle": 0}} ]
id:检测目标编号class:检测类别bbox_obb:旋转框信息
cx, cy:中心点坐标angle:朝向角(遥感里车辆/船只/跑道等常用) (任务的汇总 - 副本.md)注:你这份输出里没有
conf(置信度),但你们工具表标准输出是包含conf的,建议补上,便于后续 reflection 阶段判断可信度。 (任务的汇总 - 副本.md)
Action:
{
"tool": "Semantic_Analyzer",
"args": {
"image_id": "birmingham_block_1",
"query": "entire row of houses on Leslie Road",
"reference_point": [947, 876, 52],
"search_radius": 50
}
}
Observation:
{
"tool": "Semantic_Analyzer",
"status": "success",
"result": {
"top_label": "residential_terrace",
"target_description": "Found the terrace housing structure aligned with Leslie Road.",
"refined_target_pos": [917, 897, 73],
"confidence": 0.96
}
}
Action:
{
"tool": "Trajectory_Planner",
"args": {
"start": [690, 660, 492, 265, 704, 108],
"end": [917, 897, 73],
"obstacles": "birmingham_block_1_map_data",
"algo": "A_Star"
}
}
Observation:
{
"tool": "Trajectory_Planner",
"status": "success",
"result": {
"status": "success",
"total_dist": 850,
"waypoints": [
[690, 660, 492, 265, 704, 108], [690, 660, 492, 279, 711, 104], [636, 712, 492, 279, 711, 104],
[582, 728, 432, 279, 711, 104], [576, 725, 424, 279, 711, 104], [576, 725, 423, 259, 722, 122],
[508, 814, 423, 110, 717, 274], [508, 814, 423, 96, 707, 291], [362, 741, 306, 96, 705, 288],
[351, 736, 298, 95, 699, 283], [341, 733, 290, 93, 688, 277], [333, 728, 282, 90, 680, 277],
[333, 728, 282, 90, 680, 277], [330, 728, 281, 87, 673, 277], [186, 681, 141, 87, 584, 229],
[175, 681, 131, 88, 584, 229], [165, 679, 121, 88, 584, 229], [165, 679, 120, 292, 515, 45],
[216, 681, 296, 292, 515, 45], [216, 681, 297, 302, 517, 41], [246, 684, 399, 302, 517, 41],
[246, 684, 401, 995, 454, 460], [425, 699, 386, 996, 454, 472], [435, 702, 385, 996, 454, 472],
[441, 702, 384, 996, 454, 472], [454, 702, 383, 996, 454, 472], [465, 705, 382, 996, 454, 472],
[479, 705, 381, 996, 454, 472], [490, 707, 380, 996, 454, 472], [495, 707, 380, 996, 454, 472],
[508, 707, 379, 996, 454, 472], [519, 710, 378, 996, 454, 472], [530, 710, 377, 996, 454, 472],
[544, 710, 376, 996, 454, 472], [549, 712, 375, 996, 454, 472], [555, 712, 375, 996, 454, 472],
[557, 712, 374, 884, 293, 257], [684, 777, 247, 883, 295, 253], [687, 777, 246, 913, 273, 335],
[836, 855, 155, 913, 273, 335], [841, 858, 152, 913, 273, 335], [850, 863, 146, 913, 273, 335],
[852, 866, 145, 828, 300, 181], [888, 887, 92, 828, 300, 181], [888, 887, 91, 828, 300, 181],
[917, 902, 49, 828, 300, 181], [917, 902, 49, 828, 300, 181]
]
}
}
解释这两个函数的功能,并简单解释输入输出参数
下面我逐个解释你给出的两个函数(Semantic_Analyzer 和 Trajectory_Planner),并对应说明它们的功能、输入参数含义、输出参数含义。解释会严格贴合你给出的示例和你整套任务体系(任务调度 / 导航类任务)。
Semantic_AnalyzerSemantic_Analyzer 的功能是:
根据自然语言语义查询(query)和给定的空间参考,在图像中定位并确认一个语义目标对象,并输出该目标的精确位置和语义类别。
在你的任务体系中,它主要用于:
这是任务调度 / 导航任务里的关键一步。
json{ "image_id": "birmingham_block_1", "query": "entire row of houses on Leslie Road", "reference_point": [947, 876, 52], "search_radius": 50 }
image_id
要分析的图像或地图的唯一标识。
query
自然语言语义查询,用于描述目标对象
(例如:整排房屋、某条路旁的建筑群等)。
reference_point
语义锚点或地标的三维坐标(通常来自文本中给定的 landmark)。
用于缩小搜索范围、避免全图搜索。
search_radius
在 reference_point 周围的搜索半径(单位通常是像素或米),
用于限定语义匹配的空间范围。
json{ "top_label": "residential_terrace", "target_description": "Found the terrace housing structure aligned with Leslie Road.", "refined_target_pos": [917, 897, 73], "confidence": 0.96 }
top_label
模型推断出的目标语义类别(如:residential_terrace)。
target_description
对目标的文字说明,用于增强可解释性(不是必须参与推理)。
refined_target_pos
最终确定的目标三维位置(作为后续轨迹规划的终点)。
confidence
语义定位与匹配的置信度,用于 reflection 阶段判断是否可信。
Trajectory_PlannerTrajectory_Planner 的功能是:
在给定起点、终点和环境约束(障碍物地图)的情况下,生成一条可行、无碰撞的三维飞行轨迹。
这是你们 任务调度 / 行动决策任务 的核心工具。
json{ "start": [690, 660, 492, 265, 704, 108], "end": [917, 897, 73], "obstacles": "birmingham_block_1_map_data", "algo": "A_Star" }
start
起始位姿,通常包含:
end
目标位置(通常是 Semantic_Analyzer 输出的 refined_target_pos)。
obstacles
障碍物或环境地图数据(如建筑分布、禁飞区等)。
algo
使用的路径规划算法(如 A*、RRT)。
json{ "status": "success", "total_dist": 850, "waypoints": [...] }
status
路径是否成功生成(success / failure)。
total_dist
路径总长度(用于评估效率或安全性)。
waypoints
规划出的航点序列,每个点包含位置和姿态,
是最终可执行的飞行指令集合。
在你的 任务调度 / 导航类任务 中,它们形成一个标准流水线:
这正好对应你数据集中 text_output 的 Step2(目标定位)和 Step4(轨迹生成)。
如果你愿意,我也可以帮你把这两个工具的 tool_define 写成统一的 JSON Schema,直接用于你之前设计的 system_prompt。
IR(红外)图像目标检测任务是什么意思,使用一句话介绍一下这个任务要干什么,这句话会写在提示词里面告诉模型它要完成什么任务
一句话版(可直接写进提示词):
IR(红外)图像目标检测任务是指:** 在红外遥感图像中识别并定位目标实例,输出每个目标的类别与对应的边界框坐标(如检测并给出船只/车辆等的框)。** (任务的汇总 - 副本.md)
Action:
{
"tool": "Visual_Perception",
"args": {
"image_id": "IR_ship_5340_0",
"classes": ["sailboat", "canoe", "bulk_carrier", "truck", "warship"],
"modality": "ir",
"conf_thres": 0.3
}
}
Observation:
{
"tool": "Visual_Perception",
"status": "success",
"result": [
{"id": 1, "class": "sailboat", "conf": 0.92, "bbox_obb": {"cx": 570, "cy": 635, "angle": 15.0}},
{"id": 2, "class": "sailboat", "conf": 0.90, "bbox_obb": {"cx": 865, "cy": 535, "angle": -5.0}},
{"id": 3, "class": "canoe", "conf": 0.88, "bbox_obb": {"cx": 700, "cy": 565, "angle": 0.0}},
{"id": 4, "class": "bulk_carrier", "conf": 0.95, "bbox_obb": {"cx": 590, "cy": 535, "angle": 0.0}},
{"id": 5, "class": "bulk_carrier", "conf": 0.94, "bbox_obb": {"cx": 680, "cy": 535, "angle": 0.0}},
{"id": 6, "class": "bulk_carrier", "conf": 0.93, "bbox_obb": {"cx": 820, "cy": 525, "angle": 0.0}},
{"id": 7, "class": "bulk_carrier", "conf": 0.91, "bbox_obb": {"cx": 925, "cy": 535, "angle": 0.0}},
{"id": 8, "class": "bulk_carrier", "conf": 0.92, "bbox_obb": {"cx": 975, "cy": 525, "angle": 0.0}}
]
}
解释这两个函数的功能,并简单解释输入输出参数
下面我只解释你给出的这两个函数:Visual_Perception(含 Action 与 Observation),并简要说明它们的功能以及输入 / 输出参数的含义,完全对应你给出的示例和你当前的数据集设计。
Visual_Perception(视觉感知 / 多模态目标检测)Visual_Perception 的功能是:
在指定模态(这里是红外 IR)下,对遥感图像进行目标检测,识别并定位图像中的目标实例,输出每个目标的类别、置信度以及旋转边界框(OBB)。
在你的任务体系中,它对应 多模态感知任务 – 目标检测(IR 图像)。
json{ "image_id": "IR_ship_5340_0", "classes": ["sailboat", "canoe", "bulk_carrier", "truck", "warship"], "modality": "ir", "conf_thres": 0.3 }
image_id
要进行检测的红外遥感图像的唯一标识。
classes
需要检测的目标类别列表,用于限定检测范围,避免无关类别干扰。
modality
图像模态(ir 表示红外),决定底层加载的检测模型权重(如 IR 专用 YOLOv8-OBB)。
conf_thres
置信度阈值,低于该阈值的检测结果将被过滤。
json[ {"id": 1, "class": "sailboat", "conf": 0.92, "bbox_obb": {"cx": 570, "cy": 635, "angle": 15.0}}, {"id": 2, "class": "sailboat", "conf": 0.90, "bbox_obb": {"cx": 865, "cy": 535, "angle": -5.0}}, {"id": 3, "class": "canoe", "conf": 0.88, "bbox_obb": {"cx": 700, "cy": 565, "angle": 0.0}}, {"id": 4, "class": "bulk_carrier", "conf": 0.95, "bbox_obb": {"cx": 590, "cy": 535, "angle": 0.0}}, {"id": 5, "class": "bulk_carrier", "conf": 0.94, "bbox_obb": {"cx": 680, "cy": 535, "angle": 0.0}}, {"id": 6, "class": "bulk_carrier", "conf": 0.93, "bbox_obb": {"cx": 820, "cy": 525, "angle": 0.0}}, {"id": 7, "class": "bulk_carrier", "conf": 0.91, "bbox_obb": {"cx": 925, "cy": 535, "angle": 0.0}}, {"id": 8, "class": "bulk_carrier", "conf": 0.92, "bbox_obb": {"cx": 975, "cy": 525, "angle": 0.0}} ]
id
检测目标的唯一编号。
class
检测到的目标类别(如 sailboat、canoe、bulk_carrier)。
conf
检测结果的置信度,用于后续筛选或反思判断。
bbox_obb
旋转边界框(Oriented Bounding Box)信息:
cx, cy:目标中心点坐标angle:目标朝向角(红外船只检测中用于区分航向)IR(红外)图像场景分类任务是什么意思,使用一句话介绍一下这个任务要干什么,这句话会写在提示词里面告诉模型它要完成什么任务
一句话版(可直接写进提示词):
IR(红外)图像场景分类任务是指:** 根据红外遥感图像的整体热分布与空间结构特征,判断图像所属的高层场景类别(如船舶场景、港口场景、机场场景等),而不需要逐一定位具体目标。**
Action:
{
"tool": "Semantic_Analyzer",
"args": {
"image_id": "IR_ship_3376_0",
"candidate_labels": ["sea", "land", "sky", "urban"],
"crop_bbox": [0, 0, 1000, 1000]
}
}
Observation:
{
"tool": "Semantic_Analyzer",
"status": "success",
"result": {
"top_label": "sea",
"top_conf": 0.82,
"attributes": ["dark background", "water surface", "single salient object"]
}
}
Action:
{
"tool": "Visual_Perception",
"args": {
"image_id": "IR_ship_3376_0",
"classes": ["ship", "vehicle", "airplane", "building"],
"modality": "ir",
"conf_thres": 0.3
}
}
Observation:
{
"tool": "Visual_Perception",
"status": "success",
"result": [
{
"id": 1,
"class": "ship",
"conf": 0.94,
"bbox_obb": {"cx": 585, "cy": 565, "w": 70, "h": 110, "angle": 10.0}
}
]
}
解释这两个函数的功能,并简单解释输入输出参数
下面我分别解释你给出的两个函数:Semantic_Analyzer 和 Visual_Perception,并简要说明它们各自的功能,以及输入 / 输出参数的含义。解释会严格对应你给出的 Action / Observation 示例,以及你这类 IR 图像场景分类任务 的上下文。
Semantic_Analyzer(语义理解 / 场景分类)Semantic_Analyzer 的功能是:
对整幅图像或指定区域进行高层语义分析,根据候选类别判断图像所处的整体场景类型,并给出语义属性与置信度。
在这个例子中,它用于 IR 图像场景分类:
判断整张红外图像更像是 sea / land / sky / urban 中的哪一种场景。
json{ "image_id": "IR_ship_3376_0", "candidate_labels": ["sea", "land", "sky", "urban"], "crop_bbox": [0, 0, 1000, 1000] }
image_id
需要进行语义分析的红外图像标识。
candidate_labels
场景候选类别集合,Semantic_Analyzer 会在这些标签中进行判别(zero-shot 或相似度匹配)。
crop_bbox
语义分析的区域范围。
[0,0,1000,1000] 表示对整幅图像进行全局场景判断。json{ "top_label": "sea", "top_conf": 0.82, "attributes": ["dark background", "water surface", "single salient object"] }
top_label
预测的场景类别(这里为 sea)。
top_conf
场景分类的置信度,用于判断结果是否可靠。
attributes
场景的辅助语义属性,用于解释为什么该场景被判定为该类别
(例如:暗背景、水面纹理、单一显著目标等)。
Visual_Perception(视觉感知 / 目标检测)Visual_Perception 的功能是:
在指定模态(这里是红外 IR)下,对图像进行目标检测,识别并定位图像中的具体目标实例,输出其类别、置信度和旋转边界框(OBB)。
在这个例子中,它用于 辅助验证场景分类结果:
检测是否存在船只等关键目标,以支持“sea / ship scene”的判断。
json{ "image_id": "IR_ship_3376_0", "classes": ["ship", "vehicle", "airplane", "building"], "modality": "ir", "conf_thres": 0.3 }
image_id
要进行检测的红外图像。
classes
指定需要检测的目标类别集合,避免检测无关目标。
modality
图像模态(ir 表示红外),用于加载红外专用检测模型权重。
conf_thres
置信度阈值,过滤低置信度检测结果。
json[ { "id": 1, "class": "ship", "conf": 0.94, "bbox_obb": {"cx": 585, "cy": 565, "w": 70, "h": 110, "angle": 10.0} } ]
id
检测到的目标编号。
class
目标类别(这里为 ship)。
conf
检测置信度,用于衡量目标识别的可靠性。
bbox_obb
旋转边界框信息:
cx, cy:目标中心点坐标w, h:目标宽高angle:目标朝向角(红外船只中常用于表示航向)在 IR 图像场景分类任务 中,它们形成一个互补流程:
{
"image_id": "0047732",
"chain_of_thought": {
"planning": "Fourteen ships, many near top and left margins (e.g., [13,817,…], [0,318,…]). Requires strong edge tolerance. Set conf=0.3 to retain marginal detections without inflating FP rate.",
"tool_execution": {
"steps": [
{
"action": {
"tool_name": "Visual_Perception",
"parameters": "{'image_id': '0047732', 'classes': ['ship'], 'modality': 'SAR', 'conf': 0.3}"
},
"observation": {
"tool_name": "Visual_Perception",
"status": "success",
"result": "[{'id': 1, 'class': 'ship', 'conf': 0.95, 'bbox': [310, 599, 437, 661]}, {'id': 2, 'class': 'ship', 'conf': 0.94, 'bbox': [349, 645, 515, 716]}, {'id': 3, 'class': 'ship', 'conf': 0.96, 'bbox': [13, 817, 171, 883]}, {'id': 4, 'class': 'ship', 'conf': 0.95, 'bbox': [132, 680, 292, 739]}, {'id': 5, 'class': 'ship', 'conf': 0.93, 'bbox': [142, 83, 175, 183]}, {'id': 6, 'class': 'ship', 'conf': 0.92, 'bbox': [60, 97, 107, 183]}, {'id': 7, 'class': 'ship', 'conf': 0.91, 'bbox': [292, 173, 319, 279]}, {'id': 8, 'class': 'ship', 'conf': 0.92, 'bbox': [329, 189, 374, 294]}, {'id': 9, 'class': 'ship', 'conf': 0.90, 'bbox': [0, 318, 39, 444]}, {'id': 10, 'class': 'ship', 'conf': 0.91, 'bbox': [150, 314, 200, 442]}, {'id': 11, 'class': 'ship', 'conf': 0.92, 'bbox': [195, 310, 241, 431]}, {'id': 12, 'class': 'ship', 'conf': 0.93, 'bbox': [298, 360, 353, 507]}, {'id': 13, 'class': 'ship', 'conf': 0.94, 'bbox': [345, 359, 411, 513]}, {'id': 14, 'class': 'ship', 'conf': 0.95, 'bbox': [109, 727, 302, 788]}]"
}
}
]
},
"reflection": "All 14 boxes match, including those at absolute image boundaries (e.g., x=0, y=0). Lowest confidence (0.90) still yields correct localization, affirming operational reliability at detection limit.",
"synthesis": "Fourteen ships — including those at image borders — fully and correctly localized in SAR image '0047732'."
}
},
{
"image_id": "IR_ship_827_0",
"chain_of_thought": {
"planning": "Warship detection in IR: typically exhibits strong, compact thermal signature due to engines and radar. Planning: focus class list on 'warship', include 'liner', 'container_ship' as distractors. Modality='IR'. Confidence threshold raised to 0.45 to suppress false alarms from sea glint or cloud edges.",
"tool_execution": {
"steps": [
{
"action": {
"tool_name": "Visual_Perception",
"parameters": {
"image_id": "IR_ship_827_0",
"classes": [
"warship",
"liner",
"container_ship",
"bulk_carrier"
],
"modality": "IR",
"conf": 0.45
}
},
"observation": {
"tool_name": "Visual_Perception",
"status": "success",
"result": "[{'id': 1, 'class': 'warship', 'conf': 0.97, 'bbox_obb': {'cx': 805, 'cy': 680, 'w': 390, 'h': 180, 'angle': 0}}]"
}
}
]
},
"reflection": "Single high-confidence warship detection. OBB → HBB: cx=805±195 = [610,1000], cy=680±90 = [590,770] → [610,590,1000,770]. Matches output exactly. No competing classes detected — confirms specificity of IR warship signature.",
"synthesis": "There is 1 <|ref|>warship<|/ref|><|det|>[[610, 590, 1000, 770]]<|/det|> in the image."
}
},
{
"image_id": "16199_600_800_0_0",
"chain_of_thought": {
"planning": "Determine relationship between 'small-car' at [398,647,420,674] and nearby object at [365,622,381,643]. Both are compact, similarly sized, and spatially proximate. Likely candidates: same-category vehicle or larger vehicle (e.g., van). Steps: (1) Classify both regions using YOLOv8_OBB; (2) Compute inter-object distance and relative orientation; (3) Infer relation (e.g., park-next-to, behind, in-front) based on proximity and alignment.",
"tool_execution": {
"steps": [
{
"action": {
"tool_name": "YOLOv8_OBB",
"parameters": {
"image_id": "16199_600_800_0_0",
"regions": [
{
"category": "small-car",
"oriented_bbox": [
398,
647,
420,
674
]
},
{
"category": "unknown",
"oriented_bbox": [
365,
622,
381,
643
]
}
]
}
},
"observation": {
"tool_name": "YOLOv8_OBB",
"status": "success",
"result": "[{'region_id': 'subj', 'class': 'small-car', 'conf': 0.97, 'bbox_obb': {'cx': 409.0, 'cy': 660.5, 'w': 22.0, 'h': 27.0, 'angle': 0.0}}, {'region_id': 'obj', 'class': 'van', 'conf': 0.89, 'bbox_obb': {'cx': 373.0, 'cy': 632.5, 'w': 16.0, 'h': 21.0, 'angle': 0.0}}]"
}
},
{
"action": {
"tool_name": "GDAL-Shapely",
"parameters": {
"operation": "distance_and_bearing",
"p1": [
409.0,
660.5
],
"p2": [
373.0,
632.5
]
}
},
"observation": {
"tool_name": "GDAL-Shapely",
"status": "success",
"result": {
"pixel_distance": 46.2,
"topology": "disjoint",
"bearing": "North-West"
}
}
}
]
},
"reflection": "YOLOv8_OBB confirms subject = 'small-car' (conf=0.97), object = 'van' (conf=0.89). Distance is ~46 px — close but non-overlapping ('disjoint'), and bearing is North-West, suggesting the small-car is positioned diagonally forward-left of the van. In parking contexts, such proximity without occlusion or road alignment implies intentional side-by-side or angled parking — best captured by 'park-next-to'. No evidence of traffic flow (e.g., directionality, lane markings), reinforcing static arrangement.",
"synthesis": "A small-car and a van are closely spaced, disjoint, and similarly oriented — characteristic of adjacent parking in lots or streets. The semantic relation 'park-next-to' accurately reflects this co-located, static vehicular arrangement. Final answer: subject: small-car, object: van, the small-car is <|rel|>park-next-to<|/rel|> the van."
}
},
以上这些思维链有什么问题
这些 CoT 的主要问题可以归为 4 大类:(A)与工具规范不一致、(B)把“不可得的 GT/外部信息”写进 planning/reflection、(C)tool_execution 里 JSON 结构不利于校验、(D)推理链条里几何/语义证据不足或表述过度确定。下面逐条对你给的 3 段 CoT 点名问题,并给出怎么改更“可训练/可审计”。
你们文件里对工具的规范是:
Visual_Perception支持多模态(光学/SAR/红外),输入含modality和conf_thres,输出是列表且包含id/class/conf/bbox_obb(angle);空间计算工具则输出value/unit/relation/direction这类字段。(任务的汇总 - 副本.md) (任务的汇总 - 副本.md)
Visual_Perception,这没问题,但 modality 写成了 "SAR"(大写),而规范示例里是 "sar"(小写)。如果你做严格 schema 或枚举校验,会直接判失败。(任务的汇总 - 副本.md)parameters 被写成了 字符串:"{'image_id':..., 'classes':..., ...}""bbox": [x1,y1,x2,y2],但你们工具规范的输出是 "bbox_obb": {"cx","cy","angle"...}(至少含 angle;很多样例也用 bbox_obb)。(任务的汇总 - 副本.md)bbox_hbb,避免训练时模型学到“同一工具输出两种完全不同 schema”。建议怎么改:
observation.result 是一个 字符串形式的 list:"[{'id':1,...}]"modality: "IR"(大写),规范示例是 "ir"。(任务的汇总 - 副本.md)w/h 推导出 [610,590,1000,770],但 tool 输出里并没有给 HBB,你却用“matches output exactly”去验证——这相当于用你自己推导的中间量证明自己是对的,缺乏外部可验证性。<|det|>[[x1,y1,x2,y2]] 这种 HBB,但工具规范更偏 OBB(至少含 angle)。你需要在工具层明确:到底输出 OBB 还是 HBB,或两者都输出。(任务的汇总 - 副本.md)建议怎么改:
Visual_Perception 输出 bbox_obb(含 w/h/angle)是可以的;若任务需要 HBB(text_output 用 det 框),就在 synthesis 把 OBB 转成 HBB,但不要在 reflection 里用“matches exactly”这种强断言。这条相对前两条更接近“可训练”的样式,但仍有关键问题:
region_id: 'subj'/'obj'。这会导致“输入-输出对齐”不严谨,后续自动校验会很麻烦。{"region_id":"subj", ...}, {"region_id":"obj", ...}。{pixel_distance, topology, bearing},而你们空间计算工具(在文件里叫 Geo_Calculator)的规范输出是 {value, unit, relation, direction} 这种格式。(任务的汇总 - 副本.md)GDAL-Shapely 作为工具名,也建议统一字段名(例如:relation 替代 topology,direction 替代 bearing,并加 unit)。close-to 或 near。顺带一提:这条的
pixel_distance=46.2是自洽的:dx=-36, dy=-28,欧氏距离约 45.6,46.2 在合理误差内;bearing “North-West” 也与 dx<0、dy<0(y向下为North)一致。问题主要在 schema 和语义映射强度。
建议怎么改:
Visual_Perception 检测 parking-lot 或 road(如果任务允许),或用语义工具确认“parking context”。close-to/near,让语义标签与证据强度匹配。sar/ir/optical 大小写统一)。(任务的汇总 - 副本.md)bbox 与 bbox_obb。(任务的汇总 - 副本.md)close-to,不要硬上 park-next-to。如果你愿意,我可以把你这三条 CoT 逐条改写成“完全符合你们 6 工具规范且可校验”的版本(重点:统一 modality、统一 bbox 字段、去除 GT 泄漏、把结果从字符串改为结构化 JSON)。
这些cot里面的工具调用的结果是否正确
结论先给出来,再逐条解释(这是你现在最关心的):
这些 CoT 里的工具调用结果,只有一部分在“数值/逻辑上自洽”,但整体来看:
👉 多数“形式上像真的”,但并不严格正确**,存在可验证的计算错误、规范不一致和“用结果反证结果”的问题。**
下面我只从“工具调用结果是否正确”这个角度来分析,不再重复 CoT 结构、风格等问题。
| CoT 样例 | 工具结果是否正确 | 主要问题类型 |
|---|---|---|
| SAR 船只检测(0047732) | ❌ 不严格正确 | 输出 schema 不一致、GT 泄漏 |
| IR 军舰检测(IR_ship_827_0) | ⚠️ 数值基本合理,但逻辑不严 | 用推导结果自证 |
| 关系推理(16199_600_800_0_0) | ⚠️ 几何结果基本对,但语义过推 | bearing/距离 OK,关系偏强 |
下面逐条说清楚 “哪里对、哪里不对、为什么”。
json[ {'id': 1, 'class': 'ship', 'conf': 0.95, 'bbox': [310, 599, 437, 661]}, ... ]
根据你给的工具表(YOLOv8-OBB 内核):
输出必须是 bbox_obb,至少包含 cx, cy, angle
但这里输出的是:
json"bbox": [x1,y1,x2,y2]
👉 这不是“小问题”,而是 工具输出定义错误。
如果模型学会这种输出,将来根本无法和真实工具对齐。
planning: “Fourteen ships …”
reflection: “All 14 boxes match …”
⚠️ 工具本身从未返回“应该有 14 个”这一事实
这是 人为注入的 GT 信息,不是工具调用结果。
👉 所以:
📌 结论:
这个 CoT 里的 检测数值看起来合理,
但 输出 schema 错 + 使用 GT 反证 → 工具结果不被认为是“正确的”。
json{ "class": "warship", "conf": 0.97, "bbox_obb": {"cx": 805, "cy": 680, "w": 390, "h": 180, "angle": 0} }
👉 如果只看工具输出本身:是“像真的”
textOBB → HBB: cx=805±195 = [610,1000] Matches output exactly
⚠️ 问题在于:
用 OBB 推 HBB → 再说“matches exactly”
这是 循环自证,不是工具校验。
👉 在严格意义上:
📌 结论:
工具输出本身“数值上没明显错”,
但 reflection 对工具正确性的论证是无效的。
这是三条里 工具结果最接近“正确”的一条,我们仔细看。
jsonsmall-car: conf=0.97 van: conf=0.89
✅ 分类结果 OK
jsonp1 = [409.0, 660.5] p2 = [373.0, 632.5]
计算验证:
你给的是:
json"pixel_distance": 46.2 "bearing": "North-West"
👉 这是自洽的
✅ 几何计算结果是正确的
工具结果只支持:
但你给出的语义是:
textpark-next-to
⚠️ 这是语义层面过推:
👉 从工具结果 只能稳妥推出:
close-to / near📌 结论:
工具调用结果(数值)是正确的,
但 语义关系的使用超出了工具证据的支持范围。
这些 CoT 中:
- ❌ SAR 船检:工具输出 schema 错 + 使用 GT 反证 → 不正确
- ⚠️ IR 军舰:工具数值合理,但反思逻辑不成立 → 不严格正确
- ⚠️ 关系推理:几何工具结果正确,但语义关系过强 → 部分正确
总体评价:
👉 “像工具结果”,但还没有达到“可作为真实工具监督信号”的标准。
如果你愿意,下一步我可以直接帮你做一件非常实用的事:
把“什么情况下工具结果算正确”写成一份自动校验 checklist / rule-set
(例如:
park-next-toclose-to)这一步一做完,你整个 CoT 数据集的质量会立刻上一个台阶。
{
"image_id": "IR_ship_4753_1",
"chain_of_thought": {
"planning": "Same as IR_ship_4753_0. Duplicate content.",
"tool_execution": {
"steps": [
{
"action": {
"tool_name": "Visual_Perception",
"parameters": {
"image_id": "IR_ship_4753_1",
"classes": [
"bulk_carrier",
"ship"
],
"modality": "IR",
"conf": 0.4
}
},
"observation": {
"tool_name": "Visual_Perception",
"status": "success",
"result": {
"detections": [
{
"class": "bulk_carrier",
"conf": 0.92,
"count": 2
}
]
}
}
}
]
},
"reflection": "Identical results: sea background, two bulk carriers. Scene is Ship.",
"synthesis": "Same as IR_ship_4753_0: two bulk carriers on sea. Scene category is Ship."
}
},
{
"image_id": "12475_800_800_400_0",
"chain_of_thought": {
"planning": "Region [[738,591,975,790]] analysis entails: 1. Position estimation (lower-right quadrant); 2. Building detection; 3. Proximity-based relation inference.",
"tool_execution": {
"steps": [
{
"action": {
"tool_name": "Geo_Calculator",
"parameters": {
"operation": "locate_region",
"region_bbox": [
738,
591,
975,
790
],
"image_size": [
800,
800
]
}
},
"observation": {
"tool_name": "Geo_Calculator",
"status": "success",
"result": {
"location_desc": "lower right",
"coverage": "7.3%"
}
}
},
{
"action": {
"tool_name": "Visual_Perception",
"parameters": {
"image_id": "12475_800_800_400_0",
"classes": ["building"],
"modality": "optical"
}
},
"observation": {
"tool_name": "Visual_Perception",
"status": "success",
"result": [
{
"id": 1,
"class": "building",
"bbox_obb": {
"cx": 783,
"cy": 726.5,
"angle": 0
}
},
{
"id": 2,
"class": "building",
"bbox_obb": {
"cx": 870,
"cy": 654.5,
"angle": 0
}
}
]
}
},
{
"action": {
"tool_name": "Geo_Calculator",
"parameters": {
"operation": "topology",
"obj_id_A": 1,
"obj_id_B": 2
}
},
"observation": {
"tool_name": "Geo_Calculator",
"status": "success",
"result": {
"relation": "adjacent",
"semantic_implication": "close-to"
}
}
}
]
},
"reflection": "1. Region is compact and near bottom-right corner. 2. Two buildings with overlapping horizontal extents suggest side-by-side placement. 3. Adjacency metric confirms 'close-to' semantics bidirectionally.",
"synthesis": "Step 1: Locate the target area: The target area locates at lower right of the image; Step 2: Perform object detection: There are 2 objects in the target area, including: 2 buildings; Step 3: Perform relation analysis: The two buildings are close-to each other."
}
}
这两个思维链有什么问题
下面只讲这两条 CoT 在“工具调用/工具结果”层面的问题(不讨论写作风格)。我会按你们文件里的工具规范来对照:Visual_Perception 应输出逐实例检测列表(含 id/class/conf/bbox_obb(angle)),Geo_Calculator 输出应以 value/unit/relation/direction 为主,relation 是拓扑关系(如 intersects/disjoint/contains/within)。 (任务的汇总 - 副本.md)
planning: “Same as IR_ship_4753_0. Duplicate content.”reflection/synthesis: “Identical results…”这会让数据集学到不可泛化的捷径:遇到相似 ID 就直接复用结论,而不是从工具结果推导。
Visual_Perception 的输出 schema 不符合定义你现在的 Observation 是:
json"result": { "detections": [ { "class": "bulk_carrier", "conf": 0.92, "count": 2 } ] }
但你们工具表中 Visual_Perception 的标准输出是逐实例列表,每个实例应包含 id/class/conf/bbox_obb(至少含 cx/cy/angle)。 (任务的汇总 - 副本.md)
也就是说:你这里把“检测”变成了“计数汇总”,会导致:
<|det|>[[...]])conf: 0.4,但规范里是 conf_thres。 (任务的汇总 - 副本.md)modality: "IR" 大写,规范示例是 "ir"(建议统一成小写枚举,方便校验)。 (任务的汇总 - 副本.md)你没有调用场景语义工具(文件里是 Semantic_Analyzer 用于 zero-shot 场景分类)。 (任务的汇总 - 副本.md)
仅凭 “bulk_carrier count=2” 可以猜 Ship 场景,但严格数据链条上属于证据缺失。
Geo_Calculator(locate_region) 的 coverage 数值基本是对的bbox 宽高:(975-738)=237,(790-591)=199,面积 237*199=47163
整图 800*800=640000,占比 47163/640000≈7.37%,你写 7.3% 合理。✅
Visual_Perception 输出缺字段(不符合规范)你输出:
json{"id":1,"class":"building","bbox_obb":{"cx":783,"cy":726.5,"angle":0}}
但规范要求包含 conf,并且 bbox_obb 通常至少应稳定包含 cx/cy/angle(你有),但缺 conf 会让 reflection 阶段无法判断可靠性。 (任务的汇总 - 副本.md)
另外你没有传 conf_thres,规范里这是默认常用字段(建议加上,避免风格漂移)。 (任务的汇总 - 副本.md)
Geo_Calculator(operation="topology") 的返回值不符合规范你返回:
json{"relation":"adjacent","semantic_implication":"close-to"}
但你们工具表对空间计算输出更像:
json{"value":..., "unit":"meters", "relation":"intersects", "direction":"North-East"}
其中 relation 是拓扑关系(intersects/disjoint/contains…),而 adjacent 更像是“语义关系/阈值判断”的结果,不应作为 topology 的原子输出。 (任务的汇总 - 副本.md)
更严重的是:你把 semantic_implication 直接放进空间计算工具输出里,相当于工具替模型做了“语义映射”,会让模型在 reflection/synthesis 阶段学不到“如何从几何到语义”。
你说:
“Two buildings with overlapping horizontal extents suggest side-by-side placement.”
但工具输出只有两个中心点(cx/cy),没有 w/h 或 HBB,所以“overlapping horizontal extents”无法从工具结果推出(证据链断裂)。
Visual_Perception 输出被改成“计数汇总”,且用“重复/复用”替代推理;场景结论缺少 Semantic_Analyzer 证据。Visual_Perception 缺 conf,Geo_Calculator(topology) 输出把“语义 close-to”塞进工具结果且 relation 用了非拓扑原子(adjacent),reflection 还引用了工具未提供的信息。如果你希望,我可以把这两条各给一版“严格符合你们 tool schema、且证据闭环”的修正版(保持同样任务含义,但让每一步都可审计/可校验)。