我现在使用anomalib进行车载图像的异物检测,将含有目标的数据记为异常,现在有正常数据集8000...

Criado em: 26 de dezembro de 2025

Respondido usando GPT-5.2 Thinking por Chat01

Pergunta

我现在使用anomalib进行车载图像的异物检测,将含有目标的数据记为异常,现在有正常数据集8000张,异常200多张,正常测试集400多张,现在使用fastflow的配置文件如下:dataset:
name: mvtec
format: folder
path: /media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib/dataset
normal_dir: normal # 自制数据集正样本子文件夹
abnormal_dir: abnormal0 # 自制数据集负样本子文件夹
mask_dir: label_mask0 # 二值掩膜路径,自制数据集一般没有,填null
category: bottle
normal_test_dir: normal_test0 # name of the folder containing normal test images.
task: segmentation
extensions: null
train_batch_size: 16
eval_batch_size: 16
split_ratio: 0.3 # ratio of the normal images that will be used to create a test split
num_workers: 8
image_size: 256 # dimensions to which images are resized (mandatory)
center_crop: null # dimensions to which images are center-cropped after resizing (optional)
normalization: imagenet # data distribution to which the images will be normalized: [none, imagenet]
transform_config:
train: null
eval: null
test_split_mode: from_dir # options: [from_dir, synthetic]
test_split_ratio: 0.2 # fraction of train images held out testing (usage depends on test_split_mode)
val_split_mode: same_as_test # options: [same_as_test, from_test, synthetic]
val_split_ratio: 0.5 # fraction of train/test images held out for validation (usage depends on val_split_mode)
tiling:
apply: false
tile_size: null
stride: null
remove_border_count: 0
use_random_tiling: False
random_tile_count: 16

model:
name: fastflow
backbone: resnet18 # options: [resnet18, wide_resnet50_2, cait_m48_448, deit_base_distilled_patch16_384]
pre_trained: true
flow_steps: 8 # options: [8, 8, 20, 20] - for each supported backbone
hidden_ratio: 1.0 # options: [1.0, 1.0, 0.16, 0.16] - for each supported backbone
conv3x3_only: True # options: [True, False, False, False] - for each supported backbone
lr: 0.001
weight_decay: 0.00001
early_stopping:
patience: 3
metric: pixel_AUROC
mode: max
normalization_method: min_max # options: [null, min_max, cdf]

metrics:
image:
- F1Score
- AUROC
pixel:
- F1Score
- AUROC
- AUPRO
threshold:
method: adaptive #options: [adaptive, manual]
manual_image: null
manual_pixel: null

visualization:
show_images: False # show images on the screen
save_images: True # save images to the file system
log_images: True # log images to the available loggers (if any)
image_save_path: ./results/1217fastflow/fastflow/mvtec/run/images # path to which images will be saved
mode: simple # options: ["full", "simple"]

project:
seed: 42
path: ./results/1217fastflow

logging:
logger: [tensorboard,csv]
log_graph: true

optimization:
export_mode: null # options: torch, onnx, openvino

PL Trainer Args. Don't add extra parameter here.

trainer:
enable_checkpointing: true
default_root_dir: null
gradient_clip_val: 0
gradient_clip_algorithm: norm
num_nodes: 1
devices: 1
enable_progress_bar: true
overfit_batches: 0.0
track_grad_norm: -1
check_val_every_n_epoch: 1 # Don't validate before extracting features.
fast_dev_run: false
accumulate_grad_batches: 1
max_epochs: 500
min_epochs: null
max_steps: -1
min_steps: null
max_time: null
limit_train_batches: 1.0
limit_val_batches: 1.0
limit_test_batches: 1.0
limit_predict_batches: 1.0
val_check_interval: 1.0 # Don't validate before extracting features.
log_every_n_steps: 50
accelerator: auto # <"cpu", "gpu", "tpu", "ipu", "hpu", "auto">
strategy: null
sync_batchnorm: false
precision: 32
enable_model_summary: true
num_sanity_val_steps: 0
profiler: null
benchmark: false
deterministic: false
reload_dataloaders_every_n_epochs: 0
auto_lr_find: false
replace_sampler_ddp: true
detect_anomaly: false
auto_scale_batch_size: false
plugins: null
move_metrics_to_cpu: false
multiple_trainloader_mode: max_size_cycle,现在分类效果很差, image_AUROC 0.6587263345718384
image_F1Score 0.6217764616012573
pixel_AUPRO 0.4064096212387085
pixel_AUROC 0.8441610336303711
pixel_F1Score 0.11360228806734085
如何在不改变网络结构情况下首先提高图像分类效果 image_AUROC的指标呢,请给出

Pensamento

Improving classification task setup

The provided doc includes a config example for classification tasks using normal and abnormal directories, as well as the proper split ratio. It's a good source for referencing split modes and tasks.

For improved accuracy, they might want to specify Engine task: classification in the config. Since their dataset has a segmentation task, switching to classification may improve image-level results, but could sacrifice localization. We’ll cite relevant validation info too about how abnormal images during validation help normalize scores.

Suggesting a classification focus for image AUROC improvement

Since they’re focused on improving image classification (AUROC), a good first step could be setting task: classification in the config and evaluating image AUROC. The doc shows the difference. FastFlow can indeed be used for classification, and this could simplify evaluation since segmentation involves more pixel-level work. Once image classification is stable, they can revert to segmentation or adjust the aggregator.

Considering performance and dataset adjustments

The classification task might perform better by using pred_score directly since FastFlow outputs maps, but the classification score could be similar. It’s a bit tricky. Also, if their dataset includes masks, they can skip them for classification, as masks aren’t needed. Setting the early stopping patience to 10 instead of 3 would help. Tiling docs and stacking might be useful too. I'll suggest these config adjustments based on their needs.

Clarifying steps for improvement

To start, I’ll verify the setup. Then, for Step 1, I’ll suggest using the classification task and fixing the splits. If they are using a normal_test_dir like their setup, they should set normal_split_ratio: 0. That way, everything should align better for their needs.

Clarifying the approach

To start, let's verify everything.

For Step 1, use the classification task and fix splits: set normal_split_ratio: 0 when using normal_test_dir, to avoid mixing training data. Then, for Step 2, either increase image_size or enable tiling, with specific config noted. In Step 3, consider using center_crop for ROI, and add augmentations in Step 4.

Thought for 4m 5s

Resposta

你现在这个现象挺典型:pixel_AUROC 还可以(0.84),但 image_AUROC 很低(0.66),往往说明模型在异常图上能“局部打高分”,但在正常图上也会出现不少“尖峰噪声高分”,导致图像级分数(pred_score)与异常/正常的排序重叠,AUROC 就上不去。

下面给你一套不改网络结构(不改 FastFlow 核心结构)、优先级从高到低的提升手段(都是改数据/配置/后处理):


1) 先把数据划分方式改“确定”,避免训练集被 split 伤到

你已经有独立的 normal_test_dir(normal_test0),这种情况下建议把 normal_split_ratio/split_ratio 设为 0,保证训练只用 normal_dir,测试用 normal_test_dir + abnormal_dir,避免“从 normal_dir 再切走一块当 test/val”造成分布混乱。

官方 Folder 数据集的示例里就是:有 normal_test_dir 时把 normal_split_ratio: 0,并用 test_split_mode: from_dir。 (Anomalib Documentation)

你这份 config 里对应改法:

  • split_ratio: 0(或 null,看你 anomalib 版本字段名)
  • 保持 test_split_mode: from_dir
  • normal_test_dir: normal_test0 保持

这一步有时能直接把 image_AUROC 拉一截(尤其是车载场景 normal 分布变化大时)。


2) 提高分辨率 / 开启 tiling(对“小异物”提升最明显)

你现在 image_size: 256 对“车载小异物”非常容易被下采样抹掉,导致图像级分数区分不开。

方案 A:直接加大 image_size

  • 先试 image_size: 512(显存不够再回退)

方案 B:开启 tiling(更推荐)

Anomalib 的 tiling 就是为“小缺陷/小异物”准备的:用 tile_size/stride 把大图切成块处理,再拼回去。文档说明 tiling 主要参数是 tile_sizestride,并且默认会把 tiles 以 batch 方式堆叠,在不减 batch 的情况下显存通常不会爆炸式增长。 (Anomalib Documentation)

建议起步配置:

yaml
dataset: image_size: 512 # 或保持原始比例后再 resize tiling: apply: true tile_size: 256 stride: 128 # < tile_size 表示有重叠,通常更稳 remove_border_count: 0 use_random_tiling: False random_tile_count: 16

⚠️注意:部分旧版本 anomalib 还需要加一个 tiler callback 才会真正生效(否则你以为开了,其实没切)。社区讨论里有人就踩过这个坑。 (GitHub)
(如果你用的是较新版本文档里的 pipeline,通常会自动配置;旧版本就需要手动加 callback。)


3) 用 ROI/裁剪把“非关注区域”去掉(对车载特别关键)

车载画面里:天空、远处建筑、反光、雨刷、水渍、仪表台边缘……这些都会在正常图里制造高响应,直接拉高正常图的 image score,从而压低 image_AUROC。

你不改网络结构也能做:

  • 如果异物只会出现在某块区域(比如挡风玻璃中下部/引擎盖区域),就:
    • center_crop 固定裁剪到 ROI(或你自己离线把图裁好)
  • ROI 一旦稳定,image_AUROC 往往提升非常明显(因为正常分数噪声源少了)

4) 加“不会造假异常”的数据增强,让正常分布覆盖车载光照/天气变化

FastFlow 属于一类“只学正常分布”的方法:车载场景的光照、曝光、雨雪、夜间噪声变化很大,你的正常集如果没覆盖全,模型会把这些“正常变化”当异常,直接伤 image_AUROC。

Anomalib 支持在 dataset.transform_config 里挂 Albumentations 的 yaml 做自定义增强。 (GitHub)
Albumentations 也支持把增强 pipeline 序列化成 YAML 文件。 (Albumentations)

推荐增强方向(轻量、偏“成像变化”):

  • RandomBrightnessContrast / Gamma / HueSaturationValue(轻一点)
  • 少量 GaussianNoise / MotionBlur(很轻)
  • 轻微仿射(小角度旋转/平移),前提是你 ROI 对齐允许

原则:增强要模拟“车载正常变化”,不要做 RandomCrop 把异物裁出来或裁没(会引入标签噪声/分布偏差)。


5) 把 early stopping 监控指标改成 image_AUROC(你现在在“按像素指标挑模型”)

你当前:

yaml
early_stopping: metric: pixel_AUROC

这会倾向于挑“像素定位更好”的 checkpoint,但你最关心的是 image_AUROC。建议改成:

yaml
early_stopping: metric: image_AUROC mode: max patience: 10 # 车载波动大,3 往往太激进

6)(非常有效但需要改一点点后处理)把图像分数从“max”改成“Top-k 均值/分位数”

很多 anomaly segmentation 管线的 image score 默认取 anomaly map 的 max,这对正常图里的单个噪声尖峰非常敏感;你现在“pixel_AUROC 不差但 image_AUROC 差”就很像这个问题。

Anomalib 的 post-processing 体系就是用来做阈值/归一化/输出格式化的,并且支持自定义 post-processor。 (Anomalib Documentation)

思路(不改网络,只改 score 聚合):

  • pred_score = mean(top 1% pixels)p99(anomaly_map)
    这样正常图里零星尖峰不会把整张图打成高异常分,image_AUROC 往往会涨。

7) 你“只想先把图像分类做上去”的捷径:把任务切到 classification 先跑通

如果你阶段目标就是 image_AUROC,Anomalib 的 Folder 数据集本身支持 task: classification(不需要 mask)。文档也明确区分了 classification/segmentation 两种任务配置方式。 (Anomalib Documentation)
你可以先用 classification task 把 image_AUROC 拉起来,再回到 segmentation 做定位优化。


额外提醒(别忽略)

  • 你异常样本只有 200 张,AUPR 往往比 AUROC 更能反映实际效果(AUROC 有时会“看起来还行但上线误报很多”)。Anomalib 指标文档也强调 AUPR 更适合不均衡数据。 (Anomalib Documentation)
  • 验证阶段异常样本对“分数归一化/阈值寻找”是有用的(尤其是你还看 F1)。文档里也提到验证阶段异常图会用于归一化和找阈值。 (Anomalib Documentation)

我建议你按这个顺序做(最省时间、最可能涨 image_AUROC)

  1. split_ratio/normal_split_ratio → 0(确保不会乱切)
  2. image_size → 512 或直接 tiling apply=true
  3. 加 ROI 裁剪(去掉天空/边缘/雨刷等)
  4. 加“成像变化类”增强覆盖正常分布
  5. early stopping 改监控 image_AUROC
  6. 仍不够就做“Top-k/分位数”图像分数聚合(只动后处理,不动网络)

如果你愿意把几张误报最高的正常图漏检的异常图(各 5 张)截图/示例发我,我可以帮你判断到底是「分辨率/ROI问题」还是「image score 聚合尖峰问题」,并给你更精确的 tiling/裁剪参数。

Pergunta

我现在怎么实现数据增强呢,我现在 transform_config:
train: null
eval: null是不是改变这里也可以呢,请给出我如何更改这里的选项呢进行数据增强呢,

Pensamento

Instructions on modifying dataset transforms

To modify dataset augmentations, update dataset.transform_config to the path of your YAML with transformations. Provide separate files for training and evaluation. Recommended augmentations include Resize, Normalize, and ToTensorV2; refer to GitHub discussion lines 279-337 for further details. For newer versions (>=2.0), use train_augmentations. Check the version by running pip show anomalib.

Creating custom augmentation YAML files

To modify options, create an Albumentations YAML file, either by hand or using Python's A.save. Recommended augmentations for training include Resize, RandomBrightnessContrast, HueSaturationValue, RandomGamma, GaussNoise, MotionBlur, ShiftScaleRotate, Normalize, and ToTensorV2. For evaluation, include only Resize, Normalize, and ToTensorV2. Ensure the YAML is in the correct format using Albumentations' saving method.

Clarifying usage of augmentations in the new pipeline

Anomalib now uses torchvision v2 transforms, and augmentations can be configured in the DataModule, but for the user's config, it's still valid to follow the old approach with transform_config. For eval, avoid random transforms to maintain stable metrics. Also, if normalization: imagenet is set, adding Normalize may double-normalize, depending on the Anomalib version. If using a newer version with preprocessing, skip Normalize in the augmentations.

Crafting a clear guide on updating the config

To update the user’s config, I’ll suggest the following:

  1. Create car_train_aug.yaml and car_eval_aug.yaml in the project with sample YAML contents for train and eval.
  2. Modify the main config to reference these files.
  3. Ensure image_size and Resize sizes match (e.g., 256 for image_size should match Resize 256).
  4. Run training and debug by printing a batch to verify.

Thought for 55s

Resposta

可以的,你现在 transform_config.train/eval: null 就是“没开数据增强”。在 **anomalib(你这种 dataset: 配置风格)**里,最常见的做法是:

  • dataset.transform_config.train / eval 指向一个 Albumentations 的序列化 YAML 文件
  • 并且这个 YAML 里要包含一些“必备项”(Resize / Normalize / ToTensorV2),否则容易尺寸或类型不匹配
    ——官方维护者在讨论里明确提到 ResizeToTensorV2 必须要有,并且大多数模型还需要 Normalize(ImageNet)。(GitHub)

下面给你一套“能直接用”的改法(建议你先照着跑通,再微调增强强度)。


1)准备两个增强文件:train 用随机增强,eval 用确定性预处理

A. car_aug_train.yaml(训练增强:模拟车载光照/噪声/轻微抖动)

先别上太激进的 RandomCrop(会改变异物尺度/位置分布,很多时候会把 image_AUROC 拉崩),车载更推荐“成像变化类”增强。

yaml
__version__: 1.3.0 transform: __class_fullname__: Compose additional_targets: {} bbox_params: null keypoint_params: null p: 1.0 transforms: - __class_fullname__: Resize always_apply: true height: 256 width: 256 interpolation: 1 p: 1.0 - __class_fullname__: RandomBrightnessContrast always_apply: false brightness_by_max: true brightness_limit: [-0.2, 0.2] contrast_limit: [-0.2, 0.2] p: 0.6 - __class_fullname__: HueSaturationValue always_apply: false hue_shift_limit: [-8, 8] sat_shift_limit: [-15, 15] val_shift_limit: [-10, 10] p: 0.3 - __class_fullname__: RandomGamma always_apply: false gamma_limit: [80, 120] p: 0.3 - __class_fullname__: GaussNoise always_apply: false var_limit: [5.0, 30.0] mean: 0 per_channel: true p: 0.2 - __class_fullname__: MotionBlur always_apply: false blur_limit: [3, 5] p: 0.15 - __class_fullname__: ShiftScaleRotate always_apply: false shift_limit: 0.02 scale_limit: 0.05 rotate_limit: 5 interpolation: 1 border_mode: 4 value: null mask_value: null p: 0.25 - __class_fullname__: Normalize always_apply: true max_pixel_value: 255.0 mean: [0.485, 0.456, 0.406] std: [0.229, 0.224, 0.225] p: 1.0 - __class_fullname__: ToTensorV2 always_apply: true p: 1.0 transpose_mask: false

B. car_aug_eval.yaml(评估/测试:不要随机增强,只做确定性处理)

yaml
__version__: 1.3.0 transform: __class_fullname__: Compose additional_targets: {} bbox_params: null keypoint_params: null p: 1.0 transforms: - __class_fullname__: Resize always_apply: true height: 256 width: 256 interpolation: 1 p: 1.0 - __class_fullname__: Normalize always_apply: true max_pixel_value: 255.0 mean: [0.485, 0.456, 0.406] std: [0.229, 0.224, 0.225] p: 1.0 - __class_fullname__: ToTensorV2 always_apply: true p: 1.0 transpose_mask: false

维护者给的示例也是这种思路:train 里放随机增强,最后接 Normalize + ToTensorV2。(GitHub)


2)在你的 anomalib 配置里,把 transform_config 指向这两个 YAML 文件

把你 config 的这一段:

yaml
transform_config: train: null eval: null

改成(路径换成你实际放置 yaml 的路径):

yaml
transform_config: train: /media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib/configs/car_aug_train.yaml eval: /media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib/configs/car_aug_eval.yaml

3)两个关键坑:尺寸一致 + 不要“双重 Normalize”

坑 1:Resize(height/width) 要和你 image_size 一致

你现在 image_size: 256,那上面 YAML 的 Resize 就写 256。
如果你后面把 image_size 改 512,那 两个 YAML 里的 Resize 也一起改成 512,否则很容易出现维度不匹配(社区里经常遇到)。(GitHub)

坑 2:如果你用的是“新版本 anomalib 的 torchvision v2 augmentations 管线”,不要在增强里再 Normalize

最新文档里把 模型预处理(Resize/Normalize) 和 **数据增强(随机)**区分得很清楚,并提醒不要把 Normalize 当作 augmentation 乱塞(会产生不可控行为)。(Anomalib)
但你当前这种 dataset.transform_config(Albumentations 序列化)用法,通常是“你提供了自定义 transform 就要自己把 Normalize/ToTensorV2 补全”。(GitHub)
如果你不确定版本:跑一下 pip show anomalib 看版本号;能训练跑通就以你当前框架实际行为为准


4)可选:用 Python 自动生成 YAML(避免手写格式错误)

如果你愿意用脚本生成(更稳),思路就是用 Albumentations A.save(...)(维护者示例也是这样生成的)。(GitHub)


如果你把这套增强加上后 image_AUROC 还是没起色,下一步我会优先让你做两件事(同样不改网络结构):

  1. 只在训练集加增强,eval/test 完全不加随机项(上面已经这么做了)
  2. 把误报最高的正常图挑 20 张看看是不是“天空/雨刷/反光”等区域导致分数尖峰,然后做 ROI 或 tiling

你也可以直接贴一下:你现在 anomalib 的版本号 + 训练日志里 datamodule 打印的 transforms(或截图),我能帮你确认到底有没有真正加载到这些增强。

Pergunta

我的配置文件如下:dataset:
name: mvtec
format: folder
path: /media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib/dataset
normal_dir: normal # 自制数据集正样本子文件夹
abnormal_dir: abnormal0 # 自制数据集负样本子文件夹
mask_dir: label_mask0 # 二值掩膜路径,自制数据集一般没有,填null
category: bottle
normal_test_dir: normal_test0 # name of the folder containing normal test images.
task: segmentation
extensions: null
train_batch_size: 32
eval_batch_size: 32
num_workers: 8
image_size: 512 # dimensions to which images are resized (mandatory)
center_crop: null # dimensions to which images are center-cropped after resizing (optional)
normalization: imagenet # data distribution to which the images will be normalized: [none, imagenet]
transform_config:
train: null
eval: null
test_split_mode: from_dir # options: [from_dir, synthetic]
test_split_ratio: 0.2 # fraction of train images held out testing (usage depends on test_split_mode)
val_split_mode: same_as_test # options: [same_as_test, from_test, synthetic]
val_split_ratio: 0.5 # fraction of train/test images held out for validation (usage depends on val_split_mode)
tiling:
apply: true
tile_size: 256
stride: 128
remove_border_count: 0
use_random_tiling: False
random_tile_count: 16

model:
name: fastflow
backbone: resnet18 # options: [resnet18, wide_resnet50_2, cait_m48_448, deit_base_distilled_patch16_384]
pre_trained: true
flow_steps: 8 # options: [8, 8, 20, 20] - for each supported backbone
hidden_ratio: 1.0 # options: [1.0, 1.0, 0.16, 0.16] - for each supported backbone
conv3x3_only: True # options: [True, False, False, False] - for each supported backbone
lr: 0.001
weight_decay: 0.00001
early_stopping:
patience: 3
metric: pixel_AUROC
mode: max
normalization_method: min_max # options: [null, min_max, cdf]

metrics:
image:
- F1Score
- AUROC
pixel:
- F1Score
- AUROC
- AUPRO
threshold:
method: adaptive #options: [adaptive, manual]
manual_image: null
manual_pixel: null

visualization:
show_images: False # show images on the screen
save_images: True # save images to the file system
log_images: True # log images to the available loggers (if any)
image_save_path: ./results/1217fastflow/fastflow/mvtec/run/images # path to which images will be saved
mode: simple # options: ["full", "simple"]

project:
seed: 42
path: ./results/1217fastflow

logging:
logger: [tensorboard,csv]
log_graph: true

optimization:
export_mode: null # options: torch, onnx, openvino

PL Trainer Args. Don't add extra parameter here.

trainer:
enable_checkpointing: true
default_root_dir: null
gradient_clip_val: 0
gradient_clip_algorithm: norm
num_nodes: 1
devices: 1
enable_progress_bar: true
overfit_batches: 0.0
track_grad_norm: -1
check_val_every_n_epoch: 1 # Don't validate before extracting features.
fast_dev_run: false
accumulate_grad_batches: 1
max_epochs: 500
min_epochs: null
max_steps: -1
min_steps: null
max_time: null
limit_train_batches: 1.0
limit_val_batches: 1.0
limit_test_batches: 1.0
limit_predict_batches: 1.0
val_check_interval: 1.0 # Don't validate before extracting features.
log_every_n_steps: 50
accelerator: auto # <"cpu", "gpu", "tpu", "ipu", "hpu", "auto">
strategy: null
sync_batchnorm: false
precision: 32
enable_model_summary: true
num_sanity_val_steps: 0
profiler: null
benchmark: false
deterministic: false
reload_dataloaders_every_n_epochs: 0
auto_lr_find: false
replace_sampler_ddp: true
detect_anomaly: false
auto_scale_batch_size: false
plugins: null
move_metrics_to_cpu: false
multiple_trainloader_mode: max_size_cycle
代码报错如下:Epoch 0: 92%|█████████████████████████████████████████████████████████████████████████████▏ | 250/272 [01:46<00:09, 2.35it/s, loss=-1.68e+06, v_num=0, train_loss_step=-1.69e+6]/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/utilities/data.py:83: UserWarning: Trying to infer the batch_size from an ambiguous collection. The batch size we found is 6. To avoid any miscalculations, use self.log(..., batch_size=batch_size).
warning_cache.warn(
Epoch 3: 100%|█| 272/272 [03:06<00:00, 1.46it/s, loss=-4.93e+06, v_num=0, train_loss_step=-4.98e+6, pixel_F1Score=0.0544, pixel_AUROC=0.776, pixel_AUPRO=0.336, train_loss_epoch=-3.98Given normalized_shape=[64, 128, 128], expected input with shape [*, 64, 128, 128], but got input of size[1, 64, 64, 64]
Error occurs, No graph saved
Traceback (most recent call last):
File "tools/train.py", line 81, in <module>
train(args)
File "tools/train.py", line 66, in train
trainer.fit(model=model, datamodule=datamodule)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
results = self._run_stage()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
self._run_train()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train
self.fit_loop.run()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 206, in run
output = self.on_run_end()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 323, in on_run_end
self.trainer._call_callback_hooks("on_train_end")
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1394, in _call_callback_hooks
fn(self, self.lightning_module, *args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/utils/callbacks/graph.py", line 44, in on_train_end
logger.log_graph(pl_module, input_array=torch.ones((1, 3, 256, 256)))
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn
return fn(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loggers/tensorboard.py", line 205, in log_graph
self.experiment.add_graph(model, input_array)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/utils/tensorboard/writer.py", line 841, in add_graph
graph(model, input_to_model, verbose, use_strict_trace)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/utils/tensorboard/_pytorch_graph.py", line 338, in graph
raise e
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/utils/tensorboard/_pytorch_graph.py", line 331, in graph
trace = torch.jit.trace(model, args, strict=False)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/jit/_trace.py", line 1000, in trace
traced_func = _trace_impl(
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/jit/_trace.py", line 695, in _trace_impl
return trace_module(
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/jit/_trace.py", line 1275, in trace_module
module._c._create_method_from_trace(
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1543, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/models/components/base/anomaly_module.py", line 67, in forward
return self.model(batch)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1543, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/models/fastflow/torch_model.py", line 190, in forward
features = self._get_cnn_features(input_tensor)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/models/fastflow/torch_model.py", line 219, in _get_cnn_features
features = [self.normsi for i, feature in enumerate(features)]
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/models/fastflow/torch_model.py", line 219, in <listcomp>
features = [self.normsi for i, feature in enumerate(features)]
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1543, in _slow_forward
result = self.forward(input, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/normalization.py", line 202, in forward
return F.layer_norm(
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/functional.py", line 2576, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Given normalized_shape=[64, 128, 128], expected input with shape [
, 64, 128, 128], but got input of size[1, 64, 64, 64]
Epoch 3: 100%|██████████| 272/272 [03:08<00:00, 1.44it/s, loss=-4.93e+06, v_num=0, train_loss_step=-4.98e+6, pixel_F1Score=0.0544, pixel_AUROC=0.776, pixel_AUPRO=0.336, train_loss_epoch=-3.98e+6]

Pensamento

Addressing tiling and callback issues

The user’s config likely comes from an older version of the pipeline, where tiling might have worked without the TilerConfigurationCallback. The error stack shows the use of an older GraphLogger, so it looks like the user is working with an older version of Anomalib. To fix the issue, the TilerConfigurationCallback should be added for tiling in the v2.x engine.

Addressing tiling mismatch issue

It looks like there’s a mismatch between image size (512) and tile size (256) while tiling is active. If tiling is working properly, the model should be built with the tile size, not image size. Since the model expects 512 and not tile size 256, tiling might not be effective. To fix it, add the TilerConfigurationCallback to the config, but you’ll need to make sure the config is updated correctly for the newer version.

Addressing user error and configuration fixes

It seems like the user is running into errors during training when using tools/train.py from Anomalib. We can suggest fixing the crash by setting log_graph: false. For augmentation, we should give an example YAML for setting image_size 512 and the proper configuration, along with clarifying how to handle normalization. Additionally, we need to address data split issues.

Explaining and solving the GraphLogger error

The error occurs because GraphLogger uses a fixed 256 input size, but your model is designed with an image_size of 512, so layer norms expect 128x128 feature maps. When tracing with 256, the feature maps become 64x64, causing a mismatch.

Quick solution: disable graph logging (logging.log_graph: false). If you need the graph, you can patch the callback to use the correct input size.

Clarifying graph input size and augmentation setup

If tiling is active, the model expects a tile size of 256, but the model was built for 512. To fix the model, change the graph input to 512. If tiling is intended, use the tiler callback, setting the model’s tile size.

For data augmentation, set the transform_config file paths for training and evaluation, ensuring Resize matches the image size (512). Normalize to "none" to prevent double normalization.

Thought for 1m 12s

Resposta

你这个报错不是训练过程本身炸了,而是**训练结束时写 TensorBoard “模型结构图(graph)”**的时候炸了。

从 traceback 看得很清楚:异常发生在 anomalib/utils/callbacks/graph.pyon_train_end,它在 TensorBoard 里用一个固定大小的假输入去 trace 模型并记录 graph(很多版本里写死 torch.ones((1,3,256,256)))。(Anomalib)
但你现在 image_size: 512,FastFlow 里一些 LayerNormnormalized_shape 会按输入分辨率对应的特征图尺寸初始化(512→特征图 128×128),于是 graph logger 用 256 去 trace 时,特征图变成 64×64,就出现你看到的 mismatch:

expected [*, 64, 128, 128] but got [1, 64, 64, 64]


一、先把这个报错解决(两种做法,推荐先用 A)

A) 最简单:关闭 graph 记录(推荐)

把配置里这一行改掉即可:

yaml
logging: log_graph: false

GraphLogger 本来就是训练结束才记录(TensorBoard/Comet 都是 on_train_end),关掉不影响训练和指标。(Anomalib)


B) 你必须要 graph:把 graph logger 的假输入从 256 改成 512

你现在的 anomalib 版本里,graph logger 很可能写死了 256×256(你日志也印证了这一点)。(Lightrun)
做法是去你环境里改这个文件(你 traceback 里已经给出路径):

.../site-packages/anomalib/utils/callbacks/graph.py

把类似下面这一行:

python
torch.ones((1, 3, 256, 256))

改为:

python
torch.ones((1, 3, 512, 512))

注意:如果你后面真的让 tiling 生效并且模型实际吃的是 tile_size=256 的 patch,那 graph 输入用 256 才是对的;但你现在模型明显按 512 的特征尺寸在工作,所以要先改成 512 才能过。


二、你问的“数据增强”怎么在 transform_config 里实现(可以改这里)

你现在:

yaml
transform_config: train: null eval: null

要做增强,就把它们改成指向 Albumentations 的 YAML 文件路径。社区维护者明确说了:通过 dataset.transform_config 指向 albumentations 配置文件即可,并且自定义 transform 文件里必须包含 Resize + ToTensorV2(大多数模型还要 Normalize 到 ImageNet)。(GitHub)

1) 先准备两个文件(建议)

car_train_aug_512.yaml(训练:随机增强 + 必备变换)

yaml
__version__: 1.3.0 transform: __class_fullname__: Compose p: 1.0 transforms: - __class_fullname__: Resize always_apply: true height: 512 width: 512 interpolation: 1 p: 1.0 # ——建议从“成像变化类”开始(车载更稳)—— - __class_fullname__: RandomBrightnessContrast brightness_limit: [-0.2, 0.2] contrast_limit: [-0.2, 0.2] brightness_by_max: true p: 0.6 - __class_fullname__: RandomGamma gamma_limit: [80, 120] p: 0.3 - __class_fullname__: GaussNoise var_limit: [5.0, 30.0] mean: 0 per_channel: true p: 0.2 - __class_fullname__: MotionBlur blur_limit: [3, 5] p: 0.15 - __class_fullname__: ShiftScaleRotate shift_limit: 0.02 scale_limit: 0.05 rotate_limit: 5 interpolation: 1 border_mode: 4 p: 0.25 # ——必备:Normalize + ToTensorV2(很多 anomalib 版本自定义 transforms 时需要你显式写)—— - __class_fullname__: Normalize max_pixel_value: 255.0 mean: [0.485, 0.456, 0.406] std: [0.229, 0.224, 0.225] p: 1.0 - __class_fullname__: ToTensorV2 p: 1.0 transpose_mask: false

car_eval_512.yaml(验证/测试:只做确定性预处理,别放随机增强)

yaml
__version__: 1.3.0 transform: __class_fullname__: Compose p: 1.0 transforms: - __class_fullname__: Resize always_apply: true height: 512 width: 512 interpolation: 1 p: 1.0 - __class_fullname__: Normalize max_pixel_value: 255.0 mean: [0.485, 0.456, 0.406] std: [0.229, 0.224, 0.225] p: 1.0 - __class_fullname__: ToTensorV2 p: 1.0 transpose_mask: false

上面这种“train 有随机增强,eval 只有 Resize/Normalize/ToTensor”就是 anomalib 里最常见的做法。(GitHub)


2) 在你的 config 里引用它们

把你的配置改成这样(路径换成你真实路径):

yaml
dataset: transform_config: train: /media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib/configs/car_train_aug_512.yaml eval: /media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib/configs/car_eval_512.yaml

三、关于你现在的 tiling:确认“真的生效了”

最新文档里明确写了:要让 tiling 生效,需要 TilerConfigurationCallback 把 tiler 绑定到 model(否则可能你以为开了,实际没切)。(Anomalib)
你现在这种“旧式 config + tools/train.py”的工程,是否自动加了这个 callback取决于你这份训练脚本怎么写(很多人会漏)。所以建议你在日志里确认一下:训练时输入到模型的到底是 512 还是 256 的 tiles。


你现在应该立刻怎么改(最短路径)

  1. 先把 logging.log_graph: false 关掉,保证能完整跑完不报错。(Anomalib)
  2. 加上上面的 transform_config.train/eval 两个 YAML,先从“亮度/对比度/噪声/轻微抖动”这种增强开始。(GitHub)

如果你愿意,把你 pip show anomalib 的版本号贴一下(就一行 Version),我可以按你的具体版本给你一份完全匹配你版本的 transform/tiling/callback 配置写法,避免“新文档写法”和“旧工程写法”混用踩坑。

Pergunta

我想让他自定义graph张量大小,可以根据我的传入自动修改,这样就不用每次我去修改

Pensamento

Proposing solutions for dynamic graph tensor size

Given that anomalib has its own callback, we can suggest overriding the GraphLogger callback. One option is to disable its log_graph via log_graph=false and create a custom callback. Alternatively, if they want to keep log_graph true, we can modify their installed code to automatically compute the tensor size from the datamodule or input batch.

Using dynamic batch shape within the callback

To get the correct tensor shape dynamically, I'll suggest inside the callback, grabbing a batch with next(iter(trainer.datamodule.train_dataloader())) and using its "image" tensor. This ensures we handle any potential transformations, like tiling or cropping. The approach remains robust, even if the image_size or tile_size changes.

Handling device mismatch during graph logging

It looks like there might be some device mismatch issues when tracing with logger.log_graph. The mistake might be that the input tensor is on the CPU while the model is on the GPU. To fix this, I'll ensure the input tensor is moved to the same device as the model using input_array.to(device). That should handle any device-related errors.

Handling device and size dynamics for graph logging

I need to ensure the batch is on the proper device after training, as on_train_end occurs after training finishes. So in the custom callback, I’ll gather the sample on CPU, then move it to the device. I’ll also set up a dynamic function to infer the correct input size, based on the dataloader or config. This will handle tiling and size flexibility, using dynamic sizes instead of hardcoding them.

Proposing callback options for user needs

To avoid modifying the site-packages each update, I'll present two options for customizing the graph logging callback. Option 1: Add a custom callback into the project and disable the default one for better maintainability. Option 2: Quickly patch the anomalib callback to infer input size, with specific code and diff provided.

Thought for 1m 1s

Resposta

你这个需求本质上是:别再用 anomalib 里写死的 torch.ones((1,3,256,256)) 去 trace 模型,而是自动拿到“真实会喂进模型的输入尺寸”(你改 image_size / tiling.tile_size 后自动匹配),这样 TensorBoard 画 graph 就不会再因为 256↔512 之类不一致而报错。

你现在的报错就是 anomalib 的 GraphLogger 在 on_train_end 里用固定 256 记录 graph 导致的(很多版本确实这么写)。(Lightrun)

下面给你两种实现方式:推荐方案 1(最干净,不用每次改 site-packages),方案 2(一次性 patch site-packages)。


方案 1(推荐):关掉 anomalib 自带 log_graph + 用你自己的“动态 graph callback”

思路:训练结束时从 dataloader 抽一批真实 batch,取其中 image 的 shape 作为 input_array,再调用 logger.log_graph。这样:

  • 你改 image_size=512 也行
  • 你开 tiling(真正生效时输入变成 tile_size=256)也能自动跟着变
  • 不会再出现 LayerNorm shape mismatch

1)配置里先关掉 anomalib 的 graph(避免它再用 256 去 trace)

yaml
logging: log_graph: false

(这个字段对应 anomalib 的 GraphLogger callback。(Anomalib Documentation))

2)新增一个 callback:dynamic_graph_logger.py

把下面文件放到你的工程里,比如 callbacks/dynamic_graph_logger.py

python
# callbacks/dynamic_graph_logger.py from __future__ import annotations import torch from pytorch_lightning import Callback from lightning_utilities.core.rank_zero import rank_zero_warn def _extract_4d_image_tensor(batch): """尽量兼容 anomalib / 自定义 dataloader 的 batch 格式,找出 [B,C,H,W] 的图像张量.""" if torch.is_tensor(batch) and batch.ndim == 4: return batch if isinstance(batch, dict): # anomalib 常见 key:image for k in ("image", "input", "inputs", "img"): v = batch.get(k, None) if torch.is_tensor(v) and v.ndim == 4: return v # 找第一个 4D tensor for v in batch.values(): if torch.is_tensor(v) and v.ndim == 4: return v if isinstance(batch, (list, tuple)): for v in batch: if torch.is_tensor(v) and v.ndim == 4: return v if isinstance(v, dict): t = _extract_4d_image_tensor(v) if t is not None: return t return None class DynamicGraphLogger(Callback): """用真实 batch 的输入尺寸记录 graph,避免 anomalib 写死 256 导致的 shape mismatch.""" def __init__(self, when: str = "train_end"): """ when: "fit_start" 或 "train_end" - fit_start:更早记录(但有的模型/日志器可能还没 ready) - train_end:最稳(你的报错也发生在 train_end) """ assert when in ("fit_start", "train_end") self.when = when self._logged = False def _log_graph(self, trainer, pl_module): if self._logged: return if not getattr(trainer, "loggers", None): return # 1) 从 train dataloader 取一个 batch(尺寸最贴近真实训练输入,含 tiling/resize 等) input_array = None try: dl = trainer.datamodule.train_dataloader() batch = next(iter(dl)) img = _extract_4d_image_tensor(batch) if img is None: raise RuntimeError("无法从 batch 中解析到 [B,C,H,W] 图像张量") input_array = img[:1] # 只取 1 张就够画 graph except Exception as e: rank_zero_warn(f"[DynamicGraphLogger] 取 batch 失败,跳过 graph 记录。原因: {e}") return # 2) 确保 input 在模型所在 device(否则会出现 CPU/GPU 不一致) try: device = next(pl_module.parameters()).device input_array = input_array.to(device) except Exception: pass # 3) 对每个 logger 尝试 log_graph(TensorBoardLogger 支持 input_array 参数。([PyTorch Lightning](https://pytorch-lightning.readthedocs.io/en/1.2.10/extensions/generated/pytorch_lightning.loggers.TensorBoardLogger.html?utm_source=chatgpt.com))) ok = False for logger in trainer.loggers: if hasattr(logger, "log_graph"): try: logger.log_graph(pl_module, input_array=input_array) ok = True except Exception as e: rank_zero_warn(f"[DynamicGraphLogger] logger {type(logger)} 记录 graph 失败: {e}") self._logged = ok def on_fit_start(self, trainer, pl_module): if self.when == "fit_start": self._log_graph(trainer, pl_module) def on_train_end(self, trainer, pl_module): if self.when == "train_end": self._log_graph(trainer, pl_module)

3)在你的 tools/train.py 里把 callback 加到 Trainer

示例(你按自己代码结构放进去即可):

python
from callbacks.dynamic_graph_logger import DynamicGraphLogger callbacks = [] callbacks.append(DynamicGraphLogger(when="train_end")) trainer = pl.Trainer( ..., callbacks=callbacks, )

这样你以后只改 YAML 里的 image_size / tile_size,graph 输入会自动跟着你的 dataloader 走,不需要手改任何 tensor 尺寸。


方案 2:一次性 patch anomalib 的 GraphLogger,让它自动推断输入尺寸

如果你不想动 train.py,那就直接把你环境里的: .../site-packages/anomalib/utils/callbacks/graph.py 里那行 torch.ones((1, 3, 256, 256)) 改成“动态推断”:

  • 优先从 trainer.datamodule.train_dataloader() 抽一个 batch 拿尺寸
  • 抽不到就 fallback:读 trainer.datamodule.image_size 或 config
  • 最后才 fallback 到 256

这个思路和上面 callback 一样,只是你把逻辑塞回 anomalib 的 GraphLogger 里。

(你现在的报错链路就是这里写死 256 导致的。(Lightrun))


我建议你用哪个?

  • 想“一劳永逸、可维护、升级 anomalib 不会丢”:用 方案 1
  • 想“不改训练代码,只动一次环境文件”:用 方案 2

如果你把你 tools/train.py 里创建 Trainer 的那几行贴出来(创建 trainer / callbacks 的部分就行),我可以把“callback 如何插入”按你的代码写成一段可直接粘贴的版本。

Compartilhar esta Q&A