我现在使用anomalib进行车载图像的异物检测,将含有目标的数据记为异常,现在有正常数据集8000...
Creato il: 26 dicembre 2025
Risposto usando GPT-5.2 Thinking da Chat01
Creato il: 26 dicembre 2025
Risposto usando GPT-5.2 Thinking da Chat01
我现在使用anomalib进行车载图像的异物检测,将含有目标的数据记为异常,现在有正常数据集8000张,异常200多张,正常测试集400多张,现在使用fastflow的配置文件如下:dataset:
name: mvtec
format: folder
path: /media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib/dataset
normal_dir: normal # 自制数据集正样本子文件夹
abnormal_dir: abnormal0 # 自制数据集负样本子文件夹
mask_dir: label_mask0 # 二值掩膜路径,自制数据集一般没有,填null
category: bottle
normal_test_dir: normal_test0 # name of the folder containing normal test images.
task: segmentation
extensions: null
train_batch_size: 16
eval_batch_size: 16
split_ratio: 0.3 # ratio of the normal images that will be used to create a test split
num_workers: 8
image_size: 256 # dimensions to which images are resized (mandatory)
center_crop: null # dimensions to which images are center-cropped after resizing (optional)
normalization: imagenet # data distribution to which the images will be normalized: [none, imagenet]
transform_config:
train: null
eval: null
test_split_mode: from_dir # options: [from_dir, synthetic]
test_split_ratio: 0.2 # fraction of train images held out testing (usage depends on test_split_mode)
val_split_mode: same_as_test # options: [same_as_test, from_test, synthetic]
val_split_ratio: 0.5 # fraction of train/test images held out for validation (usage depends on val_split_mode)
tiling:
apply: false
tile_size: null
stride: null
remove_border_count: 0
use_random_tiling: False
random_tile_count: 16
model:
name: fastflow
backbone: resnet18 # options: [resnet18, wide_resnet50_2, cait_m48_448, deit_base_distilled_patch16_384]
pre_trained: true
flow_steps: 8 # options: [8, 8, 20, 20] - for each supported backbone
hidden_ratio: 1.0 # options: [1.0, 1.0, 0.16, 0.16] - for each supported backbone
conv3x3_only: True # options: [True, False, False, False] - for each supported backbone
lr: 0.001
weight_decay: 0.00001
early_stopping:
patience: 3
metric: pixel_AUROC
mode: max
normalization_method: min_max # options: [null, min_max, cdf]
metrics:
image:
- F1Score
- AUROC
pixel:
- F1Score
- AUROC
- AUPRO
threshold:
method: adaptive #options: [adaptive, manual]
manual_image: null
manual_pixel: null
visualization:
show_images: False # show images on the screen
save_images: True # save images to the file system
log_images: True # log images to the available loggers (if any)
image_save_path: ./results/1217fastflow/fastflow/mvtec/run/images # path to which images will be saved
mode: simple # options: ["full", "simple"]
project:
seed: 42
path: ./results/1217fastflow
logging:
logger: [tensorboard,csv]
log_graph: true
optimization:
export_mode: null # options: torch, onnx, openvino
trainer:
enable_checkpointing: true
default_root_dir: null
gradient_clip_val: 0
gradient_clip_algorithm: norm
num_nodes: 1
devices: 1
enable_progress_bar: true
overfit_batches: 0.0
track_grad_norm: -1
check_val_every_n_epoch: 1 # Don't validate before extracting features.
fast_dev_run: false
accumulate_grad_batches: 1
max_epochs: 500
min_epochs: null
max_steps: -1
min_steps: null
max_time: null
limit_train_batches: 1.0
limit_val_batches: 1.0
limit_test_batches: 1.0
limit_predict_batches: 1.0
val_check_interval: 1.0 # Don't validate before extracting features.
log_every_n_steps: 50
accelerator: auto # <"cpu", "gpu", "tpu", "ipu", "hpu", "auto">
strategy: null
sync_batchnorm: false
precision: 32
enable_model_summary: true
num_sanity_val_steps: 0
profiler: null
benchmark: false
deterministic: false
reload_dataloaders_every_n_epochs: 0
auto_lr_find: false
replace_sampler_ddp: true
detect_anomaly: false
auto_scale_batch_size: false
plugins: null
move_metrics_to_cpu: false
multiple_trainloader_mode: max_size_cycle,现在分类效果很差, image_AUROC 0.6587263345718384
image_F1Score 0.6217764616012573
pixel_AUPRO 0.4064096212387085
pixel_AUROC 0.8441610336303711
pixel_F1Score 0.11360228806734085
如何在不改变网络结构情况下首先提高图像分类效果 image_AUROC的指标呢,请给出
你现在这个现象挺典型:pixel_AUROC 还可以(0.84),但 image_AUROC 很低(0.66),往往说明模型在异常图上能“局部打高分”,但在正常图上也会出现不少“尖峰噪声高分”,导致图像级分数(pred_score)与异常/正常的排序重叠,AUROC 就上不去。
下面给你一套不改网络结构(不改 FastFlow 核心结构)、优先级从高到低的提升手段(都是改数据/配置/后处理):
你已经有独立的 normal_test_dir(normal_test0),这种情况下建议把 normal_split_ratio/split_ratio 设为 0,保证训练只用 normal_dir,测试用 normal_test_dir + abnormal_dir,避免“从 normal_dir 再切走一块当 test/val”造成分布混乱。
官方 Folder 数据集的示例里就是:有 normal_test_dir 时把 normal_split_ratio: 0,并用 test_split_mode: from_dir。 (Anomalib Documentation)
你这份 config 里对应改法:
split_ratio: 0(或 null,看你 anomalib 版本字段名)test_split_mode: from_dirnormal_test_dir: normal_test0 保持这一步有时能直接把 image_AUROC 拉一截(尤其是车载场景 normal 分布变化大时)。
你现在 image_size: 256 对“车载小异物”非常容易被下采样抹掉,导致图像级分数区分不开。
image_sizeimage_size: 512(显存不够再回退)Anomalib 的 tiling 就是为“小缺陷/小异物”准备的:用 tile_size/stride 把大图切成块处理,再拼回去。文档说明 tiling 主要参数是 tile_size 和 stride,并且默认会把 tiles 以 batch 方式堆叠,在不减 batch 的情况下显存通常不会爆炸式增长。 (Anomalib Documentation)
建议起步配置:
yamldataset: image_size: 512 # 或保持原始比例后再 resize tiling: apply: true tile_size: 256 stride: 128 # < tile_size 表示有重叠,通常更稳 remove_border_count: 0 use_random_tiling: False random_tile_count: 16
⚠️注意:部分旧版本 anomalib 还需要加一个 tiler callback 才会真正生效(否则你以为开了,其实没切)。社区讨论里有人就踩过这个坑。 (GitHub)
(如果你用的是较新版本文档里的 pipeline,通常会自动配置;旧版本就需要手动加 callback。)
车载画面里:天空、远处建筑、反光、雨刷、水渍、仪表台边缘……这些都会在正常图里制造高响应,直接拉高正常图的 image score,从而压低 image_AUROC。
你不改网络结构也能做:
center_crop 固定裁剪到 ROI(或你自己离线把图裁好)FastFlow 属于一类“只学正常分布”的方法:车载场景的光照、曝光、雨雪、夜间噪声变化很大,你的正常集如果没覆盖全,模型会把这些“正常变化”当异常,直接伤 image_AUROC。
Anomalib 支持在 dataset.transform_config 里挂 Albumentations 的 yaml 做自定义增强。 (GitHub)
Albumentations 也支持把增强 pipeline 序列化成 YAML 文件。 (Albumentations)
推荐增强方向(轻量、偏“成像变化”):
原则:增强要模拟“车载正常变化”,不要做 RandomCrop 把异物裁出来或裁没(会引入标签噪声/分布偏差)。
你当前:
yamlearly_stopping: metric: pixel_AUROC
这会倾向于挑“像素定位更好”的 checkpoint,但你最关心的是 image_AUROC。建议改成:
yamlearly_stopping: metric: image_AUROC mode: max patience: 10 # 车载波动大,3 往往太激进
很多 anomaly segmentation 管线的 image score 默认取 anomaly map 的 max,这对正常图里的单个噪声尖峰非常敏感;你现在“pixel_AUROC 不差但 image_AUROC 差”就很像这个问题。
Anomalib 的 post-processing 体系就是用来做阈值/归一化/输出格式化的,并且支持自定义 post-processor。 (Anomalib Documentation)
思路(不改网络,只改 score 聚合):
pred_score = mean(top 1% pixels) 或 p99(anomaly_map)如果你阶段目标就是 image_AUROC,Anomalib 的 Folder 数据集本身支持 task: classification(不需要 mask)。文档也明确区分了 classification/segmentation 两种任务配置方式。 (Anomalib Documentation)
你可以先用 classification task 把 image_AUROC 拉起来,再回到 segmentation 做定位优化。
split_ratio/normal_split_ratio → 0(确保不会乱切)image_size → 512 或直接 tiling apply=trueimage_AUROC如果你愿意把几张误报最高的正常图和漏检的异常图(各 5 张)截图/示例发我,我可以帮你判断到底是「分辨率/ROI问题」还是「image score 聚合尖峰问题」,并给你更精确的 tiling/裁剪参数。
我现在怎么实现数据增强呢,我现在 transform_config:
train: null
eval: null是不是改变这里也可以呢,请给出我如何更改这里的选项呢进行数据增强呢,
可以的,你现在 transform_config.train/eval: null 就是“没开数据增强”。在 **anomalib(你这种 dataset: 配置风格)**里,最常见的做法是:
dataset.transform_config.train / eval 指向一个 Albumentations 的序列化 YAML 文件Resize 和 ToTensorV2 必须要有,并且大多数模型还需要 Normalize(ImageNet)。(GitHub)下面给你一套“能直接用”的改法(建议你先照着跑通,再微调增强强度)。
car_aug_train.yaml(训练增强:模拟车载光照/噪声/轻微抖动)先别上太激进的 RandomCrop(会改变异物尺度/位置分布,很多时候会把 image_AUROC 拉崩),车载更推荐“成像变化类”增强。
yaml__version__: 1.3.0 transform: __class_fullname__: Compose additional_targets: {} bbox_params: null keypoint_params: null p: 1.0 transforms: - __class_fullname__: Resize always_apply: true height: 256 width: 256 interpolation: 1 p: 1.0 - __class_fullname__: RandomBrightnessContrast always_apply: false brightness_by_max: true brightness_limit: [-0.2, 0.2] contrast_limit: [-0.2, 0.2] p: 0.6 - __class_fullname__: HueSaturationValue always_apply: false hue_shift_limit: [-8, 8] sat_shift_limit: [-15, 15] val_shift_limit: [-10, 10] p: 0.3 - __class_fullname__: RandomGamma always_apply: false gamma_limit: [80, 120] p: 0.3 - __class_fullname__: GaussNoise always_apply: false var_limit: [5.0, 30.0] mean: 0 per_channel: true p: 0.2 - __class_fullname__: MotionBlur always_apply: false blur_limit: [3, 5] p: 0.15 - __class_fullname__: ShiftScaleRotate always_apply: false shift_limit: 0.02 scale_limit: 0.05 rotate_limit: 5 interpolation: 1 border_mode: 4 value: null mask_value: null p: 0.25 - __class_fullname__: Normalize always_apply: true max_pixel_value: 255.0 mean: [0.485, 0.456, 0.406] std: [0.229, 0.224, 0.225] p: 1.0 - __class_fullname__: ToTensorV2 always_apply: true p: 1.0 transpose_mask: false
car_aug_eval.yaml(评估/测试:不要随机增强,只做确定性处理)yaml__version__: 1.3.0 transform: __class_fullname__: Compose additional_targets: {} bbox_params: null keypoint_params: null p: 1.0 transforms: - __class_fullname__: Resize always_apply: true height: 256 width: 256 interpolation: 1 p: 1.0 - __class_fullname__: Normalize always_apply: true max_pixel_value: 255.0 mean: [0.485, 0.456, 0.406] std: [0.229, 0.224, 0.225] p: 1.0 - __class_fullname__: ToTensorV2 always_apply: true p: 1.0 transpose_mask: false
维护者给的示例也是这种思路:train 里放随机增强,最后接 Normalize + ToTensorV2。(GitHub)
transform_config 指向这两个 YAML 文件把你 config 的这一段:
yamltransform_config: train: null eval: null
改成(路径换成你实际放置 yaml 的路径):
yamltransform_config: train: /media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib/configs/car_aug_train.yaml eval: /media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib/configs/car_aug_eval.yaml
Resize(height/width) 要和你 image_size 一致你现在 image_size: 256,那上面 YAML 的 Resize 就写 256。
如果你后面把 image_size 改 512,那 两个 YAML 里的 Resize 也一起改成 512,否则很容易出现维度不匹配(社区里经常遇到)。(GitHub)
最新文档里把 模型预处理(Resize/Normalize) 和 **数据增强(随机)**区分得很清楚,并提醒不要把 Normalize 当作 augmentation 乱塞(会产生不可控行为)。(Anomalib)
但你当前这种 dataset.transform_config(Albumentations 序列化)用法,通常是“你提供了自定义 transform 就要自己把 Normalize/ToTensorV2 补全”。(GitHub)
如果你不确定版本:跑一下 pip show anomalib 看版本号;能训练跑通就以你当前框架实际行为为准。
如果你愿意用脚本生成(更稳),思路就是用 Albumentations A.save(...)(维护者示例也是这样生成的)。(GitHub)
如果你把这套增强加上后 image_AUROC 还是没起色,下一步我会优先让你做两件事(同样不改网络结构):
你也可以直接贴一下:你现在 anomalib 的版本号 + 训练日志里 datamodule 打印的 transforms(或截图),我能帮你确认到底有没有真正加载到这些增强。
我的配置文件如下:dataset:
name: mvtec
format: folder
path: /media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib/dataset
normal_dir: normal # 自制数据集正样本子文件夹
abnormal_dir: abnormal0 # 自制数据集负样本子文件夹
mask_dir: label_mask0 # 二值掩膜路径,自制数据集一般没有,填null
category: bottle
normal_test_dir: normal_test0 # name of the folder containing normal test images.
task: segmentation
extensions: null
train_batch_size: 32
eval_batch_size: 32
num_workers: 8
image_size: 512 # dimensions to which images are resized (mandatory)
center_crop: null # dimensions to which images are center-cropped after resizing (optional)
normalization: imagenet # data distribution to which the images will be normalized: [none, imagenet]
transform_config:
train: null
eval: null
test_split_mode: from_dir # options: [from_dir, synthetic]
test_split_ratio: 0.2 # fraction of train images held out testing (usage depends on test_split_mode)
val_split_mode: same_as_test # options: [same_as_test, from_test, synthetic]
val_split_ratio: 0.5 # fraction of train/test images held out for validation (usage depends on val_split_mode)
tiling:
apply: true
tile_size: 256
stride: 128
remove_border_count: 0
use_random_tiling: False
random_tile_count: 16
model:
name: fastflow
backbone: resnet18 # options: [resnet18, wide_resnet50_2, cait_m48_448, deit_base_distilled_patch16_384]
pre_trained: true
flow_steps: 8 # options: [8, 8, 20, 20] - for each supported backbone
hidden_ratio: 1.0 # options: [1.0, 1.0, 0.16, 0.16] - for each supported backbone
conv3x3_only: True # options: [True, False, False, False] - for each supported backbone
lr: 0.001
weight_decay: 0.00001
early_stopping:
patience: 3
metric: pixel_AUROC
mode: max
normalization_method: min_max # options: [null, min_max, cdf]
metrics:
image:
- F1Score
- AUROC
pixel:
- F1Score
- AUROC
- AUPRO
threshold:
method: adaptive #options: [adaptive, manual]
manual_image: null
manual_pixel: null
visualization:
show_images: False # show images on the screen
save_images: True # save images to the file system
log_images: True # log images to the available loggers (if any)
image_save_path: ./results/1217fastflow/fastflow/mvtec/run/images # path to which images will be saved
mode: simple # options: ["full", "simple"]
project:
seed: 42
path: ./results/1217fastflow
logging:
logger: [tensorboard,csv]
log_graph: true
optimization:
export_mode: null # options: torch, onnx, openvino
trainer:
enable_checkpointing: true
default_root_dir: null
gradient_clip_val: 0
gradient_clip_algorithm: norm
num_nodes: 1
devices: 1
enable_progress_bar: true
overfit_batches: 0.0
track_grad_norm: -1
check_val_every_n_epoch: 1 # Don't validate before extracting features.
fast_dev_run: false
accumulate_grad_batches: 1
max_epochs: 500
min_epochs: null
max_steps: -1
min_steps: null
max_time: null
limit_train_batches: 1.0
limit_val_batches: 1.0
limit_test_batches: 1.0
limit_predict_batches: 1.0
val_check_interval: 1.0 # Don't validate before extracting features.
log_every_n_steps: 50
accelerator: auto # <"cpu", "gpu", "tpu", "ipu", "hpu", "auto">
strategy: null
sync_batchnorm: false
precision: 32
enable_model_summary: true
num_sanity_val_steps: 0
profiler: null
benchmark: false
deterministic: false
reload_dataloaders_every_n_epochs: 0
auto_lr_find: false
replace_sampler_ddp: true
detect_anomaly: false
auto_scale_batch_size: false
plugins: null
move_metrics_to_cpu: false
multiple_trainloader_mode: max_size_cycle
代码报错如下:Epoch 0: 92%|█████████████████████████████████████████████████████████████████████████████▏ | 250/272 [01:46<00:09, 2.35it/s, loss=-1.68e+06, v_num=0, train_loss_step=-1.69e+6]/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/utilities/data.py:83: UserWarning: Trying to infer the batch_size from an ambiguous collection. The batch size we found is 6. To avoid any miscalculations, use self.log(..., batch_size=batch_size).
warning_cache.warn(
Epoch 3: 100%|█| 272/272 [03:06<00:00, 1.46it/s, loss=-4.93e+06, v_num=0, train_loss_step=-4.98e+6, pixel_F1Score=0.0544, pixel_AUROC=0.776, pixel_AUPRO=0.336, train_loss_epoch=-3.98Given normalized_shape=[64, 128, 128], expected input with shape [*, 64, 128, 128], but got input of size[1, 64, 64, 64]
Error occurs, No graph saved
Traceback (most recent call last):
File "tools/train.py", line 81, in <module>
train(args)
File "tools/train.py", line 66, in train
trainer.fit(model=model, datamodule=datamodule)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
results = self._run_stage()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
self._run_train()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train
self.fit_loop.run()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 206, in run
output = self.on_run_end()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 323, in on_run_end
self.trainer._call_callback_hooks("on_train_end")
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1394, in _call_callback_hooks
fn(self, self.lightning_module, *args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/utils/callbacks/graph.py", line 44, in on_train_end
logger.log_graph(pl_module, input_array=torch.ones((1, 3, 256, 256)))
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn
return fn(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loggers/tensorboard.py", line 205, in log_graph
self.experiment.add_graph(model, input_array)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/utils/tensorboard/writer.py", line 841, in add_graph
graph(model, input_to_model, verbose, use_strict_trace)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/utils/tensorboard/_pytorch_graph.py", line 338, in graph
raise e
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/utils/tensorboard/_pytorch_graph.py", line 331, in graph
trace = torch.jit.trace(model, args, strict=False)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/jit/_trace.py", line 1000, in trace
traced_func = _trace_impl(
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/jit/_trace.py", line 695, in _trace_impl
return trace_module(
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/jit/_trace.py", line 1275, in trace_module
module._c._create_method_from_trace(
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1543, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/models/components/base/anomaly_module.py", line 67, in forward
return self.model(batch)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1543, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/models/fastflow/torch_model.py", line 190, in forward
features = self._get_cnn_features(input_tensor)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/models/fastflow/torch_model.py", line 219, in _get_cnn_features
features = [self.normsi for i, feature in enumerate(features)]
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/models/fastflow/torch_model.py", line 219, in <listcomp>
features = [self.normsi for i, feature in enumerate(features)]
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1543, in _slow_forward
result = self.forward(input, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/normalization.py", line 202, in forward
return F.layer_norm(
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/functional.py", line 2576, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Given normalized_shape=[64, 128, 128], expected input with shape [, 64, 128, 128], but got input of size[1, 64, 64, 64]
Epoch 3: 100%|██████████| 272/272 [03:08<00:00, 1.44it/s, loss=-4.93e+06, v_num=0, train_loss_step=-4.98e+6, pixel_F1Score=0.0544, pixel_AUROC=0.776, pixel_AUPRO=0.336, train_loss_epoch=-3.98e+6]
你这个报错不是训练过程本身炸了,而是**训练结束时写 TensorBoard “模型结构图(graph)”**的时候炸了。
从 traceback 看得很清楚:异常发生在 anomalib/utils/callbacks/graph.py 的 on_train_end,它在 TensorBoard 里用一个固定大小的假输入去 trace 模型并记录 graph(很多版本里写死 torch.ones((1,3,256,256)))。(Anomalib)
但你现在 image_size: 512,FastFlow 里一些 LayerNorm 的 normalized_shape 会按输入分辨率对应的特征图尺寸初始化(512→特征图 128×128),于是 graph logger 用 256 去 trace 时,特征图变成 64×64,就出现你看到的 mismatch:
expected [*, 64, 128, 128] but got [1, 64, 64, 64]
把配置里这一行改掉即可:
yamllogging: log_graph: false
GraphLogger 本来就是训练结束才记录(TensorBoard/Comet 都是 on_train_end),关掉不影响训练和指标。(Anomalib)
你现在的 anomalib 版本里,graph logger 很可能写死了 256×256(你日志也印证了这一点)。(Lightrun)
做法是去你环境里改这个文件(你 traceback 里已经给出路径):
.../site-packages/anomalib/utils/callbacks/graph.py
把类似下面这一行:
pythontorch.ones((1, 3, 256, 256))
改为:
pythontorch.ones((1, 3, 512, 512))
注意:如果你后面真的让 tiling 生效并且模型实际吃的是 tile_size=256 的 patch,那 graph 输入用 256 才是对的;但你现在模型明显按 512 的特征尺寸在工作,所以要先改成 512 才能过。
transform_config 里实现(可以改这里)你现在:
yamltransform_config: train: null eval: null
要做增强,就把它们改成指向 Albumentations 的 YAML 文件路径。社区维护者明确说了:通过 dataset.transform_config 指向 albumentations 配置文件即可,并且自定义 transform 文件里必须包含 Resize + ToTensorV2(大多数模型还要 Normalize 到 ImageNet)。(GitHub)
car_train_aug_512.yaml(训练:随机增强 + 必备变换)yaml__version__: 1.3.0 transform: __class_fullname__: Compose p: 1.0 transforms: - __class_fullname__: Resize always_apply: true height: 512 width: 512 interpolation: 1 p: 1.0 # ——建议从“成像变化类”开始(车载更稳)—— - __class_fullname__: RandomBrightnessContrast brightness_limit: [-0.2, 0.2] contrast_limit: [-0.2, 0.2] brightness_by_max: true p: 0.6 - __class_fullname__: RandomGamma gamma_limit: [80, 120] p: 0.3 - __class_fullname__: GaussNoise var_limit: [5.0, 30.0] mean: 0 per_channel: true p: 0.2 - __class_fullname__: MotionBlur blur_limit: [3, 5] p: 0.15 - __class_fullname__: ShiftScaleRotate shift_limit: 0.02 scale_limit: 0.05 rotate_limit: 5 interpolation: 1 border_mode: 4 p: 0.25 # ——必备:Normalize + ToTensorV2(很多 anomalib 版本自定义 transforms 时需要你显式写)—— - __class_fullname__: Normalize max_pixel_value: 255.0 mean: [0.485, 0.456, 0.406] std: [0.229, 0.224, 0.225] p: 1.0 - __class_fullname__: ToTensorV2 p: 1.0 transpose_mask: false
car_eval_512.yaml(验证/测试:只做确定性预处理,别放随机增强)yaml__version__: 1.3.0 transform: __class_fullname__: Compose p: 1.0 transforms: - __class_fullname__: Resize always_apply: true height: 512 width: 512 interpolation: 1 p: 1.0 - __class_fullname__: Normalize max_pixel_value: 255.0 mean: [0.485, 0.456, 0.406] std: [0.229, 0.224, 0.225] p: 1.0 - __class_fullname__: ToTensorV2 p: 1.0 transpose_mask: false
上面这种“train 有随机增强,eval 只有 Resize/Normalize/ToTensor”就是 anomalib 里最常见的做法。(GitHub)
把你的配置改成这样(路径换成你真实路径):
yamldataset: transform_config: train: /media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib/configs/car_train_aug_512.yaml eval: /media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib/configs/car_eval_512.yaml
最新文档里明确写了:要让 tiling 生效,需要 TilerConfigurationCallback 把 tiler 绑定到 model(否则可能你以为开了,实际没切)。(Anomalib)
你现在这种“旧式 config + tools/train.py”的工程,是否自动加了这个 callback取决于你这份训练脚本怎么写(很多人会漏)。所以建议你在日志里确认一下:训练时输入到模型的到底是 512 还是 256 的 tiles。
logging.log_graph: false 关掉,保证能完整跑完不报错。(Anomalib)transform_config.train/eval 两个 YAML,先从“亮度/对比度/噪声/轻微抖动”这种增强开始。(GitHub)如果你愿意,把你 pip show anomalib 的版本号贴一下(就一行 Version),我可以按你的具体版本给你一份完全匹配你版本的 transform/tiling/callback 配置写法,避免“新文档写法”和“旧工程写法”混用踩坑。
我想让他自定义graph张量大小,可以根据我的传入自动修改,这样就不用每次我去修改
你这个需求本质上是:别再用 anomalib 里写死的 torch.ones((1,3,256,256)) 去 trace 模型,而是自动拿到“真实会喂进模型的输入尺寸”(你改 image_size / tiling.tile_size 后自动匹配),这样 TensorBoard 画 graph 就不会再因为 256↔512 之类不一致而报错。
你现在的报错就是 anomalib 的 GraphLogger 在 on_train_end 里用固定 256 记录 graph 导致的(很多版本确实这么写)。(Lightrun)
下面给你两种实现方式:推荐方案 1(最干净,不用每次改 site-packages),方案 2(一次性 patch site-packages)。
思路:训练结束时从 dataloader 抽一批真实 batch,取其中 image 的 shape 作为 input_array,再调用 logger.log_graph。这样:
image_size=512 也行tile_size=256)也能自动跟着变yamllogging: log_graph: false
(这个字段对应 anomalib 的 GraphLogger callback。(Anomalib Documentation))
dynamic_graph_logger.py把下面文件放到你的工程里,比如 callbacks/dynamic_graph_logger.py:
python# callbacks/dynamic_graph_logger.py from __future__ import annotations import torch from pytorch_lightning import Callback from lightning_utilities.core.rank_zero import rank_zero_warn def _extract_4d_image_tensor(batch): """尽量兼容 anomalib / 自定义 dataloader 的 batch 格式,找出 [B,C,H,W] 的图像张量.""" if torch.is_tensor(batch) and batch.ndim == 4: return batch if isinstance(batch, dict): # anomalib 常见 key:image for k in ("image", "input", "inputs", "img"): v = batch.get(k, None) if torch.is_tensor(v) and v.ndim == 4: return v # 找第一个 4D tensor for v in batch.values(): if torch.is_tensor(v) and v.ndim == 4: return v if isinstance(batch, (list, tuple)): for v in batch: if torch.is_tensor(v) and v.ndim == 4: return v if isinstance(v, dict): t = _extract_4d_image_tensor(v) if t is not None: return t return None class DynamicGraphLogger(Callback): """用真实 batch 的输入尺寸记录 graph,避免 anomalib 写死 256 导致的 shape mismatch.""" def __init__(self, when: str = "train_end"): """ when: "fit_start" 或 "train_end" - fit_start:更早记录(但有的模型/日志器可能还没 ready) - train_end:最稳(你的报错也发生在 train_end) """ assert when in ("fit_start", "train_end") self.when = when self._logged = False def _log_graph(self, trainer, pl_module): if self._logged: return if not getattr(trainer, "loggers", None): return # 1) 从 train dataloader 取一个 batch(尺寸最贴近真实训练输入,含 tiling/resize 等) input_array = None try: dl = trainer.datamodule.train_dataloader() batch = next(iter(dl)) img = _extract_4d_image_tensor(batch) if img is None: raise RuntimeError("无法从 batch 中解析到 [B,C,H,W] 图像张量") input_array = img[:1] # 只取 1 张就够画 graph except Exception as e: rank_zero_warn(f"[DynamicGraphLogger] 取 batch 失败,跳过 graph 记录。原因: {e}") return # 2) 确保 input 在模型所在 device(否则会出现 CPU/GPU 不一致) try: device = next(pl_module.parameters()).device input_array = input_array.to(device) except Exception: pass # 3) 对每个 logger 尝试 log_graph(TensorBoardLogger 支持 input_array 参数。([PyTorch Lightning](https://pytorch-lightning.readthedocs.io/en/1.2.10/extensions/generated/pytorch_lightning.loggers.TensorBoardLogger.html?utm_source=chatgpt.com))) ok = False for logger in trainer.loggers: if hasattr(logger, "log_graph"): try: logger.log_graph(pl_module, input_array=input_array) ok = True except Exception as e: rank_zero_warn(f"[DynamicGraphLogger] logger {type(logger)} 记录 graph 失败: {e}") self._logged = ok def on_fit_start(self, trainer, pl_module): if self.when == "fit_start": self._log_graph(trainer, pl_module) def on_train_end(self, trainer, pl_module): if self.when == "train_end": self._log_graph(trainer, pl_module)
tools/train.py 里把 callback 加到 Trainer示例(你按自己代码结构放进去即可):
pythonfrom callbacks.dynamic_graph_logger import DynamicGraphLogger callbacks = [] callbacks.append(DynamicGraphLogger(when="train_end")) trainer = pl.Trainer( ..., callbacks=callbacks, )
这样你以后只改 YAML 里的
image_size/tile_size,graph 输入会自动跟着你的 dataloader 走,不需要手改任何 tensor 尺寸。
如果你不想动 train.py,那就直接把你环境里的:
.../site-packages/anomalib/utils/callbacks/graph.py
里那行 torch.ones((1, 3, 256, 256)) 改成“动态推断”:
trainer.datamodule.train_dataloader() 抽一个 batch 拿尺寸trainer.datamodule.image_size 或 config这个思路和上面 callback 一样,只是你把逻辑塞回 anomalib 的 GraphLogger 里。
(你现在的报错链路就是这里写死 256 导致的。(Lightrun))
如果你把你 tools/train.py 里创建 Trainer 的那几行贴出来(创建 trainer / callbacks 的部分就行),我可以把“callback 如何插入”按你的代码写成一段可直接粘贴的版本。