anomalib相关
创建于:2025年12月10日
使用 GPT-5.1 Thinking 回答 Chat01
创建于:2025年12月10日
使用 GPT-5.1 Thinking 回答 Chat01
pip install wandbPrecisionRecallCurve will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.Trainer(limit_train_batches=1.0) was configured so 100% of the batches per epoch will be used..Trainer(limit_val_batches=1.0) was configured so 100% of the batches will be used..Trainer(limit_test_batches=1.0) was configured so 100% of the batches will be used..Trainer(limit_predict_batches=1.0) was configured so 100% of the batches will be used..Trainer(val_check_interval=1.0) was configured so validation will run at the end of the training epoch..torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precisionROC will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.36.3 M Trainable params
4.0 M Non-trainable params
40.3 M Total params
161.195 Total estimated model params size (MB)
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loggers/tensorboard.py:191: UserWarning: Could not log computational graph to TensorBoard: The model.example_input_array attribute is not set or input_array was not given.
rank_zero_warn(
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1609: PossibleUserWarning: The number of training batches (18) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
rank_zero_warn(
Training: 0it [00:00, ?it/s]2025-12-10 10:50:34,277 - anomalib.data.utils.download - INFO - Downloading the vq_model_pretrained_128_4096.pckl dataset.
vq_model_pretrained_128_4096.pckl: 1%|█▏ | 172k/15.0M [14:25<20:47:08, 199B/s]
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py:48: UserWarning: Detected KeyboardInterrupt, attempting graceful shutdown...
rank_zero_warn("Detected KeyboardInterrupt, attempting graceful shutdown...")
2025-12-10 11:05:00,173 - anomalib - INFO - Loading the best model weights.
2025-12-10 11:05:00,174 - anomalib - INFO - Testing the model.
2025-12-10 11:05:00,180 - pytorch_lightning.utilities.rank_zero - INFO - You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
2025-12-10 11:05:00,181 - anomalib.utils.callbacks.model_loader - INFO - Loading the model from
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/utils/callbacks/model_loader.py:32: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
pl_module.load_state_dict(torch.load(self.weights_path, map_location=pl_module.device)["state_dict"])
^CTraceback (most recent call last):
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 842, in _test_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1051, in _run
self._call_setup_hook() # allow user to setup lightning_module in accelerator environment
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1299, in _call_setup_hook
self._call_callback_hooks("setup", stage=fn)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1394, in _call_callback_hooks
fn(self, self.lightning_module, *args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/utils/callbacks/model_loader.py", line 32, in setup
pl_module.load_state_dict(torch.load(self.weights_path, map_location=pl_module.device)["state_dict"])
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/serialization.py", line 1065, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/serialization.py", line 468, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/serialization.py", line 449, in init
super().init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: ''
During handling of the above exception, another exception occurred:
pip install wandbPrecisionRecallCurve will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.Trainer(limit_train_batches=1.0) was configured so 100% of the batches per epoch will be used..Trainer(limit_val_batches=1.0) was configured so 100% of the batches will be used..Trainer(limit_test_batches=1.0) was configured so 100% of the batches will be used..Trainer(limit_predict_batches=1.0) was configured so 100% of the batches will be used..Trainer(val_check_interval=1.0) was configured so validation will run at the end of the training epoch..torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precisionROC will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.model.example_input_array attribute is not set or input_array was not given.pip install wandbPrecisionRecallCurve will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.Trainer(limit_train_batches=1.0) was configured so 100% of the batches per epoch will be used..Trainer(limit_val_batches=1.0) was configured so 100% of the batches will be used..Trainer(limit_test_batches=1.0) was configured so 100% of the batches will be used..Trainer(limit_predict_batches=1.0) was configured so 100% of the batches will be used..Trainer(val_check_interval=1.0) was configured so validation will run at the end of the training epoch..torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precisionROC will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.model.example_input_array attribute is not set or input_array was not given.pip install wandbPrecisionRecallCurve will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.Trainer(limit_train_batches=1.0) was configured so 100% of the batches per epoch will be used..Trainer(limit_val_batches=1.0) was configured so 100% of the batches will be used..Trainer(limit_test_batches=1.0) was configured so 100% of the batches will be used..Trainer(limit_predict_batches=1.0) was configured so 100% of the batches will be used..Trainer(val_check_interval=1.0) was configured so validation will run at the end of the training epoch..torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precisionROC will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.36.3 M Trainable params
4.0 M Non-trainable params
40.3 M Total params
161.195 Total estimated model params size (MB)
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loggers/tensorboard.py:191: UserWarning: Could not log computational graph to TensorBoard: The model.example_input_array attribute is not set or input_array was not given.
rank_zero_warn(
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1609: PossibleUserWarning: The number of training batches (18) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
rank_zero_warn(
Training: 0it [00:00, ?it/s]2025-12-10 11:07:35,317 - anomalib.data.utils.download - INFO - Existing dataset archive found. Skipping download stage.
2025-12-10 11:07:35,317 - anomalib.data.utils.download - INFO - Extracting dataset into root folder.
Traceback (most recent call last):
File "tools/train.py", line 81, in <module>
train(args)
File "tools/train.py", line 66, in train
trainer.fit(model=model, datamodule=datamodule)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
results = self._run_stage()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
self._run_train()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train
self.fit_loop.run()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 194, in run
self.on_run_start(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 218, in on_run_start
self.trainer._call_lightning_module_hook("on_train_start")
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1356, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/models/dsr/lightning_model.py", line 93, in on_train_start
ckpt: Path = self.prepare_pretrained_model()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/models/dsr/lightning_model.py", line 60, in prepare_pretrained_model
download_and_extract(pretrained_models_dir, WEIGHTS_DOWNLOAD_INFO)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/data/utils/download.py", line 312, in download_and_extract
extract(downloaded_file_path, root)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/data/utils/download.py", line 265, in extract
with ZipFile(file_name, "r") as zip_file:
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/zipfile.py", line 1271, in init
self._RealGetContents()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/zipfile.py", line 1338, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
Training: 0it [00:00, ?it/s]
这是什么问题呢,一般这个zip下载后不是放置在/home/jd/.cache/torch/hub/checkpoints路径下的一个zip吗,我看网速慢,下载后传递进去,怎么不对呢,也不使用我下载的
问题的根子其实不是 DSR 本身,而是预训练权重压缩包被搞坏了,然后 anomalib 一直在拿这个坏文件解压,所以才会报:
textzipfile.BadZipFile: File is not a zip file
按你给的日志,时间线大概是这样:
第一次跑:
textDownloading the vq_model_pretrained_128_4096.pckl dataset. vq_model_pretrained_128_4096.pckl: 1%|█▏ | 172k/15.0M [...] ... UserWarning: Detected KeyboardInterrupt, attempting graceful shutdown...
也就是说:DSR 在下载 vq_model_pretrained_128_4096 这个预训练模型的时候,你 Ctrl+C 中断了。
→ 结果:磁盘上留下了一个没有完全下完的 zip 文件 / 数据文件。
之后每次再跑:
textExisting dataset archive found. Skipping download stage. Extracting dataset into root folder. ... zipfile.BadZipFile: File is not a zip file
也就是:
download_and_extract → extract(...) 尝试用 ZipFile(...) 解压;BadZipFile: File is not a zip file。💡 所以前后逻辑是:
第一次下载被中断 → 留下了损坏的 zip → 之后每次运行都跳过下载,直接解压坏文件 → BadZipFile。
/home/jd/.cache/torch/hub/checkpoints 也没用?你这句:
一般这个zip下载后不是放置在
/home/jd/.cache/torch/hub/checkpoints路径下的一个zip吗…也不使用我下载的
关键点在于:
torch.hub 下载的,而是用它自己的:
这个工具自己的缓存路径和命名规则,跟pythonanomalib.data.utils.download.download_and_extract(root, info)
~/.cache/torch/hub/checkpoints 完全没关系。(Anomalib Documentation)root 目录(在 Dsr.prepare_pretrained_model() 里算出来的 pretrained_models_dir),文件名取自 WEIGHTS_DOWNLOAD_INFO 中的 filename 或 URL 最后那一段。(Anomalib Documentation)所以:
你手动下好的 zip 放到
~/.cache/torch/hub/checkpoints,anomalib 根本不会去那儿找,自然“也不使用我下载的”。
你需要找出 anomalib 认为已经存在的那个压缩包,把它删掉。
在终端里可以先全局搜一下这个文件名(注意你日志里的名字):
bashfind ~ -iname "vq_model_pretrained_128_4096*"
常见情况可能会出现在类似这些目录里(具体以 find 结果为准):
~/.cache/anomalib/...<你的 anomalib 工程目录>/pretrained_models/...results/1207dsr/.../pretrained/... 一类的路径看到那种 几十 KB 或几百 KB 大小的 zip(明显不像 15MB 的完整文件),基本就是那次中断留下来的半成品。
把相关的压缩包删掉,比如:
bashrm /path/to/vq_model_pretrained_128_4096*.zip # 或者 rm /path/to/vq_model_pretrained_128_4096* 视情况而定
如果旁边还有一个同名解压出来的目录、但里面空空的,也可以顺手删掉,保证环境干净。
然后再跑:
bashpython tools/train.py --model dsr --config src/anomalib/models/dsr/config.yaml
这次你应该会再次看到:
textDownloading the vq_model_pretrained_128_4096.pckl dataset.
不要再 Ctrl+C 中断它的下载过程,等它把预训练权重下载 + 解压完成,后面训练就不会再触发这个 BadZipFile 了。
大致有两种思路:
打开 src/anomalib/models/dsr/lightning_model.py,找到 prepare_pretrained_model() 和 WEIGHTS_DOWNLOAD_INFO:
pretrained_models_dir 是怎么构造的(root 目录是什么);WEIGHTS_DOWNLOAD_INFO.filename(或者如果没写 filename,则看 url 最后文件名)。假设你看到是类似:
pythonpretrained_models_dir = Path.home() / ".cache" / "anomalib" / "dsr" WEIGHTS_DOWNLOAD_INFO = DownloadInfo( name="vq_model_pretrained_128_4096.pckl", url="https://xxx/vq_model_pretrained_128_4096.zip", hashsum="xxxx", )
那 download_and_extract 会在:
text~/.cache/anomalib/dsr/vq_model_pretrained_128_4096.zip
下这个 zip,然后解压。
你做的事应该是:
bashmkdir -p ~/.cache/anomalib/dsr cp 你下载好的那个.zip ~/.cache/anomalib/dsr/vq_model_pretrained_128_4096.zip
再跑训练时,日志会显示 Existing dataset archive found. Skipping download stage.,然后直接解压你放进去的这个 zip,如果是完整正确的,就不会再报错。
注意:你如果下的是
.pckl而不是.zip,而extract又按 zip 来解压,就必然报BadZipFile,所以格式和文件名要和 WEIGHTS_DOWNLOAD_INFO 对齐。
.pckl如果你已经有原始 DSR 仓库里的 vq_model_pretrained_128_4096.pckl(非压缩),也可以直接绕过 download_and_extract,例如(思路):
Dsr.prepare_pretrained_model() 里直接返回你那份 .pckl 的路径:
pythondef prepare_pretrained_model(self) -> Path: return Path("/你自己的路径/vq_model_pretrained_128_4096.pckl")
download_and_extract 调用前的逻辑,让它不再解压,只是使用现成文件。这个需要你稍微改源码,不嫌麻烦的话可行,但要注意今后升级 anomalib 时可能会被覆盖。
第一次那段日志里的:
textFileNotFoundError: [Errno 2] No such file or directory: '' ... pl_module.load_state_dict(torch.load(self.weights_path, ... )["state_dict"])
是因为你在训练阶段还没结束就 Ctrl+C,所以还没保存任何 checkpoint,
但 Lightning 的 ModelLoader callback 还试图去加载“最佳模型权重”,结果 weights_path 为空字符串,就直接抛了这个错。
这个不是根本问题,一旦你让训练阶段正常跑完,产生了 *.ckpt 文件,这个错误自然就消失了。
BadZipFile;/home/jd/.cache/torch/hub/checkpoints 是 torch.hub 的位置,DSR 这段代码根本不看那里,所以你放进去的文件不会被用到。你可以先按我上面第 3 部分的步骤找并删掉坏 zip,跑一遍看看,如果还报错,把 find 找到的路径贴出来,我可以帮你一起看要删 / 要放在哪个目录更合适。
这个问题怎么解决呢:2025-12-13 14:44:03,503 - anomalib.models.efficient_ad.lightning_model - INFO - Load pretrained teacher model from pre_trained/efficientad_pretrained_weights/pretrained_teacher_small.pth
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/models/efficient_ad/lightning_model.py:105: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
self.model.teacher.load_state_dict(torch.load(teacher_path, map_location=torch.device(self.device)))
Traceback (most recent call last):
File "tools/train.py", line 81, in <module>
train(args)
File "tools/train.py", line 60, in train
model = get_model(config)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/models/init.py", line 136, in get_model
model = getattr(module, f"{convert_snake_to_pascal_case(config.model.name)}Lightning")(config)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/models/efficient_ad/lightning_model.py", line 278, in init
super().init(
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/models/efficient_ad/lightning_model.py", line 94, in init
self.prepare_pretrained_model()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/models/efficient_ad/lightning_model.py", line 105, in prepare_pretrained_model
self.model.teacher.load_state_dict(torch.load(teacher_path, map_location=torch.device(self.device)))
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/serialization.py", line 1065, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/serialization.py", line 468, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/serialization.py", line 449, in init
super().init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'pre_trained/efficientad_pretrained_weights/pretrained_teacher_small.pth'
我将下载好的文件放在哪里呢
这个报错本质上就是:代码在这个路径下找不到预训练 teacher 权重文件:
textpre_trained/efficientad_pretrained_weights/pretrained_teacher_small.pth
而你确实还没把文件放到它要找的位置。
在 EfficientAd.prepare_pretrained_model() 里,路径是这样拼出来的:(GitHub)
pythonpretrained_models_dir = Path("./pre_trained/") if not (pretrained_models_dir / "efficientad_pretrained_weights").is_dir(): download_and_extract(pretrained_models_dir, WEIGHTS_DOWNLOAD_INFO) teacher_path = ( pretrained_models_dir / "efficientad_pretrained_weights" / f"pretrained_teacher_{self.model_size}.pth" ) logger.info(f"Load pretrained teacher model from {teacher_path}") self.model.teacher.load_state_dict(torch.load(teacher_path, map_location=torch.device(self.device)))
结合你运行命令的路径:
bash/media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib$ python tools/train.py ...
可以得到实际查找的位置是:
text/media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib/pre_trained/efficientad_pretrained_weights/pretrained_teacher_small.pth
✅ 所以:你下载好的
pretrained_teacher_small.pth必须放在 anomalib 仓库根目录下面的这个路径里,而不是~/.cache/torch/hub/checkpoints之类的地方。
假设你已经下载好了 pretrained_teacher_small.pth 到某个地方,比如 /home/jd/Downloads/pretrained_teacher_small.pth。
在 anomalib 根目录下(就是你跑 python tools/train.py 的那个目录)执行:
bashcd /media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib # 1. 创建目录结构 mkdir -p pre_trained/efficientad_pretrained_weights # 2. 拷贝权重文件进去(注意文件名要一模一样) cp /home/jd/Downloads/pretrained_teacher_small.pth \ pre_trained/efficientad_pretrained_weights/pretrained_teacher_small.pth # 3. 检查一下 ls -lh pre_trained/efficientad_pretrained_weights
你应该能看到类似:
textpretrained_teacher_small.pth
然后再跑:
bashpython tools/train.py --model efficient_ad --config 你的efficient_ad配置.yaml
就不会再报 FileNotFoundError 了。
官方 EfficientAD 预训练权重通常会打包成一个 zip(例如 efficientad_pretrained_weights.zip),download_and_extract 就是用来解压这个 zip 并生成 efficientad_pretrained_weights 目录的。(GitHub)
如果你手里是 zip:
bashcd /media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib mkdir -p pre_trained unzip /home/jd/Downloads/efficientad_pretrained_weights.zip -d pre_trained
解压后目录结构应该是:
textpre_trained/ efficientad_pretrained_weights/ pretrained_teacher_small.pth (可能还有其它文件)
保证这里有 pretrained_teacher_small.pth 就可以了。
日志里还有一段:
textFutureWarning: You are using `torch.load` with `weights_only=False` ...
这个只是 PyTorch 的未来行为提醒,不是错误,不会导致训练失败,可以先忽略;真正导致程序崩掉的是 FileNotFoundError。
你问:“我将下载好的文件放在哪里呢?”
答案:
把 pretrained_teacher_small.pth 放到你运行 python tools/train.py 的 anomalib 根目录下的这个路径中:
textpre_trained/efficientad_pretrained_weights/pretrained_teacher_small.pth
完整绝对路径大概是:
text/media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib/pre_trained/efficientad_pretrained_weights/pretrained_teacher_small.pth
照着上面的 mkdir + cp 命令做一遍,再跑一次就可以了。如果你 ls 之后的目录结构截图贴给我,我也可以帮你确认一下是否摆对位置。
这是什么问题呢:Traceback (most recent call last):
File "tools/train.py", line 81, in <module>
train(args)
File "tools/train.py", line 76, in train
trainer.test(model=model, datamodule=datamodule)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 794, in test
return call._call_and_handle_interrupt(
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 842, in _test_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
results = self._run_stage()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1188, in _run_stage
return self._run_evaluate()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1228, in _run_evaluate
eval_loop_results = self._evaluation_loop.run()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 143, in advance
self._on_evaluation_batch_end(output, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 275, in _on_evaluation_batch_end
self.trainer._call_callback_hooks(hook_name, output, *kwargs.values())
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1394, in _call_callback_hooks
fn(self, self.lightning_module, *args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/utils/callbacks/visualizer/visualizer_image.py", line 84, in on_test_batch_end
for i, image in enumerate(self.visualizer.visualize_batch(outputs)):
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/post_processing/visualizer.py", line 122, in visualize_batch
yield self.visualize_image(image_result)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/post_processing/visualizer.py", line 134, in visualize_image
return self._visualize_full(image_result)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/post_processing/visualizer.py", line 165, in _visualize_full
assert image_result.pred_mask is not None
AssertionError
难道说anomalib模型库里配置写的task是分类,就不能做分割任务
这个报错的本质是:可视化回调在按“分割任务”画图,但模型并没有给出像素级掩码 pred_mask,于是在断言处直接炸了。
从堆栈能看出来关键这一行(官方源码里就是这样写的):(Anomalib Documentation)
python# anomalib.post_processing.visualizer.Visualizer._visualize_full if self.task == "segmentation": assert image_result.pred_mask is not None ...
也就是说:
ImageVisualizerCallback 当前被配置成了 task="segmentation";image_result.pred_mask 是 None(模型/Pipeline 没有产出像素级预测掩码);assert image_result.pred_mask is not None 失败,抛出 AssertionError。难道说 anomalib 模型库里配置写的 task 是分类,就不能做分割任务
更准确地说:
config.yaml 里的 task 只是告诉 数据与可视化 / 后处理 该按“分类”还是“分割”的逻辑走:
task: classification → 只需要 image-level 分数/标签,不要求 pred_mask;task: segmentation → 认为模型提供了 anomaly map + pred_mask,可视化要画热力图和掩码。task: segmentation,并不会 magically 变成分割模型——模型里根本没实现产生像素级掩码的分支,于是 pred_mask 就是 None,到可视化这一步必崩。所以这次的错误,就是 “可视化认为是分割任务,但实际输出只有分类信息” 造成的配置不一致问题。
你接下来想要的,有两种可能:只是做分类,还是确实想要分割结果。两种情况处理方式不一样。
那就让整条 pipeline 都按 classification 走,而不是 segmentation。
建议检查/修改这些地方(大致思路,具体字段名字可能因版本略有差异):
yamldataset: task: classification
或者如果你用的是 Folder 这种 datamodule,在代码/API 里也要传 task="classification"。(Anomalib Documentation)
找到 config 里关于 ImageVisualizerCallback / image_visualizer 的配置,把 task 改成 classification,例如:
yamlcallbacks: image_visualizer: task: classification mode: full # or 'simple'
这样 Visualizer 会走 _visualize_full 的 classification 分支,就不会再去碰 pred_mask 了。(Anomalib Documentation)
最简单粗暴的方式是直接关掉这个回调,比如在 config 里把 image_visualizer 删掉/注释掉,或者有开关时设为 enable: false / save_images: false(具体字段看你那版 config)。
总之:只做分类 →
task一律设成classification,不要让可视化认为是 segmentation,就不会再 assert 了。
那就要确认两件事:
选的模型本身支持分割
Model Type: Classification(比如 GANomaly),它天生就是 image-level 模型,没有 pred_mask 输出,你光在 config 里改 task: segmentation 是不可能得到掩码的。(GitHub)数据与 config 按“分割任务”的要求配置好
dataset.task、回调的 task 都要一致设成 "segmentation",这样 PostProcessor 才会生成 pred_mask,Visualizer 才画得出来。如果你现在是用一个“分类型”模型(比如 GANomaly),但强行把 config 的 task 改成 segmentation,那就会 永远没有 pred_mask,也就永远在这个 assert 上崩。
pred_mask 非空,但你的模型输出里没有 pred_mask,于是 AssertionError。classification,或者直接关掉图像可视化回调;task 字段就指望分类模型输出掩码。如果你方便把当前用的 具体模型名 + 一小段 config.yaml(dataset 部分和 callbacks 部分) 贴出来,我可以帮你按那一版的字段写一份“能跑不报错”的最小修改示例。
我有两个gpu,我训练anomalib时的配置文件如下,我想让它使用我的GPU1,而不是GPU0:dataset:
name: mvtec
format: folder
path: /media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib/dataset
normal_dir: normal # 自制数据集正样本子文件夹
abnormal_dir: abnormal # 自制数据集负样本子文件夹
mask_dir: label_mask1 # 二值掩膜路径,自制数据集一般没有,填null
category: bottle
normal_test_dir: normal_test # name of the folder containing normal test images.
task: segmentation
extensions: null
train_batch_size: 32
eval_batch_size: 32
split_ratio: 0.2 # ratio of the normal images that will be used to create a test split
num_workers: 8
image_size: 256 # dimensions to which images are resized (mandatory)
center_crop: null # dimensions to which images are center-cropped after resizing (optional)
normalization: imagenet # data distribution to which the images will be normalized: [none, imagenet]
transform_config:
train: null
eval: null
test_split_mode: from_dir # options: [from_dir, synthetic]
test_split_ratio: 0.2 # fraction of train images held out testing (usage depends on test_split_mode)
val_split_mode: same_as_test # options: [same_as_test, from_test, synthetic]
val_split_ratio: 0.5 # fraction of train/test images held out for validation (usage depends on val_split_mode)
tiling:
apply: false
tile_size: null
stride: null
remove_border_count: 0
use_random_tiling: False
random_tile_count: 16
model:
name: cflow
backbone: wide_resnet50_2
pre_trained: true
layers:
- layer2
- layer3
- layer4
decoder: freia-cflow
condition_vector: 128
coupling_blocks: 8
clamp_alpha: 1.9
fiber_batch_size: 64
permute_soft: false
lr: 0.0001
early_stopping:
patience: 2
metric: pixel_AUROC
mode: max
normalization_method: min_max # options: [null, min_max, cdf]
metrics:
image:
- F1Score
- AUROC
pixel:
- F1Score
- AUROC
- AUPRO
threshold:
method: adaptive #options: [adaptive, manual]
manual_image: null
manual_pixel: null
visualization:
show_images: False # show images on the screen
save_images: True # save images to the file system
log_images: True # log images to the available loggers (if any)
image_save_path: ./results/1217cflow # path to which images will be saved
mode: full # options: ["full", "simple"]
project:
seed: 0
path: ./results/1217cflow
logging:
logger: [tensorboard,csv]
log_graph: true
optimization:
export_mode: null # options: torch, onnx, openvino
pip install wandbNone for the seed value (null in the YAML file) or remove the seed key from the YAML file.PrecisionRecallCurve will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.Trainer(limit_train_batches=1.0) was configured so 100% of the batches per epoch will be used..Trainer(limit_val_batches=1.0) was configured so 100% of the batches will be used..Trainer(limit_test_batches=1.0) was configured so 100% of the batches will be used..Trainer(limit_predict_batches=1.0) was configured so 100% of the batches will be used..Trainer(val_check_interval=1.0) was configured so validation will run at the end of the training epoch..pip install wandbpip install wandbYou are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Missing logger folder: results/1217cflow/cflow/mvtec/run/logs/Tensorboard Logs
Missing logger folder: results/1217cflow/cflow/mvtec/run/logs/Tensorboard Logs
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Metric ROC will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.
warnings.warn(*args, **kwargs)
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Metric ROC will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.
warnings.warn(*args, **kwargs)
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
81.6 M Trainable params
154 M Non-trainable params
236 M Total params
946.249 Total estimated model params size (MB)
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loggers/tensorboard.py:191: UserWarning: Could not log computational graph to TensorBoard: The model.example_input_array attribute is not set or input_array was not given.
rank_zero_warn(
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:208: UserWarning: num_workers>0, persistent_workers=False, and strategy=ddp_spawn may result in data loading bottlenecks. Consider setting persistent_workers=True (this is a limitation of Python .spawn() and PyTorch)
rank_zero_warn(
Epoch 0: 0%| | 0/190 [00:00<?, ?it/s]/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/albumentations/init.py:13: UserWarning: A new version of Albumentations is available: 2.0.8 (you have 1.4.14). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.
check_for_updates()
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
To use wandb logger install it using pip install wandb
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/albumentations/init.py:13: UserWarning: A new version of Albumentations is available: 2.0.8 (you have 1.4.14). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.
check_for_updates()
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
To use wandb logger install it using pip install wandb
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/albumentations/init.py:13: UserWarning: A new version of Albumentations is available: 2.0.8 (you have 1.4.14). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.
check_for_updates()
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
To use wandb logger install it using pip install wandb
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/albumentations/init.py:13: UserWarning: A new version of Albumentations is available: 2.0.8 (you have 1.4.14). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.
check_for_updates()
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
To use wandb logger install it using pip install wandb
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/albumentations/init.py:13: UserWarning: A new version of Albumentations is available: 2.0.8 (you have 1.4.14). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.
check_for_updates()
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
To use wandb logger install it using pip install wandb
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/albumentations/init.py:13: UserWarning: A new version of Albumentations is available: 2.0.8 (you have 1.4.14). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.
check_for_updates()
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
To use wandb logger install it using pip install wandb
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/albumentations/init.py:13: UserWarning: A new version of Albumentations is available: 2.0.8 (you have 1.4.14). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.
check_for_updates()
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
To use wandb logger install it using pip install wandb
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/albumentations/init.py:13: UserWarning: A new version of Albumentations is available: 2.0.8 (you have 1.4.14). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.
check_for_updates()
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
To use wandb logger install it using pip install wandb
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/albumentations/init.py:13: UserWarning: A new version of Albumentations is available: 2.0.8 (you have 1.4.14). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.
check_for_updates()
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
To use wandb logger install it using pip install wandb
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/albumentations/init.py:13: UserWarning: A new version of Albumentations is available: 2.0.8 (you have 1.4.14). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.
check_for_updates()
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
To use wandb logger install it using pip install wandb
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/albumentations/init.py:13: UserWarning: A new version of Albumentations is available: 2.0.8 (you have 1.4.14). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.
check_for_updates()
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
To use wandb logger install it using pip install wandb
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/albumentations/init.py:13: UserWarning: A new version of Albumentations is available: 2.0.8 (you have 1.4.14). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.
check_for_updates()
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
To use wandb logger install it using pip install wandb
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/albumentations/init.py:13: UserWarning: A new version of Albumentations is available: 2.0.8 (you have 1.4.14). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.
check_for_updates()
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/albumentations/init.py:13: UserWarning: A new version of Albumentations is available: 2.0.8 (you have 1.4.14). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.
check_for_updates()
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
To use wandb logger install it using pip install wandb
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
To use wandb logger install it using pip install wandb
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/albumentations/init.py:13: UserWarning: A new version of Albumentations is available: 2.0.8 (you have 1.4.14). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.
check_for_updates()
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/albumentations/init.py:13: UserWarning: A new version of Albumentations is available: 2.0.8 (you have 1.4.14). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.
check_for_updates()
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
To use wandb logger install it using pip install wandb
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
To use wandb logger install it using pip install wandb
^C/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py:48: UserWarning: Detected KeyboardInterrupt, attempting graceful shutdown...
rank_zero_warn("Detected KeyboardInterrupt, attempting graceful shutdown...")
2025-12-23 00:50:43,762 - anomalib - INFO - Loading the best model weights.
2025-12-23 00:50:43,763 - anomalib - INFO - Testing the model.
现在的打印是什么情况呢,如何解决
先说结论:
你现在的这段输出不是报错,而是因为你配置了 devices: 2,Lightning 用了 DDP 多进程在 2 张卡(0 和 1) 上训练,所以所有那些 Albumentations / OpenVINO / wandb 的 warning 被两三个进程重复打印,看起来像“卡住一直刷日志”。
你想要的是:只用 GPU1,不用 GPU0,而且最好不要这么多重复输出。下面一步步改。
关键几行:
textInitializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 ... Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2 ... distributed_backend=nccl All distributed processes registered. Starting with 2 processes ... LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1] LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1] Epoch 0: 0%| 0/190 [00:00<?, ?it/s]
解释一下:
devices: 2 + accelerator: gpu → Lightning 自动用 DDP(多卡分布式)。GLOBAL_RANK 0 用 cuda:0GLOBAL_RANK 1 用 cuda:1text^C ... Detected KeyboardInterrupt, attempting graceful shutdown... 2025-12-23 00:50:43,762 - anomalib - INFO - Loading the best model weights. 2025-12-23 00:50:43,763 - anomalib - INFO - Testing the model.
所以:
现在的打印 = DDP 多进程 + 各种库的 Warning 被每个进程打印一遍,看着很吵,但不是 bug。
把 config 里 trainer 部分的这两行改一下:
yamltrainer: accelerator: gpu devices: 1 # 原来是 2,改成 1,只用一张 GPU strategy: null # 或者删掉,让 lightning 自己选
在终端里这样跑:
bash# 只让程序看到物理 GPU1 CUDA_VISIBLE_DEVICES=1 \ python tools/train.py --model cflow --config src/anomalib/models/cflow/config.yaml
这样:
对程序来说,只“看到” 1 张卡(物理 GPU1);
Lightning 里面的 cuda:0 实际就是你的 物理 GPU1;
日志会变成类似:
textGPU available: True (cuda), used: True LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
不会再有 GLOBAL_RANK 1 之类的东西,也不会再起第二个进程去用 GPU0。
注意:如果你还保持
devices: 2,但CUDA_VISIBLE_DEVICES=1只暴露 1 张卡,就会报错(要 2 张卡、实际只有 1 张),所以一定要 同时把 devices 改成 1。
如果你不想折腾环境变量,可以试试看(取决于你安装的 PyTorch Lightning 版本):
yamltrainer: accelerator: gpu devices: [1] # 显式指定用第 1 号 GPU strategy: null
在一些版本里 devices: [1] 表示只用编号为 1 的那张卡。
不过实际中我更推荐方案 1,用 CUDA_VISIBLE_DEVICES,行为最稳定也最容易理解。
这些 warning 不是错误,只是挺烦,可以处理一下:
反复出现这一条:
textA new version of Albumentations is available: 2.0.8 (you have 1.4.14). To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.
你可以在终端加一个环境变量就不再检查了:
bashexport NO_ALBUMENTATIONS_UPDATE=1
或者一行写完:
bashNO_ALBUMENTATIONS_UPDATE=1 CUDA_VISIBLE_DEVICES=1 \ python tools/train.py --model cflow --config src/anomalib/models/cflow/config.yaml
textOpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer. To use wandb logger install it using `pip install wandb`
pip install openvino-dev wandb;如果你的目标是:
只用 GPU1 单卡训练 CFlow,不用 GPU0,日志干净一点
可以按这个顺序做:
yamltrainer: accelerator: gpu devices: 1 # 原来的 2 改成 1 strategy: null # 或直接删掉
bashcd /media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib NO_ALBUMENTATIONS_UPDATE=1 CUDA_VISIBLE_DEVICES=1 \ python tools/train.py --model cflow --config src/anomalib/models/cflow/config.yaml
nvidia-smi 看一下:
如果你把改完的 YAML 贴出来,我也可以帮你再检查一下有没有其它小地方可以顺便优化。
Epoch 0: 66%|█████████████████████████████████████████████████████████████████▏ | 125/190 [1:44:22<54:16, 50.10s/it, v_num=0, train_loss_step=3.55e+4]/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/utilities/data.py:83: UserWarning: Trying to infer the batch_size from an ambiguous collection. The batch size we found is 3. To avoid any miscalculations, use self.log(..., batch_size=batch_size).
warning_cache.warn(
Epoch 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 190/190 [1:48:23<00:00, 34.23s/it, v_num=0, train_loss_step=3.43e+3]W1223 02:47:47.428338 140265947027264 torch/multiprocessing/spawn.py:146] Terminating process 2829716 via signal SIGTERM███████████████████████████████| 64/64 [02:55<00:00, 2.75s/it]
Traceback (most recent call last):
File "tools/train.py", line 81, in <module>
train(args)
File "tools/train.py", line 66, in train
trainer.fit(model=model, datamodule=datamodule)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 113, in launch
mp.start_processes(
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 238, in start_processes
while not context.join():
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 189, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 76, in _wrap
fn(i, *args)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 139, in _wrapping_function
results = function(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
results = self._run_stage()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
self._run_train()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train
self.fit_loop.run()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.on_advance_end()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 250, in on_advance_end
self._run_validation()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 308, in _run_validation
self.val_loop.run()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 206, in run
output = self.on_run_end()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 180, in on_run_end
self._evaluation_epoch_end(self._outputs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 288, in _evaluation_epoch_end
self.trainer._call_lightning_module_hook(hook_name, output_or_outputs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1356, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/models/components/base/anomaly_module.py", line 145, in validation_epoch_end
self._compute_adaptive_threshold(outputs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/models/components/base/anomaly_module.py", line 162, in _compute_adaptive_threshold
self.image_threshold.compute()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/metric.py", line 529, in wrapped_func
with self.sync_context(
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/contextlib.py", line 113, in enter
return next(self.gen)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/metric.py", line 500, in sync_context
self.sync(
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/metric.py", line 452, in sync
self._sync_dist(dist_sync_fn, process_group=process_group)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/metric.py", line 364, in _sync_dist
output_dict = apply_to_collection(
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/utilities/data.py", line 203, in apply_to_collection
return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs) for k, v in data.items()})
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/utilities/data.py", line 203, in <dictcomp>
return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs) for k, v in data.items()})
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/utilities/data.py", line 209, in apply_to_collection
return elem_type([apply_to_collection(d, dtype, function, *args, **kwargs) for d in data])
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/utilities/data.py", line 209, in <listcomp>
return elem_type([apply_to_collection(d, dtype, function, *args, **kwargs) for d in data])
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/utilities/data.py", line 199, in apply_to_collection
return function(data, *args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/utilities/distributed.py", line 131, in gather_all_tensors
torch.distributed.all_gather(local_sizes, local_size, group=group)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
return func(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3108, in all_gather
work = group.allgather([tensor_list], [tensor])
RuntimeError: No backend type associated with device type cpu
现在是什么原因呢
这个报错其实跟你的模型、数据都没关系,而是 多卡 + anomalib 这一版的指标实现有 bug 导致的。
最后炸的地方:
textRuntimeError: No backend type associated with device type cpu ... torch.distributed.all_gather(local_sizes, local_size, group=group)
意思是:
all_gather 的时候,拿的是 CPU 上的 tensor;No backend type associated with device type cpu。这和你前面的堆栈也对得上:
text... validation_epoch_end self._compute_adaptive_threshold(outputs) ... self.image_threshold.compute() ... gather_all_tensors(...) ... all_gather(...) RuntimeError: No backend type associated with device type cpu
anomalib 在 AnomalyModule 里是这么写的(文档/源码可以看到)(Anomalib Documentation):
pythondef _compute_adaptive_threshold(self, outputs): self._collect_outputs(self.image_threshold, self.pixel_threshold, outputs) self.image_threshold.compute() ... @staticmethod def _collect_outputs(image_metric, pixel_metric, outputs): for output in outputs: image_metric.cpu() # <-- 强行搬到 CPU image_metric.update(output["pred_scores"], output["label"].int()) if "mask" in output.keys() and "anomaly_maps" in output.keys(): pixel_metric.cpu() pixel_metric.update(output["anomaly_maps"], output["mask"].int())
也就是说:
nccl;adaptive threshold)时,把 image_threshold 这个 metric .cpu() 了;No backend type associated with device type cpu。GitHub 上已经有人提过类似问题,标题基本就是:“AnomalyScoreThreshold 和多 GPU 不兼容”。(GitHub)
虽然你之前想只用 GPU1,但从这次的堆栈可以看出,你现在还是在用 多卡 / 多进程:
texttorch.multiprocessing.spawn.ProcessRaisedException ... /pytorch_lightning/strategies/launchers/multiprocessing.py: mp.start_processes(...)
说明现在 Trainer 还是用的 DDP(多进程 spawn),即:
trainer.devices 仍然 > 1(你之前配置里是 2);accelerator: gpu → Lightning 默认 strategy 就会选 DDP/DDP Spawn。只要是 多 GPU + 你这版 anomalib,就会在算自适应阈值那里撞到这个 bug。
你本来就想 只用 GPU1,那就顺便把这个 multi-GPU 的坑绕开,直接单卡训练即可。
在你的 config 里找到 trainer: 那一段,把这几个字段改掉:
yamltrainer: accelerator: gpu devices: 1 # 原来是 2,改成 1,彻底关掉多卡 strategy: null # 或者直接删掉,让 lightning 自动选 # 其它保持不动
只要
devices: 1,Lightning 就不会再走 DDP、多进程、分布式同步,自然也不会去 all_gather 那个 CPU metric。
在命令前加上环境变量,把物理 GPU1 暴露给程序:
bashcd /media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib NO_ALBUMENTATIONS_UPDATE=1 CUDA_VISIBLE_DEVICES=1 \ python tools/train.py --model cflow --config src/anomalib/models/cflow/config.yaml
这样:
cuda:0 实际就是 你机器上的 GPU1;devices=1,不会再起第二个进程,也不会再用分布式同步那些 metric;RuntimeError: No backend type associated with device type cpu 就不会再出现了。如果你以后真的想搞多 GPU,可以有几条比较折腾的路(现在不一定要做):
升级 anomalib 到最新版本
新版对 metrics 和 threshold 的实现有较大调整,对多 GPU 支持会好一些。(Anomalib Documentation)
自己改源码,把 .cpu() 调用去掉,并用自定义 all_gather
就是 issue 里讨论的那种 hack:让 metric 始终待在 GPU 上,同步时手动把 tensor 先搬到 cuda 再 gather。(GitHub)
这需要你对 anomalib 代码比较熟,而且要小心验证指标有没有被搞坏。
改分布式 backend 为 gloo(支持 CPU)
在 Lightning 里可以用自定义 strategy=DDPStrategy(process_group_backend="gloo") 一类的写法,让 CPU tensor 也能 all_gather。
但这通常要在 Python 代码里创建 Trainer 而不是纯 YAML 配置,对你现在这套 CLI + YAML 流程就比较麻烦了。
所以就当前情况看,你最想要的是“用 GPU1 跑通”而不是“研究多卡分布式 bug”,那就老老实实单卡跑,是最省心的。
日志里还有一个:
textUserWarning: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 3.
这个只是 torchmetrics 在推测 batch_size 时的提示,不会导致崩溃,可以先忽略。
你如果以后想消掉它,可以在 self.log(...) 里手动加 batch_size=...,但 anomalib 自己封装了,暂时不用管。
No backend type associated with device type cpu。trainer.devices: 1,不要再用 2;CUDA_VISIBLE_DEVICES=1,只暴露 GPU1;你按上面两步改完再跑,如果还有新报错,把更新后的 trainer 配置和完整错误栈贴一下,我可以帮你继续顺。
我现在使用batch_size为8,正常样本数据集有8000张数据,异常样本有2028,正常测试集有2000多张,图像尺寸为256,我有一张3090的GPU,24G,但是一直报错torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.05 GiB. GPU 0 has a total capacity of 23.69 GiB of which 2.83 GiB is free. Including non-PyTorch memory, this process has 20.85 GiB memory in use. Of the allocated memory 18.15 GiB is allocated by PyTorch, and 2.38 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Testing DataLoader 0: 100%|██████████| 507/507 [37:17<00:00, 4.41s/it] 怎么办呢
你这次不是代码写错,而是显卡显存真不够用了(主要是在“测试阶段 + CFlow + 分割指标”这套组合下)。
先拆一下报错信息:
textCUDA out of memory. Tried to allocate 3.05 GiB. GPU 0 total 23.69 GiB, free 2.83 GiB. This process has 20.85 GiB in use. 18.15 GiB allocated by PyTorch, 2.38 GiB reserved but unallocated. Testing DataLoader 0: 100%|██████████| 507/507 [...]
关键信息:
结合前面日志里这些警告:
textMetric `ROC` will save all targets and predictions in buffer. Metric `PrecisionRecallCurve` will save all targets and predictions in buffer.
再加上你是 CFlow + segmentation + 一堆 image/pixel metrics,基本可以肯定:
不是单步 forward 太大,而是测试阶段各种 metric 把“所有预测和标签”都堆在显存里,最后一下子爆了。
先在训练前开个终端跑:
bashnvidia-smi
确保只有你这个 python tools/train.py 在用 GPU,
如果还有其它程序(别的 Python / 浏览器 / IDE / 上次没退干净的进程),先 kill 掉,这样你就能多出来几 GB 空间。
你的训练和测试 batch_size 是分开的:
yamldataset: train_batch_size: 8 # 你说已经改成 8 了 eval_batch_size: 8 # 这里最好单独改小
先把 eval_batch_size 改到很保守,比如 1 或 2:
yamldataset: train_batch_size: 8 # 先保持 eval_batch_size: 2 # 建议先试 2,不行再试 1
因为报错发生在:
textTesting DataLoader 0: 100%|██████████| 507/507 [...]
说明是测试环节,所以优先动 eval_batch_size 就行,训练 batch_size 可以先不动。
你现在 config 里是:
yamltrainer: precision: 32
建议改成混合精度:
yamltrainer: precision: 16 # 混合精度,显存立减一大截
3090 非常适合跑混合精度,速度和显存都会好不少。
注意:precision 16 要求驱动 + CUDA 版本正常(你之前已经能正常训练应该没问题)。
从你之前的日志看,CFlow 模型参数量快 1G 了:
textTotal estimated model params size (MB): 946.249
这是因为:
wide_resnet50_2 很大;你可以在 model: 下面做几件事:
把 wide_resnet50_2 换成 resnet18 或 resnet34(如果你不追逐 paper 完全一致指标):
yamlmodel: name: cflow backbone: resnet34 # 或 resnet18,都比 wide_resnet50_2 轻很多 pre_trained: true layers: - layer2 - layer3 # 可以先只用两层,内存更友好 # 其它参数先不动
你现在的是:
yamlcondition_vector: 128 coupling_blocks: 8 fiber_batch_size: 64
可以稍微砍一点:
yamlcondition_vector: 64 # 降一半 coupling_blocks: 4 # 流的层数减半,少不少显存 fiber_batch_size: 32 # 减小这个能明显降低内存,但会稍微慢点
这些都会减少 CFlow 内部 flow 的维度和中间特征大小,对显存很友好。
你现在 metrics 配置是:
yamlmetrics: image: - F1Score - AUROC pixel: - F1Score - AUROC - AUPRO threshold: method: adaptive
这些 metric(尤其是 AUROC、AUPRO)会把所有图片的预测和标签都存起来,再一起算曲线,非常吃内存。
可以先精简一点,比如只保留最关心的那几个:
yamlmetrics: image: - AUROC # 只要一个 image-level AUROC pixel: - AUROC # 只留一个 pixel-level AUROC(或只留 AUPRO) threshold: method: adaptive
如果你暂时只想训练通、不关心所有 fancy 指标,可以再狠一点:
yamlmetrics: image: [] pixel: [] threshold: method: manual manual_image: 0.5 manual_pixel: 0.5
manual阈值就不需要存一堆预测来“自适应估计阈值”了,省内存,但阈值是人为设的,可能没 adaptive 好。
你的可视化配置是:
yamlvisualization: show_images: False save_images: True log_images: True image_save_path: ./results/1217cflow mode: full
在测试阶段,full 模式 + save + log,有时会把 anomaly map、掩码、叠加图都搞一大堆,而且还可能积在 GPU 或 CPU 一段时间。
先暂时改简单一点,比如:
yamlvisualization: show_images: False save_images: False # 先关掉保存图片,调通后再打开 log_images: False # 先关掉往 logger 里写图 mode: simple # 或 full -> simple
确认不再 OOM 后,再慢慢开回自己需要的部分。
报错里建议的:
textIf reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.
可以在启动命令前加:
bashPYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \ CUDA_VISIBLE_DEVICES=0 \ python tools/train.py --model cflow --config src/anomalib/models/cflow/config.yaml
这个主要是缓解显存碎片问题,不是根本解决方案,但可以一起用。
如果你不想一下子改太多,我给你一个“尽量小改动、成功率又高”的方案,你可以直接照抄:
yamldataset: train_batch_size: 8 eval_batch_size: 2 # 主要改这个 model: name: cflow backbone: resnet34 # 从 wide_resnet50_2 换成 resnet34 pre_trained: true layers: - layer2 - layer3 decoder: freia-cflow condition_vector: 64 # 从 128 降到 64 coupling_blocks: 4 # 从 8 降到 4 fiber_batch_size: 32 # 从 64 降到 32 lr: 0.0001 normalization_method: min_max metrics: image: - AUROC pixel: - AUROC threshold: method: adaptive visualization: show_images: False save_images: False # 先关 log_images: False # 先关 image_save_path: ./results/1217cflow mode: simple # 改成 simple trainer: accelerator: gpu devices: 1 # 单卡 precision: 16 # 混合精度 # 其它不变
bashPYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \ CUDA_VISIBLE_DEVICES=0 \ NO_ALBUMENTATIONS_UPDATE=1 \ python tools/train.py --model cflow --config src/anomalib/models/cflow/config.yaml
(如果你还是想用 GPU1,把 CUDA_VISIBLE_DEVICES=0 换成 1 就行。)
如果你按这个方案改完还是 OOM,把**最新的报错信息 + 改完之后的 config(只要 dataset / model / metrics / trainer 这几段)**贴出来,我可以再帮你针对性继续收紧显存。
我调整imgsize为512会增大压力还是减小呢
会增大,而且是明显增大,不是一点点 🥲
直观一点说:
image_size = 256image_size = 512,边长变 2 倍所以:
如果你在
256的时候已经会 OOM,那改成512几乎可以肯定会更容易炸显存,不会减小压力。
显存大致 ~ batch_size × H × W × (通道 & 层数相关系数)
所以:
256, batch=8512, batch≈2 才勉强持平但你现在在 256, batch=8 都会在测试阶段 OOM,
这时去 512 只会更难救,不是解法。
256 → 224 或 256 → 192 来减轻显存负担。eval_batch_size(甚至 1);precision: 16 来省显存;一句话:256 → 512 一定是“更重”,不是“更省”,在你目前 OOM 的情况下不建议这样改。
我去训练fastflow,将输入改为512后代码训练几轮报错如下,这是为什么呢,我已经将异常数据集调整为200多张:Epoch 7: 100%|█| 543/543 [03:16<00:00, 2.77it/s, loss=-9.39e+06, v_num=1, train_loss_step=-9.18e+6, pixel_F1Score=0.0435, pixel_AUROC=0.836, pixel_AUPRO=0.450, train_loss_epoch=-9.58Given normalized_shape=[64, 128, 128], expected input with shape [*, 64, 128, 128], but got input of size[1, 64, 64, 64]
Error occurs, No graph saved
Traceback (most recent call last):
File "tools/train.py", line 81, in <module>
train(args)
File "tools/train.py", line 66, in train
trainer.fit(model=model, datamodule=datamodule)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
results = self._run_stage()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
self._run_train()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train
self.fit_loop.run()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 206, in run
output = self.on_run_end()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 323, in on_run_end
self.trainer._call_callback_hooks("on_train_end")
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1394, in _call_callback_hooks
fn(self, self.lightning_module, *args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/utils/callbacks/graph.py", line 44, in on_train_end
logger.log_graph(pl_module, input_array=torch.ones((1, 3, 256, 256)))
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn
return fn(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loggers/tensorboard.py", line 205, in log_graph
self.experiment.add_graph(model, input_array)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/utils/tensorboard/writer.py", line 841, in add_graph
graph(model, input_to_model, verbose, use_strict_trace)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/utils/tensorboard/_pytorch_graph.py", line 338, in graph
raise e
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/utils/tensorboard/_pytorch_graph.py", line 331, in graph
trace = torch.jit.trace(model, args, strict=False)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/jit/_trace.py", line 1000, in trace
traced_func = _trace_impl(
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/jit/_trace.py", line 695, in _trace_impl
return trace_module(
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/jit/_trace.py", line 1275, in trace_module
module._c._create_method_from_trace(
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1543, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/models/components/base/anomaly_module.py", line 67, in forward
return self.model(batch)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1543, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/models/fastflow/torch_model.py", line 190, in forward
features = self._get_cnn_features(input_tensor)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/models/fastflow/torch_model.py", line 219, in _get_cnn_features
features = [self.normsi for i, feature in enumerate(features)]
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/models/fastflow/torch_model.py", line 219, in <listcomp>
features = [self.normsi for i, feature in enumerate(features)]
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1543, in _slow_forward
result = self.forward(input, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/normalization.py", line 202, in forward
return F.layer_norm(
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/functional.py", line 2576, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Given normalized_shape=[64, 128, 128], expected input with shape [, 64, 128, 128], but got input of size[1, 64, 64, 64]
Epoch 7: 100%|██████████| 543/543 [03:18<00:00, 2.74it/s, loss=-9.39e+06, v_num=1, train_loss_step=-9.18e+6, pixel_F1Score=0.0435, pixel_AUROC=0.836, pixel_AUPRO=0.450, train_loss_epoch=-9.58e+6]
这次不是数据量的问题(你把异常样本减到 200 多张也没关系),而是 FastFlow 在你把输入改成 512 后,模型内部的 LayerNorm 形状按 512 生成了;但 anomalib 在训练结束 on_train_end 记录 TensorBoard 计算图时,仍然用固定的 256x256 假输入去 trace,导致形状不匹配,于是报:
expected normalized_shape=[64, 128, 128] ... but got ... [1, 64, 64, 64]
你堆栈里也明确显示错误发生在:
anomalib/utils/callbacks/graph.py 里:logger.log_graph(pl_module, input_array=torch.ones((1, 3, 256, 256)))fastflow/torch_model.py 的 self.norms[i](feature) → LayerNorm 报 shape 不对解释一下数字:
128x128(例如 /4 下采样)64x64128x128 初始化的,拿到 64x64 就炸了你 config 里现在应该是 logging.log_graph: true,改成:
yamllogging: logger: [tensorboard,csv] log_graph: false
这样训练/测试照常跑,只是不往 TensorBoard 里写计算图,就不会在 on_train_end 崩。
去改你环境里的这个文件(你报错里已指向它):
.../site-packages/anomalib/utils/callbacks/graph.py
把里面固定的 256 改成 512(或改成读配置里的 image_size)。最粗暴能用的改法:
pythonlogger.log_graph(pl_module, input_array=torch.ones((1, 3, 512, 512)))
如果你以后还会换别的尺寸,建议改成从 config / datamodule 取 image_size,而不是写死。
如果你只是想尽快跑通,直接把 dataset.image_size 改回 256 也行。
你这次报错发生在 Epoch 7 结束后 / train_end(写 graph 的回调阶段),所以往往模型已经训练了好几轮、可能也已经保存了 checkpoint;只是最后写图失败导致程序退出。你可以先看看 results/... 目录里有没有 ckpt 文件。
如果你把你当前 fastflow 的 YAML 里 logging: 那段贴一下(尤其是 log_graph),我可以直接告诉你改哪一行最稳。