anomalib相关

Question

我在训练dsr异常检测库代码报错如下：(anomalib) jd@jd-x11dai-n:/media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib$ python tools/train.py --model dsr --config src/anomalib/models/dsr/config.yaml
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/albumentations/init.py:13: UserWarning: A new version of Albumentations is available: 2.0.8 (you have 1.4.14). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.
check_for_updates()
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
To use wandb logger install it using `pip install wandb`
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/config/config.py:280: UserWarning: config.project.unique_dir is set to False. This does not ensure that your results will be written in an empty directory and you may overwrite files.
warn(
Global seed set to 42
2025-12-10 10:50:32,338 - anomalib.data - INFO - Loading the datamodule
2025-12-10 10:50:32,339 - anomalib.data.utils.transform - INFO - No config file has been provided. Using default transforms.
2025-12-10 10:50:32,340 - anomalib.data.utils.transform - INFO - No config file has been provided. Using default transforms.
2025-12-10 10:50:32,340 - anomalib.models - INFO - Loading the model.
2025-12-10 10:50:32,341 - anomalib.models.components.base.anomaly_module - INFO - Initializing DsrLightning model.
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Metric `PrecisionRecallCurve` will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.
warnings.warn(*args, kwargs)
2025-12-10 10:50:32,697 - anomalib.utils.loggers - INFO - Loading the experiment logger(s)
2025-12-10 10:50:32,698 - anomalib.utils.callbacks - INFO - Loading the callbacks
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/utils/callbacks/init**.py:142: UserWarning: Export option: None not found. Defaulting to no model export
warnings.warn(f"Export option: {config.optimization.export_mode} not found. Defaulting to no model export")
2025-12-10 10:50:32,730 - pytorch_lightning.utilities.rank_zero - INFO - GPU available: True (cuda), used: True
2025-12-10 10:50:32,730 - pytorch_lightning.utilities.rank_zero - INFO - TPU available: False, using: 0 TPU cores
2025-12-10 10:50:32,730 - pytorch_lightning.utilities.rank_zero - INFO - IPU available: False, using: 0 IPUs
2025-12-10 10:50:32,730 - pytorch_lightning.utilities.rank_zero - INFO - HPU available: False, using: 0 HPUs
2025-12-10 10:50:32,730 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_train_batches=1.0)` was configured so 100% of the batches per epoch will be used..
2025-12-10 10:50:32,730 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_val_batches=1.0)` was configured so 100% of the batches will be used..
2025-12-10 10:50:32,730 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_test_batches=1.0)` was configured so 100% of the batches will be used..
2025-12-10 10:50:32,731 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_predict_batches=1.0)` was configured so 100% of the batches will be used..
2025-12-10 10:50:32,731 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
2025-12-10 10:50:32,731 - anomalib - INFO - Training the model.
2025-12-10 10:50:32,779 - pytorch_lightning.utilities.rank_zero - INFO - You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
2025-12-10 10:50:32,780 - pytorch_lightning.loggers.tensorboard - WARNING - Missing logger folder: results/1207dsr/dsr/mvtec/run/logs/Tensorboard Logs
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Metric `ROC` will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.
warnings.warn(*args, **kwargs)
2025-12-10 10:50:34,071 - pytorch_lightning.accelerators.cuda - INFO - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
2025-12-10 10:50:34,081 - pytorch_lightning.callbacks.model_summary - INFO -
| Name | Type | Params

0 | image_threshold | AnomalyScoreThreshold | 0
1 | pixel_threshold | AnomalyScoreThreshold | 0
2 | quantized_anomaly_generator | DsrAnomalyGenerator | 0
3 | model | DsrModel | 40.3 M
4 | second_stage_loss | DsrSecondStageLoss | 0
5 | third_stage_loss | DsrThirdStageLoss | 0
6 | image_metrics | AnomalibMetricCollection | 0
7 | pixel_metrics | AnomalibMetricCollection | 0

36.3 M Trainable params
4.0 M Non-trainable params
40.3 M Total params
161.195 Total estimated model params size (MB)
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loggers/tensorboard.py:191: UserWarning: Could not log computational graph to TensorBoard: The model.example_input_array attribute is not set or input_array was not given.
rank_zero_warn(
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1609: PossibleUserWarning: The number of training batches (18) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
rank_zero_warn(
Training: 0it [00:00, ?it/s]2025-12-10 10:50:34,277 - anomalib.data.utils.download - INFO - Downloading the vq_model_pretrained_128_4096.pckl dataset.

vq_model_pretrained_128_4096.pckl: 1%|█▏ | 172k/15.0M [14:25<20:47:08, 199B/s]
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py:48: UserWarning: Detected KeyboardInterrupt, attempting graceful shutdown...
rank_zero_warn("Detected KeyboardInterrupt, attempting graceful shutdown...")
2025-12-10 11:05:00,173 - anomalib - INFO - Loading the best model weights.
2025-12-10 11:05:00,174 - anomalib - INFO - Testing the model.
2025-12-10 11:05:00,180 - pytorch_lightning.utilities.rank_zero - INFO - You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
2025-12-10 11:05:00,181 - anomalib.utils.callbacks.model_loader - INFO - Loading the model from
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/utils/callbacks/model_loader.py:32: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
pl_module.load_state_dict(torch.load(self.weights_path, map_location=pl_module.device)["state_dict"])
^CTraceback (most recent call last):
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 842, in _test_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1051, in _run
self._call_setup_hook() # allow user to setup lightning_module in accelerator environment
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1299, in _call_setup_hook
self._call_callback_hooks("setup", stage=fn)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1394, in _call_callback_hooks
fn(self, self.lightning_module, *args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/utils/callbacks/model_loader.py", line 32, in setup
pl_module.load_state_dict(torch.load(self.weights_path, map_location=pl_module.device)["state_dict"])
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/serialization.py", line 1065, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/serialization.py", line 468, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/serialization.py", line 449, in init
super().init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: ''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "tools/train.py", line 81, in <module>
train(args)
File "tools/train.py", line 76, in train
trainer.test(model=model, datamodule=datamodule)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 794, in test
return call._call_and_handle_interrupt(
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 63, in _call_and_handle_interrupt
trainer._teardown()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1175, in _teardown
self.strategy.teardown()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 496, in teardown
self.lightning_module.cpu()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/lightning_fabric/utilities/device_dtype_mixin.py", line 78, in cpu
return super().cpu()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 965, in cpu
return self._apply(lambda t: t.cpu())
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 780, in _apply
module._apply(fn)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 780, in _apply
module._apply(fn)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 780, in _apply
module._apply(fn)
[Previous line repeated 3 more times]
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 805, in _apply
param_applied = fn(param)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 965, in <lambda>
return self._apply(lambda t: t.cpu())
KeyboardInterrupt
Training: 0it [14:26, ?it/s]
(anomalib) jd@jd-x11dai-n:/media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib$ python tools/train.py --model dsr --config src/anomalib/models/dsr/config.yaml
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/albumentations/init.py:13: UserWarning: A new version of Albumentations is available: 2.0.8 (you have 1.4.14). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.
check_for_updates()
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
To use wandb logger install it using `pip install wandb`
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/config/config.py:280: UserWarning: config.project.unique_dir is set to False. This does not ensure that your results will be written in an empty directory and you may overwrite files.
warn(
Global seed set to 42
2025-12-10 11:05:26,585 - anomalib.data - INFO - Loading the datamodule
2025-12-10 11:05:26,585 - anomalib.data.utils.transform - INFO - No config file has been provided. Using default transforms.
2025-12-10 11:05:26,586 - anomalib.data.utils.transform - INFO - No config file has been provided. Using default transforms.
2025-12-10 11:05:26,587 - anomalib.models - INFO - Loading the model.
2025-12-10 11:05:26,587 - anomalib.models.components.base.anomaly_module - INFO - Initializing DsrLightning model.
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Metric `PrecisionRecallCurve` will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.
warnings.warn(*args, kwargs)
2025-12-10 11:05:26,964 - anomalib.utils.loggers - INFO - Loading the experiment logger(s)
2025-12-10 11:05:26,965 - anomalib.utils.callbacks - INFO - Loading the callbacks
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/utils/callbacks/init**.py:142: UserWarning: Export option: None not found. Defaulting to no model export
warnings.warn(f"Export option: {config.optimization.export_mode} not found. Defaulting to no model export")
2025-12-10 11:05:26,979 - pytorch_lightning.utilities.rank_zero - INFO - GPU available: True (cuda), used: True
2025-12-10 11:05:26,980 - pytorch_lightning.utilities.rank_zero - INFO - TPU available: False, using: 0 TPU cores
2025-12-10 11:05:26,980 - pytorch_lightning.utilities.rank_zero - INFO - IPU available: False, using: 0 IPUs
2025-12-10 11:05:26,980 - pytorch_lightning.utilities.rank_zero - INFO - HPU available: False, using: 0 HPUs
2025-12-10 11:05:26,980 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_train_batches=1.0)` was configured so 100% of the batches per epoch will be used..
2025-12-10 11:05:26,980 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_val_batches=1.0)` was configured so 100% of the batches will be used..
2025-12-10 11:05:26,980 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_test_batches=1.0)` was configured so 100% of the batches will be used..
2025-12-10 11:05:26,980 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_predict_batches=1.0)` was configured so 100% of the batches will be used..
2025-12-10 11:05:26,980 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
2025-12-10 11:05:26,980 - anomalib - INFO - Training the model.
2025-12-10 11:05:27,023 - pytorch_lightning.utilities.rank_zero - INFO - You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
2025-12-10 11:05:27,024 - pytorch_lightning.loggers.tensorboard - WARNING - Missing logger folder: results/1207dsr/dsr/mvtec/run/logs/Tensorboard Logs
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Metric `ROC` will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.
warnings.warn(*args, **kwargs)
2025-12-10 11:05:29,172 - pytorch_lightning.accelerators.cuda - INFO - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
2025-12-10 11:05:29,182 - pytorch_lightning.callbacks.model_summary - INFO -
| Name | Type | Params

0 | image_threshold | AnomalyScoreThreshold | 0
1 | pixel_threshold | AnomalyScoreThreshold | 0
2 | quantized_anomaly_generator | DsrAnomalyGenerator | 0
3 | model | DsrModel | 40.3 M
4 | second_stage_loss | DsrSecondStageLoss | 0
5 | third_stage_loss | DsrThirdStageLoss | 0
6 | image_metrics | AnomalibMetricCollection | 0
7 | pixel_metrics | AnomalibMetricCollection | 0

36.3 M Trainable params
4.0 M Non-trainable params
40.3 M Total params
161.195 Total estimated model params size (MB)
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loggers/tensorboard.py:191: UserWarning: Could not log computational graph to TensorBoard: The `model.example_input_array` attribute is not set or `input_array` was not given.
rank_zero_warn(
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1609: PossibleUserWarning: The number of training batches (18) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
rank_zero_warn(
Training: 0it [00:00, ?it/s]2025-12-10 11:05:29,397 - anomalib.data.utils.download - INFO - Existing dataset archive found. Skipping download stage.
2025-12-10 11:05:29,397 - anomalib.data.utils.download - INFO - Extracting dataset into root folder.
Traceback (most recent call last):
File "tools/train.py", line 81, in <module>
train(args)
File "tools/train.py", line 66, in train
trainer.fit(model=model, datamodule=datamodule)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
results = self._run_stage()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
self._run_train()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train
self.fit_loop.run()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 194, in run
self.on_run_start(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 218, in on_run_start
self.trainer._call_lightning_module_hook("on_train_start")
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1356, in _call_lightning_module_hook
output = fn(*args, kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/models/dsr/lightning_model.py", line 93, in on_train_start
ckpt: Path = self.prepare_pretrained_model()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/models/dsr/lightning_model.py", line 60, in prepare_pretrained_model
download_and_extract(pretrained_models_dir, WEIGHTS_DOWNLOAD_INFO)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/data/utils/download.py", line 312, in download_and_extract
extract(downloaded_file_path, root)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/data/utils/download.py", line 265, in extract
with ZipFile(file_name, "r") as zip_file:
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/zipfile.py", line 1271, in init
self._RealGetContents()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/zipfile.py", line 1338, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
Training: 0it [00:00, ?it/s]
(anomalib) jd@jd-x11dai-n:/media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib$ python tools/train.py --model dsr --config src/anomalib/models/dsr/config.yaml
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/albumentations/init**.py:13: UserWarning: A new version of Albumentations is available: 2.0.8 (you have 1.4.14). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.
check_for_updates()
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
To use wandb logger install it using `pip install wandb`
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/config/config.py:280: UserWarning: config.project.unique_dir is set to False. This does not ensure that your results will be written in an empty directory and you may overwrite files.
warn(
Global seed set to 42
2025-12-10 11:06:52,152 - anomalib.data - INFO - Loading the datamodule
2025-12-10 11:06:52,152 - anomalib.data.utils.transform - INFO - No config file has been provided. Using default transforms.
2025-12-10 11:06:52,153 - anomalib.data.utils.transform - INFO - No config file has been provided. Using default transforms.
2025-12-10 11:06:52,154 - anomalib.models - INFO - Loading the model.
2025-12-10 11:06:52,154 - anomalib.models.components.base.anomaly_module - INFO - Initializing DsrLightning model.
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Metric `PrecisionRecallCurve` will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.
warnings.warn(*args, kwargs)
2025-12-10 11:06:52,531 - anomalib.utils.loggers - INFO - Loading the experiment logger(s)
2025-12-10 11:06:52,532 - anomalib.utils.callbacks - INFO - Loading the callbacks
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/utils/callbacks/init**.py:142: UserWarning: Export option: None not found. Defaulting to no model export
warnings.warn(f"Export option: {config.optimization.export_mode} not found. Defaulting to no model export")
2025-12-10 11:06:52,546 - pytorch_lightning.utilities.rank_zero - INFO - GPU available: True (cuda), used: True
2025-12-10 11:06:52,546 - pytorch_lightning.utilities.rank_zero - INFO - TPU available: False, using: 0 TPU cores
2025-12-10 11:06:52,546 - pytorch_lightning.utilities.rank_zero - INFO - IPU available: False, using: 0 IPUs
2025-12-10 11:06:52,546 - pytorch_lightning.utilities.rank_zero - INFO - HPU available: False, using: 0 HPUs
2025-12-10 11:06:52,546 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_train_batches=1.0)` was configured so 100% of the batches per epoch will be used..
2025-12-10 11:06:52,546 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_val_batches=1.0)` was configured so 100% of the batches will be used..
2025-12-10 11:06:52,546 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_test_batches=1.0)` was configured so 100% of the batches will be used..
2025-12-10 11:06:52,546 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_predict_batches=1.0)` was configured so 100% of the batches will be used..
2025-12-10 11:06:52,546 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
2025-12-10 11:06:52,546 - anomalib - INFO - Training the model.
2025-12-10 11:06:52,596 - pytorch_lightning.utilities.rank_zero - INFO - You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
2025-12-10 11:06:52,597 - pytorch_lightning.loggers.tensorboard - WARNING - Missing logger folder: results/1207dsr/dsr/mvtec/run/logs/Tensorboard Logs
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Metric `ROC` will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.
warnings.warn(*args, **kwargs)
2025-12-10 11:06:54,714 - pytorch_lightning.accelerators.cuda - INFO - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
2025-12-10 11:06:54,725 - pytorch_lightning.callbacks.model_summary - INFO -
| Name | Type | Params

0 | image_threshold | AnomalyScoreThreshold | 0
1 | pixel_threshold | AnomalyScoreThreshold | 0
2 | quantized_anomaly_generator | DsrAnomalyGenerator | 0
3 | model | DsrModel | 40.3 M
4 | second_stage_loss | DsrSecondStageLoss | 0
5 | third_stage_loss | DsrThirdStageLoss | 0
6 | image_metrics | AnomalibMetricCollection | 0
7 | pixel_metrics | AnomalibMetricCollection | 0

36.3 M Trainable params
4.0 M Non-trainable params
40.3 M Total params
161.195 Total estimated model params size (MB)
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loggers/tensorboard.py:191: UserWarning: Could not log computational graph to TensorBoard: The `model.example_input_array` attribute is not set or `input_array` was not given.
rank_zero_warn(
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1609: PossibleUserWarning: The number of training batches (18) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
rank_zero_warn(
Training: 0it [00:00, ?it/s]2025-12-10 11:06:54,989 - anomalib.data.utils.download - INFO - Existing dataset archive found. Skipping download stage.
2025-12-10 11:06:54,989 - anomalib.data.utils.download - INFO - Extracting dataset into root folder.
Traceback (most recent call last):
File "tools/train.py", line 81, in <module>
train(args)
File "tools/train.py", line 66, in train
trainer.fit(model=model, datamodule=datamodule)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
results = self._run_stage()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
self._run_train()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train
self.fit_loop.run()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 194, in run
self.on_run_start(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 218, in on_run_start
self.trainer._call_lightning_module_hook("on_train_start")
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1356, in _call_lightning_module_hook
output = fn(*args, kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/models/dsr/lightning_model.py", line 93, in on_train_start
ckpt: Path = self.prepare_pretrained_model()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/models/dsr/lightning_model.py", line 60, in prepare_pretrained_model
download_and_extract(pretrained_models_dir, WEIGHTS_DOWNLOAD_INFO)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/data/utils/download.py", line 312, in download_and_extract
extract(downloaded_file_path, root)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/data/utils/download.py", line 265, in extract
with ZipFile(file_name, "r") as zip_file:
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/zipfile.py", line 1271, in init**
self._RealGetContents()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/zipfile.py", line 1338, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
Training: 0it [00:00, ?it/s]
(anomalib) jd@jd-x11dai-n:/media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib$ python tools/train.py --model dsr --config src/anomalib/models/dsr/config.yaml
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/albumentations/check_version.py:49: UserWarning: Error fetching version info The read operation timed out
data = fetch_version_info()
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
To use wandb logger install it using `pip install wandb`
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/config/config.py:280: UserWarning: config.project.unique_dir is set to False. This does not ensure that your results will be written in an empty directory and you may overwrite files.
warn(
Global seed set to 42
2025-12-10 11:07:29,728 - anomalib.data - INFO - Loading the datamodule
2025-12-10 11:07:29,729 - anomalib.data.utils.transform - INFO - No config file has been provided. Using default transforms.
2025-12-10 11:07:29,730 - anomalib.data.utils.transform - INFO - No config file has been provided. Using default transforms.
2025-12-10 11:07:29,731 - anomalib.models - INFO - Loading the model.
2025-12-10 11:07:29,731 - anomalib.models.components.base.anomaly_module - INFO - Initializing DsrLightning model.
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Metric `PrecisionRecallCurve` will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.
warnings.warn(*args, kwargs)
2025-12-10 11:07:30,094 - anomalib.utils.loggers - INFO - Loading the experiment logger(s)
2025-12-10 11:07:30,095 - anomalib.utils.callbacks - INFO - Loading the callbacks
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/utils/callbacks/init**.py:142: UserWarning: Export option: None not found. Defaulting to no model export
warnings.warn(f"Export option: {config.optimization.export_mode} not found. Defaulting to no model export")
2025-12-10 11:07:30,115 - pytorch_lightning.utilities.rank_zero - INFO - GPU available: True (cuda), used: True
2025-12-10 11:07:30,115 - pytorch_lightning.utilities.rank_zero - INFO - TPU available: False, using: 0 TPU cores
2025-12-10 11:07:30,115 - pytorch_lightning.utilities.rank_zero - INFO - IPU available: False, using: 0 IPUs
2025-12-10 11:07:30,115 - pytorch_lightning.utilities.rank_zero - INFO - HPU available: False, using: 0 HPUs
2025-12-10 11:07:30,115 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_train_batches=1.0)` was configured so 100% of the batches per epoch will be used..
2025-12-10 11:07:30,115 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_val_batches=1.0)` was configured so 100% of the batches will be used..
2025-12-10 11:07:30,115 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_test_batches=1.0)` was configured so 100% of the batches will be used..
2025-12-10 11:07:30,115 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_predict_batches=1.0)` was configured so 100% of the batches will be used..
2025-12-10 11:07:30,115 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
2025-12-10 11:07:30,115 - anomalib - INFO - Training the model.
2025-12-10 11:07:30,167 - pytorch_lightning.utilities.rank_zero - INFO - You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Metric `ROC` will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.
warnings.warn(*args, **kwargs)
2025-12-10 11:07:35,095 - pytorch_lightning.accelerators.cuda - INFO - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
2025-12-10 11:07:35,106 - pytorch_lightning.callbacks.model_summary - INFO -
| Name | Type | Params

0 | image_threshold | AnomalyScoreThreshold | 0
1 | pixel_threshold | AnomalyScoreThreshold | 0
2 | quantized_anomaly_generator | DsrAnomalyGenerator | 0
3 | model | DsrModel | 40.3 M
4 | second_stage_loss | DsrSecondStageLoss | 0
5 | third_stage_loss | DsrThirdStageLoss | 0
6 | image_metrics | AnomalibMetricCollection | 0
7 | pixel_metrics | AnomalibMetricCollection | 0

36.3 M Trainable params
4.0 M Non-trainable params
40.3 M Total params
161.195 Total estimated model params size (MB)
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loggers/tensorboard.py:191: UserWarning: Could not log computational graph to TensorBoard: The model.example_input_array attribute is not set or input_array was not given.
rank_zero_warn(
/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1609: PossibleUserWarning: The number of training batches (18) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
rank_zero_warn(
Training: 0it [00:00, ?it/s]2025-12-10 11:07:35,317 - anomalib.data.utils.download - INFO - Existing dataset archive found. Skipping download stage.
2025-12-10 11:07:35,317 - anomalib.data.utils.download - INFO - Extracting dataset into root folder.
Traceback (most recent call last):
File "tools/train.py", line 81, in <module>
train(args)
File "tools/train.py", line 66, in train
trainer.fit(model=model, datamodule=datamodule)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
results = self._run_stage()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
self._run_train()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train
self.fit_loop.run()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 194, in run
self.on_run_start(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 218, in on_run_start
self.trainer._call_lightning_module_hook("on_train_start")
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1356, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/models/dsr/lightning_model.py", line 93, in on_train_start
ckpt: Path = self.prepare_pretrained_model()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/models/dsr/lightning_model.py", line 60, in prepare_pretrained_model
download_and_extract(pretrained_models_dir, WEIGHTS_DOWNLOAD_INFO)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/data/utils/download.py", line 312, in download_and_extract
extract(downloaded_file_path, root)
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/site-packages/anomalib/data/utils/download.py", line 265, in extract
with ZipFile(file_name, "r") as zip_file:
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/zipfile.py", line 1271, in init
self._RealGetContents()
File "/home/jd/anaconda3/envs/anomalib/lib/python3.8/zipfile.py", line 1338, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
Training: 0it [00:00, ?it/s]
这是什么问题呢，一般这个zip下载后不是放置在/home/jd/.cache/torch/hub/checkpoints路径下的一个zip吗，我看网速慢，下载后传递进去，怎么不对呢，也不使用我下载的

Answer 1

问题的根子其实不是 DSR 本身，而是预训练权重压缩包被搞坏了，然后 anomalib 一直在拿这个坏文件解压，所以才会报：

text
zipfile.BadZipFile: File is not a zip file

1. 发生了什么？

按你给的日志，时间线大概是这样：

第一次跑：
```
text
Downloading the vq_model_pretrained_128_4096.pckl dataset.
vq_model_pretrained_128_4096.pckl:   1%|█▏ | 172k/15.0M [...]
...
UserWarning: Detected KeyboardInterrupt, attempting graceful shutdown...
```
也就是说：DSR 在下载 vq_model_pretrained_128_4096 这个预训练模型的时候，你 Ctrl+C 中断了。
→ 结果：磁盘上留下了一个没有完全下完的 zip 文件 / 数据文件。
之后每次再跑：
```
text
Existing dataset archive found. Skipping download stage.
Extracting dataset into root folder.
...
zipfile.BadZipFile: File is not a zip file
```
也就是：
- anomalib 检查到“已有数据集压缩包”，认为已经下载好了，就不再重新下载；
- 然后调用它自己的 download_and_extract → extract(...) 尝试用 ZipFile(...) 解压；
- 但是这个文件是「半拉子文件」或内容不对，就报 BadZipFile: File is not a zip file。

💡 所以前后逻辑是：

第一次下载被中断 → 留下了损坏的 zip → 之后每次运行都跳过下载，直接解压坏文件 → BadZipFile。

2. 为啥你放在 `/home/jd/.cache/torch/hub/checkpoints` 也没用？

你这句：

一般这个zip下载后不是放置在 /home/jd/.cache/torch/hub/checkpoints 路径下的一个zip吗…也不使用我下载的

关键点在于：

DSR / anomalib 这段代码不是通过 torch.hub 下载的，而是用它自己的：
```
python
anomalib.data.utils.download.download_and_extract(root, info)
```
这个工具自己的缓存路径和命名规则，跟 ~/.cache/torch/hub/checkpoints 完全没关系。(Anomalib Documentation)
它会把文件下载到一个 root 目录（在 Dsr.prepare_pretrained_model() 里算出来的 pretrained_models_dir），文件名取自 WEIGHTS_DOWNLOAD_INFO 中的 filename 或 URL 最后那一段。(Anomalib Documentation)

所以：

你手动下好的 zip 放到 ~/.cache/torch/hub/checkpoints，anomalib 根本不会去那儿找，自然“也不使用我下载的”。

3. 怎么解决现在的 BadZipFile？

步骤 1：删掉坏掉的“已有数据集压缩包”

你需要找出 anomalib 认为已经存在的那个压缩包，把它删掉。

在终端里可以先全局搜一下这个文件名（注意你日志里的名字）：

bash
find ~ -iname "vq_model_pretrained_128_4096*"

常见情况可能会出现在类似这些目录里（具体以 find 结果为准）：

~/.cache/anomalib/...
<你的 anomalib 工程目录>/pretrained_models/...
或 results/1207dsr/.../pretrained/... 一类的路径

看到那种 几十 KB 或几百 KB 大小的 zip（明显不像 15MB 的完整文件），基本就是那次中断留下来的半成品。

把相关的压缩包删掉，比如：

bash
rm /path/to/vq_model_pretrained_128_4096*.zip
# 或者 rm /path/to/vq_model_pretrained_128_4096* 视情况而定

如果旁边还有一个同名解压出来的目录、但里面空空的，也可以顺手删掉，保证环境干净。

步骤 2：重新运行训练（让它自己下完整）

然后再跑：

bash
python tools/train.py --model dsr --config src/anomalib/models/dsr/config.yaml

这次你应该会再次看到：

text
Downloading the vq_model_pretrained_128_4096.pckl dataset.

不要再 Ctrl+C 中断它的下载过程，等它把预训练权重下载 + 解压完成，后面训练就不会再触发这个 BadZipFile 了。

4. 如果你一定想自己离线下载并“喂给” anomalib

大致有两种思路：

方案 A：把你下好的 zip 放到它真正用的目录里

打开 src/anomalib/models/dsr/lightning_model.py，找到 prepare_pretrained_model() 和 WEIGHTS_DOWNLOAD_INFO：
- 看 pretrained_models_dir 是怎么构造的（root 目录是什么）；
- 看 WEIGHTS_DOWNLOAD_INFO.filename（或者如果没写 filename，则看 url 最后文件名）。

假设你看到是类似：

python
pretrained_models_dir = Path.home() / ".cache" / "anomalib" / "dsr"
WEIGHTS_DOWNLOAD_INFO = DownloadInfo(
    name="vq_model_pretrained_128_4096.pckl",
    url="https://xxx/vq_model_pretrained_128_4096.zip",
    hashsum="xxxx",
)

那 download_and_extract 会在：

text
~/.cache/anomalib/dsr/vq_model_pretrained_128_4096.zip

下这个 zip，然后解压。

你做的事应该是：
- 用浏览器 / wget 把同一个 url 对应的 zip下载下来；
- 按上面这个确切的路径和文件名拷贝到本地：
```
bash
mkdir -p ~/.cache/anomalib/dsr
cp 你下载好的那个.zip  ~/.cache/anomalib/dsr/vq_model_pretrained_128_4096.zip
```
- 确认大小合理（十几 MB，而不是几百 KB）。
再跑训练时，日志会显示 Existing dataset archive found. Skipping download stage.，然后直接解压你放进去的这个 zip，如果是完整正确的，就不会再报错。

注意：你如果下的是 .pckl 而不是 .zip，而 extract 又按 zip 来解压，就必然报 BadZipFile，所以格式和文件名要和 WEIGHTS_DOWNLOAD_INFO 对齐。

方案 B：直接修改代码让它用你已经有的 `.pckl`

如果你已经有原始 DSR 仓库里的 vq_model_pretrained_128_4096.pckl（非压缩），也可以直接绕过 download_and_extract，例如（思路）：

在 Dsr.prepare_pretrained_model() 里直接返回你那份 .pckl 的路径：

python
def prepare_pretrained_model(self) -> Path:
    return Path("/你自己的路径/vq_model_pretrained_128_4096.pckl")

或者只改 download_and_extract 调用前的逻辑，让它不再解压，只是使用现成文件。

这个需要你稍微改源码，不嫌麻烦的话可行，但要注意今后升级 anomalib 时可能会被覆盖。

5. 至于最开始那次 FileNotFoundError

第一次那段日志里的：

text
FileNotFoundError: [Errno 2] No such file or directory: ''
...
pl_module.load_state_dict(torch.load(self.weights_path, ... )["state_dict"])

是因为你在训练阶段还没结束就 Ctrl+C，所以还没保存任何 checkpoint，
但 Lightning 的 ModelLoader callback 还试图去加载“最佳模型权重”，结果 weights_path 为空字符串，就直接抛了这个错。

这个不是根本问题，一旦你让训练阶段正常跑完，产生了 *.ckpt 文件，这个错误自然就消失了。

总结一句话版本

现在的报错是因为第一次下载 DSR 预训练权重时中断，留下了损坏的 zip；
anomalib 之后一直用这个坏 zip 解压，报 BadZipFile；
解决办法：找到并删掉这个半拉子的压缩包，让 anomalib 重新下载，或者把你离线下好的 zip 放到它真正的缓存路径，并且文件名/格式和代码里一致；
/home/jd/.cache/torch/hub/checkpoints 是 torch.hub 的位置，DSR 这段代码根本不看那里，所以你放进去的文件不会被用到。

你可以先按我上面第 3 部分的步骤找并删掉坏 zip，跑一遍看看，如果还报错，把 find 找到的路径贴出来，我可以帮你一起看要删 / 要放在哪个目录更合适。

Answer 2

这个报错本质上就是：代码在这个路径下找不到预训练 teacher 权重文件：

text
pre_trained/efficientad_pretrained_weights/pretrained_teacher_small.pth

而你确实还没把文件放到它要找的位置。

1. EfficientAD 代码到底去哪找这个权重？

在 EfficientAd.prepare_pretrained_model() 里，路径是这样拼出来的：(GitHub)

python
pretrained_models_dir = Path("./pre_trained/")
if not (pretrained_models_dir / "efficientad_pretrained_weights").is_dir():
    download_and_extract(pretrained_models_dir, WEIGHTS_DOWNLOAD_INFO)

teacher_path = (
    pretrained_models_dir
    / "efficientad_pretrained_weights"
    / f"pretrained_teacher_{self.model_size}.pth"
)
logger.info(f"Load pretrained teacher model from {teacher_path}")
self.model.teacher.load_state_dict(torch.load(teacher_path, map_location=torch.device(self.device)))

结合你运行命令的路径：

bash
/media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib$ python tools/train.py ...

可以得到实际查找的位置是：

text
/media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib/pre_trained/efficientad_pretrained_weights/pretrained_teacher_small.pth

✅ 所以：你下载好的 pretrained_teacher_small.pth 必须放在 anomalib 仓库根目录下面的这个路径里，而不是 ~/.cache/torch/hub/checkpoints 之类的地方。

2. 具体怎么放？

假设你已经下载好了 pretrained_teacher_small.pth 到某个地方，比如 /home/jd/Downloads/pretrained_teacher_small.pth。

在 anomalib 根目录下（就是你跑 python tools/train.py 的那个目录）执行：

bash
cd /media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib

# 1. 创建目录结构
mkdir -p pre_trained/efficientad_pretrained_weights

# 2. 拷贝权重文件进去（注意文件名要一模一样）
cp /home/jd/Downloads/pretrained_teacher_small.pth \
   pre_trained/efficientad_pretrained_weights/pretrained_teacher_small.pth

# 3. 检查一下
ls -lh pre_trained/efficientad_pretrained_weights

你应该能看到类似：

text
pretrained_teacher_small.pth

然后再跑：

bash
python tools/train.py --model efficient_ad --config 你的efficient_ad配置.yaml

就不会再报 FileNotFoundError 了。

3. 如果你下的是官方的 zip 包怎么办？

官方 EfficientAD 预训练权重通常会打包成一个 zip（例如 efficientad_pretrained_weights.zip），download_and_extract 就是用来解压这个 zip 并生成 efficientad_pretrained_weights 目录的。(GitHub)

如果你手里是 zip：

bash
cd /media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib

mkdir -p pre_trained
unzip /home/jd/Downloads/efficientad_pretrained_weights.zip -d pre_trained

解压后目录结构应该是：

text
pre_trained/
  efficientad_pretrained_weights/
    pretrained_teacher_small.pth
    （可能还有其它文件）

保证这里有 pretrained_teacher_small.pth 就可以了。

4. 关于那个 FutureWarning

日志里还有一段：

text
FutureWarning: You are using `torch.load` with `weights_only=False` ...

这个只是 PyTorch 的未来行为提醒，不是错误，不会导致训练失败，可以先忽略；真正导致程序崩掉的是 FileNotFoundError。

小结

你问：“我将下载好的文件放在哪里呢？”

答案：

把 pretrained_teacher_small.pth 放到你运行 python tools/train.py 的 anomalib 根目录下的这个路径中：

text
pre_trained/efficientad_pretrained_weights/pretrained_teacher_small.pth

完整绝对路径大概是：

text
/media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib/pre_trained/efficientad_pretrained_weights/pretrained_teacher_small.pth

照着上面的 mkdir + cp 命令做一遍，再跑一次就可以了。如果你 ls 之后的目录结构截图贴给我，我也可以帮你确认一下是否摆对位置。

Answer 3

这个报错的本质是：可视化回调在按“分割任务”画图，但模型并没有给出像素级掩码 pred_mask，于是在断言处直接炸了。

从堆栈能看出来关键这一行（官方源码里就是这样写的）：(Anomalib Documentation)

python
# anomalib.post_processing.visualizer.Visualizer._visualize_full
if self.task == "segmentation":
    assert image_result.pred_mask is not None
    ...

也就是说：

ImageVisualizerCallback 当前被配置成了 task="segmentation"；
但是传进来的 image_result.pred_mask 是 None（模型/Pipeline 没有产出像素级预测掩码）；
于是 assert image_result.pred_mask is not None 失败，抛出 AssertionError。

回到你问的这句

难道说 anomalib 模型库里配置写的 task 是分类，就不能做分割任务

更准确地说：

每个模型本身是“分类型”还是“分割型”是写死在模型实现里的（文档里一般会标 Model Type: Classification / Segmentation，比如 GANomaly 明确标成 Classification 型模型）。(GitHub)
config.yaml 里的 task 只是告诉 数据与可视化 / 后处理 该按“分类”还是“分割”的逻辑走：
- task: classification → 只需要 image-level 分数/标签，不要求 pred_mask；
- task: segmentation → 认为模型提供了 anomaly map + pred_mask，可视化要画热力图和掩码。
把一个“模型类型=分类”的模型，单纯在 config 里改成 task: segmentation，并不会 magically 变成分割模型——模型里根本没实现产生像素级掩码的分支，于是 pred_mask 就是 None，到可视化这一步必崩。

所以这次的错误，就是 “可视化认为是分割任务，但实际输出只有分类信息” 造成的配置不一致问题。

怎么改才能跑通？

你接下来想要的，有两种可能：只是做分类，还是确实想要分割结果。两种情况处理方式不一样。

情况 A：你现在这个模型只想做分类（不关心像素级掩码）

那就让整条 pipeline 都按 classification 走，而不是 segmentation。

建议检查/修改这些地方（大致思路，具体字段名字可能因版本略有差异）：

数据集部分：

yaml
dataset:
  task: classification

或者如果你用的是 Folder 这种 datamodule，在代码/API 里也要传 task="classification"。(Anomalib Documentation)

可视化回调的 task：

找到 config 里关于 ImageVisualizerCallback / image_visualizer 的配置，把 task 改成 classification，例如：

yaml
callbacks:
  image_visualizer:
    task: classification
    mode: full        # or 'simple'

这样 Visualizer 会走 _visualize_full 的 classification 分支，就不会再去碰 pred_mask 了。(Anomalib Documentation)

如果你暂时根本不需要可视化图：

最简单粗暴的方式是直接关掉这个回调，比如在 config 里把 image_visualizer 删掉/注释掉，或者有开关时设为 enable: false / save_images: false（具体字段看你那版 config）。

总之：只做分类 → task 一律设成 classification，不要让可视化认为是 segmentation，就不会再 assert 了。

情况 B：你真的想要“分割”（像素级掩码）结果

那就要确认两件事：

选的模型本身支持分割
- 有些模型官方文档就写了 Model Type: Classification（比如 GANomaly），它天生就是 image-level 模型，没有 pred_mask 输出，你光在 config 里改 task: segmentation 是不可能得到掩码的。(GitHub)
- 要做分割，请用官方标注为 Segmentation 的模型，例如 DSR、PaDiM、PatchCore、FastFlow、EfficientAD 等。这些模型原生就会输出 anomaly map 和像素级掩码。(Anomalib Documentation)
数据与 config 按“分割任务”的要求配置好
- 数据层面至少要保证 test/val 阶段能给到 mask（如果你想算分割指标的话），或者按官方示例结构组织 MVTecAD 这类数据。(Anomalib Documentation)
- config 里的 dataset.task、回调的 task 都要一致设成 "segmentation"，这样 PostProcessor 才会生成 pred_mask，Visualizer 才画得出来。

如果你现在是用一个“分类型”模型（比如 GANomaly），但强行把 config 的 task 改成 segmentation，那就会 永远没有 pred_mask，也就永远在这个 assert 上崩。

小结一句话

报错原因：Visualizer 在“分割模式”下要求 pred_mask 非空，但你的模型输出里没有 pred_mask，于是 AssertionError。
解决思路：
- 只做分类 → 全部配置（dataset / callbacks 里的 task）统一为 classification，或者直接关掉图像可视化回调；
- 真想做分割 → 换成官方标注为 Segmentation 的模型，并按分割任务要求配置数据和 config，不能只改一个 task 字段就指望分类模型输出掩码。

如果你方便把当前用的 具体模型名 + 一小段 config.yaml（dataset 部分和 callbacks 部分） 贴出来，我可以帮你按那一版的字段写一份“能跑不报错”的最小修改示例。

Answer 4

先说结论：
你现在的这段输出不是报错，而是因为你配置了 devices: 2，Lightning 用了 DDP 多进程在 2 张卡(0 和 1) 上训练，所以所有那些 Albumentations / OpenVINO / wandb 的 warning 被两三个进程重复打印，看起来像“卡住一直刷日志”。

你想要的是：只用 GPU1，不用 GPU0，而且最好不要这么多重复输出。下面一步步改。

一、现在这些打印是什么情况？

关键几行：

text
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
...
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
...
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
...
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
Epoch 0:   0%| 0/190 [00:00<?, ?it/s]

解释一下：

devices: 2 + accelerator: gpu → Lightning 自动用 DDP(多卡分布式)。
DDP 会起 2 个进程：
- GLOBAL_RANK 0 用 cuda:0
- GLOBAL_RANK 1 用 cuda:1
每个进程都会跑你那套初始化逻辑，于是你看到：
- Albumentations 的版本提示打印 N 遍
- OpenVINO 和 wandb 的提示打印 N 遍
这些都是 warning + info，不是 error，训练其实在跑，只是你中途 Ctrl+C 了：

text
^C ... Detected KeyboardInterrupt, attempting graceful shutdown...
2025-12-23 00:50:43,762 - anomalib - INFO - Loading the best model weights.
2025-12-23 00:50:43,763 - anomalib - INFO - Testing the model.

所以：

现在的打印 = DDP 多进程 + 各种库的 Warning 被每个进程打印一遍，看着很吵，但不是 bug。

二、只用 GPU1，不用 GPU0 的正确做法

方案 1（推荐）：用环境变量屏蔽 GPU0

改 YAML：只用 1 张卡

把 config 里 trainer 部分的这两行改一下：

yaml
trainer:
  accelerator: gpu
  devices: 1      # 原来是 2，改成 1，只用一张 GPU
  strategy: null  # 或者删掉，让 lightning 自己选

用 CUDA_VISIBLE_DEVICES 把“可见的卡”限制为 1

在终端里这样跑：

bash
# 只让程序看到物理 GPU1
CUDA_VISIBLE_DEVICES=1 \
python tools/train.py --model cflow --config src/anomalib/models/cflow/config.yaml

这样：

对程序来说，只“看到” 1 张卡（物理 GPU1）；
Lightning 里面的 cuda:0 实际就是你的 物理 GPU1；

日志会变成类似：

text
GPU available: True (cuda), used: True
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

不会再有 GLOBAL_RANK 1 之类的东西，也不会再起第二个进程去用 GPU0。

注意：如果你还保持 devices: 2，但 CUDA_VISIBLE_DEVICES=1 只暴露 1 张卡，就会报错（要 2 张卡、实际只有 1 张），所以一定要 同时把 devices 改成 1。

方案 2：不用环境变量，直接在 config 里指定 GPU1

如果你不想折腾环境变量，可以试试看（取决于你安装的 PyTorch Lightning 版本）：

yaml
trainer:
  accelerator: gpu
  devices: [1]   # 显式指定用第 1 号 GPU
  strategy: null

在一些版本里 devices: [1] 表示只用编号为 1 的那张卡。
不过实际中我更推荐方案 1，用 CUDA_VISIBLE_DEVICES，行为最稳定也最容易理解。

三、顺手把一堆烦人的 Warning 静音

这些 warning 不是错误，只是挺烦，可以处理一下：

Albumentations 版本提示

反复出现这一条：

text
A new version of Albumentations is available: 2.0.8 (you have 1.4.14).
To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.

你可以在终端加一个环境变量就不再检查了：

bash
export NO_ALBUMENTATIONS_UPDATE=1

或者一行写完：

bash
NO_ALBUMENTATIONS_UPDATE=1 CUDA_VISIBLE_DEVICES=1 \
python tools/train.py --model cflow --config src/anomalib/models/cflow/config.yaml

OpenVINO / wandb

text
OpenVINO is not installed. Please install OpenVINO to use OpenVINOInferencer.
To use wandb logger install it using `pip install wandb`

如果你暂时不用 OpenVINO 和 wandb，完全可以无视；
想彻底消掉，要么：
- 安装对应库：pip install openvino-dev wandb；
- 要么在 anomalib 的 config 里把相关的 export_mode / wandb logger 去掉。

四、总结一下操作步骤

如果你的目标是：

只用 GPU1 单卡训练 CFlow，不用 GPU0，日志干净一点

可以按这个顺序做：

修改 YAML：

yaml
trainer:
  accelerator: gpu
  devices: 1       # 原来的 2 改成 1
  strategy: null   # 或直接删掉

在终端里只暴露 GPU1：

bash
cd /media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib

NO_ALBUMENTATIONS_UPDATE=1 CUDA_VISIBLE_DEVICES=1 \
python tools/train.py --model cflow --config src/anomalib/models/cflow/config.yaml

用 nvidia-smi 看一下：
- 只有 GPU1 被占用；
- GPU0 空闲，就说明配置成功了。

如果你把改完的 YAML 贴出来，我也可以帮你再检查一下有没有其它小地方可以顺便优化。

Answer 5

这个报错其实跟你的模型、数据都没关系，而是 多卡 + anomalib 这一版的指标实现有 bug 导致的。

1. 现在这个错误到底在干嘛？

最后炸的地方：

text
RuntimeError: No backend type associated with device type cpu
...
torch.distributed.all_gather(local_sizes, local_size, group=group)

意思是：

你现在在用 分布式后端 NCCL（只支持 GPU tensor 的通信）；
但 torchmetrics 在做 all_gather 的时候，拿的是 CPU 上的 tensor；
NCCL 不支持 CPU，于是直接报：No backend type associated with device type cpu。

这和你前面的堆栈也对得上：

text
... validation_epoch_end
    self._compute_adaptive_threshold(outputs)
... self.image_threshold.compute()
... gather_all_tensors(...)
... all_gather(...)
RuntimeError: No backend type associated with device type cpu

anomalib 在 AnomalyModule 里是这么写的（文档/源码可以看到）(Anomalib Documentation)：

python
def _compute_adaptive_threshold(self, outputs):
    self._collect_outputs(self.image_threshold, self.pixel_threshold, outputs)
    self.image_threshold.compute()
    ...

@staticmethod
def _collect_outputs(image_metric, pixel_metric, outputs):
    for output in outputs:
        image_metric.cpu()   # <-- 强行搬到 CPU
        image_metric.update(output["pred_scores"], output["label"].int())
        if "mask" in output.keys() and "anomaly_maps" in output.keys():
            pixel_metric.cpu()
            pixel_metric.update(output["anomaly_maps"], output["mask"].int())

也就是说：

你在用多 GPU（DDP），Lightning 创建了分布式进程组，后端是 nccl；
anomalib 在做自适应阈值（adaptive threshold）时，把 image_threshold 这个 metric .cpu() 了；
torchmetrics 想同步这些 metric 状态，就在 CPU tensor 上做 all_gather；
backend 是 nccl → 不支持 CPU → 报 No backend type associated with device type cpu。

GitHub 上已经有人提过类似问题，标题基本就是：“AnomalyScoreThreshold 和多 GPU 不兼容”。(GitHub)

2. 为什么你这里是多卡？

虽然你之前想只用 GPU1，但从这次的堆栈可以看出，你现在还是在用 多卡 / 多进程：

text
torch.multiprocessing.spawn.ProcessRaisedException
...
/pytorch_lightning/strategies/launchers/multiprocessing.py: mp.start_processes(...)

说明现在 Trainer 还是用的 DDP（多进程 spawn），即：

trainer.devices 仍然 > 1（你之前配置里是 2）；
accelerator: gpu → Lightning 默认 strategy 就会选 DDP/DDP Spawn。

只要是 多 GPU + 你这版 anomalib，就会在算自适应阈值那里撞到这个 bug。

3. 最简单、稳妥的解决方案：改成单卡（顺便只用 GPU1）

你本来就想 只用 GPU1，那就顺便把这个 multi-GPU 的坑绕开，直接单卡训练即可。

① 改 YAML 配置（只用 1 张卡）

在你的 config 里找到 trainer: 那一段，把这几个字段改掉：

yaml
trainer:
  accelerator: gpu
  devices: 1          # 原来是 2，改成 1，彻底关掉多卡
  strategy: null      # 或者直接删掉，让 lightning 自动选
  # 其它保持不动

只要 devices: 1，Lightning 就不会再走 DDP、多进程、分布式同步，自然也不会去 all_gather 那个 CPU metric。

② 在终端里指定只用 GPU1

在命令前加上环境变量，把物理 GPU1 暴露给程序：

bash
cd /media/jd/4997BB1603CFE2C4/lw/liwei23/anomalib

NO_ALBUMENTATIONS_UPDATE=1 CUDA_VISIBLE_DEVICES=1 \
python tools/train.py --model cflow --config src/anomalib/models/cflow/config.yaml

这样：

程序里看到的 cuda:0 实际就是 你机器上的 GPU1；
因为 devices=1，不会再起第二个进程，也不会再用分布式同步那些 metric；
这个 RuntimeError: No backend type associated with device type cpu 就不会再出现了。

4. 如果你“非得”用两张卡怎么办（进阶，不推荐现在搞）

如果你以后真的想搞多 GPU，可以有几条比较折腾的路（现在不一定要做）：

升级 anomalib 到最新版本
新版对 metrics 和 threshold 的实现有较大调整，对多 GPU 支持会好一些。(Anomalib Documentation)
自己改源码，把 .cpu() 调用去掉，并用自定义 all_gather
就是 issue 里讨论的那种 hack：让 metric 始终待在 GPU 上，同步时手动把 tensor 先搬到 cuda 再 gather。(GitHub)
这需要你对 anomalib 代码比较熟，而且要小心验证指标有没有被搞坏。
改分布式 backend 为 gloo（支持 CPU）
在 Lightning 里可以用自定义 strategy=DDPStrategy(process_group_backend="gloo") 一类的写法，让 CPU tensor 也能 all_gather。
但这通常要在 Python 代码里创建 Trainer 而不是纯 YAML 配置，对你现在这套 CLI + YAML 流程就比较麻烦了。

所以就当前情况看，你最想要的是“用 GPU1 跑通”而不是“研究多卡分布式 bug”，那就老老实实单卡跑，是最省心的。

5. 顺带提一下那个 batch_size warning

日志里还有一个：

text
UserWarning: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 3.

这个只是 torchmetrics 在推测 batch_size 时的提示，不会导致崩溃，可以先忽略。
你如果以后想消掉它，可以在 self.log(...) 里手动加 batch_size=...，但 anomalib 自己封装了，暂时不用管。

总结一下

报错原因：多 GPU（DDP + NCCL）训练时，anomalib 把阈值 metric 和部分指标搬到 CPU，torchmetrics 又在 CPU 上做 all_gather，而 NCCL 不支持 CPU → No backend type associated with device type cpu。
当前解决方案（强烈推荐）：
1. 在 YAML 里设 trainer.devices: 1，不要再用 2；
2. 命令前加 CUDA_VISIBLE_DEVICES=1，只暴露 GPU1；
3. 再跑一遍训练。

你按上面两步改完再跑，如果还有新报错，把更新后的 trainer 配置和完整错误栈贴一下，我可以帮你继续顺。

Answer 6

你这次不是代码写错，而是显卡显存真不够用了（主要是在“测试阶段 + CFlow + 分割指标”这套组合下）。

先拆一下报错信息：

text
CUDA out of memory. Tried to allocate 3.05 GiB.
GPU 0 total 23.69 GiB, free 2.83 GiB.
This process has 20.85 GiB in use.
18.15 GiB allocated by PyTorch, 2.38 GiB reserved but unallocated.
Testing DataLoader 0: 100%|██████████| 507/507 [...]

关键信息：

发生位置：Testing DataLoader 0 跑完之后（已经 100% 了），也就是在测试末尾，通常是算 metric 或画图那一段爆掉的；
你的 3090 24G，已经被这个进程吃掉了将近 21GB 显存（18GB PyTorch 真用 + 2.38GB reserved）；
还想再申请 3GB，结果只剩 2.8GB，就炸了。

结合前面日志里这些警告：

text
Metric `ROC` will save all targets and predictions in buffer.
Metric `PrecisionRecallCurve` will save all targets and predictions in buffer.

再加上你是 CFlow + segmentation + 一堆 image/pixel metrics，基本可以肯定：

不是单步 forward 太大，而是测试阶段各种 metric 把“所有预测和标签”都堆在显存里，最后一下子爆了。

一步步帮你把显存降下来（按优先级来）

第 0 步：先确认真的是只有一个进程在用显卡

先在训练前开个终端跑：

bash
nvidia-smi

确保只有你这个 python tools/train.py 在用 GPU，
如果还有其它程序（别的 Python / 浏览器 / IDE / 上次没退干净的进程），先 kill 掉，这样你就能多出来几 GB 空间。

第 1 步：大幅降低「测试阶段」的 batch_size

你的训练和测试 batch_size 是分开的：

yaml
dataset:
  train_batch_size: 8      # 你说已经改成 8 了
  eval_batch_size: 8       # 这里最好单独改小

先把 eval_batch_size 改到很保守，比如 1 或 2：

yaml
dataset:
  train_batch_size: 8      # 先保持
  eval_batch_size: 2       # 建议先试 2，不行再试 1

因为报错发生在：

text
Testing DataLoader 0: 100%|██████████| 507/507 [...]

说明是测试环节，所以优先动 eval_batch_size 就行，训练 batch_size 可以先不动。

第 2 步：打开混合精度（precision 16）

你现在 config 里是：

yaml
trainer:
  precision: 32

建议改成混合精度：

yaml
trainer:
  precision: 16   # 混合精度，显存立减一大截

3090 非常适合跑混合精度，速度和显存都会好不少。

注意：precision 16 要求驱动 + CUDA 版本正常（你之前已经能正常训练应该没问题）。

第 3 步：减轻 CFlow 模型本身的负担（可选，但推荐）

从你之前的日志看，CFlow 模型参数量快 1G 了：

text
Total estimated model params size (MB): 946.249

这是因为：

backbone：wide_resnet50_2 很大；
CFlow decoder 里的 flow 模块很多、通道也大。

你可以在 model: 下面做几件事：

3.1 换一个轻一点的 backbone

把 wide_resnet50_2 换成 resnet18 或 resnet34（如果你不追逐 paper 完全一致指标）：

yaml
model:
  name: cflow
  backbone: resnet34      # 或 resnet18，都比 wide_resnet50_2 轻很多
  pre_trained: true
  layers:
    - layer2
    - layer3             # 可以先只用两层，内存更友好
  # 其它参数先不动

3.2 调小 CFlow 自己的参数

你现在的是：

yaml
  condition_vector: 128
  coupling_blocks: 8
  fiber_batch_size: 64

可以稍微砍一点：

yaml
  condition_vector: 64        # 降一半
  coupling_blocks: 4          # 流的层数减半，少不少显存
  fiber_batch_size: 32        # 减小这个能明显降低内存，但会稍微慢点

这些都会减少 CFlow 内部 flow 的维度和中间特征大小，对显存很友好。

第 4 步：减少测试时“堆在显存里的东西”（metrics / 可视化）

你现在 metrics 配置是：

yaml
metrics:
  image:
    - F1Score
    - AUROC
  pixel:
    - F1Score
    - AUROC
    - AUPRO
  threshold:
    method: adaptive

这些 metric（尤其是 AUROC、AUPRO）会把所有图片的预测和标签都存起来，再一起算曲线，非常吃内存。

可以先精简一点，比如只保留最关心的那几个：

yaml
metrics:
  image:
    - AUROC         # 只要一个 image-level AUROC
  pixel:
    - AUROC         # 只留一个 pixel-level AUROC（或只留 AUPRO）
  threshold:
    method: adaptive

如果你暂时只想训练通、不关心所有 fancy 指标，可以再狠一点：

yaml
metrics:
  image: []
  pixel: []
  threshold:
    method: manual
    manual_image: 0.5
    manual_pixel: 0.5

manual 阈值就不需要存一堆预测来“自适应估计阈值”了，省内存，但阈值是人为设的，可能没 adaptive 好。

第 5 步：关掉 / 精简测试时的可视化（如果开了）

你的可视化配置是：

yaml
visualization:
  show_images: False
  save_images: True
  log_images: True
  image_save_path: ./results/1217cflow
  mode: full

在测试阶段，full 模式 + save + log，有时会把 anomaly map、掩码、叠加图都搞一大堆，而且还可能积在 GPU 或 CPU 一段时间。

先暂时改简单一点，比如：

yaml
visualization:
  show_images: False
  save_images: False   # 先关掉保存图片，调通后再打开
  log_images: False    # 先关掉往 logger 里写图
  mode: simple         # 或 full -> simple

确认不再 OOM 后，再慢慢开回自己需要的部分。

第 6 步：顺手设一下 PyTorch 的内存配置（锦上添花）

报错里建议的：

text
If reserved but unallocated memory is large try setting
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.

可以在启动命令前加：

bash
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
CUDA_VISIBLE_DEVICES=0 \
python tools/train.py --model cflow --config src/anomalib/models/cflow/config.yaml

这个主要是缓解显存碎片问题，不是根本解决方案，但可以一起用。

建议你先尝试的最小修改组合

如果你不想一下子改太多，我给你一个“尽量小改动、成功率又高”的方案，你可以直接照抄：

config.yaml 里改这几处：

yaml
dataset:
  train_batch_size: 8
  eval_batch_size: 2       # 主要改这个

model:
  name: cflow
  backbone: resnet34       # 从 wide_resnet50_2 换成 resnet34
  pre_trained: true
  layers:
    - layer2
    - layer3
  decoder: freia-cflow
  condition_vector: 64     # 从 128 降到 64
  coupling_blocks: 4       # 从 8 降到 4
  fiber_batch_size: 32     # 从 64 降到 32
  lr: 0.0001
  normalization_method: min_max

metrics:
  image:
    - AUROC
  pixel:
    - AUROC
  threshold:
    method: adaptive

visualization:
  show_images: False
  save_images: False       # 先关
  log_images: False        # 先关
  image_save_path: ./results/1217cflow
  mode: simple             # 改成 simple

trainer:
  accelerator: gpu
  devices: 1               # 单卡
  precision: 16            # 混合精度
  # 其它不变

启动命令这样写：

bash
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
CUDA_VISIBLE_DEVICES=0 \
NO_ALBUMENTATIONS_UPDATE=1 \
python tools/train.py --model cflow --config src/anomalib/models/cflow/config.yaml

（如果你还是想用 GPU1，把 CUDA_VISIBLE_DEVICES=0 换成 1 就行。）

如果你按这个方案改完还是 OOM，把**最新的报错信息 + 改完之后的 config（只要 dataset / model / metrics / trainer 这几段）**贴出来，我可以再帮你针对性继续收紧显存。

Answer 7

会增大，而且是明显增大，不是一点点 🥲

直观一点说：

现在是 image_size = 256
如果改成 image_size = 512，边长变 2 倍
像素数 = 256×256 → 512×512 = 4 倍
卷积网络（CFlow 这种）里每一层的 feature map 尺寸也大致跟着 ×4
显存占用、计算量大致也会 ×4 左右

所以：

如果你在 256 的时候已经会 OOM，那改成 512 几乎可以肯定会更容易炸显存，不会减小压力。

简单记一个经验公式

显存大致 ~ batch_size × H × W × (通道 & 层数相关系数)

所以：

从 256 → 512：H、W 各 ×2 → 显存 ~ ×4
想保持显存差不多，你得同时把 batch_size 降到 1/4：
- 例如：
  - 原来：256, batch=8
  - 改成：512, batch≈2 才勉强持平

但你现在在 256, batch=8 都会在测试阶段 OOM，
这时去 512 只会更难救，不是解法。

实际建议

现在 OOM 的前提下，不要把图像改大，
反而可以考虑从 256 → 224 或 256 → 192 来减轻显存负担。
如果确实需要更高分辨率（比如 tiny defect，需要 512 才看得清），那就：
- 大幅减小 eval_batch_size（甚至 1）；
- backbone 换轻一点的（resnet18 / resnet34）；
- 结合 precision: 16 来省显存；
- 或使用 anomalib 里的 tiling（把大图切成小块处理）。

一句话：256 → 512 一定是“更重”，不是“更省”，在你目前 OOM 的情况下不建议这样改。

Answer 8

这次不是数据量的问题（你把异常样本减到 200 多张也没关系），而是 FastFlow 在你把输入改成 512 后，模型内部的 LayerNorm 形状按 512 生成了；但 anomalib 在训练结束 on_train_end 记录 TensorBoard 计算图时，仍然用固定的 256x256 假输入去 trace，导致形状不匹配，于是报：

expected normalized_shape=[64, 128, 128] ... but got ... [1, 64, 64, 64]

你堆栈里也明确显示错误发生在：

anomalib/utils/callbacks/graph.py 里：logger.log_graph(pl_module, input_array=torch.ones((1, 3, 256, 256)))
trace 进入 fastflow/torch_model.py 的 self.norms[i](feature) → LayerNorm 报 shape 不对

解释一下数字：

输入 512 时，某层特征可能是 128x128（例如 /4 下采样）
但 graph trace 用 256 输入时，同一层特征变成 64x64
LayerNorm 当初按 128x128 初始化的，拿到 64x64 就炸了

解决办法（推荐按顺序选一个）

方案 A（最简单，推荐）：关掉计算图记录

你 config 里现在应该是 logging.log_graph: true，改成：

yaml
logging:
  logger: [tensorboard,csv]
  log_graph: false

这样训练/测试照常跑，只是不往 TensorBoard 里写计算图，就不会在 on_train_end 崩。

方案 B：让 graph callback 用正确的输入尺寸（512）

去改你环境里的这个文件（你报错里已指向它）：

.../site-packages/anomalib/utils/callbacks/graph.py

把里面固定的 256 改成 512（或改成读配置里的 image_size）。最粗暴能用的改法：

python
logger.log_graph(pl_module, input_array=torch.ones((1, 3, 512, 512)))

如果你以后还会换别的尺寸，建议改成从 config / datamodule 取 image_size，而不是写死。

方案 C：回退到 256 输入

如果你只是想尽快跑通，直接把 dataset.image_size 改回 256 也行。

额外提醒

你这次报错发生在 Epoch 7 结束后 / train_end（写 graph 的回调阶段），所以往往模型已经训练了好几轮、可能也已经保存了 checkpoint；只是最后写图失败导致程序退出。你可以先看看 results/... 目录里有没有 ckpt 文件。

如果你把你当前 fastflow 的 YAML 里 logging: 那段贴一下（尤其是 log_graph），我可以直接告诉你改哪一行最稳。

问题

思考

Analyzing error context

Reviewing the training process timeline

Analyzing the zipped model issue

Fixing corrupted file issue

Fixing the corrupted zip file issue

Dealing with file extension mismatch

Fixing bad zip file issue

回答

1. 发生了什么？

2. 为啥你放在 /home/jd/.cache/torch/hub/checkpoints 也没用？

3. 怎么解决现在的 BadZipFile？

步骤 1：删掉坏掉的“已有数据集压缩包”

步骤 2：重新运行训练（让它自己下完整）

4. 如果你一定想自己离线下载并“喂给” anomalib

方案 A：把你下好的 zip 放到它真正用的目录里

方案 B：直接修改代码让它用你已经有的 .pckl

5. 至于最开始那次 FileNotFoundError

总结一句话版本

问题

思考

Providing clear instructions for file placement and warnings

回答

1. EfficientAD 代码到底去哪找这个权重？

2. 具体怎么放？

3. 如果你下的是官方的 zip 包怎么办？

4. 关于那个 FutureWarning

小结

问题

思考

Explaining the segmentation error and fixing configuration

Crafting a clear answer in Chinese

回答

回到你问的这句

怎么改才能跑通？

情况 A：你现在这个模型只想做分类（不关心像素级掩码）

情况 B：你真的想要“分割”（像素级掩码）结果

小结一句话

问题

PL Trainer Args. Don't add extra parameter here.

distributed_backend=nccl All distributed processes registered. Starting with 2 processes

| Name | Type | Params

0 | image_threshold | AnomalyScoreThreshold | 0 1 | pixel_threshold | AnomalyScoreThreshold | 0 2 | model | CflowModel | 236 M 3 | image_metrics | AnomalibMetricCollection | 0 4 | pixel_metrics | AnomalibMetricCollection | 0 5 | normalization_metrics | MinMax | 0

思考

Helping with GPU selection and log issues

Selecting GPU1 and adjusting config

Fixing GPU selection and explaining log repetition

Explaining training logs and silencing warnings

Addressing training slowdown, log repetition, and GPU use

Clear instructions for GPU setup and log warnings

Clarifying GPU usage and log warnings

Adjusting hardware and config for single GPU use

回答

一、现在这些打印是什么情况？

二、只用 GPU1，不用 GPU0 的正确做法

方案 1（推荐）：用环境变量屏蔽 GPU0

方案 2：不用环境变量，直接在 config 里指定 GPU1

三、顺手把一堆烦人的 Warning 静音

四、总结一下操作步骤

问题

思考

Clarifying metric configuration and thresholds

Clarifying issue with metrics sync on multi-GPU

Analyzing the bug in multi-GPU with adaptive threshold

Explaining known bug in multi-GPU with adaptive threshold

回答

1. 现在这个错误到底在干嘛？

2. 为什么你这里是多卡？

3. 最简单、稳妥的解决方案：改成单卡（顺便只用 GPU1）

① 改 YAML 配置（只用 1 张卡）

② 在终端里指定只用 GPU1

4. 如果你“非得”用两张卡怎么办（进阶，不推荐现在搞）

5. 顺带提一下那个 batch_size warning

总结一下

问题

思考

Evaluating OOM issue during testing with anomalib

Addressing OOM with pixel ROC and metrics aggregation

Identifying OOM trigger during testing phase

2. 为啥你放在 `/home/jd/.cache/torch/hub/checkpoints` 也没用？

方案 B：直接修改代码让它用你已经有的 `.pckl`

distributed_backend=nccl
All distributed processes registered. Starting with 2 processes

0 | image_threshold | AnomalyScoreThreshold | 0
1 | pixel_threshold | AnomalyScoreThreshold | 0
2 | model | CflowModel | 236 M
3 | image_metrics | AnomalibMetricCollection | 0
4 | pixel_metrics | AnomalibMetricCollection | 0
5 | normalization_metrics | MinMax | 0