Invalidate trace cache @ step 474 and module 474: ...

Créé le : 21 mars 2025

Répondu en utilisant O1 par Chat01

Question

Invalidate trace cache @ step 474 and module 474: cache has only 474 modules
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/data/guoweis/zero/tinyzero/qwen3b_grpo/why_grpo_epochs.py", line 531, in <module>
[rank0]: trainer.train()
[rank0]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/transformers/trainer.py", line 2171, in train
[rank0]: return inner_training_loop(
[rank0]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/transformers/trainer.py", line 2531, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/transformers/trainer.py", line 3712, in training_step
[rank0]: self.accelerator.backward(loss, **kwargs)
[rank0]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/accelerate/accelerator.py", line 2238, in backward
[rank0]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank0]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 270, in backward
[rank0]: self.engine.step()
[rank0]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2249, in step
[rank0]: self._take_model_step(lr_kwargs)
[rank0]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2152, in _take_model_step
[rank0]: self.optimizer.step()
[rank0]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2089, in step
[rank0]: if self._overflow_check_and_loss_scale_update():
[rank0]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2034, in _overflow_check_and_loss_scale_update
[rank0]: self._update_scale(self.overflow)
[rank0]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2459, in _update_scale
[rank0]: self.loss_scaler.update_scale(has_overflow)
[rank0]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
[rank0]: raise Exception(
[rank0]: Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/data/guoweis/zero/tinyzero/qwen3b_grpo/why_grpo_epochs.py", line 531, in <module>
[rank1]: trainer.train()
[rank1]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/transformers/trainer.py", line 2171, in train
[rank1]: return inner_training_loop(
[rank1]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/transformers/trainer.py", line 2531, in _inner_training_loop
[rank1]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank1]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/transformers/trainer.py", line 3712, in training_step
[rank1]: self.accelerator.backward(loss, **kwargs)
[rank1]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/accelerate/accelerator.py", line 2238, in backward
[rank1]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank1]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 270, in backward
[rank1]: self.engine.step()
[rank1]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2249, in step
[rank1]: self._take_model_step(lr_kwargs)
[rank1]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2152, in _take_model_step
[rank1]: self.optimizer.step()
[rank1]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2089, in step
[rank1]: if self._overflow_check_and_loss_scale_update():
[rank1]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2034, in _overflow_check_and_loss_scale_update
[rank1]: self._update_scale(self.overflow)
[rank1]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2459, in _update_scale
[rank1]: self.loss_scaler.update_scale(has_overflow)
[rank1]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
[rank1]: raise Exception(
[rank1]: Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
以下代码报上述错误,请帮我修正上述bug~

-- coding: utf-8 --

import os
import re
import json
import gc
import langid
import wandb
import torch
import deepspeed
from typing import List
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from nltk import edit_distance
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import GRPOConfig, GRPOTrainer
from nltk.translate.meteor_score import meteor_score

1. Login to wandb if desired

wandb.login(key="91223cd0edc96e39bfa8e98641d2a8d1aa3c0c4e")

2. Device assignment; let DeepSpeed handle device_map

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Running on device: {device}")

3. Supported languages

SUPPORTED_LANGUAGES = {
"en_to_zh": ("英语", "中文"),
"zh_to_en": ("中文", "英语"),
}

4. System Prompt (modified as requested)

SYSTEM_PROMPT = """
You are a lyrics translation assistant.
You MUST ALWAYS output in the exact XML format:
<why1>
[First question you ask yourself to complete the task]
</why1>
<why2>
[Second question you ask yourself to complete the task]
</why2>
<why3>
[Third question you ask yourself to complete the task]
</why3>
<answer>
[The final translation only goes here]
</answer>
"""

5. Dataset Preparation

def get_lyric_datasets(path: str) -> Dataset:
"""
将指定路径的 JSON 数据集转换为 HuggingFace Dataset,并为每条数据
生成 prompt: [system, user] 及其真实翻译。
"""
data = Dataset.from_json(path)

text
# Filter dataset to only include 'en_to_zh' and 'zh_to_en' data = data.filter(lambda x: x['type'] in ["en_to_zh", "zh_to_en"]) def map_fn(x): lang_src = SUPPORTED_LANGUAGES[x['type']][0] lang_tgt = SUPPORTED_LANGUAGES[x['type']][1] # 将新的 SYSTEM_PROMPT 拼接上提示 system_plus = SYSTEM_PROMPT + f"\nTranslate the following from {lang_src} to {lang_tgt}. Do not add commentary." return { 'prompt': [ {'role': 'system', 'content': system_plus}, {'role': 'user', 'content': x['lyric']} ], 'answer': x['target_lyric'] } data = data.map(map_fn) return data

6. Utility to extract <answer> ... </answer> from text

def extract_xml_answer(text: str) -> str:
"""
从给定文本中提取 <answer> ... </answer> 内容。
"""
pattern = r"<answer>\s*(.?)\s</answer>"
match = re.search(pattern, text, re.DOTALL | re.IGNORECASE)
if match:
return match.group(1).strip()
return ""

7. Reward computation helpers

def compute_length_acc(
preds: List[str],
refs: List[str],
tokenizer,
max_tolerance: float = 0.5
) -> List[float]:
"""
计算长度准确度奖励。
如果预测长度与参考长度之差比不大于一定比例,则给予相应的分数。
"""
rewards = []
for pred, ref in zip(preds, refs):
pred_tokens = tokenizer.tokenize(pred)
ref_tokens = tokenizer.tokenize(ref)
if len(ref_tokens) == 0:
rewards.append(0.0)
continue
length_ratio = abs(len(pred_tokens) - len(ref_tokens)) / len(ref_tokens)
if length_ratio <= 0.1:
score = 1.0
elif length_ratio <= 0.2:
score = 0.8
elif length_ratio <= 0.3:
score = 0.6
elif length_ratio <= 0.4:
score = 0.4
elif length_ratio <= 0.5:
score = 0.2
else:
score = 0.0
rewards.append(score)
return rewards

def compute_bleu(preds: List[str], refs: List[str], tokenizer) -> List[float]:
"""
计算 BLEU 分数列表
"""
smoothie = SmoothingFunction().method1
weights = (0.25, 0.25, 0.25, 0.25)
scores = []
for pred, ref in zip(preds, refs):
pred_tokens = tokenizer.tokenize(pred)
ref_tokens = tokenizer.tokenize(ref)
if not pred_tokens or not ref_tokens:
scores.append(0.0)
continue
bleu = sentence_bleu(
[ref_tokens],
pred_tokens,
weights=weights,
smoothing_function=smoothie
)
scores.append(bleu)
return scores

def compute_ter(preds: List[str], refs: List[str], tokenizer) -> List[float]:
"""
计算 TER (Translation Edit Rate)。
"""
ter_scores = []
for pred, ref in zip(preds, refs):
pred_tokens = tokenizer.tokenize(pred)
ref_tokens = tokenizer.tokenize(ref)
if len(ref_tokens) == 0:
if len(pred_tokens) > 0:
ter_scores.append(100.0)
else:
ter_scores.append(0.0)
continue
dist_val = edit_distance(pred_tokens, ref_tokens)
ter = (dist_val / len(ref_tokens)) * 100
ter_scores.append(ter)
return ter_scores

def detect_language(text: str) -> str:
"""
使用 langid 自动检测语言,返回语言代码。
"""
return langid.classify(text)[0]

8. Reward Function Decorator (示例,本项目暂不使用)

def reward_func_decorator(func):
def wrapper(prompts, completions, answer, step=0, **kwargs):
# 可以在这里对 func 做一些预处理或后处理
pass
return wrapper

9. Define Reward Functions

def length_acc_reward_func(prompts, completions, answer, step=0, **kwargs) -> List[float]:
"""示例奖励函数:关注预测的长度。"""
responses = [completion[0]['content'] for completion in completions]
extracted_responses = [extract_xml_answer(r) for r in responses]

text
# 打印一些示例数据 if step % 200 == 0: print(f"\n[Step {step}] Displaying up to 3 samples from the batch:\n") for i in range(min(3, len(responses))): q = prompts[i][-1]['content'] print("-" * 20) print(f"Sample {i + 1}") print(f"Question:\n{q}") print(f"GroundTruth Answer:\n{answer[i]}") print(f"Model Response (raw):\n{responses[i]}") print(f"Extracted <answer>:\n{extracted_responses[i]}") print("-" * 20) length_rewards = compute_length_acc( preds=extracted_responses, refs=answer, tokenizer=tokenizer ) return length_rewards

def bleu_reward_func(prompts, completions, answer, step=0, **kwargs) -> List[float]:
"""
计算 BLEU 分数,并转换为某种自定义奖励区间。
"""
responses = [c[0]["content"] for c in completions]
q = prompts[0][-1]['content']
extracted = [extract_xml_answer(r) for r in responses]
print('-' * 20, f"Original Lyrics:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}",
f"\nExtracted:\n{extracted[0]}")
bleu_scores = compute_bleu(preds=extracted, refs=answer, tokenizer=tokenizer)
rewards = []
for score in bleu_scores:
if score >= 0.9:
rewards.append(5.0)
elif score >= 0.8:
rewards.append(4.5)
elif score >= 0.7:
rewards.append(4.0)
elif score >= 0.6:
rewards.append(3.5)
elif score >= 0.5:
rewards.append(2.5)
elif score >= 0.4:
rewards.append(2.0)
elif score >= 0.3:
rewards.append(1.5)
elif score >= 0.2:
rewards.append(1.0)
elif score >= 0.1:
rewards.append(0.5)
else:
rewards.append(0.0)
return rewards

def ter_reward_func(completions, answer, step=0, **kwargs) -> List[float]:
"""
计算 TER 分数,并映射到某种奖励区间。
"""
responses = [c[0]["content"] for c in completions]
extracted = [extract_xml_answer(r) for r in responses]
ter_scores = compute_ter(preds=extracted, refs=answer, tokenizer=tokenizer)
rewards = []
for t in ter_scores:
if t >= 80:
rewards.append(0.0)
elif t >= 60:
rewards.append(0.5)
elif t >= 40:
rewards.append(1.0)
elif t >= 20:
rewards.append(1.5)
else:
rewards.append(2.0)
return rewards

def language_recognition(completions, answer, step=0, **kwargs) -> List[float]:
"""
简单地检测 预测语种 是否与 参考答案语种 相同,以此给予奖励。
"""
responses = [c[0]["content"] for c in completions]
extracted = [extract_xml_answer(r) for r in responses]
rewards = []
for pred, ref in zip(extracted, answer):
if not pred.strip():
rewards.append(0.0)
continue
pred_lang = detect_language(pred)
ref_lang = detect_language(ref)
rewards.append(1.0 if pred_lang == ref_lang else 0.0)
return rewards

def strict_format_reward_func(completions, answer=None, step=0, kwargs) -> list:
"""
对输出做严格格式检查:
必须包含 <why1>...</why1> <why2>...</why2> <why3>...</why3> <answer>...</answer> 且顺序正确。
"""
pattern = (
r"<why1>[\s\S]+?</why1>\s
"
r"<why2>[\s\S]+?</why2>\s
"
r"<why3>[\s\S]+?</why3>\s*"
r"<answer>[\s\S]+?</answer>"
)
responses = [completion[0]["content"] for completion in completions]
scores = []
for r in responses:
if re.search(pattern, r):
scores.append(1.0)
else:
scores.append(0.0)
return scores

def soft_format_reward_func(completions, answer=None, step=0, **kwargs) -> list:
"""
对输出做软格式检查:
只要包含所有四组标签即可,无论顺序或中间有无其他内容,都给一定的分数。
"""
responses = [completion[0]["content"] for completion in completions]
scores = []
for r in responses:
tags_present = all(
tag in r
for tag in ["<why1>", "</why1>", "<why2>", "</why2>", "<why3>", "</why3>", "<answer>", "</answer>"]
)
scores.append(0.5 if tags_present else 0.0)
return scores

def xmlcount_reward_func(completions, answer=None, step=0, **kwargs) -> List[float]:
"""
对XML标签出现次数进行计分,每出现一个正确的起始和结束标签加分。
如果 </answer> 后面还有残余文字,则额外扣分。
"""
def count_xml(text) -> float:
count = 0.0

text
# 针对新的四段式标签:why1、why2、why3、answer,每个起止标签 0.125 分 # 总分最高1.0 if text.count("<why1>") == 1: count += 0.125 if text.count("</why1>") == 1: count += 0.125 if text.count("<why2>") == 1: count += 0.125 if text.count("</why2>") == 1: count += 0.125 if text.count("<why3>") == 1: count += 0.125 if text.count("</why3>") == 1: count += 0.125 if text.count("<answer>") == 1: count += 0.125 if text.count("</answer>") == 1: count += 0.125 # 如果 </answer> 后还有多余文本,稍微给个负分 if "</answer>" in text: leftover = text.split("</answer>")[-1] count -= len(leftover.strip()) * 0.001 # 夹在 0 到 1 之间,不要出现负分 if count < 0: count = 0.0 return count responses = [c[0]["content"] for c in completions] return [count_xml(c) for c in responses]

def meteor_reward_func(completions, answer, step=0, tokenizer=None, **kwargs) -> List[float]:
"""
2) 基于 METEOR 分数的语义相似度
- 使用 nltk.translate.meteor_score.meteor_score 计算分值
- 再简单映射到 [0, 2.0] 区间
"""
responses = [c[0]["content"] for c in completions]
extracted_preds = [extract_xml_answer(r) for r in responses]

text
scores = [] for pred, ref in zip(extracted_preds, answer): if not pred.strip(): scores.append(0.0) continue m = meteor_score([ref], pred) # meteor_score 参数: (参考列表, 预测字符串) # 映射区间,这里只是示例,你可根据需要调节 if m >= 0.9: scores.append(2.0) elif m >= 0.75: scores.append(1.5) elif m >= 0.5: scores.append(1.0) elif m >= 0.3: scores.append(0.5) else: scores.append(0.0) return scores

10. Training Arguments for GRPO

training_args = GRPOConfig(
use_vllm=False,
learning_rate=5e-6,
adam_beta1=0.9,
adam_beta2=0.99,
weight_decay=0.1,
warmup_ratio=0.1,
lr_scheduler_type="cosine",
logging_steps=50,
fp16=True,
per_device_train_batch_size=4,
gradient_checkpointing=False,
gradient_accumulation_steps=2,
num_generations=8,
max_prompt_length=768,
max_completion_length=768,
num_train_epochs=11, # We'll manually loop over epochs below
save_steps=500,
max_grad_norm=0.1,
report_to='none',
output_dir="outputs",
# 指定 DeepSpeed config
deepspeed="ds_config.json"
)

11. Load Model/Tokenizer

model_name = "../model/Qwen2.5-3B-Instruct" # Adjust path as needed
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map=None, # Let DeepSpeed handle it
use_cache=False
)
model.config.use_cache = True
model.to(device)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

12. Load Dataset

dataset = get_lyric_datasets("../data_pack/multi_lyric.json")

--------------------------- 修改:划分训练集和测试集 (9 : 1) ---------------------------

train_test = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = train_test["train"]
test_dataset = train_test["test"]

13. Initialize Trainer (用训练集)

trainer = GRPOTrainer(
model=model,
processing_class=tokenizer,
reward_funcs=[
xmlcount_reward_func,
soft_format_reward_func,
strict_format_reward_func,
bleu_reward_func
],
args=training_args,
train_dataset=train_dataset,
)

14. Clean up CUDA before training

torch.cuda.empty_cache()

gc.collect()

15. 定义测试集评估函数(with torch.no_grad())

def evaluate_model(
model,
tokenizer,
dataset: Dataset,
batch_size: int = 2,
max_new_tokens: int = 512
):
"""
简易测试函数:逐条生成翻译结果,提取 <answer>,计算 BLEU、TER、和 length_acc。
根据需要可自行改动推理参数(如温度、max_new_tokens 等)。
"""
from statistics import mean

text
all_preds = [] all_refs = [] model.eval() with torch.no_grad(): for i in range(0, len(dataset), batch_size): batch = dataset[i:i + batch_size] prompts = [item["prompt"] for item in batch] refs = [item["answer"] for item in batch] inputs = [] for p in prompts: # p 是 list[{'role': ..., 'content': ...}, ...] # 这里简单拼接 system + user 的 content role_content = [pc["content"] for pc in p] joined = "\n".join(role_content) inputs.append(joined) encodings = tokenizer(inputs, return_tensors="pt", padding=True, truncation=True).to(device) outputs = model.generate( **encodings, max_new_tokens=max_new_tokens, do_sample=False # 用 greedy ) for o in outputs: text = tokenizer.decode(o, skip_special_tokens=True) pred_ans = extract_xml_answer(text) all_preds.append(pred_ans) all_refs.extend(refs) bleu_scores = compute_bleu(all_preds, all_refs, tokenizer) ter_scores = compute_ter(all_preds, all_refs, tokenizer) len_acc_scores = compute_length_acc(all_preds, all_refs, tokenizer) metrics = { "bleu": mean(bleu_scores) if bleu_scores else 0.0, "ter": mean(ter_scores) if ter_scores else 0.0, "length_acc": mean(len_acc_scores) if len_acc_scores else 0.0 } return metrics

16. 手动逐 epoch 训练与测试

num_epochs = int(training_args.num_train_epochs)
for epoch in range(num_epochs):
print(f"======= Starting Epoch {epoch+1}/{num_epochs} =======")
# 临时把 num_train_epochs 设为1,执行单次 epoch 训练
trainer.args.num_train_epochs = 1
trainer.train()

text
# 评估 test_metrics = evaluate_model(model, tokenizer, test_dataset) print(f"\n===== Evaluation at Epoch {epoch+1} =====") print(f"Test BLEU: {test_metrics['bleu']:.4f}") print(f"Test TER: {test_metrics['ter']:.4f}") print(f"Test Length Acc: {test_metrics['length_acc']:.4f}") print("==========================================")

Réflexion

Examining model configurations

I checked the use of cache settings and gradient checkpointing in the model configuration. This might clarify the exception issue.

Adjusting configurations

Considering the "fp16" setting and "ds_config.json" for DeepSpeed, along with reducing "gradient_accumulation_steps" or batch size, and examining the train dataset size.

Adjusting training parameters

Reducing batch size, learning rate, or using "bf16" are potential solutions to prevent gradient overflows. Also, tweaking configurations like "max_grad_norm" and "train_dataset" size could help.

Tackling memory issues

I’m curious about potential memory problems, possibly linked to token generation and prompt/completion lengths. Adjusting configurations like "fp16_auto_cast" or "min_loss_scale" might mitigate these issues.

Contemplating bf16 and NaN handling

I'm investigating enabling bf16 if the GPU supports it, adjusting learning rates, and managing NaN values in the data.

Addressing gradient checkpointing

I'm noting the potential issue of "use_cache=False" being overridden by "model.config.use_cache=True," possibly causing memory usage spikes. Adjusting configurations to align with gradient checkpointing needs might help.

Offering guidance

To alleviate the "current loss scale already at minimum" error, I recommend enabling gradient checkpointing, tweaking the learning rate or batch size, or opting for "bf16" in the ds_config.

Laying out the options

I’m suggesting setting "min_loss_scale=1" in ds_config, tweaking "trainer.args.num_train_epochs," or resolving contradictory settings like "model.config.use_cache=True" with "use_cache=False" in from_pretrained.

Addressing HPC environment issues

I'm noting the importance of resolving HPC environment errors, possibly by tweaking "ds_config.json" and modifying training arguments to enable "bf16" and disable "fp16" for optimization.

Revisiting the approach

OK, let me see. Disabling dynamic loss scaling, using 'bf16' instead of 'fp16', and tweaking batch size and configuration settings could effectively resolve the "Current loss scale already at minimum" error.

Reasoned about loss scale error for 29 seconds

Réponse

下面的报错说明 DeepSpeed 在混合精度训练中因为梯度连续溢出,动态 loss scale 已经减到了最小值,无法再继续降低,于是训练过程被强制退出:

Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.

这通常意味着以下几种常见原因之一:

  1. 学习率过大:导致梯度容易爆炸。
  2. batch size 或序列长度过大:显存吃紧或者在 FP16 下梯度波动过大。
  3. 配置上矛盾:比如一边开着 use_cache=True,一边还想要梯度检查点(或者从别处改成 use_cache=False),会产生内部冲突。
  4. 深度或其他超参设置,使得网络在 FP16 下容易发生数值溢出(比如权重衰减过大、动量设置不当等)。

一、快速解决思路

要解决这个问题,可以尝试以下思路,常见做法是让训练尽可能“稳定”,避免产生 NaN 或者梯度爆炸:

  1. 改用 BF16 精度(硬件支持的情况下,这是最简单的做法)

    • 大部分 A100 / 40G 及更新架构都支持 BF16,它对数值范围更宽松一些,且在实际训练中和 FP16 几乎一样快。
    • 如果使用 HF Transformers + DeepSpeed,可以在 deepspeed 配置里把 "fp16.enabled": false 改成 "bf16.enabled": true,并在相应的 Trainer 配置里设置 bf16=True, fp16=False
  2. 降低学习率

    • 比如从 5e-6 调低到 1e-6 或更小,然后再观察是否还会爆。
  3. 减小批大小(batch size)或序列长度

    • 例如把 per_device_train_batch_size4 改成 2,或者把 max_prompt_length / max_completion_length 适度减小。
    • 如果 GPU 显存不够,或者序列长、模型大,都容易让梯度数值变得不稳定。
  4. 确保配置无冲突

    • 如果你想用 Gradient Checkpointing,一般要 use_cache=False
    • 你在 AutoModelForCausalLM.from_pretrained(..., use_cache=False) 里关闭了缓存,却又在下一行 model.config.use_cache = True 打开了缓存,这会造成冲突,建议保持一致。
  5. 在 DeepSpeed 配置中增大或固定最小 loss scale

    • 例如在 ds_config.json 中修改:
      json
      "fp16": { "enabled": true, "min_loss_scale": 1 }
      或者在做一些更细粒度的 AMP 配置,避免频繁地把 loss scale 一路降到 1 然后报错。

二、示例:改用 BF16 并移除冲突配置

下面给出一个直接可行的修改方案,重点在于:

  • 使用 BF16 而不是 FP16(如果显卡支持)。
  • 移除/注释掉 model.config.use_cache = True 以避免和 use_cache=False 的冲突。
  • 适当把 per_device_train_batch_size 调小一点,进一步降低梯度溢出的可能。

1. DeepSpeed 配置示例(ds_config.json)

假设你现在的 ds_config.json 类似这样:

json
{ "train_micro_batch_size_per_gpu": 1, "gradient_accumulation_steps": 2, "gradient_clipping": 1.0, "bf16": { "enabled": true }, "zero_optimization": { "stage": 3, "overlap_comm": true, "contiguous_gradients": true }, "optimizer": { "type": "AdamW", "params": { "lr": 5e-6, "betas": [ 0.9, 0.99 ], "eps": 1e-8, "weight_decay": 0.1 } } }

与原先相比,把 "fp16.enabled": false 改成了 "bf16.enabled": true,并确保没有重复出现 fp16 配置即可。这样就启用了 BF16 训练。

2. Python 端 Trainer 配置修改

在你原先的 GRPOConfig 里,去掉 fp16=True,改成:

python
training_args = GRPOConfig( use_vllm=False, learning_rate=5e-6, adam_beta1=0.9, adam_beta2=0.99, weight_decay=0.1, warmup_ratio=0.1, lr_scheduler_type="cosine", logging_steps=50, # fp16=True, # <--- 注释掉 bf16=True, # <--- 打开 BF16 per_device_train_batch_size=2, # <--- 尝试减小 batch gradient_checkpointing=False, gradient_accumulation_steps=2, num_generations=8, max_prompt_length=768, max_completion_length=768, num_train_epochs=11, save_steps=500, max_grad_norm=0.1, report_to='none', output_dir="outputs", # 指定 DeepSpeed config deepspeed="ds_config.json" )

3. 避免 use_cache 冲突

你的代码里这两句是互相矛盾的:

python
model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map=None, use_cache=False # <-- 这里是关闭 ) model.config.use_cache = True # <-- 这里却又打开

如果你真要做训练+梯度检查点,通常推荐关掉 use_cache。所以可以直接把修改为:

python
model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, # 改成 BF16 device_map=None, use_cache=False ) # model.config.use_cache = True # 注释掉这一行

如果你的显卡不支持 BF16,可以把这行保留为 FP16,但要小心溢出问题;或者把 torch_dtype 去掉交给 DeepSpeed 自己处理也行。

4. 观察训练是否稳定

  • 如果依旧报相同错误,继续降低学习率(如 1e-6)或缩小 batch size
  • 确认数据本身没有异常,比如空文本、极长文本或全是特殊字符之类,可能也会导致梯度异常。
  • 如果要做更大规模训练,建议开启 gradient_checkpointing=True(并保持 use_cache=False 一致),这样可以显著减少显存压力。

三、结论

最常见的修复方式就是:改成 BF16 + 降低 batch 或学习率 + 保持 use_cache 设置一致。这样往往就能避免 “loss scale 降到最小” 的溢出问题。如果硬件实在不支持 BF16,那就只能多调小超参(batch / lr / seq_len),或者在 ds_config.json 里上调初始 loss scale、允许更高范围内的动态缩放。

上面给出的示例中,你可以直接参考修改,将冲突和过大的设置去掉,就能极大概率解决这个报错。祝调参顺利!

Partager cette Q&R