Invalidate trace cache @ step 474 and module 474: ...
تم الإنشاء في: ٢١ مارس ٢٠٢٥
تم الإنشاء في: ٢١ مارس ٢٠٢٥
Invalidate trace cache @ step 474 and module 474: cache has only 474 modules
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/data/guoweis/zero/tinyzero/qwen3b_grpo/why_grpo_epochs.py", line 531, in <module>
[rank0]: trainer.train()
[rank0]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/transformers/trainer.py", line 2171, in train
[rank0]: return inner_training_loop(
[rank0]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/transformers/trainer.py", line 2531, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/transformers/trainer.py", line 3712, in training_step
[rank0]: self.accelerator.backward(loss, **kwargs)
[rank0]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/accelerate/accelerator.py", line 2238, in backward
[rank0]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank0]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 270, in backward
[rank0]: self.engine.step()
[rank0]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2249, in step
[rank0]: self._take_model_step(lr_kwargs)
[rank0]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2152, in _take_model_step
[rank0]: self.optimizer.step()
[rank0]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2089, in step
[rank0]: if self._overflow_check_and_loss_scale_update():
[rank0]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2034, in _overflow_check_and_loss_scale_update
[rank0]: self._update_scale(self.overflow)
[rank0]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2459, in _update_scale
[rank0]: self.loss_scaler.update_scale(has_overflow)
[rank0]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
[rank0]: raise Exception(
[rank0]: Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/data/guoweis/zero/tinyzero/qwen3b_grpo/why_grpo_epochs.py", line 531, in <module>
[rank1]: trainer.train()
[rank1]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/transformers/trainer.py", line 2171, in train
[rank1]: return inner_training_loop(
[rank1]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/transformers/trainer.py", line 2531, in _inner_training_loop
[rank1]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank1]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/transformers/trainer.py", line 3712, in training_step
[rank1]: self.accelerator.backward(loss, **kwargs)
[rank1]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/accelerate/accelerator.py", line 2238, in backward
[rank1]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank1]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 270, in backward
[rank1]: self.engine.step()
[rank1]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2249, in step
[rank1]: self._take_model_step(lr_kwargs)
[rank1]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2152, in _take_model_step
[rank1]: self.optimizer.step()
[rank1]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2089, in step
[rank1]: if self._overflow_check_and_loss_scale_update():
[rank1]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2034, in _overflow_check_and_loss_scale_update
[rank1]: self._update_scale(self.overflow)
[rank1]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2459, in _update_scale
[rank1]: self.loss_scaler.update_scale(has_overflow)
[rank1]: File "/home/data/guoweis/miniconda3/envs/zero/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
[rank1]: raise Exception(
[rank1]: Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
以下代码报上述错误,请帮我修正上述bug~
import os
import re
import json
import gc
import langid
import wandb
import torch
import deepspeed
from typing import List
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from nltk import edit_distance
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import GRPOConfig, GRPOTrainer
from nltk.translate.meteor_score import meteor_score
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Running on device: {device}")
SUPPORTED_LANGUAGES = {
"en_to_zh": ("英语", "中文"),
"zh_to_en": ("中文", "英语"),
}
SYSTEM_PROMPT = """
You are a lyrics translation assistant.
You MUST ALWAYS output in the exact XML format:
<why1>
[First question you ask yourself to complete the task]
</why1>
<why2>
[Second question you ask yourself to complete the task]
</why2>
<why3>
[Third question you ask yourself to complete the task]
</why3>
<answer>
[The final translation only goes here]
</answer>
"""
def get_lyric_datasets(path: str) -> Dataset:
"""
将指定路径的 JSON 数据集转换为 HuggingFace Dataset,并为每条数据
生成 prompt: [system, user] 及其真实翻译。
"""
data = Dataset.from_json(path)
text# Filter dataset to only include 'en_to_zh' and 'zh_to_en' data = data.filter(lambda x: x['type'] in ["en_to_zh", "zh_to_en"]) def map_fn(x): lang_src = SUPPORTED_LANGUAGES[x['type']][0] lang_tgt = SUPPORTED_LANGUAGES[x['type']][1] # 将新的 SYSTEM_PROMPT 拼接上提示 system_plus = SYSTEM_PROMPT + f"\nTranslate the following from {lang_src} to {lang_tgt}. Do not add commentary." return { 'prompt': [ {'role': 'system', 'content': system_plus}, {'role': 'user', 'content': x['lyric']} ], 'answer': x['target_lyric'] } data = data.map(map_fn) return data
def extract_xml_answer(text: str) -> str:
"""
从给定文本中提取 <answer> ... </answer> 内容。
"""
pattern = r"<answer>\s*(.?)\s</answer>"
match = re.search(pattern, text, re.DOTALL | re.IGNORECASE)
if match:
return match.group(1).strip()
return ""
def compute_length_acc(
preds: List[str],
refs: List[str],
tokenizer,
max_tolerance: float = 0.5
) -> List[float]:
"""
计算长度准确度奖励。
如果预测长度与参考长度之差比不大于一定比例,则给予相应的分数。
"""
rewards = []
for pred, ref in zip(preds, refs):
pred_tokens = tokenizer.tokenize(pred)
ref_tokens = tokenizer.tokenize(ref)
if len(ref_tokens) == 0:
rewards.append(0.0)
continue
length_ratio = abs(len(pred_tokens) - len(ref_tokens)) / len(ref_tokens)
if length_ratio <= 0.1:
score = 1.0
elif length_ratio <= 0.2:
score = 0.8
elif length_ratio <= 0.3:
score = 0.6
elif length_ratio <= 0.4:
score = 0.4
elif length_ratio <= 0.5:
score = 0.2
else:
score = 0.0
rewards.append(score)
return rewards
def compute_bleu(preds: List[str], refs: List[str], tokenizer) -> List[float]:
"""
计算 BLEU 分数列表
"""
smoothie = SmoothingFunction().method1
weights = (0.25, 0.25, 0.25, 0.25)
scores = []
for pred, ref in zip(preds, refs):
pred_tokens = tokenizer.tokenize(pred)
ref_tokens = tokenizer.tokenize(ref)
if not pred_tokens or not ref_tokens:
scores.append(0.0)
continue
bleu = sentence_bleu(
[ref_tokens],
pred_tokens,
weights=weights,
smoothing_function=smoothie
)
scores.append(bleu)
return scores
def compute_ter(preds: List[str], refs: List[str], tokenizer) -> List[float]:
"""
计算 TER (Translation Edit Rate)。
"""
ter_scores = []
for pred, ref in zip(preds, refs):
pred_tokens = tokenizer.tokenize(pred)
ref_tokens = tokenizer.tokenize(ref)
if len(ref_tokens) == 0:
if len(pred_tokens) > 0:
ter_scores.append(100.0)
else:
ter_scores.append(0.0)
continue
dist_val = edit_distance(pred_tokens, ref_tokens)
ter = (dist_val / len(ref_tokens)) * 100
ter_scores.append(ter)
return ter_scores
def detect_language(text: str) -> str:
"""
使用 langid 自动检测语言,返回语言代码。
"""
return langid.classify(text)[0]
def reward_func_decorator(func):
def wrapper(prompts, completions, answer, step=0, **kwargs):
# 可以在这里对 func 做一些预处理或后处理
pass
return wrapper
def length_acc_reward_func(prompts, completions, answer, step=0, **kwargs) -> List[float]:
"""示例奖励函数:关注预测的长度。"""
responses = [completion[0]['content'] for completion in completions]
extracted_responses = [extract_xml_answer(r) for r in responses]
text# 打印一些示例数据 if step % 200 == 0: print(f"\n[Step {step}] Displaying up to 3 samples from the batch:\n") for i in range(min(3, len(responses))): q = prompts[i][-1]['content'] print("-" * 20) print(f"Sample {i + 1}") print(f"Question:\n{q}") print(f"GroundTruth Answer:\n{answer[i]}") print(f"Model Response (raw):\n{responses[i]}") print(f"Extracted <answer>:\n{extracted_responses[i]}") print("-" * 20) length_rewards = compute_length_acc( preds=extracted_responses, refs=answer, tokenizer=tokenizer ) return length_rewards
def bleu_reward_func(prompts, completions, answer, step=0, **kwargs) -> List[float]:
"""
计算 BLEU 分数,并转换为某种自定义奖励区间。
"""
responses = [c[0]["content"] for c in completions]
q = prompts[0][-1]['content']
extracted = [extract_xml_answer(r) for r in responses]
print('-' * 20, f"Original Lyrics:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}",
f"\nExtracted:\n{extracted[0]}")
bleu_scores = compute_bleu(preds=extracted, refs=answer, tokenizer=tokenizer)
rewards = []
for score in bleu_scores:
if score >= 0.9:
rewards.append(5.0)
elif score >= 0.8:
rewards.append(4.5)
elif score >= 0.7:
rewards.append(4.0)
elif score >= 0.6:
rewards.append(3.5)
elif score >= 0.5:
rewards.append(2.5)
elif score >= 0.4:
rewards.append(2.0)
elif score >= 0.3:
rewards.append(1.5)
elif score >= 0.2:
rewards.append(1.0)
elif score >= 0.1:
rewards.append(0.5)
else:
rewards.append(0.0)
return rewards
def ter_reward_func(completions, answer, step=0, **kwargs) -> List[float]:
"""
计算 TER 分数,并映射到某种奖励区间。
"""
responses = [c[0]["content"] for c in completions]
extracted = [extract_xml_answer(r) for r in responses]
ter_scores = compute_ter(preds=extracted, refs=answer, tokenizer=tokenizer)
rewards = []
for t in ter_scores:
if t >= 80:
rewards.append(0.0)
elif t >= 60:
rewards.append(0.5)
elif t >= 40:
rewards.append(1.0)
elif t >= 20:
rewards.append(1.5)
else:
rewards.append(2.0)
return rewards
def language_recognition(completions, answer, step=0, **kwargs) -> List[float]:
"""
简单地检测 预测语种 是否与 参考答案语种 相同,以此给予奖励。
"""
responses = [c[0]["content"] for c in completions]
extracted = [extract_xml_answer(r) for r in responses]
rewards = []
for pred, ref in zip(extracted, answer):
if not pred.strip():
rewards.append(0.0)
continue
pred_lang = detect_language(pred)
ref_lang = detect_language(ref)
rewards.append(1.0 if pred_lang == ref_lang else 0.0)
return rewards
def strict_format_reward_func(completions, answer=None, step=0, kwargs) -> list:
"""
对输出做严格格式检查:
必须包含 <why1>...</why1> <why2>...</why2> <why3>...</why3> <answer>...</answer> 且顺序正确。
"""
pattern = (
r"<why1>[\s\S]+?</why1>\s"
r"<why2>[\s\S]+?</why2>\s"
r"<why3>[\s\S]+?</why3>\s*"
r"<answer>[\s\S]+?</answer>"
)
responses = [completion[0]["content"] for completion in completions]
scores = []
for r in responses:
if re.search(pattern, r):
scores.append(1.0)
else:
scores.append(0.0)
return scores
def soft_format_reward_func(completions, answer=None, step=0, **kwargs) -> list:
"""
对输出做软格式检查:
只要包含所有四组标签即可,无论顺序或中间有无其他内容,都给一定的分数。
"""
responses = [completion[0]["content"] for completion in completions]
scores = []
for r in responses:
tags_present = all(
tag in r
for tag in ["<why1>", "</why1>", "<why2>", "</why2>", "<why3>", "</why3>", "<answer>", "</answer>"]
)
scores.append(0.5 if tags_present else 0.0)
return scores
def xmlcount_reward_func(completions, answer=None, step=0, **kwargs) -> List[float]:
"""
对XML标签出现次数进行计分,每出现一个正确的起始和结束标签加分。
如果 </answer> 后面还有残余文字,则额外扣分。
"""
def count_xml(text) -> float:
count = 0.0
text# 针对新的四段式标签:why1、why2、why3、answer,每个起止标签 0.125 分 # 总分最高1.0 if text.count("<why1>") == 1: count += 0.125 if text.count("</why1>") == 1: count += 0.125 if text.count("<why2>") == 1: count += 0.125 if text.count("</why2>") == 1: count += 0.125 if text.count("<why3>") == 1: count += 0.125 if text.count("</why3>") == 1: count += 0.125 if text.count("<answer>") == 1: count += 0.125 if text.count("</answer>") == 1: count += 0.125 # 如果 </answer> 后还有多余文本,稍微给个负分 if "</answer>" in text: leftover = text.split("</answer>")[-1] count -= len(leftover.strip()) * 0.001 # 夹在 0 到 1 之间,不要出现负分 if count < 0: count = 0.0 return count responses = [c[0]["content"] for c in completions] return [count_xml(c) for c in responses]
def meteor_reward_func(completions, answer, step=0, tokenizer=None, **kwargs) -> List[float]:
"""
2) 基于 METEOR 分数的语义相似度
- 使用 nltk.translate.meteor_score.meteor_score 计算分值
- 再简单映射到 [0, 2.0] 区间
"""
responses = [c[0]["content"] for c in completions]
extracted_preds = [extract_xml_answer(r) for r in responses]
textscores = [] for pred, ref in zip(extracted_preds, answer): if not pred.strip(): scores.append(0.0) continue m = meteor_score([ref], pred) # meteor_score 参数: (参考列表, 预测字符串) # 映射区间,这里只是示例,你可根据需要调节 if m >= 0.9: scores.append(2.0) elif m >= 0.75: scores.append(1.5) elif m >= 0.5: scores.append(1.0) elif m >= 0.3: scores.append(0.5) else: scores.append(0.0) return scores
training_args = GRPOConfig(
use_vllm=False,
learning_rate=5e-6,
adam_beta1=0.9,
adam_beta2=0.99,
weight_decay=0.1,
warmup_ratio=0.1,
lr_scheduler_type="cosine",
logging_steps=50,
fp16=True,
per_device_train_batch_size=4,
gradient_checkpointing=False,
gradient_accumulation_steps=2,
num_generations=8,
max_prompt_length=768,
max_completion_length=768,
num_train_epochs=11, # We'll manually loop over epochs below
save_steps=500,
max_grad_norm=0.1,
report_to='none',
output_dir="outputs",
# 指定 DeepSpeed config
deepspeed="ds_config.json"
)
model_name = "../model/Qwen2.5-3B-Instruct" # Adjust path as needed
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map=None, # Let DeepSpeed handle it
use_cache=False
)
model.config.use_cache = True
model.to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
dataset = get_lyric_datasets("../data_pack/multi_lyric.json")
train_test = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = train_test["train"]
test_dataset = train_test["test"]
trainer = GRPOTrainer(
model=model,
processing_class=tokenizer,
reward_funcs=[
xmlcount_reward_func,
soft_format_reward_func,
strict_format_reward_func,
bleu_reward_func
],
args=training_args,
train_dataset=train_dataset,
)
def evaluate_model(
model,
tokenizer,
dataset: Dataset,
batch_size: int = 2,
max_new_tokens: int = 512
):
"""
简易测试函数:逐条生成翻译结果,提取 <answer>,计算 BLEU、TER、和 length_acc。
根据需要可自行改动推理参数(如温度、max_new_tokens 等)。
"""
from statistics import mean
textall_preds = [] all_refs = [] model.eval() with torch.no_grad(): for i in range(0, len(dataset), batch_size): batch = dataset[i:i + batch_size] prompts = [item["prompt"] for item in batch] refs = [item["answer"] for item in batch] inputs = [] for p in prompts: # p 是 list[{'role': ..., 'content': ...}, ...] # 这里简单拼接 system + user 的 content role_content = [pc["content"] for pc in p] joined = "\n".join(role_content) inputs.append(joined) encodings = tokenizer(inputs, return_tensors="pt", padding=True, truncation=True).to(device) outputs = model.generate( **encodings, max_new_tokens=max_new_tokens, do_sample=False # 用 greedy ) for o in outputs: text = tokenizer.decode(o, skip_special_tokens=True) pred_ans = extract_xml_answer(text) all_preds.append(pred_ans) all_refs.extend(refs) bleu_scores = compute_bleu(all_preds, all_refs, tokenizer) ter_scores = compute_ter(all_preds, all_refs, tokenizer) len_acc_scores = compute_length_acc(all_preds, all_refs, tokenizer) metrics = { "bleu": mean(bleu_scores) if bleu_scores else 0.0, "ter": mean(ter_scores) if ter_scores else 0.0, "length_acc": mean(len_acc_scores) if len_acc_scores else 0.0 } return metrics
num_epochs = int(training_args.num_train_epochs)
for epoch in range(num_epochs):
print(f"======= Starting Epoch {epoch+1}/{num_epochs} =======")
# 临时把 num_train_epochs 设为1,执行单次 epoch 训练
trainer.args.num_train_epochs = 1
trainer.train()
text# 评估 test_metrics = evaluate_model(model, tokenizer, test_dataset) print(f"\n===== Evaluation at Epoch {epoch+1} =====") print(f"Test BLEU: {test_metrics['bleu']:.4f}") print(f"Test TER: {test_metrics['ter']:.4f}") print(f"Test Length Acc: {test_metrics['length_acc']:.4f}") print("==========================================")
下面的报错说明 DeepSpeed 在混合精度训练中因为梯度连续溢出,动态 loss scale 已经减到了最小值,无法再继续降低,于是训练过程被强制退出:
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
这通常意味着以下几种常见原因之一:
use_cache=True
,一边还想要梯度检查点(或者从别处改成 use_cache=False
),会产生内部冲突。要解决这个问题,可以尝试以下思路,常见做法是让训练尽可能“稳定”,避免产生 NaN 或者梯度爆炸:
改用 BF16 精度(硬件支持的情况下,这是最简单的做法)
deepspeed
配置里把 "fp16.enabled": false
改成 "bf16.enabled": true
,并在相应的 Trainer 配置里设置 bf16=True, fp16=False
。降低学习率
5e-6
调低到 1e-6
或更小,然后再观察是否还会爆。减小批大小(batch size)或序列长度
per_device_train_batch_size
从 4
改成 2
,或者把 max_prompt_length
/ max_completion_length
适度减小。确保配置无冲突
use_cache=False
。AutoModelForCausalLM.from_pretrained(..., use_cache=False)
里关闭了缓存,却又在下一行 model.config.use_cache = True
打开了缓存,这会造成冲突,建议保持一致。在 DeepSpeed 配置中增大或固定最小 loss scale
ds_config.json
中修改:
或者在做一些更细粒度的 AMP 配置,避免频繁地把 loss scale 一路降到 1 然后报错。json"fp16": { "enabled": true, "min_loss_scale": 1 }
下面给出一个直接可行的修改方案,重点在于:
model.config.use_cache = True
以避免和 use_cache=False
的冲突。假设你现在的 ds_config.json
类似这样:
json{ "train_micro_batch_size_per_gpu": 1, "gradient_accumulation_steps": 2, "gradient_clipping": 1.0, "bf16": { "enabled": true }, "zero_optimization": { "stage": 3, "overlap_comm": true, "contiguous_gradients": true }, "optimizer": { "type": "AdamW", "params": { "lr": 5e-6, "betas": [ 0.9, 0.99 ], "eps": 1e-8, "weight_decay": 0.1 } } }
与原先相比,把 "fp16.enabled": false
改成了 "bf16.enabled": true
,并确保没有重复出现 fp16
配置即可。这样就启用了 BF16 训练。
在你原先的 GRPOConfig
里,去掉 fp16=True
,改成:
pythontraining_args = GRPOConfig( use_vllm=False, learning_rate=5e-6, adam_beta1=0.9, adam_beta2=0.99, weight_decay=0.1, warmup_ratio=0.1, lr_scheduler_type="cosine", logging_steps=50, # fp16=True, # <--- 注释掉 bf16=True, # <--- 打开 BF16 per_device_train_batch_size=2, # <--- 尝试减小 batch gradient_checkpointing=False, gradient_accumulation_steps=2, num_generations=8, max_prompt_length=768, max_completion_length=768, num_train_epochs=11, save_steps=500, max_grad_norm=0.1, report_to='none', output_dir="outputs", # 指定 DeepSpeed config deepspeed="ds_config.json" )
use_cache
冲突你的代码里这两句是互相矛盾的:
pythonmodel = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map=None, use_cache=False # <-- 这里是关闭 ) model.config.use_cache = True # <-- 这里却又打开
如果你真要做训练+梯度检查点,通常推荐关掉 use_cache
。所以可以直接把修改为:
pythonmodel = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, # 改成 BF16 device_map=None, use_cache=False ) # model.config.use_cache = True # 注释掉这一行
如果你的显卡不支持 BF16,可以把这行保留为 FP16,但要小心溢出问题;或者把
torch_dtype
去掉交给 DeepSpeed 自己处理也行。
1e-6
)或缩小 batch size。gradient_checkpointing=True
(并保持 use_cache=False
一致),这样可以显著减少显存压力。最常见的修复方式就是:改成 BF16 + 降低 batch 或学习率 + 保持 use_cache
设置一致。这样往往就能避免 “loss scale 降到最小” 的溢出问题。如果硬件实在不支持 BF16,那就只能多调小超参(batch / lr / seq_len),或者在 ds_config.json
里上调初始 loss scale、允许更高范围内的动态缩放。
上面给出的示例中,你可以直接参考修改,将冲突和过大的设置去掉,就能极大概率解决这个报错。祝调参顺利!