这是我的两个有关usb相关操作的文件,以及一个出现错误的日志文件,我想知道为什么我的usb设备为什么...

생성일: 2025년 10월 17일

답변에 사용된 모델 GPT-5 Thinking by Chat01

질문

usblog.log
watch_usbgadget.sh
usbdevice.sh

这是我的两个有关usb相关操作的文件,以及一个出现错误的日志文件,我想知道为什么我的usb设备为什么会有概率枚举失败,请根据日志和脚本文件,详细分析并推理相关过程,找出失败的真正原因并提供解决方法

Python

python
# Let's open the uploaded files to analyze content import os, textwrap, json, sys, re, pathlib paths = ["/mnt/data/usblog.log", "/mnt/data/watch_usbgadget.sh", "/mnt/data/usbdevice.sh"] for p in paths: try: print("====", p, "====") with open(p, 'r', errors='ignore') as f: data = f.read() print(data[:3000]) print("...length", len(data)) except Exception as e: print("Error reading", p, e)

Python

python
# Read watch_usbgadget.sh and usbdevice.sh fully for p in ["/mnt/data/watch_usbgadget.sh", "/mnt/data/usbdevice.sh"]: print("====", p, "====") with open(p,'r', errors='ignore') as f: s=f.read() print(s[:2000]) print("\n... Tail ...\n") print(s[-2000:]) print("Length:", len(s))

사고

Investigating potential bug

It seems like the file watch_usbgadget.sh actually contains the content of the usbdevice script, which might be misnamed. At the end, there seems to be a bug with lock file management: FD 9 is acquiring the lock, but FD 3 is being released even though it wasn’t locked. This could cause issues where the script thinks another command is running. I’ll need to dig deeper into this.

Thought for 1m 3s

Python

python
s2=open('/mnt/data/usbdevice.sh','r',errors='ignore').read() print("len usbdevice.sh:", len(s2), "lines:", len(s2.splitlines())) print(s2[:2000]) print("\n...tail...\n") print(s2[-2000:])

Python

python
lines = s2.splitlines() print(lines[665:680])

Python

python
for i in range(650, 690): try: print(i+1, lines[i]) except: pass

Python

python
import re, sys for i, line in enumerate(lines, start=1): if "UDC" in line: print(i, line)

Python

python
print("\n".join(f"{i+1:03d} {lines[i]}" for i in range(630, 690)))

Python

python
for i, line in enumerate(lines, start=1): if "USB state" in line or "state changed" in line: print(i, line)

Python

python
s_watch = open('/mnt/data/watch_usbgadget.sh','r', errors='ignore').read() watch_lines = s_watch.splitlines() for i, line in enumerate(watch_lines, start=1): if "USB state" in line or "state changed" in line: print(i, line)

Python

python
print("\n".join(f"{i+1:03d} {watch_lines[i]}" for i in range(len(watch_lines))))

Python

python
print("\n".join(f"{i+1:03d} {watch_lines[i]}" for i in range(60, 180)))

Python

python
print(s_watch)

Python

python
for i, line in enumerate(watch_lines, start=1): if 'flock' in line: print(i, line)

Python

python
for i,line in enumerate(watch_lines, start=1): if "release_lock" in line: print(i, line)

Python

python
print("\n".join(f"{i+1:03d} {watch_lines[i]}" for i in range(1, 110)))

Python

python
for i, line in enumerate(lines, start=1): if 'LOCK_FILE' in line or 'flock' in line: print(i, line)

Python

python
for i, line in enumerate(lines, start=1): if re.search(r'^usb_prepare\(\)', line): start=i break print("usb_prepare at line", start) print("\n".join(f"{j:03d} {lines[j-1]}" for j in range(start, start+80)))

Python

python
for i, line in enumerate(lines, start=1): if re.search(r'^usb_start\(\)', line): start=i break print("usb_start at line", start) print("\n".join(f"{j:03d} {lines[j-1]}" for j in range(start, start+80)))

Python

python
for i, line in enumerate(lines, start=1): if re.search(r'^usb_init\(\)', line): start=i print("usb_init at", start) print("\n".join(f"{j:03d} {lines[j-1]}" for j in range(start, start+120))) break

Python

python
for i, line in enumerate(lines, start=1): if re.search(r'^usb_disable\(\)', line): start=i print("usb_disable at", start) print("\n".join(f"{j:03d} {lines[j-1]}" for j in range(start, start+80))) break

Python

python
for i in range(640, 680): print(f"{i+1:03d} {lines[i]}")

Python

python
for i, line in enumerate(lines, start=1): if re.search(r'^usb_run_stage\(\)', line): start=i print("usb_run_stage at", start) print("\n".join(f"{j:03d} {lines[j-1]}" for j in range(start, start+120))) break

Python

python
for i in range(690, 760): print(f"{i+1:03d} {lines[i]}")

사고

for i in range(690, 760): print(f"{i+1:03d} {lines[i]}")

Thought for 1m 36s

Python

python
for i, line in enumerate(lines, start=1): if 'USB state' in line or 'USB_STATE' in line: print(i, line)

Python

python
print("\n".join(f"{i+1:03d} {lines[i]}" for i in range(488, 610)))

Python

python
print("\n".join(f"{i+1:03d} {lines[i]}" for i in range(610, 760)))

사고

print("\n".join(f"{i+1:03d} {lines[i]}" for i in range(610, 760)))

Thought for 7s

Python

python
print("\n".join(f"{i+1:03d} {lines[i]}" for i in range(610, 754)))

Python

python
for i in range(720, 748): print(f"{i+1:03d} {lines[i]}")

Python

python
for i, line in enumerate(lines, start=1): if "usb_prepare" in line: print(i, line)

Python

python
for i, line in enumerate(lines, start=1): if re.search(r'^usb_enable\(\)', line): start=i print("usb_enable at", start) print("\n".join(f"{j:03d} {lines[j-1]}" for j in range(start, start+140))) break

Python

python
for i, line in enumerate(lines, start=1): if 'USB_GADGET_DIR' in line: print(i, line)

Python

python
for i, line in enumerate(lines, start=1): if 'CONFIGFS_DIR' in line or 'USB_GROUP' in line: print(i, line)

Python

python
import re def find_function(code, name): lines=code.splitlines() start=None for i,l in enumerate(lines): if re.match(fr'^{name}\(\)', l.strip()): start=i break if start is None: return None # print next 120 lines return "\n".join(f"{i+1:03d} {lines[i]}" for i in range(start, min(start+200, len(lines)))) print(find_function(s_watch, 'start_watch'))

Python

python
print(find_function(s_watch, 'monitor_usb_config'))

Python

python
print(find_function(s_watch, 'debounce_usb_state'))

Python

python
for i in range(130, 200): print(f"{i+1:03d} {watch_lines[i]}")

사고

Identifying potential issues

From what I can piece together, at specific disconnect counts (around 6 or 15), they implement safe_usbdevice_update or a restart. There could be an issue with the file path /sys/class/android_usb/android0/state, especially with certain setups. The log indicates updates occur, but with a missing driver or stale data, that could cause problems. I'll now put together a detailed analysis for the user’s request in Chinese.

Providing structured analysis

Here’s my analysis:

  • The device is a composite gadget (UAC1 + HID + ADB) on a Rockchip RK3308 using configfs. The usbdevice.sh manages the functions, and watch_usbgadget.sh handles toggling functions by monitoring the USB state and /tmp/.usb_config to recover from disconnections.

  • The failure observed: After toggling the configuration, ADB starts printing FUNCTIONFS_* events.

Analyzing failure and potential causes

Reconfiguration occurred twice at 00:19:18 and 00:19:31. On the second event, writing '' to UDC returned an ENODEV error: "No such device." After a restart, the function "FUNCTIONFS_BIND" timed out, indicating the enumeration was stuck.

Possible reasons:

  1. Race condition / thrashing: The watch script may trigger updates or restarts while usbdevice is running, causing a conflict with ongoing host enumeration, leading to parallel operations and errors like "Another usbdevice command is already running."

  2. UDC rebind too early: The script may attempt to rebind the UDC too quickly after unbinding it.

Identifying additional failure causes

  • Settling time for UDCs: Frequent toggles might cause an ENODEV error when detaching or leave the device in a stale state, which the host could ignore.

  • FunctionFS ADB handshake: ADB expects FUNCTIONFS_BIND only after configuration, but with composite devices, the host driver might need more time. Premature update triggers (like the 3s threshold) can kill adbd and cause enumeration failure.

  • Lock file mismatch: The watch script’s lock file mismatches with the one used by usbdevice, causing concurrency issues.

  • Minor bug: There’s inconsistency with file descriptors (FD) used for locking.

  • State file issue: The watch_usbgadget.sh script relies on a specific state file that might not exist, leading it to incorrectly assume the device is disconnected.

Summarizing proposed fixes and patches

The root cause of the enumeration failure is overlapping UDC unbind/bind cycles, causing FunctionFS issues. The fixes involve synchronizing scripts, adding settling times, and extending recovery thresholds to prevent interference during enumeration. I also suggest debouncing config file events and using correct device checks for UDC. Some patches to unify lock files across scripts, manage configurations more consistently, and handle errors more gracefully. Testing involves checking for stable enumeration and ensuring logs show proper behavior.

Thought for 2m 19s

답변

下面是我把你给的三个文件(两个脚本 + 一份日志)串在一起后的完整推理、根因定位可落地修复方案。我会引用你日志里的关键片段作为证据。


结论先行(TL;DR)

你的 USB 设备之所以“有概率枚举失败”,根因不是单一硬件问题,而是**脚本层的竞态与过度频繁的解绑/重绑(unbind/bind)**导致主机端没机会稳定完成枚举:

  1. 两个控制脚本的互斥锁不统一watch_usbgadget.sh/var/run/usbdevice.lock,而 usbdevice.sh/var/run/.usbdevice.lock,导致两边不能互相串行化,出现交错地对 UDC/FunctionFS 操作(典型现象:一边在“恢复”,另一边又来一次“更新/重启”),造成枚举时序被打断,ADB 的 FunctionFS 迟迟等不到 BIND。(usblog.log)

  2. 解绑过于激进且未容错usbdevice.shusb_stop() 里直接 echo > .../UDC,当 UDC 已为空或控制器刚切换、硬件状态未稳定时,会返回 -ENODEV,正对应你日志里的 echo: write error: No such device;这会把一次“更新”过程推进到半截失败的中间态。(usblog.log)

  3. 看门狗的恢复阈值太短(3s/7.5s 就触发 update/restart),恰好卡在主机端首次/再次枚举的时间窗口里,导致你这边刚刚绑上 UDC,watch 又认为“还没连上”而去再次触发更新/重启,形成“抖动风暴”。日志里能看到先 CONFIGURED,紧接着又被你脚本主动切回 DISCONNECTED 的频繁来回。(usblog.log)

这三点叠加,使得主机偶尔能赶上你的窗口枚举成功(于是你觉得“有概率”),但多数时候在 FunctionFS 一直 waiting for FUNCTIONFS_BIND 的循环里超时。(usblog.log)


证据链(关键时间线与日志点)

  1. 功能切换 + 主动断开
    日志显示你在 update 时主动把设备从已配置态踢回断开态(以便重配功能),然后再启动:
  • USB state changed: CONFIGURED -> DISCONNECTED(发生在 update 开始时)。
  • 功能从 uac1 hid adb 切到 hid adb,随后又切回去。(usblog.log)
  1. UDC 解绑时报错(ENODEV)
    第二次切换时出现:
  • /usr/bin/usbdevice: line 671: echo: write error: No such device
    这行正是 usb_stop()UDC 写空进行解绑的地方,说明解绑时点与控制器/内核状态不匹配(已空或设备瞬时不可用)。(usblog.log)
  1. ADB/FunctionFS 长时间等不到绑定(主机没有完成配置)
    之后 adb 多次打印:
  • timed out while waiting for FUNCTIONFS_BIND, trying again
  • 同期 watch 侧“USB disconnected count”一路计数到 60 并停止恢复重试。
    这说明设备端已经把 gadget 准备好并(尝试)绑定 UDC,但主机端始终没有把配置选上(即没走到 CONFIGURED),极可能是因为我们这边又触发了下一轮更新/重启把过程打断了。(usblog.log)
  1. 双脚本互斥锁不一致,导致行为交错
    日志有:
  • Another usbdevice command is already running, skipping this execution —— 这是 watch 用 自己的锁判断“已有动作进行中”而跳过;但 usbdevice.sh 自己又用另一个锁文件来串行自己的生命周期,二者彼此不知情,容易出现“你没等我,我也没等你”的交错。(usblog.log)
  1. 过度敏感的恢复阈值与抖动
    watch 的策略是在3 秒连续 DISCONNECTED 时就下 update7.5 秒restart,并且同时还在监听 /tmp/.usb_config 的创建/删除来“即时”做 update。结合 1)-4) 的交错,很容易在主机刚要枚举时又把它掐断。(usblog.log)

根因复盘(技术面)

  • 竞态/时序

    • UDC 解绑 → 准备 functionfs/ffs.* → 建立 symlink → 绑定 UDC → 等主机设置配置(CONFIGURED)。
    • 以上任一步还没稳住时,watch 就可能再次触发 update/restart,再次解绑或杀 adbd。
    • 主机端看到的就是快速的“拔插/改描述符”,偶发能枚举,多数失败(ADB 就卡在等待 FUNCTIONFS_BIND)。(usblog.log)
  • 错误处理

    • 解绑直接 echo > UDC 未忽略 ENODEV,报错后流程继续,容易留下半配置状态。(usblog.log)
  • 互斥锁分裂

    • watch 与 usbdevice 两套锁,没法形成全局串行,导致 stage 交叠。(usblog.log)

一次性修好(建议改动——最小但关键)

下面是“能立刻落地”的变更点;不改业务逻辑,只增稳态和序列化。若你愿意,我也可以把这几处合成补丁给你。

1) 统一全局锁,让所有动作必须串行

  • 把两边的锁文件改成同一个,推荐都用:/var/run/usbdevice.lock
    • usbdevice.sh 顶部:
      sh
      - LOCK_FILE=/var/run/.usbdevice.lock + LOCK_FILE=/var/run/usbdevice.lock
    • watch_usbgadget.sh 已经就是 /var/run/usbdevice.lock,无需改名。
  • watch 中的 acquire_lock 仅用于保护其“发起动作”的临界区没有问题;关键是要和 usbdevice.sh 使用同一把锁(文件路径一致即可),保证全链路严格串行。(usblog.log)

2) 解绑 UDC 时容错 + 留出 settle 时间

usbdevice.shusb_stop() 里,把解绑改成忽略 ENODEV,并加入短暂延时(200~300ms):

sh
# 原来 echo > $USB_GADGET_DIR/UDC # 建议 [ -e "$USB_GADGET_DIR/UDC" ] && : > "$USB_GADGET_DIR/UDC" 2>/dev/null || true usleep 200000 # 200ms,让控制器/内核状态稳定

这能消除你日志里那种 echo: write error: No such device 报错的副作用。(usblog.log)

3) 把 update 做成“prepare + start”,避免半配置/脏状态

你这份 usbdevice.shupdate 分支逻辑与日志表现存在出入;为稳妥起见,建议明确地在 update 执行:

sh
update) usb_enable usb_run_stage usb prepare usb_run_stage usb start ;;

确保功能切换时,老函数被正确停掉、目录/软链落到一致,再开始一轮绑定。这能避免“切功能但没完全清理”的边角问题。(usblog.log)

4) 拉长并“去抖”恢复阈值,别掐住主机的枚举窗口

watch_usbgadget.shdebounce_usb_state() 中,把策略调宽一些(示例值):

  • update:由 3s 提到 8~10s 连续 DISCONNECTED 再触发;
  • restart:由 7.5s 提到 15~20s
  • 两次动作之间的 cooldown 拉长到 10~15s
  • 在监听 /tmp/.usb_config 变化时,加一个 2s 稳定期(先 sleep 2 再二次确认文件状态没变)再执行 usbdevice update
    这些修改可以显著降低“刚绑上又被你自己解绑”的抖动概率。(usblog.log)

5) 小而有效的健壮性改动

  • usb_start() 里已经是“先建 function、软链,再写 UDC 绑定”,这个顺序是对的,保持即可。
  • rm -f $USB_CONFIGS_DIR/f-*find ... ! -name 'ffs.*' | xargs rmdir 保留,不要清理 ffs.*(你已经这么做了),可以减少 adb 自己的 FunctionFS 崩溃几率。
  • 如果可能,把 watch 中的 USB 状态来源从 /sys/class/android_usb/android0/state 改为基于 UDC 文件是否为空的判定(UDC 非空≈连上/已上拉;为空≈断开)。现有路径在某些内核组合里可能暂时缺失,容易误判“断开”而触发不必要的恢复。(usblog.log)

验证方法(回归与压力)

  1. 基础回归
  • 连上主机后,观察日志应出现一次 FUNCTIONFS_BIND,随后状态进入 CONFIGURED,且不再出现 No such device 或频繁的 FUNCTIONFS_DISABLE/UNBIND 抖动。(usblog.log)
  1. 压力测试
  • 每隔 60s 改一次 /tmp/.usb_config(在两个功能集之间来回切),确认每次都是:
    • DISCONNECTED →(1~2s 内)→ CONFIGURED
    • 无额外的“自动恢复动作”穿插(因为被 cooldown 和统一锁挡住了)。(usblog.log)
  1. 极端拔插/热切功能
  • 快速拔插/快速切功能十几次,查看 adb 不再出现长串 timed out while waiting for FUNCTIONFS_BIND。(usblog.log)

附:你日志中的“锚点”与现象对照

  • USB state changed: CONFIGURED -> DISCONNECTED(脚本主动重配引发断开);
  • Functions changed uac1 hid adb -> hid adb / 再切回 uac1 hid adb(频繁切功能);
  • echo: write error: No such device(UDC 解绑时机不对/未容错);
  • timed out while waiting for FUNCTIONFS_BIND(主机端未完成配置,ADB 一直等不到绑定);
  • USB disconnected count: 10/20/30...Max retry count reached...(watch 阈值过短、反复动作)。(usblog.log)

如果你愿意,我可以直接把上述修改打成补丁(diff/patch 形式)贴给你,或者根据你的目标平台再把阈值细调一下(比如某些 Windows 主机首次绑定 UAC1 时确实会慢几秒)。

질문

我发现这个问题是因为usbdevice卡住了,导致后续逻辑处理都无法获取到锁,从而程序卡死,而且usbdevice是卡住在adbd那里
root@rk3308-buildroot:/# fuser /var/run/usbdevice.lock
4174 8328 8330 23141
从这个可以看出
4174 root {usbdevice} /bin/sh /usr/bin/usbdevice start
8328 root {usbdevice} /bin/sh /usr/bin/usbdevice update
8330 root /usr/bin/adbd
请结合发给你的usbdevice文件,为什么会卡住

사고

Diagnosing adb behavior

I think the problem might be with how the usbdevice script interacts with adbd. It seems like the script tries to start and stop adbd, but might not handle signals properly, like SIGTERM. The log, showing "Run stage: adb_stop" and "Terminated," suggests adbd is being killed, but sometimes it respawns, causing the script to get stuck. Maybe the respawning is due to another process like init or service supervision.

Investigating lockfile behavior

It seems the issue is related to the lockfile usbdevice is using. The script potentially passes the lockfile descriptor to child processes like adbd, keeping the lockfile open. If the script doesn't close the file descriptor or set it to close-on-exec, adbd may inherit it, causing the lock to remain held even after the script tries to release it.

Flushing out flock behavior

I think the lock does remain held when adbd inherits the lockfile descriptor. On Unix, flock locks are tied to the open file description, and if the child process inherits that descriptor, the lock stays in place. This happens even if the process executes a new command, as long as the descriptor isn’t set to close.

Exploring deadlock via lock inheritance

If the script uses flock on FD 200 and passes it to adbd, it seems like adbd would keep that lock open. The parent can unlock it, but if the parent exits without releasing the lock, the child process (adbd) holds onto it. This creates a deadlock, preventing other invocations from acquiring the lock.

Explaining the issue and proposing fixes

The problem seems to stem from the lock file descriptor in the usbdevice script not being marked with CLOEXEC, causing the child adbd process to inherit the lock. This results in a situation where other processes can't acquire the lock, leading to a deadlock.

To fix this, I'd recommend adding a wrapper around critical sections with flock -n "$LOCK_FILE" -c '...' or modifying the locking process to prevent FD inheritance by using proper exec and CLOEXEC.

Clarifying the root cause and offering fixes

The issue stems from usbdevice not using FD_CLOEXEC, allowing adbd to inherit the lock file descriptor. This causes the lock to remain held even after the usbdevice script exits, blocking subsequent invocations. The fix is to either use flock -n "$LOCK_FILE" -c '...' to prevent inheritance or ensure FD_CLOEXEC is set on the lock file descriptor.

Analyzing and suggesting fixes for lock leak

The core issue is lock leakage, as adbd inherits the lock file descriptor (FD) from usbdevice, preventing future processes from acquiring the lock. This causes the blockage. The fix involves rethinking the locking mechanism, either by re-executing the script under the lock, using flock, or implementing a directory-based lock to avoid FD leakage. Also, diagnostics should include verifying no FDs are left open after applying any fix.

Thought for 2m 17s

답변

你这个“卡住”本质上不是 adbd 自己挂了,而是锁文件 FD 泄漏导致的“全局死锁”:
usbdevice 在拿到 /var/run/usbdevice.lock 之后,把同一个已加锁的文件描述符继承给了子进程(adbd)。于是只要 adbd 还活着,那个锁就一直处于“被占用”状态;后续的 usbdevice update/start/stop 都拿不到锁,于是你看到“程序卡死”。你用 fuser 也正印证了这点:锁文件同时被 usbdevice startusbdevice update、以及 adbd 打开持有。

再对照你的日志:多次“启动/停止/更新”都会拉起新的 adbd,且反复在 FunctionFS 上 BIND/UNBIND/等待,锁始终没有被释放,于是后面发起的 usbdevice 命令只能阻塞在锁上(或被看门狗跳过)。例如:update → 停功能 → 启动 ffs.adb → adbd 报告 FUNCTIONFS_BIND / FUNCTIONFS_DISABLE / “waiting for FUNCTIONFS_BIND” 反复出现,这段时间里锁始终被持有;下一次 update/restart 也来了,却发现“Another usbdevice command is already running, skipping”(或者直接卡在 flock 上)。这些现象都能在你的日志里看到。(usblog.log)


为什么会泄漏(结合 usbdevice 的实现方式来推理)

常见写法是这样的(你脚本里就是这个思路的变体):

sh
LOCK_FILE=/var/run/usbdevice.lock # 打开一个 FD 指向锁文件 exec 200>"$LOCK_FILE" # 对 FD 200 加独占锁 flock -x 200 || exit 1 # ... 后面会启动/停止 adbd ... /usr/bin/adbd &

问题在于:FD 200 没有设置 close-on-exec(CLOEXEC),shell 在 fork/exec adbd 时把 FD 200 原封不动传给了 adbd。
Linux 的 flock(2) 锁是跟“打开的文件描述(open file description)”绑定的;只要任意一个继承了这份描述符的进程(这里就是 adbd)还活着且没把它关掉,锁就等价于一直被持有。于是:

  • 第 1 个 usbdevice 拿到锁 → 启 adbd(adbd 继承了加锁 FD);
  • 就算第 1 个 usbdevice 正常退出或崩了,锁也不会释放,因为 adbd 仍然持有那个 FD;
  • 第 2 个 usbdevice 再来尝试加锁 → 永远拿不到,表现为“卡住/跳过”。

你的运行现场正是这种状态:
/var/run/usbdevice.lock 同时被 usbdevice startusbdevice update以及 adbd 打开。(usblog.log)

另外,从日志里的时间线也能看到这一点的副作用:频繁的 FUNCTIONFS_BIND/FUNCTIONFS_DISABLE 循环、Another usbdevice command is already running, skipping this executiontimed out while waiting for FUNCTIONFS_BIND 等,说明外层看门狗不断想“修复”,但由于锁被 adbd 占着,要么卡住,要么只能放弃。(usblog.log)


一锤子修好:避免锁 FD 传给 adbd(给出两种安全做法)

二选一即可,推荐做法 A。

做法 A(推荐):用 flock -c 运行整段脚本,彻底杜绝 FD 泄漏

usbdevice 顶部加上“自我重启+上锁”包装,后面的主体不改:

sh
LOCK_FILE=/var/run/usbdevice.lock if [ -z "$USBDEVICE_LOCKED" ]; then export USBDEVICE_LOCKED=1 # 关键点:让 flock 这个外层进程持有锁,脚本主体在它的子进程里运行 exec flock -xn "$LOCK_FILE" "$0" "$@" fi # 下面是原有脚本主体(start/stop/update 等),正常启动/停止 adbd 即可

特点:

  • 锁由 flock 外层进程 持有,不再依赖“打开一个 FD 200 并传给子进程”的模式;
  • adbd 作为孙进程不会再继承锁;
  • 脚本退出时,flock 进程自动释放锁;即使脚本崩了也不会“遗留死锁”。

做法 B:目录锁(不占用文件描述符)

如果机器上 flock 不可用,可以用原子 mkdir 做互斥:

sh
LOCKDIR=/var/run/usbdevice.lock.d if ! mkdir "$LOCKDIR" 2>/dev/null; then echo "another usbdevice is running" >&2 exit 1 fi trap 'rmdir "$LOCKDIR"' EXIT INT TERM

特点:

  • 不依赖 FD,因此不会被子进程“继承”;
  • 语义简单,Crash 后目录可能残留,但可在开头加“目录存在且无对应进程时清理”的自检。

旁路补丁(如果你坚持保留 FD-200 写法)

若一定要继续用 exec 200>"$LOCK_FILE"; flock -x 200 这种写法,确保 adbd 不继承任何非 0/1/2 的 FD。可以把 adbd 的启动改为:

sh
# 在一个“干净的 FD 空间”里拉起 adbd(仅保留 0/1/2) # BusyBox 的 sh 可用这种通用法关闭所有 >2 的 FD start_adbd() { # 关闭除 0/1/2 以外的 FD(不影响外层持锁的 flock 进程) for fd in /proc/$$/fd/*; do case "$(basename "$fd")" in 0|1|2) ;; *) eval "exec $(basename "$fd")>&-" 2>/dev/null ;; esac done exec /usr/bin/adbd }

注意:这种方式要么配合“外层 flock 进程持锁”(做法 A),要么就会把锁提前放掉。仅靠脚本自己持 FD 很难做到“既不泄漏,又不提前释放”,因此仍强烈建议采用 做法 A


如何验证已经修好

  1. 重新部署后,跑一轮 start/update/stop,然后执行:
    fuser /var/run/usbdevice.lock
    应该只看到一个持锁的 flock/usbdevice 进程,绝不应该再出现 adbd

  2. 观察日志:

    • usbdevice: Done start/update/stop request 出现后,后续新的 usbdevice 调用应能立刻获得锁;
    • 不再出现“Another usbdevice command is already running, skipping this execution”的误报;
    • adbd 仍可正常 FUNCTIONFS_BIND/ENABLE,但不会再把锁“绑死”。(usblog.log)

顺手把这一点也统一了(避免后续新的卡死点)

  • watch 脚本usbdevice 脚本 的锁路径统一为同一个(建议都用 /var/run/usbdevice.lock),保证全链路串行;
  • 结合之前给你的建议,把解绑 UDC 的错误(echo: write error: No such device)忽略并加 200ms settle,避免把系统推进到半配置态。(usblog.log)

一句话总结
usbdevice 把“已加锁的文件描述符”泄漏给了 adbd,导致锁永远不释放 → 后续命令全部堵在锁上。改成“flock -c 外层持锁运行脚本主体”(推荐),或使用目录锁,就能一次性根治这个“卡在 adbd”的问题。(usblog.log)

질문

usb_start_daemon()
{
NAME=(echo(echo 1 | sed "s~^[^ ]/[][^ ]*.\1")
TAG_FILE=TAG_DIR.usb_NAME

text
# Enable spawn touch $TAG_FILE # Already started [ -z "$(usb_get_started)" ] || return 0 # Start and spawn background daemon { exec 3<&- cd /; cd / while usb_is_enabled; do # Don't spawn after stopped [ ! -f $TAG_FILE ] || /sbin/start-stop-daemon -Sqx $@ || true sleep .5 done }&

}

但是他的这部分内容不是也会让usbdevice卡死吗

사고

Thought for 2m 6s

답변

会的,而且就是它在“把锁带到后台”这一步把你卡死了。

你这段函数做了三件关键事:

  1. 先拿到全局锁(在外层脚本里拿的),然后
  2. usb_start_daemon 里起了一个后台子 shell{ ... } &,只显式关了 3 号 FD:exec 3<&-
  3. 这个后台子 shell 里又每 0.5s 调一次 start-stop-daemon -S -q -x $@ 去拉起/保活 adbd(或其它守护)。

问题就出在第 2 步:后台子 shell 会继承父进程打开的所有 FD(除非显式关闭或设置了 CLOEXEC),而全局锁(通常是 flock 在某个 FD,比如 200 上)也在其中。结果就是:

  • 即便外层 usbdevice 主流程结束并尝试“释放”锁,后台子 shell 还握着那把锁的 FD,锁不会真正释放;
  • 后续任何 usbdevice start/update/stop 都拿不到锁,表现为“卡住/阻塞”;
  • 如果 start-stop-daemon/adbd 也继承到了这个锁 FD(BusyBox/某些实现不会帮你自动关),你就会看到 fuser /var/run/usbdevice.lock 里同时出现 usbdevice + adbd 都占着锁。你之前贴的 fuser 输出(含 adbd 的 PID)正是这个现象。(usblog.log)

另外两个细节会进一步放大卡死概率:

  • while usb_is_enabled; do ... sleep .5; done 这个循环长期存在,即使你把 TAG_FILE 删掉,它也只是“不再 spawn”,但后台进程本身仍在,锁 FD 仍被持有;
  • start-stop-daemon 的实现若不清理继承 FD,会把锁继续传给 adbd,于是你看到的就是 “usbdevice start / usbdevice update / adbd” 都在持有同一个锁。

最小改动修法(保留你原有逻辑)

在后台子 shell 一开始就把所有非 0/1/2 的 FD 关掉(至少要把“持锁”的那个 FD 关掉)。例如,如果你的锁是放在 FD 200:

sh
usb_start_daemon() { NAME=$(echo "$1" | sed 's~^[^ ]*/$[^ ]*$.*~\1~') TAG_FILE=$TAG_DIR.usb_$NAME touch "$TAG_FILE" [ -z "$(usb_get_started)" ] || return 0 { # 关键:在后台子 shell 里把“锁的 FD”关掉,避免继承 exec 200>&- # 如果你的锁用的是 200 exec 3<&- # 你原来就关了 3,这行保留 cd /; cd / while usb_is_enabled; do [ ! -f "$TAG_FILE" ] || /sbin/start-stop-daemon -Sqx "$@" || true sleep 0.5 done } & }

如果你不确定锁具体用的是哪个 FD(或者未来可能变化),就粗暴点:在后台子 shell 关掉除 0/1/2 以外的所有 FD

sh
{ # 关闭除 0/1/2 之外的所有 FD,防止任何锁/句柄被继承 for fd in /proc/$$/fd/*; do case "$(basename "$fd")" in 0|1|2) ;; *) eval "exec $(basename "$fd")>&-" 2>/dev/null ;; esac done cd /; cd / while usb_is_enabled; do [ ! -f "$TAG_FILE" ] || /sbin/start-stop-daemon -Sqx "$@" || true sleep 0.5 done } &

这样做的效果是:后台守护不再持有锁 FD,外层 usbdevice 退出时锁会被真正释放,后续命令可以顺利获取锁。


更稳的结构性修法(推荐)

从根上避免“FD 被继承”的坑:

方案 A:用 flock -c 包裹整个脚本(外层持锁进程与脚本分离)

sh
LOCK_FILE=/var/run/usbdevice.lock if [ -z "$USBDEVICE_LOCKED" ]; then export USBDEVICE_LOCKED=1 exec flock -xn "$LOCK_FILE" "$0" "$@" fi # ↓ 下面是脚本主体,包括你的 usb_start_daemon 定义与调用
  • 锁由 flock 父进程持有;你的脚本进程只是它的子进程;
  • 你脚本里再怎么 fork/&/exec,后台子进程都拿不到那把锁的 FD;
  • 脚本主进程退出,flock 收回锁——不会遗留死锁

方案 B:用目录锁(mkdir 原子互斥)替代文件锁
不占用 FD,自然也就不存在“FD 继承”:

sh
LOCKDIR=/var/run/usbdevice.lock.d if ! mkdir "$LOCKDIR" 2>/dev/null; then echo "another usbdevice is running" >&2 exit 1 fi trap 'rmdir "$LOCKDIR"' EXIT INT TERM

顺手再防两处“隐形卡死”

  • start-stop-daemon 启动 adbd 时,尽量使用会清理多余 FD的形式(不同实现不一):
    • BusyBox 的 start-stop-daemon 一般不会帮你关 FD,所以上面的“后台子 shell 先关 FD”非常关键。
  • 让后台循环更“可退出”:while usb_is_enabled; do ...; done 之外,再加一个“当 TAG_FILE 不存在持续 N 秒就退出后台”的小去抖逻辑,避免后台常驻占资源、也避免未来再度误持锁。

如何确认修复

  1. 应用修改后,执行一次 usbdevice start,随后跑 fuser /var/run/usbdevice.lock
    • 只能看到一个(短时)usbdevice/flock 持锁
    • 不应该再看到 adbd 的 PID 持有该锁。
  2. 反复执行 usbdevice update/stop/start 不再出现“拿锁卡住”;
  3. 日志里也不会再出现“Another usbdevice command is already running, skipping…”等误判。(usblog.log)

一句话回答你的提问
是的,这段 usb_start_daemon() 会让 usbdevice 卡死——因为它把持锁的文件描述符继承到了后台子 shell(以及可能的 adbd),导致锁永不释放。按上面的“后台先关 FD”或“flock -c 外层持锁”改法即可根治。

이 Q&A 공유하기