Move 37 和 LLM
Creado el: 29 de enero de 2025
Creado el: 29 de enero de 2025
將以下英文翻譯成中文
text"Move 37" is the word-of-day - it's when an AI, trained via the trial-and-error process of reinforcement learning, discovers actions that are new, surprising, and secretly brilliant even to expert humans. It is a magical, just slightly unnerving, emergent phenomenon only achievable by large-scale reinforcement learning. You can't get there by expert imitation. It's when AlphaGo played move 37 in Game 2 against Lee Sedol, a weird move that was estimated to only have 1 in 10,000 chance to be played by a human, but one that was creative and brilliant in retrospect, leading to a win in that game. We've seen Move 37 in a closed, game-like environment like Go, but with the latest crop of "thinking" LLM models (e.g. OpenAI-o1, DeepSeek-R1, Gemini 2.0 Flash Thinking), we are seeing the first very early glimmers of things like it in open world domains. The models discover, in the process of trying to solve many diverse math/code/etc. problems, strategies that resemble the internal monologue of humans, which are very hard (/impossible) to directly program into the models. I call these "cognitive strategies" - things like approaching a problem from different angles, trying out different ideas, finding analogies, backtracking, re-examining, etc. Weird as it sounds, it's plausible that LLMs can discover better ways of thinking, of solving problems, of connecting ideas across disciplines, and do so in a way we will find surprising, puzzling, but creative and brilliant in retrospect. It could get plenty weirder too - it's plausible (even likely, if it's done well) that the optimization invents its own language that is inscrutable to us, but that is more efficient or effective at problem solving. The weirdness of reinforcement learning is in principle unbounded. I don't think we've seen equivalents of Move 37 yet. I don't know what it will look like. I think we're still quite early and that there is a lot of work ahead, both engineering and research. But the technology feels on track to find them.
「Move 37」是今日的關鍵字——它代表了當一個透過增強式學習的反覆試驗過程所訓練的 AI,能夠發現對人類專家而言全新、令人驚奇且暗藏深刻巧思的行動時。這是一種帶有些許神秘與不安感、只能透過大規模增強式學習才可能出現的奇異現象。單純模仿專家並無法抵達那種境界。「Move 37」這個名詞來自 AlphaGo 與李世乭第二局比賽中的第 37 手——這是一個極其怪異的下法,被估算在真人棋手身上僅有萬分之一的機率出現,但事後證明它極具創意與高明之處,最終助 AlphaGo 贏得那一局的勝利。
我們在圍棋這種封閉、類遊戲環境中見識到「Move 37」,但隨著新一代「會思考」的大型語言模型(例如 OpenAI-o1、DeepSeek-R1、Gemini 2.0 Flash Thinking)的出現,我們首次在更開放的領域隱約看到了類似跡象。這些模型在嘗試解決各式各樣的數學、程式碼等問題的過程中,會自行發現類似人類「內在思考」的策略,而這些策略幾乎無法直接被程式設計或硬性規定進模型裡。我把它們稱為「認知策略」——像是從不同角度思考問題、嘗試不同做法、尋找類比、回溯、反覆檢視等等。雖然聽起來有些不可思議,但很可能這些模型能發現更好的思考方法、更有效的解決問題途徑,以及跨領域連結想法的方式,並用我們或許會感到驚奇、甚至困惑,但在事後又會讚嘆其創意與高妙的形式展現出來。它也可能變得更加古怪——假如整個優化過程做得好,模型甚至可能自創一種我們完全無法理解的語言,卻能更有效率或更有成效地進行問題解決。增強式學習的怪異與可能性,理論上是沒有上限的。
我不認為我們已經見到真正相當於「Move 37」的案例,也不確定它最終會以什麼形式出現。我相信我們仍處在相當早期的階段,還需要大量的工程與研究努力。然而,就目前的技術發展走向來看,我們似乎正在通往那一步的路上。