We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

2024/8/23
logo of podcast Papers Read on AI

Papers Read on AI

AI Deep Dive AI Chapters Transcript
People
H
Huajian Xin 等人
Topics
我们介绍了DeepSeek-Prover V1.5,一个用于Lean 4定理证明的开源语言模型。它通过优化训练和推理过程,改进了DeepSeek-Prover V1。我们使用DeepSeekMath-Base进行预训练,并专注于形式化数学语言。模型经过监督微调,使用了从DeepSeek-Prover V1改进的正式定理证明数据集。通过来自证明助手反馈的强化学习(RLPAF)进一步改进模型。DeepSeek-Prover V1.5超越了DeepSeek-Prover V1的单次完整证明生成方法,提出了一种改进的蒙特卡洛树搜索算法RMaxTS,该算法使用基于内在奖励的探索策略生成多种证明路径。DeepSeek-Prover V1.5在高中水平的miniF2F基准测试集和大学水平的ProofNet基准测试集上取得了显著改进,达到了新的最先进水平,在Mini F2F测试集上达到63.5%,在ProofNet测试集上达到25.3%。我们还对模型的训练策略、不同提示模式(COD和非COD)以及RMaxTS算法的各个组成部分进行了深入的消融研究,结果表明,强化学习、思维链提示和RMaxTS算法的结合对于提高模型的定理证明能力至关重要。

Deep Dive

Chapters
DeepSeek-Prover-V1.5 is an open-source language model enhancing its predecessor by optimizing training and inference. It uses reinforcement learning from proof assistant feedback (RLPAF) and a Monte Carlo tree search variant (RMaxTS) for diverse proof paths. The model achieves state-of-the-art results on Mini F2F and ProofNet benchmarks.
  • Open-source language model for theorem proving in Lean 4
  • Optimized training and inference processes
  • Uses RLPAF and RMaxTS
  • State-of-the-art results on Mini F2F (63.5%) and ProofNet (25.3%) benchmarks

Shownotes Transcript

We introduce DeepSeek-Prover-V1.5, an open-source language model designed for theorem proving in Lean 4, which enhances DeepSeek-Prover-V1 by optimizing both training and inference processes. Pre-trained on DeepSeekMath-Base with specialization in formal mathematical languages, the model undergoes supervised fine-tuning using an enhanced formal theorem proving dataset derived from DeepSeek-Prover-V1. Further refinement is achieved through reinforcement learning from proof assistant feedback (RLPAF). Beyond the single-pass whole-proof generation approach of DeepSeek-Prover-V1, we propose RMaxTS, a variant of Monte-Carlo tree search that employs an intrinsic-reward-driven exploration strategy to generate diverse proof paths. DeepSeek-Prover-V1.5 demonstrates significant improvements over DeepSeek-Prover-V1, achieving new state-of-the-art results on the test set of the high school level miniF2F benchmark ($63.5%$) and the undergraduate level ProofNet benchmark ($25.3%$).

2024: Huajian Xin, Z. Z. Ren, Jun-Mei Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, W. Gao, Qihao Zhu, Dejian Yang, Zhibin Gou, Z. F. Wu, Fuli Luo, Chong Ruan

https://arxiv.org/pdf/2408.08152