We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

2024/8/23

Papers Read on AI

AI Deep Dive AI Chapters Transcript

People

Huajian Xin 等人

Topics

我们介绍了DeepSeek-Prover V1.5,一个用于Lean 4定理证明的开源语言模型。它通过优化训练和推理过程,改进了DeepSeek-Prover V1。我们使用DeepSeekMath-Base进行预训练,并专注于形式化数学语言。模型经过监督微调,使用了从DeepSeek-Prover V1改进的正式定理证明数据集。通过来自证明助手反馈的强化学习(RLPAF)进一步改进模型。DeepSeek-Prover V1.5超越了DeepSeek-Prover V1的单次完整证明生成方法,提出了一种改进的蒙特卡洛树搜索算法RMaxTS,该算法使用基于内在奖励的探索策略生成多种证明路径。DeepSeek-Prover V1.5在高中水平的miniF2F基准测试集和大学水平的ProofNet基准测试集上取得了显著改进,达到了新的最先进水平,在Mini F2F测试集上达到63.5%,在ProofNet测试集上达到25.3%。我们还对模型的训练策略、不同提示模式(COD和非COD)以及RMaxTS算法的各个组成部分进行了深入的消融研究,结果表明,强化学习、思维链提示和RMaxTS算法的结合对于提高模型的定理证明能力至关重要。

Deep Dive

Chapters

DeepSeek-Prover-V1.5 is an open-source language model enhancing its predecessor by optimizing training and inference. It uses reinforcement learning from proof assistant feedback (RLPAF) and a Monte Carlo tree search variant (RMaxTS) for diverse proof paths. The model achieves state-of-the-art results on Mini F2F and ProofNet benchmarks.

Open-source language model for theorem proving in Lean 4
Optimized training and inference processes
Uses RLPAF and RMaxTS
State-of-the-art results on Mini F2F (63.5%) and ProofNet (25.3%) benchmarks

Shownotes Transcript

Translations:

中文

Papers read on AI. With Rob. Keeping you up to date with the latest research. This reading is brought to you by. Mars Race. Stake Acclaim on the Red Planet. Available on Android and iOS. DeepSeek Prover V1.5. Harnessing proof assistant feedback for reinforcement learning and Monte Carlo tree search.

Authored 2024 by:

Wazhen Xin, Zizi Ren, Jun Xiaosong, Ji Hongxiao, Wan Yijiao, Hao Chengwang, Bo Lu, Li Weizhong, Zhan Lu, Chuxi Du, Wenjun Gao, Qi Haozhu, Dijian Yang, Ziben Go, ZF Wu, Fuli Luo, Chong Ruan.

We introduce DeepSeqProver v1.5, an open-source language model designed for theorem proving in Lean 4, which enhances DeepSeqProver v1 by optimizing both training and inference processes. Pre-trained on DeepSeqMathBase with specialization in formal mathematical languages, the model undergoes supervised fine-tuning using an enhanced formal theorem proving dataset derived from DeepSeqProver v1.

Further refinement is achieved through reinforcement learning from proof assistant feedback . Beyond the single-pass whole proof generation approach of DeepSeqProver v1, we propose RmaxTS, a variant of Monte Carlo TreeSearch that employs an intrinsic reward-driven exploration strategy to generate diverse proof paths.

DeepSeq Prover v1.5 demonstrates significant improvements over DeepSeq Prover v1, achieving new state-of-the-art results on the test set of the high school-level Mini F2F benchmark, 63.5%, and the undergraduate-level ProofNet benchmark, 25.3%. Core Contributors 1. Introduction 1.1. Contributions

We present a comprehensive framework for developing a language model-based formal mathematics prover, integrating several key components: large-scale mathematical pre-training, formal mathematics corpus construction and augmentation, online reinforcement learning from proof assistant feedback, and a tree search methodology for long-term planning and theorem proving.

The pre-trained model, supervised fine-tuned model, and reinforcement learning model, along with the code for the Monte Carlo tree search algorithm, are publicly available for further research and application. Pre-training. We enhance our base model's capabilities in formal theorem proving and mathematical reasoning by further pre-training on high-quality mathematics and code data, with a focus on formal languages such as Lean, Isabel, and Metamath.

Supervised fine-tuning. We improve the Lean4Code completion dataset by implementing two data augmentation techniques. First, we use DeepSeq coder V2-236B, Zhu et al. 2024, to annotate natural language chain of thought comments alongside Lean4Code, aligning formal theorem proving with natural language reasoning.

Second, we insert intermediate tactic state information within the Lean-4 proof code, enabling our model to leverage compiler feedback effectively. The resulting dataset is then used to fine-tune the pre-trained model. Reinforcement Learning We employ the GRPO algorithm, Xiao et al. 2024, to perform reinforcement learning from proof-assistant feedback, RLPAF, on the supervised fine-tuned model.

Verification results from the Lean Prover serve as reward supervision, enhancing the model's alignment with the formal specifications of the verification system. Monte Carlo Tree Search We advance the tree search method in formal theorem proving by introducing a novel abstraction and a corresponding search algorithm. Our truncated resume mechanism acts as a state-action abstraction, seamlessly integrating the tree search process into the whole proof generation framework.

We present RMAX-TS, an innovative Monte Carlo tree search algorithm that leverages the RMAX, Brafman and Tenenholz, 2002, strategy to tackle exploration challenges in sparse reward-proof search problems. By assigning intrinsic rewards, this algorithm encourages the prover agent to generate diverse planning paths, thereby fostering extensive exploration of the proof space.

1.2 Summary of Evaluations and Metrics Mini F2F In the single-pass hole proof generation setting, DeepSeq Prover V1.5 achieved a pass rate of 60.2% on the test set of Mini F2F, marking a significant improvement of absolute 10.2 percentage points over DeepSeq Prover V1's 50.0%. Incorporating tree search techniques further elevated the pass rate to a new state-of-the-art 63.5%.

ProofNet. DeepSeat Prover v1.5 also demonstrated strong performance in the single pass hole proof generation setting for ProofNet, with pass rates of 21.6% on the validation set and 23.7% on the test set. The integration of tree search techniques further enhanced these results, achieving new state-of-the-art pass rates of 25.4% on the validation set and 25.3% on the test set. 2. Model Training. 2.1.

Pre-training. To enhance our language model's proficiency in generating formal proofs and reasoning through mathematical language, we further pre-train our base model, Xiao et al. . This refinement involved training on high-quality datasets that include both code and natural language mathematical content. We specifically focused on formal languages widely used in proof assistance, such as Lean, Isabelle, and Metamath. We designate this improved model as DeepSeqProver v1.5 base.

2.2 Supervised Fine Tuning In this section, we explore the methodology and processes involved in the supervised fine tuning of DeepSeqProver v1.5. Specifically, we augment the proof dataset from DeepSeqProver v1 by adding detailed explanatory comments. This enhancement aims to improve the alignment between natural language descriptions and lean-for code, thereby facilitating better formal mathematical reasoning.

Additionally, we incorporate intermediate tactic state. Training setting. We conduct supervised fine-tuning based on the pre-trained model and train for 9B tokens, using a batch size of 2048 and a constant learning rate of 1E4. The training process begins with 100 warm-up steps to stabilize the learning dynamics. Training examples are randomly concatenated to form sequences, with a maximum context length of 4096 tokens.

2.3 Reinforcement learning from proof assistant feedback Reinforcement learning has been proven effective in enhancing the mathematical reasoning capabilities of supervised fine-tuned language models . To further advance DeepSeqProver v1.5-SFT, we incorporate a reinforcement learning phase, resulting in the model DeepSeqProver v1.5-RL.

This phase leverages RL to enhance performance based on verification feedback from the Lean4 prover. The specifics of this RL process are detailed below. Prompts In the reinforcement learning stage, we use a subset of theorem statements from the supervised fine-tuning dataset as training prompts. We select theorems for which DeepSeqProver v1.5 SFT has a moderate success rate in generating correct proofs upon multiple attempts.

This ensures that the model has room for improvement while still being able to receive positive feedback. After filtering, we retain approximately 4.5K unique theorem statements. Each theorem is prefixed with both COD and non-COD guiding prompts to enhance the model's proof generation capabilities in both modes. Rewards When training LLMs via RL, a trained reward model typically provides feedback signals.

In contrast, formal theorem proving benefits from the rigorous verification of generated proofs by proof assistants, offering a significant advantage. Specifically, each generated proof receives a reward of 1 if verified as correct and 0 otherwise. While this binary reward signal is accurate, it is also sparse, especially for theorems that are challenging for the supervised fine-tuned model.

To mitigate this sparsity, we select training prompts that are challenging yet achievable for the supervised fine-tuned model, as described above. Reinforcement learning algorithm. We employ the group relative policy optimization, GRPO, Xiao et al. 2024, as our RL algorithm, which has demonstrated superior effectiveness and efficiency compared to PPO, Schulman et al. 2017, primarily because it eliminates the necessity of training an additional critic model.

Specifically, GRPO samples a group of candidate proofs for each theorem prompt and optimizes the model based on the relative rewards of the outputs within the group. Our prompt selection strategy is designed to likely include both correct and incorrect proofs among the candidates, aligning well with the group relative nature of GRPO and thereby enhancing the training process.

Training setting. We conduct RL training based on the SFT model, which serves as both the initial model and the reference model for imposing the Kolbach-Leibler divergence penalty. We use a constant learning rate of 5E6, and the KL penalty coefficient is set to 0.02. For each theorem, we sample a group of 32 candidate proofs, with maximum length set to 2048. The training batch size is configured to 512. 2.4. Evaluation.

Benchmarks. We evaluate theorem proving performance on the following benchmarks to compare model capabilities after each training stage. Mini F2F. Jung et al. 2022. Focuses on formal problem-solving skills for high school level exercises and competitions, such as AMC, AMI, and EMO, with an emphasis on algebra and number theory.

The benchmark includes 244 validation and 244 test problems, originally in Lean 3 and manually converted to Lean 4.9.0, based on the version provided by Young, 2023. ProofNet, Azarbayev et al., 2023, evaluates formal theorem-proving capabilities at the undergraduate level in mathematics.

It comprises 185 validation and 186 test problems from widely used undergraduate textbooks, covering real and complex analysis, linear algebra, abstract algebra, and topology. These problems were initially in Lean 3 and manually converted to Lean 4.9.0. Prompting configurations For each proof attempt of DeepSeqProver v1.5 base, we independently sample three proof demonstrations from the validation set to construct the few shot prompts.

For the Mini F2F benchmark, we use human-written proofs from YONG 2023, while for the ProofNet benchmark, we use correct proofs generated by DeepSeqProver v1.5 RL as few-shot demonstrations. For DeepSeqProver v1.5 SFT and DeepSeqProver v1.5 RL, we employ two types of guiding prompts, one that encourages chain of thought, caught, reasoning before each proof step, and one that does not, non-caught.

Detailed examples are provided in Appendix A metric. We evaluate theorem proving performance using the pass at K accuracy metric, which measures the model's success in generating a correct proof within K attempts. Each model is deployed on a single A100-40G GPU, utilizing the VLLM framework, Kwan et al. 2023, for sample generation.

The sampling parameters are set with a temperature of 1, a top p-value of 0.95, and a maximum token limit of 2048. The generated proofs are then verified using the Lean-4 theorem prover. For this verification, we import MATHLIB4, MATHLIB Community, 2020, and ESOP, LIMPURG and FRUM, 2023, to access predefined premises and tactics. The verification process is subject to a time limit of 300 seconds.

Comparison across training stages. Figure 3 presents a comparative analysis of each training stage on the Mini F2F and ProofNet datasets. Our base model, DeepSeqProver v1.5 Base, achieves a notable pass rate, solving nearly one-third of the problems on the test set of the Mini F2F benchmark using three-shot prompting.

The supervised fine-tuning stage, resulting in DeepSeqProver v1.5-SFT, significantly outperforms the base model, with pass at 1.28 accuracy increasing by approximately two-thirds on MiniF2F and doubling on ProofNet. The subsequent reinforcement learning stage further enhances the model's performance, improving pass at K.

accuracy across all values of K. In contrast to findings in natural language mathematics, such as those reported in DeepSeat Math, Xiao et al. 2024, where reinforcement learning primarily boosts the correct response from top K, we observe a genuine enhancement of fundamental capabilities in formal theorem proving. This improvement is evident not only with a small sample budget but also remains stable as the sample budget increases.

This conclusion is further supported by later Monte Carlo tree search experiments with larger sample budgets, as discussed in section 4.2. Comparison between COD and non-COD We compare the performance of non-COD and COD generation modes for both DeepSeqProver V1.5 SFT and DeepSeqProver V1.5 RL. The results in figure 3 demonstrate that the COD mode consistently outperforms the non-COD mode across most settings.

Specifically, DeepSeq Prover v1.5 RL, leveraging these enhanced theorem-proving patterns, achieves superior performance on both benchmarks, with an average accuracy of 51.6% on MiniF2F and 18.2% on ProofNet. The integration of natural language reasoning in COD mode significantly enhances the planning and execution of formal proof writing.

For a detailed comparison of proof strategies with and without the use of natural language chain of thought, refer to the examples provided in Appendix A3. Exploration-Oriented Monte Carlo Tree Search 3.1. Tactic Level Tree Abstraction. To implement the tree search method in the whole proof generation setting, we introduce a proof tree abstraction to define the tailored state and action space, leveraging a truncate and resume mechanism.

Roughly following the paradigm of Yao et al. 2023, we begin by decomposing an incomplete proof into a sequence of tree nodes that correspond to individual proof steps, and then we utilize the partial content stored in these tree nodes to continue the proof generation process. Figure 4 illustrates the process of constructing a proof search tree from whole proof generation.

Truncate. Proof decomposition into tree nodes. We construct the proof search tree at the tactic level, where each tree edge represents a single transition step of the tactic state. Initially, we submit the entire proof the model generated to the lean prover to parse it into tactics. We then truncate the proof at the earliest verification error, ensuring that all subsequent tactic codes can be successfully applied to advance the proof towards the desired theorem.

The tactic codes are segmented into several code fractions, each containing a valid tactic code and its associated chain of thought comments, corresponding to a single tree edge that represents a tactic state transition. Through this abstraction, each tactic code is converted into a series of tree nodes, forming a path from the root to a specific node.

Resume. Proof generation from a tree node. In Lean 4, different tactics can lead to the same tactic state, meaning each node in our proof tree can correspond to various tactic codes that achieve the same outcome. To handle this, we store a set of these equivalent tactic codes at each node. When the tree search agent expands a node, it randomly selects one tactic to use as a prompt for the language model.

This prompt includes the incomplete proof code ending with the chosen tactic and the tactic state information from the lean prover as a comment block. The fine-tuned model, see section 2.2, has been trained to recognize and utilize this format, using the incomplete code augmented with tactic state comments to guide subsequent proof generation.

3.2 Interactive Theorem Proving via Monte Carlo Tree Search Our proof search tree is developed using the standard Monte Carlo tree search paradigm . Coulomb, 2006. Brown et al. 2012. Which iteratively applies four steps: selection, expansion, simulation, and backpropagation. We integrate the simulation step into expansion because our whole proof generation model inherently performs a rollout from the expanded node.

The detailed design of the algorithm workflow is as follows: The selection step, aka the tree policy, starts from the root node and traverses downward to identify a promising node for expansion. The objective of this algorithmic step is to trade off between exploration and exploitation . The tree policy at a tree node S is computed by selecting the action that maximizes the value from the set of valid operations,

TREPOLICY = arg max element of C HildRen S union /Q UCB S A 1 where Q denotes a sample-based estimation of action values derived from the selection history, functioning as the exploitation component that retrieves high-value candidates from previous trials.

UCB-S-A denotes the exploration bonus computed by upper confidence bounds, UCB-H-2002, which diminishes with the repeated execution of the state-action pair, S-A. More specifically, QUCB-S-A stands for an optimistic estimation of Q-S-A and can serve as an upper bound with high probability. We defer the discussion of detailed settings of node values and UCB bonus to Section 3.3.

Expansion. The next step is invoking the proof generation model to expand the node nominated by the selection phase. Resuming the incomplete proof codes stored on the node designated for expansion, we perform whole proof generation to propose a series of subsequent tactics and submit the generated proof to LeanProver for verification. Such a trial of proof completion is equivalent to conducting a single rollout of simulation within the standard MCTS framework.

When the verification result indicates the proof is complete, the search procedure is ready to be terminated, having found a new proof of the desired theorem. Otherwise, we parse the verification feedback and truncate the generated proof to the assertion of the earliest verification error. The remaining tactics are transformed into a path of nodes to be merged into the search tree .

It is important to note that, because we use the whole proof generation setting, where the output is an entire proof consisting of a sequence of tactics, rather than just the next tactic, our expansion procedure may insert a path of tree nodes into the search tree during each iteration. This differs from the conventional MCTS designed for competitive games, which typically expands only one layer of children nodes per iteration. Silver et al. 2016-2018. Schrittweiser et al. 2020.

Backpropagation, the final phase of each tree search iteration is to update value statistics along the selection trajectory from the root to the expanded node, i.e., updating the values associated with the tree policy stated in EQ1. Let tau equals root S1S1S2S2S3.

S tau minus 1 equals ST. Circle division slash denote the selection trajectory of TTH iteration that ends with ST as the expanding node. We update QUCB SA for all SA element of tau by taking the most recent trajectory reward R tau into account details referred to EQ.

7. The extrinsic source of rewards comes from the compiler feedback, specifically assigning a reward of R_Extrinsic = 1 for completed proofs and R_Extrinsic = 0 for unsolved ones. In Section 3.3, we will introduce an intrinsic reward mechanism to augment the reward assignment that enhances the agent's incentive for exploration.

3.3. Intrinsic Rewards for Monte Carlo Tree Search In the search problem of formal theorem proving, the extrinsic rewards are extremely sparse, i.e., the search agent only obtains non-zero rewards when the proof is completely solved. More specifically, the proof search process forms a tree structure with only a narrow set of leaves delivering non-zero rewards, which matches a famous hard exploration case, Krishnamurti et al. 2016, in the literature of statistical reinforcement learning.

To promote exploration in sparse reward sequential decision-making, one classical paradigm is constructing intrinsic rewards, Schmidhuber, 2010, that encourage the agent to not only optimize extrinsic rewards but also acquire general information about the interactive environment, Bellemere et al., 2016, Huthuf et al., 2016, Podik et al., 2017, Berda et al., 2019.

In this section, we present our intrinsic reward-driven exploration algorithm, R-Max applied to TreeSearch, R-Max TS, to incorporate reward-free exploration in the proofsearch problem. R-Max applied to MCTS. We adopt R-Max, Brafman and Tenenholz, 2002, a classical exploration mechanism, to construct intrinsic rewards for Monte Carlo TreeSearch. The core idea of R-Max is to explore a broad coverage of the state space.

the agent awards itself a maximal amount of reward upon reaching an unseen state. In the context of proof search, where no extrinsic rewards are provided until the proof is completed, our algorithmic procedure resembles 0Rmax, Jin et al. 2020, in which the agent's exploration is driven solely by intrinsic rewards, i.e., setting R, tau, equals R intrinsic, tau.

The intrinsic reward of a tree expansion step is determined by whether a new node is added to the search tree. r_intrinsic = i, at least one new node is added to the search tree. 3. Where tau denotes the most recent selection trajectory that requires a reward assignment for backpropagation. This exploration strategy prioritizes the expansion of nodes where the prover model generates tactics that lead to a diverse range of tactic states.

As multiple lean codes can result in the same transition of intermediate states, this heuristics can potentially reduce redundant generation and improve sample efficiency. UCB for non-stationary rewards. The common setting of UCB exploration bonus for Monte Carlo tree search is using UCB 1, Hour at All, 2002.

where gamma equals tau element of tau denotes the list of tree policy trajectory tau containing as an intermediate selection step. To facilitate discussions, we organize the list gamma equals tau , such that newly collected trajectories have larger subscript indices. In this work, we propose to use an alternative variant of UCB method.

Note that the derived intrinsic reward in EQ3 is a non-stationary reward signal whose expected value decays with the progress of exploration. That is because it becomes definitely harder to discover new nodes with unseen tactic states as the search tree expands through sophisticated exploration.

To tackle the non-stationarity, we consider discounted upper confidence bounds, DUCB, Garavir and Mollins, 2011, which uses a discount factor gamma element of 0, 1, to smoothly drop those outdated feedback records, where newly received feedback would be assigned a larger weight in the value estimation.

In practice, we said gamma equals 0.99. Note that the role of discount factor gamma in DUCB differs from its role in value iteration for infinite horizon MDPs. The discounting is applied to tree search iterations rather than to the action step horizon within a single trajectory.

3.4 Parallelization of Monte Carlo Tree Search To enhance the efficiency of Monte Carlo Tree Search , we implement several established parallelization techniques as described by Chaslott et al. . We deploy 256 MCTS runners per node, with one language model per GPU and a batch size of 512 for proof generation.

The lean prover is invoked through REPL and executed on a cluster with thousands of CPU cores, where each proof verification task is handled by an individual process, created and terminated in a sandbox. Both proof generation by language models and verification by lean provers are handled asynchronously. This setup allows MCTS runners to perform concurrent tree search operations, significantly accelerating the process.

Tree Parallelization We manage each search tree with 32 threadworkers to parallelize the tree iteration steps. This method effectively schedules and balances the tasks of proof generation and lean verification. Each threadworker iteratively performs the tree search loop by selecting a candidate node for expansion, invoking the language model to generate the proof, verifying the generated proof with the lean prover, and performing backpropagation.

Virtual loss. To encourage diverse node selection among concurrent threadworkers, we assign a virtual reward r equals zero for ongoing iterations. This involves backprop agitating a reward of zero temporarily and updating it to the true reward upon completion. This strategy promotes exploration of different nodes for expansion, thereby enhancing the overall search efficiency.

3.5. Comparison with existing methods. Methods for using language models in formal mathematics proof search generally fall into two main strategies. Multipass proof step generation. This strategy breaks down the proving process into multiple episodes of tactic generation and verification, typically following a tree search pattern. It involves generating and verifying one tactic at a time, repeating the process for the next tactic until no proof goals remain.

Notable examples include GPTF, Polu and Sutskever Polu et al. Thor, Jong et al. Reprover, Yong et al. Hypertree proof search, Lample et al. An intern LM2 step prover, Woo et al. Single pass whole proof generation This approach generates and verify an entire proof in one attempt

If the proof is incorrect, the model generates a new proof in the next attempt. Methods in this category include DSP, Zhang et al. , SubGolProver Zhao et al. , LegoProver, Wang et al. , Lyra, Zheng et al. , and MiniCTX, Hu et al. . Our proof tree search method uniquely bridges these two strategies, offering a novel hybrid approach.

It starts with whole proof generation, similar to the single pass approach, but extends this by implementing a sophisticated truncate and resume mechanism. This process involves truncating the generated proof to its successful initial segment, parsing this segment into individual tactics, and resuming the tree search from this point. This iterative process effectively implements a Monte Carlo tree search, seamlessly integrating single pass whole proof generation with multi-pass proof step generation.

Consequently, we can train a single model with nearly identical objectives to support both strategies simultaneously. Our experimental results demonstrate that this unified approach achieves superior performance in both settings. By combining the strengths of existing methods and introducing innovative techniques, our method offers a more versatile and effective solution for formal mathematics proofsearch, potentially paving the way for future advancements in this field. 4. Experimental Results

In this section, we evaluate the theorem proving capabilities of DeepSeqProver v1.5 using two distinct benchmarks: MiniF2F, which encompasses high school level exercises and competition problems, and ProofNet, which pertains to undergraduate level theorems. We present the results for both complete proof generation and Monte Carlo tree search methodologies, utilizing the same trained model and inference configuration as section 2.4 to ensure consistency. 4.1

Main results. We present a comparative analysis of DeepSeqProver v1.5 against previous state-of-the-art language models, highlighting its performance and advancements. General purpose models. GPT 3.5 and GPT 4, OpenAI, 2023, are advanced generative AI models developed by OpenAI, known for their effectiveness across diverse tasks, including code generation.

Despite not being specifically designed for theorem proving, their extensive parameter scales provide significant capabilities. The evaluation of these models in formal theorem proving is facilitated by Kopra, Thacker et al. 2023, an in-context learning agent that leverages these large language models to propose tactic applications. Additionally, we examine Lema, Azarbayev et al. 2024, a series of

Language models trained on extensive general mathematical corpora, commonly used as the base model for formal theorem proving. Specialized models for formal mathematics: GPTF, POLU and Sutskever, 2020. POLU et al. , represents an initial effort to apply transformers to proof step generation for theorem proving tasks, utilizing a best-first search module to construct complete proofs.

Subsequent advancements include Reprover, Yong et al., 2023, LLM-STEP, Wellick & Saha, 2023, and LeanStar, Lin et al., 2024. Hypertree Proof Search, Lample et al., 2022, explores the use of Monte Carlo Tree Search in formal theorem proving using Lean.

Concurrent works, Intern LM2 Math, Ying et al. , and Intern LM2 Stebb Prover, Wu et al. also demonstrate outstanding performance. Metric We compare the performance of DeepSeq Prover V1.5 with state-of-the-art models using the Pass at K accuracy metric, which evaluates the model's ability to generate a correct proof within K attempts.

We display the sample budget K according to the following rules to align the computation budget across different generation schemes. For single pass sampling methods, we define the sample budget K as the total number of proofs generated, with large values of K factorized for the ease of comparison to tree search methods. For best first search methods, following the notation of Azarbayev et al. 2024, we present

k = n*s*t where n denotes the number of best first search attempts, s denotes the number of tactics generated for each expansion, and t denotes the number of expansion iterations. For tree search methods, e.g., Rmax TS and HTPS, Lampel et al. 2022, we present k = n*t where n denotes the number of tree search attempts, and t denotes the number of model generations invoked in tree expansions.

Results on Mini F2F Table 1 provides a comparative analysis of various theorem-proving methods on the Mini F2F test dataset. In the single-pass whole-proof generation setting, DeepSeqProver v1.5 RL achieved the highest pass rate at 60.2%, marking a significant improvement of 10.2 percentage points over DeepSeqProver v1's 50.0%.

With a sampling budget limited to 128 attempts, DeepSeqProver v1.5 RL proved 51.6% of the problems, significantly outperforming other hole proof generation methods and is comparable to the leading tree search methods. In the tree search methods category, DeepSeqProver v1.5 RL+Rmax TS leads with a pass rate of 62.7%, establishing a new state of the art and creating a substantial gap with existing methods.

Notably, DeepSeat Prover v1.5 RL requires only 3,200 whole proof generation samplings to achieve a pass rate of 54.9%, surpassing the previous state-of-the-art result of InternLM two-step prover, which performs 64×3,200 tree searches to achieve 54.5% results on ProofNet. Table 2 presents a comparative analysis of various theorem-proving methods on the ProofNet dataset.

DeepSeq Prover v1.5 RL achieved pass rates of 22.6% and 25.3% for the overall proof net dataset in the single-pass whole-proof generation setting and with the enhancement of RMAX TS, respectively. These results surpass the existing state-of-the-art methods, Reprover, 13.8%, and Intern LM2 StepProver, 18.1%.

When the number of hole proof generation attempts is restricted to 3,200, DeepSeq Prover v1.5 also proved 21.7% of the theorems, demonstrating a 3.6% improvement over the previous state-of-the-art Intern LM2 step prover.

Method Sample Budget Proof Net Valid Test All Single Pass Hole Proof Generation Methods DeepSeat Prover V1.5 Base 128 6.6% ± 0.9% 9.7% ± 0.7% 7.5% ± 0.7% 3,210.8% 15.6% 13.2%

DeepSeek Prover V1.5 SFT 128 19.9% plus or minus 0.4% 15.9% plus or minus 0.6% 17.9% plus or minus 0.3% 3,220.7% plus or minus 0.7% 21.0% plus or minus 0.9% 20.9% plus

or minus 0.6% 4 times 6422.2% 23.7% 22.9%

Deep Seat Prover V1.5 RL 128 20.1% ± 0.5% 18.2% ± 0.5% 19.1% ± 0.4% 3,221.4% ± 0.3% 22.0% ± 0.5% 21.7%

plus or minus 0.4%, 4 times 6421.6%, 23.7%, 22.6%. Tree search methods. Reprover, reference 55, 13.8% intern LM2 step prover, reference 52, 1 times 32 times 118.1%.

DeepSeat Prover V1.5 SFT plus R-Max TS1 times 3,222.2% plus or minus 0.7% 21.6% plus or minus 0.2% 21.9% plus or minus 0.4% 4 times 6,423.8% 25.8% 24.8%

DeepSeq Prover v1.5 RL plus R max TS 1 times 3,222.0% plus or minus 0.3% 21.5% plus or minus 0.8% 21.8% plus or minus 0.4% 4 times 6,425.4% 25.3% 25.3% Table 2. Comparing with State of the Arts on the ProofNet Dataset.

Note that the validation set of ProofNet is used to perform expert iteration in supervised fine-tuning. 4.2. Re-examining the effectiveness of training strategies on large-scale sampling. We revisit the effects of several training modules in N/A large-scale sampling setting, focusing on both single pass whole proof generation and Monte Carlo tree search. The results demonstration that the observations and findings presented in section 2.4 generalized to sampling scenarios with a large sample size.

General enhancement of reinforcement learning To support the claim that online reinforcement learning from verification feedback generally enhances the model capabilities, we compare our final model to the SFT-only version using a large sample budget. The comparison results are presented as two columns in table 3. DeepSeqProver v1.5 RL consistently outperforms the SFT model across all generation settings, regardless of whether the chain of thought strategy is applied.

The results also indicate that the improvements gained from conducting online RL is orthogonal to those achieved through RMAX TS, which can be further combined to boost the performance. By integrating both COD prompting and RMAX TS, DeepSeq Prover V1.5 RL achieves a pass rate of 62.7% on Mini F2F test.

This performance shows a notable 3.7% improvement over the SFT model, highlighting the critical role of reinforcement learning in enhancing the overall effectiveness of the proof completion model. COT, Non-COT, and Mixture Strategy We compare the performance of two generation modes, i.e., non-COT and COT, on MiniF2F test dataset. The results, shown in Table 3, indicate that the advantage of COT over the non-COT mode is amplified as the sample budget increases.

This suggests that the incorporation of natural language chain of thought can diversify the planning pathways of theorem proving, potentially leading to a broader range of reasoning strategies and more innovative solutions.

Results also show that these two modes have complementary advantages across different problems. The model's theorem-proving strategy in the COT mode is more systematic and proactive in mathematical thinking, while in the non-COT mode, the model can efficiently use Lean high-level tactics to solve computational problems that can be addressed within Lean's automation mechanisms.

To leverage these advantages, we consider a mixture strategy, denoted by non-COT and COT in Table 3, allocates half of sample budget to the COT mode and the remains to the non-COT mode. This simple combination of two guiding prompts shows great promise in further bootstrapping the performance of our proof completion model, achieving a pass rate of 63.5% on Mini F2F test. In Appendix B, we present example problems that illustrate the different advantages of the two generation modes.

4.3. Ablation Studies on R-Max TS. Intrinsic Rewards and Discounted UCB. We investigate the effectiveness of two core components of R-Max TS, i.e., the intrinsic rewards defined in EQ. 3. and the discounted upper confidence bound stated in EQ.

7. We start with a baseline implementing the standard UCT algorithm, Kochish and Zepesvari, 2006, without intrinsic rewards, in which the exploration is driven exclusively by the UCB bonus. Note that, since no non-zero rewards are provided for this baseline, all variants of the UCB formula become equivalent, as node selection is determined solely by visitation counts.

The experimental results in figure 5 show that, in the absence of intrinsic rewards, the performance of UCT without R intrinsic degenerates into a level comparable to that of non-search methods. Furthermore, we consider R max TS using the standard UCB1, referred to EQ. For, instead of the discounted UCB, denoted by R max TS, dUCB right pointing arrow UCB1,

The results indicate that the performance of Rmax TS with UCB1 bonus is also moderate, comparable to that of UCT, without R intrinsic. That is because UCB1 is designed to guarantee asymptotic performance through exhausted exploration, R et al. 2002, assuming the sample size to be sufficiently large.

In contrast, the discounted UCB can accelerate the value propagation of non-stationary intrinsic rewards, preventing the guidance of R intrinsic from being dominated by that of visitation counts. This demonstrates that the discounted UCB mechanism is a crucial complement to intrinsic reward-driven exploration. Guidance of Tactic State Information When expanding a tree node, we concatenate the intermediate tactic state information as a comment block to the incomplete code to guide the proof completion.

With the provided auxiliary information, the proof completion model can enhance its internal representation of the tactic state, offering intermediate guidance for long horizon planning. To demonstrate this advantage, we present experiments on Rmax TS that performs code completion directly from the raw incomplete code without accessing tactic state information, denoted by Rmax TS. Without tactic state, in figure 5.

The results indicate that the performance gained from applying tree search becomes moderate in the absence of tactic state information, especially when tackling hard problems that require a large amount of samples. This highlights that the integration of compiler information is an essential component of the tree search algorithm, enhancing its overall effectiveness and sample efficiency.

5. Conclusion, Limitation, and Future Work We present DeepSeqProver v1.5, a language model with 7 billion parameters that outperforms all open-source models in formal theorem proving in Lean 4. DeepSeqProver v1.5 is initialized with DeepSeqProver v1.5 base, which extends the pre-training of DeepSeqMathBase 7b using a specialized corpus for formal mathematical reasoning.

Supervised fine-tuning is conducted on a comprehensive Lean-4 code completion dataset, encompassing a wide range of formal theorems from various mathematical domains. Subsequently, we employ GRPO to enhance whole proof generation through online reinforcement learning. Upon developing the DeepSeqProver v1.5 model, we introduce RMAX-TS, a variant of Monte Carlo TreeSearch, to improve problem-solving capabilities via large-scale search with extensive exploration.

These components form a comprehensive pipeline for training an LLM-based proof assistant, enabling DeepSeqProver v1.5 to achieve significant improvements over DeepSeqProver v1. The framework of DeepSeqProver v1.5 is designed to establish an alpha-zero-like pipeline for formal theorem proving. The use of expert iteration and synthetic data mirrors the core trial and error loop of reinforcement learning, with the compiler Oracle serving as the world model to provide environmental supervision.

Within the RL paradigm, the integrated tree search module has proven to be highly effective in advancing superhuman performance across various domains: Silver et al. , Fawzi et al. , Lutz et al. . In developing DeepSeq Prover v1.5, we focus on the exploration aspect of RL, introducing Rmax TS to diversify the generation of proofsteps.

However, the exploitation aspect, specifically the problem of proof search, remains unexplored. A promising future direction is training a critic model to assess incomplete proofs and prune search branches. Such a partial proof critic model would implicitly perform temporal credit assignment, sudden, 1984, decomposing proof-level feedback into stepwise value differences, Arjona-Medina et al., 2019.

Developing critic models for assessing long planning paths and providing guidance rewards presents a crucial and challenging problem, Ng and Russell, 2000. Sorg et al., 2010. That warrants further investigation. Finally, recent work has progressed beyond proving individual theorems to addressing real-world theory proving within complex, multi-theorem lean files. Hu et al., 2024. This shift is a natural extension of our whole proof generation approach.

Our observations indicate that the current model already possesses some understanding of file-level context. Moving forward, we will focus on enhancing this aspect to support cutting-edge lean mathematical formalization developers with our language model advancements. References Reference 1. J. A. Arjona Medina, M. Gilhofer, M. Widrick, T. Untertheiner, J. Brandstetter, and S. Hochreiter

Rutter. Return Decomposition for Delayed Rewards. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. Pages 13566-13577, 2019. Reference 2. P. Auer. Using Confidence Bounds for Exploitation Exploration Tradeoffs. Journal of Machine Learning Research. 3, November, 397-422, 2002.

Reference 3. P. Auer, N. Cessa Bianchi, and P. Fischer. Finite Time Analysis of the Multi-Armed Bandit Problem. Machine Learning, 47. 235-256, 2002. Reference 4. Z. Azerbaev, B. Piotrowski, H. Sholkoff, E. W. Ayers, D. Radev, and J. Avigod.

ProofNet. Auto-formalizing and formally proving undergraduate-level mathematics. Archive Preprint Archive. 2302.12433, 2023. Reference 5. Z. Azarbayev, H. Sholkoff, K. Paster, M. Dos Santos, S. M. McAleer, A. Q. Zhang, J. Deng, S. Biderman, and S. Wellick.

Lemma. An Open Language Model for Mathematics. In the 12th International Conference on Learning Representations, 2024. Reference 6. M. G. Bellemare, S. Srinivasan, G. Ostrovsky, T. Scholl, D. Saxton, and R. Munoz. Unifying Count-Based Exploration and Intrinsic Motivation. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 1479-1487, 2016.

Reference 7. R. I. Brafman and M. Tenenholz. R. Max a General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning. Journal of Machine Learning Research. 3, October, 213-231, 2002. Reference 8. C. B. Brown, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rolfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton.

A Survey of Monte Carlo Tree Search Methods. IEEE Transactions on Computational Intelligence and AI in Games, 4, 1, 1-43, 2012. Reference 9. Y. Berda, H. Edwards, Istaki, and Oklemov. Exploration by Random Network Distillation. In 7th International Conference on Learning Representations, pages 1-17, 2019.

Thanks for listening to this reading. For the entire paper, and more, check out our homepage.

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search 47:39 Share

Papers Read on AI

Deep Dive

Shownotes Transcript

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search