DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning
This paper presents a new method called DreamPRM that improves how multimodal language models reason by addressing issues with data quality and training methods.
This video presentation explains the key concepts from the paper in plain language.
Content & Liability Disclaimer
This article and its accompanying video are automated summaries derived from the original research paper by Unknown authors. The original research was conducted solely by the paper's authors; PDFdigest did not conduct any of the research and makes no claims of ownership over the underlying scientific work.
The video narration is generated by artificial intelligence and references the paper's authors for attribution. The video is not narrated by any of the paper's authors. This content may contain inaccuracies, omissions, or misinterpretations of the original research. First-person language (e.g., "we found", "our results") reflects the original authors' voice, not PDFdigest's. Always read the original paper for accurate, verified information before making any decisions based on this content.
This content is provided "as is" without any warranties, express or implied. Simulated systems OÜ, its officers, directors, employees, and agents shall not be liable for any direct, indirect, incidental, special, consequential, or punitive damages arising from your use of, reliance on, or access to this content, including but not limited to errors, omissions, or misinterpretations of the original research. This disclaimer applies to the fullest extent permitted by applicable law.
- 1 DreamPRM helps multimodal models reason better by focusing on high-quality data.
- 2 The method uses a two-step optimization process to improve model performance.
- 3 It addresses challenges that arise when combining different types of data, like text and images.
Introduction
The introduction discusses the significant improvements in reasoning capabilities of large language models (LLMs) and the role of Process Reward Models (PRMs) in enhancing reasoning processes. It highlights the challenges of applying PRMs to multimodal large language models (MLLMs) due to distribution shifts and dataset quality imbalances.
The Proposed Domain-reweighting Method
This section details the DreamPRM framework, which addresses dataset quality imbalance and training-inference discrepancies through a bi-level optimization approach. It explains the lower-level optimization for updating PRM weights and the upper-level optimization for adjusting domain importance weights.
How PDFdigest Helps You Understand Research
Instant Paper Analysis
Get structured summaries and key findings from dense PDFs in seconds.
Visual Explanations
Turn complex methods, figures, and results into clearer visual breakdowns.
AI-Powered Q&A
Ask focused questions and get answers grounded in the paper.
Domain-reweighting
The domain-reweighting section outlines the problem setting and the need for effective training of PRMs in MLLMs. It introduces the concept of using Monte Carlo methods to generate approximated supervision signals for training PRMs.
Experimental settings
The experimental settings section describes the methodology used for testing the proposed DreamPRM framework, including the base models, training hyperparameters, and baseline comparisons with state-of-the-art models.
Figures Explained
The paper’s visual material highlights the workflow and the main system components.
- Figure 2 :: Figure 2: General flow of training PRM and using PRM for inference. Training phase: Train PRM with Monte Carlo signals from intermediate steps of Chain-of-Thoughts (CoTs). Inference phase: Use the trained PRM to verify CoTs step by step and select the best CoT. Conventional training of PRM has poor generalization capability due to distribution shift between training set and testing set.
- Figure 3 :: Figure 3: The proposed bi-level optimization based domain-reweighting method. Lower-level optimization: In this stage, PRM’s parameters are updated on multiple datasets with domain weights, allowing the PRM to prioritize domains with better quality. Upper-level optimization: In this stage, the PRM is evaluated on a separate meta dataset to compute an aggregation function loss and optimize the domain weights. DreamPRM helps address dataset quality imbalance problems and leads to stronger and more generalizable reasoning performance.
- Figure 3: V ϕ (x, ŷi ). With the PRM training loss on a single domain D k above, we next define the domain-reweighted training objective of PRM on multiple training domains D = {D k } K k=1 . The overall objective is a weighted sum of the single-domain PRM training losses, allowing the contribution of each domain to be adjusted during the learning process:.
- Figure 4 :: Figure 4: Leaderboard on MathVista (as ofOctober 15, 2025). The first column (“o4-mini + DreamPRM”) reports our own evaluation, while the remaining results are taken from the official MathVista leaderboard. The compared models include VL-Rethinker [62] , Step R1-V-Mini [58] , Kimi-k1.6-preview [43] , Kimi-k1.5 [24] , Doubao-pro-1.5 [60] , Ovis2-34B [1] , OpenAI o1 [45] , Llama 4 Maverick [41, 42] , and Vision-R1-7B [18] .
- Figure 5 :Figure 6 :: Figure 5: Comparative evaluation of DreamPRM on multimodal reasoning benchmarks. Radar charts report accuracy (%) on five datasets (WEMATH, MATHVISTA, MATHVISION, MMVET, and MMSTAR). (a) Impact of different data selection strategies. (b) Comparison with existing test-time scaling methods. (c) Ablation study of three key components, i.e. w/o aggregation function loss (AFL), w/o bi-level optimization (BLO), and w/o structural thinking (ST).
Frequently Asked Questions
This paper presents a new method called DreamPRM that improves how multimodal language models reason by addressing issues with data quality and training methods.
The introduction discusses the significant improvements in reasoning capabilities of large language models (LLMs) and the role of Process Reward Models (PRMs) in enhancing reasoning processes. It highlights the challenges of applying.
This section details the DreamPRM framework, which addresses dataset quality imbalance and training-inference discrepancies through a bi-level optimization approach. It explains the lower-level optimization for updating PRM weights and the upper-level optimization.
Yes. PDFDigest can turn this paper into a structured explanation, key takeaways, visual summaries, and a narrated video when available.