DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning

This paper presents a new method called DreamPRM that improves how multimodal language models reason by addressing issues with data quality and training methods.

Analyze with PDFdigest

This video presentation explains the key concepts from the paper in plain language.

Content & Liability Disclaimer

This article and its accompanying video are automated summaries derived from the original research paper by Unknown authors. The original research was conducted solely by the paper's authors; PDFdigest did not conduct any of the research and makes no claims of ownership over the underlying scientific work.

The video narration is generated by artificial intelligence and references the paper's authors for attribution. The video is not narrated by any of the paper's authors. This content may contain inaccuracies, omissions, or misinterpretations of the original research. First-person language (e.g., "we found", "our results") reflects the original authors' voice, not PDFdigest's. Always read the original paper for accurate, verified information before making any decisions based on this content.

This content is provided "as is" without any warranties, express or implied. Simulated systems OÜ, its officers, directors, employees, and agents shall not be liable for any direct, indirect, incidental, special, consequential, or punitive damages arising from your use of, reliance on, or access to this content, including but not limited to errors, omissions, or misinterpretations of the original research. This disclaimer applies to the fullest extent permitted by applicable law.

Key Takeaways
  1. 1 DreamPRM helps multimodal models reason better by focusing on high-quality data.
  2. 2 The method uses a two-step optimization process to improve model performance.
  3. 3 It addresses challenges that arise when combining different types of data, like text and images.

Introduction

The introduction discusses the significant improvements in reasoning capabilities of large language models (LLMs) and the role of Process Reward Models (PRMs) in enhancing reasoning processes. It highlights the challenges of applying PRMs to multimodal large language models (MLLMs) due to distribution shifts and dataset quality imbalances.

The Proposed Domain-reweighting Method

This section details the DreamPRM framework, which addresses dataset quality imbalance and training-inference discrepancies through a bi-level optimization approach. It explains the lower-level optimization for updating PRM weights and the upper-level optimization for adjusting domain importance weights.

Related Works

This section reviews recent studies on multimodal reasoning and the limitations of existing methods, such as Chain-of-Thought (CoT) prompting. It discusses various approaches to enhance reasoning capabilities in MLLMs and the challenges associated with obtaining process supervision signals for PRMs.

How PDFdigest Helps You Understand Research

Instant Paper Analysis

Get structured summaries and key findings from dense PDFs in seconds.

Visual Explanations

Turn complex methods, figures, and results into clearer visual breakdowns.

AI-Powered Q&A

Ask focused questions and get answers grounded in the paper.

Try PDFdigest Free

Domain-reweighting

The domain-reweighting section outlines the problem setting and the need for effective training of PRMs in MLLMs. It introduces the concept of using Monte Carlo methods to generate approximated supervision signals for training PRMs.

Experimental settings

The experimental settings section describes the methodology used for testing the proposed DreamPRM framework, including the base models, training hyperparameters, and baseline comparisons with state-of-the-art models.

Figures Explained

The paper’s visual material highlights the workflow and the main system components.

  • Figure 2 :: Figure 2: General flow of training PRM and using PRM for inference. Training phase: Train PRM with Monte Carlo signals from intermediate steps of Chain-of-Thoughts (CoTs). Inference phase: Use the trained PRM to verify CoTs step by step and select the best CoT. Conventional training of PRM has poor generalization capability due to distribution shift between training set and testing set.
  • Figure 3 :: Figure 3: The proposed bi-level optimization based domain-reweighting method. Lower-level optimization: In this stage, PRM’s parameters are updated on multiple datasets with domain weights, allowing the PRM to prioritize domains with better quality. Upper-level optimization: In this stage, the PRM is evaluated on a separate meta dataset to compute an aggregation function loss and optimize the domain weights. DreamPRM helps address dataset quality imbalance problems and leads to stronger and more generalizable reasoning performance.
  • Figure 3: V ϕ (x, ŷi ). With the PRM training loss on a single domain D k above, we next define the domain-reweighted training objective of PRM on multiple training domains D = {D k } K k=1 . The overall objective is a weighted sum of the single-domain PRM training losses, allowing the contribution of each domain to be adjusted during the learning process:.
  • Figure 4 :: Figure 4: Leaderboard on MathVista (as ofOctober 15, 2025). The first column (“o4-mini + DreamPRM”) reports our own evaluation, while the remaining results are taken from the official MathVista leaderboard. The compared models include VL-Rethinker [62] , Step R1-V-Mini [58] , Kimi-k1.6-preview [43] , Kimi-k1.5 [24] , Doubao-pro-1.5 [60] , Ovis2-34B [1] , OpenAI o1 [45] , Llama 4 Maverick [41, 42] , and Vision-R1-7B [18] .
  • Figure 5 :Figure 6 :: Figure 5: Comparative evaluation of DreamPRM on multimodal reasoning benchmarks. Radar charts report accuracy (%) on five datasets (WEMATH, MATHVISTA, MATHVISION, MMVET, and MMSTAR). (a) Impact of different data selection strategies. (b) Comparison with existing test-time scaling methods. (c) Ablation study of three key components, i.e. w/o aggregation function loss (AFL), w/o bi-level optimization (BLO), and w/o structural thinking (ST).
PDFDIGEST AI

Struggling to understand complex research papers?

Upload any PDF and get instant AI-powered explanations, summaries, and visual breakdowns. Turn dense academic writing into clear, actionable insights.

Upload a Paper

Frequently Asked Questions

This paper presents a new method called DreamPRM that improves how multimodal language models reason by addressing issues with data quality and training methods.

The introduction discusses the significant improvements in reasoning capabilities of large language models (LLMs) and the role of Process Reward Models (PRMs) in enhancing reasoning processes. It highlights the challenges of applying.

This section details the DreamPRM framework, which addresses dataset quality imbalance and training-inference discrepancies through a bi-level optimization approach. It explains the lower-level optimization for updating PRM weights and the upper-level optimization.

Yes. PDFDigest can turn this paper into a structured explanation, key takeaways, visual summaries, and a narrated video when available.

Related Research

Research

Token-Sparse Medical Multimodal Reasoning via Dual-Stream Reinforcement Learning

Vision-language models (VLMs) combining reinforcement learning (RL) ignite remarkable progress in multimodal reasoning, yet still struggle with medical images, which typically exhibit…

10 min read
Research

Helicobacter Pylori Infection and the Latest Treatment Guidelines

Helicobacter Pylori infection is prevalent worldwide, particularly in developing regions. It can lead to various health issues, including gastritis, peptic ulcer disease,…

10 min read
Research

Typeset using L A T E X twocolumn style in AASTeX631

This work proposes a novel approach to Martian climate modeling using machine learning techniques, specifically a deep neural network to model relative…

10 min read