Self-Evolving Visual Questioner

1University of Maryland, College Park 2University of California, Los Angeles
3Peking University 4Arena 5MBZUAI

Abstract

A VLM teaches itself to ask better visual questions — harder, more grounded, more diverse — and becomes a stronger questioner without losing its answering ability.

No human labels No teacher models

Vision-language models (VLMs) are typically trained as passive answerers, while their ability to actively ask diverse, non-trivial, visual-centric, and grounded questions remains underexplored. Existing visual questioners are bottlenecked by the availability of high-quality training data or the cost of curating it. We show that a VLM can continuously improve itself as a visual questioner without any external supervision. We propose a self-evolving framework that uses a VLM itself as both a proposer and a filter to produce harder, more informative, and visual-centric questions, while maintaining exploration diversity to avoid training collapse. These questions are then used to train the VLM in both questioner and answerer modes. To evaluate the questioner, we introduce an agentic protocol that assesses questions along perception, reasoning, and diversity dimensions. Experiments across various backbone VLMs show that our method substantially enhances the quality and expands the difficulty boundary of autonomous question generation. Under the same budget, our self-supervision is more effective than training on static source data — and the self-evolving questioner remains a competitive or even better answerer.

Overview

A VLM that learns to ask better questions — by proposing, rewriting, filtering, and learning from its own.

Overview of the self-evolving visual questioner framework
1 Propose — the current model Mt generates candidate questions from unlabeled images and visual-intent prompts.
2 Rewrite — a stable base model M0 makes them harder and more visual-centric.
3 FilterM0 keeps only questions that are answerable, better-grounded, and more demanding.
4 Train — the refined data trains the model in both question-generation (QG) and question-answering (QA) formats.
Repeat — the updated model Mt+1 becomes the next proposer, forming a continuous self-evolution loop.

Why it works: harder questions demand richer visual evidence and deeper reasoning, so the model’s own questions become more effective training signal than static data — all with no external supervision, while keeping its answering ability intact.

Self-Improving Question Quality

Question quality climbs every round — while answering ability holds.

QG and QA results across VLM backbones
~82% relative QG gain after two rounds
5/5 QG dimensions improved on every backbone
QA ≈ downstream answering stays competitive

Across Qwen2.5-VL-3B/7B and Qwen3VL-4B, the self-evolving framework consistently lifts question-generation quality. QG average roughly doubles from the base model (e.g., 0.25 → 0.50 on the 3B), with the largest gains on spatial and contextual reasoning. The second round improves over the first — the gains compound rather than saturate.

Key takeaway: better questions don’t come at the cost of answering. Dual QA+QG training keeps QA accuracy competitive (even improving on several benchmarks) while the model becomes a markedly stronger questioner.

Do Better Questions Help Supervision?

With answers held fixed, better questions alone make better training data.

Downstream QA from base vs improved question sources
61.9 → 63.3 average QA from better questions
+6.3 CVBench-3D (69.3 → 75.6)

Are better questions useful beyond the questioning task? We build two QA training sets that differ in only one thing — whether the questions come from the base model (Base-Q) or the improved questioner (Improved-Q). GPT generates the answers for both, so any difference comes purely from question quality.

Key takeaway: questions from the improved questioner train a stronger answerer — +1.4 average and a striking +6.3 on CVBench-3D. Questions that demand richer visual evidence and spatial reasoning are simply more informative supervision.

Watch the Questions Evolve

From shallow recognition to grounded, multi-step reasoning — on the same image, across self-evolving rounds.

M₀ · Base What is the color of the fireplace in the picture?
M₁ · Round 1 What is placed on top of the fireplace?
M₂ · Round 2 Is the object placed on top of the fireplace shown in the mirror?
Cross-region reflection reasoning
M₀ · Base What is the color and design of the water tower?
M₁ · Round 1 What color is the water tower with “Balsam Lake” written on it?
M₂ · Round 2 Which flag is larger, the American flag on the water tower or the red flag near the trees?
Relative-size comparison across regions
M₀ · Base What is the color and condition of the grass?
M₁ · Round 1 What is the purpose of the white building in the background in this image?
M₂ · Round 2 Which building is closer, the grandstand on the right or the small white building on the left?
Depth and layout reasoning
M₀ · Base What is the color scheme of the bedroom?
M₁ · Round 1 What items have the color yellow on the bed in this bedroom?
M₂ · Round 2 How does the color scheme of the bedroom affect its mood and atmosphere?
Scene mood reasoning

More examples in the paper (Figures 9–10).