Self-Evolving Visual Questioner

Liang, Yijun; Zhou, Hengguang; Li, Ming; Li, Lichen; Hsieh, Cho-Jui; Zhou, Tianyi

Self-Evolving Visual Questioner

Yijun Liang¹, Hengguang Zhou², Ming Li¹, Lichen Li³, Cho-Jui Hsieh^2,4, Tianyi Zhou⁵

¹University of Maryland, College Park ²University of California, Los Angeles
³Peking University ⁴Arena ⁵MBZUAI

Abstract

A VLM teaches itself to ask better visual questions — harder, more grounded, more diverse — and becomes a stronger questioner without losing its answering ability.

No human labels No teacher models

Vision-language models (VLMs) are typically trained as passive answerers, while their ability to actively ask diverse, non-trivial, visual-centric, and grounded questions remains underexplored. Existing visual questioners are bottlenecked by the availability of high-quality training data or the cost of curating it. We show that a VLM can continuously improve itself as a visual questioner without any external supervision. We propose a self-evolving framework that uses a VLM itself as both a proposer and a filter to produce harder, more informative, and visual-centric questions, while maintaining exploration diversity to avoid training collapse. These questions are then used to train the VLM in both questioner and answerer modes. To evaluate the questioner, we introduce an agentic protocol that assesses questions along perception, reasoning, and diversity dimensions. Experiments across various backbone VLMs show that our method substantially enhances the quality and expands the difficulty boundary of autonomous question generation. Under the same budget, our self-supervision is more effective than training on static source data — and the self-evolving questioner remains a competitive or even better answerer.

Overview

A VLM that learns to ask better questions — by proposing, rewriting, filtering, and learning from its own.

1 Propose — the current model M_t generates candidate questions from unlabeled images and visual-intent prompts.

2 Rewrite — a stable base model M₀ makes them harder and more visual-centric.

3 Filter — M₀ keeps only questions that are answerable, better-grounded, and more demanding.

4 Train — the refined data trains the model in both question-generation (QG) and question-answering (QA) formats.

↻ Repeat — the updated model M_t+1 becomes the next proposer, forming a continuous self-evolution loop.

Why it works: harder questions demand richer visual evidence and deeper reasoning, so the model’s own questions become more effective training signal than static data — all with no external supervision, while keeping its answering ability intact.

Evaluation Framework

How do you measure whether a generated question is actually good? We introduce an agentic protocol that goes beyond surface fluency.

Evaluating question quality is fundamentally harder than evaluating answers: there is no single ground-truth question for a given image, and naive metrics (BLEU, CIDEr) reward lexical overlap rather than visual grounding or cognitive demand. We design an agentic evaluation protocol that uses a VLM judge to probe each generated question across five interpretable dimensions.

Five Quality Dimensions

Search

Visual Search Difficulty

Does the question require actively searching the image for the relevant region or detail, rather than answering from a salient object?

Coverage

Visual Evidence Coverage

Does answering the question require attending to a broad portion of the image, rather than a single local patch?

Context

Visual Context Reasoning

Does the question require integrating multiple visual cues or scene-level context rather than reading off a single isolated detail?

Spatial

Visual Spatial Reasoning

Does the question probe spatial relations — layout, depth, size comparison, or positional reference across image regions?

Diversity

Questioning Diversity

Across a set of questions for the same image, do they cover distinct visual aspects rather than rephrasing the same idea?

Each dimension is scored on a 0–1 scale by the VLM judge, and the QG Average is the mean across all five. This multi-dimensional score is more informative than a single scalar: a model can improve spatial reasoning while still lacking contextual diversity, and the breakdown pinpoints exactly where gains are concentrated.

Self-Improving Question Quality

Question quality climbs every round — while answering ability holds.

~82% relative QG gain after two rounds

5/5 QG dimensions improved on every backbone

QA ≈ downstream answering stays competitive

Across Qwen2.5-VL-3B/7B and Qwen3VL-4B, the self-evolving framework consistently lifts question-generation quality. QG average roughly doubles from the base model (e.g., 0.25 → 0.50 on the 3B), with the largest gains on spatial and contextual reasoning. The second round improves over the first — the gains compound rather than saturate.

Key takeaway: better questions don’t come at the cost of answering. Dual QA+QG training keeps QA accuracy competitive (even improving on several benchmarks) while the model becomes a markedly stronger questioner.

Do Better Questions Help Supervision?

With answers held fixed, better questions alone make better training data.

Downstream QA from base vs improved question sources

61.9 → 63.3 average QA from better questions

+6.3 CVBench-3D (69.3 → 75.6)

Are better questions useful beyond the questioning task? We build two QA training sets that differ in only one thing — whether the questions come from the base model (Base-Q) or the improved questioner (Improved-Q). GPT generates the answers for both, so any difference comes purely from question quality.

Key takeaway: questions from the improved questioner train a stronger answerer — +1.4 average and a striking +6.3 on CVBench-3D. Questions that demand richer visual evidence and spatial reasoning are simply more informative supervision.

Watch the Questions Evolve

From shallow recognition to grounded, multi-step reasoning — on the same image, across self-evolving rounds.

M₀ · Base What is the color and design of the water tower?

M₁ · Round 1 What color is the water tower with “Balsam Lake” written on it?

M₂ · Round 2 Which flag is larger, the American flag on the water tower or the red flag near the trees?

Relative-size comparison across regions

M₀ · Base What is the color and condition of the grass?

M₁ · Round 1 What is the purpose of the white building in the background in this image?

M₂ · Round 2 Which building is closer, the grandstand on the right or the small white building on the left?

Depth and layout reasoning

M₀ · Base What is the color scheme of the bedroom?

M₁ · Round 1 What items have the color yellow on the bed in this bedroom?

M₂ · Round 2 How does the color scheme of the bedroom affect its mood and atmosphere?

Scene mood reasoning

More examples in the paper (Figures 9–10).