InsightVQA: High-Dimensional Emotion-Cognitive Visual Question Answering Benchmark

Figure 1: Overview of InsightVQA. The proposed task decomposes multimodal emotional reasoning into three stages: Perception for recognizing emotion and valence, Understanding for grounding emotional triggers and Cognition for predicting response intent and performing sequential insight reasoning.

Abstract

Visual emotion understanding requires models not only to recognize emotional states, but also to why they arise and perform higher-level cognitive reasoning. However, existing benchmarks mainly focus on emotion recognition, offering limited support for grounded understanding and response-oriented analysis. To address this gap, we introduce InsightVQA, a large-scale dataset for hierarchical visual question answering on emotion understanding and cognitive reasoning. Building from 351K images collected from six public sources, we apply a rigorous multi-stage filtering pipeline to curate 138K high-confidence images. Each image is annotated at three hierarchical levels: perception QA for emotion and valence recognition, grounded understanding QA constructed from visual trigger extraction through constraint-guided generation, and cognition QA centered on response intent prediction and sequential insight reasoning. We further present InsightVQA-Bench, a high-quality evaluation benchmark comprising 30K samples for fine-grained evaluation. To support evaluation, we introduce InsightNet, an emotion-tuned baseline for MLLMs. Results demonstrate that InsightVQA poses significant challenges for grounded emotion understanding and reasoning.

InsightVQA Dataset Pipeline

Figure 2: InsightVQA dataset construction pipeline. The pipeline comprises three stages: (A1) Image collection and emotion classification from 351K images across six public sources, producing 138K balanced samples; (A2) Perception annotation via template-based QA generation, yielding 276K label and valence QA pairs; (B) Grounded understanding annotation using VLM-extracted visual triggers and LLM-based QA generation, producing 330K visual, contextual, and counterfactual QA pairs. (C) Cognition annotation generates 119K intent and insight QA pairs via structured prompts and human curated samples, with quality controlled by rule filtering and VLM checking throughout. The final InsightVQA dataset comprises 138K images and 725K QA pairs.

Methods: InsightNet

Built upon the InsightVQA dataset, we propose InsightNet, a unified architecture for human state understanding. To instill foundational reasoning capabilities, we perform LoRA-based fine-tuning on InsightVQA, enabling the model to capture hierarchical perception, comprehension, and cognition for high-dimensional cognitive–affective representations.

To enhance both cognitive depth and generalization in complex emotional scenarios, we further apply instruction-driven supervised fine-tuning (SFT) on Qwen2.5-VL-7B using the InsightVQA dataset. As full-parameter fine-tuning is prone to overfitting on emotionally biased data, we adopt LoRA as an implicit regularization mechanism, reducing trainable parameters via low-rank decomposition.

Experiments

Comparison of model performance across Perception, Understanding, and Cognition tasks.

Model	Year	Perception	Understanding			Cognition
Model	Year	ACC	F1	Precision	Recall	Ranking	Top-1	ACC
Open-source MLLMs
Qwen2.5-VL-7B	2025	57.95	88.24	87.74	88.80	34.95	51.27	30.92
LLaVA-OneVision-1.5-8B	2025	56.74	87.81	88.60	87.07	60.84	67.50	45.97
InternVL3.5-8B	2025	53.40	88.54	88.53	88.60	31.81	43.44	31.18
Deepseek-VL-7B-chat	2024	52.54	87.25	88.42	86.12	40.52	43.48	38.07
Qwen2.5-VL-32B	2025	52.80	88.04	87.90	88.19	69.24	65.32	44.02
Qwen2.5-VL-72B	2025	53.83	88.86	88.13	89.62	76.30	56.83	57.24
Close-source MLLMs
Qwen3-Max	2026	52.64	87.73	86.96	88.53	75.65	61.14	42.51
Deepseek-V3.2	2025	51.20	87.86	87.23	88.53	80.24	65.77	41.60
GPT-4o	2024	50.63	88.34	88.49	88.24	77.51	61.06	50.34
Gemini-2.5-flash	2025	56.83	89.05	88.02	90.13	80.56	63.51	56.08
Claude-3.7-sonnet	2026	56.37	88.72	89.16	88.31	81.77	63.49	61.87
Emotion-Oriented MLLMs
EmoViT	2024	53.69	84.50	84.85	84.19	-	12.53	42.86
Emotion-Qwen	2025	52.98	86.92	86.37	87.48	-	33.61	30.18
EmoCaliber	2025	40.29	84.23	81.52	87.18	-	25.47	27.92
InsightNet (Ours)	2026	76.25	90.56	90.50	90.63	82.79	71.21	69.18

"-" indicates that the model is not applicable to the evaluation.

BibTeX

If you find InsightVQA useful in your research, please consider citing:

@article{wang2026insightvqa,
    title   = {InsightVQA: High-Dimensional Emotion-Cognitive Visual Question Answering Benchmark},
    author  = {Wang, Shiyu and Liu, Ziyu and Yu, Chaoyi and Yin, Yujie and Mao, Zhongqian and Chen, Jing and Song, Jiaqi and Lan, Yunshi and Wang, Yan},
    journal = {arXiv preprint arXiv:XXXX.XXXXX},
    year    = {2026}
}

InsightVQA: High-Dimensional Emotion-Cognitive Visual Question Answering Benchmark InsightVQA: High-Dimensional Emotion-Cognitive Visual Question Answering Benchmark

Abstract

InsightVQA Dataset Pipeline

Methods: InsightNet

Experiments

BibTeX