VisDoTQA

Official public benchmark release for
VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought, with an additional mirror on arXiv:2603.11631.

VisDoTQA is a chart visual reasoning benchmark introduced in VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought.

Large vision-language models often struggle to reliably ground visual primitives in charts and align them with semantic references. VisDoT addresses this challenge through perception-grounded task design and decomposition-of-thought (DoT) reasoning, and VisDoTQA is introduced as a benchmark for evaluating these capabilities.

Links

Overview

VisDoTQA is designed to evaluate visual grounding and compositional reasoning on chart images through four core task types:

Position
Length
Pattern
Extract

These task families follow the perception-grounded taxonomy introduced in the paper.

Public Release

This repository releases the public VisDoTQA test set only.

The public test set consists of visually grounded and complex chart questions, enabling evaluation of both visual grounding ability and compositional reasoning ability. The questions require models to identify relevant chart elements, interpret perceptual cues, and reason over multiple pieces of visual information.

Public release statistics

1,120 test samples
609 held-out charts used to construct the test set

Task distribution in this release

Position: 350
Compare object positions along a common scale.
Length: 240
Reason over length-based visual encodings.
Pattern: 267
Map visual patterns or category cues to chart elements.
Extract: 263
Read explicitly shown values from charts.

Relation to the Paper Dataset

The full VisDoTQA dataset described in the paper contains 331,969 QA pairs across the same four perceptual task types. This repository does not release the full research dataset; it releases the public benchmark test set only.

Task Taxonomy

VisDoTQA is organized around four core perceptual task families:

Position
Compares object positions along a common scale to determine relative order.
Length
Uses length as a perceptual cue for comparing chart elements.
Pattern
Links pattern cues to legends and data categories for visual label mapping.
Extract
Reads explicitly shown values from charts and evaluates direct numerical recognition.

Data Release

This release includes:

data/VisDoTQA.json
data/test.jsonl
data/images/

Each record in data/VisDoTQA.json contains the following fields:

imgname
query
label
source

The Hugging Face dataset viewer uses data/test.jsonl, which contains the same QA rows plus an image column for rendering chart images in the table view.

Field Description

imgname: image filename for the chart instance
query: benchmark question
label: ground-truth answer
source: VisDoTQA task category
image: relative image path used by the Hugging Face dataset viewer

In this public release, the source field denotes the VisDoTQA task category:

Position
Length
Pattern
Extract

It does not denote the original chart source website.

Public Release Notes

Explanation fields from the internal research artifact are intentionally excluded from this public release.
Only evaluation-facing fields required for benchmark use are included.
Each JSON record is expected to resolve to a matching image file in data/images/.

Repository Layout

VisDoTQA/
├── README.md
├── CITATION.cff
└── data/
    ├── VisDoTQA.json
    ├── test.jsonl
    └── images/

Evaluation

VisDoTQA is evaluated using Relaxed Accuracy (RA), following the evaluation protocol described in the paper.

Benchmark Context

In the paper, VisDoT is evaluated on ChartQA, ChartQAPro, and VisDoTQA. The reported results show strong gains from combining perception-following supervision with DoT-based reasoning, including a +33.2% improvement on VisDoTQA for the reported InternVL-4B setting.

Paper

The canonical publication record for this dataset is the ACL Anthology page below. We also provide the arXiv mirror for convenience.

Citation

If you use this dataset in your research, please cite our paper:

@inproceedings{lee2026visdot,
  title={VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought},
  author={Lee, Eunsoo and Lee, Jeongwoo and Hong, Minki and Choi, Jangho and Kim, Jihie},
  booktitle={Findings of the Association for Computational Linguistics: EACL 2026},
  pages={610--640},
  year={2026},
  doi={10.18653/v1/2026.findings-eacl.30},
  url={https://aclanthology.org/2026.findings-eacl.30/}
}

License

License information for this release is pending.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
CITATION.cff		CITATION.cff
HF_README.md		HF_README.md
README.md		README.md
hf_dataset_card_metadata.yaml		hf_dataset_card_metadata.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VisDoTQA

Links

Overview

Public Release

Public release statistics

Task distribution in this release

Relation to the Paper Dataset

Task Taxonomy

Data Release

Field Description

Public Release Notes

Repository Layout

Evaluation

Benchmark Context

Paper

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

VisDoTQA

Links

Overview

Public Release

Public release statistics

Task distribution in this release

Relation to the Paper Dataset

Task Taxonomy

Data Release

Field Description

Public Release Notes

Repository Layout

Evaluation

Benchmark Context

Paper

Citation

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages