Liyan Tang1,2, Shreyas Pimpalgaonkar2, Kartik Sharma2, Alexandros G. Dimakis2, Mahesh Sathiamoorthy2, Greg Durrett1,2
1Bespoke Labs, 2The University of Texas at Austin
TL;DR
We present Bespoke-MiniChart-7B, a 7B open chart-understanding model that sets a new state-of-the-art in chart question answering for models of its size, outperforming larger models like Gemini-1.5-Pro and Claude-3.5 across seven benchmarks.
You can try the model at the playground. The model and the inference code are available on Hugging Face 🤗.
1. Data curation matters: Carefully curated synthetic data can significantly enhance pre-trained Large Vision Language Models (LVLMs), improving their robustness to out-of-domain charts and questions.
2. DPO is critical: Unlike math tasks, SFT is insufficient to get the best performance. DPO helps improve chain-of-thought reasoning in chart understanding.
Chart understanding, or being able to answer questions and evaluate claims about charts, is an interesting testbed for LVLMs. This task requires that LVLMs accurately parse visual information and perform reasoning tasks. Some reasoning tasks such as identifying trends in data are distinct from those observed in other problem domains.
Problem Setup: Chart Question Answering - Given a chart image i and a corresponding question q, the task is to generate the final answer from a model M(i, q) = a, where the model takes the image and question as input.
Existing training data for chart question answering does not reflect real-world complexities of charts and questions. Many datasets use synthetic images generated by models (Yang et al., 2025; He et al., 2024; Xu et al., 2024), which have limited visual diversity and are error prone due to the imperfect chart generation process. For questions, these datasets frequently use template-generated questions that primarily focus on basic information extraction and require limited reasoning (Methani et al., 2020; Kahou et al., 2018). Modern LVLMs are already equipped with skills to handle relatively complex chart questions and do not benefit from training on such data.
To address these challenges, we developed a rigorous data generation pipeline aimed at curating a high-quality training dataset for chart question answering, utilizing real-world images. We collected 40K real-world images from online sources and processed them through a four-step pipeline to construct a high-quality synthetic dataset for chart QA. The data construction is efficiently done with the help of Bespoke Curator. To ensure models’ generalizability, we intentionally avoid using either similar questions or images from benchmarks mentioned in the later section.
Our initial small-scale experiments showed that directly generating questions from LLMs yielded simple questions with low diversity, with many being overly general or unanswerable from the image. Therefore, we developed a novel pipeline for generating questions based on reasoning about atomic facts about images.
For each image, we extracted atomic facts using a large LVLM. These atomic facts serve as the basis for question generation and answer verification in subsequent steps. To ensure consistency in model responses to the questions that we generate in Step 2, we removed images where the extracted facts contained uncertainty words (e.g., "approximately," "roughly," "about"); such facts were difficult to generate questions with uniquely specified answers in the rest of our pipeline. This filtered around 15% of all images we started with. It is possible that the extracted facts may contain information extraction errors that misalign with the image content. Therefore, we implemented additional quality control in Step 3 to filter out incorrect cases.
To reduce the data curation costs and simplify the quality control process mentioned in Step 3, we generated questions directly from the extracted atomic facts.
Specifically, we prompted the LLM to first generate step-by-step reasoning (e.g., mathematical computations and logical reasoning) from the atomic facts and then derive the final question from these intermediate steps. We found that questions generated this way require more diverse reasoning skills. For each image, we sampled 12 questions using varying step length and then removed duplicated questions with the same answers.
We use a process for generating and verifying answers from both facts and images. First, to generate an answer from atomic facts, we used two distinct LLMs to generate step-by-step CoT answers for each question. We only kept questions where both models agreed on the final answer. In this step, 20% of questions are filtered out.
Then, we further validated answers by prompting a LVLM to answer the question directly from the image. We only kept examples where the answers derived from atomic facts matched those obtained directly from the image, removing an additional 20% of questions. This filtering ensures that the question is answerable from the image, and that answer is consistent with the extracted atomic facts.. By the end of Step 3, 60% of the original questions from Step 2 remained.
To further enhance the dataset’s diversity, we augmented each remaining question from Step 3 by replacing key entities in the question with alternative entities from the atomic facts. We find that this can effectively construct questions that encourage the model to extract information from different parts of the chart image than the original question, bringing higher diversity to the final dataset. We then used the same answer-generation and filtering pipeline from Step 3 to these augmented questions.
The final training dataset has 270K chart-question-CoT-answer tuples, which consists of a collection of 13K images, 91K curated unique question-answer pairs with 3 CoT traces each. A training example can be found above.
‍
We chose Qwen2.5-VL-7B-Instruct as our base model due to its strong baseline performance across popular benchmarks. Our training approach followed a three-stage framework.
Stage 1: Supervised Fine-tuning (SFT) warmup
We began with an initial supervised fine-tuning phase using our curated dataset of 270K CoTs reasoning traces. This training phase produced our first model variant, M1.
Stage 2: CoT Collection via Rejection Sampling
Recent work (Li et al., 2025; Guan et al., 2025; Yin et al., 2025) has shown that a model training on its own traces yields better performance compared to traces from larger teacher models, likely due to the data distribution shift. We therefore use rejection sampling to collect high-quality in-distribution training data.
In particular, for each unique chart-question-answer tuple, we sampled 16 CoT traces from our M1 model. We kept only those reasoning paths that reached the correct answer (up to 8 per question). For those examples where the model does not produce a correct answer, we sample an additional 16 CoTs each to collect correct CoTs. This process expanded our training corpus from 270K to 1M high-quality reasoning examples. We then did a second round of SFT on those 1M data to obtain our model M2.
Stage 3: DPO training
Third, we use DPO training to further improve the model performance. For our model M2, we again sampled 16 CoTs for each unique question-answer pair in the training dataset and then evaluated the correctness of each CoT trace and constructed 240K DPO data. After 800 optimization steps, we obtained our final model, Bespoke-MiniChart-7B.
Evaluation Benchmarks: We evaluate on the test sets from a collection of recent chart question answering datasets, including ChartQAPro (Masry et al., 2025), ChartQA (Masry et al., 2022), EvoChart (Huang et al., 2025), CharXiv (Wang et al., 2025), ChartX (Xia et al., 2025), ChartBench (Xu et al., 2024), and MMC (Liu et al., 2024). For ChartX, ChartBench, and MMC, we focus on the QA subset from each dataset. For ChartQAPro, we focus on the answerable questions and single-turn questions. For the fact-checking problems in ChartQAPro, we add “Determine whether the following statement is supported by the image (Yes/No): ” to the claims to be checked. The number of examples from each dataset is included in Table 1.
Comparison Models: We benchmark the performance of state-of-the-art LVLM for Chart QA. We include the following models. All models use the same prompt across the benchmarks. The prompt can be found in the Bespoke-MiniChart-7B Hugging Face repo.
‍Evaluation Metric Following previous work, we use LLM-as-a-judge to evaluate the models’ answers against the ground truth and allow a 5% error margin for numeric answers. For ChartQAPro, we do exact matches for questions where the answer is years to avoid bias from minimal differences.
‍
Our model achieves state-of-the-art performance on chart understanding among models with similar sizes. In addition to that, our models can even surpass closed-models such as Gemini-1.5-Pro and Claude-3.5.
Finding 1: Training on synthetic data can improve a strong pre-trained LVLM
We now further analyze the importance of each training stage in our training framework. In Table 2, with a warmup training stage, our model M1 already performs better than the base model across most benchmark datasets. Note that our model is not trained on chart examples and questions from our evaluation datasets specifically. This shows that with proper synthetic data, we can ensure the models’ robustness and generalizability to unseen charts and question types.
Finding 2: DPO is important in refining models’ CoT reasoning
Adding the DPO stage consistently improves the model performance across datasets by an additional 1.5% on average. We also experimented with scaling up SFT training data to see if an equivalent effect could be achieved. However, with more SFT data with model M2’s rejection sampled traces, we start to lose generalizability across datasets.
‍
The examples below showcase how Bespoke-MiniChart-7B can perform both visual perception and textual reasoning.
When Bespoke-MiniChart-7B makes mistakes, they usually fall into one of three categories. The first two error types are more common compared to the last one.
Error 1: Vision Perception Errors
We find that both our model and SOTA closed-source models make perception errors, especially when visual information is tightly packed together. Our hypothesis is that this is due to the limitations of the existing visual encoders.
Error 2: Mathematical/Logical Errors
Although Bespoke-MiniChart-7B can do mathematical computation correctly as shown in the first example, it may still make simple math mistakes. When such simple math errors occur, usually sampling another answer from the model would get the correct answer. We hypothesize that additional focused math training could potentially improve this, but it may trade off with the ability to do visual reasoning.
Error 3: Inconsistency between thinking and final answer
We occasionally found inconsistencies between the thinking and the final answer. We found that this happens most when Bespoke-MiniChart-7B’s final answer is wrong, and happens occasionally when the thinking is incorrect but the final answer is correct.
‍
We think the Bespoke-MiniChart-7B model is a great starting point for further open model development! Visual reasoning is an interesting testbed showing some differences from mathematical reasoning.
Our model has some limitations, including its inability to address unanswerable questions. Moreover, on very out-of-domain charts or graphics, we see that frontier models like Claude-3.7 perform substantially better. We’re interested in studying this more and understanding whether it’s a fundamental limitation of model size or a consequence of the data.
‍
If you found our work useful, please consider citing our work.
@misc{bespoke_minichart_7b,
author = {Liyan Tang and Shreyas Pimpalgaonkar and Kartik Sharma and Alexandros G. Dimakis and Maheswaran Sathiamoorthy and Greg Durrett},
title = {Bespoke-MiniChart-7B: pushing the frontiers of open VLMs for chart understanding},    Â
howpublished ={https://www.bespokelabs.ai/blog/bespoke-minichart-7b},    Â
note = {Accessed: 2025-04-23},
year = {2025}}
Join Bespokelabs community on discord and follow us on Twitter and LinkedIn for discussions and future updates