We’re hiring! Join our mission to build the foundation for the agentic world. See Open Roles ->

[

]

Bespoke-MiniChart-7B: Pushing The Frontiers Of Open VLMs For Chart Understanding

Bespoke-MiniChart-7B: an open-source 7B model that sets a new standard for chart question answering

No headings found on page

We trained Bespoke-Stratos-32B, our reasoning model distilled from DeepSeek-R1 using Berkeley NovaSky’s Sky-T1 data pipeline. The model outperforms Sky-T1 and o1-preview in reasoning (Math and Code) benchmarks and almost reaches the performance of DeepSeek-R1-Distill-Qwen-32B while being trained on 47x fewer examples:


Benchmark

Bespoke-Stratos-32B

Sky-T1-32B

o1-preview

DeepSeek-R1 (reported)

DeepSeek-R1-Distill-Qwen-32B (ours / reported)

AIME2024

63.3

43.3

40.0

79.8

66.7 / 72.6

MATH500

93.0

82.4

81.4

97.3

89.8 / 94.3

GPQA-Diamond

58.1

56.8

75.2

71.5

61.1 / 62.1

LiveCodeBench v2 Easy

96.7

86.3

92.9

-

91.2 / -

LiveCodeBench v2 Medium

75.2

56.8

54.9

-

75.7 / -

LiveCodeBench v2 Hard

26.2

17.9

16.3

-

38.2 / -

LiveCodeBench v2 All

71.1

57.9

59.1

-

72.2 / -

We open-source everything, including the reasoning dataset, to continue experimenting together with the community!

Please also refer to Sky-T1’s codebase for the training and evaluation code.

Data Curation

We used Bespoke Curator to create the synthetic reasoning dataset. We ported the Sky-T1 data pipeline into Curator, making it faster and fault-tolerant which helped us generate the reasoning dataset within 1.5 hour with DeepSeek-R1.

Rejection sampling involves filtering out reasoning traces with incorrect solutions. This is challenging for code verification, which we speed up using a Ray cluster. We are currently integrating code execution verifier directly in Curator, so stay tuned.

We followed the same recipe as the Sky-T1, but with the following differences:

  • We used DeepSeek-R1 as the teacher reasoning model instead of QwQ.

  • The Sky-T1 recipe used gpt-4o-mini to reformat QwQ’s traces, whereas we did not reformat DeepSeek-R1’s. We found that DeepSeek-R1’s reasoning traces were sufficiently well-formatted and coherent for parsing and fine-tuning even without an intermediate reformatting step.

  • We used gpt-4o-mini instead of Sky-T1’s parsing logic to filter out incorrect math solutions. We found that Sky-T1’s parsing logic, which relies on regex and sympy, often fails to extract the right answer given a solution and thus tends to filter out solutions that were actually correct (an issue also reported here). Using gpt-4o-mini allowed us to reduce the number of false negatives, increasing the number of retained correct solutions from 25% to 73%.

7B model

We also release Bespoke-Stratos-7B, a fine-tune of Qwen-2.5-7B-Instruct.


Benchmark

Bespoke-Stratos-7B

Qwen2.5-7B-Instruct

DeepSeek-R1-Distill-Qwen-7B (ours / reported)

AIME2024

20.0

10.0

43.3 / 55.5

MATH500

82.0

74.2

89.4 / 92.8

GPQA-Diamond

37.8

33.3

44.9 / 49.1

LiveCodeBench v2 Easy

71.4

65.9

81.3 / -

LiveCodeBench v2 Medium

25.5

18.9

42.2 / -

LiveCodeBench v2 Hard

1.6

3.3

2.4 / -

LiveCodeBench v2 All

36.1

31.9

46.6 / -

The authors of Sky-T1 had noted that they saw little or no improvement training 7B or 14B models with their data.

With only 17k examples, we find the distillation is effective at the 7B scale, possibly due to higher quality of the data. For comparison, DeepSeek-R1-Distill-Qwen-7B used 800k examples.

Thoughts and future work

We are pleasantly surprised at the results, but acknowledge that benchmarks convey only one side to the story. We invite the community to try the models out and evaluate on other benchmarks so we can figure out what to improve.

There are many open questions we would like to understand better. For example, what is the Pareto frontier between the student model size and the number of SFT examples?

Beyond this work, we are excited about what reasoning distillation unlocks. We are building Curator to democratize the creation of powerful reasoning models and agents by enterprises and developers.

Citation

@misc{bespoke_stratos,  
    author = {Bespoke Labs},  
    title = {Bespoke-Stratos: The unreasonable effectiveness of reasoning distillation},  
    howpublished = {www.bespokelabs.ai/blog/bespoke-stratos-the-unreasonable-effectiveness-of-reasoning-distillation},  
    note = {Accessed: 2025-01-22},  
    year = {2025}
}
@misc{bespoke_stratos,  
    author = {Bespoke Labs},  
    title = {Bespoke-Stratos: The unreasonable effectiveness of reasoning distillation},  
    howpublished = {www.bespokelabs.ai/blog/bespoke-stratos-the-unreasonable-effectiveness-of-reasoning-distillation},  
    note = {Accessed: 2025-01-22},  
    year = {2025}
}
@misc{bespoke_stratos,  
    author = {Bespoke Labs},  
    title = {Bespoke-Stratos: The unreasonable effectiveness of reasoning distillation},  
    howpublished = {www.bespokelabs.ai/blog/bespoke-stratos-the-unreasonable-effectiveness-of-reasoning-distillation},  
    note = {Accessed: 2025-01-22},  
    year = {2025}
}



Acknowledgement

We are standing on the shoulder of giants. Bespoke Labs would like to thank Berkeley Sky Computing Lab for their work on Sky-T1 and releasing the code and data, DeepSeek for releasing the DeepSeek-R1 model, and the Datacomp community for insightful discussions.

Share

Science

Science

Build

Build

Data

Data

Updates

Updates

[ Environment research ] & infrastructure for the agent era.

©2026 BespokeLabs.AI, Inc.

[ Environment research ] & infrastructure for the agent era.

©2026 BespokeLabs.AI, Inc.

[ Environment research ] & infrastructure for the agent era.

©2026 BespokeLabs.AI, Inc.