We trained Bespoke-Stratos-32B, our reasoning model distilled from DeepSeek-R1 using Berkeley NovaSky’s Sky-T1 data pipeline. The model outperforms Sky-T1 and o1-preview in reasoning (Math and Code) benchmarks and almost reaches the performance of DeepSeek-R1-Distill-Qwen-32B while being trained on 47x fewer examples:
We open-source everything, including the reasoning dataset, to continue experimenting together with the community!
Please also refer to Sky-T1’s codebase for the training and evaluation code.
We used Bespoke Curator to create the synthetic reasoning dataset. We ported the Sky-T1 data pipeline into Curator, making it faster and fault-tolerant which helped us generate the reasoning dataset within 1.5 hour with DeepSeek-R1.
Rejection sampling involves filtering out reasoning traces with incorrect solutions. This is challenging for code verification, which we speed up using a Ray cluster. We are currently integrating code execution verifier directly in Curator, so stay tuned.
We followed the same recipe as the Sky-T1, but with the following differences:
We also release Bespoke-Stratos-7B, a fine-tune of Qwen-2.5-7B-Instruct.
The authors of Sky-T1 had noted that they saw little or no improvement training 7B or 14B models with their data.
With only 17k examples, we find the distillation is effective at the 7B scale, possibly due to higher quality of the data. For comparison, DeepSeek-R1-Distill-Qwen-7B used 800k examples.
We are pleasantly surprised at the results, but acknowledge that benchmarks convey only one side to the story. We invite the community to try the models out and evaluate on other benchmarks so we can figure out what to improve.
There are many open questions we would like to understand better. For example, what is the Pareto frontier between the student model size and the number of SFT examples?
Beyond this work, we are excited about what reasoning distillation unlocks. We are building Curator to democratize the creation of powerful reasoning models and agents by enterprises and developers.
@misc{bespoke_stratos,
author = {Bespoke Labs},
title = {Bespoke-Stratos: The unreasonable effectiveness of reasoning distillation},
howpublished = {www.bespokelabs.ai/blog/bespoke-stratos-the-unreasonable-effectiveness-of-reasoning-distillation},
note = {Accessed: 2025-01-22},
year = {2025}
}
We are standing on the shoulder of giants. Bespoke Labs would like to thank Berkeley Sky Computing Lab for their work on Sky-T1 and releasing the code and data, DeepSeek for releasing the DeepSeek-R1 model, and the Datacomp community for insightful discussions.