Launching Code Executor

Bespoke Minicheck Illustration

In today's agentic world, the ability to process and transform large datasets containing LLM generated code is becoming increasingly valuable. Whether you're creating synthetic code datasets, verifying outputs, generating automated test cases, or complex outputs that need code execution, there's often a need to execute code snippets automatically, securely, and at scale. This is where Curator's CodeExecutor comes in.

Curator provides a robust solution for these needs with CodeExecutor, a flexible framework that handles the complexities of code execution while giving you full control over the process.

Hello world example

Creating a custom code executor with Curator is straightforward. You simply extend the Curator.CodeExecutor class and implement a few key methods:

from bespokelabs import curator
from datasets import Dataset

class HelloExecutor(curator.CodeExecutor):     
  def code(self, row):
    location = row['location']
    return f"""print("Hello {location}")"""
	
  def code_output(self, row, execution_output):        
    row['output'] = execution_output.stdout 
    return row
		
locations = Dataset.from_list([
    {'location': 'New York'},
    {'location': 'Tokyo'}
])

hello_executor = HelloExecutor() 

print(hello_executor(locations).to_pandas())

Synthetically generating 3Blue1Brown-like videos

While code executor is powerful, the real value addition comes from coupling it with Curator.LLM(). In this example, we generate a fully synthetic dataset of math animation videos using curator.LLM() and curator.CodeExecutor(). The data generation recipe is as follows:

Data generation recipe for Math Animated Videos

Our process leverages a chained curator pipeline to generate math animation video scripts. We begin by selecting a diverse range of subjects of varying difficulty, then identify various topics within each subject, and finally craft multiple questions for each topic. In total, we produce 1,000 unique questions, each paired with its own tailored script for a math animation video. Then, for each snippet, we generate a corresponding Manim code using a single curator block with Claude-Sonnet-3.7 with thinking and self consistency enabled to improve the quality of code generated. Finally, we pass the 1000 examples through curator’s code executor with the docker backend and the manim docker image. We run this on a CPU machine on GCP and it takes about 20 minutes for the full run. Here is an example output:

We find that Claude-3.7-sonnet is able to generate simple mathematical animations in one shot. It sometimes struggles with more complicated animations that involve camera movement. In our case, around 25% of the Manim snippets generate valid videos. This is mostly due to incorrect code being generated for the given script, sometimes due to hallucination in function names and calls.

We definitely think that there is scope of improvement in the code generation pipeline by adding manim’s documentation to the prompt and/or by using models finetuned for manim code generation. Also, using CodeExecutor, one can create a pipeline with feedback loop for to regenerate code based on error message. It is quite simple to do it using curator and code executor and we’d love if you try it out!

Get started

As you can see, Curator makes it very easy to define complex workflows involving several stages of LLM calls followed by code execution.

To get started, please check the docs and the following examples:

  1. Math video generation
  2. Colab example

For feature requests and bugs, please use the Curator repository in GitHub. Join our discord community TokenTown.

Star us on GitHub to support!

Ready to fine-tune your models with precision?

By submitting, you agree to receive periodic emails from Bespoke Labs.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.