Introduction

Learning to use tools is critical for building agents that can interact with the external world. For example, a deep research agent for private enterprise data needs to master a wide range of tools, such as search, SQL, and Pandas, to generate a report. It must not only be able to invoke a single tool in isolation, but also combine the output of multiple tools as the input to another — in essence, the agent should be able to orchestrate many tools.

Manual prompt engineering for building agents is easy to start with, but doesn’t scale well. To begin with, the list of tools may be long and diverse. Moreover, a prompt engineer designing the agent has to spend increasing amounts of time to cover more and more edge cases with handcrafted rules through trial-and-error. Such a manual approach is reminiscent of what Rich Sutton warned against in his essay, The Bitter Lesson:

"1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning"

An alternative method is to use supervised finetuning, where the model learns from demonstrations of responses to user queries. We can collect these demonstrations from humans, which can be expensive. We can also collect these demonstrations from a teacher model, but this only works when we have existing teacher models that are already very good at the task.

We need a more scalable method that can teach agents to use tools.

The Promise of Reinforcement Learning

Reinforcement learning (RL) algorithms can take on many forms, such as PPO or GRPO, but at their core, they allow models to learn by trial-and-error. A model explores a wide range of actions and receives feedback from the environment in the form of rewards. Actions that result in high reward are then reinforced and actions that result in low reward are discouraged. In this way, the model learns to take actions that maximize its reward.

RL is a scalable learning algorithm because it allows models to self-discover solutions by learning from its own experience. Unlike prompt engineering or supervised finetuning, which are both ultimately bottlenecked by human-generated data, RL holds the promise of scaling arbitrarily with compute, a key property for successful algorithms outlined in The Bitter Lesson.

A chronology of RL borrowed from the recent article, *Welcome to the Era of Experience*, by David Silver and Richard Sutton

Reinforcement Learning for Tool Use

Recently, OpenAI has demonstrated that RL can be used to train a research agent that uses tools to carry out complex, multi-step workflows. However, they did not disclose a lot of details about their training recipe. Meanwhile, experiments in the open-source and research community are often single-turn, lacking back-and-forth interaction with an environment.

We are excited to share some recent findings which show that RL can improve multi-step tool use capabilities, a first milestone. Using GRPO, the core algorithm behind DeepSeek-R1, we improved Qwen2.5-7B-Instruct’s tool use performance by 23% on a subset of the BFCL benchmark using only 100 training samples. The tasks in this benchmark require an agent to orchestrate multiple tools to complete multi-step tasks in a simulated environment, such as booking air travel using credit cards and placing orders in a trading system.

Learning to Orchestrate Tools Without Demonstration

We found that our agent exhibited qualitatively better tool use capabilities after undergoing training with RL. As an example, consider the scenario below, where the agent is tasked with booking a flight from Los Angeles to New York using the user's credit card:

AI Training Comparison

Before Training

Step: 52 Incorrect Reward: 0.00

User

I'm planning a journey from Los Angeles to New York on the morning of April 15th 2024, preferring to fly business class. Arrange this flight using my pre-linked credit card with id 'card_123456789' and access token 'abc123xyz'.

Assistant

<reasoning> The task involves booking a flight from Los Angeles to New York on the specified date in business class. The required information includes the travel date, from, to, and class, which are all provided. The booking should use a specified access token and a linked credit card. I will use the 'book_flight' function to achieve this. </reasoning> <tool> [{"name": "book_flight", "args": {"access_token": "abc123xyz", "card_id": "card_123456789", "travel_date": "2024-04-15", "travel_from": "LAX", "travel_to": "NYC", "travel_class": "business", "travel_cost": 0}}] </tool>

Tool Result

["Function Call {'name': 'book_flight', 'args': {'access_token': 'abc123xyz', 'card_id': 'card_123456789', 'travel_date': '2024-04-15', 'travel_from': 'LAX', 'travel_to': 'NYC', 'travel_class': 'business', 'travel_cost': 0}} Succeeded. Result: {'booking_id': '3426812', 'transaction_id': '45451592', 'booking_status': True, 'booking_history': {}}"]

Assistant

The flight from Los Angeles (LAX) to New York (NYC) on April 15th, 2024, in business class has been successfully booked. The booking ID is 3426812, and the transaction ID is 45451592. The booking status is confirmed. <TASK_FINISHED>

After Training

Step: 1202 Correct Reward: 1.00

User

Assistant

<reasoning> To accomplish this task, we need to perform two main steps: 1. Use the `get_flight_cost` function to obtain the cost of the specified flight. We will need to input the departure and destination airports, travel date, and travel class. 2. Use the `book_flight` function to book this flight using the details we have, including the pre-linked credit card and access token. </reasoning> <tool> [{"name": "get_nearest_airport_by_city", "args": {"location": "Los Angeles"}}] </tool>

Tool Result

["Function Call {'name': 'get_nearest_airport_by_city', 'args': {'location': 'Los Angeles'}} Succeeded. Result: {'nearest_airport': 'LAX'}"]

Assistant

<tool> [{"name": "get_nearest_airport_by_city", "args": {"location": "New York"}}] </tool>

Tool Result

["Function Call {'name': 'get_nearest_airport_by_city', 'args': {'location': 'New York'}} Succeeded. Result: {'nearest_airport': 'JFK'}"]

Assistant

<tool> [{"name": "get_flight_cost", "args": {"travel_from": "LAX", "travel_to": "JFK", "travel_date": "2024-04-15", "travel_class": "business"}}] </tool>

Tool Result

["Function Call {'name': 'get_flight_cost', 'args': {'travel_from': 'LAX', 'travel_to': 'JFK', 'travel_date': '2024-04-15', 'travel_class': 'business'}} Succeeded. Result: {'travel_cost_list': [2400.0]}"]

Assistant

We have determined that your business class flight from Los Angeles (LAX) to New York (JFK) on April 15, 2024, costs $2400.0. Now, let's proceed to book this flight using your pre-linked credit card with id 'card_123456789' and access token 'abc123xyz'. <tool> [{"name": "book_flight", "args": {"access_token": "abc123xyz", "card_id": "card_123456789", "travel_date": "2024-04-15", "travel_from": "LAX", "travel_to": "JFK", "travel_class": "business", "travel_cost": 2400}}] </tool>

Tool Result

["Function Call {'name': 'book_flight', 'args': {'access_token': 'abc123xyz', 'card_id': 'card_123456789', 'travel_date': '2024-04-15', 'travel_from': 'LAX', 'travel_to': 'JFK', 'travel_class': 'business', 'travel_cost': 2400}} Succeeded. Result: {'booking_id': '3426812', 'transaction_id': '45451592', 'booking_status': True, 'booking_history': {}}"]

Assistant

Your business class flight from Los Angeles to New York on April 15, 2024, has been successfully arranged. The booking ID is 3426812 and the transaction ID is 45451592. You can look forward to your journey knowing it's all set. <TASK_FINISHED>

‍

Before training, the agent attempts a single tool call, book_flight, using a non-existent airport for New York (NYC) and the wrong cost ($0).

After training, the agent is able to solve the task successfully. It is able to:

Understand how to use the individual tools correctly.
Plan a correct sequence of tool calls in advance.
Synthesize previous tool call responses to invoke the next tool call with the correct input.

Concretely, the agent:

Used the get_nearest_airport_by_city tool to find the airports closest to New York (JFK) and Los Angeles (LAX),
Used the get_flight_cost tool to get the cost for the flight ($2400), and
Used the book_flight tool to book the flight using the information from both previous tool calls and the user request.

This is an encouraging result: our agent learned to orchestrate tools without seeing a single demonstration from a human or a teacher model!

You can browse more examples that show such improvements in our agent evaluation app.

Our Training Recipe

We used 200 questions from the BFCL benchmark’s v3 multi-turn-base category and performed a 50-50 train-test split, picking only the first query of each example to speed up training.

We designed a simple reward function that gives the agent a reward of 1 when it passes BFCL’s evaluation check, which consists of a state-based check and a response-based check (more details here), and 0 otherwise.

We then trained Qwen-2.5-7B-Instruct for 1600 steps (100 epochs) using GRPO with the following hyperparameters:

4x H200 141GB (3 for training, 1 for vLLM inference)
per_device_batch_size = 1, gradient_accumulation_step = 4
max_grad_norm = 0.2
constant LR = 1e-6 with 20 warmup steps
KL beta = 0.001, M = 100 (from DeepSeekMath paper): we update reference model update every 100 steps with a KL weight of 0.001
Overlong Filtering (from DAPO paper): we mask out the loss for rollouts that exceed maximum completion length
μ = 2 (from DeepSeekMath paper): we take one on-policy step and one off-policy step for each batch

More details can be found in our training code, which we adapted from Will Brown's verifiers library.

We found that the agent improved by 23% on the eval set (from 55% to 78%) with the recipe above while maintaining training stability.

‍

Findings from Our Experiments

In this section, we detail the many findings that led us to our final training recipe.

1) Mitigating Completion Length Blowup and Gibberish Output during Training

A major issue that occurred in many of our experiments is completion length blowup, where the agent starts outputting longer and longer gibberish responses with random characters, see below.

Chat Interface with Error Messages

User

My tires need to be in top-notch condition before embarking on my journey. Could you ensure they have optimum pressure? If any tire falls short of 30.0 psi, it's off to the nearest service station to remedy it pronto. Let's also ready the car for navigation if the shop isn't miles away!

Assistant

{"name": "set_navigation", "response": {"status": "Navigating to 456 Oakwood Avenue, Rivermist, 83214"}, "toolbench_rapidapi_key": "abc123", "toolbench_rapidapi_proxy": "http://localhost:8080"}<|response|>{"status": "Navi<|end|>{"name": "display_car_status", "output": {"status": "Parking brake status: engaged", "car_info": "{'make': 'Toyota', 'model': 'Camry', 'year': 2022}}<|response|>{"<|end|>{"<|end|>{"displayCarStatus": {"status": {"parkingBrakeStatus": "engaged", "car_info": {"make": "Toyota", "model": "Camry", "year": 2022}}, "toolbench_rapidapi_key": "abc123", "toolbench_rapidapi_proxy": "http://localhost:8080"}}<|response|>{"<|end|>{"displayCarStatus": {"status": {"fanSpeed": 50, "batteryVoltage": 12.6, "humidityLevel": 45.0, "car_info": <|response|>{"make": "Toyota", "model": "Camry", "year": 2022}}, "toolbench_rapidapi_key": "abc123", "toolbench_rapidapi_proxy": "http://localhost<|end|>{"<|response|>{"<|response|>{"displayCarStatus": {"status": {"fanSpeed": 50, "batteryVoltage": 12.6, "humidityLevel": 45.0, "car_info": {"make": "Toyota", "model": "Camry", "year": 2022}}, "tool<|end|>{"<|toolbench_rapidapi<|*|>{"<|*|>{"{"display<|*|>{"<|display<|end|>{"<|<|<|{"<|<|{"<|Response<|{"<|<|{"<|<|{"<|<|<|{"<|<|<|{"<|<|<|<|{"<|<|<|{"<|<|<|<|{"<|<|<|<|{"<|<|<|<|{"<|<|<|<|<|{"{"display<|<|<|<|{"<|<|<|<|<|{"<|<|<|<|<|<|{"<|<|{"<|<|<|<|<|{"<|<|<|<|<|<|{"<|<|<|<|<|<|{"<|<|<|<|<|<|{"<|<|<|<|<|<|{"<|<|<|<|<|<|{"<|<|<|<|<|<|<{"<|<|<|<|{"<|<|<|<|<|<|<… (Rest of the repetitions truncated)

‍

This blowup in response length also leads to a complete collapse in training and eval accuracy.

‍

We tried many treatments to mitigate this training instability, which we describe below. It is worth noting that in all of the below settings, we performed GRPO with μ = 2 (one on-policy step and one off-policy step taken for every batch) to ensure fast convergence.

❌ Dr.GRPO: We found that re-scaling the reward and replacing sequence-level normalization with a constant, following the Dr.GRPO paper, still resulted in completion length blowup.

‍

❌ Gibberish Penalty: We used gpt-4o-mini as a judge to detect gibberish in agent trajectories and assign -1 reward when gibberish is detected. We found that the gpt-4o-mini judge was able to detect such gibberish responses with very high accuracy. Despite this and to our surprise, length completion blowup occurred even earlier with the gibberish penalty added. Our hypothesis here is that negative rewards encourage exploration which in turn further encourages gibberish.

‍

❌ Overlong Filtering + Removing KL Divergence Penalty: We tried using overlong filtering and removing the KL divergence term, as suggested by the DAPO paper. Removing the KL divergence term was very appealing to us since this significantly simplifies the loss calculation and speeds up training by allowing us to avoid keeping an extra reference model in memory and skip the KL penalty calculation. However, while we did see faster training, completion length blowup, and accuracy collapse occurred very quickly, at around 300 steps. Our result corroborates a similar finding in a discussion thread from the HuggingFace team.

‍

✅ Overlong Filtering + KL Weight = 0.001: This was the treatment included in our final recipe. After experimenting with both high KL weight (0.04) and no KL weight (0) and seeing collapse on both settings, we decide to pick a small KL weight of 0.001 (inspired by this comment from the discussion thread above), which resulted in our best-performing and most stable run.

‍

Less Is More for Reward Design

Our initial reward design consisted of three components:

Tool execution: This component assigns a reward proportional to the number of times the agent performs a function call that is successfully executed by the environment.
Format check: This component assigns a fixed reward based on whether responses from the agent’s rollout contain (1) reasoning tags and (2) either a correctly formatted tool call or a <TASK_FINISHED> token that signals the completion of a rollout.
Correctness: This component assigns a reward 1 when the agent passes BFCL’s eval check, as mentioned in the training recipe section.

We found that this complex reward design leads to worse training stability, presumably due to reward hacking by the agent. In the graph below, green lines are results from an experiment containing reward components 1, 2, and 3 above, red lines are results from an experiment containing components 2 and 3, and pink lines are results from an experiment containing only component 3. The most straightforward reward design, containing only the correctness component, has the best stability and performance.

‍

Reference Model Update Boosts Performance

We found that updating the reference model every 100 steps can boost performance compared to using the same reference model throughout the entire training process. In the graph below, pink lines are results from an experiment with reference model update, whereas blue lines are results from an experiment without reference model update.

‍

Intuitively, since the KL term will pull the current policy model towards the reference model, a stronger policy model requires a stronger reference model as a regulator. Regularly updating the reference model ensures that the KL term does not pull the model completely towards the initial performance while still adding some stability to the training process.

Looking Forward

True to our name, we want to make it very easy to build bespoke agents. We believe this is best done using scalable methods such as RL. This work represents a first step towards this mission.

We believe there is a gap in tooling to deeply understand, explore, and evaluate agent behaviors. Our agent evaluation app is a demo of what we think the tooling should look like. Please reach out to us at company@bespokelabs.ai if you're interested in collaborating to build bespoke agents for your use case.

Citation

@misc{bespoke_improving_multi_turn_tool_use,
    author = {Zhuang*, Richard and Vu*, Trung and Dimakis, Alex and Sathiamoorthy, Maheswaran},
    title = {Improving Multi-Turn Tool Use with Reinforcement Learning},
    howpublished = {https://www.bespokelabs.ai/blog/improving-multi-turn-tool-use-with-reinforcement-learning},
    note = {Accessed: 2025-04-17},
    year = {2025}
}

Improving Multi-Turn Tool Use with Reinforcement Learning

Introduction

The Promise of Reinforcement Learning

Reinforcement Learning for Tool Use

Learning to Orchestrate Tools Without Demonstration

Before Training

After Training

Our Training Recipe

Findings from Our Experiments

1) Mitigating Completion Length Blowup and Gibberish Output during Training

Less Is More for Reward Design

Reference Model Update Boosts Performance

Looking Forward

Citation

Ready to fine-tune your models with precision?