MCP servers scale in a way that punishes success.
A server with ten tools works beautifully. The LLM sees all ten schemas, picks the right one, calls it. A server with two hundred tools dumps two hundred schemas into the context window before the LLM reads a single word of the user’s request: tens of thousands of tokens, most of them irrelevant.
The execution model compounds the problem. Every tool call is a round-trip. The LLM calls a tool, the result passes back through the context window, the LLM reasons about it, calls another tool. Intermediate results that only exist to feed the next step burn tokens flowing through the model on every turn.
The code mode pattern, introduced by Cloudflare and explored by Anthropic, addresses both problems at once: instead of calling tools one at a time, the LLM writes a script that composes them. Search for what’s available, write code, execute it in a sandbox. The intermediate results stay inside the sandbox. The context window stays clean. Cloudflare recently shipped a server-side implementation for their own API: two tools covering 2,500 endpoints in roughly 1,000 tokens.
FastMCP 3.1 ships server-side code mode with fully configurable discovery, and the server-side part matters more than it sounds.
CodeMode
Here’s a normal FastMCP server with CodeMode applied:
from fastmcp import FastMCPfrom fastmcp.experimental.transforms.code_mode import CodeMode
mcp = FastMCP("Server", transforms=[CodeMode()])
@mcp.tooldef add(x: int, y: int) -> int: """Add two numbers.""" return x + y
@mcp.tooldef multiply(x: int, y: int) -> int: """Multiply two numbers.""" return x * yThe only difference from a standard server is transforms=[CodeMode()]. The tool functions stay the same. But clients connecting to this server no longer see add and multiply directly; they see the meta-tools that CodeMode provides: tools for discovering what’s available and for writing code that calls them.
The default flow has three stages. Granted, three stages might sound like a lot for something intended to reduce server round-trips. The original code mode pattern, introduced by Cloudflare, had no discovery phase at all: clients loaded every tool definition into context, then executed code against them. This solved the sequential calling problem but not the context bloat problem. Anthropic introduced a two-stage approach: search for relevant tools, then execute. This addressed both problems.
For servers complex enough to need code mode, we’ve found that an additional stage makes a meaningful difference. Separating search from schema retrieval lets the search tool stay lightweight, returning only names and brief descriptions, while a dedicated schema step provides the precision the LLM needs to write correct code. But if you want something else, FastMCP permits full customization of this flow to have as few or as many stages as you need.
Here’s how the three default stages play out with the server above:
First, the LLM searches. It calls search(query="math numbers") and gets back tool names and descriptions: a lightweight index. Instead of loading two hundred schemas, it sees a few lines of text about the tools that match.
Next, it requests parameter details for the tools it found. get_schema(tools=["add", "multiply"]) returns parameter names, types, and required markers. Not the full JSON schema (by default), but enough to write code against.
Finally, it writes a Python script and executes it in a sandbox:
a = await call_tool("add", {"x": 3, "y": 4})b = await call_tool("multiply", {"x": a, "y": 2})return bThree round-trips: search, schema, execute. The intermediate result (a) never enters the context window. call_tool is the only function available inside the sandbox; no filesystem, no network, just tool calls and Python.
Discovery
The three-stage flow is the default. CodeMode’s discovery surface is fully configurable, because different tool catalogs need different approaches.
CodeMode ships four discovery tools. All of them share a tunable detail level that controls how much information each response includes:
| Level | Output | Token cost |
|---|---|---|
"brief" | Tool names and one-line descriptions | Cheapest |
"detailed" | Compact markdown with parameter names, types, and required markers | Medium |
"full" | Complete JSON Schema | Most expensive |
This is significant. Even ListTools, which dumps the entire catalog, can produce substantially fewer tokens than a standard MCP handshake when set to "brief" or "detailed". A standard tools/list response includes the full JSON Schema for every tool: argument names, types, nested objects, descriptions, constraints. ListTools at "brief" returns just names and descriptions. The context dump tax is still there, but it’s a fraction of what it would be, and the sequential calling tax is eliminated entirely because tool calls happen inside the sandbox.
By default, two discovery tools are enabled:
Search finds tools by natural-language query using BM25 ranking. Defaults to "brief" detail. The LLM can override the detail level per call, requesting "detailed" for inline schemas or "full" for the complete JSON Schema.
GetSchemas takes a list of tool names and returns parameter details. Defaults to "detailed". The fallback for when search results aren’t enough to write code against.
Two more are opt-in:
ListTools dumps the entire catalog. At "brief" detail, this is a lightweight alternative to standard MCP tool listing. For small servers, under twenty tools or so, seeing everything upfront can be faster than searching.
GetTags lets the LLM browse tools by tag metadata, then pass tags into Search to narrow results. Useful when tools have a natural taxonomy.
The discovery configuration is where the server author’s knowledge becomes design. A large platform server might use all four tools with progressive detail levels: tags for orientation, search for narrowing, schemas for precision. A smaller server can collapse to two stages by bumping search detail:
from fastmcp.experimental.transforms.code_mode import CodeMode, Search, GetSchemas
code_mode = CodeMode( discovery_tools=[Search(default_detail="detailed"), GetSchemas()],)Now search returns parameter schemas inline, and the LLM goes straight from search to execute. GetSchemas stays available as a fallback for complex parameter trees.
This two-stage configuration is exactly the pattern Cloudflare shipped for their API: search returns enough detail to write code, execute runs it. In FastMCP, it’s one line applied to any server. Cloudflare’s results — and early usage patterns — suggest two-stage may be the better default for most servers. It’s something we’re actively evaluating.
A very simple server can skip discovery entirely and bake tool instructions into the execute tool’s description:
code_mode = CodeMode( discovery_tools=[], execute_description=( "Available tools:\n" "- add(x: int, y: int) -> int: Add two numbers\n" "- multiply(x: int, y: int) -> int: Multiply two numbers\n\n" "Write Python using `await call_tool(name, params)` and `return` the result." ),)Each of these patterns is a conscious choice about the tradeoff between token cost and discovery accuracy. The server author makes that choice once, and every client benefits. This is the fundamental advantage of server-side code mode: the person who knows the tools best is the one deciding how they’re discovered and composed.
Composition
In the FastMCP 3.0 architecture, components flow through a pipeline. Providers source them; transforms modify them on the way to clients. A transform can rename, filter, namespace, or reshape what a provider exposes, and transforms compose: stack them, and each one processes the output of the previous.
CodeMode is a transform. It works with everything else in the system without special-casing.
Apply it to an entire server, or to just one provider. Some tools go through code mode, others stay directly accessible. Chain it with other transforms: add a namespace to a mounted sub-server, then apply CodeMode to the result. Filter tools by tag or version, then wrap whatever passes through.
One pattern worth highlighting is to proxy a remote server, then apply CodeMode:
from fastmcp.server import create_proxyfrom fastmcp.experimental.transforms.code_mode import CodeMode
remote = create_proxy("https://api.example.com/mcp")remote.add_transform(CodeMode())remote.run()That remote server now has a code execution interface with tunable discovery. The original authors didn’t build one. The person running the proxy configured one that fits their application.
The behavior falls out of the architecture.
Coming soon: We’re adding configurable code mode for every server hosted on Prefect Horizon. No code changes required.
The Sandbox
The Python execution environment is sandboxed via Pydantic’s Monty project, an experimental Python sandbox that restricts LLM-generated code to call_tool and standard Python. No filesystem access, no network access, nothing outside the sandbox boundary.
Building a Python sandbox that’s secure enough for production and flexible enough to be useful is genuinely hard. The Pydantic team has been doing excellent work on Monty, and CodeMode wouldn’t exist without it.
Resource limits are configurable: timeouts, memory caps, recursion depth.
from fastmcp.experimental.transforms.code_mode import CodeMode, MontySandboxProvider
sandbox = MontySandboxProvider( limits={"max_duration_secs": 10, "max_memory": 50_000_000},)
mcp = FastMCP("Server", transforms=[CodeMode(sandbox_provider=sandbox)])The sandbox provider itself is replaceable. Implement the SandboxProvider protocol and point CodeMode at a Docker container, a remote execution service, whatever fits the deployment.
Getting Started
pip install "fastmcp[code-mode]"CodeMode is experimental. The core interface is stable, but the specific discovery tools and their parameters may evolve as we learn more about what works in practice.
Happy (context) engineering!