If you’re looking to build agentic AI solutions that perform well, one area that doesn’t get discussed nearly often enough is the core runtime environment. For AI developers, that term runtime is almost always synonymous with Python.
But as agent-oriented designs take on more performance-critical roles, I’m speaking to more and more developers who are starting to question whether Python is the right long-term solution. It is this question that’s led them to use Rust for some of their projects, and you might want to consider it as well.
Python's dominance in AI
The evolution of machine learning among data scientists led to Python becoming the lingua franca of AI. A simple core language with a large ecosystem of libraries well suited for AI tasks (Hugging Face, LangChain, PyTorch, TensorFlow, and so on) means we can spin up a proof-of-concept agent in an afternoon.
But there’s a critical question we need to ask as we move from a single cool demo to a production system running hundreds of agents: How does it scale?
For CPU-bound tasks in Python, the answer isn’t great. The reason is the Global Interpreter Lock (GIL), and it’s the bottleneck that will force you to rethink your architecture. This is where Rust comes in, offering a path to build concurrent, scalable agentic systems.
The scaling problem: From 5 to 500 agents
An agentic framework is inherently concurrent. At any moment, you can have multiple agents performing different tasks: one making an API call, another processing a large text file, and a third is running a simulation.
With 5 agents, you can often get by. The tasks are spread out, and the performance hiccups aren't critical.
With 500 agents, any inefficiency is magnified exponentially. If each agent's "thinking" process is a CPU-bound task, a Python-based system will grind to a halt. The GIL ensures that no matter how many CPU cores you have, only one agent can "think" at a time. It’s like having a 16-lane highway that narrows down to a single-lane bridge.
Multi-processing in Python, combined with message passing, are typical solutions used to get around the GIL, but there are several advantages to thread-based programming, a big one being reduced complexity. Additionally, as you scale, there is less overhead from instance creation and context switching that will come into play.
Python's GIL bottleneck: A practical demonstration
Let’s demonstrate this with a common CPU-bound task. We'll have our "agents" perform a heavy computation (summing prime numbers) using both a single thread and multiple threads. I’ve provided examples so you can try it out yourself.
Python: Hitting the wall
Because this task is CPU-bound, the GIL prevents the threads from running in parallel. The multi-threaded version is actually slightly slower due to the overhead of managing the threads. Here’s some code to copy into a file we’ll call cpu_perf.py:
import time
import threading
# A simple, CPU-intensive task to simulate an agent "thinking"
def sum_primes(start, end):
total = 0
for num in range(start, end):
is_prime = True
for i in range(2, int(num**0.5) + 1):
if num % i == 0:
is_prime = False
break
if is_prime:
total += num
# This print is just for verification, can be removed for pure speed tests
# print(f"Sum for range {start}-{end}: {total}")
LIMIT = 200000
MIDPOINT = LIMIT // 2
# --- Single-threaded version (1 agent) ---
start_time = time.perf_counter()
sum_primes(2,LIMIT)
end_time = time.perf_counter()
print(f"Single-threaded (1 agent) time: {end_time - start_time:.4f} seconds\n")
# --- Multi-threaded version (2 agents) ---
thread1 = threading.Thread(target=sum_primes, args=(2,MIDPOINT))
thread2 = threading.Thread(target=sum_primes, args=(MIDPOINT,LIMIT))
start_time = time.perf_counter()
thread1.start()
thread2.start()
thread1.join()
thread2.join()
end_time = time.perf_counter()
print(f"Multi-threaded (2 agents) time: {end_time - start_time:.4f} seconds")When I run this on my Fedora system with:
python cpu_perf.pyI see the following output:
Single-threaded (1 agent) time: 0.1408 seconds
Multi-threaded (2 agents) time: 0.1520 secondsSplitting the work into two "agents" took even longer than just running one. This is the GIL bottleneck in action.
Rust: Parallelism delivers concurrency
If you’re new to Rust, you start out setting up a Rust project using cargo:
sudo dnf install rust cargo
cargo new cpu_perfYou’ll see the output:
Creating binary (application) `cpu_perf` package
note: see more `Cargo.toml` keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.htmlRust will create a directory called cpu_perf containing a Cargo.toml file and a src directory. We’re going to create two files within the src directory. Be careful not to overwrite the Cargo.toml file in the cpu_perf directory with the contents of the one that is intended for the src directory.
Place the following in the file src/main.rs:
use rayon::prelude::*;
use std::time::Instant;
// A function to check if a single number is prime
fn is_prime(n: u32) -> bool {
if n <= 1 { return false; }
for i in 2..=(n as f64).sqrt() as u32 {
if n % i == 0 {
return false;
}
}
true
}
const LIMIT: u32 = 200_000;
fn main() {
// --- Single-threaded version (for comparison) ---
let start = Instant::now();
let single_core_sum: u64 = (2..LIMIT).filter(|&n| is_prime(n)).map(|n| n as u64).sum();
let duration = start.elapsed();
println!("Sum: {}", single_core_sum);
println!("Single-threaded time: {:.4?} seconds\n", duration.as_secs_f64());
// --- Multi-threaded version with Rayon ---
let start = Instant::now();
// Rayon's `par_iter` automatically distributes the work across all available CPU cores
let multi_core_sum: u64 = (2..LIMIT).into_par_iter().filter(|&n| is_prime(n)).map(|n| n as u64).sum();
let duration = start.elapsed();
println!("Sum: {}", multi_core_sum);
println!("Multi-threaded time with Rayon: {:.4?} seconds", duration.as_secs_f64());
}The following commands, when executed from the cpu_perf project directory, will bring in the Rayon crate and compile and run our Rust code:
cargo add rayon
cargo run --releaseThe resulting output will look something like this:
Compiling crossbeam-utils v0.8.21
Compiling rayon-core v1.13.0
Compiling either v1.15.0
Compiling crossbeam-epoch v0.9.18
Compiling crossbeam-deque v0.8.6
Compiling rayon v1.11.0
Compiling cpu_perf v0.1.0 (/home/limershe/cpu_perf)
Finished `release` profile [optimized] target(s) in 2.11s
Running `target/release/cpu_perf`
Sum: 1709600813
Single-threaded time: 0.0107 seconds
Sum: 1709600813
Multi-threaded time with Rayon: 0.0025 secondsAs you can see, in the Rust case, we see less overhead with the additional thread. This is because we’ve used Rayon and it has found more cores to put the threads on. Now imagine this scenario with 500 agents on a 64-core machine. The Rust version would continue to scale, while the Python version would not.
Why GPU isn't the bottleneck (it's the CPU and network)
So how did Python get so popular for AI workloads in the first place if it’s that much slower at multi-threaded work?
It turns out that the GIL isn’t an issue for GPU-intensive tasks. When a Python thread sends a task to a GPU (via CUDA), it releases the GIL. This allows other threads to run on the CPU.
The real bottlenecks for Python agents are:
- CPU-bound tasks: As demonstrated above, any "thinking" or data processing done by the agent is severely limited by the GIL.
- Network I/O: Asynchronous IO is supported through the asyncio module of the Python Standard Library (stdlib) and awaiting non-blocking function calls is particularly common in the network IO space with Python. While threading works well for I/O in Python, a large number of concurrent network requests can still be managed more efficiently in Rust due to its lower overhead, more robust async runtime (Tokio), and easy to use libraries built on it like Actix Web and Reqwest.
A practical I/O example in Python
This example shows that threading in Python is effective for I/O-bound tasks because the GIL is released during the wait. To see the effects of this, create the file io_perf.py and run it on your system:
import time
import threading
def simulate_network_request():
"""Simulates a 1-second network delay."""
time.sleep(1)
# --- Single-threaded I/O ---
start_time = time.perf_counter()
simulate_network_request()
simulate_network_request()
end_time = time.perf_counter()
print(f"Single-threaded I/O took: {end_time - start_time:.4f} seconds")
# --- Multi-threaded I/O ---
thread1 = threading.Thread(target=simulate_network_request)
thread2 = threading.Thread(target=simulate_network_request)
start_time = time.perf_counter()
thread1.start()
thread2.start()
thread1.join()
thread2.join()
end_time = time.perf_counter()
print(f"Multi-threaded I/O took: {end_time - start_time:.4f} seconds")When I run this on my Fedora system with:
python io_perf.pyThe following output is the result:
Single-threaded I/O took: 2.0006 seconds
Multi-threaded I/O took: 1.0008 secondsAs we anticipated, the multi-threaded version is about twice as fast. Remember though, a real agent does both I/O and CPU work. Rust resolves performance issues because it is able to deliver parallelism for both.
The pragmatic path: A hybrid approach
For most developers I’ve spoken with who are adopting Rust, their solution isn't to abandon Python entirely. Instead they’re using both environments together. A practical approach is to first prototype in Python using its rich ecosystem to build and test agent logic. Then you can identify bottlenecks by profiling your application to find the most resource-intensive parts. Finally, you rewrite critical components in Rust, and expose them as native Python modules using tools like PyO3, a Rust library that provides bindings for the Python interpreter and can be used to create modules in Rust that can be consumed in Python applications.
While it’s certainly possible to use native AI crates with Rust, the hybrid model gives you the best of both worlds: the development speed (and proven AI module support) of Python and the execution speed of Rust, allowing you to build AI agents that are not only intelligent but also highly performant and ready to scale.