If you’re looking to build agentic AI solutions that perform well, one area that doesn’t get discussed nearly often enough is the core runtime environment. For AI developers, that term runtime is almost always synonymous with Python.
But as agent-oriented designs take on more performance-critical roles, I’m speaking to more and more developers who are starting to question whether Python is the right long-term solution. It is this question that’s led them to use Rust for some of their projects, and you might want to consider it as well.
Python's dominance in AI
The evolution of machine learning among data scientists led to Python becoming the lingua franca of AI. A simple core language with a large ecosystem of libraries well suited for AI tasks (Hugging Face, LangChain, PyTorch, TensorFlow, and so on) means we can spin up a proof-of-concept agent in an afternoon.
But there’s a critical question we need to ask as we move from a single cool demo to a production system running hundreds of agents: How does it scale?
For CPU-bound tasks in Python, the answer isn’t great. The reason is the Global Interpreter Lock (GIL), and it’s the bottleneck that will force you to rethink your architecture. This is where Rust comes in, offering a path to build concurrent, scalable agentic systems.
The scaling problem: From 5 to 500 agents
An agentic framework is inherently concurrent. At any moment, you can have multiple agents performing different tasks: one making an API call, another processing a large text file, and a third is running a simulation.
With 5 agents, you can often get by. The tasks are spread out, and the performance hiccups aren't critical.
With 500 agents, any inefficiency is magnified exponentially. If each agent's "thinking" process is a CPU-bound task, a Python-based system will grind to a halt. The GIL ensures that no matter how many CPU cores you have, only one agent can "think" at a time. It’s like having a 16-lane highway that narrows down to a single-lane bridge.
Multi-processing in Python, combined with message passing, are typical solutions used to get around the GIL, but there are several advantages to thread-based programming, a big one being reduced complexity. Additionally, as you scale, there is less overhead from instance creation and context switching that will come into play.
Python's GIL bottleneck: A practical demonstration
Let’s demonstrate this with a common CPU-bound task. We'll have our "agents" perform a heavy computation (summing prime numbers) using both a single thread and multiple threads. I’ve provided examples so you can try it out yourself.
Python: Hitting the wall
Because this task is CPU-bound, the GIL prevents the threads from running in parallel. The multi-threaded version is actually slightly slower due to the overhead of managing the threads. Here’s some code to copy into a file we’ll call cpu_perf.py
:
import time
import threading
# A simple, CPU-intensive task to simulate an agent "thinking"
def sum_primes(start, end):
total = 0
for num in range(start, end):
is_prime = True
for i in range(2, int(num**0.5) + 1):
if num % i == 0:
is_prime = False
break
if is_prime:
total += num
# This print is just for verification, can be removed for pure speed tests
# print(f"Sum for range {start}-{end}: {total}")
LIMIT = 200000
MIDPOINT = LIMIT // 2
# --- Single-threaded version (1 agent) ---
start_time = time.perf_counter()
sum_primes(2,LIMIT)
end_time = time.perf_counter()
print(f"Single-threaded (1 agent) time: {end_time - start_time:.4f} seconds\n")
# --- Multi-threaded version (2 agents) ---
thread1 = threading.Thread(target=sum_primes, args=(2,MIDPOINT))
thread2 = threading.Thread(target=sum_primes, args=(MIDPOINT,LIMIT))
start_time = time.perf_counter()
thread1.start()
thread2.start()
thread1.join()
thread2.join()
end_time = time.perf_counter()
print(f"Multi-threaded (2 agents) time: {end_time - start_time:.4f} seconds")
When I run this on my Fedora system with:
python cpu_perf.py
I see the following output:
Single-threaded (1 agent) time: 0.1408 seconds
Multi-threaded (2 agents) time: 0.1520 seconds
Splitting the work into two "agents" took even longer than just running one. This is the GIL bottleneck in action.
Rust: Parallelism delivers concurrency
If you’re new to Rust, you start out setting up a Rust project using cargo:
sudo dnf install rust cargo
cargo new cpu_perf
You’ll see the output:
Creating binary (application) `cpu_perf` package
note: see more `Cargo.toml` keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
Rust will create a directory called cpu_perf
containing a Cargo.toml
file and a src
directory. We’re going to create two files within the src
directory. Be careful not to overwrite the Cargo.toml
file in the cpu_perf
directory with the contents of the one that is intended for the src
directory.
Place the following in the file src/main.rs
:
use rayon::prelude::*;
use std::time::Instant;
// A function to check if a single number is prime
fn is_prime(n: u32) -> bool {
if n <= 1 { return false; }
for i in 2..=(n as f64).sqrt() as u32 {
if n % i == 0 {
return false;
}
}
true
}
const LIMIT: u32 = 200_000;
fn main() {
// --- Single-threaded version (for comparison) ---
let start = Instant::now();
let single_core_sum: u64 = (2..LIMIT).filter(|&n| is_prime(n)).map(|n| n as u64).sum();
let duration = start.elapsed();
println!("Sum: {}", single_core_sum);
println!("Single-threaded time: {:.4?} seconds\n", duration.as_secs_f64());
// --- Multi-threaded version with Rayon ---
let start = Instant::now();
// Rayon's `par_iter` automatically distributes the work across all available CPU cores
let multi_core_sum: u64 = (2..LIMIT).into_par_iter().filter(|&n| is_prime(n)).map(|n| n as u64).sum();
let duration = start.elapsed();
println!("Sum: {}", multi_core_sum);
println!("Multi-threaded time with Rayon: {:.4?} seconds", duration.as_secs_f64());
}
The following commands, when executed from the cpu_perf
project directory, will bring in the Rayon crate and compile and run our Rust code:
cargo add rayon
cargo run --release
The resulting output will look something like this:
Compiling crossbeam-utils v0.8.21
Compiling rayon-core v1.13.0
Compiling either v1.15.0
Compiling crossbeam-epoch v0.9.18
Compiling crossbeam-deque v0.8.6
Compiling rayon v1.11.0
Compiling cpu_perf v0.1.0 (/home/limershe/cpu_perf)
Finished `release` profile [optimized] target(s) in 2.11s
Running `target/release/cpu_perf`
Sum: 1709600813
Single-threaded time: 0.0107 seconds
Sum: 1709600813
Multi-threaded time with Rayon: 0.0025 seconds
As you can see, in the Rust case, we see less overhead with the additional thread. This is because we’ve used Rayon and it has found more cores to put the threads on. Now imagine this scenario with 500 agents on a 64-core machine. The Rust version would continue to scale, while the Python version would not.
Why GPU isn't the bottleneck (it's the CPU and network)
So how did Python get so popular for AI workloads in the first place if it’s that much slower at multi-threaded work?
It turns out that the GIL isn’t an issue for GPU-intensive tasks. When a Python thread sends a task to a GPU (via CUDA), it releases the GIL. This allows other threads to run on the CPU.
The real bottlenecks for Python agents are:
- CPU-bound tasks: As demonstrated above, any "thinking" or data processing done by the agent is severely limited by the GIL.
- Network I/O: Asynchronous IO is supported through the asyncio module of the Python Standard Library (stdlib) and awaiting non-blocking function calls is particularly common in the network IO space with Python. While threading works well for I/O in Python, a large number of concurrent network requests can still be managed more efficiently in Rust due to its lower overhead, more robust async runtime (Tokio), and easy to use libraries built on it like Actix Web and Reqwest.
A practical I/O example in Python
This example shows that threading in Python is effective for I/O-bound tasks because the GIL is released during the wait. To see the effects of this, create the file io_perf.py
and run it on your system:
import time
import threading
def simulate_network_request():
"""Simulates a 1-second network delay."""
time.sleep(1)
# --- Single-threaded I/O ---
start_time = time.perf_counter()
simulate_network_request()
simulate_network_request()
end_time = time.perf_counter()
print(f"Single-threaded I/O took: {end_time - start_time:.4f} seconds")
# --- Multi-threaded I/O ---
thread1 = threading.Thread(target=simulate_network_request)
thread2 = threading.Thread(target=simulate_network_request)
start_time = time.perf_counter()
thread1.start()
thread2.start()
thread1.join()
thread2.join()
end_time = time.perf_counter()
print(f"Multi-threaded I/O took: {end_time - start_time:.4f} seconds")
When I run this on my Fedora system with:
python io_perf.py
The following output is the result:
Single-threaded I/O took: 2.0006 seconds
Multi-threaded I/O took: 1.0008 seconds
As we anticipated, the multi-threaded version is about twice as fast. Remember though, a real agent does both I/O and CPU work. Rust resolves performance issues because it is able to deliver parallelism for both.
The pragmatic path: A hybrid approach
For most developers I’ve spoken with who are adopting Rust, their solution isn't to abandon Python entirely. Instead they’re using both environments together. A practical approach is to first prototype in Python using its rich ecosystem to build and test agent logic. Then you can identify bottlenecks by profiling your application to find the most resource-intensive parts. Finally, you rewrite critical components in Rust, and expose them as native Python modules using tools like PyO3, a Rust library that provides bindings for the Python interpreter and can be used to create modules in Rust that can be consumed in Python applications.
While it’s certainly possible to use native AI crates with Rust, the hybrid model gives you the best of both worlds: the development speed (and proven AI module support) of Python and the execution speed of Rust, allowing you to build AI agents that are not only intelligent but also highly performant and ready to scale.