Saša Zelenović
Saša Zelenović's contributions
Article
Run Voxtral Mini 4B Realtime on vLLM with Red Hat AI on Day 1: A step-by-step guide
Saša Zelenović
+1
Learn how to deploy Voxtral Mini 4B Realtime, a streaming automatic speech recognition model for low-latency voice workloads, using Red Hat AI Inference Server.
Article
Run Mistral Large 3 & Ministral 3 on vLLM with Red Hat AI on Day 0: A step-by-step guide
Saša Zelenović
+6
Run the latest Mistral Large 3 and Ministral 3 models on vLLM with Red Hat AI, providing day 0 access for immediate experimentation and deployment.
Article
DeepSeek-V3.2-Exp on vLLM, Day 0: Sparse Attention for long-context inference, ready for experimentation today with Red Hat AI
Saša Zelenović
+3
DeepSeek-V3.2-Exp offers major long-context efficiency via vLLM on Day 0, deploying easily on the latest leading hardware and Red Hat AI platforms.
Article
vLLM with torch.compile: Efficient LLM inference on PyTorch
Luka Govedič
+5
Learn how to optimize PyTorch code with minimal effort using torch.compile, a just-in-time compiler that generates optimized kernels automatically.
Article
Ollama or vLLM? How to choose the right LLM serving tool for your use case
Addie Stevens
+2
Ollama makes it easy for developers to get started with local model experimentation, while vLLM provides a path to reliable, efficient, and scalable deployment.
Run Voxtral Mini 4B Realtime on vLLM with Red Hat AI on Day 1: A step-by-step guide
Learn how to deploy Voxtral Mini 4B Realtime, a streaming automatic speech recognition model for low-latency voice workloads, using Red Hat AI Inference Server.
Run Mistral Large 3 & Ministral 3 on vLLM with Red Hat AI on Day 0: A step-by-step guide
Run the latest Mistral Large 3 and Ministral 3 models on vLLM with Red Hat AI, providing day 0 access for immediate experimentation and deployment.
DeepSeek-V3.2-Exp on vLLM, Day 0: Sparse Attention for long-context inference, ready for experimentation today with Red Hat AI
DeepSeek-V3.2-Exp offers major long-context efficiency via vLLM on Day 0, deploying easily on the latest leading hardware and Red Hat AI platforms.
vLLM with torch.compile: Efficient LLM inference on PyTorch
Learn how to optimize PyTorch code with minimal effort using torch.compile, a just-in-time compiler that generates optimized kernels automatically.
Ollama or vLLM? How to choose the right LLM serving tool for your use case
Ollama makes it easy for developers to get started with local model experimentation, while vLLM provides a path to reliable, efficient, and scalable deployment.