Robert Shaw

Robert Shaw's contributions

Learn how to deploy and scale Mixture of Experts (MoE) models using vLLM's new execution model and llm-d's intelligent Kubernetes-native inference framework.

llm-d delivers Kubernetes-native distributed inference with advanced optimizations, reducing latency and maximizing throughput.

Explore performance and usability improvements in vLLM 0.8.1 on OpenShift, including crucial architectural overhauls and multimodal inference optimizations.

How we optimized vLLM for DeepSeek-R1

Michael Goin +4

March 19, 2025

Explore inference performance improvements that help vLLM serve DeepSeek AI models more efficiently in this technical deep dive.

Discover LLM Compressor, a unified library for creating accurate compressed models for cheaper and faster inference with vLLM.

Sparse fine-tuning in combination with sparsity-aware inference software, like DeepSparse, unlocks ubiquitous CPU hardware as a deployment target for LLM inference.

Compress large language models (LLMs) with SparseGPT to make your machine learning inference fast and efficient. Prune in one-shot with minimal accuracy loss.

Report a website issue

Red Hat Developer Sandbox

Programming Languages & Frameworks

System Design & Architecture

Developer Productivity

Automated Data Processing

Platform Engineering

Secure Development & Architectures

E-Books

Cheat Sheets

Documentation

Robert Shaw

Robert Shaw's contributions

Scaling DeepSeek-style MoEs with vLLM and llm-d using Wide EP

llm-d: Kubernetes-native distributed inferencing

Performance boosts in vLLM 0.8.1: Switching to the V1 engine

How we optimized vLLM for DeepSeek-R1

LLM Compressor is here: Faster inference with vLLM

Sparse fine-tuning for accelerating large language models with DeepSparse

SparseGPT: Remove 100 billion parameters for free

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue