Robert Shaw
Robert Shaw's contributions
Article
Why vLLM is the best choice for AI inference today
Fatih E. Nar
+4
Discover the advantages of vLLM, an open source inference server that speeds up generative AI applications by making better use of GPU memory.
Article
Scaling DeepSeek-style MoEs with vLLM and llm-d using Wide EP
Robert Shaw
+2
Learn how to deploy and scale Mixture of Experts (MoE) models using vLLM's new execution model and llm-d's intelligent Kubernetes-native inference framework.
Article
llm-d: Kubernetes-native distributed inferencing
Robert Shaw
+2
llm-d delivers Kubernetes-native distributed inference with advanced optimizations, reducing latency and maximizing throughput.
Article
Performance boosts in vLLM 0.8.1: Switching to the V1 engine
Robert Shaw
+1
Explore performance and usability improvements in vLLM 0.8.1 on OpenShift, including crucial architectural overhauls and multimodal inference optimizations.
Article
How we optimized vLLM for DeepSeek-R1
Michael Goin
+4
Explore inference performance improvements that help vLLM serve DeepSeek AI models more efficiently in this technical deep dive.
Article
LLM Compressor is here: Faster inference with vLLM
Robert Shaw
+3
Discover LLM Compressor, a unified library for creating accurate compressed models for cheaper and faster inference with vLLM.
Article
Sparse fine-tuning for accelerating large language models with DeepSparse
Robert Shaw
+1
Sparse fine-tuning in combination with sparsity-aware inference software, like DeepSparse, unlocks ubiquitous CPU hardware as a deployment target for LLM inference.
Article
SparseGPT: Remove 100 billion parameters for free
Robert Shaw
+1
Compress large language models (LLMs) with SparseGPT to make your machine learning inference fast and efficient. Prune in one-shot with minimal accuracy loss.
Why vLLM is the best choice for AI inference today
Discover the advantages of vLLM, an open source inference server that speeds up generative AI applications by making better use of GPU memory.
Scaling DeepSeek-style MoEs with vLLM and llm-d using Wide EP
Learn how to deploy and scale Mixture of Experts (MoE) models using vLLM's new execution model and llm-d's intelligent Kubernetes-native inference framework.
llm-d: Kubernetes-native distributed inferencing
llm-d delivers Kubernetes-native distributed inference with advanced optimizations, reducing latency and maximizing throughput.
Performance boosts in vLLM 0.8.1: Switching to the V1 engine
Explore performance and usability improvements in vLLM 0.8.1 on OpenShift, including crucial architectural overhauls and multimodal inference optimizations.
How we optimized vLLM for DeepSeek-R1
Explore inference performance improvements that help vLLM serve DeepSeek AI models more efficiently in this technical deep dive.
LLM Compressor is here: Faster inference with vLLM
Discover LLM Compressor, a unified library for creating accurate compressed models for cheaper and faster inference with vLLM.
Sparse fine-tuning for accelerating large language models with DeepSparse
Sparse fine-tuning in combination with sparsity-aware inference software, like DeepSparse, unlocks ubiquitous CPU hardware as a deployment target for LLM inference.
SparseGPT: Remove 100 billion parameters for free
Compress large language models (LLMs) with SparseGPT to make your machine learning inference fast and efficient. Prune in one-shot with minimal accuracy loss.