Michael Goin
Michael Goin's contributions
Article
Structured outputs in vLLM: Guiding AI responses
Michael Goin
+2
Learn how to control the output of vLLM's AI responses with structured outputs. Discover how to define choice lists, JSON schemas, regex, and more.
Article
LLM Compressor: Optimize LLMs for low-latency deployments
Kyle Sayers
+3
LLM Compressor bridges the gap between model training and efficient deployment via quantization and sparsity, enabling cost-effective, low-latency inference.
Article
How we optimized vLLM for DeepSeek-R1
Michael Goin
+4
Explore inference performance improvements that help vLLM serve DeepSeek AI models more efficiently in this technical deep dive.
Article
vLLM V1: Accelerating multimodal inference for large language models
Michael Goin
+3
Explore how vLLM's new multimodal AI inference capabilities enhance performance, scalability, and flexibility across diverse hardware platforms.
Article
Distributed inference with vLLM
Michael Goin
Explore how distributed inference works within vLLM in this recap of Neural Magic's vLLM Office Hours with Michael Goin and Murali Andoorveedu, a vLLM committer from CentML.
Article
vLLM brings FP8 inference to the open source community
Michael Goin
+5
Explore the integration of FP8 in vLLM. Learn how to receive up to a 2x reduction in latency on NVIDIA GPUs with minimal accuracy degradation.
Article
How Marlin pushes the boundaries of mixed-precision LLM inference
Michael Goin
+1
Learn about Marlin, a mixed-precision matrix multiplication kernel that delivers 4x speedup with FP16xINT4 computations for batch sizes up to 32.
Article
Sparse fine-tuning for accelerating large language models with DeepSparse
Robert Shaw
+1
Sparse fine-tuning in combination with sparsity-aware inference software, like DeepSparse, unlocks ubiquitous CPU hardware as a deployment target for LLM inference.

Structured outputs in vLLM: Guiding AI responses
Learn how to control the output of vLLM's AI responses with structured outputs. Discover how to define choice lists, JSON schemas, regex, and more.

LLM Compressor: Optimize LLMs for low-latency deployments
LLM Compressor bridges the gap between model training and efficient deployment via quantization and sparsity, enabling cost-effective, low-latency inference.

How we optimized vLLM for DeepSeek-R1
Explore inference performance improvements that help vLLM serve DeepSeek AI models more efficiently in this technical deep dive.

vLLM V1: Accelerating multimodal inference for large language models
Explore how vLLM's new multimodal AI inference capabilities enhance performance, scalability, and flexibility across diverse hardware platforms.

Distributed inference with vLLM
Explore how distributed inference works within vLLM in this recap of Neural Magic's vLLM Office Hours with Michael Goin and Murali Andoorveedu, a vLLM committer from CentML.

vLLM brings FP8 inference to the open source community
Explore the integration of FP8 in vLLM. Learn how to receive up to a 2x reduction in latency on NVIDIA GPUs with minimal accuracy degradation.

How Marlin pushes the boundaries of mixed-precision LLM inference
Learn about Marlin, a mixed-precision matrix multiplication kernel that delivers 4x speedup with FP16xINT4 computations for batch sizes up to 32.

Sparse fine-tuning for accelerating large language models with DeepSparse
Sparse fine-tuning in combination with sparsity-aware inference software, like DeepSparse, unlocks ubiquitous CPU hardware as a deployment target for LLM inference.