Mark Kurtz
Mark Kurtz's contributions

Deployment-ready reasoning with quantized DeepSeek-R1 models
Explore new open source quantized reasoning models based on the DeepSeek-R1-Distill suite that deliver near-perfect accuracy and inference speed improvements.

2:4 Sparse Llama: Smaller models for efficient GPU inference
Discover Sparse Llama: A 50% pruned, GPU-optimized Llama 3.1 model with 2:4 sparsity, enabling faster, cost-effective inference without sacrificing accuracy.

Multimodal model quantization support through LLM Compressor
Explore multimodal model quantization in LLM Compressor, a unified library for optimizing models for deployment with vLLM.

Compressed Granite 3.1: Powerful performance in a small package
Open-sourced on Hugging Face, deployment-ready with vLLM, and extensible using LLM Compressor.

2:4 Sparse Llama FP8: SOTA performance for NVIDIA Hopper GPUs
Advancing AI efficiency is more critical than ever, and sparsity has proven to be a cornerstone in this pursuit.

We ran over half a million evaluations on quantized LLMs—here's what we found
Quantized LLMs achieve near-full accuracy with minimal trade-offs after 500K+ evaluations, providing efficient, high-performance solutions for AI model deployment.

LLM Compressor is here: Faster inference with vLLM
Discover LLM Compressor, a unified library for creating accurate compressed models for cheaper and faster inference with vLLM.

Deploy Llama 3 8B with vLLM
Llama 3's advancements, particularly at 8 billion parameters, make AI more accessible and efficient.