Mark Kurtz
Mark Kurtz's contributions
Article
Compressed Granite 3.1: Powerful performance in a small package
Shubhra Pandit
+2
Open-sourced on Hugging Face, deployment-ready with vLLM, and extensible using LLM Compressor.
Article
2:4 Sparse Llama FP8: SOTA performance for NVIDIA Hopper GPUs
Alexandre Marques
+5
Advancing AI efficiency is more critical than ever, and sparsity has proven to be a cornerstone in this pursuit.
Article
We ran over half a million evaluations on quantized LLMs—here's what we found
Eldar Kurtić
+3
Quantized LLMs achieve near-full accuracy with minimal trade-offs after 500K+ evaluations, providing efficient, high-performance solutions for AI model deployment.
Article
LLM Compressor is here: Faster inference with vLLM
Robert Shaw
+3
Discover LLM Compressor, a unified library for creating accurate compressed models for cheaper and faster inference with vLLM.
Article
Deploy Llama 3 8B with vLLM
Mark Kurtz
Llama 3's advancements, particularly at 8 billion parameters, make AI more accessible and efficient.
Article
How well do quantized models handle long-context tasks?
Eldar Kurtić
+3
4-bit and 8-bit quantized LLMs excel in long-context tasks, retaining over 99% accuracy across 4K to 64K sequence lengths.

Article
Compressed Granite 3.1: Powerful performance in a small package
Shubhra Pandit
+2
Open-sourced on Hugging Face, deployment-ready with vLLM, and extensible using LLM Compressor.

Article
2:4 Sparse Llama FP8: SOTA performance for NVIDIA Hopper GPUs
Alexandre Marques
+5
Advancing AI efficiency is more critical than ever, and sparsity has proven to be a cornerstone in this pursuit.

Article
We ran over half a million evaluations on quantized LLMs—here's what we found
Eldar Kurtić
+3
Quantized LLMs achieve near-full accuracy with minimal trade-offs after 500K+ evaluations, providing efficient, high-performance solutions for AI model deployment.

Article
LLM Compressor is here: Faster inference with vLLM
Robert Shaw
+3
Discover LLM Compressor, a unified library for creating accurate compressed models for cheaper and faster inference with vLLM.

Article
Deploy Llama 3 8B with vLLM
Mark Kurtz
Llama 3's advancements, particularly at 8 billion parameters, make AI more accessible and efficient.

Article
How well do quantized models handle long-context tasks?
Eldar Kurtić
+3
4-bit and 8-bit quantized LLMs excel in long-context tasks, retaining over 99% accuracy across 4K to 64K sequence lengths.