Alexandre Marques
Alexandre Marques's contributions
Article
Accelerating large language models with NVFP4 quantization
Shubhra Pandit
+3
Learn about NVFP4, a 4-bit floating-point format for high-performance inference on modern GPUs that can deliver near-baseline accuracy at large scale.
Article
Speculators: Standardized, production-ready speculative decoding
Alexandre Marques
+7
Speculators standardizes speculative decoding for large language models, with a unified Hugging Face format, vLLM integration, and more.
Article
Fly Eagle(3) fly: Faster inference with vLLM & speculative decoding
Alexandre Marques
+2
Boost inference performance by up to 2.5X with vLLM's Eagle 3 speculative decoding integration. Discover how in this blog post.
Article
Axolotl meets LLM Compressor: Fast, sparse, open
Rahul Tuli
+3
Discover how to deploy compressed, fine-tuned models for efficient inference with the new Axolotl and LLM Compressor integration.
Article
Enable 3.5 times faster vision language models with quantization
Shubhra Pandit
+4
Learn how quantized vision-language models enable faster inference, lower costs, and scalable AI deployment without compromising capability.
Article
Deployment-ready reasoning with quantized DeepSeek-R1 models
Eldar Kurtić
+3
Explore new open source quantized reasoning models based on the DeepSeek-R1-Distill suite that deliver near-perfect accuracy and inference speed improvements.
Article
2:4 Sparse Llama: Smaller models for efficient GPU inference
Eldar Kurtić
+4
Discover Sparse Llama: A 50% pruned, GPU-optimized Llama 3.1 model with 2:4 sparsity, enabling faster, cost-effective inference without sacrificing accuracy.
Article
Compressed Granite 3.1: Powerful performance in a small package
Shubhra Pandit
+2
Open-sourced on Hugging Face, deployment-ready with vLLM, and extensible using LLM Compressor.
Accelerating large language models with NVFP4 quantization
Learn about NVFP4, a 4-bit floating-point format for high-performance inference on modern GPUs that can deliver near-baseline accuracy at large scale.
Speculators: Standardized, production-ready speculative decoding
Speculators standardizes speculative decoding for large language models, with a unified Hugging Face format, vLLM integration, and more.
Fly Eagle(3) fly: Faster inference with vLLM & speculative decoding
Boost inference performance by up to 2.5X with vLLM's Eagle 3 speculative decoding integration. Discover how in this blog post.
Axolotl meets LLM Compressor: Fast, sparse, open
Discover how to deploy compressed, fine-tuned models for efficient inference with the new Axolotl and LLM Compressor integration.
Enable 3.5 times faster vision language models with quantization
Learn how quantized vision-language models enable faster inference, lower costs, and scalable AI deployment without compromising capability.
Deployment-ready reasoning with quantized DeepSeek-R1 models
Explore new open source quantized reasoning models based on the DeepSeek-R1-Distill suite that deliver near-perfect accuracy and inference speed improvements.
2:4 Sparse Llama: Smaller models for efficient GPU inference
Discover Sparse Llama: A 50% pruned, GPU-optimized Llama 3.1 model with 2:4 sparsity, enabling faster, cost-effective inference without sacrificing accuracy.
Compressed Granite 3.1: Powerful performance in a small package
Open-sourced on Hugging Face, deployment-ready with vLLM, and extensible using LLM Compressor.