vLLM brings FP8 inference to the open source community
Explore the integration of FP8 in vLLM. Learn how to receive up to a 2x reduction in latency on NVIDIA GPUs with minimal accuracy degradation.
Explore the integration of FP8 in vLLM. Learn how to receive up to a 2x reduction in latency on NVIDIA GPUs with minimal accuracy degradation.
Llama 3's advancements, particularly at 8 billion parameters, make AI more accessible and efficient.
Learn about Marlin, a mixed-precision matrix multiplication kernel that delivers 4x speedup with FP16xINT4 computations for batch sizes up to 32.
4-bit and 8-bit quantized LLMs excel in long-context tasks, retaining over 99% accuracy across 4K to 64K sequence lengths.
Sparse fine-tuning in combination with sparsity-aware inference software, like DeepSparse, unlocks ubiquitous CPU hardware as a deployment target for LLM inference.
Compress large language models (LLMs) with SparseGPT to make your machine learning inference fast and efficient. Prune in one-shot with minimal accuracy loss.
Gather the data you collect into real-time information you can use to optimize