Red Hat Developer Blog

Article

Feb 06, 2025

How particle filtering makes small LLMs think big

Akash Srivastava +8

An update on reproducing R1-like reasoning in small LLMs: Granite models show...

Article

Jan 30, 2025

Compressed Granite 3.1: Powerful performance in a small package

Shubhra Pandit +2

Open-sourced on Hugging Face, deployment-ready with vLLM, and extensible...

Article

Jan 28, 2025

vLLM V1 Alpha: A major upgrade to vLLM's core architecture

Learn about the alpha release of vLLM V1, a major upgrade to vLLM’s core...

Blog

Jan 22, 2025

A quick look at MCP with large language models and Node.js

Michael Dawson

Learn how to integrate Model Context Protocol (MCP) with LLMs using Node.js....

Article

Featured image for launching Sparse FP8 models and kernels

Dec 18, 2024

2:4 Sparse Llama FP8: SOTA performance for NVIDIA Hopper GPUs

Alexandre Marques +5

Advancing AI efficiency is more critical than ever, and sparsity has proven...

Article

Oct 17, 2024

We ran over half a million evaluations on quantized LLMs—here's what we found

Eldar Kurtić +3

Quantized LLMs achieve near-full accuracy with minimal trade-offs after 500K+...

Article

Oct 14, 2024

Introducing Machete, a mixed-input GEMM kernel optimized for NVIDIA Hopper GPUs

Lucas Wilkinson

Machete, Neural Magic’s optimized kernel for NVIDIA Hopper GPUs, achieves...

Article

Aug 14, 2024

LLM Compressor is here: Faster inference with vLLM

Robert Shaw +3

Discover LLM Compressor, a unified library for creating accurate compressed...

Article

Jul 15, 2024

vLLM brings FP8 inference to the open source community

Michael Goin +5

Explore the integration of FP8 in vLLM. Learn how to receive up to a 2x...

Report a website issue

Red Hat Developer Sandbox

Programming languages & frameworks

System design & architecture

Developer experience

Automated data processing

Platform engineering

Secure development & architectures

E-books

Cheat sheets

Documentation

View all blogs & articles

How particle filtering makes small LLMs think big

Compressed Granite 3.1: Powerful performance in a small package

vLLM V1 Alpha: A major upgrade to vLLM's core architecture

A quick look at MCP with large language models and Node.js

2:4 Sparse Llama FP8: SOTA performance for NVIDIA Hopper GPUs

We ran over half a million evaluations on quantized LLMs—here's what we found

Introducing Machete, a mixed-input GEMM kernel optimized for NVIDIA Hopper GPUs

LLM Compressor is here: Faster inference with vLLM

vLLM brings FP8 inference to the open source community

Featured Authors

Cedric Clyburn

Michael Dawson

Don Schenck

Andrew Azores

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue