The state of open source AI models in 2025

2025 was an exciting year for AI hobbyists running large language models (LLMs) on their own hardware and organizations that need on-premises and sovereign AI. These use cases require open models you can download locally from a public registry like Hugging Face. You can then run them on inference engines such as Ollama or RamaLama (for simple deployments) or production-ready inference servers such as vLLM.

As we help developers deploy these models for customer service and knowledge management (using patterns like retrieval-augmented generation) or code assistance (through agentic AI), we see a trend toward specific models for specific use cases. Let's look at which models are used most in real-world applications and how you can start using them.

Leading to 2025: The pre-DeepSeek landscape

Before DeepSeek gained popularity at the beginning of 2025, the open model ecosystem was simpler (Figure 1). Meta's Llama family of models was quite dominant, and these dense models (ranging from 7 to 405 billion parameters) were easy to deploy or customize. Mistral was also competing (certainly in the EU market), but models from Asia, such as DeepSeek (with its V3) or Qwen were not yet popular.

Timeline of open model releases from 2023 to 2025, featuring Llama, Mistral, and IBM Granite, concluding with the OpenAI gpt-oss release in August 2025. — Figure 1: A brief recap of the open model ecosystem for 2025 and prior years.

Through the stock market effect and media attention, DeepSeek's reasoning model validated that open weights can deliver high-value reasoning. It showed that open models are capable options for teams that need cost control or air-gapped deployments. In fact, many of the models I'll discuss here come from Chinese labs and lead in total downloads per region. As per The ATOM Project, total model downloads switched from USA-dominant to China-dominant during the summer of 2025.

The highest-performing open models

Benchmarks show a model's capabilities on certain predefined tasks, but you can also measure capabilities through the LMArena. This crowdsourced AI evaluation platform lets users vote for a result from two models through a "battle." Figure 2 shows what this leaderboard looks like.

Leaderboard table from LMArena ranking AI models like Gemini-3-pro and Grok-4.1-thinking across categories including coding, math, and creative writing. — Figure 2: The LMArena leaderboard aggregates user votes into an interactive dashboard to understand model capabilities across writing, long queries, and more.

After filtering out the proprietary models such as Gemini, Claude, and ChatGPT, we're left with a few contenders. These include Kimi K2 from the Moonshot lab, Qwen3 from the Alibaba team, and of course, DeepSeek. This is quite interesting, as most folks know DeepSeek, but they might not be familiar with the others.

Qwen, Llama, and Gemma

Different AI use cases require different model sizes and capabilities, which is why open models are so useful. Instead of a general-purpose, one-size-fits-all scenario, model families such as Qwen offer various model sizes (ranging from as small as .5 B) and modalities (text or vision). The Qwen team maintains a transparent strategy for documentation and deployment instructions on GitHub and is active on X (formerly Twitter) to tease upcoming releases (Figure 4).

X post by Nathan Lambert: Airbnb CEO Brian Chesky notes they use Alibaba Qwen in production because it is faster and cheaper than OpenAI’s models. — Figure 4: A testimonial of how Qwen, thanks to their active presence and transparency online, are being used by the largest organizations for their AI strategy.

Llama and Gemma offer similar "families" of models, but the Qwen ecosystem has seen impressive adoption. While they might not have the highest benchmarks, their commitment to the open model community makes them one of the most used local models available. Farther down on The ATOM Project webpage shows how the Qwen family of models have become the most used through metrics of cumulative downloads as 2025 closes out.

Frontier models for RAG, agents, and AI-assisted coding

While labs like Qwen build models for specific use cases, other frontier labs are building capable models that perform like proprietary models (think ChatGPT or Gemini) at a fraction of the cost. A fair way to understand their capabilities, speed, and price is through Artificial Analysis, which incorporates evaluations (like MMLU-Pro and LiveCodeBench to compare all models, both proprietary and open (Figure 5).

Comparison charts showing Gemini 3 Pro as the most intelligent model, while gpt-oss-120B is the fastest and most cost-effective option available. — Figure 5: The Artificial Analysis dashboard combines intelligence, but also speed and price for models to help understand a model's strengths and weaknesses.

Let's look at the two with the highest intelligence score: Kimi K2 from Moonshot AI and gpt-oss from OpenAI.

Kimi: For tool calling and AI-assisted coding

Kimi K2 is one of the largest open models in terms of total parameters (about 1 trillion). It is designed with only roughly 32 billion active parameters per token to provide a smaller runtime footprint that can run on NVIDIA A100s, an H100, or even an A6000 (at 48 GB of VRAM if using 4-bit quantization). It performs quite well with agentic workflows, where you might need an AI assistant to search data, analyze trends and patterns, summarize, and generate a report. See Figure 6.

Grouped bar charts comparing Kimi K2 and GPT-5 Thinking models. Kimi K2 leads in expert reasoning and agentic web search, achieving a state-of-the-art 44.9% on Humanity’s Last Exam. — Figure 6: Kimi models meet or surpass proprietary and paid models in certain benchmarks.

Additionally, the "thinking" variant has a context window of up to 256,000 tokens. This is helpful in "vibe" or "spec" coding where you need to generate code, tests, and integrate Model Context Protocol (MCP) servers for additional capabilities.

OpenAI's gpt-oss: A (surprise) high performing open model

Because OpenAI is the de facto brand name for AI, its release of an open model a surprise. The gpt-oss model matches the performance of slightly older ChatGPT models. After a bit of a rough launch (due to a "harmony" chat template breaking tools), this model is now known for accurate tool use. Its 120b variant fits on a single 80 GB GPU (like an H100), and the 20b version fits on consumer hardware. It also provides a solid alternative to Qwen for organizations that are still evaluating their model decisions (Figure 7).

X post by Sam Altman: OpenAI releases gpt-oss, an open-weights reasoning model with performance similar to o4-mini, designed for local use on PCs and phones. — Figure 7: OpenAI's Sam Altman promoting the model on X (formerly Twitter) due to its strong performance for model size.

Small models for consumer devices and the edge

Perhaps the biggest win for AI in 2025 has been the advancement of small language models (SLMs) that can run on almost any consumer device, including mobile phones. Small models are improving faster than most people realize (Figure 8). Although parameter counts might not be changing, their capabilities are increasing. This is the result of improved attention kernels, efficient block layouts, and synthetic data generation techniques that were not available two years ago.

Scatter plot showing IBM Granite models outperforming larger SLMs. Granite 4.0 1B reaches nearly 70% accuracy, while 300M models surpass competing 1B parameter models. — Figure 8: An example of model improvements in SLMs, which typically couldn't perform this well in years past.

For example, the Granite 4 from IBM focused on edge and on-device deployments. It is even ISO 42001 certified for responsible development and governance. Models from Qwen, Gemma (Google), and Llama are also part of this adoption. They provide small models for developers that need air-gapped inference with predictable costs and no API key requirement.

Real-world use cases for open models

While smaller models helped the majority of people to run and experiment their own AI, open models are also used in the enterprise (Figure 9). We see this especially in highly regulated sectors (like telecommunications or banking) that have a strict requirement for on-premise deployment and data sovereignty. For example, due to data residency regulations, the usage of AI needs to stay local, so open models are a requirement.

Stacked diagram of an integrated AI platform. It shows capabilities like MLOps and resource management running on container engines across physical, cloud, and edge environments. — Figure 9: AI deployment in the enterprise requires not just inferencing a model, but many other capabilities to monitor, automate, and scale AI workloads.

In areas ranging from customer service automation (call centers and chatbots) to internal knowledge management (legal and document processing), we see a combination of tools. Teams use data processing tools like Docling along with a "smaller" language model like Llama 4 Scout, DeepSeek R1, or Llama 3 to process and respond to requests. Retrieval-augmented generation (RAG) workflows become important here (Figure 10), as these AI pipelines typically need unstructured data for accurate responses.

Flowchart of a RAG pipeline: a user question triggers data retrieval from a vector database; relevant text is then combined with a prompt and fed into an LLM to generate an answer. — Figure 10: RAG, or retrieval-augmented generation, remains the top way language models are customized to meet customer needs.

How to run these models on your own hardware

You can easily test and evaluate these models. To run locally on your own device (using a GPU or even just a CPU), use Ollama and RamaLama. These command-line interface (CLI) tools can pull and run a model with a single command. These projects both use the llama.cpp inference engine but provide a simple CLI to get you started.

Beyond just chatting, you can also replace run with serve to get a OpenAI-compatible API for your applications instead of a remote endpoint.

ollama run gpt-oss

If you're using Docker or Podman for containerized applications, RamaLama is a great option. It runs models in containers for enhanced isolation and security.

ramalama run gpt-oss

vLLM is suited for large-scale inference with concurrent users and repeated requests where caching is useful. A UC Berkeley project now with Red Hat as the main corporate contributor, it supports "any model, on any accelerator, on any cloud." It's also the top open source project by contributors for GitHub in 2025 (Figure 11).

Split graphic for vLLM. Left: Line graph showing GitHub stars growing from 0 to over 60,000 by late 2025. Right: Technical specs highlighting over 1,700 contributors and broad hardware support. — Figure 11: For scaling up model deployments, vLLM provides extensive model and hardware support across accelerators and model families.

Through a full application platform like Red Hat OpenShift and OpenShift AI, your containerized applications run alongside open models. This provides the observability, guardrails, and more that are needed for enterprise AI deployments.

Wrapping up

The power of open has provided a huge ecosystem of models for right-sized use cases and inference capabilities that power everything from Raspberry Pis to distributed Kubernetes environments. You've learned about models like Qwen, DeepSeek, gpt-oss, and platforms like LMArena and Artificial Analysis that make it easy to select the right one. Now, it's time for you to test them out, see what works for you, and control your AI narrative.

Red Hat Developer Sandbox

Programming languages & frameworks

System design & architecture

Developer experience

Automated data processing

Platform engineering

Secure development & architectures

E-books

Cheat sheets

Documentation

The state of open source AI models in 2025

Leading to 2025: The pre-DeepSeek landscape

The highest-performing open models

Qwen, Llama, and Gemma

Frontier models for RAG, agents, and AI-assisted coding

Kimi: For tool calling and AI-assisted coding

OpenAI's gpt-oss: A (surprise) high performing open model

Small models for consumer devices and the edge

Real-world use cases for open models

How to run these models on your own hardware

Wrapping up

Agent Skills: Explore security threats and controls

How to run Slurm workloads on OpenShift with Slinky operator

Effortless Red Hat Enterprise Linux virtual machines with Libvirt and Kickstart

5 steps to triage vLLM performance

Automate AI agents with the Responses API in Llama Stack

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue