Granite, LIMO, and small LLM reasoning

Lessons on reproducing R1-like reasoning in small LLMs without using DeepSeek-R1-Zero (or its derivatives)

This is the second update on our journey to reproduce R1-like reasoning in small LLMs. In case you missed it, catch up on the previous installments:

Today’s updates: More experiments, more insights

Yesterday, we ran two new experiments to push our small models even further.

Testing the LIMO dataset on Granite

Can a really small model develop reasoning abilities with just ~800 high-quality examples?

Unfortunately, this one didn’t pan out. Neither Llama nor Granite showed much improvement, even though this dataset significantly boosted Phi-4’s performance. The original paper demonstrated strong results on Qwen-32B, but based on our experiment, it’s clear that the effectiveness of this approach is very model-dependent.

In short: Qwen-32B is just a beast. It already has strong mathematical and reasoning abilities, so training on a relatively tiny dataset helps refine what’s already there. For smaller models? Not so much. (Guess there’s no such thing as a free lunch after all!)

Generating synthetic data using particle filtering on LIMO dataset questions

Could this enhance reasoning abilities?

This one was interesting! Running Phi-4 with our particle filtering-based inference scaling method, it successfully solved about half of the ~800 LIMO problems using a 512-particle count.

Here’s what happened next:

We built a backtracking-based reasoning dataset using these filtered solutions and fine-tuned the same Phi-4 model that we used for generation.
Did it work? Nope. The model actually solved fewer AIME24 problems than the base model.
However, when we trained using only the correct solution dataset, the model managed to preserve its performance.
Comparing the LIMO dataset solutions with those from Phi-4, we found that LIMO solutions were 2–3 times longer.
Training on a 380-sample subset of this data slightly improved AIME24 performance, but only by solving one more question.

What’s next? New experiments underway

We finally killed off our older GRPO runs after running them for quite a few iterations. The reason? The reward had plateaued, and the trained models showed no further improvements on AIME24.

At this point, we're starting to wonder: Is AIME24 just too difficult for small models unless they’ve been trained with distilled data from larger reasoning models? We’ll keep using it for now, but we might reconsider another benchmark later.

Today, we launched two new experiments:

GRPO on “But Wait” Phi checkpoint and LIMO questions: We’re testing if the increased difficulty of the LIMO questions can trigger any reasoning sparks in our best “But Wait” Phi checkpoint—which already shows R1-style reflection and reasoning.
Introducing GRPO-Direct: Instead of our usual “generate synthetic data → SFT → GRPO” loop, we’re trying a direct approach:
1. Generate synthetic data inside GRPO itself.
2. Immediately train the model on it within the same loop.

We’re running this on the LIMO dataset, using a Phi checkpoint that has already been trained on synthetic data it generated from the 380 LIMO samples.

Read the next part in the series here: On reasoning versus inference-time scaling

Last updated: May 15, 2025

Granite, LIMO, and small LLM reasoning

Lessons on reproducing R1-like reasoning in small LLMs without using DeepSeek-R1-Zero (or its derivatives)

Today’s updates: More experiments, more insights

Testing the LIMO dataset on Granite

Generating synthetic data using particle filtering on LIMO dataset questions

What’s next? New experiments underway

Red Hat build of Agent Sandbox: Isolated workload management with Kubernetes

Run Claude Code locally with vLLM and OpenShift AI

Verified boot in automotive with AutoSD

Red Hat OpenShift 4.22: What dynamic plugin developers need to know

What's new for developers in Red Hat OpenShift 4.22

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links