OlmPool — 26 Controlled LLM Models Revealing How Architecture Impacts Long Context

Why OlmPool Matters

Most LLMs are pretrained on short text, then extended to handle long contexts via a midtraining phase. But the base architecture is locked in before anyone tests long-context behavior. OlmPool is the first large-scale controlled study to isolate exactly which architectural decisions make that extension succeed or fail.

🔬

Controlled Ablations

Data, tokenizer, and extension recipe are held constant. Only the architecture varies — making causal attribution possible for the first time.

⚠️

Short-Context Metrics Fail

Training loss, perplexity, and 16 standard benchmarks all fail to predict long-context performance. You can't see the problem coming.

🏗️

Architecture Is the Culprit

Llama 3's long-context strength is architectural, not data-driven. The same extension recipe applied to OLMo 3 or Qwen 3 performs significantly worse.

📦

Fully Open Release

All 26 models with 38 checkpoints each are released on HuggingFace, amortizing the 170,000 H100-hour training cost across the research community.

Video Explainer

Vinh Nguyen's 41-minute deep-dive podcast walkthrough of the OlmPool paper — covering the key findings, methodology, and implications for LLM architecture design.

📺 Paper Walkthrough

Cracks in the Foundation — OlmPool Deep Dive

A thorough walkthrough of the paper's methodology, the four architectural choices, and what the results mean for practitioners building long-context LLMs.

Why short-context metrics fail to predict long-context performance
How QK norm suppresses attention sinks
The SWA + GQA interaction effect
What Llama 3's architecture gets right
Practical implications for model architecture decisions

Watch on YouTube ↗

Key Research Findings

01

Short-Context Metrics Cannot Predict Long-Context Performance

HELMET 32K scores range from 29.9 to 56.4 across OlmPool models — a 26.5-point spread. Yet standard pretraining metrics show almost no correlation with this outcome:

Training loss (pretraining) R² = 0.29

Training loss (context extension) R² = 0.06

16 downstream BPB benchmarks R² = 0.17

HELMET 8K (pre-extension) R² = 0.32

Feature count (# bad choices) R² = 0.67

The single best predictor is simply counting how many of the four harmful architectural choices are present.

02

Effects Compound — Up to 47% Drop

Each feature alone causes modest degradation. Combined, they are catastrophic:

Pretraining ctx 4K vs 8K −1 to −2 pts

SWA alone −1.1 pts

QK Norm (OLMo arch) −6 pts

SWA + GQA combined −9 pts avg

GQA + SWA + headwise QK Norm −26.5 pts

03

Llama 3's Strength Is Architectural

It was previously unclear whether Llama 3's long-context extensibility came from its architecture or its (undisclosed) training data. OlmPool confirms: it's the architecture.

The Llama 3 architecture model is one of the best in the design space. The same extension recipe applied to OLMo 3 and Qwen 3 architectures performs significantly worse, even with identical data and training duration.

04

More Data Cannot Bridge the Gap

Three representative models (Llama 3 arch, OLMo 3 arch, worst arch) were extended with 1B, 10B, and 50B tokens.

Even after 50B tokens of context extension, the worst architecture does not reach the performance that the Llama 3 architecture achieves after just 1B tokens. The architectural gap is fundamental, not a matter of training budget.

05

Attention Sinks Are a Positive Signal

Models without QK norm develop strong attention sinks — early tokens that absorb excess attention weight. While often considered a negative property, in OlmPool these sinks correlate with better long-context performance (R² = 0.38).

QK norm suppresses attention sinks and increases entropy, making it harder for the model to attend selectively over long contexts.

06

Detectable Early in Pretraining

Architectural differences in long-context behavior are detectable from context extension runs as early as 70B tokens into pretraining — well before full-scale training completes.

This opens a path to cheaper architectural validation: run a short context extension early in training to screen for long-context viability before committing to full pretraining.

The 4 Architectural Choices

All four choices are used by at least one of OLMo, Llama, or Qwen model families. Each has a legitimate engineering justification — but their combination is toxic for long-context extension.

1

QK Normalization

Used by: OLMo 2/3 (layerwise), Qwen 3 / Gemma 3 (headwise)

QK norm applies RMS normalization to query and key matrices before computing attention scores. It improves training stability — but at a cost to long-context performance.

Layerwise (OLMo 2/3): one γ per layer — used in OLMo 2 and OLMo 3
Headwise (Qwen 3, Gemma 3): one γ per attention head — slightly worse than layerwise
Removing QK norm from OLMo 3 architecture: +6 pts HELMET 32K
Adding QK norm to Llama 3 architecture: −3.8 pts HELMET 32K

✓ Training stability ✗ Long-context performance

2

Grouped-Query Attention (GQA)

Used by: Llama 3 (8 KV heads), Qwen 3, Gemma 3

GQA reduces the KV cache size by sharing key-value matrices across multiple query heads. The standard Llama 3 configuration uses 8 KV heads for 32 query heads.

More KV heads = better long-context performance (monotonically)
32 KV heads (full MHA) outperforms 8 KV heads on HELMET and RULER
4 KV heads is worse than 8 KV heads
GQA alone: modest degradation; GQA + SWA: −9 pts avg

✓ Inference efficiency, smaller KV cache ✗ Reduced attention expressivity

3

Sliding Window Attention (SWA)

Used by: OLMo 3, Gemma 3

SWA interspersed local-window layers with full-attention layers. OlmPool uses the OLMo 3 configuration: 3 local layers (4096 window) per 1 global layer.

SWA alone: −1.1 pts (modest)
SWA + GQA: −9 pts average (severe interaction)
SWA models show less attention sink behavior on full-attention layers
Restricts model to 4K context at 75% of layers during pretraining

✓ Pretraining efficiency ✗ Compounds badly with GQA

4

Pretraining Context Length

4K (Llama 2, OLMo 1) vs 8K (Llama 3, OLMo 3)

The context length used during pretraining sets a ceiling on what the model can learn about long-range dependencies before context extension begins.

4K vs 8K pretraining: −1 to −2 pts on HELMET 32K
4K + SWA = model sees only 4K context at all layers during pretraining
Longer pretraining context enables better post-extension performance

✓ Shorter = faster pretraining ✗ Limits long-context ceiling

Architecture Comparison: Llama 3 vs OLMo 3 vs Qwen 3

How the three major model families differ on the four critical choices

Feature	Llama 3	OLMo 3	Qwen 3	Impact on Long Context
QK Normalization	None	Layerwise	Headwise	None = best · Headwise = worst Up to −6 pts HELMET 32K
GQA (KV heads)	8 KV heads	32 KV heads	8 KV heads	More heads = better 32 > 16 > 8 > 4
Sliding Window Attn	None	3/4 layers local	None	SWA alone: −1.1 pts SWA + GQA: −9 pts avg
Pretraining Context	8K	8K	8K	8K > 4K −1 to −2 pts for 4K
OlmPool HELMET 32K	~52.4 Init G equiv.	~47.5 Init H, SWA+LQK	~38.8 SWA+LQK+8kv	Same data, same extension recipe

* OlmPool holds data and training recipe constant — differences are purely architectural.

All 26 OlmPool Models

Complete results from Appendix A of the paper. Sorted by HELMET 32K score (worst to best). All models are 7B parameters, pretrained on 140B tokens, then extended with 10B tokens to 64K context.

SWA — Sliding Window Attention (3 local layers per 4) QKNorm — ✓ layerwise · hw headwise · × none Norm — post-sublayer-norm vs prenorm Init — weight initialization code (A–K) KV — KV heads (32 = full MHA) Ctx — pretraining context length

Sort by:

SWA	QKNorm	Norm	fp8	Init	KV	Ctx	Params	H-8K	H-16K	H-32K ↑	H-64K	LongPPL ↓	R-32K ↑	HuggingFace

Source: Table 1 & Table 2, Appendix A, "Cracks in the Foundation" (Bertsch et al., 2026)

Quick Start

OlmPool models are hosted on HuggingFace. Each model has 38 checkpoints covering the full pretraining and context extension process. You must specify a revision and set trust_remote_code=True.

Load a Model

# Install dependencies
pip install transformers torch

# Load any OlmPool model
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "allenai/K_post_HQK_8kv_12k"

# revision = specific checkpoint (see HF Files tab)
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    revision="<checkpoint_revision>",
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    revision="<checkpoint_revision>",
    trust_remote_code=True
)

Choosing a Model

🏆

Best long-context performance
Init J — prenorm, no QK norm, 16 KV heads, no SWA
HELMET 32K: 56.4 · RULER 32K: 67.7

🦙

Llama 3 architecture equivalent
prenorm, no QK norm, 8 KV heads, no SWA (Init G, 8K ctx)

🔬

Worst architecture (for research)
Init A — SWA + headwise QK norm + fp8 + 8 KV heads
HELMET 32K: 29.9

📍

38 checkpoints per model
Use pre models for pre-extension checkpoints, post for the final extended model.

Browse All Models on HuggingFace ↗ GitHub Repository ↗ Read the Paper ↗

Community & Citation

Community Discussion

r/allenai ↗

"OlmPool: Cracks in the Foundation — 26 models showing how architecture kills long context"

Reddit · r/allenai · Active discussion thread

Architecture tradeoffs QK norm debate Llama 3 analysis

𝕏 @HuggingPapers ↗

"Cracks in the Foundation: Seemingly Minor Architectural Choices Impact Long Context Extension"

X (Twitter) · @HuggingPapers · 20+ engagements

Cite This Work

@article{bertsch2026olmpool,
  title   = {Cracks in the Foundation: Seemingly
             Minor Architectural Choices Impact
             Long Context Extension},
  author  = {Bertsch, Amanda and
             Soldaini, Luca and
             Gormley, Matthew R. and
             Neubig, Graham and
             Hajishirzi, Hannaneh and
             Lo, Kyle and
             Groeneveld, Dirk},
  year    = {2026},
  url     = {https://allenai.org/papers/olmpool},
  institution = {Allen Institute for AI / CMU}
}

Bertsch, A., Soldaini, L., Gormley, M. R.,
Neubig, G., Hajishirzi, H., Lo, K., &
Groeneveld, D. (2026). Cracks in the
Foundation: Seemingly Minor Architectural
Choices Impact Long Context Extension.
Allen Institute for AI.
https://allenai.org/papers/olmpool

Authors

Amanda Bertsch CMU · Now at Google DeepMind

Luca Soldaini Allen Institute for AI

Matthew R. Gormley CMU

Graham Neubig CMU · Now at Google DeepMind

Hannaneh Hajishirzi UW · Allen Institute for AI

Kyle Lo Allen Institute for AI

Dirk Groeneveld Allen Institute for AI

FAQ

What is OlmPool?

OlmPool is a suite of 26 comparable 7B language models released by Allen Institute for AI (Ai2) in April 2026. Each model is trained with identical data and optimization settings, varying only in four architectural choices: QK normalization, grouped-query attention (GQA), sliding window attention (SWA), and pretraining context length. The goal is to isolate how these choices affect long-context extensibility.

What are the four architectural choices studied?

The four choices are: (1) QK Normalization — layerwise or headwise RMS norm applied to queries and keys; (2) Grouped-Query Attention (GQA) — sharing KV matrices across query heads to reduce cache size; (3) Sliding Window Attention (SWA) — using local-window attention in most layers; (4) Pretraining Context Length — 4K vs 8K tokens during pretraining.

Why does combining these choices hurt so much?

Each choice individually reduces the expressivity of the attention mechanism. GQA reduces the number of independent attention patterns. SWA restricts the context window at most layers. QK norm suppresses attention sinks and increases entropy. When combined, these effects compound non-linearly — the worst configuration (GQA + SWA + headwise QK norm) drops HELMET 32K by 26.5 points.

Why can't I use training loss to predict long-context performance?

Training loss correlates only weakly with long-context performance (R² = 0.29 for pretraining loss, R² = 0.06 for context extension loss). All 26 OlmPool models have very similar short-context metrics — the architectural differences only manifest after context extension. This is why the problem is so insidious: you can't see it coming from standard monitoring.

Is Llama 3 the best architecture for long context?

Llama 3 is one of the best architectures in the OlmPool design space, but not the absolute best. The top-performing model (Init J: prenorm, no QK norm, 16 KV heads, no SWA) outperforms the Llama 3 architecture equivalent. OlmPool confirms that Llama 3's long-context strength is architectural (not data-driven), and identifies configurations that surpass it.

How do I load OlmPool models?

Use the HuggingFace transformers library with trust_remote_code=True and a specific revision parameter pointing to one of the 38 available checkpoints. See the Quick Start section above for a code example.

What is the "Init" letter in the model names?

The Init letter (A–K) is a weight initialization code, not an architecture name. It identifies which initialization scheme was used for that run. Multiple models can share the same Init letter while differing in other architectural choices.

OlmPool: How Minor Architecture Choices Crack Long-Context Performance