Allen Institute for AI · April 2026

OlmPool: How Minor Architecture Choices
Crack Long-Context Performance

A controlled suite of 26 comparable 7B models trained with 170,000 H100 hours, showing that combining just 3 architectural choices can drop long-context performance by up to 47% — and standard benchmarks won't warn you.

26 Controlled Models
170K H100 GPU Hours
47% Max Performance Drop
38 Checkpoints / Model

"Cracks in the Foundation: Seemingly Minor Architectural Choices Impact Long Context Extension"
Bertsch, Soldaini, Gormley, Neubig, Hajishirzi, Lo, Groeneveld · Allen Institute for AI / CMU

Why OlmPool Matters

Most LLMs are pretrained on short text, then extended to handle long contexts via a midtraining phase. But the base architecture is locked in before anyone tests long-context behavior. OlmPool is the first large-scale controlled study to isolate exactly which architectural decisions make that extension succeed or fail.

🔬

Controlled Ablations

Data, tokenizer, and extension recipe are held constant. Only the architecture varies — making causal attribution possible for the first time.

⚠️

Short-Context Metrics Fail

Training loss, perplexity, and 16 standard benchmarks all fail to predict long-context performance. You can't see the problem coming.

🏗️

Architecture Is the Culprit

Llama 3's long-context strength is architectural, not data-driven. The same extension recipe applied to OLMo 3 or Qwen 3 performs significantly worse.

📦

Fully Open Release

All 26 models with 38 checkpoints each are released on HuggingFace, amortizing the 170,000 H100-hour training cost across the research community.

Video Explainer

Vinh Nguyen's 41-minute deep-dive podcast walkthrough of the OlmPool paper — covering the key findings, methodology, and implications for LLM architecture design.

📺 Paper Walkthrough

Cracks in the Foundation — OlmPool Deep Dive

A thorough walkthrough of the paper's methodology, the four architectural choices, and what the results mean for practitioners building long-context LLMs.

  • Why short-context metrics fail to predict long-context performance
  • How QK norm suppresses attention sinks
  • The SWA + GQA interaction effect
  • What Llama 3's architecture gets right
  • Practical implications for model architecture decisions
Watch on YouTube ↗

Key Research Findings

01

Short-Context Metrics Cannot Predict Long-Context Performance

HELMET 32K scores range from 29.9 to 56.4 across OlmPool models — a 26.5-point spread. Yet standard pretraining metrics show almost no correlation with this outcome:

Training loss (pretraining) R² = 0.29
Training loss (context extension) R² = 0.06
16 downstream BPB benchmarks R² = 0.17
HELMET 8K (pre-extension) R² = 0.32
Feature count (# bad choices) R² = 0.67

The single best predictor is simply counting how many of the four harmful architectural choices are present.

02

Effects Compound — Up to 47% Drop

Each feature alone causes modest degradation. Combined, they are catastrophic:

Pretraining ctx 4K vs 8K −1 to −2 pts
SWA alone −1.1 pts
QK Norm (OLMo arch) −6 pts
SWA + GQA combined −9 pts avg
GQA + SWA + headwise QK Norm −26.5 pts
03

Llama 3's Strength Is Architectural

It was previously unclear whether Llama 3's long-context extensibility came from its architecture or its (undisclosed) training data. OlmPool confirms: it's the architecture.

The Llama 3 architecture model is one of the best in the design space. The same extension recipe applied to OLMo 3 and Qwen 3 architectures performs significantly worse, even with identical data and training duration.

04

More Data Cannot Bridge the Gap

Three representative models (Llama 3 arch, OLMo 3 arch, worst arch) were extended with 1B, 10B, and 50B tokens.

Even after 50B tokens of context extension, the worst architecture does not reach the performance that the Llama 3 architecture achieves after just 1B tokens. The architectural gap is fundamental, not a matter of training budget.

05

Attention Sinks Are a Positive Signal

Models without QK norm develop strong attention sinks — early tokens that absorb excess attention weight. While often considered a negative property, in OlmPool these sinks correlate with better long-context performance (R² = 0.38).

QK norm suppresses attention sinks and increases entropy, making it harder for the model to attend selectively over long contexts.

06

Detectable Early in Pretraining

Architectural differences in long-context behavior are detectable from context extension runs as early as 70B tokens into pretraining — well before full-scale training completes.

This opens a path to cheaper architectural validation: run a short context extension early in training to screen for long-context viability before committing to full pretraining.

The 4 Architectural Choices

All four choices are used by at least one of OLMo, Llama, or Qwen model families. Each has a legitimate engineering justification — but their combination is toxic for long-context extension.

1

QK Normalization

Used by: OLMo 2/3 (layerwise), Qwen 3 / Gemma 3 (headwise)

QK norm applies RMS normalization to query and key matrices before computing attention scores. It improves training stability — but at a cost to long-context performance.

  • Layerwise (OLMo 2/3): one γ per layer — used in OLMo 2 and OLMo 3
  • Headwise (Qwen 3, Gemma 3): one γ per attention head — slightly worse than layerwise
  • Removing QK norm from OLMo 3 architecture: +6 pts HELMET 32K
  • Adding QK norm to Llama 3 architecture: −3.8 pts HELMET 32K
✓ Training stability ✗ Long-context performance
2

Grouped-Query Attention (GQA)

Used by: Llama 3 (8 KV heads), Qwen 3, Gemma 3

GQA reduces the KV cache size by sharing key-value matrices across multiple query heads. The standard Llama 3 configuration uses 8 KV heads for 32 query heads.

  • More KV heads = better long-context performance (monotonically)
  • 32 KV heads (full MHA) outperforms 8 KV heads on HELMET and RULER
  • 4 KV heads is worse than 8 KV heads
  • GQA alone: modest degradation; GQA + SWA: −9 pts avg
✓ Inference efficiency, smaller KV cache ✗ Reduced attention expressivity
3

Sliding Window Attention (SWA)

Used by: OLMo 3, Gemma 3

SWA interspersed local-window layers with full-attention layers. OlmPool uses the OLMo 3 configuration: 3 local layers (4096 window) per 1 global layer.

  • SWA alone: −1.1 pts (modest)
  • SWA + GQA: −9 pts average (severe interaction)
  • SWA models show less attention sink behavior on full-attention layers
  • Restricts model to 4K context at 75% of layers during pretraining
✓ Pretraining efficiency ✗ Compounds badly with GQA
4

Pretraining Context Length

4K (Llama 2, OLMo 1) vs 8K (Llama 3, OLMo 3)

The context length used during pretraining sets a ceiling on what the model can learn about long-range dependencies before context extension begins.

  • 4K vs 8K pretraining: −1 to −2 pts on HELMET 32K
  • 4K + SWA = model sees only 4K context at all layers during pretraining
  • Longer pretraining context enables better post-extension performance
✓ Shorter = faster pretraining ✗ Limits long-context ceiling

Architecture Comparison: Llama 3 vs OLMo 3 vs Qwen 3

How the three major model families differ on the four critical choices

Feature Llama 3 OLMo 3 Qwen 3 Impact on Long Context
QK Normalization None Layerwise Headwise None = best · Headwise = worst
Up to −6 pts HELMET 32K
GQA (KV heads) 8 KV heads 32 KV heads 8 KV heads More heads = better
32 > 16 > 8 > 4
Sliding Window Attn None 3/4 layers local None SWA alone: −1.1 pts
SWA + GQA: −9 pts avg
Pretraining Context 8K 8K 8K 8K > 4K
−1 to −2 pts for 4K
OlmPool HELMET 32K ~52.4
Init G equiv.
~47.5
Init H, SWA+LQK
~38.8
SWA+LQK+8kv
Same data, same extension recipe

* OlmPool holds data and training recipe constant — differences are purely architectural.

All 26 OlmPool Models

Complete results from Appendix A of the paper. Sorted by HELMET 32K score (worst to best). All models are 7B parameters, pretrained on 140B tokens, then extended with 10B tokens to 64K context.

SWA — Sliding Window Attention (3 local layers per 4) QKNorm — ✓ layerwise · hw headwise · × none Norm — post-sublayer-norm vs prenorm Init — weight initialization code (A–K) KV — KV heads (32 = full MHA) Ctx — pretraining context length
SWA QKNorm Norm fp8 Init KV Ctx Params H-8K H-16K H-32K ↑ H-64K LongPPL ↓ R-32K ↑ HuggingFace

Source: Table 1 & Table 2, Appendix A, "Cracks in the Foundation" (Bertsch et al., 2026)

Quick Start

OlmPool models are hosted on HuggingFace. Each model has 38 checkpoints covering the full pretraining and context extension process. You must specify a revision and set trust_remote_code=True.

Load a Model

# Install dependencies
pip install transformers torch

# Load any OlmPool model
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "allenai/K_post_HQK_8kv_12k"

# revision = specific checkpoint (see HF Files tab)
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    revision="<checkpoint_revision>",
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    revision="<checkpoint_revision>",
    trust_remote_code=True
)

Choosing a Model

🏆
Best long-context performance
Init J — prenorm, no QK norm, 16 KV heads, no SWA
HELMET 32K: 56.4 · RULER 32K: 67.7
🦙
Llama 3 architecture equivalent
prenorm, no QK norm, 8 KV heads, no SWA (Init G, 8K ctx)
🔬
Worst architecture (for research)
Init A — SWA + headwise QK norm + fp8 + 8 KV heads
HELMET 32K: 29.9
📍
38 checkpoints per model
Use pre models for pre-extension checkpoints, post for the final extended model.

Community & Citation

Cite This Work

@article{bertsch2026olmpool,
  title   = {Cracks in the Foundation: Seemingly
             Minor Architectural Choices Impact
             Long Context Extension},
  author  = {Bertsch, Amanda and
             Soldaini, Luca and
             Gormley, Matthew R. and
             Neubig, Graham and
             Hajishirzi, Hannaneh and
             Lo, Kyle and
             Groeneveld, Dirk},
  year    = {2026},
  url     = {https://allenai.org/papers/olmpool},
  institution = {Allen Institute for AI / CMU}
}

Authors

Amanda Bertsch CMU · Now at Google DeepMind
Luca Soldaini Allen Institute for AI
Matthew R. Gormley CMU
Graham Neubig CMU · Now at Google DeepMind
Hannaneh Hajishirzi UW · Allen Institute for AI
Kyle Lo Allen Institute for AI
Dirk Groeneveld Allen Institute for AI

FAQ

What is OlmPool?

OlmPool is a suite of 26 comparable 7B language models released by Allen Institute for AI (Ai2) in April 2026. Each model is trained with identical data and optimization settings, varying only in four architectural choices: QK normalization, grouped-query attention (GQA), sliding window attention (SWA), and pretraining context length. The goal is to isolate how these choices affect long-context extensibility.

What are the four architectural choices studied?

The four choices are: (1) QK Normalization — layerwise or headwise RMS norm applied to queries and keys; (2) Grouped-Query Attention (GQA) — sharing KV matrices across query heads to reduce cache size; (3) Sliding Window Attention (SWA) — using local-window attention in most layers; (4) Pretraining Context Length — 4K vs 8K tokens during pretraining.

Why does combining these choices hurt so much?

Each choice individually reduces the expressivity of the attention mechanism. GQA reduces the number of independent attention patterns. SWA restricts the context window at most layers. QK norm suppresses attention sinks and increases entropy. When combined, these effects compound non-linearly — the worst configuration (GQA + SWA + headwise QK norm) drops HELMET 32K by 26.5 points.

Why can't I use training loss to predict long-context performance?

Training loss correlates only weakly with long-context performance (R² = 0.29 for pretraining loss, R² = 0.06 for context extension loss). All 26 OlmPool models have very similar short-context metrics — the architectural differences only manifest after context extension. This is why the problem is so insidious: you can't see it coming from standard monitoring.

Is Llama 3 the best architecture for long context?

Llama 3 is one of the best architectures in the OlmPool design space, but not the absolute best. The top-performing model (Init J: prenorm, no QK norm, 16 KV heads, no SWA) outperforms the Llama 3 architecture equivalent. OlmPool confirms that Llama 3's long-context strength is architectural (not data-driven), and identifies configurations that surpass it.

How do I load OlmPool models?

Use the HuggingFace transformers library with trust_remote_code=True and a specific revision parameter pointing to one of the 38 available checkpoints. See the Quick Start section above for a code example.

What is the "Init" letter in the model names?

The Init letter (A–K) is a weight initialization code, not an architecture name. It identifies which initialization scheme was used for that run. Multiple models can share the same Init letter while differing in other architectural choices.