A controlled suite of 26 comparable 7B models trained with 170,000 H100 hours, showing that combining just 3 architectural choices can drop long-context performance by up to 47% — and standard benchmarks won't warn you.
"Cracks in the Foundation: Seemingly Minor Architectural Choices Impact Long Context Extension"
Bertsch, Soldaini, Gormley, Neubig, Hajishirzi, Lo, Groeneveld · Allen Institute for AI / CMU
Most LLMs are pretrained on short text, then extended to handle long contexts via a midtraining phase. But the base architecture is locked in before anyone tests long-context behavior. OlmPool is the first large-scale controlled study to isolate exactly which architectural decisions make that extension succeed or fail.
Data, tokenizer, and extension recipe are held constant. Only the architecture varies — making causal attribution possible for the first time.
Training loss, perplexity, and 16 standard benchmarks all fail to predict long-context performance. You can't see the problem coming.
Llama 3's long-context strength is architectural, not data-driven. The same extension recipe applied to OLMo 3 or Qwen 3 performs significantly worse.
All 26 models with 38 checkpoints each are released on HuggingFace, amortizing the 170,000 H100-hour training cost across the research community.
Vinh Nguyen's 41-minute deep-dive podcast walkthrough of the OlmPool paper — covering the key findings, methodology, and implications for LLM architecture design.
HELMET 32K scores range from 29.9 to 56.4 across OlmPool models — a 26.5-point spread. Yet standard pretraining metrics show almost no correlation with this outcome:
The single best predictor is simply counting how many of the four harmful architectural choices are present.
Each feature alone causes modest degradation. Combined, they are catastrophic:
It was previously unclear whether Llama 3's long-context extensibility came from its architecture or its (undisclosed) training data. OlmPool confirms: it's the architecture.
The Llama 3 architecture model is one of the best in the design space. The same extension recipe applied to OLMo 3 and Qwen 3 architectures performs significantly worse, even with identical data and training duration.
Three representative models (Llama 3 arch, OLMo 3 arch, worst arch) were extended with 1B, 10B, and 50B tokens.
Even after 50B tokens of context extension, the worst architecture does not reach the performance that the Llama 3 architecture achieves after just 1B tokens. The architectural gap is fundamental, not a matter of training budget.
Models without QK norm develop strong attention sinks — early tokens that absorb excess attention weight. While often considered a negative property, in OlmPool these sinks correlate with better long-context performance (R² = 0.38).
QK norm suppresses attention sinks and increases entropy, making it harder for the model to attend selectively over long contexts.
Architectural differences in long-context behavior are detectable from context extension runs as early as 70B tokens into pretraining — well before full-scale training completes.
This opens a path to cheaper architectural validation: run a short context extension early in training to screen for long-context viability before committing to full pretraining.
All four choices are used by at least one of OLMo, Llama, or Qwen model families. Each has a legitimate engineering justification — but their combination is toxic for long-context extension.
QK norm applies RMS normalization to query and key matrices before computing attention scores. It improves training stability — but at a cost to long-context performance.
GQA reduces the KV cache size by sharing key-value matrices across multiple query heads. The standard Llama 3 configuration uses 8 KV heads for 32 query heads.
SWA interspersed local-window layers with full-attention layers. OlmPool uses the OLMo 3 configuration: 3 local layers (4096 window) per 1 global layer.
The context length used during pretraining sets a ceiling on what the model can learn about long-range dependencies before context extension begins.
How the three major model families differ on the four critical choices
| Feature | Llama 3 | OLMo 3 | Qwen 3 | Impact on Long Context |
|---|---|---|---|---|
| QK Normalization | None | Layerwise | Headwise | None = best · Headwise = worst Up to −6 pts HELMET 32K |
| GQA (KV heads) | 8 KV heads | 32 KV heads | 8 KV heads | More heads = better 32 > 16 > 8 > 4 |
| Sliding Window Attn | None | 3/4 layers local | None | SWA alone: −1.1 pts SWA + GQA: −9 pts avg |
| Pretraining Context | 8K | 8K | 8K | 8K > 4K −1 to −2 pts for 4K |
| OlmPool HELMET 32K | ~52.4 Init G equiv. |
~47.5 Init H, SWA+LQK |
~38.8 SWA+LQK+8kv |
Same data, same extension recipe |
* OlmPool holds data and training recipe constant — differences are purely architectural.
Complete results from Appendix A of the paper. Sorted by HELMET 32K score (worst to best). All models are 7B parameters, pretrained on 140B tokens, then extended with 10B tokens to 64K context.
| SWA | QKNorm | Norm | fp8 | Init | KV | Ctx | Params | H-8K | H-16K | H-32K ↑ | H-64K | LongPPL ↓ | R-32K ↑ | HuggingFace |
|---|
Source: Table 1 & Table 2, Appendix A, "Cracks in the Foundation" (Bertsch et al., 2026)
OlmPool models are hosted on HuggingFace. Each model has 38 checkpoints covering the full pretraining and context extension process.
You must specify a revision and set trust_remote_code=True.
# Install dependencies
pip install transformers torch
# Load any OlmPool model
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "allenai/K_post_HQK_8kv_12k"
# revision = specific checkpoint (see HF Files tab)
tokenizer = AutoTokenizer.from_pretrained(
model_name,
revision="<checkpoint_revision>",
trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
revision="<checkpoint_revision>",
trust_remote_code=True
)
pre models for pre-extension checkpoints,
post for the final extended model.
"OlmPool: Cracks in the Foundation — 26 models showing how architecture kills long context"
"Cracks in the Foundation: Seemingly Minor Architectural Choices Impact Long Context Extension"
Official announcement from Allen AI — OlmPool release and key findings
@article{bertsch2026olmpool,
title = {Cracks in the Foundation: Seemingly
Minor Architectural Choices Impact
Long Context Extension},
author = {Bertsch, Amanda and
Soldaini, Luca and
Gormley, Matthew R. and
Neubig, Graham and
Hajishirzi, Hannaneh and
Lo, Kyle and
Groeneveld, Dirk},
year = {2026},
url = {https://allenai.org/papers/olmpool},
institution = {Allen Institute for AI / CMU}
}
OlmPool is a suite of 26 comparable 7B language models released by Allen Institute for AI (Ai2) in April 2026. Each model is trained with identical data and optimization settings, varying only in four architectural choices: QK normalization, grouped-query attention (GQA), sliding window attention (SWA), and pretraining context length. The goal is to isolate how these choices affect long-context extensibility.
The four choices are: (1) QK Normalization — layerwise or headwise RMS norm applied to queries and keys; (2) Grouped-Query Attention (GQA) — sharing KV matrices across query heads to reduce cache size; (3) Sliding Window Attention (SWA) — using local-window attention in most layers; (4) Pretraining Context Length — 4K vs 8K tokens during pretraining.
Each choice individually reduces the expressivity of the attention mechanism. GQA reduces the number of independent attention patterns. SWA restricts the context window at most layers. QK norm suppresses attention sinks and increases entropy. When combined, these effects compound non-linearly — the worst configuration (GQA + SWA + headwise QK norm) drops HELMET 32K by 26.5 points.
Training loss correlates only weakly with long-context performance (R² = 0.29 for pretraining loss, R² = 0.06 for context extension loss). All 26 OlmPool models have very similar short-context metrics — the architectural differences only manifest after context extension. This is why the problem is so insidious: you can't see it coming from standard monitoring.
Llama 3 is one of the best architectures in the OlmPool design space, but not the absolute best. The top-performing model (Init J: prenorm, no QK norm, 16 KV heads, no SWA) outperforms the Llama 3 architecture equivalent. OlmPool confirms that Llama 3's long-context strength is architectural (not data-driven), and identifies configurations that surpass it.
Use the HuggingFace transformers library with trust_remote_code=True and a specific revision parameter pointing to one of the 38 available checkpoints. See the Quick Start section above for a code example.
The Init letter (A–K) is a weight initialization code, not an architecture name. It identifies which initialization scheme was used for that run. Multiple models can share the same Init letter while differing in other architectural choices.