Recursive Language Models (RLMs): A Brief Overview

February 20, 2026

AI LLMs Architecture

Recursive Language Models (RLMs) represent a shift in how AI systems handle massive contexts. Rather than attempting to fit millions of tokens into a single model's window—which often leads to performance degradation known as "Context Rot"—RLMs use a recursive inference strategy to decompose and interact with context programmatically.

The Core Problem: Context Rot

Traditional LLMs suffer from "Context Rot," where accuracy declines as the context window fills, even before reaching technical limits. This isn't just a capacity issue but a quality one; models become "dumber" or lose track of details as the conversation history or input data grows.

Figure: An RLM interacts with a REPL environment to manage massive context, recursively sub-querying itself or other LMs to efficiently parse information. (Source: alexzhang13.github.io)

How RLMs Work

An RLM is a thin wrapper around a language model that allows it to interact with a computational environment to manage information.

1. Programmatic vs. Tokenized Context

RLMs maintain a distinction between two types of context:

Programmatic Context: The raw, massive data (e.g., 400MB logs or 10M+ tokens) stored as variables in a coding environment.
Tokenized Context: The specific snippets of information currently "active" in the LLM's prompt window.

2. The REPL Environment

The RLM operates within a REPL (Read-Eval-Print Loop) environment, typically Python. The long context is loaded as a variable in this environment. The "Root LLM" does not see the entire context; instead, it writes code to explore it.

3. Recursive Decomposition

The Root LLM can call other LLM instances (sub-calls) from within the REPL. This allows for a "divide and conquer" approach:

Peeking: The model looks at the first few characters to understand the data structure.
Filtering/Grepping: Using deterministic code (like regex) to narrow down relevant sections.
Partitioning: Breaking the context into smaller chunks.
Mapping: Assigning sub-LMs to analyze specific chunks and return summaries.
Synthesis: The Root LLM gathers all sub-findings to produce a final answer.

Key Benefits

Unbounded Context: RLMs have demonstrated stable performance on contexts exceeding 10 million tokens, where traditional models fail entirely.
Deterministic Accuracy: By using code for filtering and searching, RLMs combine the "fuzzy" reasoning of LLMs with the "exact" precision of programmatic tools.
Efficiency: RLMs can often use smaller, cheaper models (like GPT-4o-mini) for sub-tasks, orchestrated by a stronger "frontier" model, reducing overall costs.
Inference-Time Scaling: It provides a new axis for scaling performance by allowing the model to "think" longer and perform more iterations during the inference phase.

Getting Started with RLMs

You can easily spin up a new environment and run RLMs in Python.

1. Setup your environment

# Create a new 3.14 environment
uv venv .venv

# Activate it
source .venv/bin/activate

# Install your package (lightning fast)
uv pip install rlms

2. Run your first RLM

from rlm import RLM

rlm = RLM(
    backend="gemini",
    backend_kwargs={"model_name": "gemini-2.0-flash"},
    verbose=True,  # For printing to console with rich, disabled by default.
)

print(rlm.completion("find the nearest city to stockholm and tell me how far is it in km and the travel time by car to get there.").response)

The Future: Agent Discovery

Beyond solving context limits, RLMs act as Agent Discovery Mechanisms. By observing the "traces" of how an RLM solves a complex problem—what strategies it tries, how it chunks data, and which sub-queries it makes—developers can identify repeating patterns. These patterns can then be "hard-coded" into optimized, low-latency agent architectures, effectively using RLMs to "invent" the best agent for a specific task.

Current Limitations

Latency: Because they often involve multiple synchronous LLM calls, RLMs are currently slower than single-shot completions.
Model Strength: The Root LLM must be a high-reasoning "frontier" model to effectively manage the REPL and sub-tasking logic.
Synchronicity: Current implementations are often blocking and do not yet fully utilize asynchronous parallel processing for sub-calls.