Putting the “Infinite Context Window” to the Test
Evaluating Recursive Language Models (RLMs) on a real-world agent analytics workload

A new research technique proposed by Alex Zhang and colleagues at MIT enables LLMs to approach something close to an “infinite context window.”
Known as Recursive Language Models (RLMs), this technique rethinks how LLMs handle inputs vastly larger than their native context limits. Rather than attempting to ingest the entire input in a single forward pass, RLMs treat the full input as an external, programmable environment. The model interacts with this environment through a Python REPL, where it can programmatically inspect, decompose, summarize, search, and recursively delegate subtasks to identical child agents—effectively scaling reasoning to 10 million+ tokens while avoiding traditional context-window collapse and information loss.
In the original paper, the authors show that RLMs can handle inputs of this magnitude and, in many cases, outperform standard long-context benchmarks in both quality and cost by selectively allocating reasoning effort where it matters most. Instead of focusing on generic benchmarks, we wanted to see what happens when you apply this idea to a real, production problem.
At Voker, we’re pioneering the first analytics platform to monitoring your AI Agent ecosystem—focused on understanding agent ↔ user behavior. Not just tokens, traces, or latency, but what agents are actually doing and whether they’re helping.
This post covers:
- The concrete problem we tested
- It’s complexity with standard LLM approaches
- How RLM performed in practice
- Where it broke down
- Why we’re not using it (yet)
The problem we tested
A core part of agent analytics is giving AI builders a high-level understanding of how an agent is used.
Ultimately, the goal is to provide a clear view into what an agent is doing, how well it’s doing it, and to detect anomalies in its behavior.
To do that, we need to:
- Compile all agent ↔ user interactions into conversations
- Create a conversation-level overview for each:
Summary
Duration
Number of turns
General topic(s)
Key points
Resolution status
- Aggregate those conversation-level insights into an agent-level overview
- Re-run this aggregation continuously as new conversations arrive
In practice, this means summarizing hundreds of conversations, each with multi-turn conversations, and synthesizing them into a single coherent picture.
Its complexity with standard LLM approaches
The core issue: the scope of the data quickly exceeds a typical context window.
Without something like RLM, you’re forced to:
- Break user conversations into chunks
- Run multiple LLM calls on these to extract insights
- Process insights to generate an agent-level overview
- Re-aggregate as new conversations arrive
This adds latency, code complexity, and more failure modes—along with more places where summaries can drift or degrade, introducing a requirement for granular guardrails.
So the question was:
Can we leverage the RLM architecture to do this in a single pass by feeding all conversation data into its infinite context window?
Setting up RLM
First we cloned the repository and followed the instructions to get it up and running:
What was easy
- The quickstart.py file provides a great starting point to spin up your first RLM instance,
- There’s an out-of-the-box visualizer that traces:
- Code executions
- Sub-agents the RLM spawned
The visualizer is a bit hard to navigate, but genuinely useful.
What was confusing
The main point of entry for querying an RLM is
Python
rlm.completion(prompt=..., root_prompt=...)
where:
- prompt: content or data you want the model to analyze (this becomes the context in the REPL environment)
- root_prompt: high-level instructions how the model should analyze the provided data
But this distinction isn’t immediately obvious from the quickstart example, which only passes a single string:
Python
result = rlm.completion(
"Print me the first 5 powers of two, each on a newline."
)
To make the separation more explicit and intuitive, it could be rewritten like this:
Python
data="" # ← the context/data to analyze (often empty for simple tasks)
instructions="Print me the first 5 powers of two, each on a newline."
result = rlm.completion(
prompt=data,
root_prompt=instructions
)
The experiment
We passed in raw, unsorted message data across 110 conversations, structured like this:
JSON
{
"conversation_id": "00000000-0000-0000-0000-000000000001",
"created_at": "2026-01-19T00:00:00Z",
"user_message": "Okay, add the Pixel 8 128GB to my cart.",
"assistant_response": "Added! Your cart now has Google Pixel 8 (128GB). Anything else?"
}
Along with our RLM prompt that looked something like
Python
"""
You are analyzing conversation logs from an AI customer service agent.Your task is to create structured summaries of each conversation thread.
INSTRUCTIONS:
1. Group all messages by conversation_id to reconstruct complete conversations
2. For each conversation, analyze the chronological flow using created_at timestamps
3. Parse the user_message and assistant_response fields
...
"""
What RLM actually did (best run)
In the best run, RLM behaved like a lightweight data analyst + orchestrator.
1. It identified useful python packages for this task & imported them before loading in the context
Python
import json, re, datetime
data = json.loads(context)
2. Grouped data by conversation
Python
from collections import defaultdic
by_conversation = defaultdict(list)
for row in data:
by_conversation[row["conversation_id"]].append(row)
3. Sorted the messages within each conversation chronologically
Python
from datetime import datetime
def parse_ts(s):
return datetime.fromisoformat(s)
for sid in by_conversation:
by_conversation[sid].sort(key=lambda r: parse_ts(r["created_at"]))
4. Spawned 110 sub-LLMs to extract insights from each conversation
Python
prompts = []
for txt in convo_texts:
prompts.append(
"You are summarizing one customer-service chat conversation. "
"Return a JSON object with keys: conversation_id, topic_category, summary,
"
"key_points (array of 3 strings), status
(Resolved|Ongoing|Escalated|Abandoned). "
"Conversation:\n" + txt
)
summaries = llm_query_batched(prompts)
5. Aggregated the per-conversation results into a final output
Python
import json
final = {"conversations": []}
for s in summaries:
try:
final["conversations"].append(json.loads(s))
except Exception as e:
final["conversations"].append({"error": str(e), "raw": s})
When it worked, this was genuinely impressive. Here is a snippet of the output.
JSON
{
"conversations": [{
"conversation_id": "00000000-0000-0000-0000-000000000001",
"timeline": {
"start": "2026-01-19 13:28:31",
"end": "2026-01-19 13:28:31"
},
"turn_count": 22,
"topic_category": "E-commerce purchase assistance (shipping/returns, product selection, cart and checkout)",
"summary": "The customer asked about shipping time, tracking availability, order confirmation, and the return policy. They then shopped for a smartphone under $800 with a strong camera, selected a Google Pixel 8 with a discount, added a black case, reviewed the total, and proceeded through checkout by providing shipping and payment details.",
"key_points": [
"Provided policies: standard shipping 3–5 business days, tracking emailed after shipment, 30-day returns with free return shipping",
"Product guidance led to selecting Google Pixel 8 (128GB) with 10% off code HOLIDAY10; compared battery and storage with Galaxy S23",
"Checkout progressed: cart contained Pixel 8 + black case; subtotal $658.10 plus tax; shipping address saved and credit card added"
],
"status": "Resolved"
}]
}
Results
Pros
- Can work beautifully with expensive reasoning models (gpt-5.2)
- The reasoning trace showed the model:
- Exploring the data
- Writing helper code
- Refining its own approach
- The behavior felt closer to a data analyst than a prompt template
Cons
- Failed badly with cheaper models
- Would sometimes hang indefinitely with no error signal
- Inconsistent execution strategy (e.g. importing random libraries at start)
- Occasionally produced correct data but did not adhere to the structured output format
In one case, it re-ran subcalls three times (330 total calls) and still failed to return valid output.
In short: extremely powerful, extremely unstable.
Conclusion
We’re not using RLM in production today.
Not because the idea isn’t compelling—but because we can replicate the behavior with:
- More guardrails
- Better determinism
- Lower cost
- Fewer catastrophic failure modes
Right now, RLM is not production-safe for this class of workload.
That said, as the technique matures—with stronger constraints and execution guarantees—we’ll revisit it.
What’s next
- Curious to see how others apply RLM to real workloads
- Excited to watch how the technique evolves
- Open to feedback if we missed something or could improve the approach
Google just released a supposedly enterprise-ready implementation of the original codebase, which we’ll be putting to the test as well. (https://discuss.google.dev/t/recursive-language-models-in-adk/323523?utm_source=twitter&utm_medium=unpaidsoc&utm_campaign=fy26q1-googlecloudtech-web-ai-in_feed-no-brand-global&utm_content=-&utm_term=-&linkId=48424260)
References
Zhang, A. L., Kraska, T., & Khattab, O. (2025). Recursive Language Models. arXiv:2512.24601.
https://arxiv.org/abs/2512.24601

More from the Blog

The Rise of the Agent Engineer
The prompt engineer died so Agent Engineers could thrive


Agent Analytics FAQ
Everything you need to know to start measuring what your AI agent actually does

