Back to Blog
Agent Engineering
min read
min read

Putting the “Infinite Context Window” to the Test

Evaluating Recursive Language Models (RLMs) on a real-world agent analytics workload

A young man with short, dark hair looks directly at the camera with a neutral expression. He wears a gray shirt against a plain background.
Christoph Bathon
Jan 18, 2026
Share

A new research technique proposed by Alex Zhang and colleagues at MIT enables LLMs to approach something close to an “infinite context window.”

Known as Recursive Language Models (RLMs), this technique rethinks how LLMs handle inputs vastly larger than their native context limits. Rather than attempting to ingest the entire input in a single forward pass, RLMs treat the full input as an external, programmable environment. The model interacts with this environment through a Python REPL, where it can programmatically inspect, decompose, summarize, search, and recursively delegate subtasks to identical child agents—effectively scaling reasoning to 10 million+ tokens while avoiding traditional context-window collapse and information loss.

In the original paper, the authors show that RLMs can handle inputs of this magnitude and, in many cases, outperform standard long-context benchmarks in both quality and cost by selectively allocating reasoning effort where it matters most. Instead of focusing on generic benchmarks, we wanted to see what happens when you apply this idea to a real, production problem.

At Voker, we’re pioneering the first analytics platform to monitoring your AI Agent ecosystem—focused on understanding agent ↔ user behavior. Not just tokens, traces, or latency, but what agents are actually doing and whether they’re helping.

This post covers:

  • The concrete problem we tested
  • It’s complexity with standard LLM approaches
  • How RLM performed in practice
  • Where it broke down
  • Why we’re not using it (yet)

The problem we tested

A core part of agent analytics is giving AI builders a high-level understanding of how an agent is used.

Ultimately, the goal is to provide a clear view into what an agent is doing, how well it’s doing it, and to detect anomalies in its behavior.

To do that, we need to:

  • Compile all agent ↔ user interactions into conversations
  • Create a conversation-level overview for each:

Summary

Duration

Number of turns

General topic(s)

Key points

Resolution status

  • Aggregate those conversation-level insights into an agent-level overview
  • Re-run this aggregation continuously as new conversations arrive

In practice, this means summarizing hundreds of conversations, each with multi-turn conversations, and synthesizing them into a single coherent picture.

Its complexity with standard LLM approaches

The core issue: the scope of the data quickly exceeds a typical context window.

Without something like RLM, you’re forced to:

  • Break user conversations into chunks
  • Run multiple LLM calls on these to extract insights
  • Process insights to generate an agent-level overview
  • Re-aggregate as new conversations arrive

This adds latency, code complexity, and more failure modes—along with more places where summaries can drift or degrade, introducing a requirement for granular guardrails.

So the question was:

Can we leverage the RLM architecture to do this in a single pass by feeding all conversation data into its infinite context window?

Setting up RLM

First we cloned the repository and followed the instructions to get it up and running:

What was easy

  • The quickstart.py file provides a great starting point to spin up your first RLM instance,
  • There’s an out-of-the-box visualizer that traces:
    • Code executions
    • Sub-agents the RLM spawned

The visualizer is a bit hard to navigate, but genuinely useful.

What was confusing

The main point of entry for querying an RLM is

Python
rlm.completion(prompt=..., root_prompt=...)

where:

  • prompt: content or data you want the model to analyze (this becomes the context in the REPL environment)
  • root_prompt: high-level instructions how the model should analyze the provided data

But this distinction isn’t immediately obvious from the quickstart example, which only passes a single string:

Python
result = rlm.completion(    
	"Print me the first 5 powers of two, each on a newline."
)

To make the separation more explicit and intuitive, it could be rewritten like this:

Python
data=""  # ← the context/data to analyze (often empty for simple tasks)
instructions="Print me the first 5 powers of two, each on a newline." 

result = rlm.completion(    
	prompt=data,    
	root_prompt=instructions
)

The experiment

We passed in raw, unsorted message data across 110 conversations, structured like this:

JSON 
{
  "conversation_id": "00000000-0000-0000-0000-000000000001",
  "created_at": "2026-01-19T00:00:00Z",
  "user_message": "Okay, add the Pixel 8 128GB to my cart.",
  "assistant_response": "Added! Your cart now has Google Pixel 8 (128GB). Anything else?"
}

Along with our RLM prompt that looked something like 

Python

"""
You are analyzing conversation logs from an AI customer service agent.Your task is to create structured summaries of each conversation thread.

INSTRUCTIONS:

1. Group all messages by conversation_id to reconstruct complete conversations

2. For each conversation, analyze the chronological flow using created_at timestamps

3. Parse the user_message and assistant_response fields

...
"""

What RLM actually did (best run)

In the best run, RLM behaved like a lightweight data analyst + orchestrator.

1. It identified useful python packages for this task & imported them before loading in the context

Python
import json, re, datetime
data = json.loads(context)

2. Grouped data by conversation

Python
from collections import defaultdic

by_conversation = defaultdict(list)
for row in data:    
	by_conversation[row["conversation_id"]].append(row)

3. Sorted the messages within each conversation chronologically

Python
from datetime import datetime

def parse_ts(s):
	return datetime.fromisoformat(s)

for sid in by_conversation:
	by_conversation[sid].sort(key=lambda r: parse_ts(r["created_at"]))

4. Spawned 110 sub-LLMs to extract insights from each conversation

Python
prompts = []
for txt in convo_texts:    
	prompts.append(
		"You are summarizing one customer-service chat conversation. "
		"Return a JSON object with keys: conversation_id, topic_category, summary, 
"

		"key_points (array of 3 strings), status
(Resolved|Ongoing|Escalated|Abandoned). "
		"Conversation:\n" + txt
	)

summaries = llm_query_batched(prompts)

5. Aggregated the per-conversation results into a final output

Python
import json

final = {"conversations": []}

for s in summaries:
	try:
		final["conversations"].append(json.loads(s))
	except Exception as e:
		final["conversations"].append({"error": str(e), "raw": s})

When it worked, this was genuinely impressive. Here is a snippet of the output.

JSON 
{
  "conversations": [{
    "conversation_id": "00000000-0000-0000-0000-000000000001",
    "timeline": {
      "start": "2026-01-19 13:28:31",
      "end": "2026-01-19 13:28:31"
    },
    "turn_count": 22,
    "topic_category": "E-commerce purchase assistance (shipping/returns, product selection, cart and checkout)",
    "summary": "The customer asked about shipping time, tracking availability, order confirmation, and the return policy. They then shopped for a smartphone under $800 with a strong camera, selected a Google Pixel 8 with a discount, added a black case, reviewed the total, and proceeded through checkout by providing shipping and payment details.",
    "key_points": [
      "Provided policies: standard shipping 3–5 business days, tracking emailed after shipment, 30-day returns with free return shipping",
      "Product guidance led to selecting Google Pixel 8 (128GB) with 10% off code HOLIDAY10; compared battery and storage with Galaxy S23",
      "Checkout progressed: cart contained Pixel 8 + black case; subtotal $658.10 plus tax; shipping address saved and credit card added"
    ],
    "status": "Resolved"
  }]
}

Results

Pros

  • Can work beautifully with expensive reasoning models (gpt-5.2)
  • The reasoning trace showed the model:
    • Exploring the data
    • Writing helper code
    • Refining its own approach
  • The behavior felt closer to a data analyst than a prompt template

Cons

  • Failed badly with cheaper models 
  • Would sometimes hang indefinitely with no error signal
  • Inconsistent execution strategy (e.g. importing random libraries at start)
  • Occasionally produced correct data but did not adhere to the structured output format

In one case, it re-ran subcalls three times (330 total calls) and still failed to return valid output.

In short: extremely powerful, extremely unstable.

Conclusion

We’re not using RLM in production today.

Not because the idea isn’t compelling—but because we can replicate the behavior with:

  • More guardrails
  • Better determinism
  • Lower cost
  • Fewer catastrophic failure modes

Right now, RLM is not production-safe for this class of workload.

That said, as the technique matures—with stronger constraints and execution guarantees—we’ll revisit it.

What’s next

  • Curious to see how others apply RLM to real workloads
  • Excited to watch how the technique evolves
  • Open to feedback if we missed something or could improve the approach

Google just released a supposedly enterprise-ready implementation of the original codebase, which we’ll be putting to the test as well. (https://discuss.google.dev/t/recursive-language-models-in-adk/323523?utm_source=twitter&utm_medium=unpaidsoc&utm_campaign=fy26q1-googlecloudtech-web-ai-in_feed-no-brand-global&utm_content=-&utm_term=-&linkId=48424260

References

Zhang, A. L., Kraska, T., & Khattab, O. (2025). Recursive Language Models. arXiv:2512.24601.
https://arxiv.org/abs/2512.24601

Agent Engineering
A young man with short, dark hair looks directly at the camera with a neutral expression. He wears a gray shirt against a plain background.
Christoph Bathon
Founding Engineer, Voker
Abstract Shape

More from the Blog

View all articles
A person in a gray shirt works at a multi-monitor desk setup, focused on a laptop. The workspace is organized and modern, conveying concentration.
AI Product Management
12 min read

What is Agent Analytics?

Measuring AI Agents in Production

A smiling person with glasses and short hair, wearing a light shirt against a plain background. The image conveys a friendly and approachable mood.
Tyler Postle
·
Apr 15, 2026
Read article
A dark night scene shows a tombstone with "R.I.P. Prompt Engineering, 2022–2025." Below, text reads, "Prompt Engineering is Dead. Long live Agent Engineering." Stars and a moon are in the sky.
AI Product Management
4 min read

The Rise of the Agent Engineer

The prompt engineer died so Agent Engineers could thrive

A smiling person with glasses and short hair, wearing a light shirt against a plain background. The image conveys a friendly and approachable mood.
Tyler Postle
·
Mar 22, 2026
Read article
A man in a t-shirt gestures while speaking to a colleague standing nearby in an office setting, suggesting a collaborative and focused atmosphere.
AI Product Management
5 min read

Agent Analytics FAQ

Everything you need to know to start measuring what your AI agent actually does

A smiling person with glasses and short hair, wearing a light shirt against a plain background. The image conveys a friendly and approachable mood.
Tyler Postle
·
Mar 5, 2026
Read article