Agent Engineering

min read

Putting the “Infinite Context Window” to the Test

Evaluating Recursive Language Models (RLMs) on a real-world agent analytics workload

A new research technique proposed by Alex Zhang and colleagues at MIT enables LLMs to approach something close to an “infinite context window.”

‍

Known as Recursive Language Models (RLMs), this technique rethinks how LLMs handle inputs vastly larger than their native context limits. Rather than attempting to ingest the entire input in a single forward pass, RLMs treat the full input as an external, programmable environment. The model interacts with this environment through a Python REPL, where it can programmatically inspect, decompose, summarize, search, and recursively delegate subtasks to identical child agents—effectively scaling reasoning to 10 million+ tokens while avoiding traditional context-window collapse and information loss.

‍

In the original paper, the authors show that RLMs can handle inputs of this magnitude and, in many cases, outperform standard long-context benchmarks in both quality and cost by selectively allocating reasoning effort where it matters most. Instead of focusing on generic benchmarks, we wanted to see what happens when you apply this idea to a real, production problem.

‍

At Voker, we’re pioneering the first analytics platform to monitoring your AI Agent ecosystem—focused on understanding agent ↔ user behavior. Not just tokens, traces, or latency, but what agents are actually doing and whether they’re helping.

‍

This post covers:

‍

The concrete problem we tested
It’s complexity with standard LLM approaches
How RLM performed in practice
Where it broke down
Why we’re not using it (yet)

‍

The problem we tested

‍

A core part of agent analytics is giving AI builders a high-level understanding of how an agent is used.

Ultimately, the goal is to provide a clear view into what an agent is doing, how well it’s doing it, and to detect anomalies in its behavior.

‍

To do that, we need to:

‍

Compile all agent ↔ user interactions into conversations
Create a conversation-level overview for each:

Summary

Duration

Number of turns

General topic(s)

Key points

Resolution status

Aggregate those conversation-level insights into an agent-level overview
Re-run this aggregation continuously as new conversations arrive

‍

In practice, this means summarizing hundreds of conversations, each with multi-turn conversations, and synthesizing them into a single coherent picture.

‍

Its complexity with standard LLM approaches

‍

The core issue: the scope of the data quickly exceeds a typical context window.

‍

Without something like RLM, you’re forced to:

‍

Break user conversations into chunks
Run multiple LLM calls on these to extract insights
Process insights to generate an agent-level overview
Re-aggregate as new conversations arrive

‍

This adds latency, code complexity, and more failure modes—along with more places where summaries can drift or degrade, introducing a requirement for granular guardrails.

‍

So the question was:

‍

Can we leverage the RLM architecture to do this in a single pass by feeding all conversation data into its infinite context window?

‍

Setting up RLM

‍

First we cloned the repository and followed the instructions to get it up and running:

‍

Repo: https://github.com/alexzhang13/rlm

‍

What was easy

The quickstart.py file provides a great starting point to spin up your first RLM instance,
There’s an out-of-the-box visualizer that traces:
- Code executions
- Sub-agents the RLM spawned

‍

The visualizer is a bit hard to navigate, but genuinely useful.

‍

What was confusing

‍

The main point of entry for querying an RLM is

‍

Python
rlm.completion(prompt=..., root_prompt=...)

‍

where:

prompt: content or data you want the model to analyze (this becomes the context in the REPL environment)
root_prompt: high-level instructions how the model should analyze the provided data

‍

But this distinction isn’t immediately obvious from the quickstart example, which only passes a single string:

‍

Python
result = rlm.completion(    
	"Print me the first 5 powers of two, each on a newline."
)

‍

To make the separation more explicit and intuitive, it could be rewritten like this:

‍

Python
data=""  # ← the context/data to analyze (often empty for simple tasks)
instructions="Print me the first 5 powers of two, each on a newline." 

result = rlm.completion(    
	prompt=data,    
	root_prompt=instructions
)

‍

The experiment

We passed in raw, unsorted message data across 110 conversations, structured like this:

‍

JSON 
{
  "conversation_id": "00000000-0000-0000-0000-000000000001",
  "created_at": "2026-01-19T00:00:00Z",
  "user_message": "Okay, add the Pixel 8 128GB to my cart.",
  "assistant_response": "Added! Your cart now has Google Pixel 8 (128GB). Anything else?"
}

‍

Along with our RLM prompt that looked something like

‍

Python

"""
You are analyzing conversation logs from an AI customer service agent.Your task is to create structured summaries of each conversation thread.

INSTRUCTIONS:

1. Group all messages by conversation_id to reconstruct complete conversations

2. For each conversation, analyze the chronological flow using created_at timestamps

3. Parse the user_message and assistant_response fields

...
"""

‍

What RLM actually did (best run)

In the best run, RLM behaved like a lightweight data analyst + orchestrator.

‍

1. It identified useful python packages for this task & imported them before loading in the context

‍

Python
import json, re, datetime
data = json.loads(context)

‍

2. Grouped data by conversation

‍

Python
from collections import defaultdic

by_conversation = defaultdict(list)
for row in data:    
	by_conversation[row["conversation_id"]].append(row)

‍

3. Sorted the messages within each conversation chronologically

‍

Python
from datetime import datetime

def parse_ts(s):
	return datetime.fromisoformat(s)

for sid in by_conversation:
	by_conversation[sid].sort(key=lambda r: parse_ts(r["created_at"]))

‍

4. Spawned 110 sub-LLMs to extract insights from each conversation

‍

Python
prompts = []
for txt in convo_texts:    
	prompts.append(
		"You are summarizing one customer-service chat conversation. "
		"Return a JSON object with keys: conversation_id, topic_category, summary, 
"

		"key_points (array of 3 strings), status
(Resolved|Ongoing|Escalated|Abandoned). "
		"Conversation:\n" + txt
	)

summaries = llm_query_batched(prompts)

‍

5. Aggregated the per-conversation results into a final output

‍

Python
import json

final = {"conversations": []}

for s in summaries:
	try:
		final["conversations"].append(json.loads(s))
	except Exception as e:
		final["conversations"].append({"error": str(e), "raw": s})

‍

When it worked, this was genuinely impressive. Here is a snippet of the output.

‍

JSON 
{
  "conversations": [{
    "conversation_id": "00000000-0000-0000-0000-000000000001",
    "timeline": {
      "start": "2026-01-19 13:28:31",
      "end": "2026-01-19 13:28:31"
    },
    "turn_count": 22,
    "topic_category": "E-commerce purchase assistance (shipping/returns, product selection, cart and checkout)",
    "summary": "The customer asked about shipping time, tracking availability, order confirmation, and the return policy. They then shopped for a smartphone under $800 with a strong camera, selected a Google Pixel 8 with a discount, added a black case, reviewed the total, and proceeded through checkout by providing shipping and payment details.",
    "key_points": [
      "Provided policies: standard shipping 3–5 business days, tracking emailed after shipment, 30-day returns with free return shipping",
      "Product guidance led to selecting Google Pixel 8 (128GB) with 10% off code HOLIDAY10; compared battery and storage with Galaxy S23",
      "Checkout progressed: cart contained Pixel 8 + black case; subtotal $658.10 plus tax; shipping address saved and credit card added"
    ],
    "status": "Resolved"
  }]
}

‍

Results

‍

Pros

Can work beautifully with expensive reasoning models (gpt-5.2)
The reasoning trace showed the model:
- Exploring the data
- Writing helper code
- Refining its own approach
The behavior felt closer to a data analyst than a prompt template

Cons

Failed badly with cheaper models
Would sometimes hang indefinitely with no error signal
Inconsistent execution strategy (e.g. importing random libraries at start)
Occasionally produced correct data but did not adhere to the structured output format

‍

In one case, it re-ran subcalls three times (330 total calls) and still failed to return valid output.

‍

In short: extremely powerful, extremely unstable.

‍