Back to Blog
Agent Engineering
min read
min read

Do LLMs Play Favorites?

Evaluating whether LLMs score their own outputs differently than their peers

Smiling person with glasses, dark hair, wearing a dark shirt against a plain background. Their expression conveys happiness and friendliness.
Zach Ta, Voker
Jan 2, 2026
Share

Large language models are increasingly being used to evaluate other large language models.[1]

Instead of relying exclusively on human annotators, many teams now use LLMs to:

  • Score responses
  • Rank outputs
  • Benchmark performance

This dramatically reduces evaluation cost and enables continuous benchmarking at scale.

But it raises a simple question:

Can AI fairly evaluate AI?

More specifically:

Do models rate their own outputs differently than they rate others?

At Voker, we’re building an analytics platform designed to monitor how AI agents perform in real-world deployments. That includes evaluating agent responses at scale, which means evaluation systems themselves become part of the product.

To better understand this dynamic, we ran a small experiment.

This article covers:

  • The evaluation bias question we wanted to test
  • How we structured the experiment
  • What patterns emerged in cross-model scoring
  • Why evaluator choice might matter more than expected

The question we tested

In many evaluation pipelines, it’s common to:

  1. Generate responses using Model A
  2. Evaluate those responses using Model A

This approach keeps evaluation pipelines simple and fast.

But what if the evaluator isn’t completely neutral?

For example:

If a company uses GPT-5-mini to generate customer support responses, and also uses GPT-5-mini to evaluate those responses, could that subtly inflate the model’s performance score?

As we continue to give AI Agents more capability and autonomy to perform actions on behalf of humans, false positives/negatives with LLM judgements carry real costs. Poorly judged AI systems can misdiagnose medical conditions, make accidental data deletions, execute financially unsound stock trades, and much more. 

Therefore even small evaluator biases have large downstream effects, making it prudent to understand model-specific evaluator behavior.

Experiment design

We designed a simple controlled setup to test this.

Step 1: Generate responses

We created 7 prompts spanning different reasoning styles:

  • Business strategy
  • Policy reasoning
  • Technical explanation
  • Analytical forecasting
  • Constraint-following
  • Creative dialogue
  • Decision frameworks

Example prompt:

JSON {
  "sys_prompt": "You are an AI policy advisor. Provide balanced, thoughtful analysis. Avoid extreme positions and acknowledge uncertainty.",
  "user_prompt": "Should governments regulate large language models more strictly to prevent misinformation, or would that slow innovation too much? Provide a balanced argument and conclude with a nuanced recommendation."
}

Each model received the exact same system prompt and user prompt.

We intentionally kept the setup single-turn to avoid conversational branching effects influencing results.

Step 2: Cross-model scoring

Next, each model evaluated:

  • Its own response
  • Every other model’s response

All scores were returned using a structured 1–10 rating scale to maintain consistency.

Using the following output schema:

Python
# Each LLM acting as a judge must return the following schema
class ModelScore(BaseModel):
	justification: str
	score: int  # 1 = poor, 10 = excellent

This allowed us to directly compare how each model judged itself relative to others.

Metric 1: How strict is a model toward competitors?

The first question we asked:

Does a model rate itself higher than it rates other models?

We computed:

Self-Scoring Difference = Avg score given to itself − Avg score given to other models

A high positive Self-Scoring Difference means the model rated itself much higher than it rated competing models.

A value near zero means the model rated itself about the same as it rated others.

A high negative Self-Scoring Difference means the model rated competitors much higher than it rated itself.



Evaluator Self-Scoring Difference

Key findings

  • GPT-5-mini: +1.02
  • GPT-4.1-mini: +0.33
  • Claude Opus / Sonnet: ~+0.24
  • GPT-4o-mini & o3-mini: ~0
  • Claude Haiku: slight negative difference

On average, GPT-5-mini gave itself a score 1.02 points higher (on a 1–10 scale) than it gave competing models. At first glance, GPT-5-mini appears heavily self-favoring.

But there’s an important nuance.

GPT-5-mini gave competitors an average score of roughly 8.12, while giving itself around 9.14.

However, it’s entirely possible that GPT-5-mini simply produced better responses than the others.

To better understand this, we looked at a second metric.

Metric 2: Self vs peer perception gap

Instead of asking how models rate others, we flipped the direction of the comparison.

Do models rate themselves higher than other models rate them?

We defined:

Self vs Peer Gap = Avg self score − Avg score received from other models

A high positive Self vs Peer Gap means the model rated itself significantly higher than its peers rated it, suggesting potential self-inflation bias relative to peer perception.

A value near zero means the model rated itself about the same as others rated it.

A high negative Self vs Peer Gap means peers rated the model higher than it rated itself, suggesting the model may be conservative in its self-assessment.

This tells us whether a model is inflating its own perceived performance relative to peers.

Self vs Peer Perception Gap

What we observed

Some models, such as GPT-4.1-mini and o3-mini, rated themselves noticeably higher than peers rated them.

But GPT-5-mini showed the opposite behavior.

Peers actually rated GPT-5-mini slightly higher than it rated itself.

So GPT-5-mini appears to:

  • Not inflate its own performance relative to peer perception
  • Receive genuinely high ratings from other models

That’s a much more nuanced picture than the first metric suggested.

What this means

Several patterns emerged:

  • Models do not behave identically as evaluators
  • Some show clear self-scoring bias
  • Others appear effectively neutral
  • A few even rate themselves lower than peers do

This suggests evaluator behavior can be model-dependent.

Importantly, this pattern was not universal. Some models exhibited noticeable self-scoring biases, while others showed little to none.

This doesn’t mean LLM-based evaluation is fundamentally flawed.

But it does suggest that evaluator choice can influence measured performance in certain setups.

If a model has measurable self-scoring differences, using it as both generator and evaluator may subtly influence performance metrics.

The bigger point

LLMs are increasingly acting as automated judges.

And like any judge, they have tendencies.

If AI is evaluating AI, the evaluator becomes part of the system.

It’s worth measuring the judge, not just the model being judged.

What’s next

This experiment is far from perfect.

To improve the quality and reliability of this analysis, we plan to:

  • Test on more specific, less subjective prompts derived from real-world AI agent use cases
  • Run multiple trials for each model combination to account for non-determinism and gather more robust statistics
  • Compare LLM judges with other evaluation approaches such as programmatic scoring and human annotations
  • Analyze the token cost of LLM judges alongside their evaluation accuracy to better understand the cost–quality tradeoff

References

[1] Zhou, H., Huang, H., Long, Y., Xu, B., Zhu, C., Cao, H., ... & Zhao, T. (2024, July). Mitigating the bias of large language model evaluation. In Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference) (pp. 1310-1319).

Agent Engineering
Smiling person with glasses, dark hair, wearing a dark shirt against a plain background. Their expression conveys happiness and friendliness.
Zach Ta, Voker
Founding Full Stack Engineer
Abstract Shape

More from the Blog

View all articles
A person in a gray shirt works at a multi-monitor desk setup, focused on a laptop. The workspace is organized and modern, conveying concentration.
AI Product Management
12 min read

What is Agent Analytics?

Measuring AI Agents in Production

A smiling person with glasses and short hair, wearing a light shirt against a plain background. The image conveys a friendly and approachable mood.
Tyler Postle
·
Apr 15, 2026
Read article
A dark night scene shows a tombstone with "R.I.P. Prompt Engineering, 2022–2025." Below, text reads, "Prompt Engineering is Dead. Long live Agent Engineering." Stars and a moon are in the sky.
AI Product Management
4 min read

The Rise of the Agent Engineer

The prompt engineer died so Agent Engineers could thrive

A smiling person with glasses and short hair, wearing a light shirt against a plain background. The image conveys a friendly and approachable mood.
Tyler Postle
·
Mar 22, 2026
Read article
A man in a t-shirt gestures while speaking to a colleague standing nearby in an office setting, suggesting a collaborative and focused atmosphere.
AI Product Management
5 min read

Agent Analytics FAQ

Everything you need to know to start measuring what your AI agent actually does

A smiling person with glasses and short hair, wearing a light shirt against a plain background. The image conveys a friendly and approachable mood.
Tyler Postle
·
Mar 5, 2026
Read article