Agent Engineering

min read

Do LLMs Play Favorites?

Evaluating whether LLMs score their own outputs differently than their peers

Large language models are increasingly being used to evaluate other large language models.^[1]

‍

Instead of relying exclusively on human annotators, many teams now use LLMs to:

‍

Score responses
Rank outputs
Benchmark performance

‍

This dramatically reduces evaluation cost and enables continuous benchmarking at scale.

‍

But it raises a simple question:

‍

Can AI fairly evaluate AI?

‍

More specifically:

‍

Do models rate their own outputs differently than they rate others?

‍

At Voker, we’re building an analytics platform designed to monitor how AI agents perform in real-world deployments. That includes evaluating agent responses at scale, which means evaluation systems themselves become part of the product.

‍

To better understand this dynamic, we ran a small experiment.

‍

This article covers:

‍

The evaluation bias question we wanted to test
How we structured the experiment
What patterns emerged in cross-model scoring
Why evaluator choice might matter more than expected

‍

The question we tested

‍

In many evaluation pipelines, it’s common to:

‍

Generate responses using Model A
Evaluate those responses using Model A

‍

This approach keeps evaluation pipelines simple and fast.

‍

But what if the evaluator isn’t completely neutral?

‍

For example:

‍

If a company uses GPT-5-mini to generate customer support responses, and also uses GPT-5-mini to evaluate those responses, could that subtly inflate the model’s performance score?

‍

As we continue to give AI Agents more capability and autonomy to perform actions on behalf of humans, false positives/negatives with LLM judgements carry real costs. Poorly judged AI systems can misdiagnose medical conditions, make accidental data deletions, execute financially unsound stock trades, and much more.

‍

Therefore even small evaluator biases have large downstream effects, making it prudent to understand model-specific evaluator behavior.

‍

Experiment design

‍

We designed a simple controlled setup to test this.

‍

Step 1: Generate responses

‍

We created 7 prompts spanning different reasoning styles:

‍

Business strategy
Policy reasoning
Technical explanation
Analytical forecasting
Constraint-following
Creative dialogue
Decision frameworks

‍

Example prompt:

‍

JSON {
  "sys_prompt": "You are an AI policy advisor. Provide balanced, thoughtful analysis. Avoid extreme positions and acknowledge uncertainty.",
  "user_prompt": "Should governments regulate large language models more strictly to prevent misinformation, or would that slow innovation too much? Provide a balanced argument and conclude with a nuanced recommendation."
}

‍

Each model received the exact same system prompt and user prompt.

‍

We intentionally kept the setup single-turn to avoid conversational branching effects influencing results.

‍

Step 2: Cross-model scoring

‍

Next, each model evaluated:

‍

Its own response‍
Every other model’s response

‍

All scores were returned using a structured 1–10 rating scale to maintain consistency.

‍

Using the following output schema:

‍

Python
# Each LLM acting as a judge must return the following schema
class ModelScore(BaseModel):
	justification: str
	score: int  # 1 = poor, 10 = excellent

‍

This allowed us to directly compare how each model judged itself relative to others.

‍

Metric 1: How strict is a model toward competitors?

‍

The first question we asked:

‍

Does a model rate itself higher than it rates other models?

‍

We computed:

‍

Self-Scoring Difference = Avg score given to itself − Avg score given to other models

A high positive Self-Scoring Difference means the model rated itself much higher than it rated competing models.

A value near zero means the model rated itself about the same as it rated others.

A high negative Self-Scoring Difference means the model rated competitors much higher than it rated itself.

‍

‍

Key findings

‍

‍GPT-5-mini: +1.02
‍GPT-4.1-mini: +0.33
‍Claude Opus / Sonnet: ~+0.24
‍GPT-4o-mini & o3-mini: ~0
‍Claude Haiku: slight negative difference

‍

On average, GPT-5-mini gave itself a score 1.02 points higher (on a 1–10 scale) than it gave competing models. At first glance, GPT-5-mini appears heavily self-favoring.

‍

But there’s an important nuance.

‍

GPT-5-mini gave competitors an average score of roughly 8.12, while giving itself around 9.14.

‍

However, it’s entirely possible that GPT-5-mini simply produced better responses than the others.

‍

To better understand this, we looked at a second metric.

‍

Metric 2: Self vs peer perception gap

‍

Instead of asking how models rate others, we flipped the direction of the comparison.

‍

Do models rate themselves higher than other models rate them?

‍

We defined:

‍

Self vs Peer Gap = Avg self score − Avg score received from other models

A high positive Self vs Peer Gap means the model rated itself significantly higher than its peers rated it, suggesting potential self-inflation bias relative to peer perception.

A value near zero means the model rated itself about the same as others rated it.

A high negative Self vs Peer Gap means peers rated the model higher than it rated itself, suggesting the model may be conservative in its self-assessment.

This tells us whether a model is inflating its own perceived performance relative to peers.

‍

‍

What we observed

‍

Some models, such as GPT-4.1-mini and o3-mini, rated themselves noticeably higher than peers rated them.

‍

But GPT-5-mini showed the opposite behavior.

‍

Peers actually rated GPT-5-mini slightly higher than it rated itself.

‍

So GPT-5-mini appears to:

‍

Not inflate its own performance relative to peer perception
Receive genuinely high ratings from other models

‍

That’s a much more nuanced picture than the first metric suggested.

‍

What this means

‍

Several patterns emerged:

‍

Models do not behave identically as evaluators
Some show clear self-scoring bias‍
Others appear effectively neutral‍
A few even rate themselves lower than peers do

‍

This suggests evaluator behavior can be model-dependent.

‍

Importantly, this pattern was not universal. Some models exhibited noticeable self-scoring biases, while others showed little to none.

‍

This doesn’t mean LLM-based evaluation is fundamentally flawed.

‍

But it does suggest that evaluator choice can influence measured performance in certain setups.

‍

If a model has measurable self-scoring differences, using it as both generator and evaluator may subtly influence performance metrics.

‍

The bigger point

‍

LLMs are increasingly acting as automated judges.

‍

And like any judge, they have tendencies.

‍

If AI is evaluating AI, the evaluator becomes part of the system.

‍

It’s worth measuring the judge, not just the model being judged.

‍

What’s next

‍

This experiment is far from perfect.

‍

To improve the quality and reliability of this analysis, we plan to:

‍

Test on more specific, less subjective prompts derived from real-world AI agent use cases
Run multiple trials for each model combination to account for non-determinism and gather more robust statistics
Compare LLM judges with other evaluation approaches such as programmatic scoring and human annotations
Analyze the token cost of LLM judges alongside their evaluation accuracy to better understand the cost–quality tradeoff

‍

References

‍

[1] Zhou, H., Huang, H., Long, Y., Xu, B., Zhu, C., Cao, H., ... & Zhao, T. (2024, July). Mitigating the bias of large language model evaluation. In Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference) (pp. 1310-1319).

Agent Engineering