← Living Mirrors Institute · Leaderboard
The open standard for cognitive modulation

The far-and-good
frontier.

One simple idea. We gave each AI the same thinking move, then scored how often it came up with ideas that were both surprising and actually useful. The higher the bar, the better the move worked on that AI. Weird-but-useless does not count. Only far-and-good counts.

01

Standings

We gave 11 different AIs the same thinking move. Then we measured who got more surprising-and-useful ideas.

Each AI is compared against its own normal self. The big bar is its overall score. Higher bar, better the move worked. Here is what every column means, in plain words.

Overall scoreThe headline. How well the move worked on this AI, all things counted. Shown as a bar. Longer is better.
How much more originalHow much further its ideas moved from its usual answers. "1.5x" means about one and a half times more original than normal.
Still useful? (/10)A judge rated how good the ideas actually were, out of 10. Original is no good if it is useless.
Hit the sweet spotOut of all its ideas, the share that were both surprising AND useful. The corner that counts.
Reliable win?We ran 6 different problems. "Yes, won all 6" means the move helped every single time, not just by luck.
Technique · TierTechnique = the move used (forced collision). Tier = a top flagship AI (F) or a free open one (O).

Cross-model run, 2026-05-24. Each AI writes the ideas and is scored against its own normal default; one fixed judge rates every idea blind. Sorted by overall score. 6 problems × 5 tries each.

# AI (lab) Overall score MQ How much
more original dist ×van
Still useful?
(/10) judged value
Hit the
sweet spot far-and-good %
Reliable win? 6-problem sign test Technique tier
1DeepSeek-V3.1 DeepSeek
0.349
1.3x6.474% yes, won all 6 Forced collision flagship
2Cobuddy Baidu
0.344
1.4x6.580% yes, won all 6 Forced collision open
3Gemini-2.0-flash Google
0.341
1.3x6.367% yes, won all 6 Forced collision flagship
4Claude opus Anthropic
0.334
1.5x7.095% yes, won all 6 Forced collision flagship
5Llama-4-Maverick Meta
0.332
1.8x5.765% yes, won all 6 Forced collision flagship
6Qwen-2.5-72B Alibaba
0.331
1.4x5.858% yes, won all 6 Forced collision flagship
7Nemotron-3-120B NVIDIA
0.311
1.6x5.155% yes, won all 6 Forced collision open
8LFM-2.5 Liquid · non-transformer
0.302
1.5x4.325% yes, won all 6 Forced collision open
9Laguna Poolside
0.302
1.6x5.254% won 4 of 6 (close) Forced collision open
10GPT-OSS-120B OpenAI
0.293
1.5x5.345% yes, won all 6 Forced collision open
11GPT-5-mini OpenAI
0.290
1.3x6.476% yes, won all 6 Forced collision flagship
won all 6 problems (a real, reliable win) pointing the right way, but not a clean sweep yet every AI is scored against its own normal answers
The overall score
Surprising + useful

It rewards ideas that are both far from normal AND genuinely good. Just being weird does not count.

It worked on
11 of 12 AIs

All 6 top flagships won all 6 problems, plus 5 of 6 free open ones. The odds of that by luck are tiny (p=0.0156).

Even a different kind of AI
a non-transformer

Liquid's LFM-2.5 is built differently, and the move still worked. So it is not a fluke of one design.

Open standard

Submit your technique.

The benchmark is meant to be run by anyone, on any model, including a competitor's. A modulation claim only counts here if it can be measured on the same value-aware frontier and reproduced. To add a model or a new technique to the board:

  1. Run your arm against the model's own vanilla default cloud on the shared 6-brief set, 5 samples per brief.
  2. Score every idea on both axes, distance from the average and judged quality, with a fixed blind judge.
  3. Report the full distribution, the far-and-good rate, the MQ, and the per-brief sign test, including where it fails.
  4. Send the run directory so the result can be independently reproduced before it joins the board.
Submit a run →

Honesty is the moat. A submission that hides its failures is not a submission. The frontier flagships (GPT-5-mini, Gemini-2.0-flash, Llama-4-Maverick, Qwen-2.5-72B, DeepSeek-V3.1) are now on the board, added on these same terms.