The Far-and-Good Frontier · Living Mirrors Institute

Standings

We gave 11 different AIs the same thinking move. Then we measured who got more surprising-and-useful ideas.

Each AI is compared against its own normal self. The big bar is its overall score. Higher bar, better the move worked. Here is what every column means, in plain words.

Overall scoreThe headline. How well the move worked on this AI, all things counted. Shown as a bar. Longer is better.

How much more originalHow much further its ideas moved from its usual answers. "1.5x" means about one and a half times more original than normal.

Still useful? (/10)A judge rated how good the ideas actually were, out of 10. Original is no good if it is useless.

Hit the sweet spotOut of all its ideas, the share that were both surprising AND useful. The corner that counts.

Reliable win?We ran 6 different problems. "Yes, won all 6" means the move helped every single time, not just by luck.

Technique · TierTechnique = the move used (forced collision). Tier = a top flagship AI (F) or a free open one (O).

Cross-model run, 2026-05-24. Each AI writes the ideas and is scored against its own normal default; one fixed judge rates every idea blind. Sorted by overall score. 6 problems × 5 tries each.

#	AI (lab)	Overall score MQ	How much more original dist ×van	Still useful? (/10) judged value	Hit the sweet spot far-and-good %	Reliable win? 6-problem sign test	Technique tier
1	DeepSeek-V3.1 DeepSeek	0.349	1.3x	6.4	74%	✓yes, won all 6	Forced collision flagship
2	Cobuddy Baidu	0.344	1.4x	6.5	80%	✓yes, won all 6	Forced collision open
3	Gemini-2.0-flash Google	0.341	1.3x	6.3	67%	✓yes, won all 6	Forced collision flagship
4	Claude opus Anthropic	0.334	1.5x	7.0	95%	✓yes, won all 6	Forced collision flagship
5	Llama-4-Maverick Meta	0.332	1.8x	5.7	65%	✓yes, won all 6	Forced collision flagship
6	Qwen-2.5-72B Alibaba	0.331	1.4x	5.8	58%	✓yes, won all 6	Forced collision flagship
7	Nemotron-3-120B NVIDIA	0.311	1.6x	5.1	55%	✓yes, won all 6	Forced collision open
8	LFM-2.5 Liquid · non-transformer	0.302	1.5x	4.3	25%	✓yes, won all 6	Forced collision open
9	Laguna Poolside	0.302	1.6x	5.2	54%	won 4 of 6 (close)	Forced collision open
10	GPT-OSS-120B OpenAI	0.293	1.5x	5.3	45%	✓yes, won all 6	Forced collision open
11	GPT-5-mini OpenAI	0.290	1.3x	6.4	76%	✓yes, won all 6	Forced collision flagship

won all 6 problems (a real, reliable win) pointing the right way, but not a clean sweep yet every AI is scored against its own normal answers

The overall score

Surprising + useful

It rewards ideas that are both far from normal AND genuinely good. Just being weird does not count.

It worked on

11 of 12 AIs

All 6 top flagships won all 6 problems, plus 5 of 6 free open ones. The odds of that by luck are tiny (p=0.0156).

Even a different kind of AI

a non-transformer

Liquid's LFM-2.5 is built differently, and the move still worked. So it is not a fluke of one design.

Open standard

Submit your technique.

The benchmark is meant to be run by anyone, on any model, including a competitor's. A modulation claim only counts here if it can be measured on the same value-aware frontier and reproduced. To add a model or a new technique to the board:

Run your arm against the model's own vanilla default cloud on the shared 6-brief set, 5 samples per brief.
Score every idea on both axes, distance from the average and judged quality, with a fixed blind judge.
Report the full distribution, the far-and-good rate, the MQ, and the per-brief sign test, including where it fails.
Send the run directory so the result can be independently reproduced before it joins the board.

Submit a run →

Honesty is the moat. A submission that hides its failures is not a submission. The frontier flagships (GPT-5-mini, Gemini-2.0-flash, Llama-4-Maverick, Qwen-2.5-72B, DeepSeek-V3.1) are now on the board, added on these same terms.