One simple idea. We gave each AI the same thinking move, then scored how often it came up with ideas that were both surprising and actually useful. The higher the bar, the better the move worked on that AI. Weird-but-useless does not count. Only far-and-good counts.
Each AI is compared against its own normal self. The big bar is its overall score. Higher bar, better the move worked. Here is what every column means, in plain words.
Cross-model run, 2026-05-24. Each AI writes the ideas and is scored against its own normal default; one fixed judge rates every idea blind. Sorted by overall score. 6 problems × 5 tries each.
| # | AI (lab) | Overall score MQ | How much more original dist ×van |
Still useful? (/10) judged value |
Hit the sweet spot far-and-good % |
Reliable win? 6-problem sign test | Technique tier |
|---|---|---|---|---|---|---|---|
| 1 | DeepSeek-V3.1 DeepSeek | 1.3x | 6.4 | 74% | ✓yes, won all 6 | Forced collision flagship | |
| 2 | Cobuddy Baidu | 1.4x | 6.5 | 80% | ✓yes, won all 6 | Forced collision open | |
| 3 | Gemini-2.0-flash Google | 1.3x | 6.3 | 67% | ✓yes, won all 6 | Forced collision flagship | |
| 4 | Claude opus Anthropic | 1.5x | 7.0 | 95% | ✓yes, won all 6 | Forced collision flagship | |
| 5 | Llama-4-Maverick Meta | 1.8x | 5.7 | 65% | ✓yes, won all 6 | Forced collision flagship | |
| 6 | Qwen-2.5-72B Alibaba | 1.4x | 5.8 | 58% | ✓yes, won all 6 | Forced collision flagship | |
| 7 | Nemotron-3-120B NVIDIA | 1.6x | 5.1 | 55% | ✓yes, won all 6 | Forced collision open | |
| 8 | LFM-2.5 Liquid · non-transformer | 1.5x | 4.3 | 25% | ✓yes, won all 6 | Forced collision open | |
| 9 | Laguna Poolside | 1.6x | 5.2 | 54% | won 4 of 6 (close) | Forced collision open | |
| 10 | GPT-OSS-120B OpenAI | 1.5x | 5.3 | 45% | ✓yes, won all 6 | Forced collision open | |
| 11 | GPT-5-mini OpenAI | 1.3x | 6.4 | 76% | ✓yes, won all 6 | Forced collision flagship |
The benchmark is meant to be run by anyone, on any model, including a competitor's. A modulation claim only counts here if it can be measured on the same value-aware frontier and reproduced. To add a model or a new technique to the board:
Honesty is the moat. A submission that hides its failures is not a submission. The frontier flagships (GPT-5-mini, Gemini-2.0-flash, Llama-4-Maverick, Qwen-2.5-72B, DeepSeek-V3.1) are now on the board, added on these same terms.