Living Mirrors Institute

We measure how to
make any AI think better.

Not which model is smartest. How to make any mind more original, and still useful, on demand. We made the modes that move it.

What we measure
What we measure

Far and good.

Every idea is two things at once. How far from the obvious it is. How good it is. Most AI gives you near and safe. Push it to "be creative" and you get far and useless. The corner that matters is far AND good. We measure how often a mind lands there.

Usefulthe y axisUseless
Near · Good
The safe default
What AI hands you anyway.
obvious, fine,
forgettable
Far · Good
Surprising AND useful
The only corner that counts.
far + good
= the move
Near · Bad
Junk
Obvious and still wrong.
no distance,
no value
Far · Bad
Creative noise
Weird for the sake of weird.
far, but
useless
ObviousFar
FAR + GOOD = THE MOVE

One corner is the whole game. We measure how often a mind reaches far and good. That is the test.

The move, made visible

This is bisociation.

Two clusters drift apart. One is your problem. One is a far-off domain. Force a bridge between them and watch the idea fall out of the collision. Keep what is both far and useful.

Your problem A far-off domain click to force a collision

Bisociation. Force a collision between your problem and a distant domain. Keep what is both far and useful.

What we found

One move. It is a law.

The field said the frontier of original-and-useful output barely moves. We made the modes that move it.

The move
what we tested
1
move
forced bisociation
The scale
where we tested it
12
AI minds
generated the ideas
~10
distinct labs
open, frontier, and one non-transformer
The proof
what held up
11 / 12
significant
p = 0.0156
6 / 6
frontier flagships swept
GPT-5, Claude, Gemini, Llama 4, Qwen, DeepSeek
1
even a non-transformer
not a transformer quirk
The move

Collide the problem with a far-off domain.
Keep what is both far and useful.

Forced bisociation, preamble-free. Run it on twelve minds across roughly ten labs. GPT-5, Claude, Gemini, Llama 4, Qwen, DeepSeek. All swept clean. Even a non-transformer. It is a law, not a trick.

The reset, made visible

Over-generate. Then keep the best.

Generate many forced collisions at once. The far-and-good ones brighten. The rest fade. This is best-of-N selection. It works when the selector is smart enough to tell which is which.

click to generate, then select

The reset. Over-generate forced collisions. Keep only the far-and-good. Works when the selector is smart.

The category

Cognitive modulation measurement.

Everyone else ranks how creative a model is. We rank the move that makes any model think better. We own the category because we measure the thing nobody else does.

What it
measures
Whose
creativity
Works on
closed models
Tests modulation
techniques
Cross-lab
law
Reports
failures
Human
transfer
MMLU / HumanEvalcapability benchmarks knowledge, code model yes no no n/a no
LiveIdeaBench / EQ-Benchmodel creativity rankings creative output model yes no no partial no
Novelty-FrontierarXiv 2504.09389, closest prior art original + high-quality model + a few prompts no, needs open data a few 3 open families yes no
The Modulation Testcognitive modulation measurement far-and-good lift the technique, on any mind yes yes, its whole point 12 models / ~10 labs yes, loudly yes, by design

We did not invent the original-and-useful frontier. The closest prior work measured it and concluded you mostly cannot move it. We found the move that does, model-agnostic, technique-first, built to cross to humans.

Why this is the standard

Four absolutes. And one more.

I

Model-agnostic

We measure distance from each model's own default cloud, semantic and embedding-based. It works on closed frontier models too, not only open-data ones.

II

Technique-first

A standard for the moves that improve any mind. Not a leaderboard of models. We rank what works, on whatever you point it at.

III

Proven as a law

Same move, 12 models, roughly 10 labs, 11 of 12 significant, a 6 of 6 frontier sweep, even a non-transformer. Nobody else tests for universality across labs.

IV

Honest by construction

We publish where it fails. Stacking does not compound. Selection backfires on a weak model. The honesty is the product.

V

Built to cross to humans

The only one designed as a wind-tunnel. Validate a thinking move on machines cheaply, carry the winners to people. No creativity benchmark does this.

The honest part

We measured our own magic.
Then we let it disprove us.

Stacking six cognitive modes does not compound. It dilutes. Selection backfires on a weak model. We published all of it.

We are not asking you to believe us. We are handing you the ruler.

The benchmark is open

If you have a way to make a mind
think better, there is now a place to prove it.

The harness runs on a free model. The ruler is yours.