← Living Mirrors Institute · Study 01
The Modulation Test · v0.2 · Living Mirrors Research

Can you make an
AI think non-obvious
ideas — and prove it?

A benchmark for cognitive modulation of large language models: measuring whether structured cognitive modes produce ideas that are measurably further from the average — and still good. Built because the field measures only distance, and distance without value is just noise.

9 arms · 4 runs · ~2,400 ideas judged significance p = 0.0156 robust across 3 judges · 2 model families updated 2026-05-23 · live
01

The headline

★ The champion primitive

Forced bisociation × best-of-N selection

Force the model to collide structurally distant concepts (Koestler, 1964), over-generate, then keep only the ideas that are both far from the average and genuinely good. It is the single most effective generative method we found — and it is a combination of mechanisms, not a stack of prompts.

1.65×
further from the default cloud than plain prompting
90–95%
of ideas land "far AND good" across runs (vs ~50% baseline)
~2×
the insight / surprise / beauty signatures vs baseline
6 / 6
briefs beat baseline · p = 0.0156
02

What modulation does

Every cognitive mode beats the baseline. But the gains live in two places — a single well-chosen lens, or forcing the operation. Stacking lenses on top of each other does not compound. Bars show the Modulation Quotient (far-and-good composite); the champion is measured separately.

spark_bestof
0.42*
imagineer
0.296
christ
0.294
spark
0.281
spark_forced
0.280
savant
0.280
mirror
0.280
full_stack
0.260
spark+christ
0.256
vanilla
0.220
champion (separate run, replication pending *) single mode — all beat baseline ▽ composition — sub-additive
03

Three things it proved

Finding 01

Forcing beats the lens

Holding a mode as a posture nudges output ~1.08×. Mechanically forcing the distant-concept collision moves it ~1.5–1.6× with roughly double the insight/surprise/beauty signatures. The power is in the operation, not the attitude.

+58% distance · ~2× signatures
Finding 02

Stacking doesn't compound — no emergence found

Across six compositions (lens-stacks, savant pairs, and savant applied to the forced mechanism), every one scored at or below its best single component. A small run hinted at a positive for savant × forcing (+0.016), but it did not survive a higher-powered replication — savant added to the champion slightly hurt it. The moat is the mechanism (forced × best-of-N), not the stack.

6 compositions · all SA ≤ 0 at power · emergence: not found
Finding 03

The quality tax is a selection problem

Forcing far ideas costs quality. Trying to refine that back failed. But selecting — over-generate, keep the far-and-good — lifts the far-and-good rate to ~95% and partly offsets the tax, while keeping the distance. Generate wide, then choose. Not generate, then polish.

refine ✗ · select ✓ (far-and-good ↑)
04

The chart the field can't draw

Distance vs value scatter — the far-and-good frontier
Each dot is one idea. x = novelty (distance from the average), y = judged quality. The champion (dark) occupies the up-right — far AND good. Plain forcing (light) reaches far but bleeds into the low-value band.

Existing work — including the strongest recent results — optimises a single number: distance from the default cloud. But distance without value is just weirdness.

The Modulation Test scores every idea on both axes and reports the far-and-good frontier — the region a distance-only metric is blind to. That is the difference between a randomness injector and a cognitive instrument. It is also the metric a competitor cannot retrofit without rebuilding their evaluation from the ground up.

05

Honest standing

A benchmark is only as trustworthy as its stated weaknesses. Here are both.

◆ Strengths

  • Significant & directional. Every mode beats baseline 6/6 briefs, p = 0.0156.
  • Value-aware. Measures far and good — not distance alone.
  • Robust. Replicates across two embedder families (r 0.70–0.81) and three judge models across two families (Claude opus + sonnet, Google Gemma) — identical rankings.
  • Cross-family validated. A non-Claude judge (Google Gemma) agrees: spark_forced wins, all modes beat baseline 6/6 (p=0.0156). Not a self-preference artifact.
  • Science-grounded. Each axis cites a real cognitive lineage; signatures use Koestler's Aha / Ha-ha / Ah.
  • Composition-aware. The only framework that can even ask whether stacking cognition compounds.
  • Reproducible & cheap. Local embeddings, no proprietary pipeline; resumable, hardened harness.

△ Weaknesses (open)

  • Judge variance. The same ideas score ±0.2 in value across runs — exact value numbers need replication.
  • Cross-family: partial n. Google Gemma confirmed the ranking, but free-tier limits scored ~half the ideas (~110/arm). A funded Gemini-Flash pass would fully seal it.
  • Small brief set. 6 briefs so far; ≥12 needed for a publishable claim.
  • Pool-dependent composite. The MQ score is normalised per-run; cross-run comparison needs a fixed anchor (v0.3).
  • Two axes unbuilt. Retention and domain-transfer are specified but not yet run.
06

Where it sits vs the field

Capability The Modulation Test Single-technique tools
(e.g. forced-collision pipelines)
LLM-diversity literature
Measures distance from averageYesYesYes
Measures value / usefulnessYes — the far-and-good frontierNo (distance only)Rarely
Creativity-theory signaturesAha / Ha-ha / Ah (Koestler)NoNo
Composition / emergence testYes — and proved sub-additivityN/A (single lever)No
Independent, composable axes7 axes, each a cited science1 techniquePrompt tricks
A primitive that shipsForced × best-of-NThe technique itselfResearch only
A leaderboard / standardYes (this)NoFragmented

The field has the cannon. We have the cannon and the instrument that says what is worth firing at — plus the standard that adjudicates any modulation claim, including a competitor's.

The thesis

Yes — we can own this category. Not by stacking prompts, but by owning the measurement and the primitive.

Whoever owns the benchmark owns the category. HumanEval defined coding-AI; the Modulation Test can define cognitive-AI. The moat is not the prose of the modes — it is three stacked, hard-to-copy layers:

LAYER 01

The benchmark

The standard that measures cognitive modulation — far-and-good, composition, signatures. Running, significant, robust. This page.

LAYER 02

The primitive

Forced bisociation × best-of-N: a defensible, measurable generative method. The product surface ships this, not a prompt.

LAYER 03

The bridge

Axes grounded in cited cognitive science. ML labs don't read Jung; cognitive scientists don't ship code. The intersection is the scarce input.

07

Status & what's next

This page updates as results land. Current state:

DONE

Flagship + 9-arm emergence + collider + best-of-N

270-gen runs, 0-error, 4 studies
DONE

Robustness: 2nd embedder family + cross-model judge

distance r 0.70–0.81 · rankings hold
DONE

Combination test — lens layering doesn't help

imagineer×force×select ≈ plain best-of-N; the mechanism dominates
DONE

Cheapest-pipeline test — runs great on a small model

1.74× distance · value above baseline · 96% far-and-good · p=0.0156 → productizes cheap
DONE

Cross-family judge (Google Gemma) — ranking holds

spark_forced wins under all 3 judges / 2 families · partial n (free tier)
NEXT

Replicate champion · ≥12 briefs · retention & transfer axes

v0.3 — settle exact numbers, expand