The Modulation Test — Living Mirrors

The headline

★ The champion primitive

Forced bisociation × best-of-N selection

Force the model to collide structurally distant concepts (Koestler, 1964), over-generate, then keep only the ideas that are both far from the average and genuinely good. It is the single most effective generative method we found — and it is a combination of mechanisms, not a stack of prompts.

1.65×

further from the default cloud than plain prompting

90–95%

of ideas land "far AND good" across runs (vs ~50% baseline)

~2×

the insight / surprise / beauty signatures vs baseline

6 / 6

briefs beat baseline · p = 0.0156

What modulation does

Every cognitive mode beats the baseline. But the gains live in two places — a single well-chosen lens, or forcing the operation. Stacking lenses on top of each other does not compound. Bars show the Modulation Quotient (far-and-good composite); the champion is measured separately.

spark_bestof ✦

0.42*

imagineer

0.296

christ

0.294

spark

0.281

spark_forced

0.280

savant

0.280

mirror

0.280

full_stack ▽

0.260

spark+christ ▽

0.256

vanilla

0.220

champion (separate run, replication pending *) single mode — all beat baseline ▽ composition — sub-additive

Three things it proved

Finding 01

Forcing beats the lens

Holding a mode as a posture nudges output ~1.08×. Mechanically forcing the distant-concept collision moves it ~1.5–1.6× with roughly double the insight/surprise/beauty signatures. The power is in the operation, not the attitude.

+58% distance · ~2× signatures

Finding 02

Stacking doesn't compound — no emergence found

Across six compositions (lens-stacks, savant pairs, and savant applied to the forced mechanism), every one scored at or below its best single component. A small run hinted at a positive for savant × forcing (+0.016), but it did not survive a higher-powered replication — savant added to the champion slightly hurt it. The moat is the mechanism (forced × best-of-N), not the stack.

6 compositions · all SA ≤ 0 at power · emergence: not found

Finding 03

The quality tax is a selection problem

Forcing far ideas costs quality. Trying to refine that back failed. But selecting — over-generate, keep the far-and-good — lifts the far-and-good rate to ~95% and partly offsets the tax, while keeping the distance. Generate wide, then choose. Not generate, then polish.

refine ✗ · select ✓ (far-and-good ↑)

The chart the field can't draw

Distance vs value scatter — the far-and-good frontier

Each dot is one idea. x = novelty (distance from the average), y = judged quality. The champion (dark) occupies the up-right — far AND good. Plain forcing (light) reaches far but bleeds into the low-value band.

Existing work — including the strongest recent results — optimises a single number: distance from the default cloud. But distance without value is just weirdness.

The Modulation Test scores every idea on both axes and reports the far-and-good frontier — the region a distance-only metric is blind to. That is the difference between a randomness injector and a cognitive instrument. It is also the metric a competitor cannot retrofit without rebuilding their evaluation from the ground up.

Honest standing

A benchmark is only as trustworthy as its stated weaknesses. Here are both.

◆ Strengths

Significant & directional. Every mode beats baseline 6/6 briefs, p = 0.0156.
Value-aware. Measures far and good — not distance alone.
Robust. Replicates across two embedder families (r 0.70–0.81) and three judge models across two families (Claude opus + sonnet, Google Gemma) — identical rankings.
Cross-family validated. A non-Claude judge (Google Gemma) agrees: spark_forced wins, all modes beat baseline 6/6 (p=0.0156). Not a self-preference artifact.
Science-grounded. Each axis cites a real cognitive lineage; signatures use Koestler's Aha / Ha-ha / Ah.
Composition-aware. The only framework that can even ask whether stacking cognition compounds.
Reproducible & cheap. Local embeddings, no proprietary pipeline; resumable, hardened harness.

△ Weaknesses (open)

Judge variance. The same ideas score ±0.2 in value across runs — exact value numbers need replication.
Cross-family: partial n. Google Gemma confirmed the ranking, but free-tier limits scored ~half the ideas (~110/arm). A funded Gemini-Flash pass would fully seal it.
Small brief set. 6 briefs so far; ≥12 needed for a publishable claim.
Pool-dependent composite. The MQ score is normalised per-run; cross-run comparison needs a fixed anchor (v0.3).
Two axes unbuilt. Retention and domain-transfer are specified but not yet run.

Where it sits vs the field

Capability	The Modulation Test	Single-technique tools (e.g. forced-collision pipelines)	LLM-diversity literature
Measures distance from average	Yes	Yes	Yes
Measures value / usefulness	Yes — the far-and-good frontier	No (distance only)	Rarely
Creativity-theory signatures	Aha / Ha-ha / Ah (Koestler)	No	No
Composition / emergence test	Yes — and proved sub-additivity	N/A (single lever)	No
Independent, composable axes	7 axes, each a cited science	1 technique	Prompt tricks
A primitive that ships	Forced × best-of-N	The technique itself	Research only
A leaderboard / standard	Yes (this)	No	Fragmented

The field has the cannon. We have the cannon and the instrument that says what is worth firing at — plus the standard that adjudicates any modulation claim, including a competitor's.

The thesis

Yes — we can own this category. Not by stacking prompts, but by owning the measurement and the primitive.

Whoever owns the benchmark owns the category. HumanEval defined coding-AI; the Modulation Test can define cognitive-AI. The moat is not the prose of the modes — it is three stacked, hard-to-copy layers:

LAYER 01

The benchmark

The standard that measures cognitive modulation — far-and-good, composition, signatures. Running, significant, robust. This page.

LAYER 02

The primitive

Forced bisociation × best-of-N: a defensible, measurable generative method. The product surface ships this, not a prompt.

LAYER 03

The bridge

Axes grounded in cited cognitive science. ML labs don't read Jung; cognitive scientists don't ship code. The intersection is the scarce input.

Status & what's next

This page updates as results land. Current state:

DONE

Flagship + 9-arm emergence + collider + best-of-N

270-gen runs, 0-error, 4 studies

DONE

Robustness: 2nd embedder family + cross-model judge

distance r 0.70–0.81 · rankings hold

DONE

Combination test — lens layering doesn't help

imagineer×force×select ≈ plain best-of-N; the mechanism dominates

DONE

Cheapest-pipeline test — runs great on a small model

1.74× distance · value above baseline · 96% far-and-good · p=0.0156 → productizes cheap

DONE

Cross-family judge (Google Gemma) — ranking holds

spark_forced wins under all 3 judges / 2 families · partial n (free tier)

Replicate champion · ≥12 briefs · retention & transfer axes

v0.3 — settle exact numbers, expand