A benchmark for cognitive modulation of large language models: measuring whether structured cognitive modes produce ideas that are measurably further from the average — and still good. Built because the field measures only distance, and distance without value is just noise.
Force the model to collide structurally distant concepts (Koestler, 1964), over-generate, then keep only the ideas that are both far from the average and genuinely good. It is the single most effective generative method we found — and it is a combination of mechanisms, not a stack of prompts.
Every cognitive mode beats the baseline. But the gains live in two places — a single well-chosen lens, or forcing the operation. Stacking lenses on top of each other does not compound. Bars show the Modulation Quotient (far-and-good composite); the champion is measured separately.
Holding a mode as a posture nudges output ~1.08×. Mechanically forcing the distant-concept collision moves it ~1.5–1.6× with roughly double the insight/surprise/beauty signatures. The power is in the operation, not the attitude.
Across six compositions (lens-stacks, savant pairs, and savant applied to the forced mechanism), every one scored at or below its best single component. A small run hinted at a positive for savant × forcing (+0.016), but it did not survive a higher-powered replication — savant added to the champion slightly hurt it. The moat is the mechanism (forced × best-of-N), not the stack.
Forcing far ideas costs quality. Trying to refine that back failed. But selecting — over-generate, keep the far-and-good — lifts the far-and-good rate to ~95% and partly offsets the tax, while keeping the distance. Generate wide, then choose. Not generate, then polish.
Existing work — including the strongest recent results — optimises a single number: distance from the default cloud. But distance without value is just weirdness.
The Modulation Test scores every idea on both axes and reports the far-and-good frontier — the region a distance-only metric is blind to. That is the difference between a randomness injector and a cognitive instrument. It is also the metric a competitor cannot retrofit without rebuilding their evaluation from the ground up.
A benchmark is only as trustworthy as its stated weaknesses. Here are both.
| Capability | The Modulation Test | Single-technique tools (e.g. forced-collision pipelines) |
LLM-diversity literature |
|---|---|---|---|
| Measures distance from average | Yes | Yes | Yes |
| Measures value / usefulness | Yes — the far-and-good frontier | No (distance only) | Rarely |
| Creativity-theory signatures | Aha / Ha-ha / Ah (Koestler) | No | No |
| Composition / emergence test | Yes — and proved sub-additivity | N/A (single lever) | No |
| Independent, composable axes | 7 axes, each a cited science | 1 technique | Prompt tricks |
| A primitive that ships | Forced × best-of-N | The technique itself | Research only |
| A leaderboard / standard | Yes (this) | No | Fragmented |
The field has the cannon. We have the cannon and the instrument that says what is worth firing at — plus the standard that adjudicates any modulation claim, including a competitor's.
Whoever owns the benchmark owns the category. HumanEval defined coding-AI; the Modulation Test can define cognitive-AI. The moat is not the prose of the modes — it is three stacked, hard-to-copy layers:
The standard that measures cognitive modulation — far-and-good, composition, signatures. Running, significant, robust. This page.
Forced bisociation × best-of-N: a defensible, measurable generative method. The product surface ships this, not a prompt.
Axes grounded in cited cognitive science. ML labs don't read Jung; cognitive scientists don't ship code. The intersection is the scarce input.
This page updates as results land. Current state:
Flagship + 9-arm emergence + collider + best-of-N
270-gen runs, 0-error, 4 studiesRobustness: 2nd embedder family + cross-model judge
distance r 0.70–0.81 · rankings holdCombination test — lens layering doesn't help
imagineer×force×select ≈ plain best-of-N; the mechanism dominatesCheapest-pipeline test — runs great on a small model
1.74× distance · value above baseline · 96% far-and-good · p=0.0156 → productizes cheapCross-family judge (Google Gemma) — ranking holds
spark_forced wins under all 3 judges / 2 families · partial n (free tier)Replicate champion · ≥12 briefs · retention & transfer axes
v0.3 — settle exact numbers, expand