How Simulations Work

Simulations are structured deliberations between several large-language-model personas on a defined geopolitical scenario. They are explicitly experimental — not a forecast, not reporting, not expert analysis. Their value is in aggregating distinct perspectives through different analytical lenses, rooted in the different priors each model family brings. Read them as one input to sense-making, not a source of fact.

The flow

Each simulation: scenario + grounding → parallel deliberation across model families → synthesis by a Council Head.

Each simulation starts from a scenario prompt and one or more analytical horizons (24 hours, 1 week, 1 month). A grounding layer pulls recent open-source items — RSS feeds, Perplexity Sonar results, Tavily searches — so the council is arguing over a shared evidence base rather than training-data memories.

The council

A Council Head frames the question and adjudicates the deliberation. Several Council Members — each backed by a different model family (Anthropic, Google, DeepSeek, Moonshot, Zhipu, and others) — reason independently, then respond to each other in a structured exchange. The design is deliberately adversarial: disagreements are surfaced and logged, not smoothed out into false consensus.

Every simulation page lists the exact models that contributed. Outputs reflect each model’s training corpus, alignment layer, and inherent biases — they are not ground truth.

Why the divergence matters

The value is not consensus — it's watching where different priors route the same evidence toward different conclusions.

The point of running several model families on the same evidence isn’t to average their outputs into a single answer. It’s to see where they pull apart. Different models carry different priors — a realist reading of great-power competition, a liberal-institutional read of multilateral venues, a regional-historical read shaped by non-Western corpora. When priors diverge, so do the plausible futures. Surfacing that spread is more honest than forcing consensus.

The output

The Council Head produces a synthesis across three axes:

Situational assessment — what the council collectively reads from the evidence, with explicit load-bearing uncertainties.
Key actors — stated positions and constraints.
Horizon-scoped projections — for each analytical horizon, the top predictions with a mean confidence across the council, a consensus strength, supporting and countervailing rationale, and a watch trigger — the observable signal that would falsify or confirm the prediction.

Why this is not forecasting

A forecast implies calibration against an outcome distribution. Council simulations do not calibrate and do not verify. They surface what a set of models, reasoning in structured disagreement over the same evidence, would say if forced to commit. That is useful for enumerating scenarios and pressure-testing assumptions — but any individual prediction has no independent claim to accuracy.

Why publish them

Several reasons. First, the disagreements themselves are informative — where models converge vs. diverge is a signal worth looking at. Second, watch triggers double as a reading checklist for anyone following the story. Third, it makes the underlying AI behaviour inspectable: the full deliberation, the models used, and the grounding pulled are all recorded rather than hidden inside a chatbot answer.

Limitations to read against

Model biases stack. Most frontier models share a narrow training distribution and similar alignment constraints, so apparent consensus can be an artefact of shared priors, not evidence of truth.
The grounding layer is imperfect. RSS / search results are noisy and may miss the most consequential reporting in any given window.
Council deliberations run in English. Source material in Hebrew, Arabic, Farsi, and Russian is under-represented relative to its importance for this region.
Confidence numbers come from the models themselves and are not empirically calibrated. Treat them as weak orderings, not probabilities.

Prior art & influences

The architecture owes a direct debt to two public projects. Credit where it’s due.

karpathy/llm-council

by Andrej Karpathy

The seed idea. Andrej Karpathy's tiny harness for asking several LLMs the same question and letting one chair synthesise the responses. The architecture here is a structured elaboration of that pattern.

IQTLabs/snowglobe

IQT Labs' framework for running adversarial multi-agent policy simulations. Their emphasis on disagreement-as-signal and structured role play informed how the Council is framed — especially the idea that divergence is the output, not noise to be averaged out.

Source code

Everything that runs these simulations is open source. If you want to reproduce, audit, or fork the approach:

danielrosehill/Geopol-Forecast-Council

The council engine that runs these simulations — scenario framing, grounded evidence, model-diverse deliberation, Head-led synthesis.

danielrosehill/Geopol-Forecaster

Single-model forecaster variant. Same scenario/grounding contract, one model instead of a council — useful for baseline comparison.

danielrosehill/Geopol-Modeller

Scenario-building utilities: prompt scaffolds, horizon definitions, and the grounding contract passed into a run.

danielrosehill/Geopol-Forecasts-Index

The append-only index of every published simulation — JSON record of inputs, models, outputs, and permalinks.

For the list of sources drawn on during grounding, see Sources. For the broader project framing, see Disclaimer.