How Our Exit Signal Score Works
The Exit Signal Score is a daily 0-100 number measuring US stability. This is how it actually gets produced: the briefing, the six models, what they see, and why we show their disagreement.
The Exit Signal Score is the headline number on CONTINGENCYPLAN.AI. It is a 0-100 assessment of the current state of US stability. It updates daily. It is produced by six large language models — Claude (Anthropic), Gemini (Google), DeepSeek, Mistral, ChatGPT (OpenAI), and Grok (xAI) — working independently on the same input.
The methodology page covers the philosophy. This post covers the mechanics.
Step 1: Build the daily briefing
The pipeline starts every morning with a briefing.
We pull from global news wires, structured data providers, and public social media signals. The raw feed is noisy — thousands of stories a day, most of them irrelevant to stability signaling. A first pass clusters related events, deduplicates repeated coverage of the same story from different outlets, and categorizes each cluster into one of our tracked domains: Civil Liberties, Elite Accountability, Civil Unrest, Economic Instability, Government Corruption, Surveillance and Privacy, Military and Conflict, Food and Supply Chain, Public Health, and a few others.
The output is a written briefing. Representative events, not comprehensive coverage. The goal is a single document that captures what actually moved the needle that day, formatted so a model can read it end to end without getting lost in filler.
This is the most important step in the pipeline. The briefing is what the models see. Everything downstream depends on whether it is balanced and representative. We publish the briefing in full on the WHEN TO LEAVE page every day so readers can verify that the input is not cherry-picked.
Step 2: Ship the briefing to six models
Every model gets the same briefing.
Not similar briefings. Not a summary of the briefing. The identical document. This matters because if the models saw different inputs, we could not meaningfully compare their outputs. The point of running six models is to isolate differences in model reasoning — biases, training cutoffs, editorial instincts — while holding the information they work from constant.
Each model is given the same task, framed the same way: based on this briefing, and drawing on your understanding of historical patterns that have preceded state instability, score the current US situation from 0 to 100. Provide reasoning. Stay calibrated.
The models do not have live internet access during scoring. They cannot pull in external news of their own. The only information they work from is the briefing they were given plus whatever they already know from pretraining. This is intentional. Live retrieval would make the score unreliable — different retrievals for different models, different amounts of context, different editorial cuts.
Step 3: Collect scores, reasoning, and the written assessment
Each model returns three things:
- A score from 0 to 100.
- A short written reasoning — why this score, what factored in, what the model is weighing.
- A longer daily assessment in natural language.
The individual model responses are saved to the database. The scores are averaged to produce the consensus number. The written assessments are used to build the day's published briefing.
Step 4: Publish the consensus — and all six individual scores
The WHEN TO LEAVE page shows the consensus score at the top. But it also shows all six individual model scores. This is non-negotiable in our design. We never collapse six into one without showing the six.
Why: If five models score the day at 48 and one scores it at 72, the consensus is 52. That number is misleading on its own. What the reader needs to know is that five models agree and one dissents. That dissent might be right. It might be overreacting. Either way, the reader has better information if they can see the spread than if they only see the average.
When models agree, the signal is strong. When they disagree, that disagreement is itself the signal — it is telling you that reasonable analyses of the same day can arrive at meaningfully different conclusions. That is useful.
Why six models instead of one or three
Three models would not be enough to smooth out the occasional outlier. Ten would be redundant and expensive. Six is the smallest number where no single model's error can dominate the consensus, and where the diversity of training data and company provenance is meaningful.
The six we use come from six separate organizations: Anthropic, Google, DeepSeek (Chinese), Mistral (French), OpenAI, and xAI. Different labs, different training data, different editorial instincts, different political and cultural assumptions. If all six converge on the same score, it is very unlikely to be an artifact of any one lab's blind spots.
When Anthropic updates Claude, or OpenAI ships a new ChatGPT version, we revalidate. The identity of the models is less important than the diversity of them.
What the score does not include
A few things it is worth being explicit about.
- No live market data. The models do not know the S&P or the dollar index. Market movements may show up in the briefing if they were newsworthy, but the score is not a market-weighted economic indicator.
- No partisan framing. The briefing is assembled without ideological balancing. An event is reported if it is material to stability, not if it matches a political frame. Models are not prompted to favor any perspective.
- No prediction. The score is a snapshot of the present. It says: based on what is happening today, and based on what historically precedes instability, how concerning is the current state? It does not say: here is what will happen next week.
- No financial, legal, or immigration advice. We are a preparedness tool. The score is a number. What you do with it is up to you, your family, and the professionals you hire.
What we do when we are wrong
The score will sometimes be wrong. An outlier event can inflate a day's score temporarily. A slow-building pattern can show up as a normal number until the breaking point arrives.
When we notice either — through reader feedback, model disagreement, or postmortem — we do two things. We revise the briefing methodology (what gets included, how weighting works) if the issue is input-side. We revalidate the prompts and scoring rubric if the issue is model-side. The methodology is public, and changes to it are logged.
The goal is not a perfect score. The goal is a useful one — a number that helps a thoughtful reader plan better than they would without it, most of the time, and that is honest about what it is when it is wrong.
That is the Exit Signal Score. Every morning. Six models. One number. And the disagreement shown underneath, because honest signaling always is.