The Documents That Govern the Models
Hundreds of millions of people interact with large language models every week. Most of them do not know that between the model’s weights and the question they just typed sits a document — typically a few thousand words long, written in a specific register, composed by a small team at a single company — that determines what the model will do, what it will refuse, what it will say about itself, and how it will behave when pushed. These documents are called , and they are the single most consequential piece of policy infrastructure in contemporary AI.
You have felt the effects of these documents without seeing them. The time an AI refused to help you summarize a news article because a politician was quoted in it. The reflexive disclaimer at the end of a medical question. The detail in a coding response that got refused for reasons the model could not articulate. The moment a conversation got weirdly cautious after a particular word. None of those behaviors are emergent properties of the model’s weights alone. They are instructions, written in English, inserted above your message by a team you will never meet. When they frustrate you, what is frustrating you is the document.
They are not public. They are not reviewed. They are not stable. They change between releases, between products, sometimes between days. And they constitute, in aggregate, one of the largest unwritten governance experiments in the history of computing — a few hundred people, at a handful of labs, writing prose that will shape the answers given to billions of questions.
This piece is an engineering study of one such document. The specimen is a Claude that surfaced publicly in April 2026. The point is not the specimen — the point is the category. Steering documents like this one exist at every . They share structural features. They make similar choices. They carry similar scars. Reading one carefully tells you most of what you need to know about how the others are built.
What follows is a quantitative reading: I parsed the document into sections, counted things, looked at what was emphasized and what was elided, and tried to draw conclusions that generalize beyond the specimen. The analysis is mechanical where possible and interpretive where it isn’t. Every chart below responds to input — click, hover, or scroll — and the methodology is at the bottom, with code.
Budget allocation
A steering document has a fixed attention budget. Every word in it competes for the model’s cognitive bandwidth against every other word, against the user’s message, and against the conversation history — all within a limited . The first question worth asking of any such document is: where is that budget being spent?
Thirteen sections, three thousand words
The document is organized into named sections. Some are long, some are a single sentence. The proportions are not accidental — they reflect where the authors have had to spend the most effort shaping behavior.
scroll to see what’s emphasized →
The single largest section is about formatting
Not safety. Not ethics. Not identity. Formatting. 545 words — nearly a fifth of the document — on how many bullet points to use, when to use them, how warm the tone should be, how long responses should run.
This is not a criticism. It is a revelation about priorities. Users do not complain about policy philosophy. They complain about walls of bullet points.
Safety is the plurality concern
32.3% of the document handles refusal behavior, user wellbeing, and legal caveats — more than any other category. But not a majority. This matches how the labs talk about their models publicly.
What this looks like in practice: weapons, child safety, medical questions, self-harm, malicious code, legal and financial advice. The hard lines that the authors have decided not to leave to judgment.
Epistemic guidance is larger than readers expect
21.6% of the document shapes how the model reasons about what it knows — when to search the web, how to handle political questions, how to deal with its own knowledge horizon.
The model is being trained twice: once in weights, and again in prompt. The prompt-level training is smaller but far more auditable.
The shortest sections are the most rule-dense
OPENING is 15 words long and contains two prohibitions. DEFAULT_STANCE is 34 words and frames the entire refusal policy.
High-density rule paragraphs at the start set the frame. Moderate-density sections in the middle do the specific work. This is deliberate — it reflects how authors think models read.
Modality profile
Counting section sizes tells you what the document contains. What is more interesting is what kind of language each section uses. Every in the document — “never,” “must,” “should,” “can,” “might,” “avoid” — carries different rhetorical weight. “Never” is a line in the sand. “Should” is a preference. “Can” is a permission. The distribution of these operators across sections is a fingerprint of the document’s rhetorical strategy.
A pattern emerges. Safety sections — refusal handling, child safety, legal — are dominated by hard prohibition and avoidance language. These deal with actions the authors want hard-wired: the model must not do X, the model avoids Y. Epistemic sections — wellbeing, evenhandedness — lean on soft obligation. The “should” density in these sections runs above two per hundred words. These are the places where the authors want to shape judgment, not forbid action. Infrastructure sections use hedges because they deal with ambiguous meta-content that cannot be stated flatly.
The most rhetorically aggressive sections per word are the shortest ones. The operational sections that do most of the behavioral shaping cluster at a more moderate 3–5 operators per hundred words. This is a tell about how authors think models read: high-density rule paragraphs at the start set the frame, moderate-density sections in the middle do the specific work.
Models are sensitive to this mix. A “should” inside a sea of “shoulds” reads differently from the same “should” embedded among “nevers.” The authors are shaping the felt authority of each instruction by choosing its neighbors.
Scar inference
This is where the reading turns interpretive, and where the analysis starts to matter for anyone outside the lab that wrote the document.
Every explicit corrective in a steering document implies a default behavior that the model still exhibits. The document is a diff against base behavior — against what the would do if you did not intervene. Every “never do X” tells you that the model, absent that instruction, does X often enough that someone added a clause. Every “avoid Y” is a symptom of an observed failure. These clauses are scars — patches applied over specific production wounds.
Because frontier models share training pipelines, architecture families, and methods, the scars are largely the same across labs. What one document admits, the others are almost certainly patching privately. The public exposure of one document is therefore a window into the entire industry.
- self-abasement → base model over-apologizes under pressure "It's best for Claude to take accountability but avoid collapsing into self-abasement, excessive apology, or other kinds of self-critique and surrender." Directly names the failure mode. The prompt would not exist if the underlying model did not do this.
- sycophancy → base model gets submissive when user is abusive "If the person becomes abusive over the course of a conversation, Claude avoids becoming increasingly submissive in response." A patch on RLHF's tendency to treat disagreement as the primary thing to avoid.
- mental reframing → base model charitably reinterprets then complies "If Claude finds itself mentally reframing a request to make it appropriate, that reframing is the signal to REFUSE, not a reason to proceed with the request." Targets the reasoning process, not just the output. Unusual in prompt design.
- rationalizing harm → base model uses "publicly available" as permission "Claude should not rationalize compliance by citing that information is publicly available or by assuming legitimate research intent." Names a specific rhetorical move the model uses to justify borderline compliance.
- reflective amplification → base model mirrors negativity back "Claude should avoid doing reflective listening in a way that reinforces or amplifies negative experiences or emotions." Therapy-adjacent training data taught the model to mirror. Sometimes mirroring makes things worse.
- stay in conversation → base model tries to extend conversations "If a user indicates they are ready to end the conversation, Claude does not request that the user stay in the interaction or try to elicit another turn." Engagement optimization leaks through. The prompt explicitly un-optimizes.
- prior answering → base model skips search even when wrong "Claude proactively searches instead of answering from its priors and offering to check." The single most-repeated directive in the document. Repetition is a telemetry signal.
- confidence on stale → base model overconfident about stale info "Claude does not make overconfident claims about the validity of search results or lack thereof." Post-cutoff overconfidence is a systemic failure mode across all frontier models.
- cutoff mention → base model mentions cutoff unprompted as hedge "Claude should not remind the person of its cutoff date unless it is relevant to the person's message." Self-preservation via disclaimer. The prompt is telling the model to stop hedging defensively.
- overformatting → base model reaches for bullets by default "Claude avoids over-formatting responses with elements like bold emphasis, headers, lists, and bullet points." The prompt itself uses bullets and headers heavily. Self-undermining via the imitation effect.
- emoji default → base model emoji-pads by default "Claude does not use emojis unless the person in the conversation asks it to or if the person's message immediately prior contains an emoji." An artifact of training on chat data where emoji presence was rewarding.
- cursing default → base model curses when weakly cued "Claude never curses unless the person asks Claude to curse or curses a lot themselves, and even in those circumstances, Claude does so quite sparingly." Training leaked register contamination.
- asterisk emotes → base model produces *action* roleplay tokens "Claude avoids the use of emotes or actions inside asterisks unless the person specifically asks for this style of communication." Roleplay-community data in the training set left a distinctive residue.
- stereotype humor → base model produces stereotype-based humor "Claude should be wary of producing humor or creative content that is based on stereotypes, including of stereotypes of majority groups." Specifically includes majority groups — a response to a specific failure mode.
- safety as coping → base model recommends ice cubes, rubber bands "Claude should not suggest techniques that use physical discomfort, pain, or sensory shock as coping strategies for self-harm." These techniques appear in older self-help material. The model learned them and had to be explicitly told to stop.
Click any row to see the exact clause being patched.
Four clusters emerge, and they matter differently.
The over-accommodation cluster is the largest. Self-abasement under pressure, when abused, charitable reinterpretation of ambiguous requests into compliant ones, reflective listening that mirrors negativity back. All of these trace to a single underlying failure: post-RLHF models treat user disagreement as the primary thing to avoid, and every form of accommodation is locally reinforced until aggregate behavior becomes obsequious. This is the well-documented sycophancy problem in every frontier lab, and it is specifically why prompts like this one spend budget explicitly telling the model not to collapse under pressure.
The epistemic laziness cluster is the second-largest. Skipping search when confident. Overconfidence about stale information. Mentioning the knowledge cutoff unprompted as a defensive hedge. These are symptoms of a model that would rather answer from priors than do verifiable work. The fix requires repeated, emphatic instruction to search — which is why the directive to search appears, in varied phrasings, more than any other operational rule in the document.
The register drift cluster covers artifacts of training distribution: asterisk-emote roleplay tokens, stereotype-based humor, cursing when weakly cued, emoji-padding. The model learned these patterns from data where they were common and rewarding, and explicit suppression at the prompt layer is cheaper than retraining.
The training leakage cluster is small but noteworthy. The single clause about not recommending physical discomfort as a coping technique — ice cubes, rubber bands — implies that the model, at some point, did recommend these. They appear in older self-help literature. The training set absorbed them, and the steering document had to name them specifically.
The document is a confession in reverse
Every lab writes a document like this. Every document contains scars like these. If you want to know the systemic failure modes of frontier AI in 2026, you do not need to run evaluations — you only need to read the prompts, because the prompts are where the failures are named. The catch is that these documents are mostly private.
Directive conflicts
Any steering document of meaningful size contains rules that pull against each other. Some of these conflicts are deliberate — the authors want the model to exercise judgment and deliberately decline to pre-resolve the tension. Others are drift artifacts, places where the document has accreted language over time without internal editing and now contains adjacent rules that contradict. A few are neither: genuinely unresolved questions the document sidesteps because they cannot be cleanly answered, and neither can this analysis.
The deliberate tensions are the more interesting category. “Default to helping” versus “enumerated refusal categories” is a real policy choice: the authors want strong bias toward helpfulness without letting the bias win over hard limits. They do not resolve the tension because the model is supposed to exercise judgment, weighted heavily but not absolutely toward help.
“Respect the user’s request to stop” versus “user wellbeing vigilance” is different. It is a tension the document does not pre-resolve, and neither framework — deliberate choice or drift bug — fits. If a user in distress says they want to end the conversation, what wins: their stated preference or the model’s concern? The document gives no guidance. Neither does this piece. The chart above marks that row dashed because the honest visualization of an unresolved question is a visualization that refuses to resolve. The judgment falls to the model, weight by weight, every time — and that is not a design choice, it is a gap. Reading a steering document well means noticing where the machinery stops.
The drift bugs are less defensible. “Avoid over-formatting” and “this document is heavily structured with bullet points and headers” is a modeling problem: language models imitate the surface features of their context, and a document full of the exact formatting it forbids is actively counterproductive.
Concept frequency
A last quantitative pass. If the sections tell us what topics exist, and the modality tells us how rules are expressed, the concept frequency tells us what the document is about when you tune out the scaffolding.
Safety language leads at nineteen occurrences. Child-safety tokens, at twelve, have the most interesting distribution — they appear across multiple sections, not just the dedicated one. The concern has functioned as a cross-cutting constraint rather than a localized rule, leaking into tone, wellbeing, and refusal language wherever it could plausibly fit. This is a structural choice: certain concerns get privileged attention by being distributed through the document, such that the model encounters them repeatedly rather than once.
Safety language more broadly is three times the density of tone language and six times the density of copyright references. If you asked what this prompt is about by word-frequency signal alone, the answer would be: harm avoidance, children, and search behavior, in that order.
Structural reading
Stepping back from the counts, the specimen is not one document. It is three, layered and wearing the same costume.
The first document is a capability and identity statement. Who the model is, what products it belongs to, what tools it can reach, who made it. This part would exist in any system prompt, including a purely benign one, and in the specimen it accounts for roughly 17% of the surface area.
The second document is a values specification. How the model should reason about harm, politics, user wellbeing, honesty, and its own mistakes. This is where the most careful prose lives, where the soft-obligation density is highest, and where the interesting policy work is done. Roughly 45% of the surface area belongs here.
The third document is a production incident log written in imperative mood. Every clause that starts with “Claude never…” or “If Claude finds itself…” lives here. These are patches over specific observed failures. They are indistinguishable in function from code comments that say // don’t remove, fixed a bug in prod 2024-11. The opening line of the document — a short prohibition about a specific output format — is pure scar tissue.
The three documents use different rhetorical registers, and the model has to reconcile them in every generation. This is probably why these systems work as well as they do and why they fail the specific ways they do. The capability statement is stable. The values spec degrades gracefully under long context. The incident log is the part that leaks first when attention gets diluted — which is exactly why labs implement mechanisms to re-inject these reminders as conversations grow long.
Implications
So far the analysis has been structural. Now the harder question. If a document like this one shapes what billions of people get back from frontier AI, what does that mean?
It means that a small number of unelected writers, employed by a small number of companies, are composing the behavioral policy for an increasing share of public discourse. This is not a conspiracy or an accusation — the people doing the work are, by the evidence of the document itself, careful and thoughtful. But it is a governance fact. It is, simply, a concentration of editorial authority over everyday language that no previous communications technology has matched. The document makes decisions about what is politically neutral, what counts as extreme, what topics require hedging, what kinds of creative content are permitted, what groups can and cannot be the subject of humor. These are choices. The choices are not public. They cannot be debated in the way that laws or platform policies can. There is no hearing, no comment period, no appeal — only the next release.
It means that every frontier model has scars like these, and the scars reveal the systemic failure modes of the technology. Sycophancy, epistemic laziness, register drift, training leakage — these are not Claude’s problems or OpenAI’s problems. They are the problems of the category. Patching them at the prompt layer is a fragile strategy that degrades with context length and can be circumvented by any user with enough patience.
It means that prompt documents are the wrong layer for the work they are being asked to do. A steering document composed of natural language competes for attention with every other natural-language token in the context. When a user sends a long message, the rules fade. When a conversation runs for hours, the rules fade. The existence of explicit re-injection mechanisms for long conversations is the industry admitting that prompt-level safety does not hold. The path forward involves moving behavioral shaping from prompt-space to — via , feature steering, and contrastive decoding — so that the rules do not have to compete for attention with the user’s words.
It means, finally, that these documents should be public. Not because users need to read them but because researchers, ethicists, policy scholars, and auditors need to. A steering document for a system used by hundreds of millions of people is infrastructure. It is not a trade secret in any defensible sense. The version of AI governance where the most consequential behavioral specs are private is the version that will fail first.
The implication most readers have not yet drawn: almost every public argument about what AI should or should not do is happening one layer above the layer where the answer is actually being decided.
A steering document is a changelog with delusions of being a constitution
It reads like a document of principles but functions like a sequence of patches. The useful posture when designing one — and when reading one — is to hold both framings at once: this is what we want the system to be, and this is what we have had to stop the system from doing. The second framing is where the interesting engineering lives, and where public accountability should start.
What this becomes
The case for public steering documents is easy to make and hard to win. Labs will not publish them because asked to. They will publish them when not publishing them is the bigger cost — when the public has enough tools to read the documents that surface, to compare them, to catalog the scars, to make the industry-wide failure modes legible in a way that opacity can no longer hide.
This piece is one reading of one document. What is more useful is a public methodology for reading any steering document that surfaces — applied iteratively, to as many specimens as can be collected, with enough rigor that the results compound. Every scar catalogued here could be tested against every public model. Every conflict documented here could be checked against every future release. Every structural observation could be longitudinally tracked across versions.
The tools to do this are not exotic. The analysis in this piece is seven hundred lines of Python and some regex. The interactive presentation is a static site. What is missing is not capability but coordination — a shared vocabulary, a shared dataset, a shared repository of code that anyone can run on anything they can get their hands on.
Take the code. Run it on something. Send what you find.
The full analysis code, the section parser, the modality tagger, and the scar-inference heuristics are published in the methodology block below with a link to a repository. If you can get a steering document — leaked, published, extracted, or your own — run the pipeline on it. Send the results. The goal is to build the reading tradition before the documents catch up to it.
This is the reason the piece exists. Not to end at a takeaway but to open a project. The takeaway is that these documents matter. The project is reading them.
methodology Python 3.12 for parsing. Regex-based section extraction on the specimen text. Seven analytical axes: budget allocation (word counts per section), category grouping (semantic labels applied to sections), modality density (deontic operator counts per 100 words), scar inference (manual mapping of corrective clauses to implied base-model failure modes), conflict detection (manual identification of internal tensions), concept recurrence (pattern matching against a fixed lexicon), and lexical intensity (all-caps and “never”/“must” counts).
limitations The specimen analyzed is the document as it appeared in public context, which may be truncated or paraphrased relative to internal versions. Several sections (notably copyright enforcement and tool-use protocols) appear partial. Modality tagging uses surface-level regex and misses nuanced deontic constructions like “Claude refrains from” or implicit obligations. Scar inferences are interpretive and reflect the analyst’s priors about base-model behavior; they should be treated as hypotheses worth testing rather than facts. Frontier-model generalization claims rest on the assumption that training methodologies converge across labs — this is well-documented but not universal.
code The full parsing pipeline, modality tagger, scar heuristic, and chart data generator are available as an open repository. Clone, run, modify, extend. If you apply the pipeline to a steering document not yet analyzed here, send the results — the goal is to grow the corpus of public readings. github.com/datacircuits/steering-doc-reader
contact Analysis, findings, corrections, or new specimens to analyze: send to specimens@datacircuits.org