Methodology · Sovereign Speech Index

Sentence extraction

Every speech in the Index is split into individual sentences and used as the unit of analysis. A sentence is a tractable, recognisable unit: it has clear boundaries (punctuation), it is short enough that a single dominant frame usually applies, and it maps cleanly to the way readers and listeners actually parse rhetoric.

An alternative unit — the quasi-sentence used by parts of the political-text-analysis literature — splits sentences further whenever a single sentence carries more than one distinct argument. Quasi-sentences are useful in principle but fraught in practice: their boundaries are subjective, two coders rarely segment a long sentence the same way, and the resulting tag fragments are hard to compare across a corpus that already spans 240 years of evolving prose. The Index therefore stays at the sentence level and tags a sentence by its dominant frame.

Bottom-up AI classification

For each dimension, every sentence is passed to a frontier language model with a fixed rubric (see below). The model returns one tag per dimension, or null when no tag fits. Classification is run in batches of 40 sentences per request to avoid drop-off, with a single retry on length mismatch. Tags are stored alongside the sentence text in each speech's JSON file.

The rubric is identical for every model call. Speakers, dates, and external metadata are deliberately withheld from the prompt — the tag must come from the sentence in front of the model, not from priors about who is speaking.

Top-down human validation

Where resources allow, a sample of speeches is also tagged by human reviewers using the same rubric. Reviewers see the sentence and the rubric definitions but not the AI's tag.

Reconciliation

AI and human tags are then compared sentence by sentence. A sentence's dominant classification is the one both methods agree on; disagreements are flagged for re-tagging at scale and used to refine the rubric for the next pass. The Index is therefore a moving artefact — rubrics get tighter as the corpus grows.

Dimensions and rubrics

Each sentence is tagged on six dimensions. Use the options below as the working rubric.

Time orientation

Time orientation reveals whether a leader is framing the moment as one of inheritance, of present condition, or of imminent change — a quick proxy for whether a speech is defending a record, asserting authority, or making a promise.

Past

Apply when the sentence anchors its main claim in a specific historical event, prior decision, or pre-existing condition. Markers: was / were / did / inherited / since / in [past year]. Example: "We inherited a crisis last winter."

Present

Apply when the sentence describes a state of affairs asserted as currently true at the moment of speaking. Markers: is / are / now / today / at this moment. Example: "Inflation is plummeting."

Future

Apply when the sentence commits to, predicts, or warns about a state of affairs yet to come. Markers: will / shall / going to / next year / by [future year]. Example: "We will rebuild this nation."

Expression (speech act)

Expression separates the speech's rhetorical function: factual claims, binding commitments, or calls to act. The mix exposes how much of a leader's discourse is descriptive, performative, or mobilising.

Assertion

Apply when the sentence states a fact, condition, judgment, or characterisation without binding the speaker to action. Markers: declarative verbs (is / has / represents / means). Example: "America is the most innovative nation on earth."

Commitment

Apply when the sentence binds the speaker or their government to a specific future course of action. Markers: we will / I pledge / this administration will / we commit to. Example: "We will deliver universal childcare within four years."

Call to action

Apply when the sentence asks the audience — Congress, citizens, allies — to take a specific action. Markers: imperatives, let us / I ask / I urge / send / stand up. Example: "I ask Congress to pass this bill before recess."

Stance to addressee

Stance captures the power posture a speaker takes toward the audience — humble petitioner, equal collaborator, or judging authority. It is how the speaker positions themselves relative to those who can act on the message.

Deferential

Apply when the speaker positions themselves below the addressee — thanks, tribute, apology, request, deference to higher authority. Markers: I thank / with humility / I ask for / Mr Speaker / Your Majesty. Example: "I am honoured to stand before this distinguished body."

Coordinate

Apply when the speaker stands alongside the addressee as a peer — shared identity, common cause, joint endeavour. Markers: we / together / our nation / side by side, inclusive pronouns. Example: "Together we will face this challenge."

Dominant

Apply when the speaker positions themselves above an opponent, institution, or rival — judgment, condemnation, asserting authority. Markers: I refuse / they must / I will not allow / no one can. Example: "I will not negotiate with those who would dismantle our democracy."

Agency

Agency identifies who the speech casts as the actor. Patterns here reveal whether a leader is centring their own state, building solidarity with allies, or focusing the audience on adversaries.

Nation

Apply when the principal actor (subject/agent) is the speaker's own nation, government, party, military, or people. Markers: America / our country / this administration / the United States. Example: "America led the world out of recession."

Ally

Apply when the principal actor is a partner, friend, or named ally — a foreign government on the speaker's side. Markers: named allied countries, our partners / NATO / the British / Japan. Example: "Our allies in Europe have stepped up sanctions enforcement."

Adversary

Apply when the principal actor is a rival, threat, or named adversary. Markers: named hostile actors, the regime in / terrorists / Beijing / Moscow / cartels. Example: "Russia continues to escalate its aggression in Ukraine."

Reference

Reference tracks the scope a leader speaks to — domestic, bilateral, or regional/global. It exposes whether a speech is internally focused or projecting outward, and whose attention it seeks.

Domestic

Apply when the sentence concerns the speaker's own country's internal affairs, citizens, institutions, or economy. Markers: American workers / our schools / Main Street / this nation. Example: "Our schools need every dollar we can spare."

Bilateral

Apply when the sentence concerns the relationship between the speaker's country and ONE other specifically named country. Markers: with China / between the US and Japan, named pair. Example: "We will negotiate a new trade agreement with Mexico."

Regional / Global

Apply when the sentence concerns a region, bloc, multilateral institution, or the world as a whole. Markers: the world / Europe / Asia-Pacific / UN / G7 / international community. Example: "The free world must stand together against authoritarianism."

Capability (GINC National Capability Framework, nine domains)

Capability maps each sentence onto the nine domains of the GINC National Capability Framework. It surfaces which levers of national power — hard, soft, or economic — a leader is signalling intent to invest in, defend, or wield.

CT · Hard

Critical Technology

Apply when the sentence concerns strategically important advanced or emerging technology: AI, semiconductors, quantum, biotech, advanced materials, dual-use R&D. Markers: artificial intelligence / chips / frontier technology / compute. Example: "We will lead the world in AI and semiconductor manufacturing."

SI · Hard

Strategic Infrastructure

Apply when the sentence concerns the physical sinews of national power: energy grid, transport, broadband, ports, pipelines, supply chains, data centres. Markers: infrastructure / grid / broadband / ports / rail. Example: "We are rebuilding America's highways, bridges, and broadband."

NS · Hard

National Security

Apply when the sentence concerns military force, defence, deterrence, intelligence, security operations, terrorism, or alliances framed militarily. Markers: armed forces / deterrence / troops / weapons / NATO. Example: "Our armed forces remain the most lethal fighting force in history."

HC · Soft

Human Capital

Apply when the sentence concerns education, health, workforce skills, demographics, or immigration as talent flow. Markers: schools / universities / skills / healthcare / workforce. Example: "Every American child deserves a world-class education."

II · Soft

Information & Influence

Apply when the sentence concerns media, culture, narrative-shaping diplomacy, broadcasting, or civilisational positioning. Markers: values / narrative / media / Voice of America / soft power. Example: "We must counter disinformation with the truth."

GI · Soft

Governance & Integrity

Apply when the sentence concerns rule of law, institutions, courts, corruption, elections, democracy, accountability, or leadership integrity. Markers: rule of law / Constitution / courts / democracy / free and fair elections. Example: "The integrity of our elections is non-negotiable."

FS · Economic

Financial Strength

Apply when the sentence concerns currency, budgets, debt, deficits, fiscal or monetary policy, reserves, taxes, or inflation framed monetarily. Markers: deficit / debt / interest rates / Fed / balance the budget. Example: "We will balance the budget within ten years."

PI · Economic

Productivity & Innovation

Apply when the sentence concerns industry, manufacturing, factories, R&D investment, productivity, or supply-side reform. Markers: manufacturing / made in America / factories / productivity / small business. Example: "American manufacturing is roaring back."

TI · Economic

Trade & Investment

Apply when the sentence concerns tariffs, trade deals, exports/imports, foreign direct investment, sanctions, or currency as a trade weapon. Markers: trade deal / tariff / exports / sanctions / FDI. Example: "We will impose tariffs on any nation that dumps cheap steel into our markets."

Reading scores

Four numeric scores characterise the texture of the prose itself, independent of the sentence-level tags. Each is computed per sentence and the speech-level score is the mean across all sentences in the address. Year and corpus aggregates further weight each speech by its word count so longer addresses pull more weight on the trend. Empty, single-token, or content-less sentences are skipped.

Readability

How approachable is the language? Higher scores mean easier reading — broadcast-era addresses sit above 70; 19th-century formal prose can drop below 40.

Reading Ease = 206.835 − 1.015 · (words / sentences) − 84.6 · (syllables / words)

Source: Flesch Reading Ease formula (1948).
Per sentence: The formula is applied to each sentence treated as a one-sentence document. words/sentences equals the sentence's token count; syllables/words is the mean syllables-per-token.
Syllables: Estimated by a vowel-group heuristic with the standard silent-e correction — accurate enough for cross-speech comparison but not a substitute for a pronouncing dictionary.
Speech score: Arithmetic mean of the per-sentence Reading Ease values.
Speech metadata: The stored metrics.readability object also keeps the median, sd, min, max, and n (sentences scored) so downstream charts can show distribution, not just mean.

Grade level

What reading grade does the text demand? Lower is easier — modern campaign rhetoric runs 6–8; 19th-century inaugural addresses can demand 16+.

Grade Level = 0.39 · (words / sentences) + 11.8 · (syllables / words) − 15.59

Source: Flesch–Kincaid Grade Level formula (Kincaid et al., 1975).
Per sentence: Applied to each sentence individually with the same syllable heuristic as Readability.
Speech score: Arithmetic mean of the per-sentence grade levels. Negative values can occur for very short sentences with monosyllabic words and are kept as-is rather than floored.
Relation: Reading Ease and Grade Level move in opposite directions on the same inputs. They are reported separately because the curves are non-linear in the syllables term, so the two scores carry slightly different information about long, latinate sentences vs. long, simple ones.

Syntax

How structurally complex is each sentence? Measured as mean dependency distance — how far, on average, a word sits from the word it modifies. Higher = nested clauses, longer reach.

Syntax = mean( |token.i − head.i| ) over all non-root tokens in the sentence

Source: spaCy en_core_web_sm dependency parse.
Per sentence: For each token in the parse tree, take the absolute distance (in tokens) between the token's position and its syntactic head's position. Mean across all non-root tokens is the sentence's syntax score.
Why this measure: Long-distance dependencies signal embedded clauses, parentheticals, and front-loaded modifiers — the structural fingerprints of formal 19th-century rhetoric. Modern broadcast speech keeps dependencies short.
Speech score: Arithmetic mean of the per-sentence dependency distances.
Skipped: Empty sentences and single-token sentences (no dependencies to measure).

Vocabulary

How rare is the lexicon? Measured as average rarity of content words — common-word speeches score low; latinate, jargon-rich, or archaic speeches score high.

Vocabulary = 7 − mean( Zipf-frequency( content word ) )

Source: wordfreq Python package; Zipf scale (log10 of words-per-billion, anchored so 7 ≈ "the").
Content words: Only NOUN, PROPN, VERB, ADJ, and ADV tokens count (per the spaCy POS tag) — function words like determiners and prepositions are excluded so the score reflects the speaker's lexical reach, not grammatical fillers.
Per sentence: Look up each content word's Zipf frequency, take the arithmetic mean, then subtract from 7 so that higher means rarer (matching the direction of "harder").
Speech score: Arithmetic mean of the per-sentence rarity values.
Skipped: Sentences with no content words (rare — interjections, dates, single-name vocatives).

Sources

Transcripts are sourced from public archives. In addition to those listed below, GINC sources transcripts from public addresses to top up the index where useful coverage is missing.