Fascist Language Analyzer — Chapter 2 (LangExtract Deep Dive)

Chapter 1 is preserved in README_chapter1.md and captures the original trait-first analysis pass over Project 2025 using LangChain.

This Chapter 2 README reframes the project around the entity extraction stack (langextract) and explains how entities connect back to the Chapter 1 Ur-Fascism rhetorical analysis.

Chapter 1 Snapshot

Conclusion: The strongest rhetorical pattern in this corpus concentrates around Selective Populism and Obsession-with-a-Plot dynamics, with institutional actors frequently linked to those frames.
LangChain: Chapter 1 used schema-constrained LangChain analysis to classify quote-level evidence into Eco’s 14 Ur-Fascism properties with confidence scoring.
LangExtract: Chapter 2 extends that foundation by extracting and normalizing grounded entities (people, agencies, organizations, programs, laws, locations) for cross-analysis.
Entity ↔ Theme Bridge: The current release links entities to theme-bearing passages with evidence, chunk provenance, and optional normalized scoring (raw, lift, pmi).
Full Chapter 1: The long-form narrative, methodology, and original framing remain in README_chapter1.md.

Chapter 1 Snapshot
Open the Web App
Web App Functions
What Chapter 2 Covers
System Overview
Analysis Highlights (First Release)
LangExtract Deep Dive
Entity ↔ Theme Linking (New in Chapter 2)
LangChain Summary (Condensed)
Setup & Run (Chapter 2)
GitHub Pages Artifact Sync
Known Limits (Chapter 2)
Next Merge-Ready Targets

Open the Web App

Live app (GitHub Pages):

https://andyed.github.io/fascist-language-analyzer/

Direct routes:

Theme browser: https://andyed.github.io/fascist-language-analyzer/#/themes
Entity browser: https://andyed.github.io/fascist-language-analyzer/#/entities
Entity ↔ Theme graph: https://andyed.github.io/fascist-language-analyzer/#/entity-themes

Local dev app:

Run npm run dev --prefix web
Open http://localhost:3001/

Web App Functions

The web app has four core views:

Rhetoric Graph (#/)

Interactive network of chunk-to-theme relationships.
Useful for seeing macro rhetorical structure.

Theme Browser (#/themes, #/theme/:traitId)

Theme-by-theme exploration of extracted quotes, explanations, and confidence.
Fast path for reviewing evidence by Eco trait.

Entity Browser (#/entities, #/entity/:entityId)

Canonicalized entity index (people, agencies, organizations, policies, legal refs, locations).
Entity detail pages include highlighted context snippets and mention counts.

Entity ↔ Theme View (#/entity-themes)

Cross-links entities to themes using co-mention evidence.
Supports score inspection (raw, lift, pmi in data), edge filtering, and quote provenance (chunk_id, source link).

What Chapter 2 Covers

A technical deep dive into the LangExtract pipeline.
A condensed summary of the existing LangChain trait-classification system.
The static + interactive publish flow for themes, entities, and entity↔theme links.

System Overview

The repo now has two complementary tracks:

Trait analysis track (LangChain)
- Input: data/project_2025.txt
- Output: data/analysis_results.json
- Purpose: classify quotes into Eco’s 14 Ur-Fascism properties.
Entity extraction track (LangExtract)
- Input: data/project_2025.txt
- Outputs:
  - data/entities_langextract.jsonl
  - data/entities_langextract.normalized*.jsonl
  - data/entities_langextract.normalization_report*.json
- Purpose: extract grounded named entities/policies/laws/locations and normalize aliases.

These tracks are fused in Chapter 2 visualizations through scripts/build_entity_theme_links.py.

Analysis Highlights (First Release)

This first release uses --score-mode raw for ranking in entity↔theme links, with lift/PMI available for analysis.

Dominant linked themes (by aggregate edge weight) are Selective Populism, Obsession with a Plot, and Machismo and Weaponry.
Relationship mass is led by government_agency entities; location and person classes are secondary contributors.
A recurring narrative pattern frames government organizations and administrative bodies as impediments to democratic intent, often paired with calls to centralize authority under aligned executive control.
Top recurring entity-side terms include broad governance labels (for example State, Administration, President, department), indicating strong institutional framing in theme-bearing passages.
Strongest edges repeatedly connect executive/administrative actors to Selective Populism and Obsession-with-a-Plot language.
Each edge now carries evidence with quote, chunk_id, and source_url for auditability.

Quick interpretation guidance:

Treat this as a macro rhetorical map first, not final actor attribution.
Generic high-frequency entities can dominate raw rankings; lift/PMI is included to help de-bias follow-up analysis.

LangExtract Deep Dive

1) Extraction Contract

Extraction is constrained to six classes in scripts/extract_entities_langextract.py:

person
organization
government_agency
policy_program
legal_reference
location

The prompt enforces:

exact span grounding (no paraphrase)
in-order extraction
non-overlapping spans
optional lightweight attributes for graph utility

2) Few-Shot Biasing

EXAMPLES in scripts/extract_entities_langextract.py acts as a schema prior:

demonstrates span precision
demonstrates class boundaries
demonstrates attribute style

This is the main guardrail that keeps extraction coherent across long political prose.

3) Long-Document Strategy

Key extraction parameters:

--extraction-passes (default 2): multiple sweeps for higher recall
--max-workers (default 20): parallelization
--batch-length (default 20): chunk processing cadence
--max-char-buffer (default 1200): local context radius

Operationally, recall and precision trade off against max_char_buffer; higher values increase contextual disambiguation but can increase noisy captures in dense lists.

4) Provider Routing

build_extract_kwargs() supports:

native provider mode (direct model key)
poe mode (OpenAI-compatible Poe endpoint)
auto mode (environment-driven fallback)

This lets one script run across Gemini/OpenAI-like backends with minimal changes.

5) Grounded Output Shape

LangExtract records include:

extraction_text
extraction_class
char_interval.start_pos / end_pos
alignment_status
optional attributes

Because offsets are preserved, downstream snippet generation can anchor context windows directly in original source text.

6) Normalization Layer

scripts/normalize_entities.py adds canonical identity fields:

canonical_id
canonical_label
normalization_method
normalized (boolean)

Modes:

lenient (default): acronym + parenthetical + title-boundary aware matching
strict: exact alias matching only

Additional normalization supports:

person alias bootstrap from data/gold/entities_gold_v0.jsonl
alias expansion catalog (data/normalization_aliases_v2.json)
collision reporting in normalization report

7) Evaluation & Diagnostics

Gold/eval scripts:

scripts/predict_gold_langextract.py
scripts/evaluate_entity_gold.py

Reports provide strict/lenient diagnostics and unresolved top mentions, which are critical for iterative alias curation.

8) Publishing Layer (Chapter 2)

Entity publish path is now intentionally SEO-friendly:

Grouped static pages by class in docs/entities/
Per-page caps for controlled page bloat
Longer context snippets with boundary-aware trimming
Bold mention highlighting in snippet text

Script:

scripts/generate_entity_index_pages.py

Key switches:

--max-entities-per-class 50
--max-snippets 0 (all mentions)
--snippet-context-chars 180

Entity ↔ Theme Linking (New in Chapter 2)

scripts/build_entity_theme_links.py computes co-mention edges by matching entity aliases inside trait quote/explanation text.

Release note for v1:

supports --score-mode raw|lift|pmi (default raw)
every edge evidence item now includes chunk_id and source_url

Output:

web/public/entity_theme_data.json
docs/graph/entity_theme_data.json

Current edge weight:

weighted sum of concept confidences
match count per edge
quote evidence attached to edges

Interactive view:

route: #/entity-themes
filtered graph with edge-threshold and link-count controls
top-link panel with evidence and click-through to entity detail

LangChain Summary (Condensed)

The LangChain track remains the rhetorical backbone:

taxonomy constrained to Eco’s 14 properties via Pydantic schema (src/schema.py)
structured output for quote + explanation + confidence
parallel chunk analysis over project_2025.txt
outputs consumed by static and React visualizations

Chapter 2 does not replace this track; it makes it explainable through concrete actors/institutions/policies linked to trait rhetoric.

Setup & Run (Chapter 2)

# Python + web deps
pip install -r requirements.txt
npm install --prefix web

# 1) Trait analysis (LangChain)
python src/main.py

# 2) Static theme pages (capped)
python src/generate_site.py --max-items-per-theme 50

# 3) Entity extraction (LangExtract)
python scripts/extract_entities_langextract.py \
  --model-id Gemini-3-Flash \
  --provider-mode auto \
  --extraction-passes 2 \
  --max-workers 20 \
  --max-char-buffer 1200

# 4) Entity normalization
python scripts/normalize_entities.py \
  --mode lenient \
  --input data/entities_langextract.jsonl \
  --output data/entities_langextract.normalized.v2.jsonl \
  --report data/entities_langextract.normalization_report.v2.json

# 5) Static entity pages + Vite entity data
python scripts/generate_entity_index_pages.py \
  --max-entities-per-class 50 \
  --max-snippets 0 \
  --snippet-context-chars 180

# 6) Build entity↔theme link data
python scripts/build_entity_theme_links.py --score-mode raw

# 7) Build React app
npm run build --prefix web

GitHub Pages Artifact Sync

After web build, sync to docs/graph:

rm -rf docs/graph/assets
cp -R web/dist/assets docs/graph/assets
cp web/dist/index.html docs/graph/index.html
cp web/public/data.json docs/graph/data.json
cp web/public/graph_data.json docs/graph/graph_data.json
cp web/public/entities_data.json docs/graph/entities_data.json
cp web/public/entity_theme_data.json docs/graph/entity_theme_data.json

Known Limits (Chapter 2)

Alias matching in entity↔theme linking is lexical and can over-link generic terms.
High-frequency entities can dominate when using --score-mode raw; use --score-mode lift or --score-mode pmi for normalized views.
Table-of-contents style sections still introduce noisy entity contexts.

Next Merge-Ready Targets

Add PMI/lift-normalized edge score in build_entity_theme_links.py.
Add chapter-level partitioning in static entity pages (A–M / N–Z) if payloads grow.
Add a dedicated “evidence-only” toggle in the entity↔theme view.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
docs		docs
scripts		scripts
src		src
web		web
.env.example		.env.example
.gitignore		.gitignore
2025_MandateForLeadership_FULL.pdf		2025_MandateForLeadership_FULL.pdf
GOALS.md		GOALS.md
README.md		README.md
README_chapter1.md		README_chapter1.md
prepare_web_data.py		prepare_web_data.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fascist Language Analyzer — Chapter 2 (LangExtract Deep Dive)

Chapter 1 Snapshot

Table of Contents

Open the Web App

Web App Functions

What Chapter 2 Covers

System Overview

Analysis Highlights (First Release)

LangExtract Deep Dive

1) Extraction Contract

2) Few-Shot Biasing

3) Long-Document Strategy

4) Provider Routing

5) Grounded Output Shape

6) Normalization Layer

7) Evaluation & Diagnostics

8) Publishing Layer (Chapter 2)

Entity ↔ Theme Linking (New in Chapter 2)

LangChain Summary (Condensed)

Setup & Run (Chapter 2)

GitHub Pages Artifact Sync

Known Limits (Chapter 2)

Next Merge-Ready Targets

About

Uh oh!

Releases

Packages

Languages

andyed/fascist-language-analyzer

Folders and files

Latest commit

History

Repository files navigation

Fascist Language Analyzer — Chapter 2 (LangExtract Deep Dive)

Chapter 1 Snapshot

Table of Contents

Open the Web App

Web App Functions

What Chapter 2 Covers

System Overview

Analysis Highlights (First Release)

LangExtract Deep Dive

1) Extraction Contract

2) Few-Shot Biasing

3) Long-Document Strategy

4) Provider Routing

5) Grounded Output Shape

6) Normalization Layer

7) Evaluation & Diagnostics

8) Publishing Layer (Chapter 2)

Entity ↔ Theme Linking (New in Chapter 2)

LangChain Summary (Condensed)

Setup & Run (Chapter 2)

GitHub Pages Artifact Sync

Known Limits (Chapter 2)

Next Merge-Ready Targets

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages