About

Methodology

How records enter the site, how OCR and entity references are derived, where AI-generated summaries fit, and where the current collection is incomplete.

37,141
Records indexed

Curated subset of roughly 300,000 JFK Collection documents.

149,642
OCR passages

Chunked full-text passages available for search and citation.

2026
Pending releases

Release years not yet represented in the local index.

Scope and ingest cutoff

This research tool indexes 37,141 records - a curated subset of the roughly 300,000 documents in the JFK Assassination Records Collection held at the U.S. National Archives. Of those, 2,165 have full-text OCR attached, producing 149,642 indexed OCR passages. The remainder are metadata-only.

Latest indexed release date: 2025-03-18. Releases indexed: 2017-2018, 2021, 2022, 2023, 2025. Releases not yet indexed: 2026.

2025 re-release layer: for 2,162 records, the OCR text shown on this site was sourced from the March 2025 unredaction (EO 14176). NARA has not yet published an XLSX manifest for the 2025 release, so the archival metadata fields still come from the latest prior XLSX appearance of each record. Every document page shows the full release history as a strip.

Data pipeline

Record metadata is loaded from the NARA JFK Records XLSX manifests and normalized into a unified schema. OCR is streamed from ABBYY's public JFK-OCR repository rather than regenerated in-house; this keeps the VM footprint small and defers OCR cost to an upstream provider.

OCR text is chunked at 1,200 characters with page labels preserved. Entity mentions are produced by tiered substring matching against hand-curated alias lists. Topic membership is rule-based against agency, title tokens, and record groups - not model-derived.

Models

AI-generated content uses Google Vertex AI via BigQuery ML remote models:

  • Gemini 2.5 Flash - short topic summaries (140-200 words).
  • Gemini 2.5 Pro - long-form topic articles (600-900 words) and Open Questions map-reduce synthesis.

Every AI panel on the site displays the model name, generation date, and source-record count inline. Outputs are pre-generated and stored; the app does not call models at request time.

Editorial posture

The site surfaces tensions and anomalies visible in the records but does not advocate for any theory of the assassination. Open Questions threads are paired with primary-source citations; readers are expected to cross-check against the underlying documents.

Entity bios and timeline entries are curated from Warren Commission, HSCA, and ARRB materials. Factual errors should be reported via the corrections form.

Known limitations

  • The 2025 and 2026 declassification releases have not yet been fully ingested; users seeking those documents should consult archives.gov/research/jfk/release-2025.
  • OCR quality varies by document; expect noise in older typewritten or hand-annotated pages.
  • Entity extraction is alias-based, not model-based; rare spellings and redacted cryptonyms may be under-counted.
Back to About
Methodology · JFK Research Center