diff --git a/docs/RFC-TESSERA-4.0-001.md b/docs/RFC-TESSERA-4.0-001.md new file mode 100644 index 0000000..d1cdd5c --- /dev/null +++ b/docs/RFC-TESSERA-4.0-001.md @@ -0,0 +1,325 @@ +# RFC-TESSERA-4.0-001 +## Per-Hex Pipeline, Row Lifecycle, and Provenance Model +### Status: Draft v0.2 +### Date: 2026-04-26 +### Supersedes: RFC-TESSERA-2.0-001 (tile assembly), RFC-TESSERA-3.0-OCC-001 (occupation encoding) + +--- + +## Abstract + +TESSERA 4.0 replaces the one-shot global pipeline model with a per-hex +pipeline that is resumable, updatable, and portable. Any H3 hex at any +resolution can be inserted, updated, superseded, or retired independently +of all others. The database is the product from day one — not a tile +archive assembled in a second pass. + +This RFC defines: +- The `otivm.sqlite3` schema — integer-normalized, compact, TESSERA 4.0 format +- Row lifecycle states as a lookup table +- Per-field provenance as integer foreign keys +- H5 coverage completeness tracking — no rugged edges at hex boundaries +- The pipeline contract — what a pipeline run must produce to be valid +- The staging protocol — `staging_otivm.sqlite3` → `otivm.sqlite3` + +--- + +## 1. Design Principles + +### 1.1 The hex is the unit of work + +One pipeline run processes one H3 hex. It fetches source data for that +hex, derives the field values, records provenance, and writes to the +staging database. No global operations. No dependency on adjacent hexes. +A run that fails leaves no partial state — the hex either has a complete +`draft` row or no row. + +### 1.2 The database is the product + +There is no tile archive. There is no intermediate sidecar format. +The SQLite database is written directly by the pipeline. The game reads +directly from the database. There is no assembly stage. + +### 1.3 Rows are never deleted + +A row is never deleted. It is superseded or retired. The full history +of every cell is queryable. When a new dataset version produces a better +value, the old row becomes `superseded` and the new row becomes `current`. +The game always reads only `current` rows. + +### 1.4 Dataset versions evolve independently per field + +GEBCO releases a new bathymetry dataset. Only the `elev_cm` field needs +refreshing — terrain, hydrology, and geology are unchanged. The pipeline +runs only for the affected field. Only the affected rows are superseded. + +### 1.5 Academic credibility travels with the cell + +Every field value carries a confidence grade as an integer FK. A cell +with IGME5000 geology coverage has `measured` or `indicated` confidence. +A cell outside IGME5000 coverage has `inferred` or `no_data`. This grade +is queryable, displayable, and affects how the game and simulator use +the data. It is not optional metadata — it is a first-class field. + +### 1.6 No rugged edges at hex boundaries + +The game reads only cells within H5 hexes that are marked complete in +`h5_coverage`. An H5 hex is complete when all its H9 cells are present +with `status = current`. Incomplete H5 hexes are staging — they are +never served to the game, preventing elevation seams at partially-fetched +boundaries. + +### 1.7 The 32GB constraint is the reality check + +The pipeline fetches and stores only what the game needs. Storage grows +proportionally to game coverage, not to Earth's surface area. + +--- + +## 2. Lookup Tables + +All enumerated values are stored as integers referencing these tables. +Written once at database creation and never modified. + +### 2.1 `lifecycle_states` + +```sql +CREATE TABLE lifecycle_states ( + id INTEGER PRIMARY KEY, + name TEXT NOT NULL UNIQUE +); + +INSERT INTO lifecycle_states VALUES + (1, 'draft'), + (2, 'current'), + (3, 'superseded'), + (4, 'retired'); +``` + +### 2.2 `confidence_grades` + +```sql +CREATE TABLE confidence_grades ( + id INTEGER PRIMARY KEY, + name TEXT NOT NULL UNIQUE, + description TEXT NOT NULL +); + +INSERT INTO confidence_grades VALUES + (1, 'measured', 'Directly observed or instrumentally measured. Published dataset with explicit methodology.'), + (2, 'indicated', 'Recorded in registry or survey without direct measurement. Classification may be broad.'), + (3, 'inferred', 'Derived from landscape position, proximity to measured cells, or modelled from adjacent data.'), + (4, 'no_data', 'Source dataset has no coverage for this cell. Field value is a known placeholder.'); +``` + +### 2.3 `source_registry` + +```sql +CREATE TABLE source_registry ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + source_key TEXT NOT NULL UNIQUE, -- e.g. 'GEBCO_2025' + source_name TEXT NOT NULL, + source_url TEXT, + version TEXT NOT NULL, + license TEXT, + citation TEXT, + registered_at TEXT NOT NULL -- ISO 8601 UTC +); +``` + +--- + +## 3. Core Tables + +### 3.1 `pipeline_runs` + +```sql +CREATE TABLE pipeline_runs ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + run_key TEXT NOT NULL UNIQUE, -- human label, e.g. 'tessera3-seed-2026-04-26' + started_at TEXT NOT NULL, -- ISO 8601 UTC + completed_at TEXT, -- NULL while running + status INTEGER NOT NULL REFERENCES lifecycle_states(id), + h5_cells TEXT NOT NULL, -- JSON array of H3 res-5 integer IDs + fields_updated TEXT NOT NULL, -- JSON array of field names + source_versions TEXT NOT NULL, -- JSON object: {source_key: version} + row_count INTEGER, -- NULL while running + notes TEXT +); +``` + +### 3.2 `tessera_cells` — one row per H9 cell per pipeline run + +H3 cell IDs are stored as INTEGER (64-bit H3 index) not TEXT. +Use `h3.cellToIndex()` / `h3.indexToCell()` for conversion. + +```sql +CREATE TABLE tessera_cells ( + -- Identity + id INTEGER PRIMARY KEY AUTOINCREMENT, + h9 INTEGER NOT NULL, -- H3 res-9 index (64-bit) + h7 INTEGER NOT NULL, -- H3 res-7 parent index + h5 INTEGER NOT NULL, -- H3 res-5 grandparent index (waypoint) + lat REAL NOT NULL, -- H9 centroid latitude + lon REAL NOT NULL, -- H9 centroid longitude + + -- Physical fields (RFC-TESSERA-2.0-001 byte layout preserved) + elev_cm INTEGER, -- Elevation in cm, signed 24-bit range + terrain INTEGER, -- Appendix A terrain code + hydro INTEGER, -- Section 3.3 hydrology code + geo_dep INTEGER, -- Section 3.4 deposit code + geo_flag INTEGER, -- Section 3.5 geology flag code + occ_flag INTEGER, -- RFC-TESSERA-3.0-OCC-001 Section 2 code + + -- Provenance per field (source FK + confidence FK) + elev_src INTEGER REFERENCES source_registry(id), + elev_conf INTEGER REFERENCES confidence_grades(id), + terr_src INTEGER REFERENCES source_registry(id), + terr_conf INTEGER REFERENCES confidence_grades(id), + hydro_src INTEGER REFERENCES source_registry(id), + hydro_conf INTEGER REFERENCES confidence_grades(id), + gdep_src INTEGER REFERENCES source_registry(id), + gdep_conf INTEGER REFERENCES confidence_grades(id), + gflag_src INTEGER REFERENCES source_registry(id), + gflag_conf INTEGER REFERENCES confidence_grades(id), + occ_src INTEGER REFERENCES source_registry(id), + occ_conf INTEGER REFERENCES confidence_grades(id), + + -- Lifecycle + status INTEGER NOT NULL DEFAULT 1 + REFERENCES lifecycle_states(id), + run_id INTEGER NOT NULL REFERENCES pipeline_runs(id), + created_at TEXT NOT NULL, -- ISO 8601 UTC + superseded_by INTEGER REFERENCES tessera_cells(id), + retired_reason TEXT +); + +CREATE INDEX idx_cells_h9_status ON tessera_cells(h9, status); +CREATE INDEX idx_cells_h5_status ON tessera_cells(h5, status); +CREATE INDEX idx_cells_h7_status ON tessera_cells(h7, status); +CREATE INDEX idx_cells_run ON tessera_cells(run_id); +``` + +### 3.3 `h5_coverage` — H5 completeness tracking + +An H5 hex is `complete` (status=2) when all its H9 cells are present +with `status = 2` (current). The game reads only cells in complete H5 +hexes. This prevents elevation seams at partially-fetched boundaries. + +```sql +CREATE TABLE h5_coverage ( + h5 INTEGER PRIMARY KEY, -- H3 res-5 index + status INTEGER NOT NULL REFERENCES lifecycle_states(id), + -- 1=draft (in progress), 2=current (complete), 4=retired + h9_total INTEGER NOT NULL, -- Expected H9 count (typically 2401) + h9_current INTEGER NOT NULL DEFAULT 0, + last_updated TEXT NOT NULL, -- ISO 8601 UTC + run_id INTEGER NOT NULL REFERENCES pipeline_runs(id), + notes TEXT +); +``` + +A pipeline step increments `h9_current` on each H9 promotion and sets +`status = 2` when `h9_current = h9_total`. + +--- + +## 4. Row Lifecycle + +``` +[pipeline run] → status 1 (draft) +[validation pass] → status 2 (current) +[new pipeline run for same cell] → old row: status 3 (superseded), new row: status 2 (current) +[data error confirmed] → status 4 (retired, with reason) +``` + +The game's canonical query: + +```sql +SELECT tc.* +FROM tessera_cells tc +JOIN h5_coverage h5c ON tc.h5 = h5c.h5 +WHERE h5c.status = 2 -- complete H5 hexes only + AND tc.status = 2 -- current rows only + AND tc.h5 = ? -- specific waypoint +``` + +This compound filter guarantees no boundary seams and no stale data. + +--- + +## 5. Pipeline Contract + +A pipeline run is valid if and only if: + +1. Writes a `pipeline_runs` row with `status = 1` before any cells +2. All source datasets used are in `source_registry` before the run starts +3. Every `tessera_cells` row has `status = 1`, correct `run_id`, and + non-null provenance FKs for every field written +4. On completion: updates `pipeline_runs` to `status = 2`, sets `row_count` +5. On failure: updates `pipeline_runs` to `status = 4` — draft rows from + this run remain draft and are invisible to the game + +Promotion (draft → current) is a separate explicit validation step: +- Verify row count matches expected H9 count for the H5 hex +- Update `tessera_cells.status` 1 → 2 for the run's rows +- Mark previous current rows for the same H9 cells as `status = 3` +- Update `h5_coverage` accordingly + +--- + +## 6. Staging Protocol + +**`staging_otivm.sqlite3`** — pipeline writes here. Identical schema. +New hexes are fetched, processed, and validated here before production. +Can be deleted and rebuilt without affecting the game. + +**`otivm.sqlite3`** — production. Game reads only `current` rows within +`complete` H5 hexes. Promotion from staging is explicit and never automatic. + +--- + +## 7. Seed Data — TESSERA 3.0 → TESSERA 4.0 + +Source: `tessera.db` (TESSERA 3.0, SpatiaLite, 158GB, Dell SSD). +Seed run: `run_key = 'tessera3-seed-2026-04-26'` + +The five OTIVM launch H5 waypoints: + +| City | H5 (TEXT) | H5 (INTEGER) | +|---|---|---| +| Ostia | `851e805bfffffff` | resolved at extraction | +| Capua | `851e8333fffffff` | resolved at extraction | +| Brundisium | `851e8ba3fffffff` | resolved at extraction | +| Carthago | `85386e23fffffff` | resolved at extraction | +| Alexandria | `853f5ba7fffffff` | resolved at extraction | + +Each H5 contains 343 H9 cells (7 H7 × 49 H9 per H7). Total seed rows: +5 × 343 = 1,715 H9 cells minimum. Adjacent H5 cells along trade routes +may also be seeded to prevent boundary seams in route rendering. + +Seed confidence grades: +- `elev_conf`: 2 (indicated) — GEBCO 2025, direct sample +- `terr_conf`: 2 (indicated) — ESA WorldCover v200 +- `hydro_conf`: 2 (indicated) — HydroSHEDS v1.1 +- `gdep_conf`: 2 or 4 — MRDS point data where present, no_data elsewhere +- `gflag_conf`: 2 or 4 — IGME5000 polygon where present, no_data elsewhere +- `occ_conf`: 4 (no_data) — stage 06 not yet run; occ_flag = 0 + +--- + +## 8. What This RFC Does Not Cover + +| Topic | Where it belongs | +|---|---| +| Game queries and views | OTIVM game code | +| Occupation evidence detail | RFC-TESSERA-3.0-OCC-001 (adapted) | +| CIVICVS simulation state | RFC-CIVICVS stack | +| Online source fetch scripts | Pipeline implementation | +| OTIVM roadmap changes | docs/roadmap.md | + +--- + +*RFC-TESSERA-4.0-001 Draft v0.2 — 2026-04-26* +*Status: Draft — pending project owner approval before implementation* +*Next action: project owner approves schema, seed extraction from tessera.db begins*