diff --git a/docs/TESSERA-pipeline-registry.md b/docs/TESSERA-pipeline-registry.md new file mode 100644 index 0000000..350ca9b --- /dev/null +++ b/docs/TESSERA-pipeline-registry.md @@ -0,0 +1,273 @@ +# TESSERA Pipeline Registry +### Date: 2026-04-26 +### Author: Claude Sonnet 4.6 — written with full session context +### Status: Normative reference for all pipeline work + +--- + +## What this document is + +A single authoritative reference for every pipeline stage — what it +does, what source it reads, what it writes, where its output lives, +and what its current status is. Written by the assistant that ran the +pipeline end-to-end. Read this before touching any pipeline script. + +--- + +## The 8-byte cell record (RFC-TESSERA-2.0-001) + +Every H9 cell in tessera.db is described by 8 bytes: + +``` +Byte 0-2: elev_cm — elevation in cm, signed 24-bit +Byte 3: terrain — RFC-TESSERA-2.0-001 Appendix A terrain code +Byte 4: hydro — RFC-TESSERA-2.0-001 Section 3.3 hydrology code +Byte 5: geo_dep — RFC-TESSERA-2.0-001 Section 3.4 deposit code +Byte 6: geo_flag — RFC-TESSERA-2.0-001 Section 3.5 geology flag code +Byte 7: occ_flag — RFC-TESSERA-3.0-OCC-001 Section 2 occupation code +``` + +In `otivm.sqlite3` (TESSERA 4.0), these are stored as separate INTEGER +columns with the same names, plus per-field provenance FKs. + +--- + +## Scale + +- Interaction sphere: 15–72N, 15W–75E +- H7 tiles: 8,591,961 +- H9 cells: 421,006,081 +- Primary resolution: H9 (~180m diameter) +- Tile unit: H7 (~5km, contains 49 H9 cells) + +--- + +## Stage 00 — Elevation + +| Property | Value | +|---|---| +| Script | `build_tessera_db.py` (integrated) | +| Source | GEBCO 2025 Grid — global 15 arc-second bathymetry/topography | +| Source URL | https://www.gebco.net/data_and_products/gridded_bathymetry_data/ | +| License | CC-BY 4.0 | +| Output field | `elev_cm` (bytes 0-2) | +| Output file | `/mnt/tessera-tiles/{h7}/tile_values.bin.gz` | +| Fingerprint | per-tile SHA-256 | +| Status | **COMPLETE** — all 8,591,961 H7 tiles | +| Notes | GEBCO is a modern dataset (2025). Elevation reflects current sea level. Doggerland cells are ocean in this dataset — they will require palaeoDEM correction in a future stage (RFC-TESSERA-3.0-PALEO-001, not yet written). | + +--- + +## Stage 01 — Terrain + +| Property | Value | +|---|---| +| Script | `01_sample_terrain.py` | +| Source | ESA WorldCover 2021 v200 — global 10m land cover classification | +| Source URL | https://esa-worldcover.org/ | +| License | CC-BY 4.0 | +| Fingerprint | `ac7f5d74a006d248` | +| Output field | `terrain` (byte 3) | +| Output file | `/mnt/tessera-scratch/terrain/{h7}/terrain_values.bin.gz` | +| Magic | `b'TES\x01'` | +| Status | **COMPLETE** — all H7 tiles | +| Notes | Modern land cover, not Mesolithic. Forest, wetland, urban classifications reflect 2021 conditions. Mesolithic terrain correction is a future RFC (RFC-TESSERA-3.0-PALEO-001). The dataset is the ground truth for current physical terrain; simulation layers apply temporal corrections on top. | + +--- + +## Stage 02 — Hydrology + +| Property | Value | +|---|---| +| Script | `02_sample_hydrology.py` | +| Source | HydroSHEDS v1.1 — flow direction and accumulation at 15 arc-second | +| Source URL | https://www.hydrosheds.org/ | +| License | CC-BY 4.0 | +| Fingerprint | `dcf6460a2bc0ebb5` | +| Output field | `hydro` (byte 4) | +| Output file | `/mnt/tessera-scratch/hydrology/{h7}/hydrology_values.bin.gz` | +| Magic | `b'TES\x02'` | +| Status | **COMPLETE** — all H7 tiles | +| Notes | One cross-sidecar correction applied in stage 03: where WorldCover identifies a lake or river but HydroSHEDS has no water body type (WB_NONE), the terrain sidecar overrides. HydroSHEDS v2.0 expected October 2026 — review then. | + +--- + +## Stage 03 — Tile Assembly + +| Property | Value | +|---|---| +| Script | `03_assemble_tiles.py` | +| Source | Stages 00 + 01 + 02 sidecars | +| Output field | bytes 0-4 (all physical fields except geology and occupation) | +| Output file | `/mnt/tessera-tiles/{h7}/tile_values_final.bin.gz` | +| Magic | `b'TES2'` | +| Status | **COMPLETE** — all H7 tiles | +| Notes | Bytes 5-6 (geo_dep, geo_flag) written as placeholders: byte 5 = 0xFF (NO_DEPOSIT), byte 6 = 0x00. Byte 7 (occ_flag) = 0x00. These placeholders were later updated in tessera.db by stage 05 for cells where geology data exists. The tile archive on USB still has placeholder bytes 5-6 for most tiles — the authoritative values are in tessera.db. | + +--- + +## Stage 04a — Geology Flag + +| Property | Value | +|---|---| +| Script | `04a_sample_igme5000.py` | +| Source | BGR IGME 5000 — 1:5M International Geological Map of Europe, layer 23 | +| Source URL | https://services.bgr.de/arcgis/rest/services/geologie/igme5000/MapServer/23 | +| License | Geonutz 2013 — open, no registration | +| Citation | Datenquelle: IGME5000, (c) BGR Hannover, 2007 | +| Fingerprint | `97448797fc4e3e31` | +| Output field | `geo_flag` (byte 6) | +| Output file | `/mnt/tessera-scratch/geology_flag/{h7}/geology_flag_values.bin.gz` | +| Magic | `b'TES\x04'` | +| Status | **COMPLETE** — all H7 tiles | +| Notes | Bit layout: bits 5-4 = rock class (00=superficial, 01=sedimentary, 10=metamorphic, 11=igneous), bits 3-2 = confidence (00=no_data, 01=inferred, 10=indicated, 11=measured). Coverage gaps outside European shelf return 0x00 (no_data). Method: H5 bounding box query → shapely point-in-polygon for H9 centroids. v2 of this script (geometry-based) replaced v1 (per-H9-centroid API query) to avoid 421M API calls. | + +--- + +## Stage 04b — Geology Deposit + +| Property | Value | +|---|---| +| Script | `04b_sample_mrds.py` | +| Source | USGS MRDS — Mineral Resources Data System, mrds.csv downloaded 2022-08-23 | +| Source URL | https://mrdata.usgs.gov/mrds/ | +| DOI | 10.3133/ds52 | +| License | USGS public domain | +| Fingerprint | `ebf10a548e617164` | +| Output field | `geo_dep` (byte 5) | +| Output file | `/mnt/tessera-scratch/geology_dep/{h7}/geology_dep_values.bin.gz` | +| Magic | `b'TES\x05'` | +| Status | **COMPLETE** — all H7 tiles | +| Notes | Commodity codes in `mrds_commodity_map.yaml`. Only the highest-priority deposit per H9 cell is encoded. European coverage is uneven — MRDS systematic updates ceased 2011. Almadén mercury mine: RESOLVED 2026-04-18. MRDS coordinates are ~34km from actual mine due to MRDS data quality, not a pipeline error. Deposit correctly encoded as Mercury (0x1d) in H7 `87390e4d9ffffff`. | + +--- + +## Stage 05 — Geology Assembly into tessera.db + +| Property | Value | +|---|---| +| Script | `05_assemble_geology.py` (v5 — bulk load approach) | +| Source | Stage 03 tile archive + stages 04a + 04b sidecars (all USB, read-only) | +| Target | `tessera.db` — UPDATE tessera_cells SET geo_dep=?, geo_flag=? | +| Status | **PARTIALLY COMPLETE** | +| Notes | See below. | + +### Stage 05 detailed status + +Five versions were written. V5 (bulk load: stage db → batch UPDATE) ran +twice but crashed at exactly the same point both times: + +- Crash point: 8,361,990 / 8,591,961 H7 cells (97.3% complete) +- Crash time: ~80 hours into Phase 1 (reading USB sidecars) +- Root cause: unknown — clean exit (code 0), no traceback captured, + no OOM, no disk full, no system reboot. Deterministic crash at same H7 + count suggests a specific problematic tile or resource exhaustion in + the staging SQLite db at ~410M rows. + +**Consequence for otivm.sqlite3:** The five OTIVM Mediterranean waypoints +(Ostia, Capua, Brundisium, Carthago, Alexandria) were processed well +before the crash point. Their `geo_dep` and `geo_flag` values are +correctly populated in tessera.db and were correctly seeded into +otivm.sqlite3. + +**The remaining ~230,000 H7 tiles** (the last 2.7%) have `geo_dep = 255` +and `geo_flag = 0` placeholders in tessera.db. These tiles are at the +edge of the interaction sphere — not OTIVM waypoints. + +**Decision taken:** Stage 05 is not being restarted. The OTIVM seed +database has correct geology for all five waypoints. Future runs of +stage 06 against otivm.sqlite3 directly (TESSERA 4.0 model) do not +require stage 05 to be complete in tessera.db. + +--- + +## Stage 06 — Occupation / Culture Sampling + +| Property | Value | +|---|---| +| Script | **NOT YET WRITTEN** | +| Source | Archaeological databases — ARIADNE, SEAD, published excavation records | +| Target field | `occ_flag` (byte 7) — RFC-TESSERA-3.0-OCC-001 | +| Status | **NOT STARTED** | + +### Stage 06 design — TESSERA 4.0 approach + +Under TESSERA 4.0, stage 06 does NOT run against the global tessera.db. +It runs against `otivm.sqlite3` directly, updating only the 12,005 H9 +cells already in production. + +`occ_flag` bit layout (RFC-TESSERA-3.0-OCC-001 Section 2): +``` +Bits 7-6: Occupation period +Bits 5-4: Evidence type +Bits 3-2: Confidence +Bits 1-0: Reserved +``` + +Four Mesolithic cultures for the Mediterranean waypoints: + +| Code | Culture | Period BCE | Region | +|---|---|---|---| +| MAGL | Maglemosian | 9000-6000 | Denmark, S.Sweden, N.Germany, N.Poland | +| ERTE | Ertebølle | 5400-3900 | Denmark, S.Sweden, N.Germany coast | +| SAUV | Sauveterrian | 9000-6500 | SW France, N.Spain, N.Italy | +| AZIL | Azilian | 10000-8500 | SW France, N.Spain, Switzerland | + +**Source investigation required before writing stage 06:** +- ARIADNE portal: https://portal.ariadne-infrastructure.eu/ +- SEAD: https://www.sead.se/ +- Each source must be documented in `otivm.sqlite3` `source_registry` + before any rows are written + +**Stage 06 script structure (when written):** +- Reads culture polygon GIS data for the OTIVM waypoint regions +- Point-in-polygon test for each H9 centroid +- Updates `occ_flag`, `occ_src`, `occ_conf` in `otivm.sqlite3` +- Follows RFC-TESSERA-4.0-001 pipeline contract (draft → validate → promote) + +--- + +## Current state of otivm.sqlite3 + +| Field | Status | Notes | +|---|---|---| +| `elev_cm` | ✓ Current | GEBCO 2025, indicated confidence | +| `terrain` | ✓ Current | ESA WorldCover v200, indicated confidence | +| `hydro` | ✓ Current | HydroSHEDS v1.1, indicated confidence | +| `geo_dep` | ✓ Current | USGS MRDS — indicated where present, no_data elsewhere | +| `geo_flag` | ✓ Current | BGR IGME5000 — indicated where present, no_data elsewhere | +| `occ_flag` | ✗ Placeholder | 0x00 everywhere — stage 06 not yet written | + +--- + +## Scripts on tessera-pipeline CT + +Location: `/opt/tessera-pipeline/` +Python venv: `/opt/tessera-pipeline/venv/bin/python3` + +| Script | Stage | Status | +|---|---|---| +| `01_sample_terrain.py` | 01 | Complete — do not re-run | +| `02_sample_hydrology.py` | 02 | Complete — do not re-run | +| `03_assemble_tiles.py` | 03 | Complete — do not re-run | +| `04a_sample_igme5000.py` | 04a | Complete — do not re-run | +| `04b_sample_mrds.py` | 04b | Complete — do not re-run | +| `05_assemble_geology.py` | 05 | Crashed at 97% — abandoned | +| `build_tessera_db.py` | DB build | Complete — do not re-run | +| `seed_extract.py` | TESSERA 4.0 seed | Complete — do not re-run | +| `seed_promote.py` | TESSERA 4.0 promote | Complete — do not re-run | + +--- + +## Hard rules + +- USB drive (`/mnt/tessera-tiles`, `/mnt/tessera-scratch`, `/mnt/tessera-source`) is **READ-ONLY** +- `tessera.db` on SSD (`/mnt/tessera-db/tessera.db`) is the immutable source — do not modify +- `otivm.sqlite3` is the production game database — write only via RFC-TESSERA-4.0-001 pipeline contract +- Do not re-run any completed stage without explicit project owner instruction + +--- + +*TESSERA-pipeline-registry.md — 2026-04-26* +*Written by Claude Sonnet 4.6 with full pipeline session context* +*Next pipeline work: stage 06 (occ_flag) against otivm.sqlite3 directly*