Files
otivm/docs/TESSERA-pipeline-registry.md

274 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# TESSERA Pipeline Registry
### Date: 2026-04-26
### Author: Claude Sonnet 4.6 — written with full session context
### Status: Normative reference for all pipeline work
---
## What this document is
A single authoritative reference for every pipeline stage — what it
does, what source it reads, what it writes, where its output lives,
and what its current status is. Written by the assistant that ran the
pipeline end-to-end. Read this before touching any pipeline script.
---
## The 8-byte cell record (RFC-TESSERA-2.0-001)
Every H9 cell in tessera.db is described by 8 bytes:
```
Byte 0-2: elev_cm — elevation in cm, signed 24-bit
Byte 3: terrain — RFC-TESSERA-2.0-001 Appendix A terrain code
Byte 4: hydro — RFC-TESSERA-2.0-001 Section 3.3 hydrology code
Byte 5: geo_dep — RFC-TESSERA-2.0-001 Section 3.4 deposit code
Byte 6: geo_flag — RFC-TESSERA-2.0-001 Section 3.5 geology flag code
Byte 7: occ_flag — RFC-TESSERA-3.0-OCC-001 Section 2 occupation code
```
In `otivm.sqlite3` (TESSERA 4.0), these are stored as separate INTEGER
columns with the same names, plus per-field provenance FKs.
---
## Scale
- Interaction sphere: 1572N, 15W75E
- H7 tiles: 8,591,961
- H9 cells: 421,006,081
- Primary resolution: H9 (~180m diameter)
- Tile unit: H7 (~5km, contains 49 H9 cells)
---
## Stage 00 — Elevation
| Property | Value |
|---|---|
| Script | `build_tessera_db.py` (integrated) |
| Source | GEBCO 2025 Grid — global 15 arc-second bathymetry/topography |
| Source URL | https://www.gebco.net/data_and_products/gridded_bathymetry_data/ |
| License | CC-BY 4.0 |
| Output field | `elev_cm` (bytes 0-2) |
| Output file | `/mnt/tessera-tiles/{h7}/tile_values.bin.gz` |
| Fingerprint | per-tile SHA-256 |
| Status | **COMPLETE** — all 8,591,961 H7 tiles |
| Notes | GEBCO is a modern dataset (2025). Elevation reflects current sea level. Doggerland cells are ocean in this dataset — they will require palaeoDEM correction in a future stage (RFC-TESSERA-3.0-PALEO-001, not yet written). |
---
## Stage 01 — Terrain
| Property | Value |
|---|---|
| Script | `01_sample_terrain.py` |
| Source | ESA WorldCover 2021 v200 — global 10m land cover classification |
| Source URL | https://esa-worldcover.org/ |
| License | CC-BY 4.0 |
| Fingerprint | `ac7f5d74a006d248` |
| Output field | `terrain` (byte 3) |
| Output file | `/mnt/tessera-scratch/terrain/{h7}/terrain_values.bin.gz` |
| Magic | `b'TES\x01'` |
| Status | **COMPLETE** — all H7 tiles |
| Notes | Modern land cover, not Mesolithic. Forest, wetland, urban classifications reflect 2021 conditions. Mesolithic terrain correction is a future RFC (RFC-TESSERA-3.0-PALEO-001). The dataset is the ground truth for current physical terrain; simulation layers apply temporal corrections on top. |
---
## Stage 02 — Hydrology
| Property | Value |
|---|---|
| Script | `02_sample_hydrology.py` |
| Source | HydroSHEDS v1.1 — flow direction and accumulation at 15 arc-second |
| Source URL | https://www.hydrosheds.org/ |
| License | CC-BY 4.0 |
| Fingerprint | `dcf6460a2bc0ebb5` |
| Output field | `hydro` (byte 4) |
| Output file | `/mnt/tessera-scratch/hydrology/{h7}/hydrology_values.bin.gz` |
| Magic | `b'TES\x02'` |
| Status | **COMPLETE** — all H7 tiles |
| Notes | One cross-sidecar correction applied in stage 03: where WorldCover identifies a lake or river but HydroSHEDS has no water body type (WB_NONE), the terrain sidecar overrides. HydroSHEDS v2.0 expected October 2026 — review then. |
---
## Stage 03 — Tile Assembly
| Property | Value |
|---|---|
| Script | `03_assemble_tiles.py` |
| Source | Stages 00 + 01 + 02 sidecars |
| Output field | bytes 0-4 (all physical fields except geology and occupation) |
| Output file | `/mnt/tessera-tiles/{h7}/tile_values_final.bin.gz` |
| Magic | `b'TES2'` |
| Status | **COMPLETE** — all H7 tiles |
| Notes | Bytes 5-6 (geo_dep, geo_flag) written as placeholders: byte 5 = 0xFF (NO_DEPOSIT), byte 6 = 0x00. Byte 7 (occ_flag) = 0x00. These placeholders were later updated in tessera.db by stage 05 for cells where geology data exists. The tile archive on USB still has placeholder bytes 5-6 for most tiles — the authoritative values are in tessera.db. |
---
## Stage 04a — Geology Flag
| Property | Value |
|---|---|
| Script | `04a_sample_igme5000.py` |
| Source | BGR IGME 5000 — 1:5M International Geological Map of Europe, layer 23 |
| Source URL | https://services.bgr.de/arcgis/rest/services/geologie/igme5000/MapServer/23 |
| License | Geonutz 2013 — open, no registration |
| Citation | Datenquelle: IGME5000, (c) BGR Hannover, 2007 |
| Fingerprint | `97448797fc4e3e31` |
| Output field | `geo_flag` (byte 6) |
| Output file | `/mnt/tessera-scratch/geology_flag/{h7}/geology_flag_values.bin.gz` |
| Magic | `b'TES\x04'` |
| Status | **COMPLETE** — all H7 tiles |
| Notes | Bit layout: bits 5-4 = rock class (00=superficial, 01=sedimentary, 10=metamorphic, 11=igneous), bits 3-2 = confidence (00=no_data, 01=inferred, 10=indicated, 11=measured). Coverage gaps outside European shelf return 0x00 (no_data). Method: H5 bounding box query → shapely point-in-polygon for H9 centroids. v2 of this script (geometry-based) replaced v1 (per-H9-centroid API query) to avoid 421M API calls. |
---
## Stage 04b — Geology Deposit
| Property | Value |
|---|---|
| Script | `04b_sample_mrds.py` |
| Source | USGS MRDS — Mineral Resources Data System, mrds.csv downloaded 2022-08-23 |
| Source URL | https://mrdata.usgs.gov/mrds/ |
| DOI | 10.3133/ds52 |
| License | USGS public domain |
| Fingerprint | `ebf10a548e617164` |
| Output field | `geo_dep` (byte 5) |
| Output file | `/mnt/tessera-scratch/geology_dep/{h7}/geology_dep_values.bin.gz` |
| Magic | `b'TES\x05'` |
| Status | **COMPLETE** — all H7 tiles |
| Notes | Commodity codes in `mrds_commodity_map.yaml`. Only the highest-priority deposit per H9 cell is encoded. European coverage is uneven — MRDS systematic updates ceased 2011. Almadén mercury mine: RESOLVED 2026-04-18. MRDS coordinates are ~34km from actual mine due to MRDS data quality, not a pipeline error. Deposit correctly encoded as Mercury (0x1d) in H7 `87390e4d9ffffff`. |
---
## Stage 05 — Geology Assembly into tessera.db
| Property | Value |
|---|---|
| Script | `05_assemble_geology.py` (v5 — bulk load approach) |
| Source | Stage 03 tile archive + stages 04a + 04b sidecars (all USB, read-only) |
| Target | `tessera.db` — UPDATE tessera_cells SET geo_dep=?, geo_flag=? |
| Status | **PARTIALLY COMPLETE** |
| Notes | See below. |
### Stage 05 detailed status
Five versions were written. V5 (bulk load: stage db → batch UPDATE) ran
twice but crashed at exactly the same point both times:
- Crash point: 8,361,990 / 8,591,961 H7 cells (97.3% complete)
- Crash time: ~80 hours into Phase 1 (reading USB sidecars)
- Root cause: unknown — clean exit (code 0), no traceback captured,
no OOM, no disk full, no system reboot. Deterministic crash at same H7
count suggests a specific problematic tile or resource exhaustion in
the staging SQLite db at ~410M rows.
**Consequence for otivm.sqlite3:** The five OTIVM Mediterranean waypoints
(Ostia, Capua, Brundisium, Carthago, Alexandria) were processed well
before the crash point. Their `geo_dep` and `geo_flag` values are
correctly populated in tessera.db and were correctly seeded into
otivm.sqlite3.
**The remaining ~230,000 H7 tiles** (the last 2.7%) have `geo_dep = 255`
and `geo_flag = 0` placeholders in tessera.db. These tiles are at the
edge of the interaction sphere — not OTIVM waypoints.
**Decision taken:** Stage 05 is not being restarted. The OTIVM seed
database has correct geology for all five waypoints. Future runs of
stage 06 against otivm.sqlite3 directly (TESSERA 4.0 model) do not
require stage 05 to be complete in tessera.db.
---
## Stage 06 — Occupation / Culture Sampling
| Property | Value |
|---|---|
| Script | **NOT YET WRITTEN** |
| Source | Archaeological databases — ARIADNE, SEAD, published excavation records |
| Target field | `occ_flag` (byte 7) — RFC-TESSERA-3.0-OCC-001 |
| Status | **NOT STARTED** |
### Stage 06 design — TESSERA 4.0 approach
Under TESSERA 4.0, stage 06 does NOT run against the global tessera.db.
It runs against `otivm.sqlite3` directly, updating only the 12,005 H9
cells already in production.
`occ_flag` bit layout (RFC-TESSERA-3.0-OCC-001 Section 2):
```
Bits 7-6: Occupation period
Bits 5-4: Evidence type
Bits 3-2: Confidence
Bits 1-0: Reserved
```
Four Mesolithic cultures for the Mediterranean waypoints:
| Code | Culture | Period BCE | Region |
|---|---|---|---|
| MAGL | Maglemosian | 9000-6000 | Denmark, S.Sweden, N.Germany, N.Poland |
| ERTE | Ertebølle | 5400-3900 | Denmark, S.Sweden, N.Germany coast |
| SAUV | Sauveterrian | 9000-6500 | SW France, N.Spain, N.Italy |
| AZIL | Azilian | 10000-8500 | SW France, N.Spain, Switzerland |
**Source investigation required before writing stage 06:**
- ARIADNE portal: https://portal.ariadne-infrastructure.eu/
- SEAD: https://www.sead.se/
- Each source must be documented in `otivm.sqlite3` `source_registry`
before any rows are written
**Stage 06 script structure (when written):**
- Reads culture polygon GIS data for the OTIVM waypoint regions
- Point-in-polygon test for each H9 centroid
- Updates `occ_flag`, `occ_src`, `occ_conf` in `otivm.sqlite3`
- Follows RFC-TESSERA-4.0-001 pipeline contract (draft → validate → promote)
---
## Current state of otivm.sqlite3
| Field | Status | Notes |
|---|---|---|
| `elev_cm` | ✓ Current | GEBCO 2025, indicated confidence |
| `terrain` | ✓ Current | ESA WorldCover v200, indicated confidence |
| `hydro` | ✓ Current | HydroSHEDS v1.1, indicated confidence |
| `geo_dep` | ✓ Current | USGS MRDS — indicated where present, no_data elsewhere |
| `geo_flag` | ✓ Current | BGR IGME5000 — indicated where present, no_data elsewhere |
| `occ_flag` | ✗ Placeholder | 0x00 everywhere — stage 06 not yet written |
---
## Scripts on tessera-pipeline CT
Location: `/opt/tessera-pipeline/`
Python venv: `/opt/tessera-pipeline/venv/bin/python3`
| Script | Stage | Status |
|---|---|---|
| `01_sample_terrain.py` | 01 | Complete — do not re-run |
| `02_sample_hydrology.py` | 02 | Complete — do not re-run |
| `03_assemble_tiles.py` | 03 | Complete — do not re-run |
| `04a_sample_igme5000.py` | 04a | Complete — do not re-run |
| `04b_sample_mrds.py` | 04b | Complete — do not re-run |
| `05_assemble_geology.py` | 05 | Crashed at 97% — abandoned |
| `build_tessera_db.py` | DB build | Complete — do not re-run |
| `seed_extract.py` | TESSERA 4.0 seed | Complete — do not re-run |
| `seed_promote.py` | TESSERA 4.0 promote | Complete — do not re-run |
---
## Hard rules
- USB drive (`/mnt/tessera-tiles`, `/mnt/tessera-scratch`, `/mnt/tessera-source`) is **READ-ONLY**
- `tessera.db` on SSD (`/mnt/tessera-db/tessera.db`) is the immutable source — do not modify
- `otivm.sqlite3` is the production game database — write only via RFC-TESSERA-4.0-001 pipeline contract
- Do not re-run any completed stage without explicit project owner instruction
---
*TESSERA-pipeline-registry.md — 2026-04-26*
*Written by Claude Sonnet 4.6 with full pipeline session context*
*Next pipeline work: stage 06 (occ_flag) against otivm.sqlite3 directly*