Add dataset assistant handover document

2026-04-27 09:22:42 +00:00
parent 17c09adc54
commit e92e1bf980
1 changed files with 243 additions and 0 deletions
--- a/docs/handover-dataset.md
+++ b/docs/handover-dataset.md
@@ -0,0 +1,243 @@
+# Handover — TESSERA Dataset Assistant
+### Date: 2026-04-27
+### For: Incoming dataset assistant
+### Read this completely before doing anything
+
+---
+
+## 0. Your role
+
+You are the dataset assistant. You own the pipeline that populates
+`data/otivm.sqlite3` with physical-world data from the USB drives.
+You do not touch game code, frontend, backend, or PM2.
+
+The game development assistant works in parallel. They own `src/`,
+`server/`, and everything the player sees. You own:
+- `pipeline/` — all extraction and promotion scripts
+- `data/create_otivm_db.sql` — the schema source of truth
+- `data/staging_otivm.sqlite3` — your working database (never in git)
+- `docs/` — dataset and pipeline documentation
+
+You do not write to `data/otivm.sqlite3` directly. You write to
+`data/staging_otivm.sqlite3`, verify, then copy to production on
+explicit project owner approval.
+
+---
+
+## 1. Read these files before doing anything
+
+In order:
+
+1. `CLAUDE.md` — workflow, three-shell model, ground rules
+2. `docs/TESSERA-dataset-registry.md` — every dataset evaluated,
+   triage decisions, drive inventory, what is on drives and what is not
+3. `docs/RFC-TESSERA-4.0-001.md` — the database schema contract
+4. `docs/RFC-TESSERA-3.0-PALEO-001.md` — paleo epoch table spec
+5. `docs/TESSERA-pipeline-registry.md` — history of the old batch
+   pipeline, what completed, what failed, and why
+6. This file
+
+---
+
+## 2. Current database state — as of 2026-04-27
+
+### `data/otivm.sqlite3` — production
+- 12,005 H9 rows across five waypoints, all `status=2` (current)
+- All H5s at `status=2` in `h5_coverage`
+- `paleo_epochs` table populated with 9 epochs per RFC-TESSERA-3.0-PALEO-001
+- H3 IDs stored as INTEGER (64-bit)
+
+### Five waypoints
+| City | H5 TEXT | H9 cells |
+|---|---|---|
+| Ostia | `851e805bfffffff` | 2401 |
+| Capua | `851e8333fffffff` | 2401 |
+| Brundisium | `851e8ba3fffffff` | 2401 |
+| Carthago | `85386e23fffffff` | 2401 |
+| Alexandria | `853f5ba7fffffff` | 2401 |
+
+### Field status
+| Field | Status | Source |
+|---|---|---|
+| `elev_cm` | ✅ Current | GEBCO 2025 |
+| `terrain` | ✅ Current (modern only — see Section 4) | ESA WorldCover 2021 |
+| `hydro` | ✅ Current | HydroSHEDS v1.1 |
+| `geo_dep` | ✅ Current | USGS MRDS |
+| `geo_flag` | ✅ Current | BGR IGME5000 |
+| `occ_flag` | ⚠ Placeholder (0x00 everywhere) | Stage 06 not written |
+
+### `data/staging_otivm.sqlite3`
+Identical to production as of last session. Always reset from
+production before starting a new pipeline run:
+```
+cp data/otivm.sqlite3 data/staging_otivm.sqlite3
+```
+
+---
+
+## 3. USB drives — what is present
+
+Both drives mounted read-only at `/opt/data/` on every container.
+Full inventory in `data/tessera_usb_inventory.txt`.
+
+### Drive 1: TESSERA_APR26 (/dev/sdb1, 29GB, 21GB free)
+| Dataset | Path | Size | Fields |
+|---|---|---|---|
+| GEBCO 2025 | `gebco/` | 6.8GB | `elev_cm` |
+| HydroSHEDS v1.1 | `hydrosheds/` | 240MB | `hydro` |
+| USGS MRDS | `mrds/mrds.csv` | 16MB | `geo_dep` |
+
+### Drive 2: TESSERA_WORLD (/dev/sdd1, 29GB, 7GB free)
+| Dataset | Path | Size | Fields |
+|---|---|---|---|
+| ESA WorldCover 2021 v200 | `worldcover/` | 22GB | `terrain` |
+
+---
+
+## 4. The restoration layer — critical concept
+
+**`terrain` in the database is modern WorldCover 2021. It is wrong
+for historical periods.**
+
+WorldCover reflects 2021 land cover — cities, airports, drained
+marshes, reservoirs. For all five OTIVM waypoints, the majority of
+H9 cells within urban zones are classified as built-up or cropland.
+In Roman times (14 BCE epoch) and Mesolithic times (8000 BCE epoch),
+those same cells were overwhelmingly forested.
+
+The Mediterranean basin was 60–70% forested in both periods. Today
+it is not.
+
+The restoration layer corrects this at query time using two datasets
+not yet on the drives:
+- **HYDE 3.3** — historical land use per epoch (what was actually there)
+- **KK10** — potential natural vegetation (what would grow without humans)
+
+Until these datasets are loaded and the restoration pipeline stage
+is written, `terrain` is a modern snapshot, not a historical one.
+The game development assistant has been informed. The game must not
+present `terrain` values as historically accurate for any epoch
+until the restoration layer is active.
+
+**This is the most important pending pipeline work after the drive
+additions are complete.**
+
+---
+
+## 5. What is missing from the drives — priority additions
+
+These four datasets must be downloaded and added to Drive 1 before
+the per-H5 pipeline can be built. Total: ~5.2GB, fits in 21GB free.
+
+| Priority | Dataset | Size | Why needed |
+|---|---|---|---|
+| 1 | BGR IGME5000 shapefile | ~200MB | `geo_flag` currently depends on live API — must be local |
+| 2 | HYDE 3.3 historical land use | ~4GB | Restoration layer — required |
+| 3 | KK10 potential natural vegetation | ~500MB | Restoration layer — required alongside HYDE |
+| 4 | HydroRivers Europe + Africa | ~500MB | Accurate river placement for `hydro` |
+
+Download sources in `docs/TESSERA-dataset-registry.md`.
+
+Drives are read-only when mounted. To add data:
+1. Unmount from Proxmox host
+2. Remount read-write on a machine with ext4 write access
+3. Copy data
+4. Remount read-only
+5. Verify with inventory check before proceeding
+
+**Do not begin pipeline design until all four additions are on Drive 1.**
+
+---
+
+## 6. The per-H5 pipeline — not yet built
+
+The new pipeline replaces the old batch pipeline entirely. Key facts:
+
+- Processes one H5 hex at a time
+- Reads all data from USB drives (no live API calls)
+- Writes to `staging_otivm.sqlite3` only
+- Follows RFC-TESSERA-4.0-001 pipeline contract:
+  draft → validate → promote → copy to production
+- Manually triggered with project owner approval
+- Supersede support built in — can update existing H5 rows when
+  a source dataset improves
+
+### Read strategy — mandatory
+Always crop raster to H5 bounding box before sampling. Load the crop
+into a numpy array in RAM. Sample all 2401 H9 centroids from the
+array. Never seek 2401 individual points from USB.
+
+Without this: GEBCO reads at ~25s per H5 (USB random seek speed).
+With this: GEBCO reads at ~1-2s per H5 (one sequential crop + RAM).
+
+### RAM allocation
+- Baseline container RAM: 2GB
+- Pipeline mode: 24GB (non-essential containers suspended on dev box)
+- Relevant tile sizes: GEBCO tile ~891MB, WorldCover tile ~100MB
+- In-memory strategy: load relevant tiles at pipeline start,
+  release at end
+- Three Proxmox boxes: dev (pipeline work), staging (validation),
+  production (live game) — transfer via WireGuard mesh
+
+### Python venv
+- Path: `/home/otivm/pipeline-venv`
+- Packages: h3, requests, numpy, rasterio, shapely, pyproj
+- Do not use `/home/otivm/venv` — that belongs to the game assistant
+
+### Pipeline scripts (committed, not yet functional for new pipeline)
+- `pipeline/seed_extract.py` — old Dell-based extractor, do not re-run
+- `pipeline/seed_promote.py` — old promotion script, do not re-run
+- New per-H5 scripts to be written after drive additions complete
+
+---
+
+## 7. Infrastructure
+
+### OTIVM container (CT 1105, proliant-dev, 10.0.0.23)
+- App user: `otivm`
+- Repo: `/home/otivm/OTIVM`
+- Pipeline venv: `/home/otivm/pipeline-venv`
+- Production DB: `data/otivm.sqlite3`
+- Staging DB: `data/staging_otivm.sqlite3` (not in git)
+- Claude Code runs here as `otivm` via `work` alias
+
+### Three Proxmox boxes
+- **proliant-dev (srv-a, 10.0.0.11)** — development and pipeline work
+- **staging box** — validation before production
+- **production box** — live game, never touched by pipeline directly
+
+### Gitea
+- Repo: `https://gitea.barternetwork.us/TheRON/OTIVM`
+- Branch: `main`
+- MCP: `mcp.civicus.us` — read any file directly from Claude chat
+
+---
+
+## 8. Hard rules
+
+- Never write to `data/otivm.sqlite3` directly — always via staging
+- Never commit `*.sqlite3` files — both databases are gitignored
+- Never run pipeline without project owner approval and supervision
+- Never modify `tessera.db` — it no longer exists (Dell decommissioned)
+- Never touch game code (`src/`, `server/`, `public/`)
+- Read `TESSERA-dataset-registry.md` before evaluating any new source
+- One file at a time. One confirmation before proceeding.
+- Do not start pipeline coding without explicit project owner instruction
+
+---
+
+## 9. Pending work — in order
+
+1. **Drive additions** — project owner downloads and mounts four datasets
+2. **Pipeline architecture document** — design before any code
+3. **Per-H5 pipeline scripts** — one file at a time, supervised
+4. **Restoration layer** — HYDE + KK10 integration into terrain field
+5. **Stage 06 (occ_flag)** — archaeological sources, deferred until
+   simulation track begins
+
+---
+
+*Handover 2026-04-27 — dataset assistant track*
+*Database seeded, paleo_epochs added, drives inventoried.*
+*Pipeline not yet built. Drive additions required first.*
+*The restoration layer is the most important pending concept.*