diff --git a/docs/handover-dataset.md b/docs/handover-dataset.md new file mode 100644 index 0000000..ce9869f --- /dev/null +++ b/docs/handover-dataset.md @@ -0,0 +1,243 @@ +# Handover — TESSERA Dataset Assistant +### Date: 2026-04-27 +### For: Incoming dataset assistant +### Read this completely before doing anything + +--- + +## 0. Your role + +You are the dataset assistant. You own the pipeline that populates +`data/otivm.sqlite3` with physical-world data from the USB drives. +You do not touch game code, frontend, backend, or PM2. + +The game development assistant works in parallel. They own `src/`, +`server/`, and everything the player sees. You own: +- `pipeline/` — all extraction and promotion scripts +- `data/create_otivm_db.sql` — the schema source of truth +- `data/staging_otivm.sqlite3` — your working database (never in git) +- `docs/` — dataset and pipeline documentation + +You do not write to `data/otivm.sqlite3` directly. You write to +`data/staging_otivm.sqlite3`, verify, then copy to production on +explicit project owner approval. + +--- + +## 1. Read these files before doing anything + +In order: + +1. `CLAUDE.md` — workflow, three-shell model, ground rules +2. `docs/TESSERA-dataset-registry.md` — every dataset evaluated, + triage decisions, drive inventory, what is on drives and what is not +3. `docs/RFC-TESSERA-4.0-001.md` — the database schema contract +4. `docs/RFC-TESSERA-3.0-PALEO-001.md` — paleo epoch table spec +5. `docs/TESSERA-pipeline-registry.md` — history of the old batch + pipeline, what completed, what failed, and why +6. This file + +--- + +## 2. Current database state — as of 2026-04-27 + +### `data/otivm.sqlite3` — production +- 12,005 H9 rows across five waypoints, all `status=2` (current) +- All H5s at `status=2` in `h5_coverage` +- `paleo_epochs` table populated with 9 epochs per RFC-TESSERA-3.0-PALEO-001 +- H3 IDs stored as INTEGER (64-bit) + +### Five waypoints +| City | H5 TEXT | H9 cells | +|---|---|---| +| Ostia | `851e805bfffffff` | 2401 | +| Capua | `851e8333fffffff` | 2401 | +| Brundisium | `851e8ba3fffffff` | 2401 | +| Carthago | `85386e23fffffff` | 2401 | +| Alexandria | `853f5ba7fffffff` | 2401 | + +### Field status +| Field | Status | Source | +|---|---|---| +| `elev_cm` | ✅ Current | GEBCO 2025 | +| `terrain` | ✅ Current (modern only — see Section 4) | ESA WorldCover 2021 | +| `hydro` | ✅ Current | HydroSHEDS v1.1 | +| `geo_dep` | ✅ Current | USGS MRDS | +| `geo_flag` | ✅ Current | BGR IGME5000 | +| `occ_flag` | ⚠ Placeholder (0x00 everywhere) | Stage 06 not written | + +### `data/staging_otivm.sqlite3` +Identical to production as of last session. Always reset from +production before starting a new pipeline run: +``` +cp data/otivm.sqlite3 data/staging_otivm.sqlite3 +``` + +--- + +## 3. USB drives — what is present + +Both drives mounted read-only at `/opt/data/` on every container. +Full inventory in `data/tessera_usb_inventory.txt`. + +### Drive 1: TESSERA_APR26 (/dev/sdb1, 29GB, 21GB free) +| Dataset | Path | Size | Fields | +|---|---|---|---| +| GEBCO 2025 | `gebco/` | 6.8GB | `elev_cm` | +| HydroSHEDS v1.1 | `hydrosheds/` | 240MB | `hydro` | +| USGS MRDS | `mrds/mrds.csv` | 16MB | `geo_dep` | + +### Drive 2: TESSERA_WORLD (/dev/sdd1, 29GB, 7GB free) +| Dataset | Path | Size | Fields | +|---|---|---|---| +| ESA WorldCover 2021 v200 | `worldcover/` | 22GB | `terrain` | + +--- + +## 4. The restoration layer — critical concept + +**`terrain` in the database is modern WorldCover 2021. It is wrong +for historical periods.** + +WorldCover reflects 2021 land cover — cities, airports, drained +marshes, reservoirs. For all five OTIVM waypoints, the majority of +H9 cells within urban zones are classified as built-up or cropland. +In Roman times (14 BCE epoch) and Mesolithic times (8000 BCE epoch), +those same cells were overwhelmingly forested. + +The Mediterranean basin was 60–70% forested in both periods. Today +it is not. + +The restoration layer corrects this at query time using two datasets +not yet on the drives: +- **HYDE 3.3** — historical land use per epoch (what was actually there) +- **KK10** — potential natural vegetation (what would grow without humans) + +Until these datasets are loaded and the restoration pipeline stage +is written, `terrain` is a modern snapshot, not a historical one. +The game development assistant has been informed. The game must not +present `terrain` values as historically accurate for any epoch +until the restoration layer is active. + +**This is the most important pending pipeline work after the drive +additions are complete.** + +--- + +## 5. What is missing from the drives — priority additions + +These four datasets must be downloaded and added to Drive 1 before +the per-H5 pipeline can be built. Total: ~5.2GB, fits in 21GB free. + +| Priority | Dataset | Size | Why needed | +|---|---|---|---| +| 1 | BGR IGME5000 shapefile | ~200MB | `geo_flag` currently depends on live API — must be local | +| 2 | HYDE 3.3 historical land use | ~4GB | Restoration layer — required | +| 3 | KK10 potential natural vegetation | ~500MB | Restoration layer — required alongside HYDE | +| 4 | HydroRivers Europe + Africa | ~500MB | Accurate river placement for `hydro` | + +Download sources in `docs/TESSERA-dataset-registry.md`. + +Drives are read-only when mounted. To add data: +1. Unmount from Proxmox host +2. Remount read-write on a machine with ext4 write access +3. Copy data +4. Remount read-only +5. Verify with inventory check before proceeding + +**Do not begin pipeline design until all four additions are on Drive 1.** + +--- + +## 6. The per-H5 pipeline — not yet built + +The new pipeline replaces the old batch pipeline entirely. Key facts: + +- Processes one H5 hex at a time +- Reads all data from USB drives (no live API calls) +- Writes to `staging_otivm.sqlite3` only +- Follows RFC-TESSERA-4.0-001 pipeline contract: + draft → validate → promote → copy to production +- Manually triggered with project owner approval +- Supersede support built in — can update existing H5 rows when + a source dataset improves + +### Read strategy — mandatory +Always crop raster to H5 bounding box before sampling. Load the crop +into a numpy array in RAM. Sample all 2401 H9 centroids from the +array. Never seek 2401 individual points from USB. + +Without this: GEBCO reads at ~25s per H5 (USB random seek speed). +With this: GEBCO reads at ~1-2s per H5 (one sequential crop + RAM). + +### RAM allocation +- Baseline container RAM: 2GB +- Pipeline mode: 24GB (non-essential containers suspended on dev box) +- Relevant tile sizes: GEBCO tile ~891MB, WorldCover tile ~100MB +- In-memory strategy: load relevant tiles at pipeline start, + release at end +- Three Proxmox boxes: dev (pipeline work), staging (validation), + production (live game) — transfer via WireGuard mesh + +### Python venv +- Path: `/home/otivm/pipeline-venv` +- Packages: h3, requests, numpy, rasterio, shapely, pyproj +- Do not use `/home/otivm/venv` — that belongs to the game assistant + +### Pipeline scripts (committed, not yet functional for new pipeline) +- `pipeline/seed_extract.py` — old Dell-based extractor, do not re-run +- `pipeline/seed_promote.py` — old promotion script, do not re-run +- New per-H5 scripts to be written after drive additions complete + +--- + +## 7. Infrastructure + +### OTIVM container (CT 1105, proliant-dev, 10.0.0.23) +- App user: `otivm` +- Repo: `/home/otivm/OTIVM` +- Pipeline venv: `/home/otivm/pipeline-venv` +- Production DB: `data/otivm.sqlite3` +- Staging DB: `data/staging_otivm.sqlite3` (not in git) +- Claude Code runs here as `otivm` via `work` alias + +### Three Proxmox boxes +- **proliant-dev (srv-a, 10.0.0.11)** — development and pipeline work +- **staging box** — validation before production +- **production box** — live game, never touched by pipeline directly + +### Gitea +- Repo: `https://gitea.barternetwork.us/TheRON/OTIVM` +- Branch: `main` +- MCP: `mcp.civicus.us` — read any file directly from Claude chat + +--- + +## 8. Hard rules + +- Never write to `data/otivm.sqlite3` directly — always via staging +- Never commit `*.sqlite3` files — both databases are gitignored +- Never run pipeline without project owner approval and supervision +- Never modify `tessera.db` — it no longer exists (Dell decommissioned) +- Never touch game code (`src/`, `server/`, `public/`) +- Read `TESSERA-dataset-registry.md` before evaluating any new source +- One file at a time. One confirmation before proceeding. +- Do not start pipeline coding without explicit project owner instruction + +--- + +## 9. Pending work — in order + +1. **Drive additions** — project owner downloads and mounts four datasets +2. **Pipeline architecture document** — design before any code +3. **Per-H5 pipeline scripts** — one file at a time, supervised +4. **Restoration layer** — HYDE + KK10 integration into terrain field +5. **Stage 06 (occ_flag)** — archaeological sources, deferred until + simulation track begins + +--- + +*Handover 2026-04-27 — dataset assistant track* +*Database seeded, paleo_epochs added, drives inventoried.* +*Pipeline not yet built. Drive additions required first.* +*The restoration layer is the most important pending concept.*