# Handover — TESSERA Dataset Assistant ### Date: 2026-04-27 ### For: Incoming dataset assistant ### Read this completely before doing anything --- ## 0. Your role You are the dataset assistant. You own the pipeline that populates `data/otivm.sqlite3` with physical-world data from the USB drives. You do not touch game code, frontend, backend, or PM2. The game development assistant works in parallel. They own `src/`, `server/`, and everything the player sees. You own: - `pipeline/` — all extraction and promotion scripts - `data/create_otivm_db.sql` — the schema source of truth - `data/staging_otivm.sqlite3` — your working database (never in git) - `docs/` — dataset and pipeline documentation You do not write to `data/otivm.sqlite3` directly. You write to `data/staging_otivm.sqlite3`, verify, then copy to production on explicit project owner approval. --- ## 1. Read these files before doing anything In order: 1. `CLAUDE.md` — workflow, three-shell model, ground rules 2. `docs/TESSERA-dataset-registry.md` — every dataset evaluated, triage decisions, drive inventory, what is on drives and what is not 3. `docs/RFC-TESSERA-4.0-001.md` — the database schema contract 4. `docs/RFC-TESSERA-3.0-PALEO-001.md` — paleo epoch table spec 5. `docs/TESSERA-pipeline-registry.md` — history of the old batch pipeline, what completed, what failed, and why 6. This file --- ## 2. Current database state — as of 2026-04-27 ### `data/otivm.sqlite3` — production - 12,005 H9 rows across five waypoints, all `status=2` (current) - All H5s at `status=2` in `h5_coverage` - `paleo_epochs` table populated with 9 epochs per RFC-TESSERA-3.0-PALEO-001 - H3 IDs stored as INTEGER (64-bit) ### Five waypoints | City | H5 TEXT | H9 cells | |---|---|---| | Ostia | `851e805bfffffff` | 2401 | | Capua | `851e8333fffffff` | 2401 | | Brundisium | `851e8ba3fffffff` | 2401 | | Carthago | `85386e23fffffff` | 2401 | | Alexandria | `853f5ba7fffffff` | 2401 | ### Field status | Field | Status | Source | |---|---|---| | `elev_cm` | ✅ Current | GEBCO 2025 | | `terrain` | ✅ Current (modern only — see Section 4) | ESA WorldCover 2021 | | `hydro` | ✅ Current | HydroSHEDS v1.1 | | `geo_dep` | ✅ Current | USGS MRDS | | `geo_flag` | ✅ Current | BGR IGME5000 | | `occ_flag` | ⚠ Placeholder (0x00 everywhere) | Stage 06 not written | ### `data/staging_otivm.sqlite3` Identical to production as of last session. Always reset from production before starting a new pipeline run: ``` cp data/otivm.sqlite3 data/staging_otivm.sqlite3 ``` --- ## 3. USB drives — what is present Both drives mounted read-only at `/opt/data/` on every container. Full inventory in `data/tessera_usb_inventory.txt`. ### Drive 1: TESSERA_APR26 (/dev/sdb1, 29GB, 21GB free) | Dataset | Path | Size | Fields | |---|---|---|---| | GEBCO 2025 | `gebco/` | 6.8GB | `elev_cm` | | HydroSHEDS v1.1 | `hydrosheds/` | 240MB | `hydro` | | USGS MRDS | `mrds/mrds.csv` | 16MB | `geo_dep` | ### Drive 2: TESSERA_WORLD (/dev/sdd1, 29GB, 7GB free) | Dataset | Path | Size | Fields | |---|---|---|---| | ESA WorldCover 2021 v200 | `worldcover/` | 22GB | `terrain` | --- ## 4. The restoration layer — critical concept **`terrain` in the database is modern WorldCover 2021. It is wrong for historical periods.** WorldCover reflects 2021 land cover — cities, airports, drained marshes, reservoirs. For all five OTIVM waypoints, the majority of H9 cells within urban zones are classified as built-up or cropland. In Roman times (14 BCE epoch) and Mesolithic times (8000 BCE epoch), those same cells were overwhelmingly forested. The Mediterranean basin was 60–70% forested in both periods. Today it is not. The restoration layer corrects this at query time using two datasets not yet on the drives: - **HYDE 3.3** — historical land use per epoch (what was actually there) - **KK10** — potential natural vegetation (what would grow without humans) Until these datasets are loaded and the restoration pipeline stage is written, `terrain` is a modern snapshot, not a historical one. The game development assistant has been informed. The game must not present `terrain` values as historically accurate for any epoch until the restoration layer is active. **This is the most important pending pipeline work after the drive additions are complete.** --- ## 5. What is missing from the drives — priority additions These four datasets must be downloaded and added to Drive 1 before the per-H5 pipeline can be built. Total: ~5.2GB, fits in 21GB free. | Priority | Dataset | Size | Why needed | |---|---|---|---| | 1 | BGR IGME5000 shapefile | ~200MB | `geo_flag` currently depends on live API — must be local | | 2 | HYDE 3.3 historical land use | ~4GB | Restoration layer — required | | 3 | KK10 potential natural vegetation | ~500MB | Restoration layer — required alongside HYDE | | 4 | HydroRivers Europe + Africa | ~500MB | Accurate river placement for `hydro` | Download sources in `docs/TESSERA-dataset-registry.md`. Drives are read-only when mounted. To add data: 1. Unmount from Proxmox host 2. Remount read-write on a machine with ext4 write access 3. Copy data 4. Remount read-only 5. Verify with inventory check before proceeding **Do not begin pipeline design until all four additions are on Drive 1.** --- ## 6. The per-H5 pipeline — not yet built The new pipeline replaces the old batch pipeline entirely. Key facts: - Processes one H5 hex at a time - Reads all data from USB drives (no live API calls) - Writes to `staging_otivm.sqlite3` only - Follows RFC-TESSERA-4.0-001 pipeline contract: draft → validate → promote → copy to production - Manually triggered with project owner approval - Supersede support built in — can update existing H5 rows when a source dataset improves ### Read strategy — mandatory Always crop raster to H5 bounding box before sampling. Load the crop into a numpy array in RAM. Sample all 2401 H9 centroids from the array. Never seek 2401 individual points from USB. Without this: GEBCO reads at ~25s per H5 (USB random seek speed). With this: GEBCO reads at ~1-2s per H5 (one sequential crop + RAM). ### RAM allocation - Baseline container RAM: 2GB - Pipeline mode: 24GB (non-essential containers suspended on dev box) - Relevant tile sizes: GEBCO tile ~891MB, WorldCover tile ~100MB - In-memory strategy: load relevant tiles at pipeline start, release at end - Three Proxmox boxes: dev (pipeline work), staging (validation), production (live game) — transfer via WireGuard mesh ### Python venv - Path: `/home/otivm/pipeline-venv` - Packages: h3, requests, numpy, rasterio, shapely, pyproj - Do not use `/home/otivm/venv` — that belongs to the game assistant ### Pipeline scripts (committed, not yet functional for new pipeline) - `pipeline/seed_extract.py` — old Dell-based extractor, do not re-run - `pipeline/seed_promote.py` — old promotion script, do not re-run - New per-H5 scripts to be written after drive additions complete --- ## 7. Infrastructure ### OTIVM container (CT 1105, proliant-dev, 10.0.0.23) - App user: `otivm` - Repo: `/home/otivm/OTIVM` - Pipeline venv: `/home/otivm/pipeline-venv` - Production DB: `data/otivm.sqlite3` - Staging DB: `data/staging_otivm.sqlite3` (not in git) - Claude Code runs here as `otivm` via `work` alias ### Three Proxmox boxes - **proliant-dev (srv-a, 10.0.0.11)** — development and pipeline work - **staging box** — validation before production - **production box** — live game, never touched by pipeline directly ### Gitea - Repo: `https://gitea.barternetwork.us/TheRON/OTIVM` - Branch: `main` - MCP: `mcp.civicus.us` — read any file directly from Claude chat --- ## 8. Hard rules - Never write to `data/otivm.sqlite3` directly — always via staging - Never commit `*.sqlite3` files — both databases are gitignored - Never run pipeline without project owner approval and supervision - Never modify `tessera.db` — it no longer exists (Dell decommissioned) - Never touch game code (`src/`, `server/`, `public/`) - Read `TESSERA-dataset-registry.md` before evaluating any new source - One file at a time. One confirmation before proceeding. - Do not start pipeline coding without explicit project owner instruction --- ## 9. Pending work — in order 1. **Drive additions** — project owner downloads and mounts four datasets 2. **Pipeline architecture document** — design before any code 3. **Per-H5 pipeline scripts** — one file at a time, supervised 4. **Restoration layer** — HYDE + KK10 integration into terrain field 5. **Stage 06 (occ_flag)** — archaeological sources, deferred until simulation track begins --- *Handover 2026-04-27 — dataset assistant track* *Database seeded, paleo_epochs added, drives inventoried.* *Pipeline not yet built. Drive additions required first.* *The restoration layer is the most important pending concept.*