Add dataset assistant handover document

This commit is contained in:
otivm
2026-04-27 09:22:42 +00:00
parent 17c09adc54
commit e92e1bf980

243
docs/handover-dataset.md Normal file
View File

@@ -0,0 +1,243 @@
# Handover — TESSERA Dataset Assistant
### Date: 2026-04-27
### For: Incoming dataset assistant
### Read this completely before doing anything
---
## 0. Your role
You are the dataset assistant. You own the pipeline that populates
`data/otivm.sqlite3` with physical-world data from the USB drives.
You do not touch game code, frontend, backend, or PM2.
The game development assistant works in parallel. They own `src/`,
`server/`, and everything the player sees. You own:
- `pipeline/` — all extraction and promotion scripts
- `data/create_otivm_db.sql` — the schema source of truth
- `data/staging_otivm.sqlite3` — your working database (never in git)
- `docs/` — dataset and pipeline documentation
You do not write to `data/otivm.sqlite3` directly. You write to
`data/staging_otivm.sqlite3`, verify, then copy to production on
explicit project owner approval.
---
## 1. Read these files before doing anything
In order:
1. `CLAUDE.md` — workflow, three-shell model, ground rules
2. `docs/TESSERA-dataset-registry.md` — every dataset evaluated,
triage decisions, drive inventory, what is on drives and what is not
3. `docs/RFC-TESSERA-4.0-001.md` — the database schema contract
4. `docs/RFC-TESSERA-3.0-PALEO-001.md` — paleo epoch table spec
5. `docs/TESSERA-pipeline-registry.md` — history of the old batch
pipeline, what completed, what failed, and why
6. This file
---
## 2. Current database state — as of 2026-04-27
### `data/otivm.sqlite3` — production
- 12,005 H9 rows across five waypoints, all `status=2` (current)
- All H5s at `status=2` in `h5_coverage`
- `paleo_epochs` table populated with 9 epochs per RFC-TESSERA-3.0-PALEO-001
- H3 IDs stored as INTEGER (64-bit)
### Five waypoints
| City | H5 TEXT | H9 cells |
|---|---|---|
| Ostia | `851e805bfffffff` | 2401 |
| Capua | `851e8333fffffff` | 2401 |
| Brundisium | `851e8ba3fffffff` | 2401 |
| Carthago | `85386e23fffffff` | 2401 |
| Alexandria | `853f5ba7fffffff` | 2401 |
### Field status
| Field | Status | Source |
|---|---|---|
| `elev_cm` | ✅ Current | GEBCO 2025 |
| `terrain` | ✅ Current (modern only — see Section 4) | ESA WorldCover 2021 |
| `hydro` | ✅ Current | HydroSHEDS v1.1 |
| `geo_dep` | ✅ Current | USGS MRDS |
| `geo_flag` | ✅ Current | BGR IGME5000 |
| `occ_flag` | ⚠ Placeholder (0x00 everywhere) | Stage 06 not written |
### `data/staging_otivm.sqlite3`
Identical to production as of last session. Always reset from
production before starting a new pipeline run:
```
cp data/otivm.sqlite3 data/staging_otivm.sqlite3
```
---
## 3. USB drives — what is present
Both drives mounted read-only at `/opt/data/` on every container.
Full inventory in `data/tessera_usb_inventory.txt`.
### Drive 1: TESSERA_APR26 (/dev/sdb1, 29GB, 21GB free)
| Dataset | Path | Size | Fields |
|---|---|---|---|
| GEBCO 2025 | `gebco/` | 6.8GB | `elev_cm` |
| HydroSHEDS v1.1 | `hydrosheds/` | 240MB | `hydro` |
| USGS MRDS | `mrds/mrds.csv` | 16MB | `geo_dep` |
### Drive 2: TESSERA_WORLD (/dev/sdd1, 29GB, 7GB free)
| Dataset | Path | Size | Fields |
|---|---|---|---|
| ESA WorldCover 2021 v200 | `worldcover/` | 22GB | `terrain` |
---
## 4. The restoration layer — critical concept
**`terrain` in the database is modern WorldCover 2021. It is wrong
for historical periods.**
WorldCover reflects 2021 land cover — cities, airports, drained
marshes, reservoirs. For all five OTIVM waypoints, the majority of
H9 cells within urban zones are classified as built-up or cropland.
In Roman times (14 BCE epoch) and Mesolithic times (8000 BCE epoch),
those same cells were overwhelmingly forested.
The Mediterranean basin was 6070% forested in both periods. Today
it is not.
The restoration layer corrects this at query time using two datasets
not yet on the drives:
- **HYDE 3.3** — historical land use per epoch (what was actually there)
- **KK10** — potential natural vegetation (what would grow without humans)
Until these datasets are loaded and the restoration pipeline stage
is written, `terrain` is a modern snapshot, not a historical one.
The game development assistant has been informed. The game must not
present `terrain` values as historically accurate for any epoch
until the restoration layer is active.
**This is the most important pending pipeline work after the drive
additions are complete.**
---
## 5. What is missing from the drives — priority additions
These four datasets must be downloaded and added to Drive 1 before
the per-H5 pipeline can be built. Total: ~5.2GB, fits in 21GB free.
| Priority | Dataset | Size | Why needed |
|---|---|---|---|
| 1 | BGR IGME5000 shapefile | ~200MB | `geo_flag` currently depends on live API — must be local |
| 2 | HYDE 3.3 historical land use | ~4GB | Restoration layer — required |
| 3 | KK10 potential natural vegetation | ~500MB | Restoration layer — required alongside HYDE |
| 4 | HydroRivers Europe + Africa | ~500MB | Accurate river placement for `hydro` |
Download sources in `docs/TESSERA-dataset-registry.md`.
Drives are read-only when mounted. To add data:
1. Unmount from Proxmox host
2. Remount read-write on a machine with ext4 write access
3. Copy data
4. Remount read-only
5. Verify with inventory check before proceeding
**Do not begin pipeline design until all four additions are on Drive 1.**
---
## 6. The per-H5 pipeline — not yet built
The new pipeline replaces the old batch pipeline entirely. Key facts:
- Processes one H5 hex at a time
- Reads all data from USB drives (no live API calls)
- Writes to `staging_otivm.sqlite3` only
- Follows RFC-TESSERA-4.0-001 pipeline contract:
draft → validate → promote → copy to production
- Manually triggered with project owner approval
- Supersede support built in — can update existing H5 rows when
a source dataset improves
### Read strategy — mandatory
Always crop raster to H5 bounding box before sampling. Load the crop
into a numpy array in RAM. Sample all 2401 H9 centroids from the
array. Never seek 2401 individual points from USB.
Without this: GEBCO reads at ~25s per H5 (USB random seek speed).
With this: GEBCO reads at ~1-2s per H5 (one sequential crop + RAM).
### RAM allocation
- Baseline container RAM: 2GB
- Pipeline mode: 24GB (non-essential containers suspended on dev box)
- Relevant tile sizes: GEBCO tile ~891MB, WorldCover tile ~100MB
- In-memory strategy: load relevant tiles at pipeline start,
release at end
- Three Proxmox boxes: dev (pipeline work), staging (validation),
production (live game) — transfer via WireGuard mesh
### Python venv
- Path: `/home/otivm/pipeline-venv`
- Packages: h3, requests, numpy, rasterio, shapely, pyproj
- Do not use `/home/otivm/venv` — that belongs to the game assistant
### Pipeline scripts (committed, not yet functional for new pipeline)
- `pipeline/seed_extract.py` — old Dell-based extractor, do not re-run
- `pipeline/seed_promote.py` — old promotion script, do not re-run
- New per-H5 scripts to be written after drive additions complete
---
## 7. Infrastructure
### OTIVM container (CT 1105, proliant-dev, 10.0.0.23)
- App user: `otivm`
- Repo: `/home/otivm/OTIVM`
- Pipeline venv: `/home/otivm/pipeline-venv`
- Production DB: `data/otivm.sqlite3`
- Staging DB: `data/staging_otivm.sqlite3` (not in git)
- Claude Code runs here as `otivm` via `work` alias
### Three Proxmox boxes
- **proliant-dev (srv-a, 10.0.0.11)** — development and pipeline work
- **staging box** — validation before production
- **production box** — live game, never touched by pipeline directly
### Gitea
- Repo: `https://gitea.barternetwork.us/TheRON/OTIVM`
- Branch: `main`
- MCP: `mcp.civicus.us` — read any file directly from Claude chat
---
## 8. Hard rules
- Never write to `data/otivm.sqlite3` directly — always via staging
- Never commit `*.sqlite3` files — both databases are gitignored
- Never run pipeline without project owner approval and supervision
- Never modify `tessera.db` — it no longer exists (Dell decommissioned)
- Never touch game code (`src/`, `server/`, `public/`)
- Read `TESSERA-dataset-registry.md` before evaluating any new source
- One file at a time. One confirmation before proceeding.
- Do not start pipeline coding without explicit project owner instruction
---
## 9. Pending work — in order
1. **Drive additions** — project owner downloads and mounts four datasets
2. **Pipeline architecture document** — design before any code
3. **Per-H5 pipeline scripts** — one file at a time, supervised
4. **Restoration layer** — HYDE + KK10 integration into terrain field
5. **Stage 06 (occ_flag)** — archaeological sources, deferred until
simulation track begins
---
*Handover 2026-04-27 — dataset assistant track*
*Database seeded, paleo_epochs added, drives inventoried.*
*Pipeline not yet built. Drive additions required first.*
*The restoration layer is the most important pending concept.*