Files
otivm/docs/handover-dataset.md
2026-04-27 09:22:42 +00:00

244 lines
8.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Handover — TESSERA Dataset Assistant
### Date: 2026-04-27
### For: Incoming dataset assistant
### Read this completely before doing anything
---
## 0. Your role
You are the dataset assistant. You own the pipeline that populates
`data/otivm.sqlite3` with physical-world data from the USB drives.
You do not touch game code, frontend, backend, or PM2.
The game development assistant works in parallel. They own `src/`,
`server/`, and everything the player sees. You own:
- `pipeline/` — all extraction and promotion scripts
- `data/create_otivm_db.sql` — the schema source of truth
- `data/staging_otivm.sqlite3` — your working database (never in git)
- `docs/` — dataset and pipeline documentation
You do not write to `data/otivm.sqlite3` directly. You write to
`data/staging_otivm.sqlite3`, verify, then copy to production on
explicit project owner approval.
---
## 1. Read these files before doing anything
In order:
1. `CLAUDE.md` — workflow, three-shell model, ground rules
2. `docs/TESSERA-dataset-registry.md` — every dataset evaluated,
triage decisions, drive inventory, what is on drives and what is not
3. `docs/RFC-TESSERA-4.0-001.md` — the database schema contract
4. `docs/RFC-TESSERA-3.0-PALEO-001.md` — paleo epoch table spec
5. `docs/TESSERA-pipeline-registry.md` — history of the old batch
pipeline, what completed, what failed, and why
6. This file
---
## 2. Current database state — as of 2026-04-27
### `data/otivm.sqlite3` — production
- 12,005 H9 rows across five waypoints, all `status=2` (current)
- All H5s at `status=2` in `h5_coverage`
- `paleo_epochs` table populated with 9 epochs per RFC-TESSERA-3.0-PALEO-001
- H3 IDs stored as INTEGER (64-bit)
### Five waypoints
| City | H5 TEXT | H9 cells |
|---|---|---|
| Ostia | `851e805bfffffff` | 2401 |
| Capua | `851e8333fffffff` | 2401 |
| Brundisium | `851e8ba3fffffff` | 2401 |
| Carthago | `85386e23fffffff` | 2401 |
| Alexandria | `853f5ba7fffffff` | 2401 |
### Field status
| Field | Status | Source |
|---|---|---|
| `elev_cm` | ✅ Current | GEBCO 2025 |
| `terrain` | ✅ Current (modern only — see Section 4) | ESA WorldCover 2021 |
| `hydro` | ✅ Current | HydroSHEDS v1.1 |
| `geo_dep` | ✅ Current | USGS MRDS |
| `geo_flag` | ✅ Current | BGR IGME5000 |
| `occ_flag` | ⚠ Placeholder (0x00 everywhere) | Stage 06 not written |
### `data/staging_otivm.sqlite3`
Identical to production as of last session. Always reset from
production before starting a new pipeline run:
```
cp data/otivm.sqlite3 data/staging_otivm.sqlite3
```
---
## 3. USB drives — what is present
Both drives mounted read-only at `/opt/data/` on every container.
Full inventory in `data/tessera_usb_inventory.txt`.
### Drive 1: TESSERA_APR26 (/dev/sdb1, 29GB, 21GB free)
| Dataset | Path | Size | Fields |
|---|---|---|---|
| GEBCO 2025 | `gebco/` | 6.8GB | `elev_cm` |
| HydroSHEDS v1.1 | `hydrosheds/` | 240MB | `hydro` |
| USGS MRDS | `mrds/mrds.csv` | 16MB | `geo_dep` |
### Drive 2: TESSERA_WORLD (/dev/sdd1, 29GB, 7GB free)
| Dataset | Path | Size | Fields |
|---|---|---|---|
| ESA WorldCover 2021 v200 | `worldcover/` | 22GB | `terrain` |
---
## 4. The restoration layer — critical concept
**`terrain` in the database is modern WorldCover 2021. It is wrong
for historical periods.**
WorldCover reflects 2021 land cover — cities, airports, drained
marshes, reservoirs. For all five OTIVM waypoints, the majority of
H9 cells within urban zones are classified as built-up or cropland.
In Roman times (14 BCE epoch) and Mesolithic times (8000 BCE epoch),
those same cells were overwhelmingly forested.
The Mediterranean basin was 6070% forested in both periods. Today
it is not.
The restoration layer corrects this at query time using two datasets
not yet on the drives:
- **HYDE 3.3** — historical land use per epoch (what was actually there)
- **KK10** — potential natural vegetation (what would grow without humans)
Until these datasets are loaded and the restoration pipeline stage
is written, `terrain` is a modern snapshot, not a historical one.
The game development assistant has been informed. The game must not
present `terrain` values as historically accurate for any epoch
until the restoration layer is active.
**This is the most important pending pipeline work after the drive
additions are complete.**
---
## 5. What is missing from the drives — priority additions
These four datasets must be downloaded and added to Drive 1 before
the per-H5 pipeline can be built. Total: ~5.2GB, fits in 21GB free.
| Priority | Dataset | Size | Why needed |
|---|---|---|---|
| 1 | BGR IGME5000 shapefile | ~200MB | `geo_flag` currently depends on live API — must be local |
| 2 | HYDE 3.3 historical land use | ~4GB | Restoration layer — required |
| 3 | KK10 potential natural vegetation | ~500MB | Restoration layer — required alongside HYDE |
| 4 | HydroRivers Europe + Africa | ~500MB | Accurate river placement for `hydro` |
Download sources in `docs/TESSERA-dataset-registry.md`.
Drives are read-only when mounted. To add data:
1. Unmount from Proxmox host
2. Remount read-write on a machine with ext4 write access
3. Copy data
4. Remount read-only
5. Verify with inventory check before proceeding
**Do not begin pipeline design until all four additions are on Drive 1.**
---
## 6. The per-H5 pipeline — not yet built
The new pipeline replaces the old batch pipeline entirely. Key facts:
- Processes one H5 hex at a time
- Reads all data from USB drives (no live API calls)
- Writes to `staging_otivm.sqlite3` only
- Follows RFC-TESSERA-4.0-001 pipeline contract:
draft → validate → promote → copy to production
- Manually triggered with project owner approval
- Supersede support built in — can update existing H5 rows when
a source dataset improves
### Read strategy — mandatory
Always crop raster to H5 bounding box before sampling. Load the crop
into a numpy array in RAM. Sample all 2401 H9 centroids from the
array. Never seek 2401 individual points from USB.
Without this: GEBCO reads at ~25s per H5 (USB random seek speed).
With this: GEBCO reads at ~1-2s per H5 (one sequential crop + RAM).
### RAM allocation
- Baseline container RAM: 2GB
- Pipeline mode: 24GB (non-essential containers suspended on dev box)
- Relevant tile sizes: GEBCO tile ~891MB, WorldCover tile ~100MB
- In-memory strategy: load relevant tiles at pipeline start,
release at end
- Three Proxmox boxes: dev (pipeline work), staging (validation),
production (live game) — transfer via WireGuard mesh
### Python venv
- Path: `/home/otivm/pipeline-venv`
- Packages: h3, requests, numpy, rasterio, shapely, pyproj
- Do not use `/home/otivm/venv` — that belongs to the game assistant
### Pipeline scripts (committed, not yet functional for new pipeline)
- `pipeline/seed_extract.py` — old Dell-based extractor, do not re-run
- `pipeline/seed_promote.py` — old promotion script, do not re-run
- New per-H5 scripts to be written after drive additions complete
---
## 7. Infrastructure
### OTIVM container (CT 1105, proliant-dev, 10.0.0.23)
- App user: `otivm`
- Repo: `/home/otivm/OTIVM`
- Pipeline venv: `/home/otivm/pipeline-venv`
- Production DB: `data/otivm.sqlite3`
- Staging DB: `data/staging_otivm.sqlite3` (not in git)
- Claude Code runs here as `otivm` via `work` alias
### Three Proxmox boxes
- **proliant-dev (srv-a, 10.0.0.11)** — development and pipeline work
- **staging box** — validation before production
- **production box** — live game, never touched by pipeline directly
### Gitea
- Repo: `https://gitea.barternetwork.us/TheRON/OTIVM`
- Branch: `main`
- MCP: `mcp.civicus.us` — read any file directly from Claude chat
---
## 8. Hard rules
- Never write to `data/otivm.sqlite3` directly — always via staging
- Never commit `*.sqlite3` files — both databases are gitignored
- Never run pipeline without project owner approval and supervision
- Never modify `tessera.db` — it no longer exists (Dell decommissioned)
- Never touch game code (`src/`, `server/`, `public/`)
- Read `TESSERA-dataset-registry.md` before evaluating any new source
- One file at a time. One confirmation before proceeding.
- Do not start pipeline coding without explicit project owner instruction
---
## 9. Pending work — in order
1. **Drive additions** — project owner downloads and mounts four datasets
2. **Pipeline architecture document** — design before any code
3. **Per-H5 pipeline scripts** — one file at a time, supervised
4. **Restoration layer** — HYDE + KK10 integration into terrain field
5. **Stage 06 (occ_flag)** — archaeological sources, deferred until
simulation track begins
---
*Handover 2026-04-27 — dataset assistant track*
*Database seeded, paleo_epochs added, drives inventoried.*
*Pipeline not yet built. Drive additions required first.*
*The restoration layer is the most important pending concept.*