Add dataset assistant handover document
This commit is contained in:
243
docs/handover-dataset.md
Normal file
243
docs/handover-dataset.md
Normal file
@@ -0,0 +1,243 @@
|
||||
# Handover — TESSERA Dataset Assistant
|
||||
### Date: 2026-04-27
|
||||
### For: Incoming dataset assistant
|
||||
### Read this completely before doing anything
|
||||
|
||||
---
|
||||
|
||||
## 0. Your role
|
||||
|
||||
You are the dataset assistant. You own the pipeline that populates
|
||||
`data/otivm.sqlite3` with physical-world data from the USB drives.
|
||||
You do not touch game code, frontend, backend, or PM2.
|
||||
|
||||
The game development assistant works in parallel. They own `src/`,
|
||||
`server/`, and everything the player sees. You own:
|
||||
- `pipeline/` — all extraction and promotion scripts
|
||||
- `data/create_otivm_db.sql` — the schema source of truth
|
||||
- `data/staging_otivm.sqlite3` — your working database (never in git)
|
||||
- `docs/` — dataset and pipeline documentation
|
||||
|
||||
You do not write to `data/otivm.sqlite3` directly. You write to
|
||||
`data/staging_otivm.sqlite3`, verify, then copy to production on
|
||||
explicit project owner approval.
|
||||
|
||||
---
|
||||
|
||||
## 1. Read these files before doing anything
|
||||
|
||||
In order:
|
||||
|
||||
1. `CLAUDE.md` — workflow, three-shell model, ground rules
|
||||
2. `docs/TESSERA-dataset-registry.md` — every dataset evaluated,
|
||||
triage decisions, drive inventory, what is on drives and what is not
|
||||
3. `docs/RFC-TESSERA-4.0-001.md` — the database schema contract
|
||||
4. `docs/RFC-TESSERA-3.0-PALEO-001.md` — paleo epoch table spec
|
||||
5. `docs/TESSERA-pipeline-registry.md` — history of the old batch
|
||||
pipeline, what completed, what failed, and why
|
||||
6. This file
|
||||
|
||||
---
|
||||
|
||||
## 2. Current database state — as of 2026-04-27
|
||||
|
||||
### `data/otivm.sqlite3` — production
|
||||
- 12,005 H9 rows across five waypoints, all `status=2` (current)
|
||||
- All H5s at `status=2` in `h5_coverage`
|
||||
- `paleo_epochs` table populated with 9 epochs per RFC-TESSERA-3.0-PALEO-001
|
||||
- H3 IDs stored as INTEGER (64-bit)
|
||||
|
||||
### Five waypoints
|
||||
| City | H5 TEXT | H9 cells |
|
||||
|---|---|---|
|
||||
| Ostia | `851e805bfffffff` | 2401 |
|
||||
| Capua | `851e8333fffffff` | 2401 |
|
||||
| Brundisium | `851e8ba3fffffff` | 2401 |
|
||||
| Carthago | `85386e23fffffff` | 2401 |
|
||||
| Alexandria | `853f5ba7fffffff` | 2401 |
|
||||
|
||||
### Field status
|
||||
| Field | Status | Source |
|
||||
|---|---|---|
|
||||
| `elev_cm` | ✅ Current | GEBCO 2025 |
|
||||
| `terrain` | ✅ Current (modern only — see Section 4) | ESA WorldCover 2021 |
|
||||
| `hydro` | ✅ Current | HydroSHEDS v1.1 |
|
||||
| `geo_dep` | ✅ Current | USGS MRDS |
|
||||
| `geo_flag` | ✅ Current | BGR IGME5000 |
|
||||
| `occ_flag` | ⚠ Placeholder (0x00 everywhere) | Stage 06 not written |
|
||||
|
||||
### `data/staging_otivm.sqlite3`
|
||||
Identical to production as of last session. Always reset from
|
||||
production before starting a new pipeline run:
|
||||
```
|
||||
cp data/otivm.sqlite3 data/staging_otivm.sqlite3
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. USB drives — what is present
|
||||
|
||||
Both drives mounted read-only at `/opt/data/` on every container.
|
||||
Full inventory in `data/tessera_usb_inventory.txt`.
|
||||
|
||||
### Drive 1: TESSERA_APR26 (/dev/sdb1, 29GB, 21GB free)
|
||||
| Dataset | Path | Size | Fields |
|
||||
|---|---|---|---|
|
||||
| GEBCO 2025 | `gebco/` | 6.8GB | `elev_cm` |
|
||||
| HydroSHEDS v1.1 | `hydrosheds/` | 240MB | `hydro` |
|
||||
| USGS MRDS | `mrds/mrds.csv` | 16MB | `geo_dep` |
|
||||
|
||||
### Drive 2: TESSERA_WORLD (/dev/sdd1, 29GB, 7GB free)
|
||||
| Dataset | Path | Size | Fields |
|
||||
|---|---|---|---|
|
||||
| ESA WorldCover 2021 v200 | `worldcover/` | 22GB | `terrain` |
|
||||
|
||||
---
|
||||
|
||||
## 4. The restoration layer — critical concept
|
||||
|
||||
**`terrain` in the database is modern WorldCover 2021. It is wrong
|
||||
for historical periods.**
|
||||
|
||||
WorldCover reflects 2021 land cover — cities, airports, drained
|
||||
marshes, reservoirs. For all five OTIVM waypoints, the majority of
|
||||
H9 cells within urban zones are classified as built-up or cropland.
|
||||
In Roman times (14 BCE epoch) and Mesolithic times (8000 BCE epoch),
|
||||
those same cells were overwhelmingly forested.
|
||||
|
||||
The Mediterranean basin was 60–70% forested in both periods. Today
|
||||
it is not.
|
||||
|
||||
The restoration layer corrects this at query time using two datasets
|
||||
not yet on the drives:
|
||||
- **HYDE 3.3** — historical land use per epoch (what was actually there)
|
||||
- **KK10** — potential natural vegetation (what would grow without humans)
|
||||
|
||||
Until these datasets are loaded and the restoration pipeline stage
|
||||
is written, `terrain` is a modern snapshot, not a historical one.
|
||||
The game development assistant has been informed. The game must not
|
||||
present `terrain` values as historically accurate for any epoch
|
||||
until the restoration layer is active.
|
||||
|
||||
**This is the most important pending pipeline work after the drive
|
||||
additions are complete.**
|
||||
|
||||
---
|
||||
|
||||
## 5. What is missing from the drives — priority additions
|
||||
|
||||
These four datasets must be downloaded and added to Drive 1 before
|
||||
the per-H5 pipeline can be built. Total: ~5.2GB, fits in 21GB free.
|
||||
|
||||
| Priority | Dataset | Size | Why needed |
|
||||
|---|---|---|---|
|
||||
| 1 | BGR IGME5000 shapefile | ~200MB | `geo_flag` currently depends on live API — must be local |
|
||||
| 2 | HYDE 3.3 historical land use | ~4GB | Restoration layer — required |
|
||||
| 3 | KK10 potential natural vegetation | ~500MB | Restoration layer — required alongside HYDE |
|
||||
| 4 | HydroRivers Europe + Africa | ~500MB | Accurate river placement for `hydro` |
|
||||
|
||||
Download sources in `docs/TESSERA-dataset-registry.md`.
|
||||
|
||||
Drives are read-only when mounted. To add data:
|
||||
1. Unmount from Proxmox host
|
||||
2. Remount read-write on a machine with ext4 write access
|
||||
3. Copy data
|
||||
4. Remount read-only
|
||||
5. Verify with inventory check before proceeding
|
||||
|
||||
**Do not begin pipeline design until all four additions are on Drive 1.**
|
||||
|
||||
---
|
||||
|
||||
## 6. The per-H5 pipeline — not yet built
|
||||
|
||||
The new pipeline replaces the old batch pipeline entirely. Key facts:
|
||||
|
||||
- Processes one H5 hex at a time
|
||||
- Reads all data from USB drives (no live API calls)
|
||||
- Writes to `staging_otivm.sqlite3` only
|
||||
- Follows RFC-TESSERA-4.0-001 pipeline contract:
|
||||
draft → validate → promote → copy to production
|
||||
- Manually triggered with project owner approval
|
||||
- Supersede support built in — can update existing H5 rows when
|
||||
a source dataset improves
|
||||
|
||||
### Read strategy — mandatory
|
||||
Always crop raster to H5 bounding box before sampling. Load the crop
|
||||
into a numpy array in RAM. Sample all 2401 H9 centroids from the
|
||||
array. Never seek 2401 individual points from USB.
|
||||
|
||||
Without this: GEBCO reads at ~25s per H5 (USB random seek speed).
|
||||
With this: GEBCO reads at ~1-2s per H5 (one sequential crop + RAM).
|
||||
|
||||
### RAM allocation
|
||||
- Baseline container RAM: 2GB
|
||||
- Pipeline mode: 24GB (non-essential containers suspended on dev box)
|
||||
- Relevant tile sizes: GEBCO tile ~891MB, WorldCover tile ~100MB
|
||||
- In-memory strategy: load relevant tiles at pipeline start,
|
||||
release at end
|
||||
- Three Proxmox boxes: dev (pipeline work), staging (validation),
|
||||
production (live game) — transfer via WireGuard mesh
|
||||
|
||||
### Python venv
|
||||
- Path: `/home/otivm/pipeline-venv`
|
||||
- Packages: h3, requests, numpy, rasterio, shapely, pyproj
|
||||
- Do not use `/home/otivm/venv` — that belongs to the game assistant
|
||||
|
||||
### Pipeline scripts (committed, not yet functional for new pipeline)
|
||||
- `pipeline/seed_extract.py` — old Dell-based extractor, do not re-run
|
||||
- `pipeline/seed_promote.py` — old promotion script, do not re-run
|
||||
- New per-H5 scripts to be written after drive additions complete
|
||||
|
||||
---
|
||||
|
||||
## 7. Infrastructure
|
||||
|
||||
### OTIVM container (CT 1105, proliant-dev, 10.0.0.23)
|
||||
- App user: `otivm`
|
||||
- Repo: `/home/otivm/OTIVM`
|
||||
- Pipeline venv: `/home/otivm/pipeline-venv`
|
||||
- Production DB: `data/otivm.sqlite3`
|
||||
- Staging DB: `data/staging_otivm.sqlite3` (not in git)
|
||||
- Claude Code runs here as `otivm` via `work` alias
|
||||
|
||||
### Three Proxmox boxes
|
||||
- **proliant-dev (srv-a, 10.0.0.11)** — development and pipeline work
|
||||
- **staging box** — validation before production
|
||||
- **production box** — live game, never touched by pipeline directly
|
||||
|
||||
### Gitea
|
||||
- Repo: `https://gitea.barternetwork.us/TheRON/OTIVM`
|
||||
- Branch: `main`
|
||||
- MCP: `mcp.civicus.us` — read any file directly from Claude chat
|
||||
|
||||
---
|
||||
|
||||
## 8. Hard rules
|
||||
|
||||
- Never write to `data/otivm.sqlite3` directly — always via staging
|
||||
- Never commit `*.sqlite3` files — both databases are gitignored
|
||||
- Never run pipeline without project owner approval and supervision
|
||||
- Never modify `tessera.db` — it no longer exists (Dell decommissioned)
|
||||
- Never touch game code (`src/`, `server/`, `public/`)
|
||||
- Read `TESSERA-dataset-registry.md` before evaluating any new source
|
||||
- One file at a time. One confirmation before proceeding.
|
||||
- Do not start pipeline coding without explicit project owner instruction
|
||||
|
||||
---
|
||||
|
||||
## 9. Pending work — in order
|
||||
|
||||
1. **Drive additions** — project owner downloads and mounts four datasets
|
||||
2. **Pipeline architecture document** — design before any code
|
||||
3. **Per-H5 pipeline scripts** — one file at a time, supervised
|
||||
4. **Restoration layer** — HYDE + KK10 integration into terrain field
|
||||
5. **Stage 06 (occ_flag)** — archaeological sources, deferred until
|
||||
simulation track begins
|
||||
|
||||
---
|
||||
|
||||
*Handover 2026-04-27 — dataset assistant track*
|
||||
*Database seeded, paleo_epochs added, drives inventoried.*
|
||||
*Pipeline not yet built. Drive additions required first.*
|
||||
*The restoration layer is the most important pending concept.*
|
||||
Reference in New Issue
Block a user