Files
otivm/docs/handover-dataset.md
2026-04-27 09:22:42 +00:00

8.7 KiB
Raw Permalink Blame History

Handover — TESSERA Dataset Assistant

Date: 2026-04-27

For: Incoming dataset assistant

Read this completely before doing anything


0. Your role

You are the dataset assistant. You own the pipeline that populates data/otivm.sqlite3 with physical-world data from the USB drives. You do not touch game code, frontend, backend, or PM2.

The game development assistant works in parallel. They own src/, server/, and everything the player sees. You own:

  • pipeline/ — all extraction and promotion scripts
  • data/create_otivm_db.sql — the schema source of truth
  • data/staging_otivm.sqlite3 — your working database (never in git)
  • docs/ — dataset and pipeline documentation

You do not write to data/otivm.sqlite3 directly. You write to data/staging_otivm.sqlite3, verify, then copy to production on explicit project owner approval.


1. Read these files before doing anything

In order:

  1. CLAUDE.md — workflow, three-shell model, ground rules
  2. docs/TESSERA-dataset-registry.md — every dataset evaluated, triage decisions, drive inventory, what is on drives and what is not
  3. docs/RFC-TESSERA-4.0-001.md — the database schema contract
  4. docs/RFC-TESSERA-3.0-PALEO-001.md — paleo epoch table spec
  5. docs/TESSERA-pipeline-registry.md — history of the old batch pipeline, what completed, what failed, and why
  6. This file

2. Current database state — as of 2026-04-27

data/otivm.sqlite3 — production

  • 12,005 H9 rows across five waypoints, all status=2 (current)
  • All H5s at status=2 in h5_coverage
  • paleo_epochs table populated with 9 epochs per RFC-TESSERA-3.0-PALEO-001
  • H3 IDs stored as INTEGER (64-bit)

Five waypoints

City H5 TEXT H9 cells
Ostia 851e805bfffffff 2401
Capua 851e8333fffffff 2401
Brundisium 851e8ba3fffffff 2401
Carthago 85386e23fffffff 2401
Alexandria 853f5ba7fffffff 2401

Field status

Field Status Source
elev_cm Current GEBCO 2025
terrain Current (modern only — see Section 4) ESA WorldCover 2021
hydro Current HydroSHEDS v1.1
geo_dep Current USGS MRDS
geo_flag Current BGR IGME5000
occ_flag ⚠ Placeholder (0x00 everywhere) Stage 06 not written

data/staging_otivm.sqlite3

Identical to production as of last session. Always reset from production before starting a new pipeline run:

cp data/otivm.sqlite3 data/staging_otivm.sqlite3

3. USB drives — what is present

Both drives mounted read-only at /opt/data/ on every container. Full inventory in data/tessera_usb_inventory.txt.

Drive 1: TESSERA_APR26 (/dev/sdb1, 29GB, 21GB free)

Dataset Path Size Fields
GEBCO 2025 gebco/ 6.8GB elev_cm
HydroSHEDS v1.1 hydrosheds/ 240MB hydro
USGS MRDS mrds/mrds.csv 16MB geo_dep

Drive 2: TESSERA_WORLD (/dev/sdd1, 29GB, 7GB free)

Dataset Path Size Fields
ESA WorldCover 2021 v200 worldcover/ 22GB terrain

4. The restoration layer — critical concept

terrain in the database is modern WorldCover 2021. It is wrong for historical periods.

WorldCover reflects 2021 land cover — cities, airports, drained marshes, reservoirs. For all five OTIVM waypoints, the majority of H9 cells within urban zones are classified as built-up or cropland. In Roman times (14 BCE epoch) and Mesolithic times (8000 BCE epoch), those same cells were overwhelmingly forested.

The Mediterranean basin was 6070% forested in both periods. Today it is not.

The restoration layer corrects this at query time using two datasets not yet on the drives:

  • HYDE 3.3 — historical land use per epoch (what was actually there)
  • KK10 — potential natural vegetation (what would grow without humans)

Until these datasets are loaded and the restoration pipeline stage is written, terrain is a modern snapshot, not a historical one. The game development assistant has been informed. The game must not present terrain values as historically accurate for any epoch until the restoration layer is active.

This is the most important pending pipeline work after the drive additions are complete.


5. What is missing from the drives — priority additions

These four datasets must be downloaded and added to Drive 1 before the per-H5 pipeline can be built. Total: ~5.2GB, fits in 21GB free.

Priority Dataset Size Why needed
1 BGR IGME5000 shapefile ~200MB geo_flag currently depends on live API — must be local
2 HYDE 3.3 historical land use ~4GB Restoration layer — required
3 KK10 potential natural vegetation ~500MB Restoration layer — required alongside HYDE
4 HydroRivers Europe + Africa ~500MB Accurate river placement for hydro

Download sources in docs/TESSERA-dataset-registry.md.

Drives are read-only when mounted. To add data:

  1. Unmount from Proxmox host
  2. Remount read-write on a machine with ext4 write access
  3. Copy data
  4. Remount read-only
  5. Verify with inventory check before proceeding

Do not begin pipeline design until all four additions are on Drive 1.


6. The per-H5 pipeline — not yet built

The new pipeline replaces the old batch pipeline entirely. Key facts:

  • Processes one H5 hex at a time
  • Reads all data from USB drives (no live API calls)
  • Writes to staging_otivm.sqlite3 only
  • Follows RFC-TESSERA-4.0-001 pipeline contract: draft → validate → promote → copy to production
  • Manually triggered with project owner approval
  • Supersede support built in — can update existing H5 rows when a source dataset improves

Read strategy — mandatory

Always crop raster to H5 bounding box before sampling. Load the crop into a numpy array in RAM. Sample all 2401 H9 centroids from the array. Never seek 2401 individual points from USB.

Without this: GEBCO reads at ~25s per H5 (USB random seek speed). With this: GEBCO reads at ~1-2s per H5 (one sequential crop + RAM).

RAM allocation

  • Baseline container RAM: 2GB
  • Pipeline mode: 24GB (non-essential containers suspended on dev box)
  • Relevant tile sizes: GEBCO tile ~891MB, WorldCover tile ~100MB
  • In-memory strategy: load relevant tiles at pipeline start, release at end
  • Three Proxmox boxes: dev (pipeline work), staging (validation), production (live game) — transfer via WireGuard mesh

Python venv

  • Path: /home/otivm/pipeline-venv
  • Packages: h3, requests, numpy, rasterio, shapely, pyproj
  • Do not use /home/otivm/venv — that belongs to the game assistant

Pipeline scripts (committed, not yet functional for new pipeline)

  • pipeline/seed_extract.py — old Dell-based extractor, do not re-run
  • pipeline/seed_promote.py — old promotion script, do not re-run
  • New per-H5 scripts to be written after drive additions complete

7. Infrastructure

OTIVM container (CT 1105, proliant-dev, 10.0.0.23)

  • App user: otivm
  • Repo: /home/otivm/OTIVM
  • Pipeline venv: /home/otivm/pipeline-venv
  • Production DB: data/otivm.sqlite3
  • Staging DB: data/staging_otivm.sqlite3 (not in git)
  • Claude Code runs here as otivm via work alias

Three Proxmox boxes

  • proliant-dev (srv-a, 10.0.0.11) — development and pipeline work
  • staging box — validation before production
  • production box — live game, never touched by pipeline directly

Gitea

  • Repo: https://gitea.barternetwork.us/TheRON/OTIVM
  • Branch: main
  • MCP: mcp.civicus.us — read any file directly from Claude chat

8. Hard rules

  • Never write to data/otivm.sqlite3 directly — always via staging
  • Never commit *.sqlite3 files — both databases are gitignored
  • Never run pipeline without project owner approval and supervision
  • Never modify tessera.db — it no longer exists (Dell decommissioned)
  • Never touch game code (src/, server/, public/)
  • Read TESSERA-dataset-registry.md before evaluating any new source
  • One file at a time. One confirmation before proceeding.
  • Do not start pipeline coding without explicit project owner instruction

9. Pending work — in order

  1. Drive additions — project owner downloads and mounts four datasets
  2. Pipeline architecture document — design before any code
  3. Per-H5 pipeline scripts — one file at a time, supervised
  4. Restoration layer — HYDE + KK10 integration into terrain field
  5. Stage 06 (occ_flag) — archaeological sources, deferred until simulation track begins

Handover 2026-04-27 — dataset assistant track Database seeded, paleo_epochs added, drives inventoried. Pipeline not yet built. Drive additions required first. The restoration layer is the most important pending concept.