Files
otivm/docs/TESSERA-dataset-registry.md

407 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# TESSERA Dataset Registry
### Date: 2026-04-27
### Status: Authoritative — read before adding any data source to the pipeline
### Author: Claude Sonnet 4.6 with full session context
---
## Purpose
This document is the permanent record of every dataset considered for the
TESSERA pipeline. It documents what each dataset contains, where it comes
from, what it costs to use, and whether it belongs to the game, the
simulation, or the landscape restoration layer.
Datasets are documented regardless of whether they are currently loaded.
The goal is to ensure no future session needs to rediscover sources that
have already been evaluated.
**The pipeline processes one H5 hex at a time. All data is read from
local USB drives. No live API calls during pipeline runs.**
---
## Hardware constraints — non-negotiable
These numbers govern every pipeline design decision.
| Constraint | Value |
|---|---|
| USB read speed (sequential) | 20.5 MB/s |
| USB read speed (random 2401 pts, GEBCO) | ~25s per H5 |
| USB read speed (random 2401 pts, WorldCover) | ~2.1s per H5 |
| SQLite INSERT 2401 rows | ~0.007s (negligible) |
| OTIVM container RAM (baseline) | 2 GB |
| OTIVM container RAM (pipeline mode) | 24 GB (non-essential containers suspended) |
| Proliant DL360 G7 total RAM | 32 GB |
| SAS RAID1+0 write speed | adequate — not the bottleneck |
| USB 2.0 interface | bottleneck for large rasters |
**Key optimisation:** Always crop raster to H5 bounding box before sampling.
Load the crop into a numpy array in RAM. Sample all 2401 H9 centroids from
RAM, not from disk. This reduces GEBCO read time from ~25s to ~1-2s per H5.
---
## USB drives — current inventory
### Drive 1: TESSERA_APR26 (/dev/sdb1)
- Mount: `/opt/data/TESSERA_APR26` (read-only, ext4)
- Total: 29 GB | Used: 7.0 GB | Free: 21 GB
- Inventoried: 2026-04-27
- Full inventory: `data/tessera_usb_inventory.txt`
### Drive 2: TESSERA_WORLD (/dev/sdd1)
- Mount: `/opt/data/TESSERA_WORLD` (read-only, ext4)
- Total: 29 GB | Used: 22 GB | Free: 7 GB (after WorldCover)
- Inventoried: 2026-04-27
- Full inventory: `data/tessera_usb_inventory.txt`
---
## Triage key
| Tag | Meaning |
|---|---|
| **GAME** | Needed for OTIVM trade routes, terrain rendering, economic logic |
| **SIMULATION** | Needed for CIVICVS scientific rigour, academic defensibility, Mesolithic AI |
| **RESTORATION** | Needed to correct modern land cover back to Roman/Mesolithic reality |
| **DEFERRED** | Documented, evaluated, not yet loaded — load when the relevant release begins |
| **ON DRIVE** | Present on USB, ready to use |
| **NOT ON DRIVE** | Not yet downloaded — must be added before pipeline can use it |
| **SAMPLE ONLY** | Pipeline uses a small sample of records, not the full dataset |
---
## Datasets — currently on drives
---
### GEBCO 2025 Grid
| Property | Value |
|---|---|
| **Triage** | GAME · SIMULATION |
| **Status** | ON DRIVE — TESSERA_APR26/gebco/ |
| **Format** | GeoTIFF, 8 tiles, global coverage in 90°×90° quadrants |
| **Size on drive** | 6.8 GB |
| **CRS** | EPSG:4326 |
| **Resolution** | 15 arc-sec (~450m at equator) |
| **Tile shape** | 21600 × 21600 px per tile |
| **Bands** | 1 (int16, nodata=-32767) |
| **Field populated** | `elev_cm` |
| **Confidence** | `indicated` (2) — GEBCO is a composite; per-cell quality varies |
| **Source URL** | https://www.gebco.net/data_and_products/gridded_bathymetry_data/ |
| **License** | CC-BY 4.0 |
| **Citation** | GEBCO Compilation Group (2025) GEBCO 2025 Grid (doi:10.5285/a29c5465-b138-234d-e053-6c86abc0dc7f) |
| **Notes** | Elevation reflects modern sea level. Paleo epoch offsets applied at query time via `paleo_epochs` table — not stored in cell rows. One or two tiles cover the full Mediterranean. Read strategy: crop to H5 bounding box, load into RAM, sample from array. |
---
### GEBCO 2025 Source Identifier Grid (SID)
| Property | Value |
|---|---|
| **Triage** | SIMULATION · DEFERRED |
| **Status** | NOT ON DRIVE |
| **Format** | GeoTIFF, same tile structure as GEBCO elevation |
| **Size** | ~6.8 GB |
| **Field populated** | `elev_src` refinement, `elev_conf` upgrade path |
| **Source URL** | https://www.gebco.net/data_and_products/gridded_bathymetry_data/ |
| **License** | CC-BY 4.0 |
| **Notes** | Per-cell identifier of the underlying data source (ship soundings, satellite altimetry, modelled). Required to upgrade `elev_conf` from `indicated` to `measured` for well-surveyed cells. Load when academic participation begins. |
---
### ESA WorldCover 2021 v200
| Property | Value |
|---|---|
| **Triage** | GAME · RESTORATION (modern baseline) |
| **Status** | ON DRIVE — TESSERA_WORLD/worldcover/ |
| **Format** | GeoTIFF, 513 tiles, 3°×3° each |
| **Size on drive** | 22 GB |
| **CRS** | EPSG:4326 |
| **Resolution** | 1/12000° (~10m at equator) |
| **Tile shape** | 36000 × 36000 px per tile |
| **Bands** | 1 (uint8, land-cover class, nodata=0) |
| **Field populated** | `terrain` |
| **Confidence** | `indicated` (2) |
| **Source URL** | https://esa-worldcover.org/ |
| **License** | CC-BY 4.0 |
| **Citation** | Zanaga et al. (2022) ESA WorldCover 10m 2021 v200 (doi:10.5281/zenodo.7254221) |
| **Coverage** | ~15°N72°N, 15°W75°E — all five OTIVM waypoints covered |
| **Notes** | Snapshot of 2021 land cover. Reflects modern urbanisation, drainage, agriculture. This is the **modern baseline** — the restoration layer corrects it backward to Roman or Mesolithic conditions. Pipeline stores the modern WorldCover value in `terrain`; restoration is applied at query time using HYDE 3.3 and KK10. Read strategy: identify tile by H5 centroid coords, crop to H5 bounding box (~7200×7200px at 10m), load into RAM. |
---
### HydroSHEDS v1.1
| Property | Value |
|---|---|
| **Triage** | GAME · SIMULATION |
| **Status** | ON DRIVE — TESSERA_APR26/hydrosheds/ |
| **Format** | GeoTIFF, 10 tiles (flow direction + flow accumulation, per region) |
| **Size on drive** | 240 MB |
| **CRS** | EPSG:4326 |
| **Resolution** | 15 arc-sec |
| **Bands** | Flow direction: uint8 — Flow accumulation: uint32 |
| **Regions on drive** | Africa, Arctic, Asia, Europe, Siberia |
| **Field populated** | `hydro` |
| **Confidence** | `indicated` (2) |
| **Source URL** | https://www.hydrosheds.org/ |
| **License** | CC-BY 4.0 |
| **Citation** | Lehner et al. (2022) HydroSHEDS v1.1 Technical Documentation. WWF US, Washington DC. |
| **Notes** | HydroSHEDS v2.0 expected October 2026 — review then. Flow accumulation threshold for `hydro` classification defined in RFC-TESSERA-2.0-001 Section 3.3. Rivers have migrated since Roman period — restoration layer corrects major drainage changes. |
---
### USGS MRDS — Mineral Resources Data System
| Property | Value |
|---|---|
| **Triage** | GAME · SIMULATION · SAMPLE ONLY |
| **Status** | ON DRIVE — TESSERA_APR26/mrds/mrds.csv |
| **Format** | CSV, 16 MB |
| **Field populated** | `geo_dep` |
| **Confidence** | `indicated` (2) where deposit present, `no_data` (4) elsewhere |
| **Source URL** | https://mrdata.usgs.gov/mrds/ |
| **DOI** | 10.3133/ds52 |
| **License** | USGS public domain |
| **Notes** | Point dataset — bounding box query per H5, assign deposit code to nearest H9 centroid within threshold distance. Coverage uneven — MRDS systematic updates ceased 2011. European coverage sparse. Known issue: Almadén mercury mine coordinates in MRDS are ~34km from actual mine location — MRDS data quality issue, not a pipeline error. |
---
## Datasets — not yet on drives, recommended for addition
---
### BGR IGME5000 — International Geological Map of Europe
| Property | Value |
|---|---|
| **Triage** | GAME · SIMULATION · DEFERRED |
| **Status** | NOT ON DRIVE |
| **Format** | Shapefile, ~200 MB |
| **Field populated** | `geo_flag` |
| **Confidence** | `indicated` (2) where covered, `no_data` (4) outside European shelf |
| **Source URL** | https://www.bgr.bund.de/igme5000 |
| **ArcGIS REST** | https://services.bgr.de/arcgis/rest/services/geologie/igme5000/MapServer/23 |
| **License** | Geonutz 2013 — open, no registration |
| **Citation** | Datenbquelle: IGME5000, (c) BGR Hannover, 2007 |
| **Drive target** | TESSERA_APR26 (21 GB free — fits easily) |
| **Notes** | Rock class and confidence in bit layout per RFC-TESSERA-2.0-001. Method: H5 bounding box query → shapely point-in-polygon for H9 centroids. Previously used as live API in the old pipeline (stage 04a) — download the shapefile to eliminate the API dependency. Add to Drive 1 before running geo_flag stage. |
---
### HYDE 3.3 — History Database of the Global Environment
| Property | Value |
|---|---|
| **Triage** | GAME · RESTORATION · SIMULATION |
| **Status** | NOT ON DRIVE |
| **Format** | ASCII grid / NetCDF, ~4 GB full dataset |
| **Resolution** | 5 arc-min (~10km) |
| **Coverage** | Global, 10000 BCE → 2017 CE, decadal snapshots |
| **Field populated** | Restoration layer input — `terrain_restored` derivation |
| **Source URL** | https://www.pbl.nl/en/image/links/hyde |
| **License** | Free for non-commercial use with attribution |
| **Citation** | Klein Goldewijk et al. (2017) Anthropocene global land use. Earth System Science Data 9. |
| **Drive target** | TESSERA_APR26 (21 GB free — fits) |
| **Notes** | Gridded land use and population density per epoch. Primary tool for identifying which modern built-up/cropland cells were forested or natural at Roman or Mesolithic period. Essential for the restoration layer. Without HYDE, terrain is wrong for all five waypoints. Load before building the restoration pipeline stage. |
---
### KK10 — Kaplan et al. Potential Natural Vegetation
| Property | Value |
|---|---|
| **Triage** | RESTORATION · SIMULATION |
| **Status** | NOT ON DRIVE |
| **Format** | NetCDF / ASCII, ~500 MB |
| **Coverage** | Global, pre-agricultural baseline |
| **Field populated** | Restoration layer input — forest biome type |
| **Source URL** | https://www.geo.uzh.ch/en/units/h2k/Services/KK10-Reconstruction.html |
| **License** | Academic use — verify before commercial deployment |
| **Citation** | Kaplan et al. (2009) Holocene carbon emissions as a result of anthropogenic land cover change. The Holocene 21(5). |
| **Drive target** | TESSERA_APR26 |
| **Notes** | Reconstructs what vegetation would naturally grow at each location without human interference — the forest baseline before 10000 BCE. Answers the question: was this cell forest, grassland, or shrub in its natural state? Used in combination with HYDE to determine restoration class. The 6070% Mediterranean forest cover target is grounded in this dataset. |
---
### ESA CCI Land Cover Time Series (19922020)
| Property | Value |
|---|---|
| **Triage** | RESTORATION · DEFERRED |
| **Status** | NOT ON DRIVE |
| **Format** | NetCDF, ~3 GB Mediterranean subset (~20 GB global) |
| **Resolution** | 300m |
| **Coverage** | Global, annual snapshots 19922020 |
| **Source URL** | https://www.esa-landcover-cci.org |
| **License** | Free for research use |
| **Drive target** | TESSERA_APR26 (download Mediterranean subset only) |
| **Notes** | 28 annual land cover maps. Used to identify trajectory of recent urbanisation — which cells changed from natural to built-up in living memory, versus which have been urban since before the record. Helps identify cells where WorldCover 2021 is misleading. Secondary to HYDE for restoration work. Defer until restoration pipeline stage is active. |
---
### Copernicus DEM GLO-30
| Property | Value |
|---|---|
| **Triage** | SIMULATION · DEFERRED |
| **Status** | NOT ON DRIVE |
| **Format** | GeoTIFF tiles, ~8 GB Mediterranean subset (~170 GB global) |
| **Resolution** | 30m (~67× finer than GEBCO) |
| **Coverage** | Global (Mediterranean subset: Italy, Greece, North Africa, Levant, Iberia) |
| **Source URL** | https://spacedata.copernicus.eu/collections/copernicus-digital-elevation-model |
| **License** | Free for non-commercial use |
| **Drive target** | TESSERA_WORLD (7 GB free — Mediterranean subset fits, just) |
| **Notes** | Already used in CIVICVS — hillshade.png was generated from it. For OTIVM, GEBCO 450m is adequate for H9 cells (~180m). GLO-30 becomes relevant at H11H13 resolution (future CIVICVS work) and for archaeologically precise terrain modelling. Load when CIVICVS simulation reaches academic production. |
---
### HydroRivers (HydroSHEDS river network polylines)
| Property | Value |
|---|---|
| **Triage** | GAME · RESTORATION · DEFERRED |
| **Status** | NOT ON DRIVE |
| **Format** | Shapefile, ~500 MB for Europe + Africa |
| **Source URL** | https://www.hydrosheds.org/products/hydrorivers |
| **License** | CC-BY 4.0 |
| **Drive target** | TESSERA_APR26 |
| **Notes** | Named river polylines with flow order and discharge estimates. Relevant for OTIVM trade route logic (river vs sea routing) and CIVICVS foraging and settlement. Rivers have migrated since Roman period — the Po, Tiber, Nile deltas have all changed substantially. Restoration layer will need this to place river channels in Roman-era positions. |
---
### HydroLAKES
| Property | Value |
|---|---|
| **Triage** | GAME · RESTORATION · DEFERRED |
| **Status** | NOT ON DRIVE |
| **Format** | Shapefile, ~800 MB global |
| **Source URL** | https://www.hydrosheds.org/products/hydrolakes |
| **License** | CC-BY 4.0 |
| **Drive target** | TESSERA_APR26 |
| **Notes** | Polygon dataset of lakes and reservoirs. Many Mediterranean reservoirs are modern — Lake Nasser, various Iberian reservoirs. Restoration layer must identify and remove modern reservoirs, restore natural lake extents. |
---
### ICE-6G_C — Peltier et al. Glacial Isostatic Adjustment model
| Property | Value |
|---|---|
| **Triage** | SIMULATION · DEFERRED |
| **Status** | NOT ON DRIVE |
| **Format** | NetCDF, ~2 GB |
| **Coverage** | Global, 26000 BP → present, 500-year intervals |
| **Source URL** | http://www.atmosp.physics.utoronto.ca/~peltier/data.php |
| **License** | Academic use — contact Peltier group for data access |
| **Citation** | Peltier et al. (2015) Space geodesy constrains ice age terminal deglaciation. Quaternary Science Reviews 125. |
| **Drive target** | TESSERA_APR26 |
| **Notes** | The `paleo_epochs` table currently uses global eustatic offsets with no GIA correction — acknowledged in the table notes. ICE-6G_C provides per-cell relative sea level change including isostatic rebound, which is significant in the Baltic and North Sea (Doggerland) but modest in the Mediterranean. Required for RFC-TESSERA-3.0-PALEO-001 to be academically rigorous. Defer until simulation reaches academic production. |
---
### BIOME 6000 — Pollen-based biome reconstructions
| Property | Value |
|---|---|
| **Triage** | SIMULATION · DEFERRED |
| **Status** | NOT ON DRIVE |
| **Format** | CSV / Shapefile, ~100 MB |
| **Coverage** | Global pollen sites, 6000 BP and 21000 BP snapshots |
| **Source URL** | https://www.bridge.bris.ac.uk/projects/BIOME_6000 |
| **License** | Open academic |
| **Drive target** | TESSERA_APR26 |
| **Notes** | Empirical ground truth from pollen records for vegetation at 6000 BP and 21000 BP. Validates the KK10 and HYDE restoration assumptions against real palaeoecological data. Essential for academic defensibility — this is the dataset that reviewers will ask about. Defer until simulation track begins. |
---
### Zanon et al. 2018 — European Holocene forest cover reconstruction
| Property | Value |
|---|---|
| **Triage** | SIMULATION · DEFERRED |
| **Status** | NOT ON DRIVE |
| **Format** | NetCDF, ~200 MB |
| **Coverage** | Europe, 11700 BP → present |
| **Source URL** | https://doi.org/10.1177/0959683617715643 |
| **License** | Academic — open access paper, data on request |
| **Notes** | Spatially explicit reconstruction of European forest cover from pollen records. Directly quantifies the 6070% Mesolithic forest cover by region. The scientific authority behind our restoration target. Defer until simulation track. |
---
### Natural Earth 1:10m vectors
| Property | Value |
|---|---|
| **Triage** | GAME · DEFERRED |
| **Status** | NOT ON DRIVE |
| **Format** | Shapefile, ~500 MB |
| **Coverage** | Global |
| **Source URL** | https://www.naturalearthdata.com |
| **License** | Public domain |
| **Drive target** | Either drive |
| **Notes** | Country boundaries, coastlines, urban areas, rivers at 1:10m scale. Lightweight. Useful for the Azgaar bridge (OTIVM-VII) and CIVICVS territory mapping. Load when OTIVM-VII begins. |
---
### HYDE 3.3 — note on Roman-era urban erasure
This note applies to all five OTIVM waypoints and is the central challenge
of the restoration layer.
The Mediterranean basin today bears the accumulated footprint of 2000 years
of continuous urban development, land drainage, deforestation, and
agricultural intensification. The modern WorldCover classification for cells
around our five waypoints reflects:
| City | Modern WorldCover reality |
|---|---|
| Ostia | Built-up, drained wetland, Tiber mouth migrated ~3km south |
| Capua | Dense Campanian urban sprawl over former ager campanus farmland |
| Brundisium | Port city, coastal modifications, harbour dredged and extended |
| Carthago | Tunis urban core directly over the Punic and Roman city |
| Alexandria | Continuously occupied for 2300 years — modern city 5× Roman extent |
For every one of these cells, `terrain` from WorldCover 2021 is wrong for
any historical period. The restoration pipeline must detect these cells
(HYDE classification = built-up or cropland AND KK10 baseline = forest or
shrubland) and override with the historically appropriate class.
This is not a minor adjustment. For cells within 10km of any of our five
waypoints, the majority of H9 cells will require restoration. The pipeline
must treat WorldCover as the modern baseline and HYDE+KK10 as the correction
layer, not an optional add-on.
---
## Drive addition plan — next actions before pipeline build
The following datasets should be downloaded and added to the drives before
the pipeline design is finalised. In priority order:
1. **BGR IGME5000 shapefile** (~200MB) → Drive 1
Required for `geo_flag` field. The only blocked field in the current schema.
2. **HYDE 3.3** (~4GB) → Drive 1
Required for the restoration layer. Without it, terrain is wrong for all
five waypoints.
3. **KK10 potential natural vegetation** (~500MB) → Drive 1
Required alongside HYDE for restoration biome typing.
4. **HydroRivers Europe + Africa** (~500MB) → Drive 1
Required for accurate `hydro` classification near rivers.
Remaining datasets (ICE-6G_C, BIOME 6000, Zanon, CCI, GLO-30) deferred
until simulation track begins. Total for priority additions: ~5.2GB —
fits on Drive 1 with 21GB free.
---
## Schema fields — dataset mapping
| Field | Source dataset | Triage | Status |
|---|---|---|---|
| `elev_cm` | GEBCO 2025 | GAME | ON DRIVE |
| `terrain` | ESA WorldCover 2021 (modern) + HYDE 3.3 (restored) | GAME + RESTORATION | WorldCover ON DRIVE; HYDE NOT ON DRIVE |
| `hydro` | HydroSHEDS v1.1 + HydroRivers (refined) | GAME | HydroSHEDS ON DRIVE; HydroRivers NOT ON DRIVE |
| `geo_dep` | USGS MRDS | GAME | ON DRIVE |
| `geo_flag` | BGR IGME5000 | GAME | NOT ON DRIVE |
| `occ_flag` | Stage 06 — archaeological sources (ARIADNE, SEAD) | SIMULATION | NOT STARTED |
| `paleo_epochs` | Lambeck 2014 + ICE-6G_C (future) | GAME + SIMULATION | Table populated with eustatic values; ICE-6G_C deferred |
---
*TESSERA-dataset-registry.md — 2026-04-27*
*Written by Claude Sonnet 4.6 with full session and inventory context.*
*Update this document whenever a new dataset is evaluated or added to a drive.*