Files
otivm/docs/TESSERA-dataset-registry.md

19 KiB
Raw Permalink Blame History

TESSERA Dataset Registry

Date: 2026-04-27

Status: Authoritative — read before adding any data source to the pipeline

Author: Claude Sonnet 4.6 with full session context


Purpose

This document is the permanent record of every dataset considered for the TESSERA pipeline. It documents what each dataset contains, where it comes from, what it costs to use, and whether it belongs to the game, the simulation, or the landscape restoration layer.

Datasets are documented regardless of whether they are currently loaded. The goal is to ensure no future session needs to rediscover sources that have already been evaluated.

The pipeline processes one H5 hex at a time. All data is read from local USB drives. No live API calls during pipeline runs.


Hardware constraints — non-negotiable

These numbers govern every pipeline design decision.

Constraint Value
USB read speed (sequential) 20.5 MB/s
USB read speed (random 2401 pts, GEBCO) ~25s per H5
USB read speed (random 2401 pts, WorldCover) ~2.1s per H5
SQLite INSERT 2401 rows ~0.007s (negligible)
OTIVM container RAM (baseline) 2 GB
OTIVM container RAM (pipeline mode) 24 GB (non-essential containers suspended)
Proliant DL360 G7 total RAM 32 GB
SAS RAID1+0 write speed adequate — not the bottleneck
USB 2.0 interface bottleneck for large rasters

Key optimisation: Always crop raster to H5 bounding box before sampling. Load the crop into a numpy array in RAM. Sample all 2401 H9 centroids from RAM, not from disk. This reduces GEBCO read time from ~25s to ~1-2s per H5.


USB drives — current inventory

Drive 1: TESSERA_APR26 (/dev/sdb1)

  • Mount: /opt/data/TESSERA_APR26 (read-only, ext4)
  • Total: 29 GB | Used: 7.0 GB | Free: 21 GB
  • Inventoried: 2026-04-27
  • Full inventory: data/tessera_usb_inventory.txt

Drive 2: TESSERA_WORLD (/dev/sdd1)

  • Mount: /opt/data/TESSERA_WORLD (read-only, ext4)
  • Total: 29 GB | Used: 22 GB | Free: 7 GB (after WorldCover)
  • Inventoried: 2026-04-27
  • Full inventory: data/tessera_usb_inventory.txt

Triage key

Tag Meaning
GAME Needed for OTIVM trade routes, terrain rendering, economic logic
SIMULATION Needed for CIVICVS scientific rigour, academic defensibility, Mesolithic AI
RESTORATION Needed to correct modern land cover back to Roman/Mesolithic reality
DEFERRED Documented, evaluated, not yet loaded — load when the relevant release begins
ON DRIVE Present on USB, ready to use
NOT ON DRIVE Not yet downloaded — must be added before pipeline can use it
SAMPLE ONLY Pipeline uses a small sample of records, not the full dataset

Datasets — currently on drives


GEBCO 2025 Grid

Property Value
Triage GAME · SIMULATION
Status ON DRIVE — TESSERA_APR26/gebco/
Format GeoTIFF, 8 tiles, global coverage in 90°×90° quadrants
Size on drive 6.8 GB
CRS EPSG:4326
Resolution 15 arc-sec (~450m at equator)
Tile shape 21600 × 21600 px per tile
Bands 1 (int16, nodata=-32767)
Field populated elev_cm
Confidence indicated (2) — GEBCO is a composite; per-cell quality varies
Source URL https://www.gebco.net/data_and_products/gridded_bathymetry_data/
License CC-BY 4.0
Citation GEBCO Compilation Group (2025) GEBCO 2025 Grid (doi:10.5285/a29c5465-b138-234d-e053-6c86abc0dc7f)
Notes Elevation reflects modern sea level. Paleo epoch offsets applied at query time via paleo_epochs table — not stored in cell rows. One or two tiles cover the full Mediterranean. Read strategy: crop to H5 bounding box, load into RAM, sample from array.

GEBCO 2025 Source Identifier Grid (SID)

Property Value
Triage SIMULATION · DEFERRED
Status NOT ON DRIVE
Format GeoTIFF, same tile structure as GEBCO elevation
Size ~6.8 GB
Field populated elev_src refinement, elev_conf upgrade path
Source URL https://www.gebco.net/data_and_products/gridded_bathymetry_data/
License CC-BY 4.0
Notes Per-cell identifier of the underlying data source (ship soundings, satellite altimetry, modelled). Required to upgrade elev_conf from indicated to measured for well-surveyed cells. Load when academic participation begins.

ESA WorldCover 2021 v200

Property Value
Triage GAME · RESTORATION (modern baseline)
Status ON DRIVE — TESSERA_WORLD/worldcover/
Format GeoTIFF, 513 tiles, 3°×3° each
Size on drive 22 GB
CRS EPSG:4326
Resolution 1/12000° (~10m at equator)
Tile shape 36000 × 36000 px per tile
Bands 1 (uint8, land-cover class, nodata=0)
Field populated terrain
Confidence indicated (2)
Source URL https://esa-worldcover.org/
License CC-BY 4.0
Citation Zanaga et al. (2022) ESA WorldCover 10m 2021 v200 (doi:10.5281/zenodo.7254221)
Coverage ~15°N72°N, 15°W75°E — all five OTIVM waypoints covered
Notes Snapshot of 2021 land cover. Reflects modern urbanisation, drainage, agriculture. This is the modern baseline — the restoration layer corrects it backward to Roman or Mesolithic conditions. Pipeline stores the modern WorldCover value in terrain; restoration is applied at query time using HYDE 3.3 and KK10. Read strategy: identify tile by H5 centroid coords, crop to H5 bounding box (~7200×7200px at 10m), load into RAM.

HydroSHEDS v1.1

Property Value
Triage GAME · SIMULATION
Status ON DRIVE — TESSERA_APR26/hydrosheds/
Format GeoTIFF, 10 tiles (flow direction + flow accumulation, per region)
Size on drive 240 MB
CRS EPSG:4326
Resolution 15 arc-sec
Bands Flow direction: uint8 — Flow accumulation: uint32
Regions on drive Africa, Arctic, Asia, Europe, Siberia
Field populated hydro
Confidence indicated (2)
Source URL https://www.hydrosheds.org/
License CC-BY 4.0
Citation Lehner et al. (2022) HydroSHEDS v1.1 Technical Documentation. WWF US, Washington DC.
Notes HydroSHEDS v2.0 expected October 2026 — review then. Flow accumulation threshold for hydro classification defined in RFC-TESSERA-2.0-001 Section 3.3. Rivers have migrated since Roman period — restoration layer corrects major drainage changes.

USGS MRDS — Mineral Resources Data System

Property Value
Triage GAME · SIMULATION · SAMPLE ONLY
Status ON DRIVE — TESSERA_APR26/mrds/mrds.csv
Format CSV, 16 MB
Field populated geo_dep
Confidence indicated (2) where deposit present, no_data (4) elsewhere
Source URL https://mrdata.usgs.gov/mrds/
DOI 10.3133/ds52
License USGS public domain
Notes Point dataset — bounding box query per H5, assign deposit code to nearest H9 centroid within threshold distance. Coverage uneven — MRDS systematic updates ceased 2011. European coverage sparse. Known issue: Almadén mercury mine coordinates in MRDS are ~34km from actual mine location — MRDS data quality issue, not a pipeline error.


BGR IGME5000 — International Geological Map of Europe

Property Value
Triage GAME · SIMULATION · DEFERRED
Status NOT ON DRIVE
Format Shapefile, ~200 MB
Field populated geo_flag
Confidence indicated (2) where covered, no_data (4) outside European shelf
Source URL https://www.bgr.bund.de/igme5000
ArcGIS REST https://services.bgr.de/arcgis/rest/services/geologie/igme5000/MapServer/23
License Geonutz 2013 — open, no registration
Citation Datenbquelle: IGME5000, (c) BGR Hannover, 2007
Drive target TESSERA_APR26 (21 GB free — fits easily)
Notes Rock class and confidence in bit layout per RFC-TESSERA-2.0-001. Method: H5 bounding box query → shapely point-in-polygon for H9 centroids. Previously used as live API in the old pipeline (stage 04a) — download the shapefile to eliminate the API dependency. Add to Drive 1 before running geo_flag stage.

HYDE 3.3 — History Database of the Global Environment

Property Value
Triage GAME · RESTORATION · SIMULATION
Status NOT ON DRIVE
Format ASCII grid / NetCDF, ~4 GB full dataset
Resolution 5 arc-min (~10km)
Coverage Global, 10000 BCE → 2017 CE, decadal snapshots
Field populated Restoration layer input — terrain_restored derivation
Source URL https://www.pbl.nl/en/image/links/hyde
License Free for non-commercial use with attribution
Citation Klein Goldewijk et al. (2017) Anthropocene global land use. Earth System Science Data 9.
Drive target TESSERA_APR26 (21 GB free — fits)
Notes Gridded land use and population density per epoch. Primary tool for identifying which modern built-up/cropland cells were forested or natural at Roman or Mesolithic period. Essential for the restoration layer. Without HYDE, terrain is wrong for all five waypoints. Load before building the restoration pipeline stage.

KK10 — Kaplan et al. Potential Natural Vegetation

Property Value
Triage RESTORATION · SIMULATION
Status NOT ON DRIVE
Format NetCDF / ASCII, ~500 MB
Coverage Global, pre-agricultural baseline
Field populated Restoration layer input — forest biome type
Source URL https://www.geo.uzh.ch/en/units/h2k/Services/KK10-Reconstruction.html
License Academic use — verify before commercial deployment
Citation Kaplan et al. (2009) Holocene carbon emissions as a result of anthropogenic land cover change. The Holocene 21(5).
Drive target TESSERA_APR26
Notes Reconstructs what vegetation would naturally grow at each location without human interference — the forest baseline before 10000 BCE. Answers the question: was this cell forest, grassland, or shrub in its natural state? Used in combination with HYDE to determine restoration class. The 6070% Mediterranean forest cover target is grounded in this dataset.

ESA CCI Land Cover Time Series (19922020)

Property Value
Triage RESTORATION · DEFERRED
Status NOT ON DRIVE
Format NetCDF, ~3 GB Mediterranean subset (~20 GB global)
Resolution 300m
Coverage Global, annual snapshots 19922020
Source URL https://www.esa-landcover-cci.org
License Free for research use
Drive target TESSERA_APR26 (download Mediterranean subset only)
Notes 28 annual land cover maps. Used to identify trajectory of recent urbanisation — which cells changed from natural to built-up in living memory, versus which have been urban since before the record. Helps identify cells where WorldCover 2021 is misleading. Secondary to HYDE for restoration work. Defer until restoration pipeline stage is active.

Copernicus DEM GLO-30

Property Value
Triage SIMULATION · DEFERRED
Status NOT ON DRIVE
Format GeoTIFF tiles, ~8 GB Mediterranean subset (~170 GB global)
Resolution 30m (~67× finer than GEBCO)
Coverage Global (Mediterranean subset: Italy, Greece, North Africa, Levant, Iberia)
Source URL https://spacedata.copernicus.eu/collections/copernicus-digital-elevation-model
License Free for non-commercial use
Drive target TESSERA_WORLD (7 GB free — Mediterranean subset fits, just)
Notes Already used in CIVICVS — hillshade.png was generated from it. For OTIVM, GEBCO 450m is adequate for H9 cells (~180m). GLO-30 becomes relevant at H11H13 resolution (future CIVICVS work) and for archaeologically precise terrain modelling. Load when CIVICVS simulation reaches academic production.

HydroRivers (HydroSHEDS river network polylines)

Property Value
Triage GAME · RESTORATION · DEFERRED
Status NOT ON DRIVE
Format Shapefile, ~500 MB for Europe + Africa
Source URL https://www.hydrosheds.org/products/hydrorivers
License CC-BY 4.0
Drive target TESSERA_APR26
Notes Named river polylines with flow order and discharge estimates. Relevant for OTIVM trade route logic (river vs sea routing) and CIVICVS foraging and settlement. Rivers have migrated since Roman period — the Po, Tiber, Nile deltas have all changed substantially. Restoration layer will need this to place river channels in Roman-era positions.

HydroLAKES

Property Value
Triage GAME · RESTORATION · DEFERRED
Status NOT ON DRIVE
Format Shapefile, ~800 MB global
Source URL https://www.hydrosheds.org/products/hydrolakes
License CC-BY 4.0
Drive target TESSERA_APR26
Notes Polygon dataset of lakes and reservoirs. Many Mediterranean reservoirs are modern — Lake Nasser, various Iberian reservoirs. Restoration layer must identify and remove modern reservoirs, restore natural lake extents.

ICE-6G_C — Peltier et al. Glacial Isostatic Adjustment model

Property Value
Triage SIMULATION · DEFERRED
Status NOT ON DRIVE
Format NetCDF, ~2 GB
Coverage Global, 26000 BP → present, 500-year intervals
Source URL http://www.atmosp.physics.utoronto.ca/~peltier/data.php
License Academic use — contact Peltier group for data access
Citation Peltier et al. (2015) Space geodesy constrains ice age terminal deglaciation. Quaternary Science Reviews 125.
Drive target TESSERA_APR26
Notes The paleo_epochs table currently uses global eustatic offsets with no GIA correction — acknowledged in the table notes. ICE-6G_C provides per-cell relative sea level change including isostatic rebound, which is significant in the Baltic and North Sea (Doggerland) but modest in the Mediterranean. Required for RFC-TESSERA-3.0-PALEO-001 to be academically rigorous. Defer until simulation reaches academic production.

BIOME 6000 — Pollen-based biome reconstructions

Property Value
Triage SIMULATION · DEFERRED
Status NOT ON DRIVE
Format CSV / Shapefile, ~100 MB
Coverage Global pollen sites, 6000 BP and 21000 BP snapshots
Source URL https://www.bridge.bris.ac.uk/projects/BIOME_6000
License Open academic
Drive target TESSERA_APR26
Notes Empirical ground truth from pollen records for vegetation at 6000 BP and 21000 BP. Validates the KK10 and HYDE restoration assumptions against real palaeoecological data. Essential for academic defensibility — this is the dataset that reviewers will ask about. Defer until simulation track begins.

Zanon et al. 2018 — European Holocene forest cover reconstruction

Property Value
Triage SIMULATION · DEFERRED
Status NOT ON DRIVE
Format NetCDF, ~200 MB
Coverage Europe, 11700 BP → present
Source URL https://doi.org/10.1177/0959683617715643
License Academic — open access paper, data on request
Notes Spatially explicit reconstruction of European forest cover from pollen records. Directly quantifies the 6070% Mesolithic forest cover by region. The scientific authority behind our restoration target. Defer until simulation track.

Natural Earth 1:10m vectors

Property Value
Triage GAME · DEFERRED
Status NOT ON DRIVE
Format Shapefile, ~500 MB
Coverage Global
Source URL https://www.naturalearthdata.com
License Public domain
Drive target Either drive
Notes Country boundaries, coastlines, urban areas, rivers at 1:10m scale. Lightweight. Useful for the Azgaar bridge (OTIVM-VII) and CIVICVS territory mapping. Load when OTIVM-VII begins.

HYDE 3.3 — note on Roman-era urban erasure

This note applies to all five OTIVM waypoints and is the central challenge of the restoration layer.

The Mediterranean basin today bears the accumulated footprint of 2000 years of continuous urban development, land drainage, deforestation, and agricultural intensification. The modern WorldCover classification for cells around our five waypoints reflects:

City Modern WorldCover reality
Ostia Built-up, drained wetland, Tiber mouth migrated ~3km south
Capua Dense Campanian urban sprawl over former ager campanus farmland
Brundisium Port city, coastal modifications, harbour dredged and extended
Carthago Tunis urban core directly over the Punic and Roman city
Alexandria Continuously occupied for 2300 years — modern city 5× Roman extent

For every one of these cells, terrain from WorldCover 2021 is wrong for any historical period. The restoration pipeline must detect these cells (HYDE classification = built-up or cropland AND KK10 baseline = forest or shrubland) and override with the historically appropriate class.

This is not a minor adjustment. For cells within 10km of any of our five waypoints, the majority of H9 cells will require restoration. The pipeline must treat WorldCover as the modern baseline and HYDE+KK10 as the correction layer, not an optional add-on.


Drive addition plan — next actions before pipeline build

The following datasets should be downloaded and added to the drives before the pipeline design is finalised. In priority order:

  1. BGR IGME5000 shapefile (~200MB) → Drive 1 Required for geo_flag field. The only blocked field in the current schema.

  2. HYDE 3.3 (~4GB) → Drive 1 Required for the restoration layer. Without it, terrain is wrong for all five waypoints.

  3. KK10 potential natural vegetation (~500MB) → Drive 1 Required alongside HYDE for restoration biome typing.

  4. HydroRivers Europe + Africa (~500MB) → Drive 1 Required for accurate hydro classification near rivers.

Remaining datasets (ICE-6G_C, BIOME 6000, Zanon, CCI, GLO-30) deferred until simulation track begins. Total for priority additions: ~5.2GB — fits on Drive 1 with 21GB free.


Schema fields — dataset mapping

Field Source dataset Triage Status
elev_cm GEBCO 2025 GAME ON DRIVE
terrain ESA WorldCover 2021 (modern) + HYDE 3.3 (restored) GAME + RESTORATION WorldCover ON DRIVE; HYDE NOT ON DRIVE
hydro HydroSHEDS v1.1 + HydroRivers (refined) GAME HydroSHEDS ON DRIVE; HydroRivers NOT ON DRIVE
geo_dep USGS MRDS GAME ON DRIVE
geo_flag BGR IGME5000 GAME NOT ON DRIVE
occ_flag Stage 06 — archaeological sources (ARIADNE, SEAD) SIMULATION NOT STARTED
paleo_epochs Lambeck 2014 + ICE-6G_C (future) GAME + SIMULATION Table populated with eustatic values; ICE-6G_C deferred

TESSERA-dataset-registry.md — 2026-04-27 Written by Claude Sonnet 4.6 with full session and inventory context. Update this document whenever a new dataset is evaluated or added to a drive.