Files
civicvs/docs/corpus/mesolithic-corpus-standard-v1.md
TheRON 34316d2429 init: CIVICVS repository — CBPs, corpus standard, directory structure
- README.md: project identity, TESSERA relationship, directory layout
- CIVICVS-CBPs.md: CBP-001 through CBP-006 adapted for CIVICVS
- docs/corpus/mesolithic-corpus-standard-v1.md: 10-table schema, 6-sprint plan, 25 seed concepts
Per CBP-001: committed same session as produced.
2026-04-18 05:29:09 +00:00

14 KiB
Raw Permalink Blame History

Mesolithic Corpus Standard

Version: 1.0

Status: Normative

Date: 2026-04-13

Author: Claude Sonnet 4.6, approved by project owner


1. Mission and scope

Build a defensible Mesolithic Thesaurus, Vocabulary, and Dictionary in Saltcorn to support controlled corpus generation for a language model grounded in prehistoric lifeways.

1.1 Core outputs

Output Purpose
Thesaurus Meaning relationships — domains, concepts, scales, frames
Vocabulary Approved lexical forms per concept
Dictionary Human-readable entries combining concept + vocabulary
Ground truth corpus Stable causal relations for model training
Simulation triage corpus Decision and priority patterns for model training

1.2 Constraints

  • No modern units or modern-only categories in any generated language
  • Meaning-first design — surface forms are secondary to semantic structure
  • Culture-aware context — concepts tagged to applicable culture horizons
  • UI-first workflow — table → view → page → data, without exception
  • Constraint enforcement is editorial, not schema-enforced. A future model analysis pass will check the corpus for violations. No constraint tables in this schema.

1.3 Initial focus

Maglemosian / Nerava northern wetland context. All four culture horizons are represented in the schema but Maglemosian is populated first.

1.4 Out of scope

  • Game systems and full simulation engines
  • Speculative conlang reconstruction
  • Broad ontology sprawl
  • Academic citation management
  • Constraint enforcement tables (deferred to model analysis)

2. Schema

Ten tables. No table is added without a proven workflow need.

2.1 domain

The semantic domain hierarchy. Domains are self-referential — a domain can have a parent domain.

Field Type Notes
id integer Primary key
label text Human-readable name (e.g. "Weather", "Wetness")
parent_id integer References domain.id — null for top-level domains

Seed domains (in priority order): Weather, Wetness, Fire, Shelter, Water travel, Hunting, Fishing, Injury, Storage, Terrain, Time cycles, Social roles.


2.2 culture

The four target Mesolithic culture horizons. Lookup table — values are fixed and do not grow without explicit decision.

Field Type Notes
id integer Primary key
label text Culture name
ecology_note text Brief ecological context
date_range_note text Approximate date range

Fixed values:

Label Ecology Date range
Maglemosian Northern lake/peatland, open woodland ~95006000 BCE
Ertebølle Coastal, lagoonal, shell midden ~54003900 BCE
Sauveterrian Western Mediterranean upland/lowland ~90006000 BCE
Azilian Franco-Cantabrian cave/rock-shelter ~120009000 BCE

2.3 concept

The core meaning nodes of the thesaurus. Each concept belongs to a domain and carries an evidence grade.

Field Type Notes
id integer Primary key
domain_id integer References domain.id
label text Concept identifier (e.g. "wet", "ember", "crossing")
definition text Plain language definition, measurement-free
evidence_grade enum direct / analogue / inferred
notes text Optional authoring notes

Evidence grade values:

  • direct — concept is directly supported by archaeological record
  • analogue — concept is supported by ethnographic analogue
  • inferred — concept follows from physical or ecological inference

Culture applicability is stored in concept_culture, not here.


2.4 concept_culture

Join table linking concepts to applicable culture horizons. A concept with no rows here applies to all cultures.

Field Type Notes
id integer Primary key
concept_id integer References concept.id
culture_id integer References culture.id
context_note text Optional note on culture-specific usage

2.5 scale

A gradient dimension associated with a concept. A concept may have multiple scales (e.g. "wetness" has a dryness scale and a weight scale).

Field Type Notes
id integer Primary key
concept_id integer References concept.id
label text Scale name (e.g. "dryness", "ice safety")

2.6 scale_step

Ordered steps within a scale. Steps are ordered by rank and may reference an antonym step.

Field Type Notes
id integer Primary key
scale_id integer References scale.id
rank integer Ordering — lower = one end of spectrum
label text Step label (e.g. "dry", "damp", "soaked")
antonym_step_id integer References another scale_step.id — optional
is_danger_threshold boolean Marks steps that represent hazard onset
notes text Optional authoring notes

Example — wetness scale:

Rank Label Danger threshold
1 dry No
2 damp No
3 wet No
4 soaked Yes

2.7 frame

An action frame associated with a concept. Stores the typical roles (actor, patient, tool, place) for actions involving this concept. One frame per concept is the norm; complex concepts may have more.

Field Type Notes
id integer Primary key
concept_id integer References concept.id
label text Frame name (e.g. "drying hides", "crossing river")
actor text Who performs the action
patient text What is acted upon
tool text What instrument is used
place text Where the action occurs
notes text Optional authoring notes

2.8 vocabulary_item

Approved lexical forms for a concept. A concept may have multiple vocabulary items — one preferred, others allowed alternates.

Field Type Notes
id integer Primary key
concept_id integer References concept.id
term text The surface form (e.g. "wet", "soaked", "waterlogged")
preferred boolean True for the primary term
register text Usage register (e.g. "narrative", "triage", "both")
status enum approved / deprecated / restricted
notes text Optional governance notes

Status values:

  • approved — use freely
  • deprecated — do not use in new corpus items; kept for historical record
  • restricted — use only in specified contexts (noted in notes)

2.9 corpus_item

A single ground truth or triage corpus item. Ground truth items teach stable causal relations. Triage items teach decisions and priorities.

Field Type Notes
id integer Primary key
corpus_type enum ground_truth / triage
culture_id integer References culture.id — null means all cultures
text text The corpus statement (ground truth) or scenario (triage)
confidence enum high / medium / low
approved boolean True when reviewed and approved for training use
notes text Optional authoring notes

Ground truth example:

corpus_type: ground_truth
text: "Fire dries wet hides."
confidence: high
approved: true

Triage example:

corpus_type: triage
text: "Hunter returns with deep leg wound and cannot walk unassisted."
confidence: high
approved: true

Triage options are stored in triage_option.


2.10 corpus_concept

Join table linking corpus items to the concepts they involve. Enables completeness checks and concept-driven corpus browsing.

Field Type Notes
id integer Primary key
corpus_item_id integer References corpus_item.id
concept_id integer References concept.id
role_note text Optional note on how concept appears in this item

2.11 triage_option

Structured options for triage corpus items. Each triage item has 2-4 options, exactly one marked as preferred.

Field Type Notes
id integer Primary key
corpus_item_id integer References corpus_item.id
option_text text Description of this option
is_preferred boolean True for the recommended action
reason text Why this option is preferred or not
rank integer Display order

Example — triage options for wounded hunter scenario:

Option Preferred Reason
Carry hunter back immediately Yes Wound is deep, cannot walk, delay increases risk
Continue hunt, send one person back No Splits group, leaves hunter without full support
Make camp here and rest No Wound needs shelter and fire, not open ground

3. Workflow rule

Every table follows this delivery sequence without exception:

1. Table     — created in Saltcorn
2. View      — at minimum a list view and a detail view
3. Page      — at minimum one usable entry/edit page
4. Data      — production records entered only via pages, never raw grids

Rules:

  • No production records entered in raw table grids
  • Every new table ships with at least one usable page before data entry begins
  • Build vertically, not horizontally — one complete table/view/page/data cycle before starting the next table

4. Sprint plan

Sprints are ordered by dependency. Do not start a sprint until the previous sprint's data entry phase is complete and verified.

Sprint 1 — Foundation

Tables: domain, culture Data: 12 seed domains, 4 culture records Deliverable: domain browser page, culture lookup page

Sprint 2 — Core concepts

Tables: concept, concept_culture Data: 25 seed concepts from DOC-006, tagged to Maglemosian Deliverable: concept editor page with domain and culture assignment

Sprint 3 — Scales

Tables: scale, scale_step Data: scales for wetness, fire state, ice safety, injury severity Deliverable: scale builder page with ordered steps

Sprint 4 — Frames

Table: frame Data: frames for key action concepts (drying, crossing, fishing, triage) Deliverable: frame editor page

Sprint 5 — Vocabulary

Table: vocabulary_item Data: preferred terms for all 25 seed concepts Deliverable: vocabulary editor with preferred/alternate/deprecated status

Sprint 6 — Corpus

Tables: corpus_item, corpus_concept, triage_option Data: first 20 ground truth items, first 10 triage items Deliverable: corpus entry page, triage option builder, concept linkage


5. Seed concepts — Sprint 2 data

From DOC-006. All tagged Maglemosian initially.

Concept Domain Evidence grade
wet Wetness direct
dry Wetness direct
damp Wetness direct
soaked Wetness inferred
fire Fire direct
ember Fire direct
smoke Fire direct
shelter Shelter direct
hide Shelter direct
bark Shelter direct
marsh Terrain direct
reed Terrain direct
path Terrain inferred
river Water travel direct
crossing Water travel inferred
fish Fishing direct
trap Fishing direct
spear Hunting direct
wound Injury direct
limp Injury inferred
carry Injury inferred
dawn Time cycles inferred
dusk Time cycles inferred
elder Social roles analogue
child Social roles analogue

6. Corpus specification

6.1 Ground truth corpus

Teaches stable causal relations. Statements must be:

  • Present tense, declarative
  • Measurement-free
  • Culturally plausible for the tagged culture
  • Linked to at least one concept via corpus_concept

Field summary:

  • text — the causal statement
  • culture_id — null for universal statements
  • confidence — high/medium/low
  • approved — reviewed and ready for training

Examples:

  • Fire dries wet hides.
  • Rain softens paths.
  • Smoke drives insects away.
  • Wet wood makes reluctant fire.
  • Soaked bark floor cannot be slept on dry.
  • Rising water warns of flood.

6.2 Simulation triage corpus

Teaches decisions and priorities under constraint. Each item must have 2-4 structured options via triage_option, exactly one marked preferred.

Field summary:

  • text — the scenario description
  • culture_id — null for universal scenarios
  • confidence — high/medium/low
  • approved — reviewed and ready for training

Triage option fields:

  • option_text — what this choice involves
  • is_preferred — the recommended action
  • reason — why preferred or not preferred
  • rank — display order

Examples:

  • Wounded hunter cannot walk. (carry first vs continue hunt vs make camp)
  • Fire goes out in heavy rain. (seek dry tinder vs use ember from shelter vs wait)
  • Path floods at crossing. (find higher crossing vs wait vs wade)

7. Lexical governance

7.1 Purpose

Prevent semantic drift. Ensure vocabulary items remain measurement-free and culturally coherent across authors and sessions.

7.2 Controls per vocabulary item

Control Field Notes
Preferred term preferred = true One per concept
Allowed alternates status = approved, preferred = false Multiple allowed
Deprecated terms status = deprecated Kept for record, not used in new corpus
Restricted terms status = restricted Context specified in notes

7.3 Approval history

Saltcorn's built-in record history tracks who changed what and when. No separate approval log table is needed at this stage.

7.4 Constraint enforcement

Modern units and modern-only categories are excluded by editorial discipline at authoring time. A future model analysis pass will scan the corpus for violations and flag them for review. No constraint tables are maintained in this schema version.


8. What this does not decide

  • The language model architecture or training pipeline
  • How corpus items are exported to training format
  • Whether vocabulary items are used as literal tokens or as semantic seeds for generation
  • The multi-clan expansion beyond Maglemosian
  • The integration between this corpus and the TESSERA spatial data layer
  • Constraint enforcement implementation (deferred to model analysis pass)

Mesolithic Corpus Standard v1.0 — 2026-04-13 Status: Normative Next review: after Sprint 2 data entry is complete