Files
civicvs/docs/corpus/mesolithic-corpus-standard-v1.md
TheRON 34316d2429 init: CIVICVS repository — CBPs, corpus standard, directory structure
- README.md: project identity, TESSERA relationship, directory layout
- CIVICVS-CBPs.md: CBP-001 through CBP-006 adapted for CIVICVS
- docs/corpus/mesolithic-corpus-standard-v1.md: 10-table schema, 6-sprint plan, 25 seed concepts
Per CBP-001: committed same session as produced.
2026-04-18 05:29:09 +00:00

467 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Mesolithic Corpus Standard
### Version: 1.0
### Status: Normative
### Date: 2026-04-13
### Author: Claude Sonnet 4.6, approved by project owner
---
## 1. Mission and scope
Build a defensible Mesolithic Thesaurus, Vocabulary, and Dictionary in
Saltcorn to support controlled corpus generation for a language model
grounded in prehistoric lifeways.
### 1.1 Core outputs
| Output | Purpose |
|---|---|
| Thesaurus | Meaning relationships — domains, concepts, scales, frames |
| Vocabulary | Approved lexical forms per concept |
| Dictionary | Human-readable entries combining concept + vocabulary |
| Ground truth corpus | Stable causal relations for model training |
| Simulation triage corpus | Decision and priority patterns for model training |
### 1.2 Constraints
- No modern units or modern-only categories in any generated language
- Meaning-first design — surface forms are secondary to semantic structure
- Culture-aware context — concepts tagged to applicable culture horizons
- UI-first workflow — table → view → page → data, without exception
- Constraint enforcement is editorial, not schema-enforced. A future
model analysis pass will check the corpus for violations. No
constraint tables in this schema.
### 1.3 Initial focus
Maglemosian / Nerava northern wetland context. All four culture horizons
are represented in the schema but Maglemosian is populated first.
### 1.4 Out of scope
- Game systems and full simulation engines
- Speculative conlang reconstruction
- Broad ontology sprawl
- Academic citation management
- Constraint enforcement tables (deferred to model analysis)
---
## 2. Schema
Ten tables. No table is added without a proven workflow need.
### 2.1 `domain`
The semantic domain hierarchy. Domains are self-referential — a domain
can have a parent domain.
| Field | Type | Notes |
|---|---|---|
| id | integer | Primary key |
| label | text | Human-readable name (e.g. "Weather", "Wetness") |
| parent_id | integer | References `domain.id` — null for top-level domains |
**Seed domains (in priority order):**
Weather, Wetness, Fire, Shelter, Water travel, Hunting, Fishing,
Injury, Storage, Terrain, Time cycles, Social roles.
---
### 2.2 `culture`
The four target Mesolithic culture horizons. Lookup table — values are
fixed and do not grow without explicit decision.
| Field | Type | Notes |
|---|---|---|
| id | integer | Primary key |
| label | text | Culture name |
| ecology_note | text | Brief ecological context |
| date_range_note | text | Approximate date range |
**Fixed values:**
| Label | Ecology | Date range |
|---|---|---|
| Maglemosian | Northern lake/peatland, open woodland | ~95006000 BCE |
| Ertebølle | Coastal, lagoonal, shell midden | ~54003900 BCE |
| Sauveterrian | Western Mediterranean upland/lowland | ~90006000 BCE |
| Azilian | Franco-Cantabrian cave/rock-shelter | ~120009000 BCE |
---
### 2.3 `concept`
The core meaning nodes of the thesaurus. Each concept belongs to a
domain and carries an evidence grade.
| Field | Type | Notes |
|---|---|---|
| id | integer | Primary key |
| domain_id | integer | References `domain.id` |
| label | text | Concept identifier (e.g. "wet", "ember", "crossing") |
| definition | text | Plain language definition, measurement-free |
| evidence_grade | enum | `direct` / `analogue` / `inferred` |
| notes | text | Optional authoring notes |
**Evidence grade values:**
- `direct` — concept is directly supported by archaeological record
- `analogue` — concept is supported by ethnographic analogue
- `inferred` — concept follows from physical or ecological inference
Culture applicability is stored in `concept_culture`, not here.
---
### 2.4 `concept_culture`
Join table linking concepts to applicable culture horizons. A concept
with no rows here applies to all cultures.
| Field | Type | Notes |
|---|---|---|
| id | integer | Primary key |
| concept_id | integer | References `concept.id` |
| culture_id | integer | References `culture.id` |
| context_note | text | Optional note on culture-specific usage |
---
### 2.5 `scale`
A gradient dimension associated with a concept. A concept may have
multiple scales (e.g. "wetness" has a dryness scale and a weight scale).
| Field | Type | Notes |
|---|---|---|
| id | integer | Primary key |
| concept_id | integer | References `concept.id` |
| label | text | Scale name (e.g. "dryness", "ice safety") |
---
### 2.6 `scale_step`
Ordered steps within a scale. Steps are ordered by rank and may
reference an antonym step.
| Field | Type | Notes |
|---|---|---|
| id | integer | Primary key |
| scale_id | integer | References `scale.id` |
| rank | integer | Ordering — lower = one end of spectrum |
| label | text | Step label (e.g. "dry", "damp", "soaked") |
| antonym_step_id | integer | References another `scale_step.id` — optional |
| is_danger_threshold | boolean | Marks steps that represent hazard onset |
| notes | text | Optional authoring notes |
**Example — wetness scale:**
| Rank | Label | Danger threshold |
|---|---|---|
| 1 | dry | No |
| 2 | damp | No |
| 3 | wet | No |
| 4 | soaked | Yes |
---
### 2.7 `frame`
An action frame associated with a concept. Stores the typical roles
(actor, patient, tool, place) for actions involving this concept.
One frame per concept is the norm; complex concepts may have more.
| Field | Type | Notes |
|---|---|---|
| id | integer | Primary key |
| concept_id | integer | References `concept.id` |
| label | text | Frame name (e.g. "drying hides", "crossing river") |
| actor | text | Who performs the action |
| patient | text | What is acted upon |
| tool | text | What instrument is used |
| place | text | Where the action occurs |
| notes | text | Optional authoring notes |
---
### 2.8 `vocabulary_item`
Approved lexical forms for a concept. A concept may have multiple
vocabulary items — one preferred, others allowed alternates.
| Field | Type | Notes |
|---|---|---|
| id | integer | Primary key |
| concept_id | integer | References `concept.id` |
| term | text | The surface form (e.g. "wet", "soaked", "waterlogged") |
| preferred | boolean | True for the primary term |
| register | text | Usage register (e.g. "narrative", "triage", "both") |
| status | enum | `approved` / `deprecated` / `restricted` |
| notes | text | Optional governance notes |
**Status values:**
- `approved` — use freely
- `deprecated` — do not use in new corpus items; kept for historical record
- `restricted` — use only in specified contexts (noted in `notes`)
---
### 2.9 `corpus_item`
A single ground truth or triage corpus item. Ground truth items teach
stable causal relations. Triage items teach decisions and priorities.
| Field | Type | Notes |
|---|---|---|
| id | integer | Primary key |
| corpus_type | enum | `ground_truth` / `triage` |
| culture_id | integer | References `culture.id` — null means all cultures |
| text | text | The corpus statement (ground truth) or scenario (triage) |
| confidence | enum | `high` / `medium` / `low` |
| approved | boolean | True when reviewed and approved for training use |
| notes | text | Optional authoring notes |
**Ground truth example:**
```
corpus_type: ground_truth
text: "Fire dries wet hides."
confidence: high
approved: true
```
**Triage example:**
```
corpus_type: triage
text: "Hunter returns with deep leg wound and cannot walk unassisted."
confidence: high
approved: true
```
Triage options are stored in `triage_option`.
---
### 2.10 `corpus_concept`
Join table linking corpus items to the concepts they involve. Enables
completeness checks and concept-driven corpus browsing.
| Field | Type | Notes |
|---|---|---|
| id | integer | Primary key |
| corpus_item_id | integer | References `corpus_item.id` |
| concept_id | integer | References `concept.id` |
| role_note | text | Optional note on how concept appears in this item |
---
### 2.11 `triage_option`
Structured options for triage corpus items. Each triage item has 2-4
options, exactly one marked as preferred.
| Field | Type | Notes |
|---|---|---|
| id | integer | Primary key |
| corpus_item_id | integer | References `corpus_item.id` |
| option_text | text | Description of this option |
| is_preferred | boolean | True for the recommended action |
| reason | text | Why this option is preferred or not |
| rank | integer | Display order |
**Example — triage options for wounded hunter scenario:**
| Option | Preferred | Reason |
|---|---|---|
| Carry hunter back immediately | Yes | Wound is deep, cannot walk, delay increases risk |
| Continue hunt, send one person back | No | Splits group, leaves hunter without full support |
| Make camp here and rest | No | Wound needs shelter and fire, not open ground |
---
## 3. Workflow rule
Every table follows this delivery sequence without exception:
```
1. Table — created in Saltcorn
2. View — at minimum a list view and a detail view
3. Page — at minimum one usable entry/edit page
4. Data — production records entered only via pages, never raw grids
```
**Rules:**
- No production records entered in raw table grids
- Every new table ships with at least one usable page before data entry begins
- Build vertically, not horizontally — one complete table/view/page/data
cycle before starting the next table
---
## 4. Sprint plan
Sprints are ordered by dependency. Do not start a sprint until the
previous sprint's data entry phase is complete and verified.
### Sprint 1 — Foundation
Tables: `domain`, `culture`
Data: 12 seed domains, 4 culture records
Deliverable: domain browser page, culture lookup page
### Sprint 2 — Core concepts
Tables: `concept`, `concept_culture`
Data: 25 seed concepts from DOC-006, tagged to Maglemosian
Deliverable: concept editor page with domain and culture assignment
### Sprint 3 — Scales
Tables: `scale`, `scale_step`
Data: scales for wetness, fire state, ice safety, injury severity
Deliverable: scale builder page with ordered steps
### Sprint 4 — Frames
Table: `frame`
Data: frames for key action concepts (drying, crossing, fishing, triage)
Deliverable: frame editor page
### Sprint 5 — Vocabulary
Table: `vocabulary_item`
Data: preferred terms for all 25 seed concepts
Deliverable: vocabulary editor with preferred/alternate/deprecated status
### Sprint 6 — Corpus
Tables: `corpus_item`, `corpus_concept`, `triage_option`
Data: first 20 ground truth items, first 10 triage items
Deliverable: corpus entry page, triage option builder, concept linkage
---
## 5. Seed concepts — Sprint 2 data
From DOC-006. All tagged Maglemosian initially.
| Concept | Domain | Evidence grade |
|---|---|---|
| wet | Wetness | direct |
| dry | Wetness | direct |
| damp | Wetness | direct |
| soaked | Wetness | inferred |
| fire | Fire | direct |
| ember | Fire | direct |
| smoke | Fire | direct |
| shelter | Shelter | direct |
| hide | Shelter | direct |
| bark | Shelter | direct |
| marsh | Terrain | direct |
| reed | Terrain | direct |
| path | Terrain | inferred |
| river | Water travel | direct |
| crossing | Water travel | inferred |
| fish | Fishing | direct |
| trap | Fishing | direct |
| spear | Hunting | direct |
| wound | Injury | direct |
| limp | Injury | inferred |
| carry | Injury | inferred |
| dawn | Time cycles | inferred |
| dusk | Time cycles | inferred |
| elder | Social roles | analogue |
| child | Social roles | analogue |
---
## 6. Corpus specification
### 6.1 Ground truth corpus
Teaches stable causal relations. Statements must be:
- Present tense, declarative
- Measurement-free
- Culturally plausible for the tagged culture
- Linked to at least one concept via `corpus_concept`
**Field summary:**
- `text` — the causal statement
- `culture_id` — null for universal statements
- `confidence` — high/medium/low
- `approved` — reviewed and ready for training
**Examples:**
- Fire dries wet hides.
- Rain softens paths.
- Smoke drives insects away.
- Wet wood makes reluctant fire.
- Soaked bark floor cannot be slept on dry.
- Rising water warns of flood.
### 6.2 Simulation triage corpus
Teaches decisions and priorities under constraint. Each item must have
2-4 structured options via `triage_option`, exactly one marked preferred.
**Field summary:**
- `text` — the scenario description
- `culture_id` — null for universal scenarios
- `confidence` — high/medium/low
- `approved` — reviewed and ready for training
**Triage option fields:**
- `option_text` — what this choice involves
- `is_preferred` — the recommended action
- `reason` — why preferred or not preferred
- `rank` — display order
**Examples:**
- Wounded hunter cannot walk. (carry first vs continue hunt vs make camp)
- Fire goes out in heavy rain. (seek dry tinder vs use ember from shelter vs wait)
- Path floods at crossing. (find higher crossing vs wait vs wade)
---
## 7. Lexical governance
### 7.1 Purpose
Prevent semantic drift. Ensure vocabulary items remain measurement-free
and culturally coherent across authors and sessions.
### 7.2 Controls per vocabulary item
| Control | Field | Notes |
|---|---|---|
| Preferred term | `preferred = true` | One per concept |
| Allowed alternates | `status = approved, preferred = false` | Multiple allowed |
| Deprecated terms | `status = deprecated` | Kept for record, not used in new corpus |
| Restricted terms | `status = restricted` | Context specified in `notes` |
### 7.3 Approval history
Saltcorn's built-in record history tracks who changed what and when.
No separate approval log table is needed at this stage.
### 7.4 Constraint enforcement
Modern units and modern-only categories are excluded by editorial
discipline at authoring time. A future model analysis pass will scan
the corpus for violations and flag them for review. No constraint
tables are maintained in this schema version.
---
## 8. What this does not decide
- The language model architecture or training pipeline
- How corpus items are exported to training format
- Whether vocabulary items are used as literal tokens or as semantic
seeds for generation
- The multi-clan expansion beyond Maglemosian
- The integration between this corpus and the TESSERA spatial data layer
- Constraint enforcement implementation (deferred to model analysis pass)
---
*Mesolithic Corpus Standard v1.0 — 2026-04-13*
*Status: Normative*
*Next review: after Sprint 2 data entry is complete*