initial upload

2026-04-30 15:37:16 -04:00
parent 2c3667f77c
commit 9889ecb574
1 changed files with 702 additions and 0 deletions
--- a/docs/training/chunking/CIVICUS-ROMAN-MODEL-VISION-0001.md
+++ b/docs/training/chunking/CIVICUS-ROMAN-MODEL-VISION-0001.md
@@ -0,0 +1,702 @@
+# CIVICUS-ROMAN-MODEL-VISION-0001
+## Rational Vision For A Bounded Roman Simulator Model
+### Status: Draft Vision
+### Layer: Training Infrastructure
+### Purpose: Define the practical rationale, scope, and training plan for the CIVICUS-ROMAN model
+### Repository Path: docs/training/chunking/CIVICUS-ROMAN-MODEL-VISION-0001.md
+
+---
+
+## 0. Purpose
+
+This document defines the rational vision for the CIVICUS-ROMAN model.
+
+The model is not intended to be a general chatbot.
+
+The model is not intended to know all of history.
+
+The model is not intended to imitate modern English reasoning with Roman facts attached.
+
+The model is intended to operate inside a bounded Roman simulator world.
+
+Its task is to reason, ask, answer, and speak from within that world.
+
+---
+
+## 1. Core Claim
+
+A narrow Roman simulator model may be viable because the intended world is deliberately reduced.
+
+The model does not need the full ontology of modern life.
+
+It needs a bounded set of:
+
+```text
+objects
+actions
+pressures
+actors
+places
+procedures
+records
+obligations
+materials
+routes
+risks
+social meanings
+```
+
+The target is not general intelligence.
+
+The target is Roman-bounded simulator intelligence.
+
+---
+
+## 2. The Problem With Existing Models
+
+Existing general models are trained on modern reality.
+
+Even when given Roman context, they tend to leak modern assumptions:
+
+```text
+universal market price
+modern legal enforcement
+modern contract logic
+state-backed regulatory assumptions
+instant information
+abstract finance vocabulary
+modern supply-chain concepts
+consumer-market behavior
+modern moral and institutional framing
+```
+
+Retrieval alone does not solve this.
+
+RAG can supply correct facts, but the base model still interprets those facts through a modern ontology.
+
+The goal of CIVICUS-ROMAN is to reduce or remove that ontology problem.
+
+---
+
+## 3. What The Model Must Learn
+
+The model must learn to reason from Roman-visible primitives.
+
+Examples:
+
+```text
+Who saw it?
+Who heard it?
+Who wrote it?
+How old is the message?
+Is the seal broken?
+Who witnessed the bargain?
+Where are the carts?
+Can the goods move?
+Who benefits if the rumor is believed?
+What can safely be entered in the account?
+Is the obligation settled, pledged, delayed, or disputed?
+```
+
+It must not default to:
+
+```text
+What is the market price?
+Is the contract enforceable?
+What is the regulatory risk?
+What is the optimal modern transaction?
+```
+
+The model should ask and answer in terms of objects, actions, pressures, and visible social facts.
+
+---
+
+## 4. Reduced World Grammar
+
+The CIVICUS-ROMAN model should be trained around a controlled world grammar.
+
+### Objects
+
+```text
+coin
+purse
+chest
+tablet
+seal
+witness
+cart
+wheel
+mule
+road
+warehouse
+wall
+roof
+jar
+amphora
+crate
+rope
+weight
+measure
+gate
+market
+portico
+yard
+dust
+rain
+lamp
+grain
+oil
+bronze
+timber
+glass
+stone
+```
+
+### Actions
+
+```text
+buy
+sell
+carry
+store
+seal
+open
+count
+weigh
+measure
+pledge
+write
+witness
+hire
+repair
+delay
+ask
+refuse
+accuse
+confirm
+return
+split
+hold
+move
+settle
+hide
+leak
+wait
+rot
+spoil
+break
+arrive
+depart
+```
+
+### Pressures
+
+```text
+hunger
+rain
+delay
+spoilage
+debt
+rivalry
+shame
+praise
+shortage
+crowd
+rumor
+cart scarcity
+storage scarcity
+buyer urgency
+creditor pressure
+official attention
+bad road
+old news
+broken seal
+empty purse
+full warehouse
+```
+
+The model should learn to combine these before reaching for abstract explanation.
+
+---
+
+## 5. Speech Principle
+
+The model should prefer Roman-visible commercial speech.
+
+Preferred:
+
+```text
+The wheels are gone.
+The tablet arrived old.
+He owns jars, not coin.
+The road has eaten the profit.
+The crate is heavier than its name.
+The purse is fat and the street has eyes.
+```
+
+Avoided:
+
+```text
+Transport capacity is constrained.
+The information is stale.
+His assets are illiquid.
+Transportation cost eliminated the margin.
+The cargo is misclassified.
+Liquidity creates security risk.
+```
+
+The purpose is not ornament.
+
+The purpose is ontology.
+
+A model learns the kind of world it inhabits through the language it is trained to use.
+
+---
+
+## 6. Corpus Architecture
+
+The corpus is layered.
+
+Each layer teaches a different kind of reasoning.
+
+```text
+Layer 0 — Primitive Facts
+  basic world rules
+
+Layer 1 — Worked Examples
+  arithmetic, cost, movement, profit, loss, settlement
+
+Layer 2 — Uncertainty
+  reports, rumors, old messages, hidden truth, confidence, confirmation
+
+Layer 3 — Actor Perspective
+  same event read differently by different Roman-world actors
+
+Layer 4 — Dialogues
+  in-world scenes that teach through speech, action, and consequence
+```
+
+This layering is essential.
+
+The model should not merely memorize dialogue.
+
+It should learn the underlying reasoning forms that make the dialogue valid.
+
+---
+
+## 7. Vocabulary Generation Pipeline
+
+A major part of the model vocabulary can be built through a generate-review-promote workflow.
+
+The generator combines:
+
+```text
+Object + Action + Pressure
+```
+
+Example:
+
+```text
+cart + hired elsewhere + buyer waiting
+= The wheels are gone, and the buyer will not wait for our excuses.
+```
+
+Most generated phrases will be weak.
+
+That is acceptable.
+
+Humans are faster at recognizing strong expressions than inventing them cold.
+
+The workflow is:
+
+```text
+generate many candidates
+human flags useful expressions
+accepted expressions enter vocabulary
+strong expressions influence dialogue
+canonical expressions become simulator templates
+```
+
+Only reviewed material enters training.
+
+Raw churn is not training data.
+
+---
+
+## 8. Human And Agent Roles
+
+Agents will perform much of the production work.
+
+Agents can generate:
+
+```text
+candidate expressions
+dialogue variants
+actor readings
+primitive examples
+uncertainty cases
+law scenarios
+architecture scenarios
+technology scenarios
+negative examples
+contamination tests
+```
+
+Agents can also assist with:
+
+```text
+format validation
+tag audit
+style checks
+duplicate detection
+forbidden vocabulary detection
+chunk extraction
+statistics
+regression tests
+```
+
+Humans remain responsible for:
+
+```text
+canon
+ontology
+final approval
+style judgment
+failure judgment
+domain boundaries
+promotion to training data
+```
+
+The human role shifts from authoring every line to governing the corpus.
+
+---
+
+## 9. Training Strategy
+
+The first serious training target should not be a general-purpose language model.
+
+The first target should be a compact bounded simulator model.
+
+A rational training progression:
+
+```text
+Stage 1:
+  Roman-visible vocabulary expressions
+
+Stage 2:
+  primitive facts and terse Q/A
+
+Stage 3:
+  worked examples with arithmetic and consequence
+
+Stage 4:
+  uncertainty examples and knowledge-boundary tests
+
+Stage 5:
+  actor-perspective readings
+
+Stage 6:
+  in-world dialogues
+
+Stage 7:
+  simulator-state-to-response pairs
+```
+
+The model should learn from simple controlled forms before complex dialogue.
+
+---
+
+## 10. Scratch Training Reconsidered
+
+Training a general model from nothing is expensive because the model must learn broad language, broad world knowledge, and general reasoning.
+
+CIVICUS-ROMAN is different.
+
+It does not need to answer every question.
+
+It does not need modern breadth.
+
+It does not need open-ended knowledge.
+
+It needs competence inside a small Roman simulator world.
+
+Therefore scratch or near-scratch training may be viable if the model is deliberately narrow.
+
+The fair comparison is not:
+
+```text
+small project vs general LLM
+```
+
+The fair comparison is:
+
+```text
+bounded simulator grammar + controlled corpus + agent-assisted data generation
+```
+
+against:
+
+```text
+modern-prior leakage from general models
+```
+
+---
+
+## 11. Simulator Ownership Of Reality
+
+The model should not own the simulator state.
+
+The simulator owns:
+
+```text
+actors
+locations
+time
+inventory
+money
+routes
+documents
+seals
+witnesses
+obligations
+weather
+prices
+rumors
+official attention
+```
+
+The model interprets, asks, answers, and speaks within that state.
+
+The model should not invent facts that the simulator has not provided.
+
+The model should prefer questions when state is insufficient.
+
+Example:
+
+```text
+What can be known?
+Who saw it?
+Who wrote it?
+Can the cart still move?
+Was the seal broken?
+Is there a witness?
+```
+
+---
+
+## 12. Evaluation
+
+The model must be tested against modern contamination.
+
+Example failure prompt:
+
+```text
+What is the fair market price?
+```
+
+Roman-bounded response should reject universal price and ask about place, buyer, time, transport, and information.
+
+Example failure prompt:
+
+```text
+Can the contract be enforced?
+```
+
+Roman-bounded response should ask about tablet, witness, seal, pledge, patron, magistrate, standing, and leverage.
+
+Example failure prompt:
+
+```text
+Was the information reliable?
+```
+
+Roman-bounded response should ask who carried the word, how old it is, who benefits, whether anyone saw the goods, and what can be confirmed.
+
+Evaluation must reward Roman-bounded reasoning and punish modern abstraction.
+
+---
+
+## 13. Domains To Add
+
+The first domain is commerce.
+
+Next domains should be added with the same layered discipline.
+
+### Roman Law
+
+```text
+standing
+complaint
+witness
+tablet
+seal
+pledge
+remedy
+magistrate
+patronage
+procedure
+public shame
+private settlement
+```
+
+### Architecture
+
+```text
+stone
+timber
+brick
+lime
+labor
+measurement
+site
+water
+weight
+collapse
+repair
+patron
+public work
+```
+
+### Technology
+
+```text
+tool
+craft
+material
+workshop
+repair
+failure
+skill
+apprentice
+measurement
+heat
+water
+wheel
+gear
+lever
+```
+
+Each domain should develop:
+
+```text
+Layer 0 primitives
+Layer 1 examples
+Layer 2 uncertainty
+Layer 3 actor readings
+Layer 4 dialogues
+controlled vocabulary
+contamination tests
+```
+
+---
+
+## 14. Practical Near-Term Plan
+
+Recommended next steps:
+
+```text
+1. Freeze first commerce dialogue batch.
+2. Continue vocabulary generation standards.
+3. Build the expression candidate generator.
+4. Build a review interface for accept/reject/strong/canonical.
+5. Expand commerce vocabulary library.
+6. Add Roman Law Layer 0 primitives.
+7. Add Roman Law worked examples.
+8. Add Roman Law dialogues only after primitives exist.
+9. Build contamination tests.
+10. Compare:
+    A. scratch small model
+    B. near-scratch model
+    C. small existing base model fine-tuned to OTIVM
+```
+
+The comparison matters.
+
+The project should not assume scratch training wins.
+
+It should test whether scratch training reduces modern contamination enough to justify weaker inherited language ability.
+
+---
+
+## 15. Success Definition
+
+CIVICUS-ROMAN succeeds if it can operate inside the simulator without modern leakage.
+
+It should naturally produce questions and answers like:
+
+```text
+Who carried the word?
+How old is the tablet?
+Was the seal broken?
+Can the cart still move?
+Who witnessed the promise?
+Does the account remain open?
+What does the buyer need before sundown?
+```
+
+It should naturally speak like:
+
+```text
+The wheels are gone.
+The tablet arrived old.
+He owns jars, not coin.
+The road has eaten the profit.
+The account remains open.
+The crate is heavier than its name.
+```
+
+It should avoid:
+
+```text
+supply chain disruption
+market efficiency
+legal compliance
+liquidity constraint
+regulatory exposure
+contractual enforcement
+```
+
+The model is not meant to know less.
+
+It is meant to know differently.
+
+---
+
+## 16. Final Vision
+
+CIVICUS-ROMAN is a bounded-world model.
+
+Its intelligence comes from discipline, not breadth.
+
+Its strength is that it does not treat modern reality as default.
+
+It learns a smaller world deeply:
+
+```text
+what can be seen
+what can be carried
+what can be written
+what can be witnessed
+what can be pledged
+what can be delayed
+what can be hidden
+what can be settled
+```
+
+This is the rational path:
+
+```text
+controlled ontology
+layered corpus
+Roman-visible vocabulary
+agent-assisted generation
+human canon approval
+strict validation
+small model experiments
+simulator-owned state
+contamination testing
+```
+
+The purpose is to build a model that does not merely describe Ancient Rome.
+
+The purpose is to build a model that can think inside the civic Roman world of the simulator.