Files
otivm/docs/training/chunking/VOCABULARY-GENERATION-0001.md
2026-04-30 15:25:03 -04:00

650 lines
10 KiB
Markdown

# VOCABULARY-GENERATION-0001
## Generate, Review, And Promote Roman-Visible Expressions
### Status: Draft Standard
### Layer: Training Infrastructure
### Purpose: Define a fast human-in-the-loop workflow for building OTIVM's Roman-visible model vocabulary
### Repository Path: docs/training/chunking/VOCABULARY-GENERATION-0001.md
---
## 0. Purpose
This document defines a workflow for generating and selecting Roman-visible commercial expressions.
The purpose is to build the model vocabulary faster than hand-authoring every line.
The generator may produce large amounts of weak or useless material. That is acceptable.
The training corpus must only receive reviewed and accepted material.
The workflow is:
```text
generate many candidates
human flags useful expressions
accepted expressions become vocabulary records
strong expressions become dialogue material
canonical expressions become simulator templates
```
The churn is not the asset.
The approved expression is the asset.
---
## 1. Core Idea
A Roman-visible expression can often be generated from three elements:
```text
Object + Action + Pressure
```
Examples:
```text
coin + hide + street eyes
= The purse is fat and the street has eyes.
cart + hired elsewhere + buyer waiting
= The wheels are gone while the buyer counts the hours.
tablet + old + road delay
= The tablet arrived older than its promise.
jar + no cart + delivery obligation
= A jar without wheels is a promise sitting in straw.
warehouse roof + rain + merchant urgency
= The roof earns coin when rain walks the street.
```
This is not ordinary paraphrase.
It is ontology building.
The model learns what kind of world it inhabits by seeing which objects, actions, and pressures are allowed to combine.
---
## 2. Why This Works
Humans are often faster at recognizing a good phrase than inventing one from nothing.
A generator can produce hundreds or thousands of combinations.
Most will be poor.
A human reviewer can scroll quickly and mark:
```text
accept
reject
revise
strong
canonical
```
The useful lines will emerge faster than through direct composition.
The process is closer to quarrying stone than writing prose.
The generator produces rough stone.
The reviewer selects blocks worth dressing.
The corpus receives only dressed blocks.
---
## 3. Controlled Input Sets
The generator should not begin with unrestricted language.
It should combine controlled lists.
### Objects
```text
coin
purse
chest
tablet
seal
witness
cart
wheel
mule
road
warehouse
wall
roof
jar
amphora
crate
rope
weight
measure
gate
market
portico
yard
dust
rain
lamp
grain
oil
bronze
timber
glass
stone
```
### Actions
```text
buy
sell
carry
store
seal
open
count
weigh
measure
pledge
write
witness
hire
repair
delay
ask
refuse
accuse
confirm
return
split
hold
move
settle
hide
leak
wait
rot
spoil
break
arrive
depart
```
### Pressures
```text
hunger
rain
delay
spoilage
debt
rivalry
shame
praise
shortage
crowd
rumor
cart scarcity
storage scarcity
buyer urgency
creditor pressure
official attention
bad road
old news
broken seal
empty purse
full warehouse
```
### Actor Voices
```text
Varro
Felix
Lentulus
Crispus
Secundus
Chresimus
neutral narrator
```
The generator should combine these into candidate expressions, not final truth.
---
## 4. Candidate Expression Record
Each generated expression should be stored as a reviewable record.
Recommended JSONL form:
```json
{
"expression_id": "expr_000142",
"domain": "commerce",
"object": "cart",
"action": "hired_elsewhere",
"pressure": "buyer_waiting",
"actor_voice": "Secundus",
"candidate": "The wheels are gone, and the buyer will not wait for our excuses.",
"modern_meaning": "Cart capacity has been lost, but partial shipment may still be possible.",
"concept_tags": [
"transport_capacity",
"delay_cost",
"buyer_need"
],
"status": "candidate",
"strength": null,
"review_note": null
}
```
Candidate records are review material only.
They are not training material until promoted.
---
## 5. Review Status
Use a small status vocabulary.
```text
candidate
accepted
rejected
revise
strong
canonical
```
Meaning:
```text
candidate:
generated but not reviewed
accepted:
good enough to enter the vocabulary library
rejected:
not useful; do not train on it
revise:
promising but needs human rewrite
strong:
useful enough to inspire dialogue lines
canonical:
preferred phrasing for a recurring simulator condition
```
Only these should enter training or simulator-facing data:
```text
accepted
strong
canonical
```
Rejected and unreviewed candidates should be retained only for audit or generator improvement.
---
## 6. Human Review Rules
The reviewer should ask:
1. Is the line Roman-visible?
2. Does it avoid modern abstraction?
3. Does it express a real commercial condition?
4. Does it use objects, action, or pressure rather than explanation?
5. Could one of the six actor voices plausibly say it?
6. Is it compact enough to be useful?
7. Does it avoid parody or over-stylized speech?
8. Does it teach the model a useful pattern?
Reject lines that are merely clever.
Accept lines that create usable world-language.
Promote lines that can recur across scenes.
---
## 7. Rejection Reasons
Common rejection reasons:
```text
too modern
too abstract
too theatrical
too vague
wrong actor voice
no commercial meaning
no Roman-visible object
mixed metaphor
unusable in dialogue
duplicates existing phrase
```
Optional review fields:
```json
{
"status": "rejected",
"review_note": "too modern: sounds like business-school language"
}
```
or:
```json
{
"status": "revise",
"review_note": "good image, but too ornate for Secundus"
}
```
---
## 8. Promotion Levels
### Accepted
Useful phrase. Can be stored in the vocabulary library.
Example:
```text
The tablet arrived old.
```
### Strong
Useful phrase that should influence dialogue writing.
Example:
```text
A jar without wheels is a promise sitting in straw.
```
### Canonical
Preferred phrase for a repeated simulator condition.
Example:
```text
The wheels are gone.
```
Canonical expressions should be few.
If too many phrases are canonical, none are canonical.
---
## 9. Output Libraries
The workflow should produce three outputs.
### Candidate Pool
```text
data/vocabulary/candidates.jsonl
```
Generated material, mostly unreviewed.
### Reviewed Vocabulary
```text
data/vocabulary/roman_visible_expressions.jsonl
```
Accepted, strong, and canonical expressions only.
### Canonical Templates
```text
data/vocabulary/canonical_templates.jsonl
```
Small set of recurring simulator-ready expressions.
---
## 10. Training Rule
Do not train on raw generated churn.
Training material may use:
```text
accepted expressions
strong expressions
canonical expressions
human-revised expressions
dialogues that naturally include reviewed expressions
```
Training material must not use:
```text
unreviewed candidate output
rejected output
bulk generated noise
expressions marked revise but not rewritten
```
The generator is a discovery tool, not an author of record.
---
## 11. Simulator Use
Canonical expressions can help the simulator narrate recurring conditions.
Example simulator state:
```yaml
condition: transport_capacity_lost
object: cart
cause: rival_hired_carts
urgency: buyer_waiting
actor_voice: Secundus
```
Possible canonical output:
```text
The wheels are gone.
```
Expanded output:
```text
The wheels are gone, and the buyer will not wait for our excuses.
```
Actor variants:
```text
Varro:
The bridge was taken before the column moved.
Felix:
Naso bought the road, not the oil.
Chresimus:
The account must show why the jars did not move.
Secundus:
The wheels are gone. Ten jars can still go by mule.
```
The simulator should prefer canonical lines for repeated conditions and strong lines for color.
---
## 12. Generator Design
A simple generator can begin as a Cartesian combiner with templates.
Template examples:
```text
The {object} {action_phrase} while {pressure_phrase}.
A {object} without {support_object} is {metaphor_result}.
The {pressure_object} has reached {target} before {expected_event}.
{actor_voice} would say: "{expression}"
```
But the generator should be constrained by compatibility rules.
Bad combinations should be filtered before review where possible.
Example:
```text
coin + hired_elsewhere + rain
```
may produce nonsense unless transformed carefully.
Good combinations:
```text
cart + hired_elsewhere + buyer_waiting
tablet + old + road_delay
warehouse + full + merchant_urgency
coin + visible + street_eyes
seal + broken + official_attention
```
The generator should prefer semantically compatible sets.
---
## 13. Compatibility Tags
Objects, actions, and pressures should eventually carry compatibility tags.
Example:
```yaml
object: cart
compatible_actions:
- hired
- missing
- broken
- delayed
- overloaded
compatible_pressures:
- buyer_waiting
- rival_obstruction
- bad_road
- delivery_deadline
```
Example:
```yaml
object: tablet
compatible_actions:
- written
- sealed
- old
- disputed
- witnessed
compatible_pressures:
- stale_news
- legal_exposure
- source_motive
- settlement_dispute
```
This improves candidate quality without eliminating human review.
---
## 14. Review Speed Target
The process is designed for fast human selection.
Target review speed:
```text
200 to 500 candidates per hour
```
This is realistic only if the review interface is simple.
Each candidate should support one-key marking:
```text
a = accept
r = reject
v = revise
s = strong
c = canonical
```
The reviewer should not be forced to edit every line.
Editing should be reserved for promising expressions.
---
## 15. Success Condition
This workflow is successful if it produces a growing library of Roman-visible expressions faster than direct hand-authoring.
A good result is not a clean generator.
A good result is a strong reviewed vocabulary.
The approved vocabulary should improve:
```text
dialogue writing
simulator narration
actor voice consistency
contamination resistance
model training data
```
The final test is whether the model prefers:
```text
The wheels are gone.
The tablet arrived old.
He owns jars, not coin.
The purse is fat and the street has eyes.
```
over:
```text
Transport capacity is constrained.
The information is stale.
His assets are illiquid.
His liquidity creates security risk.
```
The purpose is not style alone.
The purpose is to build a bounded Roman commercial ontology one approved phrase at a time.