650 lines
10 KiB
Markdown
650 lines
10 KiB
Markdown
# VOCABULARY-GENERATION-0001
|
|
## Generate, Review, And Promote Roman-Visible Expressions
|
|
### Status: Draft Standard
|
|
### Layer: Training Infrastructure
|
|
### Purpose: Define a fast human-in-the-loop workflow for building OTIVM's Roman-visible model vocabulary
|
|
### Repository Path: docs/training/chunking/VOCABULARY-GENERATION-0001.md
|
|
|
|
---
|
|
|
|
## 0. Purpose
|
|
|
|
This document defines a workflow for generating and selecting Roman-visible commercial expressions.
|
|
|
|
The purpose is to build the model vocabulary faster than hand-authoring every line.
|
|
|
|
The generator may produce large amounts of weak or useless material. That is acceptable.
|
|
|
|
The training corpus must only receive reviewed and accepted material.
|
|
|
|
The workflow is:
|
|
|
|
```text
|
|
generate many candidates
|
|
human flags useful expressions
|
|
accepted expressions become vocabulary records
|
|
strong expressions become dialogue material
|
|
canonical expressions become simulator templates
|
|
```
|
|
|
|
The churn is not the asset.
|
|
|
|
The approved expression is the asset.
|
|
|
|
---
|
|
|
|
## 1. Core Idea
|
|
|
|
A Roman-visible expression can often be generated from three elements:
|
|
|
|
```text
|
|
Object + Action + Pressure
|
|
```
|
|
|
|
Examples:
|
|
|
|
```text
|
|
coin + hide + street eyes
|
|
= The purse is fat and the street has eyes.
|
|
|
|
cart + hired elsewhere + buyer waiting
|
|
= The wheels are gone while the buyer counts the hours.
|
|
|
|
tablet + old + road delay
|
|
= The tablet arrived older than its promise.
|
|
|
|
jar + no cart + delivery obligation
|
|
= A jar without wheels is a promise sitting in straw.
|
|
|
|
warehouse roof + rain + merchant urgency
|
|
= The roof earns coin when rain walks the street.
|
|
```
|
|
|
|
This is not ordinary paraphrase.
|
|
|
|
It is ontology building.
|
|
|
|
The model learns what kind of world it inhabits by seeing which objects, actions, and pressures are allowed to combine.
|
|
|
|
---
|
|
|
|
## 2. Why This Works
|
|
|
|
Humans are often faster at recognizing a good phrase than inventing one from nothing.
|
|
|
|
A generator can produce hundreds or thousands of combinations.
|
|
|
|
Most will be poor.
|
|
|
|
A human reviewer can scroll quickly and mark:
|
|
|
|
```text
|
|
accept
|
|
reject
|
|
revise
|
|
strong
|
|
canonical
|
|
```
|
|
|
|
The useful lines will emerge faster than through direct composition.
|
|
|
|
The process is closer to quarrying stone than writing prose.
|
|
|
|
The generator produces rough stone.
|
|
|
|
The reviewer selects blocks worth dressing.
|
|
|
|
The corpus receives only dressed blocks.
|
|
|
|
---
|
|
|
|
## 3. Controlled Input Sets
|
|
|
|
The generator should not begin with unrestricted language.
|
|
|
|
It should combine controlled lists.
|
|
|
|
### Objects
|
|
|
|
```text
|
|
coin
|
|
purse
|
|
chest
|
|
tablet
|
|
seal
|
|
witness
|
|
cart
|
|
wheel
|
|
mule
|
|
road
|
|
warehouse
|
|
wall
|
|
roof
|
|
jar
|
|
amphora
|
|
crate
|
|
rope
|
|
weight
|
|
measure
|
|
gate
|
|
market
|
|
portico
|
|
yard
|
|
dust
|
|
rain
|
|
lamp
|
|
grain
|
|
oil
|
|
bronze
|
|
timber
|
|
glass
|
|
stone
|
|
```
|
|
|
|
### Actions
|
|
|
|
```text
|
|
buy
|
|
sell
|
|
carry
|
|
store
|
|
seal
|
|
open
|
|
count
|
|
weigh
|
|
measure
|
|
pledge
|
|
write
|
|
witness
|
|
hire
|
|
repair
|
|
delay
|
|
ask
|
|
refuse
|
|
accuse
|
|
confirm
|
|
return
|
|
split
|
|
hold
|
|
move
|
|
settle
|
|
hide
|
|
leak
|
|
wait
|
|
rot
|
|
spoil
|
|
break
|
|
arrive
|
|
depart
|
|
```
|
|
|
|
### Pressures
|
|
|
|
```text
|
|
hunger
|
|
rain
|
|
delay
|
|
spoilage
|
|
debt
|
|
rivalry
|
|
shame
|
|
praise
|
|
shortage
|
|
crowd
|
|
rumor
|
|
cart scarcity
|
|
storage scarcity
|
|
buyer urgency
|
|
creditor pressure
|
|
official attention
|
|
bad road
|
|
old news
|
|
broken seal
|
|
empty purse
|
|
full warehouse
|
|
```
|
|
|
|
### Actor Voices
|
|
|
|
```text
|
|
Varro
|
|
Felix
|
|
Lentulus
|
|
Crispus
|
|
Secundus
|
|
Chresimus
|
|
neutral narrator
|
|
```
|
|
|
|
The generator should combine these into candidate expressions, not final truth.
|
|
|
|
---
|
|
|
|
## 4. Candidate Expression Record
|
|
|
|
Each generated expression should be stored as a reviewable record.
|
|
|
|
Recommended JSONL form:
|
|
|
|
```json
|
|
{
|
|
"expression_id": "expr_000142",
|
|
"domain": "commerce",
|
|
"object": "cart",
|
|
"action": "hired_elsewhere",
|
|
"pressure": "buyer_waiting",
|
|
"actor_voice": "Secundus",
|
|
"candidate": "The wheels are gone, and the buyer will not wait for our excuses.",
|
|
"modern_meaning": "Cart capacity has been lost, but partial shipment may still be possible.",
|
|
"concept_tags": [
|
|
"transport_capacity",
|
|
"delay_cost",
|
|
"buyer_need"
|
|
],
|
|
"status": "candidate",
|
|
"strength": null,
|
|
"review_note": null
|
|
}
|
|
```
|
|
|
|
Candidate records are review material only.
|
|
|
|
They are not training material until promoted.
|
|
|
|
---
|
|
|
|
## 5. Review Status
|
|
|
|
Use a small status vocabulary.
|
|
|
|
```text
|
|
candidate
|
|
accepted
|
|
rejected
|
|
revise
|
|
strong
|
|
canonical
|
|
```
|
|
|
|
Meaning:
|
|
|
|
```text
|
|
candidate:
|
|
generated but not reviewed
|
|
|
|
accepted:
|
|
good enough to enter the vocabulary library
|
|
|
|
rejected:
|
|
not useful; do not train on it
|
|
|
|
revise:
|
|
promising but needs human rewrite
|
|
|
|
strong:
|
|
useful enough to inspire dialogue lines
|
|
|
|
canonical:
|
|
preferred phrasing for a recurring simulator condition
|
|
```
|
|
|
|
Only these should enter training or simulator-facing data:
|
|
|
|
```text
|
|
accepted
|
|
strong
|
|
canonical
|
|
```
|
|
|
|
Rejected and unreviewed candidates should be retained only for audit or generator improvement.
|
|
|
|
---
|
|
|
|
## 6. Human Review Rules
|
|
|
|
The reviewer should ask:
|
|
|
|
1. Is the line Roman-visible?
|
|
2. Does it avoid modern abstraction?
|
|
3. Does it express a real commercial condition?
|
|
4. Does it use objects, action, or pressure rather than explanation?
|
|
5. Could one of the six actor voices plausibly say it?
|
|
6. Is it compact enough to be useful?
|
|
7. Does it avoid parody or over-stylized speech?
|
|
8. Does it teach the model a useful pattern?
|
|
|
|
Reject lines that are merely clever.
|
|
|
|
Accept lines that create usable world-language.
|
|
|
|
Promote lines that can recur across scenes.
|
|
|
|
---
|
|
|
|
## 7. Rejection Reasons
|
|
|
|
Common rejection reasons:
|
|
|
|
```text
|
|
too modern
|
|
too abstract
|
|
too theatrical
|
|
too vague
|
|
wrong actor voice
|
|
no commercial meaning
|
|
no Roman-visible object
|
|
mixed metaphor
|
|
unusable in dialogue
|
|
duplicates existing phrase
|
|
```
|
|
|
|
Optional review fields:
|
|
|
|
```json
|
|
{
|
|
"status": "rejected",
|
|
"review_note": "too modern: sounds like business-school language"
|
|
}
|
|
```
|
|
|
|
or:
|
|
|
|
```json
|
|
{
|
|
"status": "revise",
|
|
"review_note": "good image, but too ornate for Secundus"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 8. Promotion Levels
|
|
|
|
### Accepted
|
|
|
|
Useful phrase. Can be stored in the vocabulary library.
|
|
|
|
Example:
|
|
|
|
```text
|
|
The tablet arrived old.
|
|
```
|
|
|
|
### Strong
|
|
|
|
Useful phrase that should influence dialogue writing.
|
|
|
|
Example:
|
|
|
|
```text
|
|
A jar without wheels is a promise sitting in straw.
|
|
```
|
|
|
|
### Canonical
|
|
|
|
Preferred phrase for a repeated simulator condition.
|
|
|
|
Example:
|
|
|
|
```text
|
|
The wheels are gone.
|
|
```
|
|
|
|
Canonical expressions should be few.
|
|
|
|
If too many phrases are canonical, none are canonical.
|
|
|
|
---
|
|
|
|
## 9. Output Libraries
|
|
|
|
The workflow should produce three outputs.
|
|
|
|
### Candidate Pool
|
|
|
|
```text
|
|
data/vocabulary/candidates.jsonl
|
|
```
|
|
|
|
Generated material, mostly unreviewed.
|
|
|
|
### Reviewed Vocabulary
|
|
|
|
```text
|
|
data/vocabulary/roman_visible_expressions.jsonl
|
|
```
|
|
|
|
Accepted, strong, and canonical expressions only.
|
|
|
|
### Canonical Templates
|
|
|
|
```text
|
|
data/vocabulary/canonical_templates.jsonl
|
|
```
|
|
|
|
Small set of recurring simulator-ready expressions.
|
|
|
|
---
|
|
|
|
## 10. Training Rule
|
|
|
|
Do not train on raw generated churn.
|
|
|
|
Training material may use:
|
|
|
|
```text
|
|
accepted expressions
|
|
strong expressions
|
|
canonical expressions
|
|
human-revised expressions
|
|
dialogues that naturally include reviewed expressions
|
|
```
|
|
|
|
Training material must not use:
|
|
|
|
```text
|
|
unreviewed candidate output
|
|
rejected output
|
|
bulk generated noise
|
|
expressions marked revise but not rewritten
|
|
```
|
|
|
|
The generator is a discovery tool, not an author of record.
|
|
|
|
---
|
|
|
|
## 11. Simulator Use
|
|
|
|
Canonical expressions can help the simulator narrate recurring conditions.
|
|
|
|
Example simulator state:
|
|
|
|
```yaml
|
|
condition: transport_capacity_lost
|
|
object: cart
|
|
cause: rival_hired_carts
|
|
urgency: buyer_waiting
|
|
actor_voice: Secundus
|
|
```
|
|
|
|
Possible canonical output:
|
|
|
|
```text
|
|
The wheels are gone.
|
|
```
|
|
|
|
Expanded output:
|
|
|
|
```text
|
|
The wheels are gone, and the buyer will not wait for our excuses.
|
|
```
|
|
|
|
Actor variants:
|
|
|
|
```text
|
|
Varro:
|
|
The bridge was taken before the column moved.
|
|
|
|
Felix:
|
|
Naso bought the road, not the oil.
|
|
|
|
Chresimus:
|
|
The account must show why the jars did not move.
|
|
|
|
Secundus:
|
|
The wheels are gone. Ten jars can still go by mule.
|
|
```
|
|
|
|
The simulator should prefer canonical lines for repeated conditions and strong lines for color.
|
|
|
|
---
|
|
|
|
## 12. Generator Design
|
|
|
|
A simple generator can begin as a Cartesian combiner with templates.
|
|
|
|
Template examples:
|
|
|
|
```text
|
|
The {object} {action_phrase} while {pressure_phrase}.
|
|
A {object} without {support_object} is {metaphor_result}.
|
|
The {pressure_object} has reached {target} before {expected_event}.
|
|
{actor_voice} would say: "{expression}"
|
|
```
|
|
|
|
But the generator should be constrained by compatibility rules.
|
|
|
|
Bad combinations should be filtered before review where possible.
|
|
|
|
Example:
|
|
|
|
```text
|
|
coin + hired_elsewhere + rain
|
|
```
|
|
|
|
may produce nonsense unless transformed carefully.
|
|
|
|
Good combinations:
|
|
|
|
```text
|
|
cart + hired_elsewhere + buyer_waiting
|
|
tablet + old + road_delay
|
|
warehouse + full + merchant_urgency
|
|
coin + visible + street_eyes
|
|
seal + broken + official_attention
|
|
```
|
|
|
|
The generator should prefer semantically compatible sets.
|
|
|
|
---
|
|
|
|
## 13. Compatibility Tags
|
|
|
|
Objects, actions, and pressures should eventually carry compatibility tags.
|
|
|
|
Example:
|
|
|
|
```yaml
|
|
object: cart
|
|
compatible_actions:
|
|
- hired
|
|
- missing
|
|
- broken
|
|
- delayed
|
|
- overloaded
|
|
compatible_pressures:
|
|
- buyer_waiting
|
|
- rival_obstruction
|
|
- bad_road
|
|
- delivery_deadline
|
|
```
|
|
|
|
Example:
|
|
|
|
```yaml
|
|
object: tablet
|
|
compatible_actions:
|
|
- written
|
|
- sealed
|
|
- old
|
|
- disputed
|
|
- witnessed
|
|
compatible_pressures:
|
|
- stale_news
|
|
- legal_exposure
|
|
- source_motive
|
|
- settlement_dispute
|
|
```
|
|
|
|
This improves candidate quality without eliminating human review.
|
|
|
|
---
|
|
|
|
## 14. Review Speed Target
|
|
|
|
The process is designed for fast human selection.
|
|
|
|
Target review speed:
|
|
|
|
```text
|
|
200 to 500 candidates per hour
|
|
```
|
|
|
|
This is realistic only if the review interface is simple.
|
|
|
|
Each candidate should support one-key marking:
|
|
|
|
```text
|
|
a = accept
|
|
r = reject
|
|
v = revise
|
|
s = strong
|
|
c = canonical
|
|
```
|
|
|
|
The reviewer should not be forced to edit every line.
|
|
|
|
Editing should be reserved for promising expressions.
|
|
|
|
---
|
|
|
|
## 15. Success Condition
|
|
|
|
This workflow is successful if it produces a growing library of Roman-visible expressions faster than direct hand-authoring.
|
|
|
|
A good result is not a clean generator.
|
|
|
|
A good result is a strong reviewed vocabulary.
|
|
|
|
The approved vocabulary should improve:
|
|
|
|
```text
|
|
dialogue writing
|
|
simulator narration
|
|
actor voice consistency
|
|
contamination resistance
|
|
model training data
|
|
```
|
|
|
|
The final test is whether the model prefers:
|
|
|
|
```text
|
|
The wheels are gone.
|
|
The tablet arrived old.
|
|
He owns jars, not coin.
|
|
The purse is fat and the street has eyes.
|
|
```
|
|
|
|
over:
|
|
|
|
```text
|
|
Transport capacity is constrained.
|
|
The information is stale.
|
|
His assets are illiquid.
|
|
His liquidity creates security risk.
|
|
```
|
|
|
|
The purpose is not style alone.
|
|
|
|
The purpose is to build a bounded Roman commercial ontology one approved phrase at a time.
|