10 KiB
VOCABULARY-GENERATION-0001
Generate, Review, And Promote Roman-Visible Expressions
Status: Draft Standard
Layer: Training Infrastructure
Purpose: Define a fast human-in-the-loop workflow for building OTIVM's Roman-visible model vocabulary
Repository Path: docs/training/chunking/VOCABULARY-GENERATION-0001.md
0. Purpose
This document defines a workflow for generating and selecting Roman-visible commercial expressions.
The purpose is to build the model vocabulary faster than hand-authoring every line.
The generator may produce large amounts of weak or useless material. That is acceptable.
The training corpus must only receive reviewed and accepted material.
The workflow is:
generate many candidates
human flags useful expressions
accepted expressions become vocabulary records
strong expressions become dialogue material
canonical expressions become simulator templates
The churn is not the asset.
The approved expression is the asset.
1. Core Idea
A Roman-visible expression can often be generated from three elements:
Object + Action + Pressure
Examples:
coin + hide + street eyes
= The purse is fat and the street has eyes.
cart + hired elsewhere + buyer waiting
= The wheels are gone while the buyer counts the hours.
tablet + old + road delay
= The tablet arrived older than its promise.
jar + no cart + delivery obligation
= A jar without wheels is a promise sitting in straw.
warehouse roof + rain + merchant urgency
= The roof earns coin when rain walks the street.
This is not ordinary paraphrase.
It is ontology building.
The model learns what kind of world it inhabits by seeing which objects, actions, and pressures are allowed to combine.
2. Why This Works
Humans are often faster at recognizing a good phrase than inventing one from nothing.
A generator can produce hundreds or thousands of combinations.
Most will be poor.
A human reviewer can scroll quickly and mark:
accept
reject
revise
strong
canonical
The useful lines will emerge faster than through direct composition.
The process is closer to quarrying stone than writing prose.
The generator produces rough stone.
The reviewer selects blocks worth dressing.
The corpus receives only dressed blocks.
3. Controlled Input Sets
The generator should not begin with unrestricted language.
It should combine controlled lists.
Objects
coin
purse
chest
tablet
seal
witness
cart
wheel
mule
road
warehouse
wall
roof
jar
amphora
crate
rope
weight
measure
gate
market
portico
yard
dust
rain
lamp
grain
oil
bronze
timber
glass
stone
Actions
buy
sell
carry
store
seal
open
count
weigh
measure
pledge
write
witness
hire
repair
delay
ask
refuse
accuse
confirm
return
split
hold
move
settle
hide
leak
wait
rot
spoil
break
arrive
depart
Pressures
hunger
rain
delay
spoilage
debt
rivalry
shame
praise
shortage
crowd
rumor
cart scarcity
storage scarcity
buyer urgency
creditor pressure
official attention
bad road
old news
broken seal
empty purse
full warehouse
Actor Voices
Varro
Felix
Lentulus
Crispus
Secundus
Chresimus
neutral narrator
The generator should combine these into candidate expressions, not final truth.
4. Candidate Expression Record
Each generated expression should be stored as a reviewable record.
Recommended JSONL form:
{
"expression_id": "expr_000142",
"domain": "commerce",
"object": "cart",
"action": "hired_elsewhere",
"pressure": "buyer_waiting",
"actor_voice": "Secundus",
"candidate": "The wheels are gone, and the buyer will not wait for our excuses.",
"modern_meaning": "Cart capacity has been lost, but partial shipment may still be possible.",
"concept_tags": [
"transport_capacity",
"delay_cost",
"buyer_need"
],
"status": "candidate",
"strength": null,
"review_note": null
}
Candidate records are review material only.
They are not training material until promoted.
5. Review Status
Use a small status vocabulary.
candidate
accepted
rejected
revise
strong
canonical
Meaning:
candidate:
generated but not reviewed
accepted:
good enough to enter the vocabulary library
rejected:
not useful; do not train on it
revise:
promising but needs human rewrite
strong:
useful enough to inspire dialogue lines
canonical:
preferred phrasing for a recurring simulator condition
Only these should enter training or simulator-facing data:
accepted
strong
canonical
Rejected and unreviewed candidates should be retained only for audit or generator improvement.
6. Human Review Rules
The reviewer should ask:
- Is the line Roman-visible?
- Does it avoid modern abstraction?
- Does it express a real commercial condition?
- Does it use objects, action, or pressure rather than explanation?
- Could one of the six actor voices plausibly say it?
- Is it compact enough to be useful?
- Does it avoid parody or over-stylized speech?
- Does it teach the model a useful pattern?
Reject lines that are merely clever.
Accept lines that create usable world-language.
Promote lines that can recur across scenes.
7. Rejection Reasons
Common rejection reasons:
too modern
too abstract
too theatrical
too vague
wrong actor voice
no commercial meaning
no Roman-visible object
mixed metaphor
unusable in dialogue
duplicates existing phrase
Optional review fields:
{
"status": "rejected",
"review_note": "too modern: sounds like business-school language"
}
or:
{
"status": "revise",
"review_note": "good image, but too ornate for Secundus"
}
8. Promotion Levels
Accepted
Useful phrase. Can be stored in the vocabulary library.
Example:
The tablet arrived old.
Strong
Useful phrase that should influence dialogue writing.
Example:
A jar without wheels is a promise sitting in straw.
Canonical
Preferred phrase for a repeated simulator condition.
Example:
The wheels are gone.
Canonical expressions should be few.
If too many phrases are canonical, none are canonical.
9. Output Libraries
The workflow should produce three outputs.
Candidate Pool
data/vocabulary/candidates.jsonl
Generated material, mostly unreviewed.
Reviewed Vocabulary
data/vocabulary/roman_visible_expressions.jsonl
Accepted, strong, and canonical expressions only.
Canonical Templates
data/vocabulary/canonical_templates.jsonl
Small set of recurring simulator-ready expressions.
10. Training Rule
Do not train on raw generated churn.
Training material may use:
accepted expressions
strong expressions
canonical expressions
human-revised expressions
dialogues that naturally include reviewed expressions
Training material must not use:
unreviewed candidate output
rejected output
bulk generated noise
expressions marked revise but not rewritten
The generator is a discovery tool, not an author of record.
11. Simulator Use
Canonical expressions can help the simulator narrate recurring conditions.
Example simulator state:
condition: transport_capacity_lost
object: cart
cause: rival_hired_carts
urgency: buyer_waiting
actor_voice: Secundus
Possible canonical output:
The wheels are gone.
Expanded output:
The wheels are gone, and the buyer will not wait for our excuses.
Actor variants:
Varro:
The bridge was taken before the column moved.
Felix:
Naso bought the road, not the oil.
Chresimus:
The account must show why the jars did not move.
Secundus:
The wheels are gone. Ten jars can still go by mule.
The simulator should prefer canonical lines for repeated conditions and strong lines for color.
12. Generator Design
A simple generator can begin as a Cartesian combiner with templates.
Template examples:
The {object} {action_phrase} while {pressure_phrase}.
A {object} without {support_object} is {metaphor_result}.
The {pressure_object} has reached {target} before {expected_event}.
{actor_voice} would say: "{expression}"
But the generator should be constrained by compatibility rules.
Bad combinations should be filtered before review where possible.
Example:
coin + hired_elsewhere + rain
may produce nonsense unless transformed carefully.
Good combinations:
cart + hired_elsewhere + buyer_waiting
tablet + old + road_delay
warehouse + full + merchant_urgency
coin + visible + street_eyes
seal + broken + official_attention
The generator should prefer semantically compatible sets.
13. Compatibility Tags
Objects, actions, and pressures should eventually carry compatibility tags.
Example:
object: cart
compatible_actions:
- hired
- missing
- broken
- delayed
- overloaded
compatible_pressures:
- buyer_waiting
- rival_obstruction
- bad_road
- delivery_deadline
Example:
object: tablet
compatible_actions:
- written
- sealed
- old
- disputed
- witnessed
compatible_pressures:
- stale_news
- legal_exposure
- source_motive
- settlement_dispute
This improves candidate quality without eliminating human review.
14. Review Speed Target
The process is designed for fast human selection.
Target review speed:
200 to 500 candidates per hour
This is realistic only if the review interface is simple.
Each candidate should support one-key marking:
a = accept
r = reject
v = revise
s = strong
c = canonical
The reviewer should not be forced to edit every line.
Editing should be reserved for promising expressions.
15. Success Condition
This workflow is successful if it produces a growing library of Roman-visible expressions faster than direct hand-authoring.
A good result is not a clean generator.
A good result is a strong reviewed vocabulary.
The approved vocabulary should improve:
dialogue writing
simulator narration
actor voice consistency
contamination resistance
model training data
The final test is whether the model prefers:
The wheels are gone.
The tablet arrived old.
He owns jars, not coin.
The purse is fat and the street has eyes.
over:
Transport capacity is constrained.
The information is stale.
His assets are illiquid.
His liquidity creates security risk.
The purpose is not style alone.
The purpose is to build a bounded Roman commercial ontology one approved phrase at a time.