# VOCABULARY-GENERATION-0001 ## Generate, Review, And Promote Roman-Visible Expressions ### Status: Draft Standard ### Layer: Training Infrastructure ### Purpose: Define a fast human-in-the-loop workflow for building OTIVM's Roman-visible model vocabulary ### Repository Path: docs/training/chunking/VOCABULARY-GENERATION-0001.md --- ## 0. Purpose This document defines a workflow for generating and selecting Roman-visible commercial expressions. The purpose is to build the model vocabulary faster than hand-authoring every line. The generator may produce large amounts of weak or useless material. That is acceptable. The training corpus must only receive reviewed and accepted material. The workflow is: ```text generate many candidates human flags useful expressions accepted expressions become vocabulary records strong expressions become dialogue material canonical expressions become simulator templates ``` The churn is not the asset. The approved expression is the asset. --- ## 1. Core Idea A Roman-visible expression can often be generated from three elements: ```text Object + Action + Pressure ``` Examples: ```text coin + hide + street eyes = The purse is fat and the street has eyes. cart + hired elsewhere + buyer waiting = The wheels are gone while the buyer counts the hours. tablet + old + road delay = The tablet arrived older than its promise. jar + no cart + delivery obligation = A jar without wheels is a promise sitting in straw. warehouse roof + rain + merchant urgency = The roof earns coin when rain walks the street. ``` This is not ordinary paraphrase. It is ontology building. The model learns what kind of world it inhabits by seeing which objects, actions, and pressures are allowed to combine. --- ## 2. Why This Works Humans are often faster at recognizing a good phrase than inventing one from nothing. A generator can produce hundreds or thousands of combinations. Most will be poor. A human reviewer can scroll quickly and mark: ```text accept reject revise strong canonical ``` The useful lines will emerge faster than through direct composition. The process is closer to quarrying stone than writing prose. The generator produces rough stone. The reviewer selects blocks worth dressing. The corpus receives only dressed blocks. --- ## 3. Controlled Input Sets The generator should not begin with unrestricted language. It should combine controlled lists. ### Objects ```text coin purse chest tablet seal witness cart wheel mule road warehouse wall roof jar amphora crate rope weight measure gate market portico yard dust rain lamp grain oil bronze timber glass stone ``` ### Actions ```text buy sell carry store seal open count weigh measure pledge write witness hire repair delay ask refuse accuse confirm return split hold move settle hide leak wait rot spoil break arrive depart ``` ### Pressures ```text hunger rain delay spoilage debt rivalry shame praise shortage crowd rumor cart scarcity storage scarcity buyer urgency creditor pressure official attention bad road old news broken seal empty purse full warehouse ``` ### Actor Voices ```text Varro Felix Lentulus Crispus Secundus Chresimus neutral narrator ``` The generator should combine these into candidate expressions, not final truth. --- ## 4. Candidate Expression Record Each generated expression should be stored as a reviewable record. Recommended JSONL form: ```json { "expression_id": "expr_000142", "domain": "commerce", "object": "cart", "action": "hired_elsewhere", "pressure": "buyer_waiting", "actor_voice": "Secundus", "candidate": "The wheels are gone, and the buyer will not wait for our excuses.", "modern_meaning": "Cart capacity has been lost, but partial shipment may still be possible.", "concept_tags": [ "transport_capacity", "delay_cost", "buyer_need" ], "status": "candidate", "strength": null, "review_note": null } ``` Candidate records are review material only. They are not training material until promoted. --- ## 5. Review Status Use a small status vocabulary. ```text candidate accepted rejected revise strong canonical ``` Meaning: ```text candidate: generated but not reviewed accepted: good enough to enter the vocabulary library rejected: not useful; do not train on it revise: promising but needs human rewrite strong: useful enough to inspire dialogue lines canonical: preferred phrasing for a recurring simulator condition ``` Only these should enter training or simulator-facing data: ```text accepted strong canonical ``` Rejected and unreviewed candidates should be retained only for audit or generator improvement. --- ## 6. Human Review Rules The reviewer should ask: 1. Is the line Roman-visible? 2. Does it avoid modern abstraction? 3. Does it express a real commercial condition? 4. Does it use objects, action, or pressure rather than explanation? 5. Could one of the six actor voices plausibly say it? 6. Is it compact enough to be useful? 7. Does it avoid parody or over-stylized speech? 8. Does it teach the model a useful pattern? Reject lines that are merely clever. Accept lines that create usable world-language. Promote lines that can recur across scenes. --- ## 7. Rejection Reasons Common rejection reasons: ```text too modern too abstract too theatrical too vague wrong actor voice no commercial meaning no Roman-visible object mixed metaphor unusable in dialogue duplicates existing phrase ``` Optional review fields: ```json { "status": "rejected", "review_note": "too modern: sounds like business-school language" } ``` or: ```json { "status": "revise", "review_note": "good image, but too ornate for Secundus" } ``` --- ## 8. Promotion Levels ### Accepted Useful phrase. Can be stored in the vocabulary library. Example: ```text The tablet arrived old. ``` ### Strong Useful phrase that should influence dialogue writing. Example: ```text A jar without wheels is a promise sitting in straw. ``` ### Canonical Preferred phrase for a repeated simulator condition. Example: ```text The wheels are gone. ``` Canonical expressions should be few. If too many phrases are canonical, none are canonical. --- ## 9. Output Libraries The workflow should produce three outputs. ### Candidate Pool ```text data/vocabulary/candidates.jsonl ``` Generated material, mostly unreviewed. ### Reviewed Vocabulary ```text data/vocabulary/roman_visible_expressions.jsonl ``` Accepted, strong, and canonical expressions only. ### Canonical Templates ```text data/vocabulary/canonical_templates.jsonl ``` Small set of recurring simulator-ready expressions. --- ## 10. Training Rule Do not train on raw generated churn. Training material may use: ```text accepted expressions strong expressions canonical expressions human-revised expressions dialogues that naturally include reviewed expressions ``` Training material must not use: ```text unreviewed candidate output rejected output bulk generated noise expressions marked revise but not rewritten ``` The generator is a discovery tool, not an author of record. --- ## 11. Simulator Use Canonical expressions can help the simulator narrate recurring conditions. Example simulator state: ```yaml condition: transport_capacity_lost object: cart cause: rival_hired_carts urgency: buyer_waiting actor_voice: Secundus ``` Possible canonical output: ```text The wheels are gone. ``` Expanded output: ```text The wheels are gone, and the buyer will not wait for our excuses. ``` Actor variants: ```text Varro: The bridge was taken before the column moved. Felix: Naso bought the road, not the oil. Chresimus: The account must show why the jars did not move. Secundus: The wheels are gone. Ten jars can still go by mule. ``` The simulator should prefer canonical lines for repeated conditions and strong lines for color. --- ## 12. Generator Design A simple generator can begin as a Cartesian combiner with templates. Template examples: ```text The {object} {action_phrase} while {pressure_phrase}. A {object} without {support_object} is {metaphor_result}. The {pressure_object} has reached {target} before {expected_event}. {actor_voice} would say: "{expression}" ``` But the generator should be constrained by compatibility rules. Bad combinations should be filtered before review where possible. Example: ```text coin + hired_elsewhere + rain ``` may produce nonsense unless transformed carefully. Good combinations: ```text cart + hired_elsewhere + buyer_waiting tablet + old + road_delay warehouse + full + merchant_urgency coin + visible + street_eyes seal + broken + official_attention ``` The generator should prefer semantically compatible sets. --- ## 13. Compatibility Tags Objects, actions, and pressures should eventually carry compatibility tags. Example: ```yaml object: cart compatible_actions: - hired - missing - broken - delayed - overloaded compatible_pressures: - buyer_waiting - rival_obstruction - bad_road - delivery_deadline ``` Example: ```yaml object: tablet compatible_actions: - written - sealed - old - disputed - witnessed compatible_pressures: - stale_news - legal_exposure - source_motive - settlement_dispute ``` This improves candidate quality without eliminating human review. --- ## 14. Review Speed Target The process is designed for fast human selection. Target review speed: ```text 200 to 500 candidates per hour ``` This is realistic only if the review interface is simple. Each candidate should support one-key marking: ```text a = accept r = reject v = revise s = strong c = canonical ``` The reviewer should not be forced to edit every line. Editing should be reserved for promising expressions. --- ## 15. Success Condition This workflow is successful if it produces a growing library of Roman-visible expressions faster than direct hand-authoring. A good result is not a clean generator. A good result is a strong reviewed vocabulary. The approved vocabulary should improve: ```text dialogue writing simulator narration actor voice consistency contamination resistance model training data ``` The final test is whether the model prefers: ```text The wheels are gone. The tablet arrived old. He owns jars, not coin. The purse is fat and the street has eyes. ``` over: ```text Transport capacity is constrained. The information is stale. His assets are illiquid. His liquidity creates security risk. ``` The purpose is not style alone. The purpose is to build a bounded Roman commercial ontology one approved phrase at a time.