diff --git a/docs/training/chunking/VOCABULARY-GENERATION-0001.md b/docs/training/chunking/VOCABULARY-GENERATION-0001.md new file mode 100644 index 0000000..f3133a0 --- /dev/null +++ b/docs/training/chunking/VOCABULARY-GENERATION-0001.md @@ -0,0 +1,649 @@ +# VOCABULARY-GENERATION-0001 +## Generate, Review, And Promote Roman-Visible Expressions +### Status: Draft Standard +### Layer: Training Infrastructure +### Purpose: Define a fast human-in-the-loop workflow for building OTIVM's Roman-visible model vocabulary +### Repository Path: docs/training/chunking/VOCABULARY-GENERATION-0001.md + +--- + +## 0. Purpose + +This document defines a workflow for generating and selecting Roman-visible commercial expressions. + +The purpose is to build the model vocabulary faster than hand-authoring every line. + +The generator may produce large amounts of weak or useless material. That is acceptable. + +The training corpus must only receive reviewed and accepted material. + +The workflow is: + +```text +generate many candidates +human flags useful expressions +accepted expressions become vocabulary records +strong expressions become dialogue material +canonical expressions become simulator templates +``` + +The churn is not the asset. + +The approved expression is the asset. + +--- + +## 1. Core Idea + +A Roman-visible expression can often be generated from three elements: + +```text +Object + Action + Pressure +``` + +Examples: + +```text +coin + hide + street eyes += The purse is fat and the street has eyes. + +cart + hired elsewhere + buyer waiting += The wheels are gone while the buyer counts the hours. + +tablet + old + road delay += The tablet arrived older than its promise. + +jar + no cart + delivery obligation += A jar without wheels is a promise sitting in straw. + +warehouse roof + rain + merchant urgency += The roof earns coin when rain walks the street. +``` + +This is not ordinary paraphrase. + +It is ontology building. + +The model learns what kind of world it inhabits by seeing which objects, actions, and pressures are allowed to combine. + +--- + +## 2. Why This Works + +Humans are often faster at recognizing a good phrase than inventing one from nothing. + +A generator can produce hundreds or thousands of combinations. + +Most will be poor. + +A human reviewer can scroll quickly and mark: + +```text +accept +reject +revise +strong +canonical +``` + +The useful lines will emerge faster than through direct composition. + +The process is closer to quarrying stone than writing prose. + +The generator produces rough stone. + +The reviewer selects blocks worth dressing. + +The corpus receives only dressed blocks. + +--- + +## 3. Controlled Input Sets + +The generator should not begin with unrestricted language. + +It should combine controlled lists. + +### Objects + +```text +coin +purse +chest +tablet +seal +witness +cart +wheel +mule +road +warehouse +wall +roof +jar +amphora +crate +rope +weight +measure +gate +market +portico +yard +dust +rain +lamp +grain +oil +bronze +timber +glass +stone +``` + +### Actions + +```text +buy +sell +carry +store +seal +open +count +weigh +measure +pledge +write +witness +hire +repair +delay +ask +refuse +accuse +confirm +return +split +hold +move +settle +hide +leak +wait +rot +spoil +break +arrive +depart +``` + +### Pressures + +```text +hunger +rain +delay +spoilage +debt +rivalry +shame +praise +shortage +crowd +rumor +cart scarcity +storage scarcity +buyer urgency +creditor pressure +official attention +bad road +old news +broken seal +empty purse +full warehouse +``` + +### Actor Voices + +```text +Varro +Felix +Lentulus +Crispus +Secundus +Chresimus +neutral narrator +``` + +The generator should combine these into candidate expressions, not final truth. + +--- + +## 4. Candidate Expression Record + +Each generated expression should be stored as a reviewable record. + +Recommended JSONL form: + +```json +{ + "expression_id": "expr_000142", + "domain": "commerce", + "object": "cart", + "action": "hired_elsewhere", + "pressure": "buyer_waiting", + "actor_voice": "Secundus", + "candidate": "The wheels are gone, and the buyer will not wait for our excuses.", + "modern_meaning": "Cart capacity has been lost, but partial shipment may still be possible.", + "concept_tags": [ + "transport_capacity", + "delay_cost", + "buyer_need" + ], + "status": "candidate", + "strength": null, + "review_note": null +} +``` + +Candidate records are review material only. + +They are not training material until promoted. + +--- + +## 5. Review Status + +Use a small status vocabulary. + +```text +candidate +accepted +rejected +revise +strong +canonical +``` + +Meaning: + +```text +candidate: + generated but not reviewed + +accepted: + good enough to enter the vocabulary library + +rejected: + not useful; do not train on it + +revise: + promising but needs human rewrite + +strong: + useful enough to inspire dialogue lines + +canonical: + preferred phrasing for a recurring simulator condition +``` + +Only these should enter training or simulator-facing data: + +```text +accepted +strong +canonical +``` + +Rejected and unreviewed candidates should be retained only for audit or generator improvement. + +--- + +## 6. Human Review Rules + +The reviewer should ask: + +1. Is the line Roman-visible? +2. Does it avoid modern abstraction? +3. Does it express a real commercial condition? +4. Does it use objects, action, or pressure rather than explanation? +5. Could one of the six actor voices plausibly say it? +6. Is it compact enough to be useful? +7. Does it avoid parody or over-stylized speech? +8. Does it teach the model a useful pattern? + +Reject lines that are merely clever. + +Accept lines that create usable world-language. + +Promote lines that can recur across scenes. + +--- + +## 7. Rejection Reasons + +Common rejection reasons: + +```text +too modern +too abstract +too theatrical +too vague +wrong actor voice +no commercial meaning +no Roman-visible object +mixed metaphor +unusable in dialogue +duplicates existing phrase +``` + +Optional review fields: + +```json +{ + "status": "rejected", + "review_note": "too modern: sounds like business-school language" +} +``` + +or: + +```json +{ + "status": "revise", + "review_note": "good image, but too ornate for Secundus" +} +``` + +--- + +## 8. Promotion Levels + +### Accepted + +Useful phrase. Can be stored in the vocabulary library. + +Example: + +```text +The tablet arrived old. +``` + +### Strong + +Useful phrase that should influence dialogue writing. + +Example: + +```text +A jar without wheels is a promise sitting in straw. +``` + +### Canonical + +Preferred phrase for a repeated simulator condition. + +Example: + +```text +The wheels are gone. +``` + +Canonical expressions should be few. + +If too many phrases are canonical, none are canonical. + +--- + +## 9. Output Libraries + +The workflow should produce three outputs. + +### Candidate Pool + +```text +data/vocabulary/candidates.jsonl +``` + +Generated material, mostly unreviewed. + +### Reviewed Vocabulary + +```text +data/vocabulary/roman_visible_expressions.jsonl +``` + +Accepted, strong, and canonical expressions only. + +### Canonical Templates + +```text +data/vocabulary/canonical_templates.jsonl +``` + +Small set of recurring simulator-ready expressions. + +--- + +## 10. Training Rule + +Do not train on raw generated churn. + +Training material may use: + +```text +accepted expressions +strong expressions +canonical expressions +human-revised expressions +dialogues that naturally include reviewed expressions +``` + +Training material must not use: + +```text +unreviewed candidate output +rejected output +bulk generated noise +expressions marked revise but not rewritten +``` + +The generator is a discovery tool, not an author of record. + +--- + +## 11. Simulator Use + +Canonical expressions can help the simulator narrate recurring conditions. + +Example simulator state: + +```yaml +condition: transport_capacity_lost +object: cart +cause: rival_hired_carts +urgency: buyer_waiting +actor_voice: Secundus +``` + +Possible canonical output: + +```text +The wheels are gone. +``` + +Expanded output: + +```text +The wheels are gone, and the buyer will not wait for our excuses. +``` + +Actor variants: + +```text +Varro: + The bridge was taken before the column moved. + +Felix: + Naso bought the road, not the oil. + +Chresimus: + The account must show why the jars did not move. + +Secundus: + The wheels are gone. Ten jars can still go by mule. +``` + +The simulator should prefer canonical lines for repeated conditions and strong lines for color. + +--- + +## 12. Generator Design + +A simple generator can begin as a Cartesian combiner with templates. + +Template examples: + +```text +The {object} {action_phrase} while {pressure_phrase}. +A {object} without {support_object} is {metaphor_result}. +The {pressure_object} has reached {target} before {expected_event}. +{actor_voice} would say: "{expression}" +``` + +But the generator should be constrained by compatibility rules. + +Bad combinations should be filtered before review where possible. + +Example: + +```text +coin + hired_elsewhere + rain +``` + +may produce nonsense unless transformed carefully. + +Good combinations: + +```text +cart + hired_elsewhere + buyer_waiting +tablet + old + road_delay +warehouse + full + merchant_urgency +coin + visible + street_eyes +seal + broken + official_attention +``` + +The generator should prefer semantically compatible sets. + +--- + +## 13. Compatibility Tags + +Objects, actions, and pressures should eventually carry compatibility tags. + +Example: + +```yaml +object: cart +compatible_actions: + - hired + - missing + - broken + - delayed + - overloaded +compatible_pressures: + - buyer_waiting + - rival_obstruction + - bad_road + - delivery_deadline +``` + +Example: + +```yaml +object: tablet +compatible_actions: + - written + - sealed + - old + - disputed + - witnessed +compatible_pressures: + - stale_news + - legal_exposure + - source_motive + - settlement_dispute +``` + +This improves candidate quality without eliminating human review. + +--- + +## 14. Review Speed Target + +The process is designed for fast human selection. + +Target review speed: + +```text +200 to 500 candidates per hour +``` + +This is realistic only if the review interface is simple. + +Each candidate should support one-key marking: + +```text +a = accept +r = reject +v = revise +s = strong +c = canonical +``` + +The reviewer should not be forced to edit every line. + +Editing should be reserved for promising expressions. + +--- + +## 15. Success Condition + +This workflow is successful if it produces a growing library of Roman-visible expressions faster than direct hand-authoring. + +A good result is not a clean generator. + +A good result is a strong reviewed vocabulary. + +The approved vocabulary should improve: + +```text +dialogue writing +simulator narration +actor voice consistency +contamination resistance +model training data +``` + +The final test is whether the model prefers: + +```text +The wheels are gone. +The tablet arrived old. +He owns jars, not coin. +The purse is fat and the street has eyes. +``` + +over: + +```text +Transport capacity is constrained. +The information is stale. +His assets are illiquid. +His liquidity creates security risk. +``` + +The purpose is not style alone. + +The purpose is to build a bounded Roman commercial ontology one approved phrase at a time.