TheRON/otivm

Fork 0

Files

TheRON 2c3667f77c initial upload

2026-04-30 15:25:03 -04:00

10 KiB

Raw Blame History

VOCABULARY-GENERATION-0001

Generate, Review, And Promote Roman-Visible Expressions

Status: Draft Standard

Layer: Training Infrastructure

Purpose: Define a fast human-in-the-loop workflow for building OTIVM's Roman-visible model vocabulary

Repository Path: docs/training/chunking/VOCABULARY-GENERATION-0001.md

0. Purpose

This document defines a workflow for generating and selecting Roman-visible commercial expressions.

The purpose is to build the model vocabulary faster than hand-authoring every line.

The generator may produce large amounts of weak or useless material. That is acceptable.

The training corpus must only receive reviewed and accepted material.

The workflow is:

generate many candidates
human flags useful expressions
accepted expressions become vocabulary records
strong expressions become dialogue material
canonical expressions become simulator templates

The churn is not the asset.

The approved expression is the asset.

1. Core Idea

A Roman-visible expression can often be generated from three elements:

Object + Action + Pressure

Examples:

coin + hide + street eyes
= The purse is fat and the street has eyes.

cart + hired elsewhere + buyer waiting
= The wheels are gone while the buyer counts the hours.

tablet + old + road delay
= The tablet arrived older than its promise.

jar + no cart + delivery obligation
= A jar without wheels is a promise sitting in straw.

warehouse roof + rain + merchant urgency
= The roof earns coin when rain walks the street.

This is not ordinary paraphrase.

It is ontology building.

The model learns what kind of world it inhabits by seeing which objects, actions, and pressures are allowed to combine.

2. Why This Works

Humans are often faster at recognizing a good phrase than inventing one from nothing.

A generator can produce hundreds or thousands of combinations.

Most will be poor.

A human reviewer can scroll quickly and mark:

accept
reject
revise
strong
canonical

The useful lines will emerge faster than through direct composition.

The process is closer to quarrying stone than writing prose.

The generator produces rough stone.

The reviewer selects blocks worth dressing.

The corpus receives only dressed blocks.

3. Controlled Input Sets

The generator should not begin with unrestricted language.

It should combine controlled lists.

Objects

coin
purse
chest
tablet
seal
witness
cart
wheel
mule
road
warehouse
wall
roof
jar
amphora
crate
rope
weight
measure
gate
market
portico
yard
dust
rain
lamp
grain
oil
bronze
timber
glass
stone

Actions

buy
sell
carry
store
seal
open
count
weigh
measure
pledge
write
witness
hire
repair
delay
ask
refuse
accuse
confirm
return
split
hold
move
settle
hide
leak
wait
rot
spoil
break
arrive
depart

Pressures

hunger
rain
delay
spoilage
debt
rivalry
shame
praise
shortage
crowd
rumor
cart scarcity
storage scarcity
buyer urgency
creditor pressure
official attention
bad road
old news
broken seal
empty purse
full warehouse

Actor Voices

Varro
Felix
Lentulus
Crispus
Secundus
Chresimus
neutral narrator

The generator should combine these into candidate expressions, not final truth.

4. Candidate Expression Record

Each generated expression should be stored as a reviewable record.

Recommended JSONL form:

{
  "expression_id": "expr_000142",
  "domain": "commerce",
  "object": "cart",
  "action": "hired_elsewhere",
  "pressure": "buyer_waiting",
  "actor_voice": "Secundus",
  "candidate": "The wheels are gone, and the buyer will not wait for our excuses.",
  "modern_meaning": "Cart capacity has been lost, but partial shipment may still be possible.",
  "concept_tags": [
    "transport_capacity",
    "delay_cost",
    "buyer_need"
  ],
  "status": "candidate",
  "strength": null,
  "review_note": null
}

Candidate records are review material only.

They are not training material until promoted.

5. Review Status

Use a small status vocabulary.

candidate
accepted
rejected
revise
strong
canonical

Meaning:

candidate:
  generated but not reviewed

accepted:
  good enough to enter the vocabulary library

rejected:
  not useful; do not train on it

revise:
  promising but needs human rewrite

strong:
  useful enough to inspire dialogue lines

canonical:
  preferred phrasing for a recurring simulator condition

Only these should enter training or simulator-facing data:

accepted
strong
canonical

Rejected and unreviewed candidates should be retained only for audit or generator improvement.

6. Human Review Rules

The reviewer should ask:

Is the line Roman-visible?
Does it avoid modern abstraction?
Does it express a real commercial condition?
Does it use objects, action, or pressure rather than explanation?
Could one of the six actor voices plausibly say it?
Is it compact enough to be useful?
Does it avoid parody or over-stylized speech?
Does it teach the model a useful pattern?

Reject lines that are merely clever.

Accept lines that create usable world-language.

Promote lines that can recur across scenes.

7. Rejection Reasons

Common rejection reasons:

too modern
too abstract
too theatrical
too vague
wrong actor voice
no commercial meaning
no Roman-visible object
mixed metaphor
unusable in dialogue
duplicates existing phrase

Optional review fields:

{
  "status": "rejected",
  "review_note": "too modern: sounds like business-school language"
}

or:

{
  "status": "revise",
  "review_note": "good image, but too ornate for Secundus"
}

8. Promotion Levels

Accepted

Useful phrase. Can be stored in the vocabulary library.

Example:

The tablet arrived old.

Strong

Useful phrase that should influence dialogue writing.

Example:

A jar without wheels is a promise sitting in straw.

Canonical

Preferred phrase for a repeated simulator condition.

Example:

The wheels are gone.

Canonical expressions should be few.

If too many phrases are canonical, none are canonical.

9. Output Libraries

The workflow should produce three outputs.

Candidate Pool

data/vocabulary/candidates.jsonl

Generated material, mostly unreviewed.

Reviewed Vocabulary

data/vocabulary/roman_visible_expressions.jsonl

Accepted, strong, and canonical expressions only.

Canonical Templates

data/vocabulary/canonical_templates.jsonl

Small set of recurring simulator-ready expressions.

10. Training Rule

Do not train on raw generated churn.

Training material may use:

accepted expressions
strong expressions
canonical expressions
human-revised expressions
dialogues that naturally include reviewed expressions

Training material must not use:

unreviewed candidate output
rejected output
bulk generated noise
expressions marked revise but not rewritten

The generator is a discovery tool, not an author of record.

11. Simulator Use

Canonical expressions can help the simulator narrate recurring conditions.

Example simulator state:

condition: transport_capacity_lost
object: cart
cause: rival_hired_carts
urgency: buyer_waiting
actor_voice: Secundus

Possible canonical output:

The wheels are gone.

Expanded output:

The wheels are gone, and the buyer will not wait for our excuses.

Actor variants:

Varro:
  The bridge was taken before the column moved.

Felix:
  Naso bought the road, not the oil.

Chresimus:
  The account must show why the jars did not move.

Secundus:
  The wheels are gone. Ten jars can still go by mule.

The simulator should prefer canonical lines for repeated conditions and strong lines for color.

12. Generator Design

A simple generator can begin as a Cartesian combiner with templates.

Template examples:

The {object} {action_phrase} while {pressure_phrase}.
A {object} without {support_object} is {metaphor_result}.
The {pressure_object} has reached {target} before {expected_event}.
{actor_voice} would say: "{expression}"

But the generator should be constrained by compatibility rules.

Bad combinations should be filtered before review where possible.

Example:

coin + hired_elsewhere + rain

may produce nonsense unless transformed carefully.

Good combinations:

cart + hired_elsewhere + buyer_waiting
tablet + old + road_delay
warehouse + full + merchant_urgency
coin + visible + street_eyes
seal + broken + official_attention

The generator should prefer semantically compatible sets.

13. Compatibility Tags

Objects, actions, and pressures should eventually carry compatibility tags.

Example:

object: cart
compatible_actions:
  - hired
  - missing
  - broken
  - delayed
  - overloaded
compatible_pressures:
  - buyer_waiting
  - rival_obstruction
  - bad_road
  - delivery_deadline

Example:

object: tablet
compatible_actions:
  - written
  - sealed
  - old
  - disputed
  - witnessed
compatible_pressures:
  - stale_news
  - legal_exposure
  - source_motive
  - settlement_dispute

This improves candidate quality without eliminating human review.

14. Review Speed Target

The process is designed for fast human selection.

Target review speed:

200 to 500 candidates per hour

This is realistic only if the review interface is simple.

Each candidate should support one-key marking:

a = accept
r = reject
v = revise
s = strong
c = canonical

The reviewer should not be forced to edit every line.

Editing should be reserved for promising expressions.

15. Success Condition

This workflow is successful if it produces a growing library of Roman-visible expressions faster than direct hand-authoring.

A good result is not a clean generator.

A good result is a strong reviewed vocabulary.

The approved vocabulary should improve:

dialogue writing
simulator narration
actor voice consistency
contamination resistance
model training data

The final test is whether the model prefers:

The wheels are gone.
The tablet arrived old.
He owns jars, not coin.
The purse is fat and the street has eyes.

over:

Transport capacity is constrained.
The information is stale.
His assets are illiquid.
His liquidity creates security risk.

The purpose is not style alone.

The purpose is to build a bounded Roman commercial ontology one approved phrase at a time.

10 KiB Raw Blame History

VOCABULARY-GENERATION-0001

Generate, Review, And Promote Roman-Visible Expressions

Status: Draft Standard

Layer: Training Infrastructure

Purpose: Define a fast human-in-the-loop workflow for building OTIVM's Roman-visible model vocabulary

Repository Path: docs/training/chunking/VOCABULARY-GENERATION-0001.md

0. Purpose

1. Core Idea

2. Why This Works

3. Controlled Input Sets

Objects

Actions

Pressures

Actor Voices

4. Candidate Expression Record

5. Review Status

6. Human Review Rules

7. Rejection Reasons

8. Promotion Levels

Accepted

Strong

Canonical

9. Output Libraries

Candidate Pool

Reviewed Vocabulary

Canonical Templates

10. Training Rule

11. Simulator Use

12. Generator Design

13. Compatibility Tags

14. Review Speed Target

15. Success Condition

10 KiB

Raw Blame History