initial upload
This commit is contained in:
649
docs/training/chunking/VOCABULARY-GENERATION-0001.md
Normal file
649
docs/training/chunking/VOCABULARY-GENERATION-0001.md
Normal file
@@ -0,0 +1,649 @@
|
||||
# VOCABULARY-GENERATION-0001
|
||||
## Generate, Review, And Promote Roman-Visible Expressions
|
||||
### Status: Draft Standard
|
||||
### Layer: Training Infrastructure
|
||||
### Purpose: Define a fast human-in-the-loop workflow for building OTIVM's Roman-visible model vocabulary
|
||||
### Repository Path: docs/training/chunking/VOCABULARY-GENERATION-0001.md
|
||||
|
||||
---
|
||||
|
||||
## 0. Purpose
|
||||
|
||||
This document defines a workflow for generating and selecting Roman-visible commercial expressions.
|
||||
|
||||
The purpose is to build the model vocabulary faster than hand-authoring every line.
|
||||
|
||||
The generator may produce large amounts of weak or useless material. That is acceptable.
|
||||
|
||||
The training corpus must only receive reviewed and accepted material.
|
||||
|
||||
The workflow is:
|
||||
|
||||
```text
|
||||
generate many candidates
|
||||
human flags useful expressions
|
||||
accepted expressions become vocabulary records
|
||||
strong expressions become dialogue material
|
||||
canonical expressions become simulator templates
|
||||
```
|
||||
|
||||
The churn is not the asset.
|
||||
|
||||
The approved expression is the asset.
|
||||
|
||||
---
|
||||
|
||||
## 1. Core Idea
|
||||
|
||||
A Roman-visible expression can often be generated from three elements:
|
||||
|
||||
```text
|
||||
Object + Action + Pressure
|
||||
```
|
||||
|
||||
Examples:
|
||||
|
||||
```text
|
||||
coin + hide + street eyes
|
||||
= The purse is fat and the street has eyes.
|
||||
|
||||
cart + hired elsewhere + buyer waiting
|
||||
= The wheels are gone while the buyer counts the hours.
|
||||
|
||||
tablet + old + road delay
|
||||
= The tablet arrived older than its promise.
|
||||
|
||||
jar + no cart + delivery obligation
|
||||
= A jar without wheels is a promise sitting in straw.
|
||||
|
||||
warehouse roof + rain + merchant urgency
|
||||
= The roof earns coin when rain walks the street.
|
||||
```
|
||||
|
||||
This is not ordinary paraphrase.
|
||||
|
||||
It is ontology building.
|
||||
|
||||
The model learns what kind of world it inhabits by seeing which objects, actions, and pressures are allowed to combine.
|
||||
|
||||
---
|
||||
|
||||
## 2. Why This Works
|
||||
|
||||
Humans are often faster at recognizing a good phrase than inventing one from nothing.
|
||||
|
||||
A generator can produce hundreds or thousands of combinations.
|
||||
|
||||
Most will be poor.
|
||||
|
||||
A human reviewer can scroll quickly and mark:
|
||||
|
||||
```text
|
||||
accept
|
||||
reject
|
||||
revise
|
||||
strong
|
||||
canonical
|
||||
```
|
||||
|
||||
The useful lines will emerge faster than through direct composition.
|
||||
|
||||
The process is closer to quarrying stone than writing prose.
|
||||
|
||||
The generator produces rough stone.
|
||||
|
||||
The reviewer selects blocks worth dressing.
|
||||
|
||||
The corpus receives only dressed blocks.
|
||||
|
||||
---
|
||||
|
||||
## 3. Controlled Input Sets
|
||||
|
||||
The generator should not begin with unrestricted language.
|
||||
|
||||
It should combine controlled lists.
|
||||
|
||||
### Objects
|
||||
|
||||
```text
|
||||
coin
|
||||
purse
|
||||
chest
|
||||
tablet
|
||||
seal
|
||||
witness
|
||||
cart
|
||||
wheel
|
||||
mule
|
||||
road
|
||||
warehouse
|
||||
wall
|
||||
roof
|
||||
jar
|
||||
amphora
|
||||
crate
|
||||
rope
|
||||
weight
|
||||
measure
|
||||
gate
|
||||
market
|
||||
portico
|
||||
yard
|
||||
dust
|
||||
rain
|
||||
lamp
|
||||
grain
|
||||
oil
|
||||
bronze
|
||||
timber
|
||||
glass
|
||||
stone
|
||||
```
|
||||
|
||||
### Actions
|
||||
|
||||
```text
|
||||
buy
|
||||
sell
|
||||
carry
|
||||
store
|
||||
seal
|
||||
open
|
||||
count
|
||||
weigh
|
||||
measure
|
||||
pledge
|
||||
write
|
||||
witness
|
||||
hire
|
||||
repair
|
||||
delay
|
||||
ask
|
||||
refuse
|
||||
accuse
|
||||
confirm
|
||||
return
|
||||
split
|
||||
hold
|
||||
move
|
||||
settle
|
||||
hide
|
||||
leak
|
||||
wait
|
||||
rot
|
||||
spoil
|
||||
break
|
||||
arrive
|
||||
depart
|
||||
```
|
||||
|
||||
### Pressures
|
||||
|
||||
```text
|
||||
hunger
|
||||
rain
|
||||
delay
|
||||
spoilage
|
||||
debt
|
||||
rivalry
|
||||
shame
|
||||
praise
|
||||
shortage
|
||||
crowd
|
||||
rumor
|
||||
cart scarcity
|
||||
storage scarcity
|
||||
buyer urgency
|
||||
creditor pressure
|
||||
official attention
|
||||
bad road
|
||||
old news
|
||||
broken seal
|
||||
empty purse
|
||||
full warehouse
|
||||
```
|
||||
|
||||
### Actor Voices
|
||||
|
||||
```text
|
||||
Varro
|
||||
Felix
|
||||
Lentulus
|
||||
Crispus
|
||||
Secundus
|
||||
Chresimus
|
||||
neutral narrator
|
||||
```
|
||||
|
||||
The generator should combine these into candidate expressions, not final truth.
|
||||
|
||||
---
|
||||
|
||||
## 4. Candidate Expression Record
|
||||
|
||||
Each generated expression should be stored as a reviewable record.
|
||||
|
||||
Recommended JSONL form:
|
||||
|
||||
```json
|
||||
{
|
||||
"expression_id": "expr_000142",
|
||||
"domain": "commerce",
|
||||
"object": "cart",
|
||||
"action": "hired_elsewhere",
|
||||
"pressure": "buyer_waiting",
|
||||
"actor_voice": "Secundus",
|
||||
"candidate": "The wheels are gone, and the buyer will not wait for our excuses.",
|
||||
"modern_meaning": "Cart capacity has been lost, but partial shipment may still be possible.",
|
||||
"concept_tags": [
|
||||
"transport_capacity",
|
||||
"delay_cost",
|
||||
"buyer_need"
|
||||
],
|
||||
"status": "candidate",
|
||||
"strength": null,
|
||||
"review_note": null
|
||||
}
|
||||
```
|
||||
|
||||
Candidate records are review material only.
|
||||
|
||||
They are not training material until promoted.
|
||||
|
||||
---
|
||||
|
||||
## 5. Review Status
|
||||
|
||||
Use a small status vocabulary.
|
||||
|
||||
```text
|
||||
candidate
|
||||
accepted
|
||||
rejected
|
||||
revise
|
||||
strong
|
||||
canonical
|
||||
```
|
||||
|
||||
Meaning:
|
||||
|
||||
```text
|
||||
candidate:
|
||||
generated but not reviewed
|
||||
|
||||
accepted:
|
||||
good enough to enter the vocabulary library
|
||||
|
||||
rejected:
|
||||
not useful; do not train on it
|
||||
|
||||
revise:
|
||||
promising but needs human rewrite
|
||||
|
||||
strong:
|
||||
useful enough to inspire dialogue lines
|
||||
|
||||
canonical:
|
||||
preferred phrasing for a recurring simulator condition
|
||||
```
|
||||
|
||||
Only these should enter training or simulator-facing data:
|
||||
|
||||
```text
|
||||
accepted
|
||||
strong
|
||||
canonical
|
||||
```
|
||||
|
||||
Rejected and unreviewed candidates should be retained only for audit or generator improvement.
|
||||
|
||||
---
|
||||
|
||||
## 6. Human Review Rules
|
||||
|
||||
The reviewer should ask:
|
||||
|
||||
1. Is the line Roman-visible?
|
||||
2. Does it avoid modern abstraction?
|
||||
3. Does it express a real commercial condition?
|
||||
4. Does it use objects, action, or pressure rather than explanation?
|
||||
5. Could one of the six actor voices plausibly say it?
|
||||
6. Is it compact enough to be useful?
|
||||
7. Does it avoid parody or over-stylized speech?
|
||||
8. Does it teach the model a useful pattern?
|
||||
|
||||
Reject lines that are merely clever.
|
||||
|
||||
Accept lines that create usable world-language.
|
||||
|
||||
Promote lines that can recur across scenes.
|
||||
|
||||
---
|
||||
|
||||
## 7. Rejection Reasons
|
||||
|
||||
Common rejection reasons:
|
||||
|
||||
```text
|
||||
too modern
|
||||
too abstract
|
||||
too theatrical
|
||||
too vague
|
||||
wrong actor voice
|
||||
no commercial meaning
|
||||
no Roman-visible object
|
||||
mixed metaphor
|
||||
unusable in dialogue
|
||||
duplicates existing phrase
|
||||
```
|
||||
|
||||
Optional review fields:
|
||||
|
||||
```json
|
||||
{
|
||||
"status": "rejected",
|
||||
"review_note": "too modern: sounds like business-school language"
|
||||
}
|
||||
```
|
||||
|
||||
or:
|
||||
|
||||
```json
|
||||
{
|
||||
"status": "revise",
|
||||
"review_note": "good image, but too ornate for Secundus"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Promotion Levels
|
||||
|
||||
### Accepted
|
||||
|
||||
Useful phrase. Can be stored in the vocabulary library.
|
||||
|
||||
Example:
|
||||
|
||||
```text
|
||||
The tablet arrived old.
|
||||
```
|
||||
|
||||
### Strong
|
||||
|
||||
Useful phrase that should influence dialogue writing.
|
||||
|
||||
Example:
|
||||
|
||||
```text
|
||||
A jar without wheels is a promise sitting in straw.
|
||||
```
|
||||
|
||||
### Canonical
|
||||
|
||||
Preferred phrase for a repeated simulator condition.
|
||||
|
||||
Example:
|
||||
|
||||
```text
|
||||
The wheels are gone.
|
||||
```
|
||||
|
||||
Canonical expressions should be few.
|
||||
|
||||
If too many phrases are canonical, none are canonical.
|
||||
|
||||
---
|
||||
|
||||
## 9. Output Libraries
|
||||
|
||||
The workflow should produce three outputs.
|
||||
|
||||
### Candidate Pool
|
||||
|
||||
```text
|
||||
data/vocabulary/candidates.jsonl
|
||||
```
|
||||
|
||||
Generated material, mostly unreviewed.
|
||||
|
||||
### Reviewed Vocabulary
|
||||
|
||||
```text
|
||||
data/vocabulary/roman_visible_expressions.jsonl
|
||||
```
|
||||
|
||||
Accepted, strong, and canonical expressions only.
|
||||
|
||||
### Canonical Templates
|
||||
|
||||
```text
|
||||
data/vocabulary/canonical_templates.jsonl
|
||||
```
|
||||
|
||||
Small set of recurring simulator-ready expressions.
|
||||
|
||||
---
|
||||
|
||||
## 10. Training Rule
|
||||
|
||||
Do not train on raw generated churn.
|
||||
|
||||
Training material may use:
|
||||
|
||||
```text
|
||||
accepted expressions
|
||||
strong expressions
|
||||
canonical expressions
|
||||
human-revised expressions
|
||||
dialogues that naturally include reviewed expressions
|
||||
```
|
||||
|
||||
Training material must not use:
|
||||
|
||||
```text
|
||||
unreviewed candidate output
|
||||
rejected output
|
||||
bulk generated noise
|
||||
expressions marked revise but not rewritten
|
||||
```
|
||||
|
||||
The generator is a discovery tool, not an author of record.
|
||||
|
||||
---
|
||||
|
||||
## 11. Simulator Use
|
||||
|
||||
Canonical expressions can help the simulator narrate recurring conditions.
|
||||
|
||||
Example simulator state:
|
||||
|
||||
```yaml
|
||||
condition: transport_capacity_lost
|
||||
object: cart
|
||||
cause: rival_hired_carts
|
||||
urgency: buyer_waiting
|
||||
actor_voice: Secundus
|
||||
```
|
||||
|
||||
Possible canonical output:
|
||||
|
||||
```text
|
||||
The wheels are gone.
|
||||
```
|
||||
|
||||
Expanded output:
|
||||
|
||||
```text
|
||||
The wheels are gone, and the buyer will not wait for our excuses.
|
||||
```
|
||||
|
||||
Actor variants:
|
||||
|
||||
```text
|
||||
Varro:
|
||||
The bridge was taken before the column moved.
|
||||
|
||||
Felix:
|
||||
Naso bought the road, not the oil.
|
||||
|
||||
Chresimus:
|
||||
The account must show why the jars did not move.
|
||||
|
||||
Secundus:
|
||||
The wheels are gone. Ten jars can still go by mule.
|
||||
```
|
||||
|
||||
The simulator should prefer canonical lines for repeated conditions and strong lines for color.
|
||||
|
||||
---
|
||||
|
||||
## 12. Generator Design
|
||||
|
||||
A simple generator can begin as a Cartesian combiner with templates.
|
||||
|
||||
Template examples:
|
||||
|
||||
```text
|
||||
The {object} {action_phrase} while {pressure_phrase}.
|
||||
A {object} without {support_object} is {metaphor_result}.
|
||||
The {pressure_object} has reached {target} before {expected_event}.
|
||||
{actor_voice} would say: "{expression}"
|
||||
```
|
||||
|
||||
But the generator should be constrained by compatibility rules.
|
||||
|
||||
Bad combinations should be filtered before review where possible.
|
||||
|
||||
Example:
|
||||
|
||||
```text
|
||||
coin + hired_elsewhere + rain
|
||||
```
|
||||
|
||||
may produce nonsense unless transformed carefully.
|
||||
|
||||
Good combinations:
|
||||
|
||||
```text
|
||||
cart + hired_elsewhere + buyer_waiting
|
||||
tablet + old + road_delay
|
||||
warehouse + full + merchant_urgency
|
||||
coin + visible + street_eyes
|
||||
seal + broken + official_attention
|
||||
```
|
||||
|
||||
The generator should prefer semantically compatible sets.
|
||||
|
||||
---
|
||||
|
||||
## 13. Compatibility Tags
|
||||
|
||||
Objects, actions, and pressures should eventually carry compatibility tags.
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
object: cart
|
||||
compatible_actions:
|
||||
- hired
|
||||
- missing
|
||||
- broken
|
||||
- delayed
|
||||
- overloaded
|
||||
compatible_pressures:
|
||||
- buyer_waiting
|
||||
- rival_obstruction
|
||||
- bad_road
|
||||
- delivery_deadline
|
||||
```
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
object: tablet
|
||||
compatible_actions:
|
||||
- written
|
||||
- sealed
|
||||
- old
|
||||
- disputed
|
||||
- witnessed
|
||||
compatible_pressures:
|
||||
- stale_news
|
||||
- legal_exposure
|
||||
- source_motive
|
||||
- settlement_dispute
|
||||
```
|
||||
|
||||
This improves candidate quality without eliminating human review.
|
||||
|
||||
---
|
||||
|
||||
## 14. Review Speed Target
|
||||
|
||||
The process is designed for fast human selection.
|
||||
|
||||
Target review speed:
|
||||
|
||||
```text
|
||||
200 to 500 candidates per hour
|
||||
```
|
||||
|
||||
This is realistic only if the review interface is simple.
|
||||
|
||||
Each candidate should support one-key marking:
|
||||
|
||||
```text
|
||||
a = accept
|
||||
r = reject
|
||||
v = revise
|
||||
s = strong
|
||||
c = canonical
|
||||
```
|
||||
|
||||
The reviewer should not be forced to edit every line.
|
||||
|
||||
Editing should be reserved for promising expressions.
|
||||
|
||||
---
|
||||
|
||||
## 15. Success Condition
|
||||
|
||||
This workflow is successful if it produces a growing library of Roman-visible expressions faster than direct hand-authoring.
|
||||
|
||||
A good result is not a clean generator.
|
||||
|
||||
A good result is a strong reviewed vocabulary.
|
||||
|
||||
The approved vocabulary should improve:
|
||||
|
||||
```text
|
||||
dialogue writing
|
||||
simulator narration
|
||||
actor voice consistency
|
||||
contamination resistance
|
||||
model training data
|
||||
```
|
||||
|
||||
The final test is whether the model prefers:
|
||||
|
||||
```text
|
||||
The wheels are gone.
|
||||
The tablet arrived old.
|
||||
He owns jars, not coin.
|
||||
The purse is fat and the street has eyes.
|
||||
```
|
||||
|
||||
over:
|
||||
|
||||
```text
|
||||
Transport capacity is constrained.
|
||||
The information is stale.
|
||||
His assets are illiquid.
|
||||
His liquidity creates security risk.
|
||||
```
|
||||
|
||||
The purpose is not style alone.
|
||||
|
||||
The purpose is to build a bounded Roman commercial ontology one approved phrase at a time.
|
||||
Reference in New Issue
Block a user