otivm/docs/training/chunking/DIALOGUE-STANDARD-0001.md

# DIALOGUE-STANDARD-0001
## OTIVM Layer 4 Dialogue Style Standard
### Status: Draft Standard
### Layer: Training Infrastructure
### Purpose: Define how OTIVM dialogue files should be written, marked, and validated
### Repository Path: docs/training/chunking/DIALOGUE-STANDARD-0001.md

---

## 0. Purpose

This standard defines how Layer 4 dialogue files should be authored for the OTIVM training corpus.

Layer 4 dialogue is not metadata.

Layer 4 dialogue is in-world scene material. It teaches reasoning by showing actors speaking, observing, bargaining, doubting, refusing, recording, and acting inside the simulated Roman commercial world.

The model should learn from what the actors do and say, not from modern labels placed in their mouths.

---

## 1. Primary Rule

Dialogue body text must be Roman-world prose and speech only.

Chunk markers may contain modern metadata.

Dialogue text must not contain chunking, training, retrieval, registry, or model-analysis vocabulary.

The source file may contain:

```text
HTML comment chunk markers
YAML metadata inside those markers
Roman-world dialogue and scene prose
```

The retrievable chunk text should read as a plausible scene, not as a lesson plan.

---

## 2. Separation Of Layers

Each dialogue file has three separate layers:

```text
1. Document header
   Human-readable file identity and purpose.

2. Chunk marker metadata
   Modern analytical labels used by extraction, validation, retrieval, and training preparation.

3. Dialogue body
   In-world Roman prose and speech only.
```

Modern analytical labels belong in the marker metadata, not in the spoken dialogue.

Example allowed in metadata:

```yaml
concept_tags:
  - stale_report
  - source_chain
  - confirmation_cost
knowledge_state:
  - reported
  - actor_visible
  - inferred
```

Example not allowed in dialogue speech:

```text
"Then we have a visible signal, not a settled price."
```

Better in-world dialogue:

```text
"A cart at the warehouse tells us something. It does not tell us what the oil will fetch."
```

---

## 3. Forbidden Dialogue Vocabulary

The following terms should not appear in character speech or scene narration unless they are normal Roman-world words in context.

Forbidden as training language:

```text
metadata
chunk
chunking
retrieval
training
model
parameter
registry
token
concept tag
knowledge state
visible signal
reported state
known state
hidden true state
settled result
actor perspective
decision threshold
uncertainty structure
correct model behavior
incorrect model behavior
confidence problem
designer analysis
```

These terms may appear inside HTML comment metadata only.

---

## 4. In-World Substitutions

Use Roman-visible language instead of modern analytical phrasing.

| Modern analytical idea | In-world expression |
|---|---|
| visible signal | cart, seal, smoke, crowd, empty stall, late messenger, wet cloak, broken jar |
| reported state | word, rumor, letter, tablet, witness, clerk's note, market talk |
| hidden true state | what is really inside the crate, what the buyer already knows, what the rival has done |
| confirmation cost | rider's fee, lost time, cart hire, missed buyer, waiting until market closes |
| source motive | why the clerk speaks, why the carter lies, why the rival spreads word |
| partial commitment | sell ten jars, hold the rest; send one cart, keep two; pledge now, settle later |
| settlement | receipt, tablet, witness, seal, pledge, repair, offset, delivery |
| opportunity cost | cart used elsewhere, wall occupied, buyer lost, labor tied up |
| actor perspective | each actor's habits, fears, duties, ambitions, and practical concerns |

Characters should reason with things they can see, hear, count, carry, pledge, inspect, or write.

---

## 5. Preferred Dialogue Shape

Each dialogue file should normally contain six marked scene beats.

Preferred pattern:

```text
1. Scene opening and visible trouble
2. First interpretation or opportunity
3. Challenge, caution, or competing reading
4. Practical cost, arithmetic, obligation, or risk
5. Decision point with buyer, rival, official, worker, or witness
6. Closing result or changed account
```

This is a preference, not a hard rule.

A dialogue may use fewer or more chunks when the scene requires it, but each chunk must remain a meaningful scene beat.

---

## 6. Dialogue Chunk Quality

A dialogue chunk is useful when it contains:

```text
Roman-visible situation
+ actor speech/action
+ pressure or uncertainty
+ commercial consequence
```

A dialogue chunk is weak when it contains only:

```text
banter
style
exposition
modern explanation
metadata terms
isolated moral lesson
```

Do not split a question from the answer that gives it meaning.

Do not split a false claim from the correction that makes it useful.

Do not split a joke or quip from the economic point it reveals.

---

## 7. Character Voice Rules

The six commerce NPC lenses may appear in dialogue, but they must not speak as metadata labels.

Use their practical habits:

```text
Varro:
  discipline, order, risk, proof, defensive caution, logistics by analogy to marching or guarding

Felix:
  opportunity, bargaining, speed, pressure, profit, social agility, controlled risk

Lentulus:
  status, access, patronage, public standing, elite expectations, shame, favor

Crispus:
  procedure, remedy, enforceability, authority, complaint, written standing

Secundus:
  carts, roads, capacity, labor, timing, breakage, substitution, practical feasibility

Chresimus:
  tablets, receipts, witnesses, seals, account entries, obligations, what can be written safely
```

The actor's reasoning should emerge from voice and action, not from explanatory labels.

---

## 8. Metadata Requirements

Each dialogue chunk marker should include:

```yaml
id: <DIALOGUE-XXXX::NN::role>
source_file: <filename>
repository_path: <repo path>
domain: commerce
layer: Layer_4--Dialogues
document_id: <DIALOGUE-XXXX>
document_title: "<title>"
section_heading: "<nearest section heading>"
chunk_role: dialogue_beat
concept_tags:
  - <tag>
knowledge_state:
  - <state>
speakers:
  - <actor>
scene_location: <place>
scene_signal: <visible event, rumor, cargo, document, price, or social change>
demonstrated_concepts:
  - <concept>
```

Metadata is for the pipeline. It is not part of the Roman scene.

---

## 9. Knowledge Boundary Rule

Dialogue must preserve what actors know.

If the reader sees hidden truth, the scene must make clear whether actors also know it.

Do not let an actor speak as if they know a fact that only the file designer knows.

Use distinctions visible in Roman terms:

```text
"I saw it."
"I heard it."
"The tablet says it."
"The carter claims it."
"The seal is unbroken."
"The buyer has not yet agreed."
"The witness can say this much."
"The rest is guesswork."
```

---

## 10. Arithmetic And Practical Cost

When dialogue includes arithmetic or cost, characters should express it through practical accounting.

Allowed:

```text
"Two jars lost. Hire paid. Half a day gone."
"If we pay double for carts, the venture thins."
"Ten jars now, the rest tomorrow."
"Repair stands against part of the debt."
```

Avoid modern teaching phrasing:

```text
"This demonstrates opportunity cost."
"The correct calculation is..."
"The model should infer..."
```

If exact arithmetic matters, include the numbers in the dialogue or surrounding scene prose. Do not leave calculation only in metadata.

---

## 11. Review Checklist

Before accepting a dialogue file:

1. Does every spoken line sound like a person in the world, not a trainer?
2. Are modern analytical terms confined to chunk metadata?
3. Does each chunk contain a complete scene beat?
4. Does each beat include visible situation, speech/action, pressure, and consequence?
5. Are knowledge boundaries preserved?
6. Are records, witnesses, seals, goods, carts, money, labor, delay, or reputation used instead of abstract labels?
7. Does the file teach through action rather than explanation?
8. Does the extractor validate all chunks without errors?

---

## 12. Success Condition

This standard is functioning correctly if Layer 4 dialogue can be retrieved as natural Roman-world scene material while still carrying precise modern metadata for training preparation.

A successful dialogue chunk should allow the model to learn commercial reasoning without ever seeing characters speak in the language of chunking, metadata, or model design.