diff --git a/docs/training/chunking/DIALOGUE-STANDARD-0001.md b/docs/training/chunking/DIALOGUE-STANDARD-0001.md new file mode 100644 index 0000000..af0dfbc --- /dev/null +++ b/docs/training/chunking/DIALOGUE-STANDARD-0001.md @@ -0,0 +1,320 @@ +# DIALOGUE-STANDARD-0001 +## OTIVM Layer 4 Dialogue Style Standard +### Status: Draft Standard +### Layer: Training Infrastructure +### Purpose: Define how OTIVM dialogue files should be written, marked, and validated +### Repository Path: docs/training/chunking/DIALOGUE-STANDARD-0001.md + +--- + +## 0. Purpose + +This standard defines how Layer 4 dialogue files should be authored for the OTIVM training corpus. + +Layer 4 dialogue is not metadata. + +Layer 4 dialogue is in-world scene material. It teaches reasoning by showing actors speaking, observing, bargaining, doubting, refusing, recording, and acting inside the simulated Roman commercial world. + +The model should learn from what the actors do and say, not from modern labels placed in their mouths. + +--- + +## 1. Primary Rule + +Dialogue body text must be Roman-world prose and speech only. + +Chunk markers may contain modern metadata. + +Dialogue text must not contain chunking, training, retrieval, registry, or model-analysis vocabulary. + +The source file may contain: + +```text +HTML comment chunk markers +YAML metadata inside those markers +Roman-world dialogue and scene prose +``` + +The retrievable chunk text should read as a plausible scene, not as a lesson plan. + +--- + +## 2. Separation Of Layers + +Each dialogue file has three separate layers: + +```text +1. Document header + Human-readable file identity and purpose. + +2. Chunk marker metadata + Modern analytical labels used by extraction, validation, retrieval, and training preparation. + +3. Dialogue body + In-world Roman prose and speech only. +``` + +Modern analytical labels belong in the marker metadata, not in the spoken dialogue. + +Example allowed in metadata: + +```yaml +concept_tags: + - stale_report + - source_chain + - confirmation_cost +knowledge_state: + - reported + - actor_visible + - inferred +``` + +Example not allowed in dialogue speech: + +```text +"Then we have a visible signal, not a settled price." +``` + +Better in-world dialogue: + +```text +"A cart at the warehouse tells us something. It does not tell us what the oil will fetch." +``` + +--- + +## 3. Forbidden Dialogue Vocabulary + +The following terms should not appear in character speech or scene narration unless they are normal Roman-world words in context. + +Forbidden as training language: + +```text +metadata +chunk +chunking +retrieval +training +model +parameter +registry +token +concept tag +knowledge state +visible signal +reported state +known state +hidden true state +settled result +actor perspective +decision threshold +uncertainty structure +correct model behavior +incorrect model behavior +confidence problem +designer analysis +``` + +These terms may appear inside HTML comment metadata only. + +--- + +## 4. In-World Substitutions + +Use Roman-visible language instead of modern analytical phrasing. + +| Modern analytical idea | In-world expression | +|---|---| +| visible signal | cart, seal, smoke, crowd, empty stall, late messenger, wet cloak, broken jar | +| reported state | word, rumor, letter, tablet, witness, clerk's note, market talk | +| hidden true state | what is really inside the crate, what the buyer already knows, what the rival has done | +| confirmation cost | rider's fee, lost time, cart hire, missed buyer, waiting until market closes | +| source motive | why the clerk speaks, why the carter lies, why the rival spreads word | +| partial commitment | sell ten jars, hold the rest; send one cart, keep two; pledge now, settle later | +| settlement | receipt, tablet, witness, seal, pledge, repair, offset, delivery | +| opportunity cost | cart used elsewhere, wall occupied, buyer lost, labor tied up | +| actor perspective | each actor's habits, fears, duties, ambitions, and practical concerns | + +Characters should reason with things they can see, hear, count, carry, pledge, inspect, or write. + +--- + +## 5. Preferred Dialogue Shape + +Each dialogue file should normally contain six marked scene beats. + +Preferred pattern: + +```text +1. Scene opening and visible trouble +2. First interpretation or opportunity +3. Challenge, caution, or competing reading +4. Practical cost, arithmetic, obligation, or risk +5. Decision point with buyer, rival, official, worker, or witness +6. Closing result or changed account +``` + +This is a preference, not a hard rule. + +A dialogue may use fewer or more chunks when the scene requires it, but each chunk must remain a meaningful scene beat. + +--- + +## 6. Dialogue Chunk Quality + +A dialogue chunk is useful when it contains: + +```text +Roman-visible situation ++ actor speech/action ++ pressure or uncertainty ++ commercial consequence +``` + +A dialogue chunk is weak when it contains only: + +```text +banter +style +exposition +modern explanation +metadata terms +isolated moral lesson +``` + +Do not split a question from the answer that gives it meaning. + +Do not split a false claim from the correction that makes it useful. + +Do not split a joke or quip from the economic point it reveals. + +--- + +## 7. Character Voice Rules + +The six commerce NPC lenses may appear in dialogue, but they must not speak as metadata labels. + +Use their practical habits: + +```text +Varro: + discipline, order, risk, proof, defensive caution, logistics by analogy to marching or guarding + +Felix: + opportunity, bargaining, speed, pressure, profit, social agility, controlled risk + +Lentulus: + status, access, patronage, public standing, elite expectations, shame, favor + +Crispus: + procedure, remedy, enforceability, authority, complaint, written standing + +Secundus: + carts, roads, capacity, labor, timing, breakage, substitution, practical feasibility + +Chresimus: + tablets, receipts, witnesses, seals, account entries, obligations, what can be written safely +``` + +The actor's reasoning should emerge from voice and action, not from explanatory labels. + +--- + +## 8. Metadata Requirements + +Each dialogue chunk marker should include: + +```yaml +id: +source_file: +repository_path: +domain: commerce +layer: Layer_4--Dialogues +document_id: +document_title: "" +section_heading: "<nearest section heading>" +chunk_role: dialogue_beat +concept_tags: + - <tag> +knowledge_state: + - <state> +speakers: + - <actor> +scene_location: <place> +scene_signal: <visible event, rumor, cargo, document, price, or social change> +demonstrated_concepts: + - <concept> +``` + +Metadata is for the pipeline. It is not part of the Roman scene. + +--- + +## 9. Knowledge Boundary Rule + +Dialogue must preserve what actors know. + +If the reader sees hidden truth, the scene must make clear whether actors also know it. + +Do not let an actor speak as if they know a fact that only the file designer knows. + +Use distinctions visible in Roman terms: + +```text +"I saw it." +"I heard it." +"The tablet says it." +"The carter claims it." +"The seal is unbroken." +"The buyer has not yet agreed." +"The witness can say this much." +"The rest is guesswork." +``` + +--- + +## 10. Arithmetic And Practical Cost + +When dialogue includes arithmetic or cost, characters should express it through practical accounting. + +Allowed: + +```text +"Two jars lost. Hire paid. Half a day gone." +"If we pay double for carts, the venture thins." +"Ten jars now, the rest tomorrow." +"Repair stands against part of the debt." +``` + +Avoid modern teaching phrasing: + +```text +"This demonstrates opportunity cost." +"The correct calculation is..." +"The model should infer..." +``` + +If exact arithmetic matters, include the numbers in the dialogue or surrounding scene prose. Do not leave calculation only in metadata. + +--- + +## 11. Review Checklist + +Before accepting a dialogue file: + +1. Does every spoken line sound like a person in the world, not a trainer? +2. Are modern analytical terms confined to chunk metadata? +3. Does each chunk contain a complete scene beat? +4. Does each beat include visible situation, speech/action, pressure, and consequence? +5. Are knowledge boundaries preserved? +6. Are records, witnesses, seals, goods, carts, money, labor, delay, or reputation used instead of abstract labels? +7. Does the file teach through action rather than explanation? +8. Does the extractor validate all chunks without errors? + +--- + +## 12. Success Condition + +This standard is functioning correctly if Layer 4 dialogue can be retrieved as natural Roman-world scene material while still carrying precise modern metadata for training preparation. + +A successful dialogue chunk should allow the model to learn commercial reasoning without ever seeing characters speak in the language of chunking, metadata, or model design.