# CHUNKING-STANDARD-0001 ## Training Corpus Chunking Standard ### Status: Draft Standard ### Layer: Training Infrastructure ### Purpose: Define how OTIVM training documents should be chunked for retrieval, review, and future model preparation ### Repository Path: docs/training/chunking/CHUNKING-STANDARD-0001.md --- ## 0. Purpose This document defines chunking rules for the OTIVM training corpus. The training corpus is layered. Each layer teaches a different kind of reasoning. Chunking must preserve that reasoning. The goal is not to split files into equal text lengths. The goal is to preserve usable training units. A good chunk should allow the model to answer: - what concept is being taught? - what facts are available? - what uncertainty remains? - what arithmetic or relation is being demonstrated? - which actor perspective is active? - what behavior should the model learn or avoid? --- ## 1. General Rule Chunk by meaning, not by size. A chunk should be self-contained enough to be retrieved without requiring the entire file. Each chunk should preserve: - file identity - layer - topic - local section heading - relevant example facts - any calculation needed to understand the point - correct and incorrect model behavior where applicable Avoid chunks that contain only: - isolated dialogue lines - arithmetic without scenario context - conclusions without evidence - actor interpretation without shared facts - principles without example or test --- ## 2. Preferred Chunk Size Preferred chunk size: ```text 300 to 900 words ``` Acceptable range: ```text 150 to 1200 words ``` Use shorter chunks when the section is atomic. Use longer chunks when splitting would separate a calculation from its explanation or a dialogue exchange from its demonstrated concept. Do not split: - a calculation from the numbers it uses - a rumor from the source and confidence problem - an actor reading from the actor name and shared scenario - a dialogue beat from the reason it matters - a success condition from the concept it tests --- ## 3. Required Chunk Metadata Each chunk should carry metadata equivalent to: ```yaml source_file: repository_path: layer: document_id: document_title: section_heading: <nearest heading> chunk_role: <principle | example | calculation | variant | actor_reading | dialogue_beat | success_condition | reference> concept_tags: - <tag> ``` The corpus files already include most of this information in prose form. A chunking process should preserve or derive it. --- ## 4. Concept Tags Use short, stable concept tags. Examples: ```yaml concept_tags: - local_price - total_cost - profit_arithmetic - delay_cost - rumor_uncertainty - hidden_true_state - source_motive - actor_perspective - credit_trust - non_coin_settlement - warehouse_right - transport_capacity - rivalry - hard_stop ``` A chunk may have multiple tags. Do not over-tag. Prefer 3 to 7 tags per chunk. --- ## 5. Layer 0 Chunking Rules Layer 0 contains primitive facts. Chunk by conceptual section. Preferred chunks: 1. header + principle 2. Roman-visible example 3. minimal structure 4. incorrect modern assumption + correction 5. simulation use + canonical test 6. success condition, if substantial A Layer 0 chunk should teach one primitive only. Do not combine separate files into one chunk. Do not split the principle from the title. ### Example Chunk Roles ```yaml chunk_role: principle chunk_role: roman_visible_example chunk_role: incorrect_assumption chunk_role: simulation_use chunk_role: success_condition ``` --- ## 6. Layer 1 Chunking Rules Layer 1 contains worked examples. Chunk by reasoning unit. Preferred chunks: 1. scenario + known facts 2. first incorrect calculation 3. total cost or profit calculation 4. variant A / B / C, grouped if short 5. correct model behavior 6. incorrect model behavior 7. layer references + success condition A calculation chunk must include: - the scenario values - the formula or arithmetic - the interpretation of the result Do not split: ```text sale value - total cost = result ``` from the values used to produce it. ### Example Chunk Roles ```yaml chunk_role: scenario chunk_role: calculation chunk_role: risk_variant chunk_role: correct_behavior chunk_role: incorrect_behavior ``` --- ## 7. Layer 2 Chunking Rules Layer 2 contains uncertainty. Chunk by evidence and uncertainty structure. Preferred chunks: 1. scenario + report or signal 2. known facts + unknowns 3. possible truth states or interpretations 4. decision options 5. correct model behavior 6. incorrect model behavior 7. success condition A Layer 2 chunk should preserve the distinction between: ```text reported_state known_state hidden_true_state actor_confidence final_resolution ``` Do not split an uncertainty example so that the report is separated from its age, source, motive, or confidence problem. ### Example Chunk Roles ```yaml chunk_role: report chunk_role: evidence_structure chunk_role: truth_variants chunk_role: decision_options chunk_role: uncertainty_behavior ``` --- ## 8. Layer 3 Chunking Rules Layer 3 contains actor perspective. Chunk by actor section, plus shared setup. Preferred chunks: 1. shared scenario facts 2. actor reading: Varro 3. actor reading: Felix 4. actor reading: Lentulus 5. actor reading: Crispus 6. actor reading: Secundus 7. actor reading: Chresimus 8. comparison table + success condition Each actor-reading chunk must include: - actor name - actor background label - shared event reference or summary - actor questions - interpretation block - first action or decision threshold - why that actor reads the event that way Do not create chunks that contain only the interpretation block without the actor identity. ### Example Chunk Roles ```yaml chunk_role: shared_facts chunk_role: actor_reading chunk_role: comparison chunk_role: success_condition ``` --- ## 9. Layer 4 Dialogue Chunking Rules Layer 4 contains dialogue. Dialogue must be chunked by scene beat, not by arbitrary length. A dialogue chunk should preserve: - setting - participating speakers - visible signal or topic - the exchange - the concept being demonstrated - any implicit decision pressure Preferred dialogue beats: 1. scene opening and visible signal 2. first actor interpretation 3. second actor challenge or correction 4. conflict between readings 5. arithmetic or practical consequence 6. decision point 7. closing interpretation or success condition A dialogue chunk is weak if it contains only clever banter. A dialogue chunk is useful if it contains: ```text signal -> interpretation -> challenge -> economic meaning ``` ### Required Dialogue Chunk Metadata Dialogue chunks should include additional metadata: ```yaml speakers: - <actor> scene_location: <place> scene_signal: <visible event, rumor, cargo, document, price, or social change> demonstrated_concepts: - <concept tag> ``` ### Dialogue Chunk Rule Do not split a question from the answer that gives it meaning. Do not split a false claim from the correction that makes it useful. Do not split a joke or quip from the economic point it reveals. --- ## 10. Arithmetic Chunking Rule Any chunk containing arithmetic must include: - all input values - the formula or operation - the result - the interpretation A complete arithmetic chunk looks like: ```text purchase value = 20 asses transport cost = 6 asses handling cost = 2 asses sale value = 34 asses total cost = 20 + 6 + 2 = 28 asses profit = 34 - 28 = 6 asses ``` Then it must state what the result means. Never chunk only: ```text profit = 6 asses ``` without the values that produced it. --- ## 11. Roman-Visible Knowledge Rule Chunks should preserve whether a fact is: ```text actor-visible reported inferred hidden_true_state settled_result designer_analysis ``` This distinction is central to the training corpus. If a chunk includes hidden truth, label it clearly. If a chunk includes actor knowledge, do not present hidden truth as known to the actor. --- ## 12. Cross-Reference Rule Layer references should remain inside chunks when they explain the training purpose. However, a chunk should not rely entirely on cross-references. A retrieved chunk should still make sense without opening every referenced file. Cross-references are support, not replacement. --- ## 13. Naming Rule Chunk identifiers should be deterministic. Recommended format: ```text <document_id>::<section_number>::<chunk_role> ``` Examples: ```text CORPUS-0005::04::correct_behavior CORPUS-0011::06::actor_reading_secundus DIALOGUE-0002::03::scene_beat_cart_delay ``` For repeated roles: ```text CORPUS-0008::04a::variant_true CORPUS-0008::04b::variant_partial CORPUS-0008::04c::variant_false ``` --- ## 14. Minimum Chunk Quality Test Before accepting a chunk, ask: 1. Does it say what file and layer it came from? 2. Does it preserve the concept being taught? 3. Does it include enough facts to understand the example? 4. Does it keep arithmetic with its inputs? 5. Does it distinguish known, reported, inferred, hidden, and settled facts? 6. Does it preserve actor identity when actor perspective matters? 7. Does it avoid isolated banter? 8. Does it include the model behavior being trained or corrected? If the answer to any critical question is no, adjust the chunk boundary. --- ## 15. Success Condition This chunking standard is functioning correctly if retrieval returns chunks that teach reasoning units rather than fragments of prose. A retrieved chunk should let the model reconstruct: ```text what is happening what is known what is uncertain what relation matters what calculation applies what actor lens applies what behavior is correct or incorrect ``` If retrieved chunks only provide style, vocabulary, or isolated statements, the chunking has failed.