From 38a7048c0ba867097591c7491d6ed7d2fb4af97c Mon Sep 17 00:00:00 2001 From: TheRON Date: Thu, 30 Apr 2026 12:00:35 -0400 Subject: [PATCH] initial upload --- .../chunking/CHUNKING-STANDARD-0001.md | 473 ++++++++++++++++++ 1 file changed, 473 insertions(+) create mode 100644 docs/training/chunking/CHUNKING-STANDARD-0001.md diff --git a/docs/training/chunking/CHUNKING-STANDARD-0001.md b/docs/training/chunking/CHUNKING-STANDARD-0001.md new file mode 100644 index 0000000..e9ffa35 --- /dev/null +++ b/docs/training/chunking/CHUNKING-STANDARD-0001.md @@ -0,0 +1,473 @@ +# CHUNKING-STANDARD-0001 +## Training Corpus Chunking Standard +### Status: Draft Standard +### Layer: Training Infrastructure +### Purpose: Define how OTIVM training documents should be chunked for retrieval, review, and future model preparation +### Repository Path: docs/training/chunking/CHUNKING-STANDARD-0001.md + +--- + +## 0. Purpose + +This document defines chunking rules for the OTIVM training corpus. + +The training corpus is layered. Each layer teaches a different kind of reasoning. Chunking must preserve that reasoning. + +The goal is not to split files into equal text lengths. + +The goal is to preserve usable training units. + +A good chunk should allow the model to answer: + +- what concept is being taught? +- what facts are available? +- what uncertainty remains? +- what arithmetic or relation is being demonstrated? +- which actor perspective is active? +- what behavior should the model learn or avoid? + +--- + +## 1. General Rule + +Chunk by meaning, not by size. + +A chunk should be self-contained enough to be retrieved without requiring the entire file. + +Each chunk should preserve: + +- file identity +- layer +- topic +- local section heading +- relevant example facts +- any calculation needed to understand the point +- correct and incorrect model behavior where applicable + +Avoid chunks that contain only: + +- isolated dialogue lines +- arithmetic without scenario context +- conclusions without evidence +- actor interpretation without shared facts +- principles without example or test + +--- + +## 2. Preferred Chunk Size + +Preferred chunk size: + +```text +300 to 900 words +``` + +Acceptable range: + +```text +150 to 1200 words +``` + +Use shorter chunks when the section is atomic. + +Use longer chunks when splitting would separate a calculation from its explanation or a dialogue exchange from its demonstrated concept. + +Do not split: + +- a calculation from the numbers it uses +- a rumor from the source and confidence problem +- an actor reading from the actor name and shared scenario +- a dialogue beat from the reason it matters +- a success condition from the concept it tests + +--- + +## 3. Required Chunk Metadata + +Each chunk should carry metadata equivalent to: + +```yaml +source_file: +repository_path: +layer: +document_id: +document_title: +section_heading: <nearest heading> +chunk_role: <principle | example | calculation | variant | actor_reading | dialogue_beat | success_condition | reference> +concept_tags: + - <tag> +``` + +The corpus files already include most of this information in prose form. A chunking process should preserve or derive it. + +--- + +## 4. Concept Tags + +Use short, stable concept tags. + +Examples: + +```yaml +concept_tags: + - local_price + - total_cost + - profit_arithmetic + - delay_cost + - rumor_uncertainty + - hidden_true_state + - source_motive + - actor_perspective + - credit_trust + - non_coin_settlement + - warehouse_right + - transport_capacity + - rivalry + - hard_stop +``` + +A chunk may have multiple tags. + +Do not over-tag. Prefer 3 to 7 tags per chunk. + +--- + +## 5. Layer 0 Chunking Rules + +Layer 0 contains primitive facts. + +Chunk by conceptual section. + +Preferred chunks: + +1. header + principle +2. Roman-visible example +3. minimal structure +4. incorrect modern assumption + correction +5. simulation use + canonical test +6. success condition, if substantial + +A Layer 0 chunk should teach one primitive only. + +Do not combine separate files into one chunk. + +Do not split the principle from the title. + +### Example Chunk Roles + +```yaml +chunk_role: principle +chunk_role: roman_visible_example +chunk_role: incorrect_assumption +chunk_role: simulation_use +chunk_role: success_condition +``` + +--- + +## 6. Layer 1 Chunking Rules + +Layer 1 contains worked examples. + +Chunk by reasoning unit. + +Preferred chunks: + +1. scenario + known facts +2. first incorrect calculation +3. total cost or profit calculation +4. variant A / B / C, grouped if short +5. correct model behavior +6. incorrect model behavior +7. layer references + success condition + +A calculation chunk must include: + +- the scenario values +- the formula or arithmetic +- the interpretation of the result + +Do not split: + +```text +sale value - total cost = result +``` + +from the values used to produce it. + +### Example Chunk Roles + +```yaml +chunk_role: scenario +chunk_role: calculation +chunk_role: risk_variant +chunk_role: correct_behavior +chunk_role: incorrect_behavior +``` + +--- + +## 7. Layer 2 Chunking Rules + +Layer 2 contains uncertainty. + +Chunk by evidence and uncertainty structure. + +Preferred chunks: + +1. scenario + report or signal +2. known facts + unknowns +3. possible truth states or interpretations +4. decision options +5. correct model behavior +6. incorrect model behavior +7. success condition + +A Layer 2 chunk should preserve the distinction between: + +```text +reported_state +known_state +hidden_true_state +actor_confidence +final_resolution +``` + +Do not split an uncertainty example so that the report is separated from its age, source, motive, or confidence problem. + +### Example Chunk Roles + +```yaml +chunk_role: report +chunk_role: evidence_structure +chunk_role: truth_variants +chunk_role: decision_options +chunk_role: uncertainty_behavior +``` + +--- + +## 8. Layer 3 Chunking Rules + +Layer 3 contains actor perspective. + +Chunk by actor section, plus shared setup. + +Preferred chunks: + +1. shared scenario facts +2. actor reading: Varro +3. actor reading: Felix +4. actor reading: Lentulus +5. actor reading: Crispus +6. actor reading: Secundus +7. actor reading: Chresimus +8. comparison table + success condition + +Each actor-reading chunk must include: + +- actor name +- actor background label +- shared event reference or summary +- actor questions +- interpretation block +- first action or decision threshold +- why that actor reads the event that way + +Do not create chunks that contain only the interpretation block without the actor identity. + +### Example Chunk Roles + +```yaml +chunk_role: shared_facts +chunk_role: actor_reading +chunk_role: comparison +chunk_role: success_condition +``` + +--- + +## 9. Layer 4 Dialogue Chunking Rules + +Layer 4 contains dialogue. + +Dialogue must be chunked by scene beat, not by arbitrary length. + +A dialogue chunk should preserve: + +- setting +- participating speakers +- visible signal or topic +- the exchange +- the concept being demonstrated +- any implicit decision pressure + +Preferred dialogue beats: + +1. scene opening and visible signal +2. first actor interpretation +3. second actor challenge or correction +4. conflict between readings +5. arithmetic or practical consequence +6. decision point +7. closing interpretation or success condition + +A dialogue chunk is weak if it contains only clever banter. + +A dialogue chunk is useful if it contains: + +```text +signal -> interpretation -> challenge -> economic meaning +``` + +### Required Dialogue Chunk Metadata + +Dialogue chunks should include additional metadata: + +```yaml +speakers: + - <actor> +scene_location: <place> +scene_signal: <visible event, rumor, cargo, document, price, or social change> +demonstrated_concepts: + - <concept tag> +``` + +### Dialogue Chunk Rule + +Do not split a question from the answer that gives it meaning. + +Do not split a false claim from the correction that makes it useful. + +Do not split a joke or quip from the economic point it reveals. + +--- + +## 10. Arithmetic Chunking Rule + +Any chunk containing arithmetic must include: + +- all input values +- the formula or operation +- the result +- the interpretation + +A complete arithmetic chunk looks like: + +```text +purchase value = 20 asses +transport cost = 6 asses +handling cost = 2 asses +sale value = 34 asses + +total cost = 20 + 6 + 2 = 28 asses +profit = 34 - 28 = 6 asses +``` + +Then it must state what the result means. + +Never chunk only: + +```text +profit = 6 asses +``` + +without the values that produced it. + +--- + +## 11. Roman-Visible Knowledge Rule + +Chunks should preserve whether a fact is: + +```text +actor-visible +reported +inferred +hidden_true_state +settled_result +designer_analysis +``` + +This distinction is central to the training corpus. + +If a chunk includes hidden truth, label it clearly. + +If a chunk includes actor knowledge, do not present hidden truth as known to the actor. + +--- + +## 12. Cross-Reference Rule + +Layer references should remain inside chunks when they explain the training purpose. + +However, a chunk should not rely entirely on cross-references. + +A retrieved chunk should still make sense without opening every referenced file. + +Cross-references are support, not replacement. + +--- + +## 13. Naming Rule + +Chunk identifiers should be deterministic. + +Recommended format: + +```text +<document_id>::<section_number>::<chunk_role> +``` + +Examples: + +```text +CORPUS-0005::04::correct_behavior +CORPUS-0011::06::actor_reading_secundus +DIALOGUE-0002::03::scene_beat_cart_delay +``` + +For repeated roles: + +```text +CORPUS-0008::04a::variant_true +CORPUS-0008::04b::variant_partial +CORPUS-0008::04c::variant_false +``` + +--- + +## 14. Minimum Chunk Quality Test + +Before accepting a chunk, ask: + +1. Does it say what file and layer it came from? +2. Does it preserve the concept being taught? +3. Does it include enough facts to understand the example? +4. Does it keep arithmetic with its inputs? +5. Does it distinguish known, reported, inferred, hidden, and settled facts? +6. Does it preserve actor identity when actor perspective matters? +7. Does it avoid isolated banter? +8. Does it include the model behavior being trained or corrected? + +If the answer to any critical question is no, adjust the chunk boundary. + +--- + +## 15. Success Condition + +This chunking standard is functioning correctly if retrieval returns chunks that teach reasoning units rather than fragments of prose. + +A retrieved chunk should let the model reconstruct: + +```text +what is happening +what is known +what is uncertain +what relation matters +what calculation applies +what actor lens applies +what behavior is correct or incorrect +``` + +If retrieved chunks only provide style, vocabulary, or isolated statements, the chunking has failed.