initial upload

2026-04-30 12:00:35 -04:00
parent bac7e0c8ba
commit 38a7048c0b
1 changed files with 473 additions and 0 deletions
--- a/docs/training/chunking/CHUNKING-STANDARD-0001.md
+++ b/docs/training/chunking/CHUNKING-STANDARD-0001.md
@@ -0,0 +1,473 @@
 # CHUNKING-STANDARD-0001
 ## Training Corpus Chunking Standard
 ### Status: Draft Standard
 ### Layer: Training Infrastructure
 ### Purpose: Define how OTIVM training documents should be chunked for retrieval, review, and future model preparation
 ### Repository Path: docs/training/chunking/CHUNKING-STANDARD-0001.md
 ---
 ## 0. Purpose
 This document defines chunking rules for the OTIVM training corpus.
 The training corpus is layered. Each layer teaches a different kind of reasoning. Chunking must preserve that reasoning.
 The goal is not to split files into equal text lengths.
 The goal is to preserve usable training units.
 A good chunk should allow the model to answer:
 - what concept is being taught?
 - what facts are available?
 - what uncertainty remains?
 - what arithmetic or relation is being demonstrated?
 - which actor perspective is active?
 - what behavior should the model learn or avoid?
 ---
 ## 1. General Rule
 Chunk by meaning, not by size.
 A chunk should be self-contained enough to be retrieved without requiring the entire file.
 Each chunk should preserve:
 - file identity
 - layer
 - topic
 - local section heading
 - relevant example facts
 - any calculation needed to understand the point
 - correct and incorrect model behavior where applicable
 Avoid chunks that contain only:
 - isolated dialogue lines
 - arithmetic without scenario context
 - conclusions without evidence
 - actor interpretation without shared facts
 - principles without example or test
 ---
 ## 2. Preferred Chunk Size
 Preferred chunk size:
 ```text
 300 to 900 words
 ```
 Acceptable range:
 ```text
 150 to 1200 words
 ```
 Use shorter chunks when the section is atomic.
 Use longer chunks when splitting would separate a calculation from its explanation or a dialogue exchange from its demonstrated concept.
 Do not split:
 - a calculation from the numbers it uses
 - a rumor from the source and confidence problem
 - an actor reading from the actor name and shared scenario
 - a dialogue beat from the reason it matters
 - a success condition from the concept it tests
 ---
 ## 3. Required Chunk Metadata
 Each chunk should carry metadata equivalent to:
 ```yaml
 source_file: <filename>
 repository_path: <repo path>
 layer: <Layer_0--Primitive_Facts | Layer_1--Worked_Examples | Layer_2--Uncertainty | Layer_3--Actor_Perspective | Layer_4--Dialogues>
 document_id: <CORPUS-XXXX or DIALOGUE-XXXX>
 document_title: <title>
 section_heading: <nearest heading>
 chunk_role: <principle | example | calculation | variant | actor_reading | dialogue_beat | success_condition | reference>
 concept_tags:
  - <tag>
 ```
 The corpus files already include most of this information in prose form. A chunking process should preserve or derive it.
 ---
 ## 4. Concept Tags
 Use short, stable concept tags.
 Examples:
 ```yaml
 concept_tags:
  - local_price
  - total_cost
  - profit_arithmetic
  - delay_cost
  - rumor_uncertainty
  - hidden_true_state
  - source_motive
  - actor_perspective
  - credit_trust
  - non_coin_settlement
  - warehouse_right
  - transport_capacity
  - rivalry
  - hard_stop
 ```
 A chunk may have multiple tags.
 Do not over-tag. Prefer 3 to 7 tags per chunk.
 ---
 ## 5. Layer 0 Chunking Rules
 Layer 0 contains primitive facts.
 Chunk by conceptual section.
 Preferred chunks:
 1. header + principle
 2. Roman-visible example
 3. minimal structure
 4. incorrect modern assumption + correction
 5. simulation use + canonical test
 6. success condition, if substantial
 A Layer 0 chunk should teach one primitive only.
 Do not combine separate files into one chunk.
 Do not split the principle from the title.
 ### Example Chunk Roles
 ```yaml
 chunk_role: principle
 chunk_role: roman_visible_example
 chunk_role: incorrect_assumption
 chunk_role: simulation_use
 chunk_role: success_condition
 ```
 ---
 ## 6. Layer 1 Chunking Rules
 Layer 1 contains worked examples.
 Chunk by reasoning unit.
 Preferred chunks:
 1. scenario + known facts
 2. first incorrect calculation
 3. total cost or profit calculation
 4. variant A / B / C, grouped if short
 5. correct model behavior
 6. incorrect model behavior
 7. layer references + success condition
 A calculation chunk must include:
 - the scenario values
 - the formula or arithmetic
 - the interpretation of the result
 Do not split:
 ```text
 sale value - total cost = result
 ```
 from the values used to produce it.
 ### Example Chunk Roles
 ```yaml
 chunk_role: scenario
 chunk_role: calculation
 chunk_role: risk_variant
 chunk_role: correct_behavior
 chunk_role: incorrect_behavior
 ```
 ---
 ## 7. Layer 2 Chunking Rules
 Layer 2 contains uncertainty.
 Chunk by evidence and uncertainty structure.
 Preferred chunks:
 1. scenario + report or signal
 2. known facts + unknowns
 3. possible truth states or interpretations
 4. decision options
 5. correct model behavior
 6. incorrect model behavior
 7. success condition
 A Layer 2 chunk should preserve the distinction between:
 ```text
 reported_state
 known_state
 hidden_true_state
 actor_confidence
 final_resolution
 ```
 Do not split an uncertainty example so that the report is separated from its age, source, motive, or confidence problem.
 ### Example Chunk Roles
 ```yaml
 chunk_role: report
 chunk_role: evidence_structure
 chunk_role: truth_variants
 chunk_role: decision_options
 chunk_role: uncertainty_behavior
 ```
 ---
 ## 8. Layer 3 Chunking Rules
 Layer 3 contains actor perspective.
 Chunk by actor section, plus shared setup.
 Preferred chunks:
 1. shared scenario facts
 2. actor reading: Varro
 3. actor reading: Felix
 4. actor reading: Lentulus
 5. actor reading: Crispus
 6. actor reading: Secundus
 7. actor reading: Chresimus
 8. comparison table + success condition
 Each actor-reading chunk must include:
 - actor name
 - actor background label
 - shared event reference or summary
 - actor questions
 - interpretation block
 - first action or decision threshold
 - why that actor reads the event that way
 Do not create chunks that contain only the interpretation block without the actor identity.
 ### Example Chunk Roles
 ```yaml
 chunk_role: shared_facts
 chunk_role: actor_reading
 chunk_role: comparison
 chunk_role: success_condition
 ```
 ---
 ## 9. Layer 4 Dialogue Chunking Rules
 Layer 4 contains dialogue.
 Dialogue must be chunked by scene beat, not by arbitrary length.
 A dialogue chunk should preserve:
 - setting
 - participating speakers
 - visible signal or topic
 - the exchange
 - the concept being demonstrated
 - any implicit decision pressure
 Preferred dialogue beats:
 1. scene opening and visible signal
 2. first actor interpretation
 3. second actor challenge or correction
 4. conflict between readings
 5. arithmetic or practical consequence
 6. decision point
 7. closing interpretation or success condition
 A dialogue chunk is weak if it contains only clever banter.
 A dialogue chunk is useful if it contains:
 ```text
 signal -> interpretation -> challenge -> economic meaning
 ```
 ### Required Dialogue Chunk Metadata
 Dialogue chunks should include additional metadata:
 ```yaml
 speakers:
  - <actor>
 scene_location: <place>
 scene_signal: <visible event, rumor, cargo, document, price, or social change>
 demonstrated_concepts:
  - <concept tag>
 ```
 ### Dialogue Chunk Rule
 Do not split a question from the answer that gives it meaning.
 Do not split a false claim from the correction that makes it useful.
 Do not split a joke or quip from the economic point it reveals.
 ---
 ## 10. Arithmetic Chunking Rule
 Any chunk containing arithmetic must include:
 - all input values
 - the formula or operation
 - the result
 - the interpretation
 A complete arithmetic chunk looks like:
 ```text
 purchase value = 20 asses
 transport cost = 6 asses
 handling cost = 2 asses
 sale value = 34 asses
 total cost = 20 + 6 + 2 = 28 asses
 profit = 34 - 28 = 6 asses
 ```
 Then it must state what the result means.
 Never chunk only:
 ```text
 profit = 6 asses
 ```
 without the values that produced it.
 ---
 ## 11. Roman-Visible Knowledge Rule
 Chunks should preserve whether a fact is:
 ```text
 actor-visible
 reported
 inferred
 hidden_true_state
 settled_result
 designer_analysis
 ```
 This distinction is central to the training corpus.
 If a chunk includes hidden truth, label it clearly.
 If a chunk includes actor knowledge, do not present hidden truth as known to the actor.
 ---
 ## 12. Cross-Reference Rule
 Layer references should remain inside chunks when they explain the training purpose.
 However, a chunk should not rely entirely on cross-references.
 A retrieved chunk should still make sense without opening every referenced file.
 Cross-references are support, not replacement.
 ---
 ## 13. Naming Rule
 Chunk identifiers should be deterministic.
 Recommended format:
 ```text
 <document_id>::<section_number>::<chunk_role>
 ```
 Examples:
 ```text
 CORPUS-0005::04::correct_behavior
 CORPUS-0011::06::actor_reading_secundus
 DIALOGUE-0002::03::scene_beat_cart_delay
 ```
 For repeated roles:
 ```text
 CORPUS-0008::04a::variant_true
 CORPUS-0008::04b::variant_partial
 CORPUS-0008::04c::variant_false
 ```
 ---
 ## 14. Minimum Chunk Quality Test
 Before accepting a chunk, ask:
 1. Does it say what file and layer it came from?
 2. Does it preserve the concept being taught?
 3. Does it include enough facts to understand the example?
 4. Does it keep arithmetic with its inputs?
 5. Does it distinguish known, reported, inferred, hidden, and settled facts?
 6. Does it preserve actor identity when actor perspective matters?
 7. Does it avoid isolated banter?
 8. Does it include the model behavior being trained or corrected?
 If the answer to any critical question is no, adjust the chunk boundary.
 ---
 ## 15. Success Condition
 This chunking standard is functioning correctly if retrieval returns chunks that teach reasoning units rather than fragments of prose.
 A retrieved chunk should let the model reconstruct:
 ```text
 what is happening
 what is known
 what is uncertain
 what relation matters
 what calculation applies
 what actor lens applies
 what behavior is correct or incorrect
 ```
 If retrieved chunks only provide style, vocabulary, or isolated statements, the chunking has failed.