initial upload

2026-04-30 12:00:35 -04:00
parent bac7e0c8ba
commit 38a7048c0b
1 changed files with 473 additions and 0 deletions
--- a/docs/training/chunking/CHUNKING-STANDARD-0001.md
+++ b/docs/training/chunking/CHUNKING-STANDARD-0001.md
@@ -0,0 +1,473 @@
+# CHUNKING-STANDARD-0001
+## Training Corpus Chunking Standard
+### Status: Draft Standard
+### Layer: Training Infrastructure
+### Purpose: Define how OTIVM training documents should be chunked for retrieval, review, and future model preparation
+### Repository Path: docs/training/chunking/CHUNKING-STANDARD-0001.md
+
+---
+
+## 0. Purpose
+
+This document defines chunking rules for the OTIVM training corpus.
+
+The training corpus is layered. Each layer teaches a different kind of reasoning. Chunking must preserve that reasoning.
+
+The goal is not to split files into equal text lengths.
+
+The goal is to preserve usable training units.
+
+A good chunk should allow the model to answer:
+
+- what concept is being taught?
+- what facts are available?
+- what uncertainty remains?
+- what arithmetic or relation is being demonstrated?
+- which actor perspective is active?
+- what behavior should the model learn or avoid?
+
+---
+
+## 1. General Rule
+
+Chunk by meaning, not by size.
+
+A chunk should be self-contained enough to be retrieved without requiring the entire file.
+
+Each chunk should preserve:
+
+- file identity
+- layer
+- topic
+- local section heading
+- relevant example facts
+- any calculation needed to understand the point
+- correct and incorrect model behavior where applicable
+
+Avoid chunks that contain only:
+
+- isolated dialogue lines
+- arithmetic without scenario context
+- conclusions without evidence
+- actor interpretation without shared facts
+- principles without example or test
+
+---
+
+## 2. Preferred Chunk Size
+
+Preferred chunk size:
+
+```text
+300 to 900 words
+```
+
+Acceptable range:
+
+```text
+150 to 1200 words
+```
+
+Use shorter chunks when the section is atomic.
+
+Use longer chunks when splitting would separate a calculation from its explanation or a dialogue exchange from its demonstrated concept.
+
+Do not split:
+
+- a calculation from the numbers it uses
+- a rumor from the source and confidence problem
+- an actor reading from the actor name and shared scenario
+- a dialogue beat from the reason it matters
+- a success condition from the concept it tests
+
+---
+
+## 3. Required Chunk Metadata
+
+Each chunk should carry metadata equivalent to:
+
+```yaml
+source_file: <filename>
+repository_path: <repo path>
+layer: <Layer_0--Primitive_Facts | Layer_1--Worked_Examples | Layer_2--Uncertainty | Layer_3--Actor_Perspective | Layer_4--Dialogues>
+document_id: <CORPUS-XXXX or DIALOGUE-XXXX>
+document_title: <title>
+section_heading: <nearest heading>
+chunk_role: <principle | example | calculation | variant | actor_reading | dialogue_beat | success_condition | reference>
+concept_tags:
+  - <tag>
+```
+
+The corpus files already include most of this information in prose form. A chunking process should preserve or derive it.
+
+---
+
+## 4. Concept Tags
+
+Use short, stable concept tags.
+
+Examples:
+
+```yaml
+concept_tags:
+  - local_price
+  - total_cost
+  - profit_arithmetic
+  - delay_cost
+  - rumor_uncertainty
+  - hidden_true_state
+  - source_motive
+  - actor_perspective
+  - credit_trust
+  - non_coin_settlement
+  - warehouse_right
+  - transport_capacity
+  - rivalry
+  - hard_stop
+```
+
+A chunk may have multiple tags.
+
+Do not over-tag. Prefer 3 to 7 tags per chunk.
+
+---
+
+## 5. Layer 0 Chunking Rules
+
+Layer 0 contains primitive facts.
+
+Chunk by conceptual section.
+
+Preferred chunks:
+
+1. header + principle
+2. Roman-visible example
+3. minimal structure
+4. incorrect modern assumption + correction
+5. simulation use + canonical test
+6. success condition, if substantial
+
+A Layer 0 chunk should teach one primitive only.
+
+Do not combine separate files into one chunk.
+
+Do not split the principle from the title.
+
+### Example Chunk Roles
+
+```yaml
+chunk_role: principle
+chunk_role: roman_visible_example
+chunk_role: incorrect_assumption
+chunk_role: simulation_use
+chunk_role: success_condition
+```
+
+---
+
+## 6. Layer 1 Chunking Rules
+
+Layer 1 contains worked examples.
+
+Chunk by reasoning unit.
+
+Preferred chunks:
+
+1. scenario + known facts
+2. first incorrect calculation
+3. total cost or profit calculation
+4. variant A / B / C, grouped if short
+5. correct model behavior
+6. incorrect model behavior
+7. layer references + success condition
+
+A calculation chunk must include:
+
+- the scenario values
+- the formula or arithmetic
+- the interpretation of the result
+
+Do not split:
+
+```text
+sale value - total cost = result
+```
+
+from the values used to produce it.
+
+### Example Chunk Roles
+
+```yaml
+chunk_role: scenario
+chunk_role: calculation
+chunk_role: risk_variant
+chunk_role: correct_behavior
+chunk_role: incorrect_behavior
+```
+
+---
+
+## 7. Layer 2 Chunking Rules
+
+Layer 2 contains uncertainty.
+
+Chunk by evidence and uncertainty structure.
+
+Preferred chunks:
+
+1. scenario + report or signal
+2. known facts + unknowns
+3. possible truth states or interpretations
+4. decision options
+5. correct model behavior
+6. incorrect model behavior
+7. success condition
+
+A Layer 2 chunk should preserve the distinction between:
+
+```text
+reported_state
+known_state
+hidden_true_state
+actor_confidence
+final_resolution
+```
+
+Do not split an uncertainty example so that the report is separated from its age, source, motive, or confidence problem.
+
+### Example Chunk Roles
+
+```yaml
+chunk_role: report
+chunk_role: evidence_structure
+chunk_role: truth_variants
+chunk_role: decision_options
+chunk_role: uncertainty_behavior
+```
+
+---
+
+## 8. Layer 3 Chunking Rules
+
+Layer 3 contains actor perspective.
+
+Chunk by actor section, plus shared setup.
+
+Preferred chunks:
+
+1. shared scenario facts
+2. actor reading: Varro
+3. actor reading: Felix
+4. actor reading: Lentulus
+5. actor reading: Crispus
+6. actor reading: Secundus
+7. actor reading: Chresimus
+8. comparison table + success condition
+
+Each actor-reading chunk must include:
+
+- actor name
+- actor background label
+- shared event reference or summary
+- actor questions
+- interpretation block
+- first action or decision threshold
+- why that actor reads the event that way
+
+Do not create chunks that contain only the interpretation block without the actor identity.
+
+### Example Chunk Roles
+
+```yaml
+chunk_role: shared_facts
+chunk_role: actor_reading
+chunk_role: comparison
+chunk_role: success_condition
+```
+
+---
+
+## 9. Layer 4 Dialogue Chunking Rules
+
+Layer 4 contains dialogue.
+
+Dialogue must be chunked by scene beat, not by arbitrary length.
+
+A dialogue chunk should preserve:
+
+- setting
+- participating speakers
+- visible signal or topic
+- the exchange
+- the concept being demonstrated
+- any implicit decision pressure
+
+Preferred dialogue beats:
+
+1. scene opening and visible signal
+2. first actor interpretation
+3. second actor challenge or correction
+4. conflict between readings
+5. arithmetic or practical consequence
+6. decision point
+7. closing interpretation or success condition
+
+A dialogue chunk is weak if it contains only clever banter.
+
+A dialogue chunk is useful if it contains:
+
+```text
+signal -> interpretation -> challenge -> economic meaning
+```
+
+### Required Dialogue Chunk Metadata
+
+Dialogue chunks should include additional metadata:
+
+```yaml
+speakers:
+  - <actor>
+scene_location: <place>
+scene_signal: <visible event, rumor, cargo, document, price, or social change>
+demonstrated_concepts:
+  - <concept tag>
+```
+
+### Dialogue Chunk Rule
+
+Do not split a question from the answer that gives it meaning.
+
+Do not split a false claim from the correction that makes it useful.
+
+Do not split a joke or quip from the economic point it reveals.
+
+---
+
+## 10. Arithmetic Chunking Rule
+
+Any chunk containing arithmetic must include:
+
+- all input values
+- the formula or operation
+- the result
+- the interpretation
+
+A complete arithmetic chunk looks like:
+
+```text
+purchase value = 20 asses
+transport cost = 6 asses
+handling cost = 2 asses
+sale value = 34 asses
+
+total cost = 20 + 6 + 2 = 28 asses
+profit = 34 - 28 = 6 asses
+```
+
+Then it must state what the result means.
+
+Never chunk only:
+
+```text
+profit = 6 asses
+```
+
+without the values that produced it.
+
+---
+
+## 11. Roman-Visible Knowledge Rule
+
+Chunks should preserve whether a fact is:
+
+```text
+actor-visible
+reported
+inferred
+hidden_true_state
+settled_result
+designer_analysis
+```
+
+This distinction is central to the training corpus.
+
+If a chunk includes hidden truth, label it clearly.
+
+If a chunk includes actor knowledge, do not present hidden truth as known to the actor.
+
+---
+
+## 12. Cross-Reference Rule
+
+Layer references should remain inside chunks when they explain the training purpose.
+
+However, a chunk should not rely entirely on cross-references.
+
+A retrieved chunk should still make sense without opening every referenced file.
+
+Cross-references are support, not replacement.
+
+---
+
+## 13. Naming Rule
+
+Chunk identifiers should be deterministic.
+
+Recommended format:
+
+```text
+<document_id>::<section_number>::<chunk_role>
+```
+
+Examples:
+
+```text
+CORPUS-0005::04::correct_behavior
+CORPUS-0011::06::actor_reading_secundus
+DIALOGUE-0002::03::scene_beat_cart_delay
+```
+
+For repeated roles:
+
+```text
+CORPUS-0008::04a::variant_true
+CORPUS-0008::04b::variant_partial
+CORPUS-0008::04c::variant_false
+```
+
+---
+
+## 14. Minimum Chunk Quality Test
+
+Before accepting a chunk, ask:
+
+1. Does it say what file and layer it came from?
+2. Does it preserve the concept being taught?
+3. Does it include enough facts to understand the example?
+4. Does it keep arithmetic with its inputs?
+5. Does it distinguish known, reported, inferred, hidden, and settled facts?
+6. Does it preserve actor identity when actor perspective matters?
+7. Does it avoid isolated banter?
+8. Does it include the model behavior being trained or corrected?
+
+If the answer to any critical question is no, adjust the chunk boundary.
+
+---
+
+## 15. Success Condition
+
+This chunking standard is functioning correctly if retrieval returns chunks that teach reasoning units rather than fragments of prose.
+
+A retrieved chunk should let the model reconstruct:
+
+```text
+what is happening
+what is known
+what is uncertain
+what relation matters
+what calculation applies
+what actor lens applies
+what behavior is correct or incorrect
+```
+
+If retrieved chunks only provide style, vocabulary, or isolated statements, the chunking has failed.