initial upload

This commit is contained in:
2026-04-30 12:00:35 -04:00
parent bac7e0c8ba
commit 38a7048c0b

View File

@@ -0,0 +1,473 @@
# CHUNKING-STANDARD-0001
## Training Corpus Chunking Standard
### Status: Draft Standard
### Layer: Training Infrastructure
### Purpose: Define how OTIVM training documents should be chunked for retrieval, review, and future model preparation
### Repository Path: docs/training/chunking/CHUNKING-STANDARD-0001.md
---
## 0. Purpose
This document defines chunking rules for the OTIVM training corpus.
The training corpus is layered. Each layer teaches a different kind of reasoning. Chunking must preserve that reasoning.
The goal is not to split files into equal text lengths.
The goal is to preserve usable training units.
A good chunk should allow the model to answer:
- what concept is being taught?
- what facts are available?
- what uncertainty remains?
- what arithmetic or relation is being demonstrated?
- which actor perspective is active?
- what behavior should the model learn or avoid?
---
## 1. General Rule
Chunk by meaning, not by size.
A chunk should be self-contained enough to be retrieved without requiring the entire file.
Each chunk should preserve:
- file identity
- layer
- topic
- local section heading
- relevant example facts
- any calculation needed to understand the point
- correct and incorrect model behavior where applicable
Avoid chunks that contain only:
- isolated dialogue lines
- arithmetic without scenario context
- conclusions without evidence
- actor interpretation without shared facts
- principles without example or test
---
## 2. Preferred Chunk Size
Preferred chunk size:
```text
300 to 900 words
```
Acceptable range:
```text
150 to 1200 words
```
Use shorter chunks when the section is atomic.
Use longer chunks when splitting would separate a calculation from its explanation or a dialogue exchange from its demonstrated concept.
Do not split:
- a calculation from the numbers it uses
- a rumor from the source and confidence problem
- an actor reading from the actor name and shared scenario
- a dialogue beat from the reason it matters
- a success condition from the concept it tests
---
## 3. Required Chunk Metadata
Each chunk should carry metadata equivalent to:
```yaml
source_file: <filename>
repository_path: <repo path>
layer: <Layer_0--Primitive_Facts | Layer_1--Worked_Examples | Layer_2--Uncertainty | Layer_3--Actor_Perspective | Layer_4--Dialogues>
document_id: <CORPUS-XXXX or DIALOGUE-XXXX>
document_title: <title>
section_heading: <nearest heading>
chunk_role: <principle | example | calculation | variant | actor_reading | dialogue_beat | success_condition | reference>
concept_tags:
- <tag>
```
The corpus files already include most of this information in prose form. A chunking process should preserve or derive it.
---
## 4. Concept Tags
Use short, stable concept tags.
Examples:
```yaml
concept_tags:
- local_price
- total_cost
- profit_arithmetic
- delay_cost
- rumor_uncertainty
- hidden_true_state
- source_motive
- actor_perspective
- credit_trust
- non_coin_settlement
- warehouse_right
- transport_capacity
- rivalry
- hard_stop
```
A chunk may have multiple tags.
Do not over-tag. Prefer 3 to 7 tags per chunk.
---
## 5. Layer 0 Chunking Rules
Layer 0 contains primitive facts.
Chunk by conceptual section.
Preferred chunks:
1. header + principle
2. Roman-visible example
3. minimal structure
4. incorrect modern assumption + correction
5. simulation use + canonical test
6. success condition, if substantial
A Layer 0 chunk should teach one primitive only.
Do not combine separate files into one chunk.
Do not split the principle from the title.
### Example Chunk Roles
```yaml
chunk_role: principle
chunk_role: roman_visible_example
chunk_role: incorrect_assumption
chunk_role: simulation_use
chunk_role: success_condition
```
---
## 6. Layer 1 Chunking Rules
Layer 1 contains worked examples.
Chunk by reasoning unit.
Preferred chunks:
1. scenario + known facts
2. first incorrect calculation
3. total cost or profit calculation
4. variant A / B / C, grouped if short
5. correct model behavior
6. incorrect model behavior
7. layer references + success condition
A calculation chunk must include:
- the scenario values
- the formula or arithmetic
- the interpretation of the result
Do not split:
```text
sale value - total cost = result
```
from the values used to produce it.
### Example Chunk Roles
```yaml
chunk_role: scenario
chunk_role: calculation
chunk_role: risk_variant
chunk_role: correct_behavior
chunk_role: incorrect_behavior
```
---
## 7. Layer 2 Chunking Rules
Layer 2 contains uncertainty.
Chunk by evidence and uncertainty structure.
Preferred chunks:
1. scenario + report or signal
2. known facts + unknowns
3. possible truth states or interpretations
4. decision options
5. correct model behavior
6. incorrect model behavior
7. success condition
A Layer 2 chunk should preserve the distinction between:
```text
reported_state
known_state
hidden_true_state
actor_confidence
final_resolution
```
Do not split an uncertainty example so that the report is separated from its age, source, motive, or confidence problem.
### Example Chunk Roles
```yaml
chunk_role: report
chunk_role: evidence_structure
chunk_role: truth_variants
chunk_role: decision_options
chunk_role: uncertainty_behavior
```
---
## 8. Layer 3 Chunking Rules
Layer 3 contains actor perspective.
Chunk by actor section, plus shared setup.
Preferred chunks:
1. shared scenario facts
2. actor reading: Varro
3. actor reading: Felix
4. actor reading: Lentulus
5. actor reading: Crispus
6. actor reading: Secundus
7. actor reading: Chresimus
8. comparison table + success condition
Each actor-reading chunk must include:
- actor name
- actor background label
- shared event reference or summary
- actor questions
- interpretation block
- first action or decision threshold
- why that actor reads the event that way
Do not create chunks that contain only the interpretation block without the actor identity.
### Example Chunk Roles
```yaml
chunk_role: shared_facts
chunk_role: actor_reading
chunk_role: comparison
chunk_role: success_condition
```
---
## 9. Layer 4 Dialogue Chunking Rules
Layer 4 contains dialogue.
Dialogue must be chunked by scene beat, not by arbitrary length.
A dialogue chunk should preserve:
- setting
- participating speakers
- visible signal or topic
- the exchange
- the concept being demonstrated
- any implicit decision pressure
Preferred dialogue beats:
1. scene opening and visible signal
2. first actor interpretation
3. second actor challenge or correction
4. conflict between readings
5. arithmetic or practical consequence
6. decision point
7. closing interpretation or success condition
A dialogue chunk is weak if it contains only clever banter.
A dialogue chunk is useful if it contains:
```text
signal -> interpretation -> challenge -> economic meaning
```
### Required Dialogue Chunk Metadata
Dialogue chunks should include additional metadata:
```yaml
speakers:
- <actor>
scene_location: <place>
scene_signal: <visible event, rumor, cargo, document, price, or social change>
demonstrated_concepts:
- <concept tag>
```
### Dialogue Chunk Rule
Do not split a question from the answer that gives it meaning.
Do not split a false claim from the correction that makes it useful.
Do not split a joke or quip from the economic point it reveals.
---
## 10. Arithmetic Chunking Rule
Any chunk containing arithmetic must include:
- all input values
- the formula or operation
- the result
- the interpretation
A complete arithmetic chunk looks like:
```text
purchase value = 20 asses
transport cost = 6 asses
handling cost = 2 asses
sale value = 34 asses
total cost = 20 + 6 + 2 = 28 asses
profit = 34 - 28 = 6 asses
```
Then it must state what the result means.
Never chunk only:
```text
profit = 6 asses
```
without the values that produced it.
---
## 11. Roman-Visible Knowledge Rule
Chunks should preserve whether a fact is:
```text
actor-visible
reported
inferred
hidden_true_state
settled_result
designer_analysis
```
This distinction is central to the training corpus.
If a chunk includes hidden truth, label it clearly.
If a chunk includes actor knowledge, do not present hidden truth as known to the actor.
---
## 12. Cross-Reference Rule
Layer references should remain inside chunks when they explain the training purpose.
However, a chunk should not rely entirely on cross-references.
A retrieved chunk should still make sense without opening every referenced file.
Cross-references are support, not replacement.
---
## 13. Naming Rule
Chunk identifiers should be deterministic.
Recommended format:
```text
<document_id>::<section_number>::<chunk_role>
```
Examples:
```text
CORPUS-0005::04::correct_behavior
CORPUS-0011::06::actor_reading_secundus
DIALOGUE-0002::03::scene_beat_cart_delay
```
For repeated roles:
```text
CORPUS-0008::04a::variant_true
CORPUS-0008::04b::variant_partial
CORPUS-0008::04c::variant_false
```
---
## 14. Minimum Chunk Quality Test
Before accepting a chunk, ask:
1. Does it say what file and layer it came from?
2. Does it preserve the concept being taught?
3. Does it include enough facts to understand the example?
4. Does it keep arithmetic with its inputs?
5. Does it distinguish known, reported, inferred, hidden, and settled facts?
6. Does it preserve actor identity when actor perspective matters?
7. Does it avoid isolated banter?
8. Does it include the model behavior being trained or corrected?
If the answer to any critical question is no, adjust the chunk boundary.
---
## 15. Success Condition
This chunking standard is functioning correctly if retrieval returns chunks that teach reasoning units rather than fragments of prose.
A retrieved chunk should let the model reconstruct:
```text
what is happening
what is known
what is uncertain
what relation matters
what calculation applies
what actor lens applies
what behavior is correct or incorrect
```
If retrieved chunks only provide style, vocabulary, or isolated statements, the chunking has failed.