initial upload
This commit is contained in:
473
docs/training/chunking/CHUNKING-STANDARD-0001.md
Normal file
473
docs/training/chunking/CHUNKING-STANDARD-0001.md
Normal file
@@ -0,0 +1,473 @@
|
|||||||
|
# CHUNKING-STANDARD-0001
|
||||||
|
## Training Corpus Chunking Standard
|
||||||
|
### Status: Draft Standard
|
||||||
|
### Layer: Training Infrastructure
|
||||||
|
### Purpose: Define how OTIVM training documents should be chunked for retrieval, review, and future model preparation
|
||||||
|
### Repository Path: docs/training/chunking/CHUNKING-STANDARD-0001.md
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 0. Purpose
|
||||||
|
|
||||||
|
This document defines chunking rules for the OTIVM training corpus.
|
||||||
|
|
||||||
|
The training corpus is layered. Each layer teaches a different kind of reasoning. Chunking must preserve that reasoning.
|
||||||
|
|
||||||
|
The goal is not to split files into equal text lengths.
|
||||||
|
|
||||||
|
The goal is to preserve usable training units.
|
||||||
|
|
||||||
|
A good chunk should allow the model to answer:
|
||||||
|
|
||||||
|
- what concept is being taught?
|
||||||
|
- what facts are available?
|
||||||
|
- what uncertainty remains?
|
||||||
|
- what arithmetic or relation is being demonstrated?
|
||||||
|
- which actor perspective is active?
|
||||||
|
- what behavior should the model learn or avoid?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. General Rule
|
||||||
|
|
||||||
|
Chunk by meaning, not by size.
|
||||||
|
|
||||||
|
A chunk should be self-contained enough to be retrieved without requiring the entire file.
|
||||||
|
|
||||||
|
Each chunk should preserve:
|
||||||
|
|
||||||
|
- file identity
|
||||||
|
- layer
|
||||||
|
- topic
|
||||||
|
- local section heading
|
||||||
|
- relevant example facts
|
||||||
|
- any calculation needed to understand the point
|
||||||
|
- correct and incorrect model behavior where applicable
|
||||||
|
|
||||||
|
Avoid chunks that contain only:
|
||||||
|
|
||||||
|
- isolated dialogue lines
|
||||||
|
- arithmetic without scenario context
|
||||||
|
- conclusions without evidence
|
||||||
|
- actor interpretation without shared facts
|
||||||
|
- principles without example or test
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Preferred Chunk Size
|
||||||
|
|
||||||
|
Preferred chunk size:
|
||||||
|
|
||||||
|
```text
|
||||||
|
300 to 900 words
|
||||||
|
```
|
||||||
|
|
||||||
|
Acceptable range:
|
||||||
|
|
||||||
|
```text
|
||||||
|
150 to 1200 words
|
||||||
|
```
|
||||||
|
|
||||||
|
Use shorter chunks when the section is atomic.
|
||||||
|
|
||||||
|
Use longer chunks when splitting would separate a calculation from its explanation or a dialogue exchange from its demonstrated concept.
|
||||||
|
|
||||||
|
Do not split:
|
||||||
|
|
||||||
|
- a calculation from the numbers it uses
|
||||||
|
- a rumor from the source and confidence problem
|
||||||
|
- an actor reading from the actor name and shared scenario
|
||||||
|
- a dialogue beat from the reason it matters
|
||||||
|
- a success condition from the concept it tests
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Required Chunk Metadata
|
||||||
|
|
||||||
|
Each chunk should carry metadata equivalent to:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
source_file: <filename>
|
||||||
|
repository_path: <repo path>
|
||||||
|
layer: <Layer_0--Primitive_Facts | Layer_1--Worked_Examples | Layer_2--Uncertainty | Layer_3--Actor_Perspective | Layer_4--Dialogues>
|
||||||
|
document_id: <CORPUS-XXXX or DIALOGUE-XXXX>
|
||||||
|
document_title: <title>
|
||||||
|
section_heading: <nearest heading>
|
||||||
|
chunk_role: <principle | example | calculation | variant | actor_reading | dialogue_beat | success_condition | reference>
|
||||||
|
concept_tags:
|
||||||
|
- <tag>
|
||||||
|
```
|
||||||
|
|
||||||
|
The corpus files already include most of this information in prose form. A chunking process should preserve or derive it.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Concept Tags
|
||||||
|
|
||||||
|
Use short, stable concept tags.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
concept_tags:
|
||||||
|
- local_price
|
||||||
|
- total_cost
|
||||||
|
- profit_arithmetic
|
||||||
|
- delay_cost
|
||||||
|
- rumor_uncertainty
|
||||||
|
- hidden_true_state
|
||||||
|
- source_motive
|
||||||
|
- actor_perspective
|
||||||
|
- credit_trust
|
||||||
|
- non_coin_settlement
|
||||||
|
- warehouse_right
|
||||||
|
- transport_capacity
|
||||||
|
- rivalry
|
||||||
|
- hard_stop
|
||||||
|
```
|
||||||
|
|
||||||
|
A chunk may have multiple tags.
|
||||||
|
|
||||||
|
Do not over-tag. Prefer 3 to 7 tags per chunk.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Layer 0 Chunking Rules
|
||||||
|
|
||||||
|
Layer 0 contains primitive facts.
|
||||||
|
|
||||||
|
Chunk by conceptual section.
|
||||||
|
|
||||||
|
Preferred chunks:
|
||||||
|
|
||||||
|
1. header + principle
|
||||||
|
2. Roman-visible example
|
||||||
|
3. minimal structure
|
||||||
|
4. incorrect modern assumption + correction
|
||||||
|
5. simulation use + canonical test
|
||||||
|
6. success condition, if substantial
|
||||||
|
|
||||||
|
A Layer 0 chunk should teach one primitive only.
|
||||||
|
|
||||||
|
Do not combine separate files into one chunk.
|
||||||
|
|
||||||
|
Do not split the principle from the title.
|
||||||
|
|
||||||
|
### Example Chunk Roles
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
chunk_role: principle
|
||||||
|
chunk_role: roman_visible_example
|
||||||
|
chunk_role: incorrect_assumption
|
||||||
|
chunk_role: simulation_use
|
||||||
|
chunk_role: success_condition
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Layer 1 Chunking Rules
|
||||||
|
|
||||||
|
Layer 1 contains worked examples.
|
||||||
|
|
||||||
|
Chunk by reasoning unit.
|
||||||
|
|
||||||
|
Preferred chunks:
|
||||||
|
|
||||||
|
1. scenario + known facts
|
||||||
|
2. first incorrect calculation
|
||||||
|
3. total cost or profit calculation
|
||||||
|
4. variant A / B / C, grouped if short
|
||||||
|
5. correct model behavior
|
||||||
|
6. incorrect model behavior
|
||||||
|
7. layer references + success condition
|
||||||
|
|
||||||
|
A calculation chunk must include:
|
||||||
|
|
||||||
|
- the scenario values
|
||||||
|
- the formula or arithmetic
|
||||||
|
- the interpretation of the result
|
||||||
|
|
||||||
|
Do not split:
|
||||||
|
|
||||||
|
```text
|
||||||
|
sale value - total cost = result
|
||||||
|
```
|
||||||
|
|
||||||
|
from the values used to produce it.
|
||||||
|
|
||||||
|
### Example Chunk Roles
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
chunk_role: scenario
|
||||||
|
chunk_role: calculation
|
||||||
|
chunk_role: risk_variant
|
||||||
|
chunk_role: correct_behavior
|
||||||
|
chunk_role: incorrect_behavior
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Layer 2 Chunking Rules
|
||||||
|
|
||||||
|
Layer 2 contains uncertainty.
|
||||||
|
|
||||||
|
Chunk by evidence and uncertainty structure.
|
||||||
|
|
||||||
|
Preferred chunks:
|
||||||
|
|
||||||
|
1. scenario + report or signal
|
||||||
|
2. known facts + unknowns
|
||||||
|
3. possible truth states or interpretations
|
||||||
|
4. decision options
|
||||||
|
5. correct model behavior
|
||||||
|
6. incorrect model behavior
|
||||||
|
7. success condition
|
||||||
|
|
||||||
|
A Layer 2 chunk should preserve the distinction between:
|
||||||
|
|
||||||
|
```text
|
||||||
|
reported_state
|
||||||
|
known_state
|
||||||
|
hidden_true_state
|
||||||
|
actor_confidence
|
||||||
|
final_resolution
|
||||||
|
```
|
||||||
|
|
||||||
|
Do not split an uncertainty example so that the report is separated from its age, source, motive, or confidence problem.
|
||||||
|
|
||||||
|
### Example Chunk Roles
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
chunk_role: report
|
||||||
|
chunk_role: evidence_structure
|
||||||
|
chunk_role: truth_variants
|
||||||
|
chunk_role: decision_options
|
||||||
|
chunk_role: uncertainty_behavior
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Layer 3 Chunking Rules
|
||||||
|
|
||||||
|
Layer 3 contains actor perspective.
|
||||||
|
|
||||||
|
Chunk by actor section, plus shared setup.
|
||||||
|
|
||||||
|
Preferred chunks:
|
||||||
|
|
||||||
|
1. shared scenario facts
|
||||||
|
2. actor reading: Varro
|
||||||
|
3. actor reading: Felix
|
||||||
|
4. actor reading: Lentulus
|
||||||
|
5. actor reading: Crispus
|
||||||
|
6. actor reading: Secundus
|
||||||
|
7. actor reading: Chresimus
|
||||||
|
8. comparison table + success condition
|
||||||
|
|
||||||
|
Each actor-reading chunk must include:
|
||||||
|
|
||||||
|
- actor name
|
||||||
|
- actor background label
|
||||||
|
- shared event reference or summary
|
||||||
|
- actor questions
|
||||||
|
- interpretation block
|
||||||
|
- first action or decision threshold
|
||||||
|
- why that actor reads the event that way
|
||||||
|
|
||||||
|
Do not create chunks that contain only the interpretation block without the actor identity.
|
||||||
|
|
||||||
|
### Example Chunk Roles
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
chunk_role: shared_facts
|
||||||
|
chunk_role: actor_reading
|
||||||
|
chunk_role: comparison
|
||||||
|
chunk_role: success_condition
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Layer 4 Dialogue Chunking Rules
|
||||||
|
|
||||||
|
Layer 4 contains dialogue.
|
||||||
|
|
||||||
|
Dialogue must be chunked by scene beat, not by arbitrary length.
|
||||||
|
|
||||||
|
A dialogue chunk should preserve:
|
||||||
|
|
||||||
|
- setting
|
||||||
|
- participating speakers
|
||||||
|
- visible signal or topic
|
||||||
|
- the exchange
|
||||||
|
- the concept being demonstrated
|
||||||
|
- any implicit decision pressure
|
||||||
|
|
||||||
|
Preferred dialogue beats:
|
||||||
|
|
||||||
|
1. scene opening and visible signal
|
||||||
|
2. first actor interpretation
|
||||||
|
3. second actor challenge or correction
|
||||||
|
4. conflict between readings
|
||||||
|
5. arithmetic or practical consequence
|
||||||
|
6. decision point
|
||||||
|
7. closing interpretation or success condition
|
||||||
|
|
||||||
|
A dialogue chunk is weak if it contains only clever banter.
|
||||||
|
|
||||||
|
A dialogue chunk is useful if it contains:
|
||||||
|
|
||||||
|
```text
|
||||||
|
signal -> interpretation -> challenge -> economic meaning
|
||||||
|
```
|
||||||
|
|
||||||
|
### Required Dialogue Chunk Metadata
|
||||||
|
|
||||||
|
Dialogue chunks should include additional metadata:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
speakers:
|
||||||
|
- <actor>
|
||||||
|
scene_location: <place>
|
||||||
|
scene_signal: <visible event, rumor, cargo, document, price, or social change>
|
||||||
|
demonstrated_concepts:
|
||||||
|
- <concept tag>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Dialogue Chunk Rule
|
||||||
|
|
||||||
|
Do not split a question from the answer that gives it meaning.
|
||||||
|
|
||||||
|
Do not split a false claim from the correction that makes it useful.
|
||||||
|
|
||||||
|
Do not split a joke or quip from the economic point it reveals.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Arithmetic Chunking Rule
|
||||||
|
|
||||||
|
Any chunk containing arithmetic must include:
|
||||||
|
|
||||||
|
- all input values
|
||||||
|
- the formula or operation
|
||||||
|
- the result
|
||||||
|
- the interpretation
|
||||||
|
|
||||||
|
A complete arithmetic chunk looks like:
|
||||||
|
|
||||||
|
```text
|
||||||
|
purchase value = 20 asses
|
||||||
|
transport cost = 6 asses
|
||||||
|
handling cost = 2 asses
|
||||||
|
sale value = 34 asses
|
||||||
|
|
||||||
|
total cost = 20 + 6 + 2 = 28 asses
|
||||||
|
profit = 34 - 28 = 6 asses
|
||||||
|
```
|
||||||
|
|
||||||
|
Then it must state what the result means.
|
||||||
|
|
||||||
|
Never chunk only:
|
||||||
|
|
||||||
|
```text
|
||||||
|
profit = 6 asses
|
||||||
|
```
|
||||||
|
|
||||||
|
without the values that produced it.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 11. Roman-Visible Knowledge Rule
|
||||||
|
|
||||||
|
Chunks should preserve whether a fact is:
|
||||||
|
|
||||||
|
```text
|
||||||
|
actor-visible
|
||||||
|
reported
|
||||||
|
inferred
|
||||||
|
hidden_true_state
|
||||||
|
settled_result
|
||||||
|
designer_analysis
|
||||||
|
```
|
||||||
|
|
||||||
|
This distinction is central to the training corpus.
|
||||||
|
|
||||||
|
If a chunk includes hidden truth, label it clearly.
|
||||||
|
|
||||||
|
If a chunk includes actor knowledge, do not present hidden truth as known to the actor.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 12. Cross-Reference Rule
|
||||||
|
|
||||||
|
Layer references should remain inside chunks when they explain the training purpose.
|
||||||
|
|
||||||
|
However, a chunk should not rely entirely on cross-references.
|
||||||
|
|
||||||
|
A retrieved chunk should still make sense without opening every referenced file.
|
||||||
|
|
||||||
|
Cross-references are support, not replacement.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 13. Naming Rule
|
||||||
|
|
||||||
|
Chunk identifiers should be deterministic.
|
||||||
|
|
||||||
|
Recommended format:
|
||||||
|
|
||||||
|
```text
|
||||||
|
<document_id>::<section_number>::<chunk_role>
|
||||||
|
```
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
|
||||||
|
```text
|
||||||
|
CORPUS-0005::04::correct_behavior
|
||||||
|
CORPUS-0011::06::actor_reading_secundus
|
||||||
|
DIALOGUE-0002::03::scene_beat_cart_delay
|
||||||
|
```
|
||||||
|
|
||||||
|
For repeated roles:
|
||||||
|
|
||||||
|
```text
|
||||||
|
CORPUS-0008::04a::variant_true
|
||||||
|
CORPUS-0008::04b::variant_partial
|
||||||
|
CORPUS-0008::04c::variant_false
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 14. Minimum Chunk Quality Test
|
||||||
|
|
||||||
|
Before accepting a chunk, ask:
|
||||||
|
|
||||||
|
1. Does it say what file and layer it came from?
|
||||||
|
2. Does it preserve the concept being taught?
|
||||||
|
3. Does it include enough facts to understand the example?
|
||||||
|
4. Does it keep arithmetic with its inputs?
|
||||||
|
5. Does it distinguish known, reported, inferred, hidden, and settled facts?
|
||||||
|
6. Does it preserve actor identity when actor perspective matters?
|
||||||
|
7. Does it avoid isolated banter?
|
||||||
|
8. Does it include the model behavior being trained or corrected?
|
||||||
|
|
||||||
|
If the answer to any critical question is no, adjust the chunk boundary.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 15. Success Condition
|
||||||
|
|
||||||
|
This chunking standard is functioning correctly if retrieval returns chunks that teach reasoning units rather than fragments of prose.
|
||||||
|
|
||||||
|
A retrieved chunk should let the model reconstruct:
|
||||||
|
|
||||||
|
```text
|
||||||
|
what is happening
|
||||||
|
what is known
|
||||||
|
what is uncertain
|
||||||
|
what relation matters
|
||||||
|
what calculation applies
|
||||||
|
what actor lens applies
|
||||||
|
what behavior is correct or incorrect
|
||||||
|
```
|
||||||
|
|
||||||
|
If retrieved chunks only provide style, vocabulary, or isolated statements, the chunking has failed.
|
||||||
Reference in New Issue
Block a user