Files
otivm/docs/training/chunking/CHUNKING-STANDARD-0001.md
2026-04-30 12:00:35 -04:00

9.8 KiB

CHUNKING-STANDARD-0001

Training Corpus Chunking Standard

Status: Draft Standard

Layer: Training Infrastructure

Purpose: Define how OTIVM training documents should be chunked for retrieval, review, and future model preparation

Repository Path: docs/training/chunking/CHUNKING-STANDARD-0001.md


0. Purpose

This document defines chunking rules for the OTIVM training corpus.

The training corpus is layered. Each layer teaches a different kind of reasoning. Chunking must preserve that reasoning.

The goal is not to split files into equal text lengths.

The goal is to preserve usable training units.

A good chunk should allow the model to answer:

  • what concept is being taught?
  • what facts are available?
  • what uncertainty remains?
  • what arithmetic or relation is being demonstrated?
  • which actor perspective is active?
  • what behavior should the model learn or avoid?

1. General Rule

Chunk by meaning, not by size.

A chunk should be self-contained enough to be retrieved without requiring the entire file.

Each chunk should preserve:

  • file identity
  • layer
  • topic
  • local section heading
  • relevant example facts
  • any calculation needed to understand the point
  • correct and incorrect model behavior where applicable

Avoid chunks that contain only:

  • isolated dialogue lines
  • arithmetic without scenario context
  • conclusions without evidence
  • actor interpretation without shared facts
  • principles without example or test

2. Preferred Chunk Size

Preferred chunk size:

300 to 900 words

Acceptable range:

150 to 1200 words

Use shorter chunks when the section is atomic.

Use longer chunks when splitting would separate a calculation from its explanation or a dialogue exchange from its demonstrated concept.

Do not split:

  • a calculation from the numbers it uses
  • a rumor from the source and confidence problem
  • an actor reading from the actor name and shared scenario
  • a dialogue beat from the reason it matters
  • a success condition from the concept it tests

3. Required Chunk Metadata

Each chunk should carry metadata equivalent to:

source_file: <filename>
repository_path: <repo path>
layer: <Layer_0--Primitive_Facts | Layer_1--Worked_Examples | Layer_2--Uncertainty | Layer_3--Actor_Perspective | Layer_4--Dialogues>
document_id: <CORPUS-XXXX or DIALOGUE-XXXX>
document_title: <title>
section_heading: <nearest heading>
chunk_role: <principle | example | calculation | variant | actor_reading | dialogue_beat | success_condition | reference>
concept_tags:
  - <tag>

The corpus files already include most of this information in prose form. A chunking process should preserve or derive it.


4. Concept Tags

Use short, stable concept tags.

Examples:

concept_tags:
  - local_price
  - total_cost
  - profit_arithmetic
  - delay_cost
  - rumor_uncertainty
  - hidden_true_state
  - source_motive
  - actor_perspective
  - credit_trust
  - non_coin_settlement
  - warehouse_right
  - transport_capacity
  - rivalry
  - hard_stop

A chunk may have multiple tags.

Do not over-tag. Prefer 3 to 7 tags per chunk.


5. Layer 0 Chunking Rules

Layer 0 contains primitive facts.

Chunk by conceptual section.

Preferred chunks:

  1. header + principle
  2. Roman-visible example
  3. minimal structure
  4. incorrect modern assumption + correction
  5. simulation use + canonical test
  6. success condition, if substantial

A Layer 0 chunk should teach one primitive only.

Do not combine separate files into one chunk.

Do not split the principle from the title.

Example Chunk Roles

chunk_role: principle
chunk_role: roman_visible_example
chunk_role: incorrect_assumption
chunk_role: simulation_use
chunk_role: success_condition

6. Layer 1 Chunking Rules

Layer 1 contains worked examples.

Chunk by reasoning unit.

Preferred chunks:

  1. scenario + known facts
  2. first incorrect calculation
  3. total cost or profit calculation
  4. variant A / B / C, grouped if short
  5. correct model behavior
  6. incorrect model behavior
  7. layer references + success condition

A calculation chunk must include:

  • the scenario values
  • the formula or arithmetic
  • the interpretation of the result

Do not split:

sale value - total cost = result

from the values used to produce it.

Example Chunk Roles

chunk_role: scenario
chunk_role: calculation
chunk_role: risk_variant
chunk_role: correct_behavior
chunk_role: incorrect_behavior

7. Layer 2 Chunking Rules

Layer 2 contains uncertainty.

Chunk by evidence and uncertainty structure.

Preferred chunks:

  1. scenario + report or signal
  2. known facts + unknowns
  3. possible truth states or interpretations
  4. decision options
  5. correct model behavior
  6. incorrect model behavior
  7. success condition

A Layer 2 chunk should preserve the distinction between:

reported_state
known_state
hidden_true_state
actor_confidence
final_resolution

Do not split an uncertainty example so that the report is separated from its age, source, motive, or confidence problem.

Example Chunk Roles

chunk_role: report
chunk_role: evidence_structure
chunk_role: truth_variants
chunk_role: decision_options
chunk_role: uncertainty_behavior

8. Layer 3 Chunking Rules

Layer 3 contains actor perspective.

Chunk by actor section, plus shared setup.

Preferred chunks:

  1. shared scenario facts
  2. actor reading: Varro
  3. actor reading: Felix
  4. actor reading: Lentulus
  5. actor reading: Crispus
  6. actor reading: Secundus
  7. actor reading: Chresimus
  8. comparison table + success condition

Each actor-reading chunk must include:

  • actor name
  • actor background label
  • shared event reference or summary
  • actor questions
  • interpretation block
  • first action or decision threshold
  • why that actor reads the event that way

Do not create chunks that contain only the interpretation block without the actor identity.

Example Chunk Roles

chunk_role: shared_facts
chunk_role: actor_reading
chunk_role: comparison
chunk_role: success_condition

9. Layer 4 Dialogue Chunking Rules

Layer 4 contains dialogue.

Dialogue must be chunked by scene beat, not by arbitrary length.

A dialogue chunk should preserve:

  • setting
  • participating speakers
  • visible signal or topic
  • the exchange
  • the concept being demonstrated
  • any implicit decision pressure

Preferred dialogue beats:

  1. scene opening and visible signal
  2. first actor interpretation
  3. second actor challenge or correction
  4. conflict between readings
  5. arithmetic or practical consequence
  6. decision point
  7. closing interpretation or success condition

A dialogue chunk is weak if it contains only clever banter.

A dialogue chunk is useful if it contains:

signal -> interpretation -> challenge -> economic meaning

Required Dialogue Chunk Metadata

Dialogue chunks should include additional metadata:

speakers:
  - <actor>
scene_location: <place>
scene_signal: <visible event, rumor, cargo, document, price, or social change>
demonstrated_concepts:
  - <concept tag>

Dialogue Chunk Rule

Do not split a question from the answer that gives it meaning.

Do not split a false claim from the correction that makes it useful.

Do not split a joke or quip from the economic point it reveals.


10. Arithmetic Chunking Rule

Any chunk containing arithmetic must include:

  • all input values
  • the formula or operation
  • the result
  • the interpretation

A complete arithmetic chunk looks like:

purchase value = 20 asses
transport cost = 6 asses
handling cost = 2 asses
sale value = 34 asses

total cost = 20 + 6 + 2 = 28 asses
profit = 34 - 28 = 6 asses

Then it must state what the result means.

Never chunk only:

profit = 6 asses

without the values that produced it.


11. Roman-Visible Knowledge Rule

Chunks should preserve whether a fact is:

actor-visible
reported
inferred
hidden_true_state
settled_result
designer_analysis

This distinction is central to the training corpus.

If a chunk includes hidden truth, label it clearly.

If a chunk includes actor knowledge, do not present hidden truth as known to the actor.


12. Cross-Reference Rule

Layer references should remain inside chunks when they explain the training purpose.

However, a chunk should not rely entirely on cross-references.

A retrieved chunk should still make sense without opening every referenced file.

Cross-references are support, not replacement.


13. Naming Rule

Chunk identifiers should be deterministic.

Recommended format:

<document_id>::<section_number>::<chunk_role>

Examples:

CORPUS-0005::04::correct_behavior
CORPUS-0011::06::actor_reading_secundus
DIALOGUE-0002::03::scene_beat_cart_delay

For repeated roles:

CORPUS-0008::04a::variant_true
CORPUS-0008::04b::variant_partial
CORPUS-0008::04c::variant_false

14. Minimum Chunk Quality Test

Before accepting a chunk, ask:

  1. Does it say what file and layer it came from?
  2. Does it preserve the concept being taught?
  3. Does it include enough facts to understand the example?
  4. Does it keep arithmetic with its inputs?
  5. Does it distinguish known, reported, inferred, hidden, and settled facts?
  6. Does it preserve actor identity when actor perspective matters?
  7. Does it avoid isolated banter?
  8. Does it include the model behavior being trained or corrected?

If the answer to any critical question is no, adjust the chunk boundary.


15. Success Condition

This chunking standard is functioning correctly if retrieval returns chunks that teach reasoning units rather than fragments of prose.

A retrieved chunk should let the model reconstruct:

what is happening
what is known
what is uncertain
what relation matters
what calculation applies
what actor lens applies
what behavior is correct or incorrect

If retrieved chunks only provide style, vocabulary, or isolated statements, the chunking has failed.