TheRON/otivm

Fork 0

Files

TheRON 38a7048c0b initial upload

2026-04-30 12:00:35 -04:00

9.8 KiB

Raw Blame History

CHUNKING-STANDARD-0001

Training Corpus Chunking Standard

Status: Draft Standard

Layer: Training Infrastructure

Purpose: Define how OTIVM training documents should be chunked for retrieval, review, and future model preparation

Repository Path: docs/training/chunking/CHUNKING-STANDARD-0001.md

0. Purpose

This document defines chunking rules for the OTIVM training corpus.

The training corpus is layered. Each layer teaches a different kind of reasoning. Chunking must preserve that reasoning.

The goal is not to split files into equal text lengths.

The goal is to preserve usable training units.

A good chunk should allow the model to answer:

what concept is being taught?
what facts are available?
what uncertainty remains?
what arithmetic or relation is being demonstrated?
which actor perspective is active?
what behavior should the model learn or avoid?

1. General Rule

Chunk by meaning, not by size.

A chunk should be self-contained enough to be retrieved without requiring the entire file.

Each chunk should preserve:

file identity
layer
topic
local section heading
relevant example facts
any calculation needed to understand the point
correct and incorrect model behavior where applicable

Avoid chunks that contain only:

isolated dialogue lines
arithmetic without scenario context
conclusions without evidence
actor interpretation without shared facts
principles without example or test

2. Preferred Chunk Size

Preferred chunk size:

300 to 900 words

Acceptable range:

150 to 1200 words

Use shorter chunks when the section is atomic.

Use longer chunks when splitting would separate a calculation from its explanation or a dialogue exchange from its demonstrated concept.

Do not split:

a calculation from the numbers it uses
a rumor from the source and confidence problem
an actor reading from the actor name and shared scenario
a dialogue beat from the reason it matters
a success condition from the concept it tests

3. Required Chunk Metadata

Each chunk should carry metadata equivalent to:

source_file: <filename>
repository_path: <repo path>
layer: <Layer_0--Primitive_Facts | Layer_1--Worked_Examples | Layer_2--Uncertainty | Layer_3--Actor_Perspective | Layer_4--Dialogues>
document_id: <CORPUS-XXXX or DIALOGUE-XXXX>
document_title: <title>
section_heading: <nearest heading>
chunk_role: <principle | example | calculation | variant | actor_reading | dialogue_beat | success_condition | reference>
concept_tags:
  - <tag>

The corpus files already include most of this information in prose form. A chunking process should preserve or derive it.

4. Concept Tags

Use short, stable concept tags.

Examples:

concept_tags:
  - local_price
  - total_cost
  - profit_arithmetic
  - delay_cost
  - rumor_uncertainty
  - hidden_true_state
  - source_motive
  - actor_perspective
  - credit_trust
  - non_coin_settlement
  - warehouse_right
  - transport_capacity
  - rivalry
  - hard_stop

A chunk may have multiple tags.

Do not over-tag. Prefer 3 to 7 tags per chunk.

5. Layer 0 Chunking Rules

Layer 0 contains primitive facts.

Chunk by conceptual section.

Preferred chunks:

header + principle
Roman-visible example
minimal structure
incorrect modern assumption + correction
simulation use + canonical test
success condition, if substantial

A Layer 0 chunk should teach one primitive only.

Do not combine separate files into one chunk.

Do not split the principle from the title.

Example Chunk Roles

chunk_role: principle
chunk_role: roman_visible_example
chunk_role: incorrect_assumption
chunk_role: simulation_use
chunk_role: success_condition

6. Layer 1 Chunking Rules

Layer 1 contains worked examples.

Chunk by reasoning unit.

Preferred chunks:

scenario + known facts
first incorrect calculation
total cost or profit calculation
variant A / B / C, grouped if short
correct model behavior
incorrect model behavior
layer references + success condition

A calculation chunk must include:

the scenario values
the formula or arithmetic
the interpretation of the result

Do not split:

sale value - total cost = result

from the values used to produce it.

Example Chunk Roles

chunk_role: scenario
chunk_role: calculation
chunk_role: risk_variant
chunk_role: correct_behavior
chunk_role: incorrect_behavior

7. Layer 2 Chunking Rules

Layer 2 contains uncertainty.

Chunk by evidence and uncertainty structure.

Preferred chunks:

scenario + report or signal
known facts + unknowns
possible truth states or interpretations
decision options
correct model behavior
incorrect model behavior
success condition

A Layer 2 chunk should preserve the distinction between:

reported_state
known_state
hidden_true_state
actor_confidence
final_resolution

Do not split an uncertainty example so that the report is separated from its age, source, motive, or confidence problem.

Example Chunk Roles

chunk_role: report
chunk_role: evidence_structure
chunk_role: truth_variants
chunk_role: decision_options
chunk_role: uncertainty_behavior

8. Layer 3 Chunking Rules

Layer 3 contains actor perspective.

Chunk by actor section, plus shared setup.

Preferred chunks:

shared scenario facts
actor reading: Varro
actor reading: Felix
actor reading: Lentulus
actor reading: Crispus
actor reading: Secundus
actor reading: Chresimus
comparison table + success condition

Each actor-reading chunk must include:

actor name
actor background label
shared event reference or summary
actor questions
interpretation block
first action or decision threshold
why that actor reads the event that way

Do not create chunks that contain only the interpretation block without the actor identity.

Example Chunk Roles

chunk_role: shared_facts
chunk_role: actor_reading
chunk_role: comparison
chunk_role: success_condition

9. Layer 4 Dialogue Chunking Rules

Layer 4 contains dialogue.

Dialogue must be chunked by scene beat, not by arbitrary length.

A dialogue chunk should preserve:

setting
participating speakers
visible signal or topic
the exchange
the concept being demonstrated
any implicit decision pressure

Preferred dialogue beats:

scene opening and visible signal
first actor interpretation
second actor challenge or correction
conflict between readings
arithmetic or practical consequence
decision point
closing interpretation or success condition

A dialogue chunk is weak if it contains only clever banter.

A dialogue chunk is useful if it contains:

signal -> interpretation -> challenge -> economic meaning

Required Dialogue Chunk Metadata

Dialogue chunks should include additional metadata:

speakers:
  - <actor>
scene_location: <place>
scene_signal: <visible event, rumor, cargo, document, price, or social change>
demonstrated_concepts:
  - <concept tag>

Dialogue Chunk Rule

Do not split a question from the answer that gives it meaning.

Do not split a false claim from the correction that makes it useful.

Do not split a joke or quip from the economic point it reveals.

10. Arithmetic Chunking Rule

Any chunk containing arithmetic must include:

all input values
the formula or operation
the result
the interpretation

A complete arithmetic chunk looks like:

purchase value = 20 asses
transport cost = 6 asses
handling cost = 2 asses
sale value = 34 asses

total cost = 20 + 6 + 2 = 28 asses
profit = 34 - 28 = 6 asses

Then it must state what the result means.

Never chunk only:

profit = 6 asses

without the values that produced it.

11. Roman-Visible Knowledge Rule

Chunks should preserve whether a fact is:

actor-visible
reported
inferred
hidden_true_state
settled_result
designer_analysis

This distinction is central to the training corpus.

If a chunk includes hidden truth, label it clearly.

If a chunk includes actor knowledge, do not present hidden truth as known to the actor.

12. Cross-Reference Rule

Layer references should remain inside chunks when they explain the training purpose.

However, a chunk should not rely entirely on cross-references.

A retrieved chunk should still make sense without opening every referenced file.

Cross-references are support, not replacement.

13. Naming Rule

Chunk identifiers should be deterministic.

Recommended format:

<document_id>::<section_number>::<chunk_role>

Examples:

CORPUS-0005::04::correct_behavior
CORPUS-0011::06::actor_reading_secundus
DIALOGUE-0002::03::scene_beat_cart_delay

For repeated roles:

CORPUS-0008::04a::variant_true
CORPUS-0008::04b::variant_partial
CORPUS-0008::04c::variant_false

14. Minimum Chunk Quality Test

Before accepting a chunk, ask:

Does it say what file and layer it came from?
Does it preserve the concept being taught?
Does it include enough facts to understand the example?
Does it keep arithmetic with its inputs?
Does it distinguish known, reported, inferred, hidden, and settled facts?
Does it preserve actor identity when actor perspective matters?
Does it avoid isolated banter?
Does it include the model behavior being trained or corrected?

If the answer to any critical question is no, adjust the chunk boundary.

15. Success Condition

This chunking standard is functioning correctly if retrieval returns chunks that teach reasoning units rather than fragments of prose.

A retrieved chunk should let the model reconstruct:

what is happening
what is known
what is uncertain
what relation matters
what calculation applies
what actor lens applies
what behavior is correct or incorrect

If retrieved chunks only provide style, vocabulary, or isolated statements, the chunking has failed.

9.8 KiB Raw Blame History

CHUNKING-STANDARD-0001

Training Corpus Chunking Standard

Status: Draft Standard

Layer: Training Infrastructure

Purpose: Define how OTIVM training documents should be chunked for retrieval, review, and future model preparation

Repository Path: docs/training/chunking/CHUNKING-STANDARD-0001.md

0. Purpose

1. General Rule

2. Preferred Chunk Size

3. Required Chunk Metadata

4. Concept Tags

5. Layer 0 Chunking Rules

Example Chunk Roles

6. Layer 1 Chunking Rules

Example Chunk Roles

7. Layer 2 Chunking Rules

Example Chunk Roles

8. Layer 3 Chunking Rules

Example Chunk Roles

9. Layer 4 Dialogue Chunking Rules

Required Dialogue Chunk Metadata

Dialogue Chunk Rule

10. Arithmetic Chunking Rule

11. Roman-Visible Knowledge Rule

12. Cross-Reference Rule

13. Naming Rule

14. Minimum Chunk Quality Test

15. Success Condition

9.8 KiB

Raw Blame History