Files
otivm/docs/training/chunking/GENERATOR-MODEL-SELECTION-0001.md
2026-04-30 15:53:05 -04:00

10 KiB

GENERATOR-MODEL-SELECTION-0001

Local Model Selection And Deployment For The OTIVM Vocabulary Generator

Status: Draft Standard

Layer: Training Infrastructure

Purpose: Select and deploy a small local model for generating Roman-visible vocabulary candidates

Repository Path: docs/training/chunking/GENERATOR-MODEL-SELECTION-0001.md


0. Purpose

This document defines a practical model-selection and deployment plan for the OTIVM Roman-visible expression generator.

The generator is not the CIVICUS-ROMAN model.

The generator is a tool used to produce candidate phrases.

Most generated phrases may be weak.

Only reviewed and accepted expressions become training material.

The generator is quarry equipment.

The reviewed vocabulary is the stone.


1. Hardware Constraint

Current local hardware target:

NVIDIA GPU with 6GB VRAM

This is enough for small quantized local models.

It is not the right target for full model training.

It is sufficient for:

candidate expression generation
small-batch phrase variation
actor-voice experiments
object/action/pressure recombination
quick local iteration
offline review workflows

It should not be used yet for:

full CIVICUS-ROMAN training
large-context corpus analysis
unsupervised corpus promotion
automatic canonical selection

2. Primary Recommendation

Start with:

Model: Qwen2.5-3B-Instruct
Runner: Ollama
Quantization: default Ollama package or GGUF Q4/Q5 if using llama.cpp

Reason:

small enough for 6GB VRAM
good instruction following
good short-form generation
available through Ollama
available in GGUF form
suitable for high-volume candidate generation

The generator task is not deep reasoning.

It is constrained phrase production.

A 3B instruct model is enough to begin.


3. Backup Models

Phi-3.5-mini-instruct

Use if Qwen2.5-3B gives too much decorative prose or weak instruction following.

Strengths:

terse output
structured generation
reasoning-dense behavior
good for compact candidate lists

Risk:

may produce more modern analytical phrasing unless prompts are strict

Gemma small instruct models

Use for comparison, especially if phrase tone from Qwen or Phi is poor.

Strengths:

small model family
local deployment support
useful for style comparison

Risk:

may require more prompt tuning for OTIVM-specific compression

Qwen2.5-Coder-3B

Use only for generator tooling scripts, not phrase generation.

Strengths:

code generation
JSONL tools
review UI helpers
validator scripts

Risk:

not the right primary voice generator

4. Deployment Path

Phase 1: Ollama

Use Ollama first because it minimizes deployment friction.

Install and run:

ollama pull qwen2.5:3b
ollama run qwen2.5:3b

Test with direct prompt batches.

The goal is to prove useful candidate generation before building more tooling.

Phase 2: Scripted Batch Generation

Use Python to send object/action/pressure combinations to the local Ollama endpoint.

Input:

{
  "object": "cart",
  "action": "hired_elsewhere",
  "pressure": "buyer_waiting",
  "actor_voice": "Secundus",
  "count": 20
}

Output:

{
  "expression_id": "expr_000001",
  "object": "cart",
  "action": "hired_elsewhere",
  "pressure": "buyer_waiting",
  "actor_voice": "Secundus",
  "candidate": "The wheels are gone, and the buyer will not wait for our excuses.",
  "status": "candidate"
}

Phase 3: Review Interface

Build a fast human review tool.

Required markings:

accept
reject
revise
strong
canonical

Preferred one-key controls:

a = accept
r = reject
v = revise
s = strong
c = canonical

The review tool matters more than the generator model.


5. Generator Prompt Pattern

Use a strict prompt.

Example:

You generate Roman-visible commercial expressions for OTIVM.

Rules:
- Do not explain.
- Do not use modern business language.
- Do not use words like logistics, liquidity, market efficiency, regulatory, contract compliance, metadata, model, training, or optimization.
- Use concrete objects, actions, and pressures.
- Prefer terse lines.
- Produce candidate lines only.

Object: cart
Action: hired elsewhere
Pressure: buyer waiting
Actor voice: Secundus

Generate 20 candidates.

Expected useful outputs:

The wheels are gone.
The buyer will not wait for empty ruts.
Ten jars can still go by mule.
Naso bought the road before the oil moved.

Bad outputs:

Transport capacity is constrained.
The supply chain is disrupted.
We need to optimize the delivery channel.
This represents a logistical bottleneck.

6. Output Rule

The generator output must never enter training directly.

All generated output begins as:

status: candidate

Only reviewed material can become:

accepted
strong
canonical

Training may use:

accepted expressions
strong expressions
canonical expressions
human-revised expressions
dialogue lines based on reviewed expressions

Training must not use:

raw generated candidates
rejected candidates
unreviewed batches
candidate churn

7. Why Modern-Contaminated Generator Models Are Acceptable

The generator model may contain modern assumptions.

That is acceptable because it is not the final model.

The generator is not trusted.

The human review gate is trusted.

This distinction is central:

generator output = candidate quarry stone
reviewed output = vocabulary material
canonical output = simulator-ready phrase

The generator may suggest bad phrases.

The review process prevents them from becoming corpus material.


8. Local Model Evaluation

Evaluate local generator models by candidate yield, not by benchmark scores.

Useful metric:

accepted candidates per 100 generated lines

Example:

Qwen2.5-3B:
  1000 generated
  130 accepted
  22 strong
  5 canonical

Phi-3.5-mini:
  1000 generated
  90 accepted
  18 strong
  7 canonical

Gemma small:
  1000 generated
  110 accepted
  15 strong
  4 canonical

The best generator is the one that gives the most reviewable Roman-visible candidates per hour.

Not the one with the highest general model score.


9. Batch Generation Strategy

Generate many small batches instead of one huge batch.

Recommended:

20 candidates per prompt
50 prompts per run
1000 candidates per review session

Vary one dimension at a time.

Example batch family:

object: cart
action: hired_elsewhere
pressure: buyer_waiting
actor_voice: Secundus

object: cart
action: hired_elsewhere
pressure: buyer_waiting
actor_voice: Felix

object: cart
action: hired_elsewhere
pressure: buyer_waiting
actor_voice: Chresimus

This reveals actor voice differences without changing the underlying simulator condition.


10. Temperature And Sampling

Start conservative.

Suggested settings:

temperature: 0.8
top_p: 0.9
repeat_penalty: 1.1
num_predict: modest
context: modest

If output is too dull:

raise temperature slightly
increase candidate count
add actor-specific examples

If output is too theatrical:

lower temperature
add terse rule
add rejection examples

If output is too modern:

strengthen forbidden terms
add Roman-visible examples
reduce abstract wording in prompt

11. Data Files

Recommended folder layout:

data/vocabulary/
  generator_inputs/
    objects.yaml
    actions.yaml
    pressures.yaml
    actor_voices.yaml

  candidates/
    candidates_YYYYMMDD.jsonl

  reviewed/
    roman_visible_expressions.jsonl
    canonical_templates.jsonl

  reports/
    generator_yield_report.txt
    review_summary.txt

12. Minimum Candidate Schema

{
  "expression_id": "expr_000001",
  "created_at": "YYYY-MM-DD",
  "generator_model": "qwen2.5:3b",
  "domain": "commerce",
  "object": "cart",
  "action": "hired_elsewhere",
  "pressure": "buyer_waiting",
  "actor_voice": "Secundus",
  "candidate": "The wheels are gone, and the buyer will not wait for our excuses.",
  "modern_meaning": "Cart capacity has been lost while the buyer is waiting.",
  "concept_tags": [
    "transport_capacity",
    "delay_cost",
    "buyer_need"
  ],
  "status": "candidate",
  "strength": null,
  "review_note": null
}

13. Promotion Schema

When promoted:

{
  "expression_id": "expr_000001",
  "status": "strong",
  "reviewed_by": "human",
  "review_note": "Good Secundus line; concrete and reusable.",
  "promoted_to": [
    "roman_visible_expressions"
  ]
}

Canonical lines should be rare:

{
  "expression_id": "expr_000019",
  "status": "canonical",
  "candidate": "The wheels are gone.",
  "canonical_condition": "transport_capacity_lost"
}

14. When To Move Beyond Ollama

Move from Ollama to llama.cpp or vLLM only if needed.

Reasons to move:

need exact GGUF quant choice
need better batching control
need lower latency
need reproducible runtime parameters
need integration with a custom review server

Until then, Ollama is sufficient.

The priority is vocabulary yield, not infrastructure elegance.


15. Near-Term Test Plan

Run a small bakeoff.

Models:

qwen2.5:3b
phi3.5-mini-instruct quantized
gemma small instruct model

Prompts:

10 object/action/pressure combinations
6 actor voices
20 candidates each

Total:

10 * 6 * 20 = 1200 candidates per model

Human review outcome:

accepted count
strong count
canonical count
modern contamination count
too theatrical count
duplicate count

Pick the generator model by accepted/strong yield per review hour.


16. Recommendation

Begin with:

Ollama + qwen2.5:3b

Use it to generate candidate vocabulary only.

Do not use it as authority.

Do not train on its raw output.

Do not let it decide canonical vocabulary.

The first success condition is simple:

Can the local generator produce enough reviewable Roman-visible candidates to make human review faster than hand-authoring?

If yes, the deployment is successful.

If no, test Phi-3.5-mini and Gemma small models with the same input batches.


17. Success Condition

This model-selection process is working if it produces:

high candidate volume
low deployment friction
fast human review
rising accepted-expression count
a small canonical phrase library
better dialogue voice
less modern vocabulary

The correct measure is not model intelligence.

The correct measure is vocabulary throughput.

The generator does not need to be Roman.

The reviewed output does.