Files
otivm/docs/training/chunking/GENERATOR-MODEL-SELECTION-0001.md
2026-04-30 15:53:05 -04:00

641 lines
10 KiB
Markdown

# GENERATOR-MODEL-SELECTION-0001
## Local Model Selection And Deployment For The OTIVM Vocabulary Generator
### Status: Draft Standard
### Layer: Training Infrastructure
### Purpose: Select and deploy a small local model for generating Roman-visible vocabulary candidates
### Repository Path: docs/training/chunking/GENERATOR-MODEL-SELECTION-0001.md
---
## 0. Purpose
This document defines a practical model-selection and deployment plan for the OTIVM Roman-visible expression generator.
The generator is not the CIVICUS-ROMAN model.
The generator is a tool used to produce candidate phrases.
Most generated phrases may be weak.
Only reviewed and accepted expressions become training material.
The generator is quarry equipment.
The reviewed vocabulary is the stone.
---
## 1. Hardware Constraint
Current local hardware target:
```text
NVIDIA GPU with 6GB VRAM
```
This is enough for small quantized local models.
It is not the right target for full model training.
It is sufficient for:
```text
candidate expression generation
small-batch phrase variation
actor-voice experiments
object/action/pressure recombination
quick local iteration
offline review workflows
```
It should not be used yet for:
```text
full CIVICUS-ROMAN training
large-context corpus analysis
unsupervised corpus promotion
automatic canonical selection
```
---
## 2. Primary Recommendation
Start with:
```text
Model: Qwen2.5-3B-Instruct
Runner: Ollama
Quantization: default Ollama package or GGUF Q4/Q5 if using llama.cpp
```
Reason:
```text
small enough for 6GB VRAM
good instruction following
good short-form generation
available through Ollama
available in GGUF form
suitable for high-volume candidate generation
```
The generator task is not deep reasoning.
It is constrained phrase production.
A 3B instruct model is enough to begin.
---
## 3. Backup Models
### Phi-3.5-mini-instruct
Use if Qwen2.5-3B gives too much decorative prose or weak instruction following.
Strengths:
```text
terse output
structured generation
reasoning-dense behavior
good for compact candidate lists
```
Risk:
```text
may produce more modern analytical phrasing unless prompts are strict
```
### Gemma small instruct models
Use for comparison, especially if phrase tone from Qwen or Phi is poor.
Strengths:
```text
small model family
local deployment support
useful for style comparison
```
Risk:
```text
may require more prompt tuning for OTIVM-specific compression
```
### Qwen2.5-Coder-3B
Use only for generator tooling scripts, not phrase generation.
Strengths:
```text
code generation
JSONL tools
review UI helpers
validator scripts
```
Risk:
```text
not the right primary voice generator
```
---
## 4. Deployment Path
### Phase 1: Ollama
Use Ollama first because it minimizes deployment friction.
Install and run:
```bash
ollama pull qwen2.5:3b
ollama run qwen2.5:3b
```
Test with direct prompt batches.
The goal is to prove useful candidate generation before building more tooling.
### Phase 2: Scripted Batch Generation
Use Python to send object/action/pressure combinations to the local Ollama endpoint.
Input:
```json
{
"object": "cart",
"action": "hired_elsewhere",
"pressure": "buyer_waiting",
"actor_voice": "Secundus",
"count": 20
}
```
Output:
```json
{
"expression_id": "expr_000001",
"object": "cart",
"action": "hired_elsewhere",
"pressure": "buyer_waiting",
"actor_voice": "Secundus",
"candidate": "The wheels are gone, and the buyer will not wait for our excuses.",
"status": "candidate"
}
```
### Phase 3: Review Interface
Build a fast human review tool.
Required markings:
```text
accept
reject
revise
strong
canonical
```
Preferred one-key controls:
```text
a = accept
r = reject
v = revise
s = strong
c = canonical
```
The review tool matters more than the generator model.
---
## 5. Generator Prompt Pattern
Use a strict prompt.
Example:
```text
You generate Roman-visible commercial expressions for OTIVM.
Rules:
- Do not explain.
- Do not use modern business language.
- Do not use words like logistics, liquidity, market efficiency, regulatory, contract compliance, metadata, model, training, or optimization.
- Use concrete objects, actions, and pressures.
- Prefer terse lines.
- Produce candidate lines only.
Object: cart
Action: hired elsewhere
Pressure: buyer waiting
Actor voice: Secundus
Generate 20 candidates.
```
Expected useful outputs:
```text
The wheels are gone.
The buyer will not wait for empty ruts.
Ten jars can still go by mule.
Naso bought the road before the oil moved.
```
Bad outputs:
```text
Transport capacity is constrained.
The supply chain is disrupted.
We need to optimize the delivery channel.
This represents a logistical bottleneck.
```
---
## 6. Output Rule
The generator output must never enter training directly.
All generated output begins as:
```text
status: candidate
```
Only reviewed material can become:
```text
accepted
strong
canonical
```
Training may use:
```text
accepted expressions
strong expressions
canonical expressions
human-revised expressions
dialogue lines based on reviewed expressions
```
Training must not use:
```text
raw generated candidates
rejected candidates
unreviewed batches
candidate churn
```
---
## 7. Why Modern-Contaminated Generator Models Are Acceptable
The generator model may contain modern assumptions.
That is acceptable because it is not the final model.
The generator is not trusted.
The human review gate is trusted.
This distinction is central:
```text
generator output = candidate quarry stone
reviewed output = vocabulary material
canonical output = simulator-ready phrase
```
The generator may suggest bad phrases.
The review process prevents them from becoming corpus material.
---
## 8. Local Model Evaluation
Evaluate local generator models by candidate yield, not by benchmark scores.
Useful metric:
```text
accepted candidates per 100 generated lines
```
Example:
```text
Qwen2.5-3B:
1000 generated
130 accepted
22 strong
5 canonical
Phi-3.5-mini:
1000 generated
90 accepted
18 strong
7 canonical
Gemma small:
1000 generated
110 accepted
15 strong
4 canonical
```
The best generator is the one that gives the most reviewable Roman-visible candidates per hour.
Not the one with the highest general model score.
---
## 9. Batch Generation Strategy
Generate many small batches instead of one huge batch.
Recommended:
```text
20 candidates per prompt
50 prompts per run
1000 candidates per review session
```
Vary one dimension at a time.
Example batch family:
```text
object: cart
action: hired_elsewhere
pressure: buyer_waiting
actor_voice: Secundus
object: cart
action: hired_elsewhere
pressure: buyer_waiting
actor_voice: Felix
object: cart
action: hired_elsewhere
pressure: buyer_waiting
actor_voice: Chresimus
```
This reveals actor voice differences without changing the underlying simulator condition.
---
## 10. Temperature And Sampling
Start conservative.
Suggested settings:
```text
temperature: 0.8
top_p: 0.9
repeat_penalty: 1.1
num_predict: modest
context: modest
```
If output is too dull:
```text
raise temperature slightly
increase candidate count
add actor-specific examples
```
If output is too theatrical:
```text
lower temperature
add terse rule
add rejection examples
```
If output is too modern:
```text
strengthen forbidden terms
add Roman-visible examples
reduce abstract wording in prompt
```
---
## 11. Data Files
Recommended folder layout:
```text
data/vocabulary/
generator_inputs/
objects.yaml
actions.yaml
pressures.yaml
actor_voices.yaml
candidates/
candidates_YYYYMMDD.jsonl
reviewed/
roman_visible_expressions.jsonl
canonical_templates.jsonl
reports/
generator_yield_report.txt
review_summary.txt
```
---
## 12. Minimum Candidate Schema
```json
{
"expression_id": "expr_000001",
"created_at": "YYYY-MM-DD",
"generator_model": "qwen2.5:3b",
"domain": "commerce",
"object": "cart",
"action": "hired_elsewhere",
"pressure": "buyer_waiting",
"actor_voice": "Secundus",
"candidate": "The wheels are gone, and the buyer will not wait for our excuses.",
"modern_meaning": "Cart capacity has been lost while the buyer is waiting.",
"concept_tags": [
"transport_capacity",
"delay_cost",
"buyer_need"
],
"status": "candidate",
"strength": null,
"review_note": null
}
```
---
## 13. Promotion Schema
When promoted:
```json
{
"expression_id": "expr_000001",
"status": "strong",
"reviewed_by": "human",
"review_note": "Good Secundus line; concrete and reusable.",
"promoted_to": [
"roman_visible_expressions"
]
}
```
Canonical lines should be rare:
```json
{
"expression_id": "expr_000019",
"status": "canonical",
"candidate": "The wheels are gone.",
"canonical_condition": "transport_capacity_lost"
}
```
---
## 14. When To Move Beyond Ollama
Move from Ollama to llama.cpp or vLLM only if needed.
Reasons to move:
```text
need exact GGUF quant choice
need better batching control
need lower latency
need reproducible runtime parameters
need integration with a custom review server
```
Until then, Ollama is sufficient.
The priority is vocabulary yield, not infrastructure elegance.
---
## 15. Near-Term Test Plan
Run a small bakeoff.
Models:
```text
qwen2.5:3b
phi3.5-mini-instruct quantized
gemma small instruct model
```
Prompts:
```text
10 object/action/pressure combinations
6 actor voices
20 candidates each
```
Total:
```text
10 * 6 * 20 = 1200 candidates per model
```
Human review outcome:
```text
accepted count
strong count
canonical count
modern contamination count
too theatrical count
duplicate count
```
Pick the generator model by accepted/strong yield per review hour.
---
## 16. Recommendation
Begin with:
```text
Ollama + qwen2.5:3b
```
Use it to generate candidate vocabulary only.
Do not use it as authority.
Do not train on its raw output.
Do not let it decide canonical vocabulary.
The first success condition is simple:
```text
Can the local generator produce enough reviewable Roman-visible candidates to make human review faster than hand-authoring?
```
If yes, the deployment is successful.
If no, test Phi-3.5-mini and Gemma small models with the same input batches.
---
## 17. Success Condition
This model-selection process is working if it produces:
```text
high candidate volume
low deployment friction
fast human review
rising accepted-expression count
a small canonical phrase library
better dialogue voice
less modern vocabulary
```
The correct measure is not model intelligence.
The correct measure is vocabulary throughput.
The generator does not need to be Roman.
The reviewed output does.