# GENERATOR-MODEL-SELECTION-0001 ## Local Model Selection And Deployment For The OTIVM Vocabulary Generator ### Status: Draft Standard ### Layer: Training Infrastructure ### Purpose: Select and deploy a small local model for generating Roman-visible vocabulary candidates ### Repository Path: docs/training/chunking/GENERATOR-MODEL-SELECTION-0001.md --- ## 0. Purpose This document defines a practical model-selection and deployment plan for the OTIVM Roman-visible expression generator. The generator is not the CIVICUS-ROMAN model. The generator is a tool used to produce candidate phrases. Most generated phrases may be weak. Only reviewed and accepted expressions become training material. The generator is quarry equipment. The reviewed vocabulary is the stone. --- ## 1. Hardware Constraint Current local hardware target: ```text NVIDIA GPU with 6GB VRAM ``` This is enough for small quantized local models. It is not the right target for full model training. It is sufficient for: ```text candidate expression generation small-batch phrase variation actor-voice experiments object/action/pressure recombination quick local iteration offline review workflows ``` It should not be used yet for: ```text full CIVICUS-ROMAN training large-context corpus analysis unsupervised corpus promotion automatic canonical selection ``` --- ## 2. Primary Recommendation Start with: ```text Model: Qwen2.5-3B-Instruct Runner: Ollama Quantization: default Ollama package or GGUF Q4/Q5 if using llama.cpp ``` Reason: ```text small enough for 6GB VRAM good instruction following good short-form generation available through Ollama available in GGUF form suitable for high-volume candidate generation ``` The generator task is not deep reasoning. It is constrained phrase production. A 3B instruct model is enough to begin. --- ## 3. Backup Models ### Phi-3.5-mini-instruct Use if Qwen2.5-3B gives too much decorative prose or weak instruction following. Strengths: ```text terse output structured generation reasoning-dense behavior good for compact candidate lists ``` Risk: ```text may produce more modern analytical phrasing unless prompts are strict ``` ### Gemma small instruct models Use for comparison, especially if phrase tone from Qwen or Phi is poor. Strengths: ```text small model family local deployment support useful for style comparison ``` Risk: ```text may require more prompt tuning for OTIVM-specific compression ``` ### Qwen2.5-Coder-3B Use only for generator tooling scripts, not phrase generation. Strengths: ```text code generation JSONL tools review UI helpers validator scripts ``` Risk: ```text not the right primary voice generator ``` --- ## 4. Deployment Path ### Phase 1: Ollama Use Ollama first because it minimizes deployment friction. Install and run: ```bash ollama pull qwen2.5:3b ollama run qwen2.5:3b ``` Test with direct prompt batches. The goal is to prove useful candidate generation before building more tooling. ### Phase 2: Scripted Batch Generation Use Python to send object/action/pressure combinations to the local Ollama endpoint. Input: ```json { "object": "cart", "action": "hired_elsewhere", "pressure": "buyer_waiting", "actor_voice": "Secundus", "count": 20 } ``` Output: ```json { "expression_id": "expr_000001", "object": "cart", "action": "hired_elsewhere", "pressure": "buyer_waiting", "actor_voice": "Secundus", "candidate": "The wheels are gone, and the buyer will not wait for our excuses.", "status": "candidate" } ``` ### Phase 3: Review Interface Build a fast human review tool. Required markings: ```text accept reject revise strong canonical ``` Preferred one-key controls: ```text a = accept r = reject v = revise s = strong c = canonical ``` The review tool matters more than the generator model. --- ## 5. Generator Prompt Pattern Use a strict prompt. Example: ```text You generate Roman-visible commercial expressions for OTIVM. Rules: - Do not explain. - Do not use modern business language. - Do not use words like logistics, liquidity, market efficiency, regulatory, contract compliance, metadata, model, training, or optimization. - Use concrete objects, actions, and pressures. - Prefer terse lines. - Produce candidate lines only. Object: cart Action: hired elsewhere Pressure: buyer waiting Actor voice: Secundus Generate 20 candidates. ``` Expected useful outputs: ```text The wheels are gone. The buyer will not wait for empty ruts. Ten jars can still go by mule. Naso bought the road before the oil moved. ``` Bad outputs: ```text Transport capacity is constrained. The supply chain is disrupted. We need to optimize the delivery channel. This represents a logistical bottleneck. ``` --- ## 6. Output Rule The generator output must never enter training directly. All generated output begins as: ```text status: candidate ``` Only reviewed material can become: ```text accepted strong canonical ``` Training may use: ```text accepted expressions strong expressions canonical expressions human-revised expressions dialogue lines based on reviewed expressions ``` Training must not use: ```text raw generated candidates rejected candidates unreviewed batches candidate churn ``` --- ## 7. Why Modern-Contaminated Generator Models Are Acceptable The generator model may contain modern assumptions. That is acceptable because it is not the final model. The generator is not trusted. The human review gate is trusted. This distinction is central: ```text generator output = candidate quarry stone reviewed output = vocabulary material canonical output = simulator-ready phrase ``` The generator may suggest bad phrases. The review process prevents them from becoming corpus material. --- ## 8. Local Model Evaluation Evaluate local generator models by candidate yield, not by benchmark scores. Useful metric: ```text accepted candidates per 100 generated lines ``` Example: ```text Qwen2.5-3B: 1000 generated 130 accepted 22 strong 5 canonical Phi-3.5-mini: 1000 generated 90 accepted 18 strong 7 canonical Gemma small: 1000 generated 110 accepted 15 strong 4 canonical ``` The best generator is the one that gives the most reviewable Roman-visible candidates per hour. Not the one with the highest general model score. --- ## 9. Batch Generation Strategy Generate many small batches instead of one huge batch. Recommended: ```text 20 candidates per prompt 50 prompts per run 1000 candidates per review session ``` Vary one dimension at a time. Example batch family: ```text object: cart action: hired_elsewhere pressure: buyer_waiting actor_voice: Secundus object: cart action: hired_elsewhere pressure: buyer_waiting actor_voice: Felix object: cart action: hired_elsewhere pressure: buyer_waiting actor_voice: Chresimus ``` This reveals actor voice differences without changing the underlying simulator condition. --- ## 10. Temperature And Sampling Start conservative. Suggested settings: ```text temperature: 0.8 top_p: 0.9 repeat_penalty: 1.1 num_predict: modest context: modest ``` If output is too dull: ```text raise temperature slightly increase candidate count add actor-specific examples ``` If output is too theatrical: ```text lower temperature add terse rule add rejection examples ``` If output is too modern: ```text strengthen forbidden terms add Roman-visible examples reduce abstract wording in prompt ``` --- ## 11. Data Files Recommended folder layout: ```text data/vocabulary/ generator_inputs/ objects.yaml actions.yaml pressures.yaml actor_voices.yaml candidates/ candidates_YYYYMMDD.jsonl reviewed/ roman_visible_expressions.jsonl canonical_templates.jsonl reports/ generator_yield_report.txt review_summary.txt ``` --- ## 12. Minimum Candidate Schema ```json { "expression_id": "expr_000001", "created_at": "YYYY-MM-DD", "generator_model": "qwen2.5:3b", "domain": "commerce", "object": "cart", "action": "hired_elsewhere", "pressure": "buyer_waiting", "actor_voice": "Secundus", "candidate": "The wheels are gone, and the buyer will not wait for our excuses.", "modern_meaning": "Cart capacity has been lost while the buyer is waiting.", "concept_tags": [ "transport_capacity", "delay_cost", "buyer_need" ], "status": "candidate", "strength": null, "review_note": null } ``` --- ## 13. Promotion Schema When promoted: ```json { "expression_id": "expr_000001", "status": "strong", "reviewed_by": "human", "review_note": "Good Secundus line; concrete and reusable.", "promoted_to": [ "roman_visible_expressions" ] } ``` Canonical lines should be rare: ```json { "expression_id": "expr_000019", "status": "canonical", "candidate": "The wheels are gone.", "canonical_condition": "transport_capacity_lost" } ``` --- ## 14. When To Move Beyond Ollama Move from Ollama to llama.cpp or vLLM only if needed. Reasons to move: ```text need exact GGUF quant choice need better batching control need lower latency need reproducible runtime parameters need integration with a custom review server ``` Until then, Ollama is sufficient. The priority is vocabulary yield, not infrastructure elegance. --- ## 15. Near-Term Test Plan Run a small bakeoff. Models: ```text qwen2.5:3b phi3.5-mini-instruct quantized gemma small instruct model ``` Prompts: ```text 10 object/action/pressure combinations 6 actor voices 20 candidates each ``` Total: ```text 10 * 6 * 20 = 1200 candidates per model ``` Human review outcome: ```text accepted count strong count canonical count modern contamination count too theatrical count duplicate count ``` Pick the generator model by accepted/strong yield per review hour. --- ## 16. Recommendation Begin with: ```text Ollama + qwen2.5:3b ``` Use it to generate candidate vocabulary only. Do not use it as authority. Do not train on its raw output. Do not let it decide canonical vocabulary. The first success condition is simple: ```text Can the local generator produce enough reviewable Roman-visible candidates to make human review faster than hand-authoring? ``` If yes, the deployment is successful. If no, test Phi-3.5-mini and Gemma small models with the same input batches. --- ## 17. Success Condition This model-selection process is working if it produces: ```text high candidate volume low deployment friction fast human review rising accepted-expression count a small canonical phrase library better dialogue voice less modern vocabulary ``` The correct measure is not model intelligence. The correct measure is vocabulary throughput. The generator does not need to be Roman. The reviewed output does.