6.4 Fine-Tuning and Model Customization
Key Takeaways
- Fine-tuning continues training a base model on your labeled examples to lock in a style, format, or task — it does NOT teach new live facts; use RAG for that.
- Training data is JSONL: one line per example, each a {"messages":[...]} object with system/user/assistant roles mirroring inference-time prompts.
- Minimum 10 examples; Microsoft recommends 50-100+ high-quality, representative examples and a held-out validation split.
- Workflow: upload file -> create fine-tuning job (set n_epochs, batch_size, learning_rate_multiplier or 'auto') -> monitor loss -> deploy the resulting custom model.
- Fine-tuned deployments bill at a higher per-token rate (plus training and hosting cost) than base models, so justify them against prompt engineering and RAG first.
Quick Answer: Fine-tuning customizes a base model on your JSONL examples to enforce a consistent format, style, or task and to shrink prompts. Use it only when prompt engineering and RAG fall short. Minimum 10 examples; 50-100+ recommended. Fine-tuning does not teach new facts — that is RAG's job. Custom models cost more per token than base models.
Fine-Tune vs. Prompt vs. RAG — the Decision
| Need | Best tool |
|---|---|
| Consistent output format prompting can't pin down | Fine-tune |
| Specific tone/style every time | Fine-tune |
| Cut a long few-shot prompt to save tokens | Fine-tune |
| Up-to-date or proprietary facts | RAG |
| One-off or easily prompted task | Prompt engineering |
| Fewer than ~10 quality examples | Not fine-tuning yet |
On the Exam: The single most common fine-tuning question is fine-tune vs. RAG. If the requirement mentions current, changing, or factual knowledge → RAG. If it mentions format/style/consistency that prompts can't achieve → fine-tune. Fine-tuning is a point-in-time snapshot; it never refreshes itself.
JSONL Training Data Contract
Each line is one full example whose messages mirror what you'll send at inference:
{"messages":[{"role":"system","content":"You are a Contoso support agent. Be concise and professional."},{"role":"user","content":"How do I reset my password?"},{"role":"assistant","content":"Go to contoso.com/reset, enter your email, then click Reset. The link expires in 24 hours."}]}
{"messages":[{"role":"system","content":"You are a Contoso support agent. Be concise and professional."},{"role":"user","content":"What is the refund window?"},{"role":"assistant","content":"Returns are accepted within 30 days for a full refund if items are unused."}]}
| Requirement | Minimum | Recommended |
|---|---|---|
| Examples | 10 | 50-100+ |
| Format | JSONL (one object/line) | Identical schema each line |
| Validation set | Optional (auto-split) | 10-20% held out |
| Quality | Correct & representative | Gold-standard, human-reviewed |
Trap: the assistant message in each line must be the ideal answer — fine-tuning copies whatever you put there, errors included. Keep the system prompt identical to production.
Why example quality dominates
Fine-tuning does not reason about your examples; it nudges the model's weights toward reproducing them. Three consequences follow. First, consistency beats volume — 60 clean, on-style examples outperform 300 noisy ones. Second, the distribution must match production: if real users ask short questions but your data is all long ones, the tuned model skews wrong. Third, garbage in, garbage out — a few mislabeled answers teach the model the wrong behavior with no warning. Curate, deduplicate, and spot-check the JSONL before spending money on a training run.
Cost reality check
Fine-tuning carries three cost layers the exam expects you to weigh: a one-time training cost (priced per 1,000 training tokens × epochs), an ongoing hosting cost for keeping the custom deployment available, and a higher per-token inference rate than the base model. Because of this, the recommended order of escalation is always prompt engineering → RAG → fine-tuning: only fine-tune once cheaper, faster-to-iterate options demonstrably fail to meet the format or style requirement.
The Fine-Tuning Workflow
# 1. Upload training (and optional validation) data
train = client.files.create(file=open("train.jsonl", "rb"), purpose="fine-tune")
# 2. Create the job
job = client.fine_tuning.jobs.create(
training_file=train.id,
model="gpt-4o-mini-2024-07-18",
hyperparameters={"n_epochs": 3, "batch_size": "auto",
"learning_rate_multiplier": "auto"})
# 3. Monitor until status == "succeeded"
job = client.fine_tuning.jobs.retrieve(job.id)
print(job.status, job.fine_tuned_model)
# 4. Deploy the resulting custom model as a NEW deployment
az cognitiveservices account deployment create \
--name my-openai-service --resource-group rg-ai-prod \
--deployment-name contoso-support \
--model-name <fine-tuned-model-id> --model-format OpenAI \
--sku-name Standard --sku-capacity 10
The job runs server-side; you poll until succeeded (or read events). The output is a new model ID you must deploy like any other before calling it — training alone does not create an endpoint.
Hyperparameters
| Parameter | Meaning | Default |
|---|---|---|
| n_epochs | Passes over the dataset | auto (~3-4) |
| batch_size | Examples per step | auto |
| learning_rate_multiplier | Scales the learning rate | auto |
More epochs fit the data harder; too many cause overfitting (memorizing training answers, failing on new inputs). Start with auto and only tune if evaluation shows under- or over-fitting.
Evaluating the Result
| Signal | Healthy | Warning |
|---|---|---|
| Training loss | Decreases, then plateaus | Rising / never settles |
| Validation loss | Tracks training loss | Large gap above training = overfitting |
| Held-out test answers | Correct on unseen prompts | Right on training, wrong on new = overfit |
| A/B vs. base + prompt | Custom wins on the metric | No improvement = don't ship it |
When iterating, you can also continue fine-tuning from a previously fine-tuned model (incremental fine-tuning) by passing the earlier custom model as the base — useful for adding new examples without retraining from scratch. Always validate against a held-out test set after each round so you catch regressions early.
Common AI-102 fine-tuning distractors
| Wrong option you'll see | Why it's wrong |
|---|---|
| "Fine-tune to add this week's data" | Snapshot in time; use RAG |
| "Raise temperature for consistent format" | Higher temperature reduces consistency |
| "Use CSV / a single JSON array" | Must be JSONL, one object per line |
| "Just deploy the job" | Training produces a model ID; you still deploy it |
| "Fine-tune with 5 examples" | Below the 10-example minimum |
On the Exam: Know the four-step lifecycle (upload → create job → monitor → deploy), the JSONL shape, that fine-tuned models cost more per token and incur hosting charges, and that a validation/training loss gap signals overfitting. If a scenario needs fresh facts, the answer is RAG, not a new fine-tune; if it needs a reliable format or tone that prompts can't pin down, fine-tuning is the right call.
What file format is required for Azure OpenAI fine-tuning training data?
A company needs its assistant to answer using this week's constantly-updated product inventory. Which approach fits best?
During fine-tuning, validation loss is much higher than training loss while training loss keeps dropping. What does this indicate?
After a fine-tuning job reports status 'succeeded', what must you do before calling the custom model?