6.4 Fine-Tuning and Model Customization

Key Takeaways

  • Fine-tuning continues training a base model on your labeled examples to lock in a style, format, or task — it does NOT teach new live facts; use RAG for that.
  • Training data is JSONL: one line per example, each a {"messages":[...]} object with system/user/assistant roles mirroring inference-time prompts.
  • Minimum 10 examples; Microsoft recommends 50-100+ high-quality, representative examples and a held-out validation split.
  • Workflow: upload file -> create fine-tuning job (set n_epochs, batch_size, learning_rate_multiplier or 'auto') -> monitor loss -> deploy the resulting custom model.
  • Fine-tuned deployments bill at a higher per-token rate (plus training and hosting cost) than base models, so justify them against prompt engineering and RAG first.
Last updated: June 2026

Quick Answer: Fine-tuning customizes a base model on your JSONL examples to enforce a consistent format, style, or task and to shrink prompts. Use it only when prompt engineering and RAG fall short. Minimum 10 examples; 50-100+ recommended. Fine-tuning does not teach new facts — that is RAG's job. Custom models cost more per token than base models.

Fine-Tune vs. Prompt vs. RAG — the Decision

NeedBest tool
Consistent output format prompting can't pin downFine-tune
Specific tone/style every timeFine-tune
Cut a long few-shot prompt to save tokensFine-tune
Up-to-date or proprietary factsRAG
One-off or easily prompted taskPrompt engineering
Fewer than ~10 quality examplesNot fine-tuning yet

On the Exam: The single most common fine-tuning question is fine-tune vs. RAG. If the requirement mentions current, changing, or factual knowledge → RAG. If it mentions format/style/consistency that prompts can't achieve → fine-tune. Fine-tuning is a point-in-time snapshot; it never refreshes itself.

JSONL Training Data Contract

Each line is one full example whose messages mirror what you'll send at inference:

{"messages":[{"role":"system","content":"You are a Contoso support agent. Be concise and professional."},{"role":"user","content":"How do I reset my password?"},{"role":"assistant","content":"Go to contoso.com/reset, enter your email, then click Reset. The link expires in 24 hours."}]}
{"messages":[{"role":"system","content":"You are a Contoso support agent. Be concise and professional."},{"role":"user","content":"What is the refund window?"},{"role":"assistant","content":"Returns are accepted within 30 days for a full refund if items are unused."}]}
RequirementMinimumRecommended
Examples1050-100+
FormatJSONL (one object/line)Identical schema each line
Validation setOptional (auto-split)10-20% held out
QualityCorrect & representativeGold-standard, human-reviewed

Trap: the assistant message in each line must be the ideal answer — fine-tuning copies whatever you put there, errors included. Keep the system prompt identical to production.

Why example quality dominates

Fine-tuning does not reason about your examples; it nudges the model's weights toward reproducing them. Three consequences follow. First, consistency beats volume — 60 clean, on-style examples outperform 300 noisy ones. Second, the distribution must match production: if real users ask short questions but your data is all long ones, the tuned model skews wrong. Third, garbage in, garbage out — a few mislabeled answers teach the model the wrong behavior with no warning. Curate, deduplicate, and spot-check the JSONL before spending money on a training run.

Cost reality check

Fine-tuning carries three cost layers the exam expects you to weigh: a one-time training cost (priced per 1,000 training tokens × epochs), an ongoing hosting cost for keeping the custom deployment available, and a higher per-token inference rate than the base model. Because of this, the recommended order of escalation is always prompt engineering → RAG → fine-tuning: only fine-tune once cheaper, faster-to-iterate options demonstrably fail to meet the format or style requirement.

The Fine-Tuning Workflow

# 1. Upload training (and optional validation) data
train = client.files.create(file=open("train.jsonl", "rb"), purpose="fine-tune")

# 2. Create the job
job = client.fine_tuning.jobs.create(
    training_file=train.id,
    model="gpt-4o-mini-2024-07-18",
    hyperparameters={"n_epochs": 3, "batch_size": "auto",
                     "learning_rate_multiplier": "auto"})

# 3. Monitor until status == "succeeded"
job = client.fine_tuning.jobs.retrieve(job.id)
print(job.status, job.fine_tuned_model)
# 4. Deploy the resulting custom model as a NEW deployment
az cognitiveservices account deployment create \
  --name my-openai-service --resource-group rg-ai-prod \
  --deployment-name contoso-support \
  --model-name <fine-tuned-model-id> --model-format OpenAI \
  --sku-name Standard --sku-capacity 10

The job runs server-side; you poll until succeeded (or read events). The output is a new model ID you must deploy like any other before calling it — training alone does not create an endpoint.

Hyperparameters

ParameterMeaningDefault
n_epochsPasses over the datasetauto (~3-4)
batch_sizeExamples per stepauto
learning_rate_multiplierScales the learning rateauto

More epochs fit the data harder; too many cause overfitting (memorizing training answers, failing on new inputs). Start with auto and only tune if evaluation shows under- or over-fitting.

Evaluating the Result

SignalHealthyWarning
Training lossDecreases, then plateausRising / never settles
Validation lossTracks training lossLarge gap above training = overfitting
Held-out test answersCorrect on unseen promptsRight on training, wrong on new = overfit
A/B vs. base + promptCustom wins on the metricNo improvement = don't ship it

When iterating, you can also continue fine-tuning from a previously fine-tuned model (incremental fine-tuning) by passing the earlier custom model as the base — useful for adding new examples without retraining from scratch. Always validate against a held-out test set after each round so you catch regressions early.

Common AI-102 fine-tuning distractors

Wrong option you'll seeWhy it's wrong
"Fine-tune to add this week's data"Snapshot in time; use RAG
"Raise temperature for consistent format"Higher temperature reduces consistency
"Use CSV / a single JSON array"Must be JSONL, one object per line
"Just deploy the job"Training produces a model ID; you still deploy it
"Fine-tune with 5 examples"Below the 10-example minimum

On the Exam: Know the four-step lifecycle (upload → create job → monitor → deploy), the JSONL shape, that fine-tuned models cost more per token and incur hosting charges, and that a validation/training loss gap signals overfitting. If a scenario needs fresh facts, the answer is RAG, not a new fine-tune; if it needs a reliable format or tone that prompts can't pin down, fine-tuning is the right call.

Test Your Knowledge

What file format is required for Azure OpenAI fine-tuning training data?

A
B
C
D
Test Your Knowledge

A company needs its assistant to answer using this week's constantly-updated product inventory. Which approach fits best?

A
B
C
D
Test Your Knowledge

During fine-tuning, validation loss is much higher than training loss while training loss keeps dropping. What does this indicate?

A
B
C
D
Test Your Knowledge

After a fine-tuning job reports status 'succeeded', what must you do before calling the custom model?

A
B
C
D