What file format is required for Azure OpenAI fine-tuning training data?

JSONL — one JSON object per line, each with a messages array. Fine-tuning data must be JSONL, where every line is a standalone JSON object containing a messages array of system, user, and assistant turns. A single combined JSON array, CSV, or XML is not accepted by the fine-tuning job.

A company needs its assistant to answer using this week's constantly-updated product inventory. Which approach fits best?

Use RAG to retrieve current inventory at query time. Fine-tuning produces a point-in-time snapshot and cannot stay current with constantly changing data. RAG retrieves the latest inventory from a data source at query time, so answers always reflect the newest information without retraining.

During fine-tuning, validation loss is much higher than training loss while training loss keeps dropping. What does this indicate?

Overfitting; the model is memorizing training data and generalizing poorly. A large gap where validation loss exceeds steadily falling training loss is the classic overfitting signature: the model memorizes the training examples but performs worse on unseen data. Reducing epochs or adding more varied examples typically helps.

After a fine-tuning job reports status 'succeeded', what must you do before calling the custom model?

Create a new deployment that points to the fine-tuned model ID. Training only produces a model ID; it does not expose an endpoint. You must create a new deployment referencing that fine-tuned model ID, exactly as you deploy any base model, before the customized model can serve requests.

Fine-Tuning and Model Customization — Free Study Guide 2026

Quick Answer: Fine-tuning customizes a base model on your JSONL examples to enforce a consistent format, style, or task and to shrink prompts. Use it only when prompt engineering and RAG fall short. Minimum 10 examples; 50-100+ recommended. Fine-tuning does not teach new facts — that is RAG's job. Custom models cost more per token than base models.

Fine-Tune vs. Prompt vs. RAG — the Decision

Need	Best tool
Consistent output format prompting can't pin down	Fine-tune
Specific tone/style every time	Fine-tune
Cut a long few-shot prompt to save tokens	Fine-tune
Up-to-date or proprietary facts	RAG
One-off or easily prompted task	Prompt engineering
Fewer than ~10 quality examples	Not fine-tuning yet

On the Exam: The single most common fine-tuning question is fine-tune vs. RAG. If the requirement mentions current, changing, or factual knowledge → RAG. If it mentions format/style/consistency that prompts can't achieve → fine-tune. Fine-tuning is a point-in-time snapshot; it never refreshes itself.

JSONL Training Data Contract

Each line is one full example whose messages mirror what you'll send at inference:

{"messages":[{"role":"system","content":"You are a Contoso support agent. Be concise and professional."},{"role":"user","content":"How do I reset my password?"},{"role":"assistant","content":"Go to contoso.com/reset, enter your email, then click Reset. The link expires in 24 hours."}]}
{"messages":[{"role":"system","content":"You are a Contoso support agent. Be concise and professional."},{"role":"user","content":"What is the refund window?"},{"role":"assistant","content":"Returns are accepted within 30 days for a full refund if items are unused."}]}

Requirement	Minimum	Recommended
Examples	10	50-100+
Format	JSONL (one object/line)	Identical schema each line
Validation set	Optional (auto-split)	10-20% held out
Quality	Correct & representative	Gold-standard, human-reviewed

Trap: the assistant message in each line must be the ideal answer — fine-tuning copies whatever you put there, errors included. Keep the system prompt identical to production.

Why example quality dominates

Fine-tuning does not reason about your examples; it nudges the model's weights toward reproducing them. Three consequences follow. First, consistency beats volume — 60 clean, on-style examples outperform 300 noisy ones. Second, the distribution must match production: if real users ask short questions but your data is all long ones, the tuned model skews wrong. Third, garbage in, garbage out — a few mislabeled answers teach the model the wrong behavior with no warning. Curate, deduplicate, and spot-check the JSONL before spending money on a training run.

Cost reality check

Fine-tuning carries three cost layers the exam expects you to weigh: a one-time training cost (priced per 1,000 training tokens × epochs), an ongoing hosting cost for keeping the custom deployment available, and a higher per-token inference rate than the base model. Because of this, the recommended order of escalation is always prompt engineering → RAG → fine-tuning: only fine-tune once cheaper, faster-to-iterate options demonstrably fail to meet the format or style requirement.

The Fine-Tuning Workflow

# 1. Upload training (and optional validation) data
train = client.files.create(file=open("train.jsonl", "rb"), purpose="fine-tune")

# 2. Create the job
job = client.fine_tuning.jobs.create(
    training_file=train.id,
    model="gpt-4o-mini-2024-07-18",
    hyperparameters={"n_epochs": 3, "batch_size": "auto",
                     "learning_rate_multiplier": "auto"})

# 3. Monitor until status == "succeeded"
job = client.fine_tuning.jobs.retrieve(job.id)
print(job.status, job.fine_tuned_model)

# 4. Deploy the resulting custom model as a NEW deployment
az cognitiveservices account deployment create \
  --name my-openai-service --resource-group rg-ai-prod \
  --deployment-name contoso-support \
  --model-name <fine-tuned-model-id> --model-format OpenAI \
  --sku-name Standard --sku-capacity 10

The job runs server-side; you poll until succeeded (or read events). The output is a new model ID you must deploy like any other before calling it — training alone does not create an endpoint.

Hyperparameters

Parameter	Meaning	Default
n_epochs	Passes over the dataset	auto (~3-4)
batch_size	Examples per step	auto
learning_rate_multiplier	Scales the learning rate	auto

More epochs fit the data harder; too many cause overfitting (memorizing training answers, failing on new inputs). Start with auto and only tune if evaluation shows under- or over-fitting.

Evaluating the Result

Signal	Healthy	Warning
Training loss	Decreases, then plateaus	Rising / never settles
Validation loss	Tracks training loss	Large gap above training = overfitting
Held-out test answers	Correct on unseen prompts	Right on training, wrong on new = overfit
A/B vs. base + prompt	Custom wins on the metric	No improvement = don't ship it

When iterating, you can also continue fine-tuning from a previously fine-tuned model (incremental fine-tuning) by passing the earlier custom model as the base — useful for adding new examples without retraining from scratch. Always validate against a held-out test set after each round so you catch regressions early.

Common AI-102 fine-tuning distractors

Wrong option you'll see	Why it's wrong
"Fine-tune to add this week's data"	Snapshot in time; use RAG
"Raise temperature for consistent format"	Higher temperature reduces consistency
"Use CSV / a single JSON array"	Must be JSONL, one object per line
"Just deploy the job"	Training produces a model ID; you still deploy it
"Fine-tune with 5 examples"	Below the 10-example minimum

On the Exam: Know the four-step lifecycle (upload → create job → monitor → deploy), the JSONL shape, that fine-tuned models cost more per token and incur hosting charges, and that a validation/training loss gap signals overfitting. If a scenario needs fresh facts, the answer is RAG, not a new fine-tune; if it needs a reliable format or tone that prompts can't pin down, fine-tuning is the right call.

Azure AI Engineer Associate

Azure AI-102

6.4 Fine-Tuning and Model Customization

Key Takeaways

Fine-Tune vs. Prompt vs. RAG — the Decision

JSONL Training Data Contract

Why example quality dominates

Cost reality check

The Fine-Tuning Workflow

Hyperparameters

Evaluating the Result

Common AI-102 fine-tuning distractors

Azure AI Engineer Associate

1Introduction

2Domain 1: Plan and Manage an Azure AI Solution (20-25%)

3Content Safety and Moderation (within Plan and Manage, Domain 1)

4Domain 4: Implement Computer Vision Solutions (10-15%)

5Domain 5: Implement Natural Language Processing Solutions (15-20%)

6Domain 6: Implement Knowledge Mining and Information Extraction Solutions (15-20%)

7Domain 2: Implement Generative AI Solutions (15-20%)

8Domain 3: Implement an Agentic Solution (5-10%)

9Exam Review: Cross-Domain Topics and Advanced Practice

Azure AI-102

6.4 Fine-Tuning and Model Customization

Key Takeaways

Fine-Tune vs. Prompt vs. RAG — the Decision

JSONL Training Data Contract

Why example quality dominates

Cost reality check

The Fine-Tuning Workflow

Hyperparameters

Evaluating the Result

Common AI-102 fine-tuning distractors