Which statement best describes practitioner-level MLOps?

It is the operational discipline for repeatable, monitored, governed, and improvable ML systems. MLOps covers repeatability, deployment, monitoring, governance, feedback, retraining, and rollback. The practitioner needs to understand the controls, not build every pipeline.

A model has been in production for six months and input distributions have shifted. What is the best next step?

Investigate drift, confirm the cause, and evaluate whether retraining or rollback is needed. Drift should trigger investigation. The team should confirm whether performance is affected and decide on retraining, rollback, or other corrective action.

Which artifact is most useful for proving which model version was approved for production?

A model registry or release record with version, evaluation, and approval information. A registry or release record supports traceability, approval, and rollback. Informal or unversioned evidence is weak governance.

Practitioner MLOps: Repeatability, Monitorin | Free Guide 2026

MLOps without overbuilding

MLOps combines ML development practices with operations, governance, automation, monitoring, and continuous improvement. For the AWS Certified AI Practitioner, this is a recognition topic. You should know why MLOps matters and what good controls look like, but you are not expected to build SageMaker Pipelines, implement CI/CD infrastructure, or tune training jobs in detail.

A model is not just a file. A production ML system includes data sources, preprocessing steps, feature definitions, training code or service configuration, model artifacts, evaluation results, approvals, deployment configuration, monitoring rules, logs, user feedback, and retraining decisions. If the team cannot explain which version is live and why it was approved, the organization has an operational risk.

Repeatability means the team can trace how a model was created and reproduce the process when needed. This does not always mean every training run gives bit-for-bit identical results, because some algorithms and distributed systems have randomness. It means the organization tracks enough information to understand the lineage: dataset version, code version, configuration, parameters, training environment, evaluation report, approver, deployment time, and rollback target.

MLOps concern	Practitioner question	AWS context to recognize
Versioning	Which model, data, and code version is live?	SageMaker Model Registry, S3 versioning, source control
Approval	Who accepted the model for production?	Model registry approval states, change management
Deployment	How is the model served or consumed?	SageMaker endpoints, batch transform, managed AI API, application integration
Monitoring	What alerts if behavior changes?	CloudWatch, SageMaker Model Monitor, application logs
Governance	Is access least privilege and auditable?	IAM, CloudTrail, KMS, tagging, data governance
Retraining	What evidence triggers refresh?	Drift, new labels, scheduled review, incidents
Rollback	What happens if the model fails?	Previous model version, rules fallback, manual process

Monitoring as an operating agreement

Monitoring should be defined before launch, not after complaints arrive. Technical teams may monitor latency, error rates, endpoint CPU or GPU utilization, failed invocations, and request volume. Data teams may monitor missing values, schema changes, feature distributions, and prediction drift. Business teams may monitor conversion, fraud loss, review volume, customer satisfaction, or false-positive complaints.

The practitioner should insist on owners for each metric. An alarm that no one reviews is theater. A dashboard with no threshold is weak. A model that collects feedback but never uses it is not improving. The operating agreement should say what is measured, where it is measured, who responds, how quickly they respond, and what action they can take.

Monitoring also includes responsible AI concerns. If outputs affect people differently across segments, the team needs a way to measure and address that. If a generative AI application produces unsafe or unsupported responses, the team needs logging, user feedback, guardrails, and escalation. If sensitive data appears in prompts or logs, security and privacy teams need retention and access controls.

Retraining triggers

Retraining is the process of updating a model with new or improved data. It should not happen randomly or only because a calendar says so. A calendar may be useful for review, but retraining should be tied to evidence and business need. Common triggers include performance degradation, detected drift, new labeled data, changed product catalogs, changed fraud patterns, policy changes, new regions, new customer segments, or a major incident.

Retraining can also create risk. A new model can perform worse than the old one, break segment fairness, change user behavior, increase cost, or fail compliance review. The team should compare the candidate model with the current production model, review business metrics, approve the change, deploy in a controlled way, and keep a rollback option. For high-risk workflows, human review and phased rollout are often appropriate.

Retraining decision workflow
1. Detect drift, performance change, new labels, or business rule change.
2. Confirm the issue is not a data pipeline or application bug.
3. Prepare an updated dataset under approved governance controls.
4. Train or update the candidate model.
5. Evaluate against the current production model and segment requirements.
6. Approve, deploy, monitor, and keep rollback available.

Governance and security in the MLOps loop

MLOps is not only a data science practice. It intersects with security, compliance, and cost management. IAM should restrict who can access training data, invoke endpoints, approve models, update pipelines, and read logs. AWS KMS can protect data and artifacts with encryption keys. CloudTrail can help audit API activity. CloudWatch can collect logs and metrics. Tags can support ownership and cost reporting. Data governance tools such as Lake Formation can help control access to shared datasets.

Cost matters because ML workloads can scale quickly. Training jobs can use expensive compute. Real-time endpoints can run continuously. Foundation model calls can grow with token volume. Batch jobs can process more records than expected. A practitioner should ask for budget alarms, usage forecasts, and right-sized deployment patterns.

What good practitioner MLOps looks like

A healthy MLOps plan is understandable to business and technical stakeholders. It names the production model version, the data lineage, the approval path, the monitoring metrics, the responsible owner, the incident response path, and the retraining criteria. It also explains when the model should not be used.

A weak plan says the model is done after launch. It has no baseline, no owner, no drift monitoring, no feedback loop, no rollback plan, and no cost controls. In that situation, the practitioner should not approve expansion. The right next step is to define operating controls, narrow the use case, or keep the model in pilot until the team can run it responsibly.

AWS AI Practitioner Study Guide

3.5 Practitioner MLOps: Repeatability, Monitoring, and Retraining

Key Takeaways

MLOps without overbuilding

Monitoring as an operating agreement

Retraining triggers

Governance and security in the MLOps loop

What good practitioner MLOps looks like

AWS AI Practitioner Study Guide

1Chapter 1: AIF-C01 Orientation and Official Source Control

2Chapter 2: AI/ML Foundations and Use-Case Fit

3Chapter 3: ML Lifecycle, Metrics, and Practitioner MLOps

4Chapter 4: Generative AI Foundations and Inference Concepts

5Chapter 5: Prompting, Model Selection, Customization, and Evaluation

6Chapter 6: Amazon Bedrock, RAG, Agents, and Guardrails

7Chapter 7: AWS Managed AI/ML Services and SageMaker Map

8Chapter 8: Responsible AI, Human Review, and Safety

9Chapter 9: Security, Compliance, Governance, and Cost Controls

10Chapter 10: Integrated AWS AI Business Scenario Labs

11Chapter 11: Final Review, Exam Readiness, and Recertification

3.5 Practitioner MLOps: Repeatability, Monitoring, and Retraining

Key Takeaways

MLOps without overbuilding

Monitoring as an operating agreement

Retraining triggers

Governance and security in the MLOps loop

What good practitioner MLOps looks like