10.6 Cost, Performance, and Operations Review Lab
Key Takeaways
- AI cost review should include tokens, documents, pages, inference calls, training jobs, vector storage, data movement, monitoring, human review, and support operations.
- Performance review should focus on latency, throughput, concurrency, retrieval quality, model size, prompt length, fallback paths, and user experience thresholds.
- Operations review should define owners for monitoring, alerts, cost budgets, evaluation, incident response, model or data changes, and rollback.
- CloudWatch, CloudTrail, Cost Explorer, AWS Budgets, tagging, Trusted Advisor, Well-Architected Tool, and service metrics help convert AI pilots into manageable workloads.
- The best practitioner answer is often to reduce scope, use a smaller model, improve retrieval, shorten prompts, or add caching before buying more capacity.
Lab scenario: operations review for AI pilots
The company now has four AI pilots: a Bedrock support assistant, a Textract document workflow, a Personalize recommendation test, and a SageMaker Canvas forecasting experiment. Usage is growing, and finance asks why costs are rising. Support asks why assistant responses sometimes take too long. Security asks who reviews logs. Product asks whether the models are still accurate. The operations review turns a set of demos into managed workloads.
Start by listing cost drivers by service and workflow. Bedrock cost is often shaped by model choice, input tokens, output tokens, request volume, context length, provisioned or on-demand capacity choices where available, embeddings, retrieval, and evaluation. Textract cost can depend on document pages and feature type. Transcribe depends on audio duration. Personalize has data processing, training, campaign, and recommendation activity patterns. SageMaker AI can include notebook, training, endpoint, storage, and processing costs. Vector stores, S3, OpenSearch, logs, data transfer, and human review also matter.
| Review area | What to measure | Common improvement |
|---|---|---|
| Model inference cost | Requests, tokens, output length, model choice, peak usage | Shorter prompts, smaller model, response limits, prompt reuse, or better routing. |
| Retrieval cost and quality | Embedding volume, vector storage, index refresh, retrieved chunks | Remove stale sources, improve metadata, tune chunking, reduce unnecessary context. |
| Document processing | Pages, retries, low-confidence review rate | Improve scan quality, split document types, route exceptions earlier. |
| Personalization and forecasting | Training frequency, dataset size, campaign usage, analyst experiments | Match refresh cadence to business change and retire unused experiments. |
| Monitoring and logs | Log volume, retention, dashboard count, alarm noise | Set retention, filter sensitive data, and keep actionable alerts. |
| Human operations | Review minutes, escalation queues, support tickets | Automate only low-risk steps and measure reviewer workload. |
Performance review should start from user expectations. A support agent drafting a reply may tolerate a few seconds if citations are useful. A checkout fraud screen may need near real-time response or a deterministic fallback. A document batch review may run asynchronously. An executive forecast refresh can run overnight. Do not use one latency target for every AI workload. Define the workflow threshold, then test p50, p95, and failure behavior under realistic volume.
For Bedrock applications, latency can be affected by model size, prompt length, output length, retrieval calls, tool calls, guardrails, and downstream APIs. A larger model may not be needed for simple classification or extraction. A long prompt that includes irrelevant case history increases cost and latency. A knowledge base that retrieves too many chunks can confuse the model and slow the response. A strong operator asks whether quality improves enough to justify each extra token and call.
For operations, build an ownership map. Product owns whether the feature solves the business problem. Data owners own source freshness and data quality. Security owns access and incident review. Cloud operations owns alarms, dashboards, and cost guardrails. Model or application owners own prompt templates, evaluation sets, and release changes. Finance owns budget thresholds, but the workload team must explain the drivers. Without named owners, every AI issue becomes an unclear cross-team dispute.
Operations checklist:
- Tag AI resources by application, owner, environment, cost center, and data classification.
- Create AWS Budgets alerts for pilot and production spend thresholds.
- Use Cost Explorer to review service, Region, tag, and usage trends.
- Use CloudWatch metrics and logs for latency, errors, throttling, blocked prompts, and application outcomes.
- Use CloudTrail to investigate API activity and access patterns.
- Define log retention and encryption before storing prompts or responses.
- Keep evaluation sets for prompt, model, retrieval, and data-source changes.
- Document rollback: previous prompt, previous model, disabled action, or human-only workflow.
Failure modes often look like cost or latency symptoms, but the root cause is design. A support assistant may become expensive because every prompt includes a full transcript instead of a concise case summary. A RAG app may be slow because it retrieves a large number of chunks from stale documents. A document workflow may cost more than expected because failed scans are retried repeatedly. A forecasting experiment may keep unused compute resources running. A Personalize campaign may serve traffic even after the experiment ended.
The review should also consider value. A low-cost model is still waste if nobody uses the output. A high-cost model may be justified if it reduces regulated review time with strong human oversight and audit evidence. Use business metrics next to technical metrics: handle time, avoided rework, conversion rate, forecast error, fraud losses, review backlog, user satisfaction, and complaint rate. Cost optimization without outcome measurement can push teams toward cheaper but ineffective systems.
Review prompts before the quiz:
- Which cost driver is growing fastest: tokens, documents, storage, training, endpoints, or logs?
- Which latency target belongs to each workflow rather than the whole AI program?
- Which resources are untagged or owned by a departed pilot team?
- What prompt, model, data, or retrieval change requires a regression evaluation before release?
- What fallback keeps the business running if the AI service is unavailable?
A Bedrock support assistant is expensive and slow because each request includes long irrelevant case history. What is the best first optimization?
Which AWS tools help review and control spend for AI workloads?
A checkout fraud screen has a strict real-time requirement, while a document batch review can run overnight. What should the operations review conclude?