2.6 AWS Step Functions and Workflow Orchestration
Key Takeaways
- Step Functions orchestrate multi-step workflows as visual state machines with built-in retry, catch, timeout, and parallel/branching logic — no custom retry code.
- Standard Workflows run up to 1 year with exactly-once execution and full history; Express Workflows run up to 5 minutes with at-least-once execution and only CloudWatch Logs.
- State types include Task, Choice, Parallel, Map, Wait, Pass, and Succeed/Fail, letting you express sequencing, branching, fan-out, and delays declaratively.
- Use orchestration (Step Functions) when steps depend on each other with error handling; use choreography (SQS/SNS/EventBridge) for simple decoupling between components.
- Service integration patterns — Request-Response, Run-a-Job (.sync), and Wait-for-Callback (task token) — control whether the workflow waits for downstream completion.
Quick Answer: Step Functions coordinate complex, multi-step workflows as visual state machines with built-in retries and error handling. Use Standard Workflows for long-running, exactly-once processes (up to 1 year) and Express Workflows for high-volume, short, at-least-once processing (up to 5 minutes). Reach for Step Functions when steps depend on each other; reach for SQS/SNS when they do not.
What Step Functions Solve
When a business process spans several services — validate, charge, reserve stock, notify — chaining them with raw Lambda + queues forces you to hand-code retries, timeouts, branching, and state tracking, which becomes brittle. AWS Step Functions externalize that logic into a managed state machine defined in Amazon States Language (JSON). You get a visual graph, execution history, and centralized error handling, and the service integrates with 200+ AWS services. This is orchestration (a central brain directing steps) versus choreography (independent components reacting to events via SNS/SQS/EventBridge).
State Types
| State | What it does | Example |
|---|---|---|
| Task | Do work (invoke Lambda, call a service) | Charge the card |
| Choice | Branch on conditions (if/else) | Route by order type |
| Parallel | Run branches at the same time | Pay AND reserve stock together |
| Map | Run steps for each array item | Process every line item |
| Wait | Pause for a time or until a timestamp | Wait 24h before a reminder |
| Pass | Inject or reshape data | Transform between steps |
| Succeed / Fail | End the execution | Mark complete or errored |
The Map state is the exam's go-to for "process each item in a list" and supports distributed mode for very large datasets.
Standard vs. Express Workflows
Choosing the workflow type is the most common Step Functions exam decision.
| Feature | Standard | Express |
|---|---|---|
| Max duration | 1 year | 5 minutes |
| Execution model | Exactly-once | At-least-once |
| Pricing | Per state transition | Per execution + duration/memory |
| Execution history | Full, in console | CloudWatch Logs only |
| Start rate | ~2,000/sec | ~100,000/sec |
| Best for | Order processing, ETL, human approval | IoT/event streaming, high-volume APIs |
A durable rule: long-running or needs auditable, exactly-once history → Standard; very high volume and short-lived → Express. A multi-day human-approval workflow must be Standard because Express tops out at five minutes.
Built-In Error Handling
Step Functions eliminate custom retry plumbing:
- Retry — re-attempt a failed state with configurable interval, backoff rate, and max attempts.
- Catch — trap specific errors and route to a recovery or cleanup state.
- Timeout / Heartbeat — bound how long a state or a long-running task may run.
Service Integration Patterns
| Pattern | Behavior |
|---|---|
| Request-Response | Call the service and continue immediately |
Run-a-Job (.sync) | Start a job (Glue, Batch, ECS) and wait for it to finish |
| Wait-for-Callback (task token) | Pause, hand a token to an external system, resume when it calls back |
The task-token pattern is the answer when a workflow must pause for an external event or human action and resume later (e.g., a manager approval delivered through SQS or an API).
On the Exam: "Coordinate a multi-step order workflow with branching and automatic retries" → Step Functions. "Decouple two services with simple async messaging" → SQS. "Wait for a human approval that may take days, then continue" → Step Functions Standard with the Wait-for-Callback task-token pattern. Do not pick Express for anything longer than five minutes or anything needing exactly-once guarantees.
Orchestration vs. Choreography — How to Decide
The central design judgment in this section is choosing between a central orchestrator and event-driven choreography. Orchestration (Step Functions) suits processes where steps depend on one another, where you need branching, parallel fan-out, retries, and a clear audit trail of where each execution is. Choreography (SNS/SQS/EventBridge) suits loosely related reactions where each component independently responds to an event and no central state is required.
| Signal in the question | Pick |
|---|---|
| Multi-step process with order and dependencies | Step Functions |
| Built-in retry/catch without writing retry code | Step Functions |
| Visual workflow and per-execution history | Step Functions |
| Fire-and-forget notification to many subscribers | SNS |
| Buffer/level a spiky workload | SQS |
| Content-filtered routing of AWS/SaaS events | EventBridge |
Cost, Limits, and Integration Notes
Standard Workflows bill per state transition, so a workflow with many tiny states can get expensive at high volume — that economic crossover is exactly why Express (billed per execution plus duration and memory) wins for very high-frequency, short jobs. Step Functions integrate with 200+ services and support optimized integrations (e.g., lambda:invoke, dynamodb:putItem, sns:publish, sqs:sendMessage, glue:startJobRun) plus generic AWS SDK integrations for almost any API, so you rarely need glue Lambda functions just to call another service.
Common Traps
- Express for long or exactly-once work — wrong; Express caps at five minutes and is at-least-once.
- Using a Wait state to span days on Express — impossible; the whole execution is bounded by five minutes. Use Standard.
- Hand-coding retries in Lambda when a Step Functions Retry block with backoff would be simpler and auditable.
- Choosing Step Functions for trivial two-service async messaging — over-engineered; SQS or EventBridge is the right altitude.
The takeaway: when a scenario describes a coordinated, stateful, multi-step business process with error handling, Step Functions is almost always the intended answer, and the Standard-versus-Express choice hinges on duration, volume, and whether exactly-once execution and a full execution history are required.
A workflow must validate payment, reserve inventory, send a confirmation email, and automatically retry each step on transient failure, with visibility into where the process is. Which service is best suited?
A purchase-approval workflow can take several days while it waits for a manager to approve, then it must resume. Which choice is appropriate?
An IoT platform starts hundreds of thousands of very short orchestrations per second, each lasting under a minute, and does not need a full console execution history. Which Step Functions workflow type fits best?