2.6 AWS Step Functions and Workflow Orchestration
Key Takeaways
- AWS Step Functions orchestrate multi-step workflows using a visual state machine with built-in error handling, retries, and parallel execution.
- Standard Workflows run for up to 1 year and guarantee exactly-once execution; Express Workflows run for up to 5 minutes with at-least-once execution.
- Step Functions integrate with 200+ AWS services including Lambda, ECS, DynamoDB, SQS, SNS, and Glue.
- Use Step Functions when you need to coordinate multiple services in sequence, parallel, or with branching logic — not just simple event-driven processing.
- Built-in error handling with Retry and Catch states eliminates the need for custom retry logic in application code.
AWS Step Functions and Workflow Orchestration
Quick Answer: Step Functions orchestrate complex workflows as visual state machines. Use Standard Workflows for long-running processes (up to 1 year, exactly-once). Use Express Workflows for high-volume, short-duration processes (up to 5 minutes, at-least-once). Built-in error handling eliminates custom retry logic.
What Are Step Functions?
AWS Step Functions is a serverless orchestration service that lets you coordinate multiple AWS services into visual workflows called state machines.
State Types
| State | Description | Use Case |
|---|---|---|
| Task | Perform work (invoke Lambda, call API) | Process an order |
| Choice | Branching logic (if/else) | Route based on order type |
| Parallel | Execute branches simultaneously | Process payment AND update inventory simultaneously |
| Map | Run the same steps for each item in an array | Process each item in an order |
| Wait | Delay for a specified time | Wait 24 hours before sending follow-up |
| Pass | Pass input to output (useful for data transformation) | Format data between steps |
| Succeed/Fail | End execution successfully or with an error | Mark workflow complete |
Standard vs. Express Workflows
| Feature | Standard | Express |
|---|---|---|
| Max duration | 1 year | 5 minutes |
| Execution model | Exactly-once | At-least-once |
| Pricing | Per state transition | Per execution + duration |
| Execution history | Yes (visible in console) | CloudWatch Logs only |
| Best for | Order processing, ETL, human approval | IoT data processing, streaming ETL, high-volume APIs |
| Max executions | 2,000/sec start rate | 100,000/sec start rate |
Error Handling
Step Functions provide built-in error handling that eliminates custom retry logic:
- Retry — Automatically retry failed states with configurable backoff
- Catch — Handle errors and route to recovery states
- Timeout — Define maximum execution time for each state
- Heartbeat — Monitor long-running tasks
Integration Patterns
| Pattern | Description |
|---|---|
| Request Response | Call service, wait for HTTP response, continue |
| Run a Job (.sync) | Start a job (e.g., Glue, Batch), wait for completion |
| Wait for Callback | Send token, pause execution, resume when callback received |
On the Exam: "Orchestrate a multi-step order processing workflow with error handling" → Step Functions. "Simple async message processing between two services" → SQS. Step Functions are for orchestration; SQS/SNS are for simple decoupling.
A workflow needs to process an order: validate payment, update inventory, send confirmation email, and handle failures at each step with retries. Which service is BEST suited?
Which Step Functions workflow type should you use for a long-running order approval process that may take several days to complete?