2.6 AWS Step Functions and Workflow Orchestration

Key Takeaways

  • Step Functions orchestrate multi-step workflows as visual state machines with built-in retry, catch, timeout, and parallel/branching logic — no custom retry code.
  • Standard Workflows run up to 1 year with exactly-once execution and full history; Express Workflows run up to 5 minutes with at-least-once execution and only CloudWatch Logs.
  • State types include Task, Choice, Parallel, Map, Wait, Pass, and Succeed/Fail, letting you express sequencing, branching, fan-out, and delays declaratively.
  • Use orchestration (Step Functions) when steps depend on each other with error handling; use choreography (SQS/SNS/EventBridge) for simple decoupling between components.
  • Service integration patterns — Request-Response, Run-a-Job (.sync), and Wait-for-Callback (task token) — control whether the workflow waits for downstream completion.
Last updated: June 2026

Quick Answer: Step Functions coordinate complex, multi-step workflows as visual state machines with built-in retries and error handling. Use Standard Workflows for long-running, exactly-once processes (up to 1 year) and Express Workflows for high-volume, short, at-least-once processing (up to 5 minutes). Reach for Step Functions when steps depend on each other; reach for SQS/SNS when they do not.

What Step Functions Solve

When a business process spans several services — validate, charge, reserve stock, notify — chaining them with raw Lambda + queues forces you to hand-code retries, timeouts, branching, and state tracking, which becomes brittle. AWS Step Functions externalize that logic into a managed state machine defined in Amazon States Language (JSON). You get a visual graph, execution history, and centralized error handling, and the service integrates with 200+ AWS services. This is orchestration (a central brain directing steps) versus choreography (independent components reacting to events via SNS/SQS/EventBridge).

State Types

StateWhat it doesExample
TaskDo work (invoke Lambda, call a service)Charge the card
ChoiceBranch on conditions (if/else)Route by order type
ParallelRun branches at the same timePay AND reserve stock together
MapRun steps for each array itemProcess every line item
WaitPause for a time or until a timestampWait 24h before a reminder
PassInject or reshape dataTransform between steps
Succeed / FailEnd the executionMark complete or errored

The Map state is the exam's go-to for "process each item in a list" and supports distributed mode for very large datasets.

Standard vs. Express Workflows

Choosing the workflow type is the most common Step Functions exam decision.

FeatureStandardExpress
Max duration1 year5 minutes
Execution modelExactly-onceAt-least-once
PricingPer state transitionPer execution + duration/memory
Execution historyFull, in consoleCloudWatch Logs only
Start rate~2,000/sec~100,000/sec
Best forOrder processing, ETL, human approvalIoT/event streaming, high-volume APIs

A durable rule: long-running or needs auditable, exactly-once history → Standard; very high volume and short-lived → Express. A multi-day human-approval workflow must be Standard because Express tops out at five minutes.

Built-In Error Handling

Step Functions eliminate custom retry plumbing:

  • Retry — re-attempt a failed state with configurable interval, backoff rate, and max attempts.
  • Catch — trap specific errors and route to a recovery or cleanup state.
  • Timeout / Heartbeat — bound how long a state or a long-running task may run.

Service Integration Patterns

PatternBehavior
Request-ResponseCall the service and continue immediately
Run-a-Job (.sync)Start a job (Glue, Batch, ECS) and wait for it to finish
Wait-for-Callback (task token)Pause, hand a token to an external system, resume when it calls back

The task-token pattern is the answer when a workflow must pause for an external event or human action and resume later (e.g., a manager approval delivered through SQS or an API).

On the Exam: "Coordinate a multi-step order workflow with branching and automatic retries" → Step Functions. "Decouple two services with simple async messaging" → SQS. "Wait for a human approval that may take days, then continue" → Step Functions Standard with the Wait-for-Callback task-token pattern. Do not pick Express for anything longer than five minutes or anything needing exactly-once guarantees.

Orchestration vs. Choreography — How to Decide

The central design judgment in this section is choosing between a central orchestrator and event-driven choreography. Orchestration (Step Functions) suits processes where steps depend on one another, where you need branching, parallel fan-out, retries, and a clear audit trail of where each execution is. Choreography (SNS/SQS/EventBridge) suits loosely related reactions where each component independently responds to an event and no central state is required.

Signal in the questionPick
Multi-step process with order and dependenciesStep Functions
Built-in retry/catch without writing retry codeStep Functions
Visual workflow and per-execution historyStep Functions
Fire-and-forget notification to many subscribersSNS
Buffer/level a spiky workloadSQS
Content-filtered routing of AWS/SaaS eventsEventBridge

Cost, Limits, and Integration Notes

Standard Workflows bill per state transition, so a workflow with many tiny states can get expensive at high volume — that economic crossover is exactly why Express (billed per execution plus duration and memory) wins for very high-frequency, short jobs. Step Functions integrate with 200+ services and support optimized integrations (e.g., lambda:invoke, dynamodb:putItem, sns:publish, sqs:sendMessage, glue:startJobRun) plus generic AWS SDK integrations for almost any API, so you rarely need glue Lambda functions just to call another service.

Common Traps

  • Express for long or exactly-once work — wrong; Express caps at five minutes and is at-least-once.
  • Using a Wait state to span days on Express — impossible; the whole execution is bounded by five minutes. Use Standard.
  • Hand-coding retries in Lambda when a Step Functions Retry block with backoff would be simpler and auditable.
  • Choosing Step Functions for trivial two-service async messaging — over-engineered; SQS or EventBridge is the right altitude.

The takeaway: when a scenario describes a coordinated, stateful, multi-step business process with error handling, Step Functions is almost always the intended answer, and the Standard-versus-Express choice hinges on duration, volume, and whether exactly-once execution and a full execution history are required.

Test Your Knowledge

A workflow must validate payment, reserve inventory, send a confirmation email, and automatically retry each step on transient failure, with visibility into where the process is. Which service is best suited?

A
B
C
D
Test Your Knowledge

A purchase-approval workflow can take several days while it waits for a manager to approve, then it must resume. Which choice is appropriate?

A
B
C
D
Test Your Knowledge

An IoT platform starts hundreds of thousands of very short orchestrations per second, each lasting under a minute, and does not need a full console execution history. Which Step Functions workflow type fits best?

A
B
C
D