9.3 Alert Rules, Action Groups, and Alert Processing Rules
Key Takeaways
- Alert rules define the signal, scope, condition, evaluation frequency, and action behavior for monitored resources.
- Action groups define reusable notification and automation targets such as email, SMS, webhook, ITSM, Azure Function, and Logic App actions.
- Alert processing rules can suppress, add, or modify actions for alerts during maintenance windows or scoped operational changes.
- Metric alerts are best for simple numeric thresholds, while log alerts are best for record-based conditions and KQL aggregation.
- Troubleshooting alerts requires checking signal availability, scope, dimensions, evaluation window, action group health, and suppression rules.
Alert architecture
An Azure Monitor alert has four practical parts: scope, condition, action, and processing. Scope is the resource or set of resources being evaluated. Condition is the rule logic, such as CPU greater than 90 percent or a KQL count greater than 5. Action is what happens when the alert fires. Processing rules can change action behavior for matching alerts, often during maintenance.
Portal path: Azure portal > Monitor > Alerts > Create > Alert rule. Select a scope, select a signal, define the condition, attach or create an action group, then set details such as severity, name, resource group, and whether the rule is enabled.
The signal choice matters. Metric alerts evaluate metric time series. Log search alerts run KQL against a workspace or resource logs. Activity log alerts evaluate subscription-level events such as service health notifications or administrative operations. Resource health alerts notify when Azure reports a resource health state change. Smart detection and application alerts may appear in broader Azure Monitor contexts, but AZ-104 usually emphasizes metric, log, activity, and service health patterns.
| Requirement | Likely alert type | Why |
|---|---|---|
| CPU above 90 percent for 10 minutes | Metric alert | Numeric time-series threshold. |
| More than five failed deployments in 15 minutes | Log alert | Requires KQL count over activity records. |
| Notify when a service health incident affects a region | Activity log or Service Health alert | Subscription event from Azure service health. |
| Notify when a VM becomes unavailable due to platform health | Resource health alert | Resource health state change. |
| Suppress alerts during planned patching | Alert processing rule | Changes actions for matching alerts. |
Metric alert workflow
A metric alert is the cleanest answer for simple numeric thresholds. You choose the metric, aggregation, operator, threshold, evaluation frequency, and lookback window. The aggregation must match the meaning of the metric. Average CPU above 90 percent for 10 minutes is different from maximum CPU above 90 percent once within 10 minutes. The first reduces noise; the second catches spikes.
Dimension splitting is important. A storage account metric can be split by API name or response type. A VM scale set metric might be split by instance. Dimension filters can target only the relevant component. If the requirement says alert only when a specific disk has high latency, a broad VM-level alert may not meet the requirement.
CLI example:
az monitor metrics alert create \
--name vm-high-cpu \
--resource-group rg-monitor \
--scopes <vm-resource-id> \
--condition "avg Percentage CPU > 90" \
--window-size 10m \
--evaluation-frequency 1m \
--severity 2 \
--action <action-group-resource-id>
Log alert workflow
A log alert runs a KQL query on a schedule. The query should return a result that can be evaluated. The rule then defines measurement, aggregation, threshold, evaluation frequency, and lookback period. Log alerts are powerful, but they depend on data ingestion. If diagnostic settings or agent collection are missing, the alert will not fire because the records never arrive.
Example KQL for failed delete operations:
AzureActivity
| where TimeGenerated > ago(15m)
| where OperationNameValue has "delete"
| where ActivityStatusValue == "Failed"
| summarize FailedDeletes=count()
Use a log alert when the condition depends on text fields, status codes, joins, or counting records. Do not use a log alert just because it feels flexible if a metric alert meets the requirement more directly. Metric alerts usually have lower latency and simpler configuration.
Action groups
Action groups are reusable routing definitions. One alert rule can use an action group, and one action group can be reused by many rules. Receivers include email, SMS, push, voice, webhook, Azure Function, Logic App, Automation runbook, Event Hub, and ITSM actions depending on configuration. The exam often gives a requirement such as notify administrators by email and start remediation. That points to an action group with both notification and automation receivers.
Action groups also support common alert schema. When enabled, the payload format is consistent across alert types, which helps webhooks, functions, and Logic Apps parse alert data. If a webhook integration breaks because payload shape differs between metric and log alerts, common alert schema is a likely fix.
A practical routing model separates severity and audience. Severity 0 and 1 alerts might page an on-call engineer and open an incident. Severity 2 might notify a team channel and create a ticket. Severity 3 or 4 might send email only or feed a dashboard. AZ-104 does not test an organization's incident policy, but it does expect the administrator to map alerts to the right action mechanism.
Alert processing rules
Alert processing rules modify alert behavior after rules evaluate. They can suppress notifications or apply action groups to matching alerts based on scope, alert rule, severity, monitor service, resource type, or schedule. The most common exam scenario is planned maintenance. You should not delete alert rules for a four-hour patch window. Use an alert processing rule to suppress actions for the affected scope and schedule.
Processing rules do not stop alert evaluation. Alerts can still be created, but actions may be suppressed or changed. This distinction matters for audit and post-maintenance review. If a question requires preserving alert history while preventing notifications, processing rules are better than disabling every alert rule.
Troubleshooting alert failures
If an expected alert did not fire, check the signal first. For metrics, confirm the metric exists for the resource and has values in the evaluated period. For log alerts, run the KQL query manually with the same time range and verify it returns the expected count. Check whether the evaluation window and frequency align with the event. A five-minute event can be missed or diluted by an overly broad aggregation design.
Then check scope. A rule scoped to one VM will not evaluate a new VM unless it is included. A rule scoped to a resource group may not cover all resource types or signals in the way you assume. Check dimension filters. A filter for status code 500 will not fire on 503.
Finally, check actions. Confirm the action group is enabled, the receiver address is correct, webhook endpoints are reachable, and common schema expectations match the receiver. Look for alert processing rules that suppress the action. Also check whether the alert rule is enabled and whether RBAC allows the administrator to see the alert resource.
Scenario workflow
A production VM will be patched from 10:00 PM to midnight. The team wants high CPU and heartbeat alerts to continue being evaluated but not notify the on-call phone during the window. Create an alert processing rule scoped to the VM or resource group, select the alert rules or severities, suppress notifications, and set a schedule for the maintenance window. After patching, review fired alerts to confirm whether anything unexpected occurred.
A storage account receives intermittent 403 responses after a SAS rotation. A metric alert on availability may show symptoms, but a log alert can count 403 records and trigger a targeted action group. The KQL query should filter StatusCode == 403, group by a short window, and fire when the count exceeds a threshold. The action group can notify the storage administrators and trigger a Logic App that adds context to a ticket.
During planned VM maintenance, alerts should still be recorded but notifications should not be sent for four hours. What should be configured?
Which alert type is best for CPU average greater than 90 percent over 10 minutes?
An alert fires but a webhook receiver cannot parse different payloads from metric and log alerts. What action group setting should be considered?