9.2 Log Analytics, KQL, and Operational Queries
Key Takeaways
- KQL queries start with a table, filter early with where, shape columns with project, and aggregate with summarize.
- TimeGenerated is the primary timestamp used to control the analysis window in Log Analytics.
- Operational KQL should answer a specific support question, such as what failed, where it failed, and whether it is still happening.
- Joins, parse, extend, bin, and render are useful for correlation, extraction, time grouping, and visualization.
- AZ-104 usually tests practical query interpretation rather than advanced data science.
KQL mental model
Kusto Query Language, or KQL, is a pipeline language. A query normally starts with a table, then each pipe character sends the current result set to the next operation. For AZ-104, the most important habit is to filter early, summarize intentionally, and project only the columns needed to answer the operational question.
A basic pattern looks like this:
AzureActivity
| where TimeGenerated > ago(24h)
| where ActivityStatusValue == "Failed"
| project TimeGenerated, ResourceGroup, ResourceProviderValue, OperationNameValue, Caller, ActivityStatusValue
| order by TimeGenerated desc
This query asks for failed control plane activity in the last day. It uses TimeGenerated for the time window, where to reduce the data, project to keep useful columns, and order by to put recent events first. If a question asks why a deployment failed or who attempted a change, this is the kind of query you should recognize.
KQL is case-sensitive in some operators and functions. == is case-sensitive. =~ is case-insensitive. has is usually better than contains for tokenized text search because it can be more efficient and semantically cleaner. Exam questions generally avoid obscure syntax, but they may expect you to identify the query that counts failures or filters the correct time range.
Common operators
| Operator | Purpose | Example |
|---|---|---|
where | Filter rows | where TimeGenerated > ago(1h) |
project | Select or rename columns | project TimeGenerated, Computer, CounterValue |
extend | Add calculated columns | extend Percent = CounterValue * 100 |
summarize | Aggregate rows | summarize Count=count() by ResourceGroup |
bin | Group time into intervals | by bin(TimeGenerated, 5m) |
order by | Sort results | order by Count desc |
join | Combine tables | join kind=leftouter (...) on ResourceId |
render | Visualize result | render timechart |
A useful metric-like log query groups records into time buckets:
AzureActivity
| where TimeGenerated > ago(6h)
| where ActivityStatusValue in ("Failed", "Succeeded")
| summarize Events=count() by ActivityStatusValue, bin(TimeGenerated, 15m)
| render timechart
This helps an administrator see whether failures are isolated or part of a larger deployment issue.
VM operational queries
VM monitoring has multiple sources. Platform metrics can show CPU percentage, but guest-level logs and performance counters require Azure Monitor Agent and data collection rules. Depending on configuration, you may query tables such as Perf, InsightsMetrics, Heartbeat, Event, or Syslog.
Heartbeat is a simple availability clue:
Heartbeat
| where TimeGenerated > ago(30m)
| summarize LastHeartbeat=max(TimeGenerated) by Computer, _ResourceId
| extend MinutesSinceHeartbeat = datetime_diff("minute", now(), LastHeartbeat)
| where MinutesSinceHeartbeat > 10
| order by MinutesSinceHeartbeat desc
This does not prove the application is healthy. It only shows that the monitoring agent has not reported recently. Use it to separate monitoring-agent silence from platform availability questions.
A Windows event investigation might start like this:
Event
| where TimeGenerated > ago(2h)
| where EventLevelName in ("Error", "Critical")
| project TimeGenerated, Computer, EventLog, Source, EventID, RenderedDescription
| order by TimeGenerated desc
A Linux syslog investigation might use:
Syslog
| where TimeGenerated > ago(2h)
| where SeverityLevel in ("err", "crit", "alert", "emerg")
| project TimeGenerated, Computer, Facility, ProcessName, SyslogMessage
| order by TimeGenerated desc
Storage and network queries
Storage diagnostic logs can support troubleshooting that metrics alone cannot. If users report intermittent authorization failures, summarize by status and caller address:
StorageBlobLogs
| where TimeGenerated > ago(1h)
| where StatusCode in (401, 403)
| summarize Failures=count() by StatusCode, AuthenticationType, CallerIpAddress, bin(TimeGenerated, 5m)
| order by TimeGenerated desc
The exact table depends on the resource and diagnostic configuration. The exam will often give table names in the choices. Choose the query that filters the relevant time range and status code instead of one that counts unrelated successful traffic.
For network security group flow logs or traffic analytics, the table names and schema depend on the enabled feature version. The operational goal is the same: filter by source, destination, port, direction, and action, then summarize to identify denied flows. A troubleshooting query should answer whether the network blocked traffic, whether traffic never arrived, or whether a later layer failed.
Alert query design
A log alert query should return a numeric result that can be evaluated. A support query may return rows for humans. Those are different shapes. For a log alert that fires when more than five failed deployments occur in ten minutes, write a query that counts failures:
AzureActivity
| where TimeGenerated > ago(10m)
| where ActivityStatusValue == "Failed"
| where OperationNameValue has "deployments"
| summarize FailedDeployments=count()
Then configure the alert condition so it fires when FailedDeployments is greater than 5. If a query returns many columns and raw rows, it may be useful for investigation but awkward for a numeric alert rule.
Troubleshooting bad queries
When a query returns no data, confirm the table has data before debugging the filters. Start with the table name and a short time window:
AzureActivity
| take 10
Then add where TimeGenerated > ago(24h), then add one filter at a time. Check the time picker in the portal because it can override or interact with query expectations. Confirm column names from the schema pane. If you search text with exact case and get no result, try a case-insensitive operator such as =~ or search a broader field.
When a query is slow, reduce the time range, filter early, avoid unnecessary contains, project fewer columns, and summarize before joining. For AZ-104, performance tuning is not deep, but the administrator should know that querying seven years of raw logs interactively is not the same design as exporting logs for compliance retention.
Scenario recognition
If a question says "show a chart of average CPU by VM every five minutes," a metric chart or metric query may be enough. If it says "show failed administrative operations by caller," use AzureActivity. If it says "find VMs whose monitoring agent stopped reporting," query Heartbeat. If it says "count 403 blob operations by IP," use the storage resource log table after diagnostic settings are configured. If it says "build an alert from a complex count of log records," shape the KQL query to return a numeric result.
Which KQL clause is most appropriate for grouping failed activity events into five-minute intervals?
A log alert must fire when more than ten 403 storage requests occur in fifteen minutes. What should the KQL query return?
Which table is commonly used to identify machines whose monitoring agent has stopped reporting recently?