2.1 Data Profiling in Power Query
Key Takeaways
- Data profiling tools include Column Quality, Column Distribution, and Column Profile in the View tab.
- Column Quality shows percentages of Valid, Error, and Empty values per column.
- Column Distribution displays distinct count and unique count, helping identify potential key columns.
- Column Profile provides detailed statistics (min, max, average, standard deviation) and value distribution for the selected column.
- By default, profiling is based on the first 1,000 rows — change to 'Column profiling based on entire dataset' for accuracy.
Data Profiling in Power Query
Quick Answer: Power Query has three data profiling tools in the View tab: Column Quality (valid/error/empty percentages), Column Distribution (distinct/unique counts), and Column Profile (detailed statistics). Always switch to "entire dataset" profiling — the default only scans the first 1,000 rows.
Data profiling is the process of examining data to understand its structure, quality, and content before applying transformations. Power Query provides built-in profiling tools that are essential for making informed data cleaning decisions.
Enabling Data Profiling
Data profiling tools are found in the View tab of Power Query Editor:
View tab → Data Preview section → Check the boxes:
☑ Column Quality
☑ Column Distribution
☑ Column Profile
Critical Setting: By default, profiling is based on the first 1,000 rows. For accurate profiling, change this by clicking the status bar at the bottom of Power Query Editor and selecting "Column profiling based on entire dataset".
Column Quality
Column Quality adds a color-coded bar at the top of each column showing the percentage of:
| Indicator | Color | Meaning |
|---|---|---|
| Valid | Green | Values conform to the column's data type |
| Error | Red | Values that cannot be resolved (type mismatches, division by zero) |
| Empty | Gray | Null or blank values |
What to look for:
- Columns with high error percentages → may need type changes or error handling
- Columns with high empty percentages → decide whether to filter, replace, or fill
- 100% valid doesn't mean the data is correct — just that it matches the expected type
Column Distribution
Column Distribution shows two key metrics as a histogram at the top of each column:
| Metric | Definition | Use Case |
|---|---|---|
| Distinct | Total number of different values | Understanding cardinality |
| Unique | Values that appear exactly once | Identifying potential keys |
Interpreting the metrics:
- Distinct = Row Count → every value is different (likely a key column)
- Distinct = 1 → all values are the same (column may be unnecessary)
- Unique = Distinct → no duplicate values (good candidate for a primary key)
- Unique << Distinct → many values repeat (typical for category columns)
Example:
- A CustomerID column with 10,000 rows, 10,000 distinct values, and 10,000 unique values → perfect primary key candidate
- A Region column with 10,000 rows but only 5 distinct values → categorical column
Column Profile
Column Profile provides the most detailed analysis for the selected column. It appears in a panel below the data preview.
For Text Columns
| Statistic | Description |
|---|---|
| Count | Number of non-null values |
| Error | Number of error values |
| Empty | Number of null/blank values |
| Distinct | Number of unique values |
| Unique | Values appearing exactly once |
| Min Length | Shortest text value |
| Max Length | Longest text value |
| Value Distribution | Bar chart of value frequencies |
For Numeric Columns
| Statistic | Description |
|---|---|
| Count | Number of non-null values |
| Error | Number of error values |
| Empty | Number of null/blank values |
| Min | Smallest value |
| Max | Largest value |
| Average | Arithmetic mean |
| Standard Deviation | Measure of data spread |
| Even/Odd Count | Distribution of even and odd values |
| Zero Count | Number of zero values |
For Date Columns
| Statistic | Description |
|---|---|
| Min | Earliest date |
| Max | Latest date |
| Count | Number of non-null dates |
| Value Distribution | Timeline visualization of date frequencies |
Profiling Workflow
A systematic profiling approach:
- Enable all three profiling tools and switch to entire dataset
- Check Column Quality — flag columns with errors or high empty rates
- Review Column Distribution — identify keys, categories, and anomalies
- Examine Column Profile — deep-dive into suspicious columns
- Document findings — note which columns need cleaning, type changes, or removal
- Prioritize actions — fix errors first, then handle nulls, then optimize types
On the Exam
The PL-300 frequently tests:
- Knowing where to find profiling tools (View tab in Power Query Editor)
- Interpreting Column Quality percentages (valid, error, empty)
- Distinguishing between distinct count and unique count
- Understanding the default 1,000-row profiling limitation
- Using profiling results to inform data cleaning decisions
In Power Query, Column Distribution shows that a column has 500 distinct values and 500 unique values. What does this tell you?
By default, Power Query column profiling is based on:
Column Quality shows a column is 85% Valid, 10% Error, and 5% Empty. Which profiling tool shows these percentages?