10.1 Data Classification Foundations

Key Takeaways

  • Microsoft Purview classifies data with two engines: sensitive information types (pattern-based) and trainable classifiers (machine-learning, trained from samples).
  • A sensitive information type (SIT) detects patterns such as credit card, passport, or national ID numbers using regular expressions, keywords, checksums, and a confidence level.
  • Trainable classifiers recognize content by meaning rather than pattern, and ship as pre-trained categories (for example resumes, source code, harassment) plus custom classifiers you train.
  • Classification feeds the rest of Purview: the same SITs and classifiers are reused by sensitivity labels, DLP policies, retention, and auto-labeling.
Last updated: June 2026

What Data Classification Does

Data classification in Microsoft Purview is the capability that discovers, identifies, and tags data by type and sensitivity so the organization knows what it holds before it tries to protect it. Classification answers the first question of every data-governance project: where is our sensitive data, and what kind is it? For SC-900 you need to recognize the two classification engines, because the same engines are reused everywhere else in Purview — by sensitivity labels, data loss prevention (DLP), retention, and auto-labeling.

The first engine is the sensitive information type (SIT). A SIT is a pattern-based classifier that looks for a recognizable format. Microsoft ships hundreds of built-in SITs covering common identifiers across many regions — credit card numbers, bank account numbers, passport numbers, national IDs such as the U.S. Social Security number, ABA routing numbers, and health identifiers. Each SIT is defined by a combination of:

  • A primary pattern — usually a regular expression or a keyword list (for example the digit format of a credit card).
  • Supporting evidence — additional keywords, dates, or formatting near the match (such as the words "card" or an expiry date) that raise confidence.
  • A checksum where the identifier has one (a credit card number must pass the Luhn algorithm).
  • A confidence level (low, medium, high) and a proximity window describing how close the supporting evidence must be.

You can also create custom SITs when the built-ins do not cover an organization-specific format, such as an internal employee ID. SITs are ideal when sensitive data has a predictable shape.

Classification engineHow it decidesBest for
Sensitive information type (SIT)Pattern match: regex, keywords, checksum, confidence levelStructured identifiers (credit cards, SSNs, passports)
Trainable classifierMachine learning trained on sample documentsUnstructured content by meaning (contracts, resumes, source code)
Exact Data Match (EDM)Compares against a hashed table of your real valuesReducing false positives on known records (optional, advanced)

Trainable Classifiers and Auto-Labeling

The second engine is the trainable classifier, which recognizes content by what it is about rather than by a fixed pattern. You cannot write a regular expression for "this looks like a resume" or "this is harassing language," so Purview uses machine learning. A trainable classifier is taught by feeding it sample content; it learns the characteristics of that category and can then identify similar items. There are two kinds:

  • Pre-trained classifiers that Microsoft ships ready to use, including categories such as Resume, Source Code, Harassment, Profanity, Threat, Discrimination, Customer Complaints, Healthcare, and Finance. These are available immediately with no training.
  • Custom trainable classifiers that you build by providing seed content (Microsoft recommends roughly 50–500 sample items) and then validating the results before publishing.

Once trained, a classifier can be used wherever Purview accepts a condition: to auto-apply a sensitivity label, to drive a DLP policy, to trigger an auto-apply retention label, or inside Communication Compliance. This reuse is the key idea SC-900 tests — you classify once, then many controls consume the result.

Classification powers automatic labeling. Instead of relying on every user to tag content correctly, an administrator can configure an auto-labeling policy that applies a sensitivity or retention label whenever a SIT or trainable classifier matches. Auto-labeling runs in two places: as a client-side recommendation/automatic label inside Office apps, and as a service-side policy that scans existing data at rest in Exchange, SharePoint, and OneDrive.

Where Classification Fits in the Workflow

Classification is the first stage of nearly every Microsoft Purview governance scenario, and SC-900 likes to test the order of operations. The workflow is: know your data → protect your data → prevent data loss → govern your data. Classification serves the know your data stage, producing the signals that the protection, prevention, and governance stages consume.

A worked example makes the reuse obvious. Suppose an organization wants to protect customer health records. The steps are:

  1. Classify — create or select a sensitive information type for the health identifier, and optionally a trainable classifier for clinical documents.
  2. Label — build an auto-apply sensitivity label ("Highly Confidential\Health") whose condition is that SIT or classifier matches.
  3. Prevent — create a DLP policy whose condition is content labeled Highly Confidential\Health and whose action blocks external sharing.
  4. Govern — add an auto-apply retention label so those records are kept for the legally required period.

Notice the same classification result drives all four stages — proof that classification is foundational rather than a standalone feature.

Common trap: classification is identification, not protection. A SIT match by itself does nothing to the file — it simply tells Purview the data is there. Protection only happens once a sensitivity label, DLP policy, or retention rule acts on that classification. On the exam, if a scenario says "identify" or "detect" sensitive data, the answer is classification (SITs / trainable classifiers). If it says "protect," "encrypt," "prevent sharing," or "keep for X years," the answer is a different control that uses the classification.

Finally, do not confuse Purview classification with a security-detection product. Microsoft Defender products detect threats and protect endpoints, identities, and workloads; Microsoft Sentinel is the enterprise SIEM/SOAR. Discovering, classifying, and tagging data itself is always a Microsoft Purview task, never a Defender or Sentinel one.

Test Your Knowledge

An administrator needs to detect documents that contain credit card numbers in the predictable card format. Which Microsoft Purview classification engine is designed for this?

A
B
C
D
Test Your Knowledge

An organization wants to identify documents that 'look like resumes' even though resumes have no fixed pattern. Which capability fits best?

A
B
C
D
Test Your Knowledge

Why is data classification considered the foundation for other Microsoft Purview controls?

A
B
C
D