4.3 Custom Text Classification and Custom NER
Key Takeaways
- Custom text classification trains models to categorize documents into your domain-specific classes (single-label or multi-label).
- Custom NER trains models to extract domain-specific entities that are not covered by the pre-built NER model.
- Both features use labeled training data provided in Azure Blob Storage, with labels defined in Language Studio or via the REST API.
- Training data format requires a JSON labels file that maps documents to their class labels (classification) or entity annotations (NER).
- Model evaluation uses precision, recall, and F1 score metrics per class/entity, available after training completes.
Custom Text Classification and Custom NER
Quick Answer: Custom text classification categorizes documents into your classes (single-label or multi-label). Custom NER extracts domain-specific entities. Both require labeled training data in Azure Blob Storage, training via Language Studio or REST API, and deployment to a prediction endpoint.
Custom Text Classification
Single-Label vs. Multi-Label
| Type | Description | Example |
|---|---|---|
| Single-label | Each document gets exactly one class | Support tickets: "Billing" OR "Technical" OR "Account" |
| Multi-label | Each document can have multiple classes | Movie reviews: "Action" AND "Comedy" AND "Sci-Fi" |
Training Data Requirements
| Requirement | Minimum | Recommended |
|---|---|---|
| Documents | 10 per class | 50+ per class |
| Classes | 2 | Depends on use case |
| Document format | .txt files in Azure Blob Storage | UTF-8 encoded |
| Labels file | JSON file mapping documents to classes | Consistent labeling |
Data Format
{
"projectFileVersion": "2022-05-01",
"stringIndexType": "Utf16CodeUnit",
"metadata": {
"projectKind": "CustomSingleLabelClassification",
"projectName": "SupportTicketClassifier",
"language": "en"
},
"assets": {
"projectKind": "CustomSingleLabelClassification",
"classes": [
{"category": "Billing"},
{"category": "Technical"},
{"category": "Account"}
],
"documents": [
{
"location": "ticket001.txt",
"language": "en",
"class": {"category": "Billing"}
},
{
"location": "ticket002.txt",
"language": "en",
"class": {"category": "Technical"}
}
]
}
}
Calling the Custom Classification Endpoint
from azure.ai.textanalytics import TextAnalyticsClient
from azure.core.credentials import AzureKeyCredential
client = TextAnalyticsClient(
endpoint="https://my-language.cognitiveservices.azure.com/",
credential=AzureKeyCredential("<your-key>")
)
# Single-label classification
poller = client.begin_single_label_classify(
documents=["I was charged twice for my subscription renewal"],
project_name="SupportTicketClassifier",
deployment_name="production"
)
for result in poller.result():
for classification in result.classifications:
print(f"Class: {classification.category}")
print(f"Confidence: {classification.confidence_score:.2f}")
Custom Named Entity Recognition
Custom NER trains a model to extract entities specific to your domain:
Example Use Cases
| Domain | Custom Entities |
|---|---|
| Legal | Case numbers, judge names, statute references, legal terms |
| Healthcare | Drug names, dosages, symptoms, procedures |
| Finance | Account types, transaction codes, policy numbers |
| Manufacturing | Part numbers, defect types, machine identifiers |
Training Data Format for Custom NER
{
"projectFileVersion": "2022-05-01",
"metadata": {
"projectKind": "CustomEntityRecognition",
"projectName": "LegalEntityExtractor",
"language": "en"
},
"assets": {
"projectKind": "CustomEntityRecognition",
"entities": [
{"category": "CaseNumber"},
{"category": "JudgeName"},
{"category": "StatuteReference"}
],
"documents": [
{
"location": "legal_doc_001.txt",
"language": "en",
"entities": [
{
"regionOffset": 45,
"regionLength": 12,
"labels": [
{
"category": "CaseNumber",
"offset": 45,
"length": 12
}
]
}
]
}
]
}
}
Training and Evaluation
# Train a custom NER model
train_url = f"{endpoint}/language/authoring/analyze-text/projects/{project_name}/:train?api-version=2023-04-01"
train_body = {
"modelLabel": "v1",
"trainingConfigVersion": "latest",
"evaluationOptions": {
"kind": "percentage",
"trainingSplitPercentage": 80,
"testingSplitPercentage": 20
}
}
response = requests.post(train_url, headers=headers, json=train_body)
Model Evaluation Metrics
| Metric | Description | Formula |
|---|---|---|
| Precision | Correctness of positive predictions | TP / (TP + FP) |
| Recall | Completeness of positive predictions | TP / (TP + FN) |
| F1 Score | Harmonic mean of precision and recall | 2 * (P * R) / (P + R) |
On the Exam: Know how to interpret a confusion matrix. High precision but low recall means the model is conservative (misses some entities). Low precision but high recall means the model over-predicts (many false positives). F1 balances both.
Data Splitting Strategies
| Strategy | Description | Best For |
|---|---|---|
| Automatic (percentage) | Random split (e.g., 80/20) | Most projects |
| Manual | You define which documents are train vs. test | When you need specific test cases |
Best Practices for Training Data
- Label consistency: Ensure the same type of text is always labeled the same way
- Boundary precision: Entity labels must have exact character offsets
- Negative examples: Include documents WITHOUT the target entities
- Class balance: Roughly equal examples per entity type
- Real-world data: Use data that represents actual production inputs
Where must custom text classification training documents be stored?
A custom NER model has high precision (0.95) but low recall (0.60) for the "CaseNumber" entity. What does this mean?
Which method in the TextAnalyticsClient performs custom single-label text classification?