3.4 Optical Character Recognition (OCR)
Key Takeaways
- The Azure AI Vision Read API extracts printed text in 164 languages and handwritten text in 9 languages from images and PDFs.
- Read results are organized hierarchically as pages -> blocks -> lines -> words, each with bounding-polygon coordinates and per-word confidence.
- Read runs two ways: synchronously inside the Image Analysis 4.0 'Read' feature, or asynchronously via the standalone Read 3.2 operation that returns an operation-location to poll.
- Use the Read API for general text on signs, labels, and photos; switch to Document Intelligence when you need structured key-value fields and tables from invoices, receipts, or forms.
- Standalone Read accepts files up to 500 MB and PDFs up to 2,000 pages; the Image Analysis Read feature caps input at 20 MB.
Quick Answer: The Azure AI Vision Read API extracts printed text (164 languages) and handwritten text (9 languages) from images and PDFs. Output is pages -> blocks -> lines -> words with bounding polygons and confidence. Use Read for general text; use Document Intelligence when you need structured fields and tables from invoices, receipts, or forms.
Read API vs. Document Intelligence
The single most repeated AI-102 OCR question is choosing between these two services. Read gives you text and positions; Document Intelligence gives you meaning — typed key-value pairs and table cells.
| Aspect | Read API (Vision) | Document Intelligence |
|---|---|---|
| Best for | General text from images/PDFs | Structured field & table extraction |
| Output | Raw text + polygons | Key-value pairs, tables, typed fields |
| Typical input | Signs, labels, screenshots, books | Invoices, receipts, IDs, W-2s, forms |
| Tables | Not as structured cells | Full table extraction |
| Handwriting | Yes (9 languages) | Yes |
| Prebuilt models | No | Invoice, receipt, ID, layout, W-2, etc. |
Mnemonic: if the prompt names invoices, receipts, forms, or specific fields, pick Document Intelligence. If it says read the text on a sign/label/photo, pick the Read API.
Two Ways to Call Read
1. Inside Image Analysis 4.0 (synchronous): request the Read visual feature; the text comes straight back in read.blocks. Input is capped at 20 MB.
2. Standalone Read 3.2 (asynchronous): POST .../vision/v3.2/read/analyze returns 202 Accepted with an Operation-Location header. You then GET that URL and poll the status field until it is succeeded before the text is available. This path handles multi-page PDFs and files up to 500 MB.
from azure.ai.vision.imageanalysis import ImageAnalysisClient
from azure.ai.vision.imageanalysis.models import VisualFeatures
from azure.core.credentials import AzureKeyCredential
client = ImageAnalysisClient(endpoint, AzureKeyCredential(key))
with open("sign.jpg", "rb") as f:
result = client.analyze(image_data=f.read(),
visual_features=[VisualFeatures.READ])
for block in result.read.blocks:
for line in block.lines:
print(line.text)
for word in line.words:
print(word.text, round(word.confidence, 2))
Response Hierarchy (exam-critical)
Text is nested pages -> blocks -> lines -> words, and each level carries a bounding polygon. Words also carry a confidence score; lines do not.
{
"readResult": {
"blocks": [{
"lines": [{
"text": "Hello World",
"boundingPolygon": [{"x":10,"y":10},{"x":200,"y":10},
{"x":200,"y":40},{"x":10,"y":40}],
"words": [
{"text":"Hello","confidence":0.99},
{"text":"World","confidence":0.98}
]
}]
}]
}
}
Reading order matters: to rebuild a paragraph you iterate lines within a block; to score per-token reliability you read word.confidence. Picking tagsResult or captionResult to find OCR text is a planted wrong answer — OCR always lives under readResult.
Language Support
| Category | Count | Examples |
|---|---|---|
| Printed text | 164 languages | English, Chinese, Arabic, Hindi, Japanese, Korean, Russian |
| Handwritten text | 9 languages | English, Chinese Simplified, French, German, Italian, Japanese, Korean, Portuguese, Spanish |
The service auto-detects language; you do not usually pass a language hint for Read. Note the asymmetry — far more print languages than handwriting languages — because the exam likes the "how many handwriting languages?" twist (the answer is 9, not 164).
File and Size Limits
| Limit | Standalone Read 3.2 | Image Analysis Read feature |
|---|---|---|
| Max file size | 500 MB (S tier) | 20 MB |
| Max PDF/TIFF pages | 2,000 | n/a (single image) |
| Min dimension | 50 x 50 px | 50 x 50 px |
| Formats | JPEG, PNG, BMP, TIFF, PDF | JPEG, PNG, BMP, GIF, TIFF, WEBP |
Accuracy Best Practices
| Factor | Recommendation |
|---|---|
| Resolution | Higher resolution = better; tiny text needs more pixels |
| Contrast | Maximize contrast between text and background |
| Skew | Keep text roughly horizontal; Read tolerates moderate rotation |
| Compression | Avoid heavily compressed JPEGs that smear glyphs |
| Glare/shadow | Even lighting; avoid reflections on glossy labels |
Worked Example
A logistics app must digitize shipping labels photographed on phones (printed + occasional handwriting, single images): use the Image Analysis Read feature synchronously and read read.blocks[].lines[].text. But if the same app must also pull vendor, total, and line items from invoices, that part switches to Document Intelligence's prebuilt invoice model — Read alone cannot return typed fields.
Containers and Disconnected OCR
A recurring AI-102 theme is running OCR where data cannot leave the premises. The Read OCR capability ships as a Docker container you pull from the Microsoft Container Registry and run on-premises or at the edge; the container still requires billing connectivity to report usage, except under a disconnected (air-gapped) container commitment tier that you purchase specifically for fully offline operation.
When a scenario stresses "text extraction with no internet access" or "data residency forbids cloud calls," the answer is the Read OCR container, optionally the disconnected variant — not the cloud Read endpoint and not Document Intelligence's cloud service.
Cost and Performance Notes
Read is billed per transaction (per image, or per page for PDFs/TIFFs), so a 50-page PDF is 50 transactions. For latency, the synchronous Image Analysis Read feature is best for one small image in an interactive app, while the asynchronous Read operation is built for large multi-page documents where you can tolerate a poll. Sending an oversized image to the synchronous path returns an error rather than silently switching modes, so right-sizing the input and choosing the matching call pattern is part of correct design.
On the Exam: Memorize 164 print / 9 handwriting languages, the pages -> blocks -> lines -> words order, the 202 + Operation-Location poll pattern for async Read, the Read-vs-Document-Intelligence decision rule, and that disconnected OCR uses the Read container.
How many languages does the Azure AI Vision Read API support for handwritten text recognition?
A company needs to extract the invoice number, date, vendor, and line-item totals from scanned invoices as structured fields. Which service fits best?
In the Read API response, what is the correct containment order from top to bottom?
Using the asynchronous standalone Read operation, what must your code do after the initial POST returns 202 Accepted?