3.1 Document Extraction, Cleaning & Chunking

Key Takeaways

  • Store unstructured source files (PDFs, HTML, images) in Unity Catalog volumes, not DBFS root, notebook output folders, or ephemeral cluster local disk.
  • The `binaryFile` Spark reader loads files as a BINARY column, which is exactly the input shape `ai_parse_document()` requires; a STRING path will not work.
  • Chunk overlap exists to preserve context that spans a chunk boundary; small chunks raise precision but risk fragmenting context, and large chunks dilute relevance.
  • Remove repetitive boilerplate and deduplicate near-duplicate documents before embedding, because both pollute the index and crowd retrieval results.
  • Any change to chunk boundaries requires re-embedding and reindexing; you cannot reuse old embeddings against new chunks.
Last updated: July 2026

From Raw Files to Retrievable Chunks

Data Preparation is 14% of the Databricks Certified Generative AI Engineer Associate blueprint, but it carries outsized weight in practice: in a Retrieval-Augmented Generation (RAG) system the final answer can only be as good as the context that was retrieved, so weak data prep caps the quality of everything downstream. This section covers the front of the pipeline — extracting text from unstructured sources, cleaning and normalizing it, and splitting it into chunks a retriever can match to user questions.

Landing and extracting source documents

On Databricks, arbitrary unstructured files — PDFs, HTML exports, Word documents, scanned images — belong in Unity Catalog volumes. Volumes are the governed, persistent, access-controlled storage layer for non-tabular files. They are a better fit than the DBFS root, a notebook output folder, or cluster local disk, all of which are either ungoverned or ephemeral and disappear when the cluster terminates. Landing files in a volume also positions them for the AI Functions, which expect governed storage.

To parse a document you first need its raw bytes. The binaryFile Spark reader loads each file's contents into a BINARY content column (alongside path and length metadata). This BINARY shape matters: ai_parse_document(), Databricks' built-in document parser, operates on document bytes, not on a STRING file path. A common ingestion pattern is spark.read.format("binaryFile") over a volume path, then pass the resulting content column into ai_parse_document(). Readers such as jdbc (relational sources) or rate (streaming test data) solve entirely different problems and will not produce the byte column you need.

Databricks AI Functions for extraction

Databricks AI Functions run AI tasks directly on data in SQL, notebooks, and pipelines, so row-wise inference scales without a hand-built chain. Three are worth knowing for data prep:

FunctionPurposeNote
ai_parse_document()Parse PDFs/docs from a BINARY column into text/structureimageOutputPath saves rendered page images to a UC volume for multimodal RAG or human review
ai_extract()Pull named fields from textIdeal for deterministic field extraction (invoice number, coverage limit)
ai_classify()Label rows into categoriesUseful for routing or filtering the corpus

For deterministic extraction such as pulling an invoice_number or coverage_limit, request a fixed JSON schema rather than free text — schema-driven output is easy to validate, test, and wire into downstream systems.

Cleaning and normalization

Extracted text is rarely index-ready. Two data-quality problems recur on the exam. First, repetitive boilerplate: a footer like "Confidential — Company Internal" that appears on every page will dominate many chunks, pollute their embeddings, and pull irrelevant matches to the top. The fix is to remove repetitive boilerplate before embedding, not to lowercase everything or switch to a bigger chat model. Second, near-duplicate documents: if two versions of the same policy are both indexed, retrieval over-surfaces the same snippet and loses diversity. Deduplicate or version chunks before indexing to keep the index clean before the LLM ever sees the context.

A third, subtler control is schema-aware ingestion validation. If a form-parsing workflow silently misses a newly added coverage_limit field, a validation step that checks extracted fields against an expected schema catches the drift at ingestion — far earlier and more reliably than waiting for a downstream answer to look wrong.

Chunking strategies

Chunking splits long documents into smaller, semantically searchable passages so the retriever can return only the most relevant text instead of an entire document. The main strategies:

  • Fixed-size — every chunk is roughly N tokens/characters; simplest, best for uniform documents with little structure.
  • Sentence-based — split on sentence boundaries; keeps sentences intact.
  • Recursive — split hierarchically (sections → paragraphs → sentences) until chunks fit a target size; a robust general default.
  • Semantic — split at topic/meaning boundaries where the content shifts.
  • By-structure / section-aware — chunk along a document's own headings, numbered clauses, and appendices.
  • Parent-child — index small child chunks for precise matching but return the larger parent for context.

For structured documents — contracts with headings and numbered clauses, or legal/policy files with cross-references — section-aware (structural or hierarchical) chunking is usually the best first choice because the document's structure already maps to natural retrieval units. Embedding an entire contract as one vector buries the signal; splitting every sentence into an isolated chunk destroys context. Both fail, in opposite directions.

Chunk size, overlap, and metadata

Chunk size is a trade-off. Small chunks improve precision but risk fragmenting an idea so the model never sees enough context to answer. Large chapter-sized chunks preserve context but dilute relevance (each embedding mixes many topics) and burn the context window faster. There is no universal number — pick based on document structure and answer type, then measure retrieval quality rather than guessing.

Chunk overlap lets adjacent chunks share text so a fact that straddles a boundary still appears intact in at least one chunk. Overlap costs extra storage and tokens, so apply it where boundary splits actually hurt, not reflexively on every pipeline. When citation precision is poor because answers cite large irrelevant sections, the fix is usually smaller, semantically coherent chunks with modest overlap, not larger ones.

Finally, preserve metadata with every chunk — a source document URI, page, and section/title. This metadata powers citations, enables retrieval filtering (for example, 2026 policies only), and makes debugging traceable. Operational trivia like GPU type or notebook theme adds nothing. And remember: because embeddings encode the exact text of each chunk, changing chunk boundaries means you must re-embed and reindex — reusing old vectors leaves the index inconsistent with the new segmentation.

Test Your Knowledge

A team is building a governed RAG pipeline and must decide where to store the source PDFs. Which location is the recommended choice on Databricks?

A
B
C
D
Test Your Knowledge

A footer reading 'Confidential - Company Internal' appears on every page and now dominates many retrieved chunks. What is the best preprocessing step before embedding?

A
B
C
D
Test Your Knowledge

You change your chunking strategy from page-based to section-based segments. What must you normally do before re-evaluating retrieval quality?

A
B
C
D