4.1 Foundation Models, LLMs, Transformers, and Modalities
Key Takeaways
- Foundation models are broad pretrained models that can be adapted to many tasks through prompting, retrieval, or customization rather than task-specific training from scratch.
- LLMs are foundation models focused on language, while multimodal models can work with text, images, audio, video, or combinations of those inputs and outputs.
- Transformers matter at practitioner depth because attention lets a model weigh relationships across tokens, but candidates do not need to build transformer architectures.
- AWS service fit depends on whether the team needs a managed model API, a packaged assistant, a task-specific AI service, or a custom ML path.
Foundation models in plain language
A foundation model is a large model trained on broad data so it can support many downstream tasks. Instead of building a different model for every workflow, a team can often start with a general model and steer it with prompts, retrieved context, examples, guardrails, or later customization. That is why foundation models show up in chat assistants, document search, summarization, extraction, image generation, and code help. The model is foundational because it provides reusable capability, not because it is automatically right for every business problem.
A large language model, or LLM, is a foundation model centered on language. It predicts and generates text as sequences of tokens, which can support answering questions, drafting content, translating tone, classifying messages, or explaining a document. A multimodal foundation model expands the pattern beyond text. Depending on the model, it may accept images, audio, video, or text, and it may produce text, images, or other outputs. A practitioner should ask which modalities are required instead of assuming every model can read every file type.
Transformers are the major architecture behind many LLMs. At exam depth, the key idea is attention: the model can weigh relationships among tokens in the input rather than reading words as isolated items. This helps it connect pronouns to earlier nouns, follow instructions across a prompt, and summarize long passages. You do not need to calculate attention or design neural network layers for the AWS Certified AI Practitioner exam, but you should know why transformers enabled strong language and multimodal performance.
Practitioner comparison table
| Concept | What it means | Practitioner question |
|---|---|---|
| Foundation model | Broad pretrained model reused across many tasks | Is a general model enough, or does the task need a specialized service or customization? |
| LLM | Foundation model focused on language | Are the inputs and expected outputs primarily text? |
| Multimodal model | Model that can use more than one content type | Does the workflow need images, audio, video, or document layout understanding? |
| Transformer | Architecture that uses attention across tokens | Are we discussing capability and limits, not building the model internals? |
| Inference | Using a trained model to produce output | What latency, cost, accuracy, and control requirements apply when users call it? |
This table is useful because many business conversations use model terms loosely. A team may say it needs an LLM when it actually needs optical text extraction, translation, enterprise search, or a narrow classifier. For example, Amazon Textract is built for extracting text and structured data from documents, while Amazon Bedrock gives managed access to foundation models for generative AI applications. Amazon Q is a packaged generative AI assistant experience for business or developer workflows.
SageMaker AI is more relevant when a builder team needs deeper ML development, training, customization, or deployment control.
Service fit and modality judgment
A good practitioner starts with the work, not the hype. If a support team wants draft replies to customer emails, a text-capable LLM through Amazon Bedrock may fit because the model can summarize tone, intent, and policy context. If the team wants employees to ask questions over company documents, Amazon Q Business or a Bedrock-based retrieval augmented generation solution may be closer. If the task is to detect unsafe objects in images, Amazon Rekognition may be more direct than building a chat experience around a multimodal model.
Use cases also fail when the model capability does not match the operating risk. A general model may produce fluent output that sounds authoritative even when it lacks the required facts. A multimodal model may identify visual patterns but still miss edge cases, low-quality images, or domain-specific details. A text model may produce a good first draft but still need human review before legal, medical, financial, or public communications. Foundation models are useful accelerators, not substitutes for business ownership.
A non-builder candidate should be able to ask four practical questions before approving a generative AI path:
- What input and output modalities does the workflow require?
- Does AWS already provide a managed AI service that solves the task more directly?
- What source of truth will ground the model when factual accuracy matters?
- What review, monitoring, security, and cost controls will be used after launch?
The answer often points to a layered architecture. A web app might call Amazon Bedrock for generation, store source documents in Amazon S3, retrieve relevant chunks through a vector index, protect access with IAM, encrypt data with AWS KMS, and log activity with CloudTrail and CloudWatch. At this chapter stage, the important part is not building that architecture line by line. The important part is recognizing the vocabulary and seeing how capability, modality, service selection, and risk connect.
A department wants an assistant that can draft text responses from policy documents, but the team does not want to train a model from scratch. Which concept best describes the likely starting point?
Which statement best captures the practitioner-level meaning of transformer attention?
A team needs to extract fields from scanned invoices. Which response shows the best service-selection mindset?