5.5 Transformations: ELT, Streams & Tasks, and Semi-Structured Data
Key Takeaways
- CTAS (CREATE TABLE AS SELECT) materializes a transformed result in one step; streams capture row-level changes (CDC) and tasks schedule/chain SQL for ELT.
- Snowpark lets you write transformations in Python, Java, or Scala that push down to Snowflake compute; UDFs and stored procedures extend SQL.
- Semi-structured data loads into a VARIANT column; access elements with dot (col:field) or bracket notation and flatten arrays with the FLATTEN table function.
- Streams advance their offset only when consumed in a DML transaction; tasks can be scheduled by CRON or triggered when a stream has data.
ELT Building Blocks: CTAS, Streams, and Tasks
Snowflake favors ELT — load raw data first, then transform inside the database using its compute. The core building blocks:
CTAS (CREATE TABLE AS SELECT) materializes the result of a transformation query into a new table in a single statement: CREATE TABLE sales_clean AS SELECT ... FROM sales_raw WHERE .... It is the simplest way to persist a transformed dataset and is common for batch rebuilds.
Streams implement change data capture (CDC). A stream is an object that records the row-level changes (inserts, updates, deletes) made to a source table since the stream was last read, exposing metadata columns METADATA$ACTION, METADATA$ISUPDATE, and METADATA$ROW_ID. A stream's offset advances only when it is consumed inside a DML statement (e.g., an INSERT ... SELECT FROM stream), so you process each change exactly once.
Tasks schedule and orchestrate SQL. A task runs a single SQL statement, a call to a stored procedure, or Snowpark code, on a CRON/interval schedule or after a predecessor task (building a DAG). A task can also be gated with WHEN SYSTEM$STREAM_HAS_DATA('my_stream') so it only runs when there are changes — the canonical stream + task incremental-ELT pattern.
Procedural Logic: Stored Procedures, UDFs, and Snowpark
SQL alone can't express loops, branching, or error handling — these extend it:
| Construct | Returns | Typical use |
|---|---|---|
| UDF (user-defined function) | a value (scalar) or a table (UDTF) | Reusable computation inside a query; SQL, JavaScript, Python, Java |
| Stored procedure | controls flow; can run DML/DDL and return a status | Procedural ELT, multi-step logic, transactions |
| Snowpark | a DataFrame transformed in Python/Java/Scala | Programmatic pipelines that push down to Snowflake |
A UDF is called within a query and is generally deterministic and side-effect-free — it computes and returns. A stored procedure is called on its own (CALL proc()), can execute DML/DDL and contain control flow, and is what a task usually invokes for multi-step work.
Snowpark is Snowflake's developer framework: you write DataFrame-style code in Python, Java, or Scala, and the operations are translated to SQL and executed on Snowflake's compute (no data movement to a client). Python UDFs and stored procedures run in Snowflake's secure sandbox and can use packages from the Anaconda channel. For the exam, know that Snowpark pushes computation down to the warehouse rather than pulling data out.
Querying Semi-Structured Data
Snowflake natively stores semi-structured data (JSON, Avro, ORC, Parquet, XML) in the VARIANT data type — a single column that can hold nested objects and arrays. You typically load raw JSON into a VARIANT column, then query into it without a fixed schema.
Access paths use two notations:
- Dot notation for object fields:
src:customer.name - Bracket notation for keys/array elements:
src['customer']['name']andsrc:items[0]:price
Values come back as VARIANT, so cast them to a typed value with ::, e.g. src:order_total::number. Object keys are case-sensitive in path expressions — a frequent trap.
To turn an array into rows, use the FLATTEN table function with a LATERAL join: SELECT f.value:sku::string FROM orders, LATERAL FLATTEN(input => orders.src:items) f. FLATTEN explodes each array element into its own row, exposing VALUE, INDEX, KEY, and PATH columns. This LATERAL FLATTEN pattern is how you normalize nested JSON into relational rows — a heavily tested transformation skill.
Loading Semi-Structured Data and Useful Functions
Semi-structured data usually enters Snowflake through COPY INTO from a stage. A common pattern loads the whole JSON document into a single VARIANT column with a file format of TYPE = JSON, then transforms it with SQL afterward — pure ELT. Snowflake also supports automatic schema detection (INFER_SCHEMA) and direct loading of columnar formats like Parquet into typed columns.
Beyond VARIANT, Snowflake has the related OBJECT (key-value) and ARRAY types. Helpful functions to know:
| Function | Purpose |
|---|---|
PARSE_JSON(str) | Convert a JSON string to VARIANT |
TO_JSON(variant) | Serialize VARIANT back to a JSON string |
GET(variant, 'k') / GET_PATH() | Programmatic element access |
OBJECT_KEYS(obj) | List the keys in an OBJECT |
ARRAY_SIZE(arr) | Count elements in an array |
IS_<type>() / TYPEOF() | Test or report a VARIANT's underlying type |
A missing path returns SQL NULL rather than erroring, which is why robust transformations cast explicitly and handle nulls. Because VARIANT values are untyped until cast, always apply :: to land a clean relational column.
Stream Types and Task Orchestration Details
Streams come in variants worth recognizing: a standard (delta) stream tracks inserts, updates, and deletes; an append-only stream tracks inserts only (cheaper, ideal for insert-heavy ingestion); and insert-only streams exist for external tables. Querying a stream is non-destructive — the change set is materialized only when the stream is consumed in a DML statement, which advances its offset.
Tasks can be serverless (Snowflake-managed compute, sized via USER_TASK_MANAGED_INITIAL_WAREHOUSE_SIZE) or user-managed (run on a named warehouse you specify). They schedule via SCHEDULE = 'USING CRON ...' or an interval, and chain into a DAG using AFTER predecessor_task. A root task must be resumed (ALTER TASK ... RESUME) for the DAG to run; tasks are created suspended.
The end-to-end continuous ELT pipeline the exam expects you to assemble:
- Snowpipe / COPY loads raw files into a landing table.
- A stream on the landing table captures new rows (CDC).
- A task, gated by
SYSTEM$STREAM_HAS_DATA, runs on a schedule, consuming the stream andMERGE-ing changes into a curated table — often via a stored procedure for multi-step logic.
This stream-plus-task pattern, combined with CTAS for full rebuilds and Snowpark/UDFs for custom logic, is the complete transformation toolkit tested in this domain.
You need to process only the rows that changed in a source table since the last run, exactly once, as part of an incremental pipeline. Which Snowflake object provides this?
Which statement about Snowpark is correct for the SnowPro Core exam?
A VARIANT column 'src' holds JSON with an array under the key 'items'. Which approach turns each array element into its own row?
In the canonical incremental-ELT pattern, what triggers a task to run only when new changes exist?