5.5 Transformations: ELT, Streams & Tasks, and Semi-Structured Data

Key Takeaways

CTAS (CREATE TABLE AS SELECT) materializes a transformed result in one step; streams capture row-level changes (CDC) and tasks schedule/chain SQL for ELT.
Snowpark lets you write transformations in Python, Java, or Scala that push down to Snowflake compute; UDFs and stored procedures extend SQL.
Semi-structured data loads into a VARIANT column; access elements with dot (col:field) or bracket notation and flatten arrays with the FLATTEN table function.
Streams advance their offset only when consumed in a DML transaction; tasks can be scheduled by CRON or triggered when a stream has data.

Last updated: June 2026

ELT Building Blocks: CTAS, Streams, and Tasks

Snowflake favors ELT — load raw data first, then transform inside the database using its compute. The core building blocks:

CTAS (CREATE TABLE AS SELECT) materializes the result of a transformation query into a new table in a single statement: CREATE TABLE sales_clean AS SELECT ... FROM sales_raw WHERE .... It is the simplest way to persist a transformed dataset and is common for batch rebuilds.

Streams implement change data capture (CDC). A stream is an object that records the row-level changes (inserts, updates, deletes) made to a source table since the stream was last read, exposing metadata columns METADATA$ACTION, METADATA$ISUPDATE, and METADATA$ROW_ID. A stream's offset advances only when it is consumed inside a DML statement (e.g., an INSERT ... SELECT FROM stream), so you process each change exactly once.

Tasks schedule and orchestrate SQL. A task runs a single SQL statement, a call to a stored procedure, or Snowpark code, on a CRON/interval schedule or after a predecessor task (building a DAG). A task can also be gated with WHEN SYSTEM$STREAM_HAS_DATA('my_stream') so it only runs when there are changes — the canonical stream + task incremental-ELT pattern.

Procedural Logic: Stored Procedures, UDFs, and Snowpark

SQL alone can't express loops, branching, or error handling — these extend it:

Construct	Returns	Typical use
UDF (user-defined function)	a value (scalar) or a table (UDTF)	Reusable computation inside a query; SQL, JavaScript, Python, Java
Stored procedure	controls flow; can run DML/DDL and return a status	Procedural ELT, multi-step logic, transactions
Snowpark	a DataFrame transformed in Python/Java/Scala	Programmatic pipelines that push down to Snowflake

A UDF is called within a query and is generally deterministic and side-effect-free — it computes and returns. A stored procedure is called on its own (CALL proc()), can execute DML/DDL and contain control flow, and is what a task usually invokes for multi-step work.

Snowpark is Snowflake's developer framework: you write DataFrame-style code in Python, Java, or Scala, and the operations are translated to SQL and executed on Snowflake's compute (no data movement to a client). Python UDFs and stored procedures run in Snowflake's secure sandbox and can use packages from the Anaconda channel. For the exam, know that Snowpark pushes computation down to the warehouse rather than pulling data out.

Querying Semi-Structured Data

Snowflake natively stores semi-structured data (JSON, Avro, ORC, Parquet, XML) in the VARIANT data type — a single column that can hold nested objects and arrays. You typically load raw JSON into a VARIANT column, then query into it without a fixed schema.

Access paths use two notations:

Dot notation for object fields: src:customer.name
Bracket notation for keys/array elements: src['customer']['name'] and src:items[0]:price

Values come back as VARIANT, so cast them to a typed value with ::, e.g. src:order_total::number. Object keys are case-sensitive in path expressions — a frequent trap.

To turn an array into rows, use the FLATTEN table function with a LATERAL join: SELECT f.value:sku::string FROM orders, LATERAL FLATTEN(input => orders.src:items) f. FLATTEN explodes each array element into its own row, exposing VALUE, INDEX, KEY, and PATH columns. This LATERAL FLATTEN pattern is how you normalize nested JSON into relational rows — a heavily tested transformation skill.

Loading Semi-Structured Data and Useful Functions

Semi-structured data usually enters Snowflake through COPY INTO from a stage. A common pattern loads the whole JSON document into a single VARIANT column with a file format of TYPE = JSON, then transforms it with SQL afterward — pure ELT. Snowflake also supports automatic schema detection (INFER_SCHEMA) and direct loading of columnar formats like Parquet into typed columns.

Beyond VARIANT, Snowflake has the related OBJECT (key-value) and ARRAY types. Helpful functions to know:

Function	Purpose
`PARSE_JSON(str)`	Convert a JSON string to VARIANT
`TO_JSON(variant)`	Serialize VARIANT back to a JSON string
`GET(variant, 'k')` / `GET_PATH()`	Programmatic element access
`OBJECT_KEYS(obj)`	List the keys in an OBJECT
`ARRAY_SIZE(arr)`	Count elements in an array
`IS_<type>()` / `TYPEOF()`	Test or report a VARIANT's underlying type

A missing path returns SQL NULL rather than erroring, which is why robust transformations cast explicitly and handle nulls. Because VARIANT values are untyped until cast, always apply :: to land a clean relational column.

Stream Types and Task Orchestration Details

Streams come in variants worth recognizing: a standard (delta) stream tracks inserts, updates, and deletes; an append-only stream tracks inserts only (cheaper, ideal for insert-heavy ingestion); and insert-only streams exist for external tables. Querying a stream is non-destructive — the change set is materialized only when the stream is consumed in a DML statement, which advances its offset.

Tasks can be serverless (Snowflake-managed compute, sized via USER_TASK_MANAGED_INITIAL_WAREHOUSE_SIZE) or user-managed (run on a named warehouse you specify). They schedule via SCHEDULE = 'USING CRON ...' or an interval, and chain into a DAG using AFTER predecessor_task. A root task must be resumed (ALTER TASK ... RESUME) for the DAG to run; tasks are created suspended.

The end-to-end continuous ELT pipeline the exam expects you to assemble:

Snowpipe / COPY loads raw files into a landing table.
A stream on the landing table captures new rows (CDC).
A task, gated by SYSTEM$STREAM_HAS_DATA, runs on a schedule, consuming the stream and MERGE-ing changes into a curated table — often via a stored procedure for multi-step logic.

This stream-plus-task pattern, combined with CTAS for full rebuilds and Snowpark/UDFs for custom logic, is the complete transformation toolkit tested in this domain.

Test Your Knowledge

You need to process only the rows that changed in a source table since the last run, exactly once, as part of an incremental pipeline. Which Snowflake object provides this?

A materialized view

A clustering key

The result cache

A stream (change data capture)

Test Your Knowledge

Which statement about Snowpark is correct for the SnowPro Core exam?

It exports all data to a local Python client before processing.

It only supports SQL and cannot use Python or Java.

It lets you write DataFrame-style code in Python, Java, or Scala that is pushed down and executed on Snowflake compute.

It replaces virtual warehouses with a separate serverless engine.

Test Your Knowledge

A VARIANT column 'src' holds JSON with an array under the key 'items'. Which approach turns each array element into its own row?

Use LATERAL FLATTEN(input => src:items) and select from the flattened output

Apply a clustering key to the src column

Cast src to a string with src::string

Create a materialized view joining src to itself

Test Your Knowledge

In the canonical incremental-ELT pattern, what triggers a task to run only when new changes exist?

Setting the warehouse scaling policy to ECONOMY

Gating the task with WHEN SYSTEM$STREAM_HAS_DATA on the stream

Enabling the result cache on the target table

Adding a clustering key to the source table

Up Next

6.1 Secure Data Sharing Architecture

Chapter 6: Data Collaboration

SnowPro Core Study Guide

SnowPro Core Certification

5.5 Transformations: ELT, Streams & Tasks, and Semi-Structured Data

Key Takeaways

ELT Building Blocks: CTAS, Streams, and Tasks

Procedural Logic: Stored Procedures, UDFs, and Snowpark

Querying Semi-Structured Data

Loading Semi-Structured Data and Useful Functions

Stream Types and Task Orchestration Details

SnowPro Core Study Guide

1Chapter 1: SnowPro Core Orientation and Exam Strategy

2Chapter 2: Snowflake AI Data Cloud Features and Architecture

3Chapter 3: Account Management and Data Governance

4Chapter 4: Data Loading, Unloading, and Connectivity

5Chapter 5: Performance Optimization, Querying, and Transformation

6Chapter 6: Data Collaboration

7Chapter 7: Final Review and Test Day

SnowPro Core Certification

5.5 Transformations: ELT, Streams & Tasks, and Semi-Structured Data

Key Takeaways

ELT Building Blocks: CTAS, Streams, and Tasks

Procedural Logic: Stored Procedures, UDFs, and Snowpark

Querying Semi-Structured Data

Loading Semi-Structured Data and Useful Functions

Stream Types and Task Orchestration Details