A user can select from a variety of materialization configurations when writing a dbt model. These configurations tell dbt how the results of the model query should be instantiated in the underlying database. Though this strives to be a unified interface to database abstractions, setups hitting performance constraints will generally require applying a bit of database-specific configuration.
To understand materializations, one must be familiar with the concepts of a table, a view, and a materialized view.
- Table - Created using
CREATE TABLE
, this entity stores the data directly into the database. Must be refreshed by the user when upstream data changes. - View - Created using a query, this entity is a virtual table that evaluates its contents at query time. This has the benefit of automatically including upstream data changes, but can be slow.
- Materialized View - Created using a query, this entity is a view with database-implemented performance optimization via pre-computing of the view. This provides the performance benefit of using cached data, with the ability to automatically include upstream changes.
Let’s take a look at how these concepts are applied in dbt.
We’ll use the NBA dataset loaded into Snowflake from my previous post for this example. Define a simple dbt model as follows:
{{ config(materialized='table') }}
select player_name, date_trunc('WEEK', game_date) period, avg(pts) avg_pts
from player_game_logs
group by period, player_name
order by player_name, period
The generated SQL shows dbt wrapping our query in a creation statement:
create or replace transient table dbt_snowflake_demo.PUBLIC.points_by_week
as
(
select player_name, date_trunc('WEEK', game_date) period, avg(pts) avg_pts
from player_game_logs
group by period, player_name
order by player_name, period
);
Note that by default dbt uses transient tables on Snowflake. Switching the config to view shows the creation statement is modified to create a view:
create or replace view dbt_snowflake_demo.PUBLIC.points_by_week
as (...);
These materialization methods are straightforward, but will require the user to think about keeping the model up-to-date in the table case, and potentially be inefficient in the view case. dbt handles this by exposing the “incremental” and “materialized view” materialization modes.
The incremental materialization mode exposes additional configurations that inform dbt how to filter the incoming tables for new or updated records The syntax for enabling incremental materialization uses the Jinja functionality of dbt.
{{ config(
materialized='incremental',
unique_key=['player_name', 'period']
)
}}
select player_name, date_trunc('WEEK', game_date) period, avg(pts) avg_pts
from player_game_logs
-- Tell dbt to only select new records when running an incremental build.
{% if is_incremental() %}
where game_date > (select max(period) from {{ this }})
{% endif %}
group by period, player_name
order by player_name, period
When running an incremental build, dbt will apply the filter, store the intermediate results in a temporary table, and also generate a merge that updates the model’s table:
begin;
merge into dbt_snowflake_demo.PUBLIC.points_by_week as DBT_INTERNAL_DEST
using dbt_snowflake_demo.PUBLIC.points_by_week__dbt_tmp as DBT_INTERNAL_SOURCE
on (
DBT_INTERNAL_SOURCE.player_name = DBT_INTERNAL_DEST.player_name
) and (
DBT_INTERNAL_SOURCE.period = DBT_INTERNAL_DEST.period
)
when matched then update set
"PLAYER_NAME" = DBT_INTERNAL_SOURCE."PLAYER_NAME","PERIOD" = DBT_INTERNAL_SOURCE."PERIOD","AVG_PTS" = DBT_INTERNAL_SOURCE."AVG_PTS"
when not matched then insert
("PLAYER_NAME", "PERIOD", "AVG_PTS")
values
("PLAYER_NAME", "PERIOD", "AVG_PTS");
commit;
In the incremental case, the user is then left to determine the appropriate time to run incremental builds of the model, and update the underlying table.
Snowflake’s materialized views are not currently supported by dbt. However, there is support for a similar Snowflake feature - dynamic tables. Dynamic tables require a refresh period and a Snowflake warehouse to be supplied. We can enable dynamic tables like so:
{{ config(
materialized = 'dynamic_table',
snowflake_warehouse = 'compute_wh',
target_lag = '10 minutes',
) }}
select player_name, date_trunc('WEEK', game_date) period, avg(pts) avg_pts
from player_game_logs
group by period, player_name
order by player_name, period
The generated SQL now shows dbt taking advantage of this feature:
create or replace dynamic table dbt_snowflake_demo.PUBLIC.points_by_week_dynamic
target_lag = '10 minutes'
warehouse = compute_wh
as (
select player_name, date_trunc('WEEK', game_date) period, avg(pts) avg_pts
from player_game_logs
group by period, player_name
order by player_name, period
);
alter dynamic table dbt_snowflake_demo.PUBLIC.points_by_week_dynamic refresh
This will provide a queryable materialized table that is kept up to date via the Snowflake-managed refresh process.
The last materialization type supported by dbt at time of writing is the “ephemeral” materialization. This is a special materialization that will not create any database objects. The purpose of this materialization is to provide a reusable model reference for other models, improving code reusability while not incurring any overhead in terms of data operations. Switching the configuration of our model to ephemeral, and using this model in another model, we can see that dbt generates a CTE for working with the ephemeral model:
create or replace view dbt_snowflake_demo.PUBLIC.points_by_week_consumer
as (
with __dbt__cte__points_by_week as (
select player_name, date_trunc('WEEK', game_date) period, avg(pts) avg_pts
from player_game_logs
group by period, player_name
order by player_name, period
) select * from __dbt__cte__points_by_week
);
In this post we covered a variety of materialization mechanisms. Generally, it is best not to prematurely optimize. I recommend starting your dbt work using view/table materializations, then being selective about converting to materialized or incremental builds as performance issues arise.