Traditionally, dbt models are defined using SQL. But in early 2022, dbt Labs made an internal push to extend the dbt framework to support Python models. In dbt Labs’ own words:
For teams that are doing analytics work today (all in SQL), and who have aspirations of doing data science / ML work (in Python) tomorrow, dbt can make the shift even easier. No new tools necessary.
dbt accomplishes Python support using data warehouse’s built-in support for multi-language analysis. For instance Snowflake offers Snowpark, a compute environment for Python, Java, and Scala. Snowpark users can use Python to work with data in Snowpark without ever having to export data from Snowflake.
In this post, we’ll take a look at using Snowpark directly, then see how we are able to integrate code directly into our dbt ecosystem.
For this example, we’ll use the NBA data I loaded into Snowflake in my previous post. To load the data, instantiate the Snowpark session and call the table function. In this format, we can work with the data directly using the Snowpark DataFrame API, or copy and convert the data to native pandas. A basic aggregation for both methods is shown below:
from snowflake.snowpark import Session
from snowflake.snowpark.functions import avg, date_trunc
USER = ""
ACCOUNT = ""
PASSWORD = ""
session = Session.builder.configs(
{
"account": ACCOUNT,
"user": USER,
"password": PASSWORD,
"database": "DBT_SNOWFLAKE_DEMO",
"schema": "public",
}
).create()
df = session.table("player_game_logs")
# This will work with the data directly on Snowflake.
player_points_by_week = df.group_by("PLAYER_NAME", date_trunc("WEEK", "GAME_DATE")).agg(
avg("PTS").alias("AVG_PTS")
)
# This will incur a copy on to_pandas() call.
pdf = df.to_pandas()
pplayer_points_by_week = pdf.groupby(
[pdf["PLAYER_NAME"], pdf["GAME_DATE"].dt.to_period("W")]
)["PTS"].mean()
dbt Python models are Python code snippets that are run on the data warehouse. For Snowflake, this is implemented using the procedures feature. The dbt framework will take your model, wrap it in code that provides the dbt and framework objects, and create a procedure entity on Snowflake. Evaluating the model is done by invoking the stored procedure. Python models can include a variety of packages provided by default in Snowflake. By selecting from the information_schema, we can see that there are about 1800 available packages:
phildakin#COMPUTE_WH@DBT_SNOWFLAKE_DEMO.PUBLIC>select count(distinct package_name) from information_schema.packages where language = 'python';
+------------------------------+
| COUNT(DISTINCT PACKAGE_NAME) |
|------------------------------|
| 1778 |
+------------------------------+
Define a Python model in dbt by creating a .py
file in the models/
directory containing a model function:
def model(dbt, session):
# Note - pyarrow package is needed to use to_pandas() call below on Snowflake.
dbt.config(materialized="table", packages=["pyarrow"])
df = session.table("player_game_logs")
pdf = df.to_pandas()
pplayer_points_by_week = (
pdf.groupby([pdf["PLAYER_NAME"], pdf["GAME_DATE"].dt.to_period("W")])["PTS"]
.mean()
.to_frame()
)
return pplayer_points_by_week
When run, this model will behave exactly as other dbt models.
This dbt integration allows dbt to work in mixed-language data science environments.