Databricks is a ~$40B company built around the open-source distributed computation engine Apache Spark. Their core offering is a high-level interface that allows organizations to utilize Spark without incurring the operational costs of managing Spark clusters. Databricks’ clusters use the Databricks Runtime (DBR), a fork of Spark that is API-compliant while offering improved security, optimized IO, GPU extensions, and raw engine performance improvements. Part of the broader data lakehouse initiative at Databricks, the Photon project provides a high-performance operator framework that integrates with the DBR to enable warehouse-like performance on simple data lakes. In 2021, Photon enabled Databricks to set the world record for the 100 TB TPC-DS benchmark, an industry-standard OLAP evaluation.

Motivation

Photon was built to support Databricks’ data lakehouse architecture. The motivating idea of the lakehouse architecture is to provide the performance of a data warehouse while retaining the flexibility of using a data lake as the primary storage mechanism. Until the lakehouse architecture, organizations generally needed to use a warehouse for their most intensive analytics workloads due to scale/performance constraints. Utilizing a warehouse is expensive, and puts organizations at risk of vendor lock-in.

The storage layer of Databricks’ lakehouse architecture is provided by the Delta Lake protocol, an interface layer that wraps raw Apache Parquet files with a transaction log and additional metadata handling. Delta Lake provides many additional features on top of the raw data, including transactions, time travel, and metadata-based optimizations. Unfortunately, these IO optimizations are insufficient to support an effective lakehouse implementation when the query workloads are compute-bound.

To handle this issue, Databricks introduced Photon, a high-performance operator framework that integrates with the DBR to enable warehouse-like performance. The main design goals of the project were to:

Have excellent performance on data conforming to the Delta Lake protocol.
Have solid support for the raw, uncurated data often found in data lakes.
Support existing Spark APIs.

Design

As Databricks has introduced low-level storage-layer optimizations such as NVME caching and auto-optimized shuffle, more of their workloads have become CPU-bound. A key contributing factor to this is that the DBR is written in Java, which requires the use of the Java Virtual Machine (JVM). The JVM-based runtime is hard to optimize, requiring deep knowledge of the JVM internals. Additionally, it does not expose the ability to enforce lower-level optimizations like hand-writing SIMD instruction usage. Due to these and other limitations, Databricks decided to implement the Photon runtime entirely in C++, and integrate it with the DBR as a shared library using the Java Native Interface for communication.

Next, the team needed to decide whether to stick with Apache Spark’s code-generation approach, or to take a vectorized approach. Code generation allows an engine to avoid branching and virtual function call overhead by coalescing multiple steps of the execution plan into a singular generated function, and invoking the compiler on that generated function at runtime. In contrast, the vectorized approach retains the use of virtual function calls but defines the operator interface to accept batches of records, which amortizes the cost of computation. Ultimately they decided to go with a vectorized approach because it is easier to develop, makes per-operator observability more straightforward, and allows runtime micro-batch adaptations.

Then, the team decided to utilize a columnar representation of the data as opposed to Spark’s row-oriented native format. This decision brings a myriad of benefits including easier integration with SIMD instructions, better cache utilization, and potentially eliminating an expensive pivot operation if the legacy Spark engine is avoided entirely.

Finally, the team decided that the implementation of Photon must be able to be partially rolled out. This means that Photon operates within the plan, and any given query plan can be composed of a mix of Photon and non-Photon operators. This allows incremental rollout of Photon, which is important to ensure that new Spark features are picked up in a timely manner, and that Photon is rolled out in a safe way to production workloads.

Implementation

Photon passes columnar data vectors between individual operators, along with information indicating row nullability, and whether the row is active (not filtered) in the current query context. Operations on these column vectors are done by execution kernels, which are optimized code sections accepting and outputting column vectors. E.g., a kernel for square root looks like:

Databricks is able to validate that the kernels are compiled using optimal SIMD instructions, and utilize C++ templating to eliminate unnecessary branches.

Aggregation and join operators have especially been optimized via the introduction of a vectorized hash table. In a vectorized hash table, the hashing, probing, and memory accesses all benefit from the application of vectorized SIMD instructions, allowing them to occur in parallel. This has a major effect on join/aggregation performance, and is a major contributor to the TPC records set by Databricks.

In the data lake setting, input data is often uncurated and missing valuable metadata that can inform the query engine about the most efficient way to process a query. To handle this, the Photon implementation performs batch-level adaptations to the data. For instance, Photon can choose to compact column vectors that are sparse due to extensive filtering. This can greatly improve the efficiency of operations such as a hash table lookup, since it better exploits memory parallelism via SIMD instructions.

Photon is enabled in the Databricks Runtime (DBR) by using Spark’s optimizer, Catalyst, to replace query plan nodes that have Photon equivalents when applicable. This creates a mixed plan, containing both legacy Spark and Photon operators. This replacement is done in a bottom-up fashion, and stops once the first non-Photon operator in the plan is encountered. Nodes in the middle of the plan are not replaced, because switching modes between Spark/Photon requires a pivot operator that converts Photon’s columnar format to Spark’s row format. For instance, a simple plan may be converted like:

Since Photon operators execute in the broader context of a Spark query, it is critical that they share a unified view of memory. Photon is able to request memory reservations from the Spark memory manager, and participates in the same operator spill semantics. Integration here also requires the ability to transfer off-heap memory to on-heap memory for certain Spark operations, like broadcasting data to other executors.

Finally, Photon operators implement all of the same interfaces as native Spark operators, meaning they can participate in statistics exports for monitoring and adaptive querying purposes. To ensure that all Photon operators behave the same as legacy Spark operators, a thorough testing suite was created.

Conclusion

Databricks is rewriting Apache Spark from the inside-out in an effort to improve its ability to compete with Snowflake. I am curious whether Photon will be open-sourced at any point, or whether a community-oriented project similar to Photon will occur. In general, the idea of accelerating composable data systems based on on open standards (e.g. Spark) is something that will develop greatly over the next 5 years.

Accelerating Spark: Databricks Photon Runtime

Motivation

Design

Implementation

Conclusion