For decades, the ETL (Extract, Transform, Load) paradigm has dominated data engineering. This approach—extracting data from source systems, transforming it through a series of processing steps, and loading it into target systems for analysis—has been the backbone of data warehousing and business intelligence. Even as technology evolved from on-premises ETL tools to cloud-based ELT (Extract, Load, Transform), the fundamental concept remained: transform data before it’s queried.
But a quiet revolution is underway. Emerging technologies from major platforms like Databricks’ Photon engine, Snowflake’s Polaris, and open-source solutions like DuckDB are challenging this foundational paradigm with compute-on-query architectures that defer transformations until the moment data is needed.
This shift promises to fundamentally change how data teams work, potentially eliminating entire categories of data engineering tasks while creating new opportunities and challenges. Let’s explore how compute-on-query is reshaping the landscape and what it means for the future of data engineering.
Traditional data processing follows a pattern we’ve all come to know:
- Extract data from source systems
- Transform it into the desired format (either before or after loading)
- Load it into a target system for analysis
This approach emerged from technical limitations of earlier eras: limited storage, expensive compute, and the need to optimize for specific analytical queries. Pre-computing transformations made sense when compute was scarce and query patterns were predictable.
But this paradigm has always had significant drawbacks:
Once a transformation pipeline is built, adding new data fields or changing transformation logic requires rebuilding processes. This creates friction between evolving business needs and established data pipelines.
Teams often store multiple versions of the same data: raw, cleansed, transformed, and aggregated. Each layer consumes storage and introduces potential consistency issues.
Traditional pipelines typically run on schedules (daily, hourly), creating latency between when events occur and when they’re available for analysis.
ETL/ELT pipelines require continuous maintenance as schemas change, business rules evolve, and data volumes grow.
Analysts are constrained to working with precomputed transformations, limiting ad-hoc exploration and discovery.
Compute-on-query (also called query-time transformation) fundamentally changes this approach by pushing transformations to the moment of query execution. Rather than precomputing and storing transformed data, these systems apply transformations dynamically when a user or application queries the data.
Several key technological developments have made compute-on-query viable:
Modern query engines use vectorized execution, processing data in columnar batches rather than row by row. This provides orders of magnitude performance improvements for analytical workloads.
Just-in-time compilation of queries to optimized machine code eliminates interpretation overhead, making on-the-fly transformations much faster.
Modern optimizers can adapt execution plans based on data characteristics observed during query execution, rather than relying solely on pre-query statistics.
Cloud-native query engines can dynamically allocate massive parallel resources for individual queries, making complex transformations feasible at query time.
Sophisticated caching of intermediate results balances the flexibility of on-demand computation with performance optimization.
Several platforms are pioneering this shift, each with different approaches and strengths:
Photon is Databricks’ vectorized query engine written in C++ that powers their Delta Engine. It represents a significant advancement in compute-on-query capabilities:
Key features:
- Native vectorized processing optimized for modern CPU architectures
- Deep integration with Delta Lake for ACID transactions and time travel
- Adaptive query execution that optimizes based on runtime conditions
- Dynamic filter pushdown that dramatically reduces data scanning
Real-world impact: A major e-commerce platform migrated from traditional Spark ETL to Photon-powered compute-on-query and saw:
- 7x improvement in average query performance
- 65% reduction in data engineering maintenance hours
- 42% reduction in cloud compute costs despite increased query volume
Snowflake’s Polaris introduces “Snowpark Container Services” that enable computation directly as part of the query process:
Key features:
- Elastic compute that scales based on query requirements
- Separation of storage and compute with intelligent caching
- UDFs and stored procedures that execute within the query context
- Native support for machine learning workloads
Real-world impact: A financial services firm implementing Polaris-based compute-on-query achieved:
- 82% reduction in data pipeline complexity
- Near-real-time data availability vs. previous 4-hour latency
- 39% improvement in overall platform performance
While enterprise platforms evolve, DuckDB has emerged as an open-source analytical database that excels at in-process compute-on-query:
Key features:
- Embedded analytical processing with minimal footprint
- OLAP capabilities in a lightweight package
- Direct query of Parquet, CSV, and JSON files without loading
- Vectorized execution engine optimized for analytical workloads
Real-world impact: A data science team at a healthcare organization adopted DuckDB for compute-on-query and reported:
- 94% reduction in data preparation time
- Ability to analyze raw data directly from their data lake
- Elimination of multiple transformation pipelines
The compute-on-query revolution extends beyond these examples:
- SingleStore with its universal storage engine that handles both transactional and analytical workloads
- BigQuery’s serverless architecture that separates storage and compute
- Apache Arrow providing a standard for columnar in-memory analytics
- ClickHouse delivering extreme performance for on-the-fly analytics
Organizations adopting compute-on-query are developing new architectural patterns. Here are the most successful approaches we’ve observed:
Rather than building transformation pipelines, organizations define metrics in a semantic layer that generates SQL at query time:
# Example metrics definition (using hypothetical syntax)
metrics:
- name: revenue
description: Total revenue from completed orders
calculation: sum(order_amount)
filters:
- order_status = 'completed'
- name: active_users
description: Users who performed an action in the last 30 days
calculation: count_distinct(user_id)
filters:
- activity_date > dateadd(day, -30, current_date())
- name: conversion_rate
description: Percentage of sessions that resulted in a purchase
calculation: count_if(has_purchase = true) / count(*)
dimensions:
- name: acquisition_channel
columns: [utm_source, utm_medium, utm_campaign]
- name: user_geography
columns: [user_country, user_region, user_city]
This declarative approach means:
- No ETL jobs to maintain
- Consistent metric definitions across the organization
- The query engine optimizes the execution for each specific query
- Metrics can be used in any combination with any dimensions
Rather than making rigid decisions about what to materialize, leading organizations use a decision framework:
Factor | Favor Materialization | Favor Compute-on-Query |
---|---|---|
Query Frequency | High (>100/day) | Low (<100/day) |
Data Volume | Very large (>10TB) | Small-to-medium (<10TB) |
Computation Complexity | Very high (multiple joins, window functions) | Low-to-medium |
Freshness Requirements | Can tolerate hours of latency | Needs minutes or less |
Query Pattern | Highly predictable | Exploratory or unpredictable |
This approach leads to a hybrid architecture where:
- Only the most frequently used, complex transformations are materialized
- Everything else uses compute-on-query
- The decision is continuously re-evaluated based on actual usage patterns
Organizations are building data lakehouses with a rich layer of views instead of materialized tables:
-- Example of compute-on-query view instead of materialized table
CREATE VIEW customer_360 AS
SELECT
c.customer_id,
c.name,
c.email,
c.signup_date,
-- Dynamically calculate customer lifetime value
(SELECT SUM(order_total)
FROM orders
WHERE customer_id = c.customer_id) AS lifetime_value,
-- Dynamically calculate days since last purchase
DATEDIFF(day,
(SELECT MAX(order_date)
FROM orders
WHERE customer_id = c.customer_id),
CURRENT_DATE()) AS days_since_last_purchase,
-- Dynamically join to get latest marketing segment
(SELECT segment
FROM customer_segments
WHERE customer_id = c.customer_id
ORDER BY effective_date DESC
LIMIT 1) AS current_segment
FROM customers c;
This pattern:
- Eliminates entire categories of data pipelines
- Ensures data is always current
- Adapts automatically when source schemas change
- Applies transformations only to the rows and columns actually needed for each query
This shift to compute-on-query has profound implications for data engineering as a discipline:
Traditional ETL development is giving way to:
- Data modeling and metrics definition: Creating semantic frameworks that support compute-on-query
- Query optimization: Tuning for compute-on-query performance across diverse query patterns
- Access control and governance: Ensuring appropriate access when raw data is more directly accessible
The most valuable skills are evolving:
- Deep SQL expertise becomes even more important
- Understanding query optimization and execution plans is critical
- Knowledge of specific ETL/ELT tools becomes less valuable than understanding data modeling principles
Data architectures are being reimagined:
- Data lakes store raw or lightly processed data
- Metadata and semantic layers define transformations
- Materialization becomes an optimization technique rather than the default approach
The operational aspects of data engineering are changing:
- Less focus on pipeline scheduling and monitoring
- More emphasis on query performance and resource management
- Shifting from pipeline-centric to query-centric observability
Despite its promise, compute-on-query isn’t a silver bullet. Several challenges remain:
On-the-fly transformation can introduce performance variability, especially for complex transformations. Organizations must implement:
- Performance SLAs for critical queries
- Intelligent materialization strategies for frequently accessed data
- Query complexity management to prevent resource-intensive ad-hoc queries
Compute-on-query can sometimes increase overall compute costs, particularly if:
- The same transformations are repeatedly executed
- Queries access more data than necessary
- Resource governance isn’t properly implemented
When transformation happens at query time, governance becomes more complex:
- Row and column-level security must be enforced at query time
- Data lineage is more dynamic and harder to track
- Consistent application of business rules must be ensured across all access patterns
Moving from traditional ETL/ELT to compute-on-query requires:
- Rethinking established data modeling patterns
- Rebuilding dashboards and reports to use new semantic layers
- Retraining teams on new architectural approaches
Looking ahead, several emerging trends suggest where compute-on-query is headed:
LLMs and other AI technologies are being integrated into compute-on-query systems to:
- Automatically suggest transformations based on query intent
- Generate optimized transformation logic from natural language descriptions
- Detect and correct data quality issues during query execution
The line between batch and streaming is blurring, with compute-on-query systems:
- Applying transformations to both historical and real-time data with unified semantics
- Enabling queries that span across different time horizons
- Supporting event-time processing in ad-hoc queries
Emerging technologies are enabling compute-on-query across multiple storage systems:
- Query engines that can process data across cloud providers
- Federated queries that join data from different platforms
- Universal compute layers that work across diverse data sources
Compute-on-query is enabling analytics to move closer to operational systems:
- Embedded analytical databases like DuckDB bringing OLAP to edge devices
- Serverless query engines that can spin up on demand anywhere
- Application-embedded analytics that don’t require separate data warehouses
To illustrate the transformative potential of compute-on-query, let’s examine how a global retailer with 5,000+ stores reimagined their analytics architecture.
Their traditional architecture featured:
- 200+ ETL pipelines running on Informatica
- Data refreshed every 6 hours
- 35+ TB data warehouse with multiple layers of transformations
- 7-person team dedicated to maintaining transformation logic
- 2-3 week lead time for new analytics requests
Their new architecture leverages compute-on-query principles:
- Raw data loaded directly to cloud storage
- Snowflake’s Polaris engine for compute-on-query
- Semantic layer defining key metrics and entities
- Materialized views only for the most critical, performance-sensitive analytics
After six months with the new architecture:
- Data freshness improved from 6 hours to near real-time
- Analytics development cycle reduced from weeks to days
- ETL maintenance team reduced from 7 to 2 people
- Query performance improved by 45% on average
- Storage requirements decreased by 64%
- Overall platform costs reduced by 37%
Most importantly, business analysts could explore data in new ways without waiting for ETL changes, leading to several unexpected insights that drove significant business value.
For organizations looking to adopt compute-on-query, here’s a practical roadmap:
- Assess current state:
- Inventory existing transformation pipelines
- Identify query patterns and frequencies
- Evaluate current pain points and limitations
- Select technologies:
- Choose appropriate compute-on-query platform
- Identify semantic layer technology
- Select governance and monitoring tools
- Pilot project:
- Identify a bounded use case with high business value
- Implement proof-of-concept with compute-on-query approach
- Measure results against traditional approach
- Develop semantic layer:
- Define core metrics and dimensions
- Create testing framework for semantic accuracy
- Implement version control for semantic definitions
- Optimize raw data storage:
- Convert to columnar formats like Parquet
- Implement intelligent partitioning
- Ensure appropriate indexing and statistics
- Build query optimization knowledge:
- Train team on query performance analysis
- Develop standards for efficient query patterns
- Create monitoring for query performance
- Prioritize pipeline migration:
- Start with high-maintenance, low-complexity pipelines
- Gradually address more complex transformations
- Maintain hybrid approach during transition
- Implement governance controls:
- Ensure consistent security across access patterns
- Develop lineage tracking for compute-on-query
- Create audit mechanisms for transformation logic
- Refactor analytics applications:
- Update dashboards to use semantic layer
- Train analysts on new query capabilities
- Develop self-service analytics guidelines
- Continuous evaluation:
- Monitor query patterns and performance
- Identify candidates for materialization
- Track business impact and user satisfaction
- Progressive enhancement:
- Expand semantic layer coverage
- Refine materialization strategies
- Optimize storage formats and organization
- Knowledge sharing:
- Document best practices and patterns
- Train new team members on compute-on-query concepts
- Build community of practice across the organization
The shift to compute-on-query doesn’t mean the complete end of ETL/ELT. Rather, it represents an evolution where transformation becomes more dynamic, more flexible, and more closely aligned with actual analytical needs.
In this new world, data engineers focus less on building and maintaining rigid transformation pipelines and more on creating the foundations for flexible, powerful analytics:
- Optimizing raw data storage for efficient access
- Defining semantic frameworks that ensure consistent business logic
- Building governance mechanisms that work across diverse access patterns
- Creating the right materialization strategies for optimal performance
The organizations that embrace compute-on-query while thoughtfully addressing its challenges will gain significant advantages in agility, freshness, and insight generation—ultimately delivering more value from their data assets with less engineering overhead.
The question isn’t whether compute-on-query will transform data engineering, but how quickly organizations will adapt to this new paradigm and how they’ll balance its benefits against its challenges.
How is your organization approaching compute-on-query? Are you seeing shifts away from traditional ETL/ELT pipelines? Share your experiences in the comments below.