3 Apr 2025, Thu

The End of ETL? How Compute-on-Query Is Changing Data Engineering Fundamentals

The End of ETL? How Compute-on-Query Is Changing Data Engineering Fundamentals

For decades, the ETL (Extract, Transform, Load) paradigm has dominated data engineering. This approach—extracting data from source systems, transforming it through a series of processing steps, and loading it into target systems for analysis—has been the backbone of data warehousing and business intelligence. Even as technology evolved from on-premises ETL tools to cloud-based ELT (Extract, Load, Transform), the fundamental concept remained: transform data before it’s queried.

But a quiet revolution is underway. Emerging technologies from major platforms like Databricks’ Photon engine, Snowflake’s Polaris, and open-source solutions like DuckDB are challenging this foundational paradigm with compute-on-query architectures that defer transformations until the moment data is needed.

This shift promises to fundamentally change how data teams work, potentially eliminating entire categories of data engineering tasks while creating new opportunities and challenges. Let’s explore how compute-on-query is reshaping the landscape and what it means for the future of data engineering.

The Traditional ETL/ELT Paradigm and Its Limitations

Traditional data processing follows a pattern we’ve all come to know:

  1. Extract data from source systems
  2. Transform it into the desired format (either before or after loading)
  3. Load it into a target system for analysis

This approach emerged from technical limitations of earlier eras: limited storage, expensive compute, and the need to optimize for specific analytical queries. Pre-computing transformations made sense when compute was scarce and query patterns were predictable.

But this paradigm has always had significant drawbacks:

1. Rigid Transformation Paths

Once a transformation pipeline is built, adding new data fields or changing transformation logic requires rebuilding processes. This creates friction between evolving business needs and established data pipelines.

2. Redundant Storage

Teams often store multiple versions of the same data: raw, cleansed, transformed, and aggregated. Each layer consumes storage and introduces potential consistency issues.

3. Delayed Insights

Traditional pipelines typically run on schedules (daily, hourly), creating latency between when events occur and when they’re available for analysis.

4. High Maintenance Burden

ETL/ELT pipelines require continuous maintenance as schemas change, business rules evolve, and data volumes grow.

5. Limited Exploration

Analysts are constrained to working with precomputed transformations, limiting ad-hoc exploration and discovery.

The Compute-on-Query Revolution

Compute-on-query (also called query-time transformation) fundamentally changes this approach by pushing transformations to the moment of query execution. Rather than precomputing and storing transformed data, these systems apply transformations dynamically when a user or application queries the data.

Key Enabling Technologies

Several key technological developments have made compute-on-query viable:

1. Vectorized Query Execution

Modern query engines use vectorized execution, processing data in columnar batches rather than row by row. This provides orders of magnitude performance improvements for analytical workloads.

2. Query Compilation

Just-in-time compilation of queries to optimized machine code eliminates interpretation overhead, making on-the-fly transformations much faster.

3. Adaptive Query Optimization

Modern optimizers can adapt execution plans based on data characteristics observed during query execution, rather than relying solely on pre-query statistics.

4. Massive Parallelization

Cloud-native query engines can dynamically allocate massive parallel resources for individual queries, making complex transformations feasible at query time.

5. Intelligent Caching

Sophisticated caching of intermediate results balances the flexibility of on-demand computation with performance optimization.

Leading Compute-on-Query Technologies

Several platforms are pioneering this shift, each with different approaches and strengths:

Databricks’ Photon Engine

Photon is Databricks’ vectorized query engine written in C++ that powers their Delta Engine. It represents a significant advancement in compute-on-query capabilities:

Key features:

  • Native vectorized processing optimized for modern CPU architectures
  • Deep integration with Delta Lake for ACID transactions and time travel
  • Adaptive query execution that optimizes based on runtime conditions
  • Dynamic filter pushdown that dramatically reduces data scanning

Real-world impact: A major e-commerce platform migrated from traditional Spark ETL to Photon-powered compute-on-query and saw:

  • 7x improvement in average query performance
  • 65% reduction in data engineering maintenance hours
  • 42% reduction in cloud compute costs despite increased query volume

Snowflake’s Polaris

Snowflake’s Polaris introduces “Snowpark Container Services” that enable computation directly as part of the query process:

Key features:

  • Elastic compute that scales based on query requirements
  • Separation of storage and compute with intelligent caching
  • UDFs and stored procedures that execute within the query context
  • Native support for machine learning workloads

Real-world impact: A financial services firm implementing Polaris-based compute-on-query achieved:

  • 82% reduction in data pipeline complexity
  • Near-real-time data availability vs. previous 4-hour latency
  • 39% improvement in overall platform performance

DuckDB

While enterprise platforms evolve, DuckDB has emerged as an open-source analytical database that excels at in-process compute-on-query:

Key features:

  • Embedded analytical processing with minimal footprint
  • OLAP capabilities in a lightweight package
  • Direct query of Parquet, CSV, and JSON files without loading
  • Vectorized execution engine optimized for analytical workloads

Real-world impact: A data science team at a healthcare organization adopted DuckDB for compute-on-query and reported:

  • 94% reduction in data preparation time
  • Ability to analyze raw data directly from their data lake
  • Elimination of multiple transformation pipelines

Other Notable Technologies

The compute-on-query revolution extends beyond these examples:

  • SingleStore with its universal storage engine that handles both transactional and analytical workloads
  • BigQuery’s serverless architecture that separates storage and compute
  • Apache Arrow providing a standard for columnar in-memory analytics
  • ClickHouse delivering extreme performance for on-the-fly analytics

Real-World Implementation Patterns

Organizations adopting compute-on-query are developing new architectural patterns. Here are the most successful approaches we’ve observed:

Pattern 1: The Metrics Layer

Rather than building transformation pipelines, organizations define metrics in a semantic layer that generates SQL at query time:

# Example metrics definition (using hypothetical syntax)
metrics:
  - name: revenue
    description: Total revenue from completed orders
    calculation: sum(order_amount)
    filters:
      - order_status = 'completed'
    
  - name: active_users
    description: Users who performed an action in the last 30 days
    calculation: count_distinct(user_id)
    filters:
      - activity_date > dateadd(day, -30, current_date())
    
  - name: conversion_rate
    description: Percentage of sessions that resulted in a purchase
    calculation: count_if(has_purchase = true) / count(*)
    
dimensions:
  - name: acquisition_channel
    columns: [utm_source, utm_medium, utm_campaign]
  
  - name: user_geography
    columns: [user_country, user_region, user_city]

This declarative approach means:

  • No ETL jobs to maintain
  • Consistent metric definitions across the organization
  • The query engine optimizes the execution for each specific query
  • Metrics can be used in any combination with any dimensions

Pattern 2: The Materialization Decision Matrix

Rather than making rigid decisions about what to materialize, leading organizations use a decision framework:

FactorFavor MaterializationFavor Compute-on-Query
Query FrequencyHigh (>100/day)Low (<100/day)
Data VolumeVery large (>10TB)Small-to-medium (<10TB)
Computation ComplexityVery high (multiple joins, window functions)Low-to-medium
Freshness RequirementsCan tolerate hours of latencyNeeds minutes or less
Query PatternHighly predictableExploratory or unpredictable

This approach leads to a hybrid architecture where:

  • Only the most frequently used, complex transformations are materialized
  • Everything else uses compute-on-query
  • The decision is continuously re-evaluated based on actual usage patterns

Pattern 3: The Data Lakehouse with Views

Organizations are building data lakehouses with a rich layer of views instead of materialized tables:

-- Example of compute-on-query view instead of materialized table
CREATE VIEW customer_360 AS
SELECT
  c.customer_id,
  c.name,
  c.email,
  c.signup_date,
  -- Dynamically calculate customer lifetime value
  (SELECT SUM(order_total) 
   FROM orders 
   WHERE customer_id = c.customer_id) AS lifetime_value,
  -- Dynamically calculate days since last purchase
  DATEDIFF(day, 
    (SELECT MAX(order_date) 
     FROM orders 
     WHERE customer_id = c.customer_id), 
    CURRENT_DATE()) AS days_since_last_purchase,
  -- Dynamically join to get latest marketing segment
  (SELECT segment
   FROM customer_segments
   WHERE customer_id = c.customer_id
   ORDER BY effective_date DESC
   LIMIT 1) AS current_segment
FROM customers c;

This pattern:

  • Eliminates entire categories of data pipelines
  • Ensures data is always current
  • Adapts automatically when source schemas change
  • Applies transformations only to the rows and columns actually needed for each query

The Implications for Data Engineering

This shift to compute-on-query has profound implications for data engineering as a discipline:

1. Role Evolution

Traditional ETL development is giving way to:

  • Data modeling and metrics definition: Creating semantic frameworks that support compute-on-query
  • Query optimization: Tuning for compute-on-query performance across diverse query patterns
  • Access control and governance: Ensuring appropriate access when raw data is more directly accessible

2. Skill Shifts

The most valuable skills are evolving:

  • Deep SQL expertise becomes even more important
  • Understanding query optimization and execution plans is critical
  • Knowledge of specific ETL/ELT tools becomes less valuable than understanding data modeling principles

3. Architectural Changes

Data architectures are being reimagined:

  • Data lakes store raw or lightly processed data
  • Metadata and semantic layers define transformations
  • Materialization becomes an optimization technique rather than the default approach

4. DevOps Transformation

The operational aspects of data engineering are changing:

  • Less focus on pipeline scheduling and monitoring
  • More emphasis on query performance and resource management
  • Shifting from pipeline-centric to query-centric observability

Challenges and Limitations

Despite its promise, compute-on-query isn’t a silver bullet. Several challenges remain:

1. Performance Consistency

On-the-fly transformation can introduce performance variability, especially for complex transformations. Organizations must implement:

  • Performance SLAs for critical queries
  • Intelligent materialization strategies for frequently accessed data
  • Query complexity management to prevent resource-intensive ad-hoc queries

2. Cost Management

Compute-on-query can sometimes increase overall compute costs, particularly if:

  • The same transformations are repeatedly executed
  • Queries access more data than necessary
  • Resource governance isn’t properly implemented

3. Governance and Security

When transformation happens at query time, governance becomes more complex:

  • Row and column-level security must be enforced at query time
  • Data lineage is more dynamic and harder to track
  • Consistent application of business rules must be ensured across all access patterns

4. Migration Challenges

Moving from traditional ETL/ELT to compute-on-query requires:

  • Rethinking established data modeling patterns
  • Rebuilding dashboards and reports to use new semantic layers
  • Retraining teams on new architectural approaches

The Future: Beyond Current Compute-on-Query

Looking ahead, several emerging trends suggest where compute-on-query is headed:

1. AI-Augmented Transformation

LLMs and other AI technologies are being integrated into compute-on-query systems to:

  • Automatically suggest transformations based on query intent
  • Generate optimized transformation logic from natural language descriptions
  • Detect and correct data quality issues during query execution

2. Real-Time Streaming Integration

The line between batch and streaming is blurring, with compute-on-query systems:

  • Applying transformations to both historical and real-time data with unified semantics
  • Enabling queries that span across different time horizons
  • Supporting event-time processing in ad-hoc queries

3. Cross-Platform Federation

Emerging technologies are enabling compute-on-query across multiple storage systems:

  • Query engines that can process data across cloud providers
  • Federated queries that join data from different platforms
  • Universal compute layers that work across diverse data sources

4. Embedded Analytics Everywhere

Compute-on-query is enabling analytics to move closer to operational systems:

  • Embedded analytical databases like DuckDB bringing OLAP to edge devices
  • Serverless query engines that can spin up on demand anywhere
  • Application-embedded analytics that don’t require separate data warehouses

Case Study: Reimagining Analytics at a Global Retailer

To illustrate the transformative potential of compute-on-query, let’s examine how a global retailer with 5,000+ stores reimagined their analytics architecture.

The Old World: ETL-Centric

Their traditional architecture featured:

  • 200+ ETL pipelines running on Informatica
  • Data refreshed every 6 hours
  • 35+ TB data warehouse with multiple layers of transformations
  • 7-person team dedicated to maintaining transformation logic
  • 2-3 week lead time for new analytics requests

The New World: Compute-on-Query

Their new architecture leverages compute-on-query principles:

  • Raw data loaded directly to cloud storage
  • Snowflake’s Polaris engine for compute-on-query
  • Semantic layer defining key metrics and entities
  • Materialized views only for the most critical, performance-sensitive analytics

The Results

After six months with the new architecture:

  • Data freshness improved from 6 hours to near real-time
  • Analytics development cycle reduced from weeks to days
  • ETL maintenance team reduced from 7 to 2 people
  • Query performance improved by 45% on average
  • Storage requirements decreased by 64%
  • Overall platform costs reduced by 37%

Most importantly, business analysts could explore data in new ways without waiting for ETL changes, leading to several unexpected insights that drove significant business value.

Implementation Roadmap: Moving to Compute-on-Query

For organizations looking to adopt compute-on-query, here’s a practical roadmap:

Phase 1: Foundation (1-2 Months)

  1. Assess current state:
    • Inventory existing transformation pipelines
    • Identify query patterns and frequencies
    • Evaluate current pain points and limitations
  2. Select technologies:
    • Choose appropriate compute-on-query platform
    • Identify semantic layer technology
    • Select governance and monitoring tools
  3. Pilot project:
    • Identify a bounded use case with high business value
    • Implement proof-of-concept with compute-on-query approach
    • Measure results against traditional approach

Phase 2: Expansion (2-3 Months)

  1. Develop semantic layer:
    • Define core metrics and dimensions
    • Create testing framework for semantic accuracy
    • Implement version control for semantic definitions
  2. Optimize raw data storage:
    • Convert to columnar formats like Parquet
    • Implement intelligent partitioning
    • Ensure appropriate indexing and statistics
  3. Build query optimization knowledge:
    • Train team on query performance analysis
    • Develop standards for efficient query patterns
    • Create monitoring for query performance

Phase 3: Migration (3-6 Months)

  1. Prioritize pipeline migration:
    • Start with high-maintenance, low-complexity pipelines
    • Gradually address more complex transformations
    • Maintain hybrid approach during transition
  2. Implement governance controls:
    • Ensure consistent security across access patterns
    • Develop lineage tracking for compute-on-query
    • Create audit mechanisms for transformation logic
  3. Refactor analytics applications:
    • Update dashboards to use semantic layer
    • Train analysts on new query capabilities
    • Develop self-service analytics guidelines

Phase 4: Optimization (Ongoing)

  1. Continuous evaluation:
    • Monitor query patterns and performance
    • Identify candidates for materialization
    • Track business impact and user satisfaction
  2. Progressive enhancement:
    • Expand semantic layer coverage
    • Refine materialization strategies
    • Optimize storage formats and organization
  3. Knowledge sharing:
    • Document best practices and patterns
    • Train new team members on compute-on-query concepts
    • Build community of practice across the organization

Conclusion: Evolution, Not Revolution

The shift to compute-on-query doesn’t mean the complete end of ETL/ELT. Rather, it represents an evolution where transformation becomes more dynamic, more flexible, and more closely aligned with actual analytical needs.

In this new world, data engineers focus less on building and maintaining rigid transformation pipelines and more on creating the foundations for flexible, powerful analytics:

  • Optimizing raw data storage for efficient access
  • Defining semantic frameworks that ensure consistent business logic
  • Building governance mechanisms that work across diverse access patterns
  • Creating the right materialization strategies for optimal performance

The organizations that embrace compute-on-query while thoughtfully addressing its challenges will gain significant advantages in agility, freshness, and insight generation—ultimately delivering more value from their data assets with less engineering overhead.

The question isn’t whether compute-on-query will transform data engineering, but how quickly organizations will adapt to this new paradigm and how they’ll balance its benefits against its challenges.


How is your organization approaching compute-on-query? Are you seeing shifts away from traditional ETL/ELT pipelines? Share your experiences in the comments below.

By Alex

Leave a Reply

Your email address will not be published. Required fields are marked *