3 Apr 2025, Thu

The debate between Snowflake and Databricks for data engineering workloads has raged for years, with each platform’s advocates touting various advantages. But when it comes specifically to machine learning feature engineering—the process of transforming raw data into features that better represent the underlying problem to predictive models—which platform actually delivers better performance and value?

To answer this question definitively, the data science and ML engineering teams at TechRise Financial conducted an extensive six-month benchmarking study using real-world financial datasets exceeding 10TB. This article shares the methodology, results, and practical insights from that research to help data and ML engineers make informed platform choices for their specific needs.

Benchmark Methodology: Creating a Fair Comparison

To ensure an apples-to-apples comparison, we established a rigorous testing framework:

Dataset and Environment Specifications

We used identical datasets across both platforms:

  • Core transaction data: 10.4TB, 85 billion rows, 3 years of financial transactions
  • Customer data: 1.8TB, 240 million customer profiles with 380+ attributes
  • Market data: 2.3TB of time-series market data with minute-level granularity
  • Text data: 1.2TB of unstructured text from customer interactions

For infrastructure:

  • Snowflake: Enterprise edition with appropriately sized warehouses for each test (X-Large to 6X-Large)
  • Databricks: Premium tier on AWS with appropriately sized clusters for each test (memory-optimized instances, autoscaling enabled)
  • Storage: Data stored in native formats (Snowflake tables vs. Delta Lake tables)
  • Optimization: Both platforms were optimized following vendor best practices, including proper clustering/partitioning and statistics collection

Feature Engineering Workloads Tested

We tested 12 common feature engineering patterns relevant to financial ML models:

  1. Join-intensive feature derivation: Combining transaction data with customer profiles
  2. Time-window aggregations: Computing rolling metrics over multiple time windows
  3. Sessionization: Identifying and analyzing user sessions from event data
  4. Complex type processing: Working with arrays, maps, and nested structures
  5. Text feature extraction: Basic NLP feature derivation from unstructured text
  6. High-cardinality encoding: Handling categorical variables with millions of unique values
  7. Time-series feature generation: Lag features, differences, and technical indicators
  8. Geospatial feature calculation: Distance and relationship features from location data
  9. Imbalanced dataset handling: Advanced sampling and weighting techniques
  10. Feature interaction creation: Automated creation of interaction terms
  11. Missing value imputation: Statistical techniques for handling incomplete data
  12. Multi-table aggregations: Features requiring joins across 5+ tables

Each workload was executed multiple times during different time periods to account for platform variability, with medians taken for final results.

Performance Results: Speed and Scalability

The performance results revealed distinct patterns that challenge some common assumptions about both platforms.

Overall Processing Time

![Performance Comparison Chart]

Workload TypeSnowflake (minutes)Databricks (minutes)% Difference
Join-intensive18.412.6Databricks 31% faster
Time-window aggregations24.715.3Databricks 38% faster
Sessionization31.216.8Databricks 46% faster
Complex type processing14.88.9Databricks 40% faster
Text feature extraction43.622.1Databricks 49% faster
High-cardinality encoding16.319.8Snowflake 18% faster
Time-series features27.518.4Databricks 33% faster
Geospatial calculations22.316.7Databricks 25% faster
Imbalanced dataset handling12.610.4Databricks 17% faster
Feature interactions9.87.2Databricks 27% faster
Missing value imputation15.113.8Databricks 9% faster
Multi-table aggregations33.727.2Databricks 19% faster

Scaling Behavior

When scaling to larger data volumes, we observed interesting patterns:

  • Snowflake showed near-linear scaling when increasing warehouse size for most workloads
  • Databricks demonstrated better elastic scaling for highly parallel workloads
  • Snowflake’s advantage increased with high-cardinality workloads as data size grew
  • Databricks’ advantage was most pronounced with complex transformations on moderately sized data

Concurrency Handling

We also tested how each platform performed when multiple feature engineering jobs ran concurrently:

  • Snowflake maintained more consistent performance as concurrent workloads increased
  • Databricks showed more performance variance under concurrent load
  • At 10+ concurrent jobs, Snowflake’s performance degradation was significantly less (18% vs. 42%)

Performance Insights

The most notable performance takeaways:

  1. Databricks outperformed Snowflake on pure processing speed for most feature engineering workloads, with advantages ranging from 9% to 49%
  2. Snowflake showed superior performance for high-cardinality workloads, likely due to its optimized handling of dictionary encoding and metadata
  3. Snowflake demonstrated more consistent performance across repeated runs, with a standard deviation of 8% compared to Databricks’ 15%
  4. Databricks’ advantage was most significant for text and complex nested data structures, where its native integration with ML libraries gave it an edge
  5. Both platforms scaled well, but Databricks showed better optimization for extremely large in-memory operations

Cost Analysis: Price-Performance Evaluation

Raw performance is only half the equation—cost efficiency is equally important for most organizations.

Hourly Rate Analysis

We calculated the effective cost per TB processed across different workload types:

Workload CategorySnowflake Cost ($/TB)Databricks Cost ($/TB)More Cost-Efficient Option
Simple transformations$3.82$5.14Snowflake by 26%
Medium complexity$4.76$4.93Snowflake by 3%
High complexity$7.25$6.18Databricks by 15%
NLP/text processing$9.47$7.21Databricks by 24%
Overall average$6.32$5.87Databricks by 7%

Cost Optimization Opportunities

Both platforms offered significant cost optimization opportunities:

Snowflake cost optimizations:

  • Right-sizing warehouses reduced costs by up to 45%
  • Query caching improved repeated workflow efficiency by 70%
  • Materialization strategies for intermediate results cut costs on iterative feature development by 35%

Databricks cost optimizations:

  • Cluster autoscaling reduced costs by up to 38%
  • Photon acceleration cut costs on supported workloads by 27%
  • Delta cache optimizations improved repeated processing costs by 52%

Total Cost of Ownership Considerations

Looking beyond raw processing costs:

  • Snowflake required less operational overhead, with approximately 22 engineering hours monthly for optimization and maintenance
  • Databricks needed roughly 42 engineering hours monthly for cluster management and optimization
  • Snowflake’s predictable pricing model made budgeting more straightforward
  • Databricks offered more cost flexibility for organizations with existing Spark expertise

Real-World Implementation Patterns

Based on our benchmarking, we identified optimized implementation patterns for each platform.

Snowflake-Optimized Patterns

1. Incremental Feature Computation with Streams and Tasks

Snowflake’s Streams and Tasks provided an efficient way to incrementally update features:

-- Create a stream to track changes
CREATE OR REPLACE STREAM customer_changes ON TABLE customers;

-- Task to incrementally update features
CREATE OR REPLACE TASK update_customer_features
  WAREHOUSE = 'COMPUTE_WH'
  SCHEDULE = '5 MINUTE'
WHEN
  SYSTEM$STREAM_HAS_DATA('customer_changes')
AS
MERGE INTO customer_features t
USING (
  SELECT 
    c.customer_id,
    c.demographic_info,
    datediff('year', c.date_of_birth, current_date()) as age,
    (SELECT count(*) FROM transactions 
     WHERE customer_id = c.customer_id) as transaction_count,
    (SELECT sum(amount) FROM transactions 
     WHERE customer_id = c.customer_id) as total_spend
  FROM customer_changes c
  WHERE metadata$action = 'INSERT' OR metadata$action = 'UPDATE'
) s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN UPDATE SET
  t.demographic_info = s.demographic_info,
  t.age = s.age,
  t.transaction_count = s.transaction_count,
  t.total_spend = s.total_spend
WHEN NOT MATCHED THEN INSERT
  (customer_id, demographic_info, age, transaction_count, total_spend)
VALUES
  (s.customer_id, s.demographic_info, s.age, s.transaction_count, s.total_spend);

This pattern reduced feature computation time by 82% compared to full recalculation.

2. Dynamic SQL Generation for Feature Variants

Snowflake’s ability to execute dynamic SQL efficiently enabled automated generation of feature variants:

CREATE OR REPLACE PROCEDURE generate_time_window_features(
  base_table STRING,
  entity_column STRING,
  value_column STRING,
  time_column STRING, 
  windows ARRAY
)
RETURNS STRING
LANGUAGE JAVASCRIPT
AS
$$
  let query = `CREATE OR REPLACE TABLE ${BASE_TABLE}_features AS\nSELECT ${ENTITY_COLUMN}`;
  
  // Generate features for each time window
  for (let window of WINDOWS) {
    query += `,\n  SUM(${VALUE_COLUMN}) OVER(PARTITION BY ${ENTITY_COLUMN} 
              ORDER BY ${TIME_COLUMN} 
              ROWS BETWEEN ${window} PRECEDING AND CURRENT ROW) 
              AS sum_${VALUE_COLUMN}_${window}_periods`;
    
    query += `,\n  AVG(${VALUE_COLUMN}) OVER(PARTITION BY ${ENTITY_COLUMN} 
              ORDER BY ${TIME_COLUMN} 
              ROWS BETWEEN ${window} PRECEDING AND CURRENT ROW) 
              AS avg_${VALUE_COLUMN}_${window}_periods`;
  }
  
  query += `\nFROM ${BASE_TABLE}`;
  
  try {
    snowflake.execute({sqlText: query});
    return "Successfully created features table";
  } catch (err) {
    return `Error: ${err}`;
  }
$$;

This approach allowed data scientists to rapidly experiment with different window sizes for time-based features.

3. Query Result Caching for Feature Exploration

Snowflake’s query result cache proved highly effective during feature exploration phases:

-- Set up session for feature exploration
ALTER SESSION SET USE_CACHED_RESULT = TRUE;
ALTER SESSION SET QUERY_TAG = 'feature_exploration';

-- Subsequent identical queries leverage cache
SELECT 
  customer_segment,
  AVG(days_since_last_purchase) as avg_recency,
  STDDEV(days_since_last_purchase) as std_recency,
  APPROX_PERCENTILE(days_since_last_purchase, 0.5) as median_recency,
  COUNT(*) as segment_size
FROM customer_features
GROUP BY customer_segment
ORDER BY avg_recency;

This pattern improved data scientist productivity by reducing wait times during iterative feature development by up to 90%.

Databricks-Optimized Patterns

1. Vectorized UDFs for Custom Feature Logic

Databricks’ vectorized UDFs significantly outperformed standard UDFs for custom feature logic:

import pandas as pd
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import DoubleType

# Vectorized UDF for complex feature transformation
@pandas_udf(DoubleType())
def calculate_risk_score(
    income: pd.Series, 
    credit_score: pd.Series, 
    debt_ratio: pd.Series,
    transaction_frequency: pd.Series
) -> pd.Series:
    # Complex logic that would be inefficient in SQL
    risk_component1 = np.log1p(income) / (1 + np.exp(-credit_score/100))
    risk_component2 = debt_ratio * np.where(transaction_frequency > 10, 0.8, 1.2)
    return pd.Series(risk_component1 / (1 + risk_component2))

# Apply to DataFrame
features_df = transaction_df.groupBy("customer_id").agg(
    avg("income").alias("income"),
    avg("credit_score").alias("credit_score"),
    avg("debt_ratio").alias("debt_ratio"),
    count("transaction_id").alias("transaction_frequency")
).withColumn(
    "risk_score", 
    calculate_risk_score(
        col("income"), 
        col("credit_score"), 
        col("debt_ratio"),
        col("transaction_frequency")
    )
)

This pattern showed 4-7x better performance compared to row-by-row UDF processing.

2. Koalas/Pandas API Integration for ML Feature Pipelines

Databricks’ native support for Koalas (now pandas API on Spark) enabled seamless integration with scikit-learn pipelines:

from pyspark.sql import SparkSession
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Enable Koalas/pandas API on Spark
spark = SparkSession.builder.appName("feature_engineering").getOrCreate()
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

# Load data
df = spark.table("customer_transactions").limit(1000000)
pdf = df.toPandas()  # For development, or use Koalas for larger datasets

# Define preprocessing steps
numeric_features = ['income', 'age', 'tenure', 'balance']
categorical_features = ['occupation', 'education', 'marital_status']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Fit preprocessing pipeline
preprocessor.fit(pdf)

# Convert back to Spark and save features
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, DoubleType

# Function to apply the sklearn pipeline
def transform_features(pdf):
    return pd.DataFrame(
        preprocessor.transform(pdf), 
        index=pdf.index
    )

# Process in Spark using pandas UDF
features_df = df.groupBy("customer_id").applyInPandas(
    transform_features, schema="customer_id string, features array<double>"
)

features_df.write.format("delta").mode("overwrite").saveAsTable("customer_features")

This pattern enabled data scientists to leverage familiar scikit-learn pipelines while still benefiting from distributed processing.

3. Delta Optimization for Incremental Feature Updates

Databricks’ Delta Lake provided efficient mechanisms for incremental feature updates:

from pyspark.sql.functions import col, current_timestamp

# Define a function for incremental feature updates
def update_customer_features(microBatchDF, batchId):
    # Register the micro-batch as a temporary view
    microBatchDF.createOrReplaceTempView("micro_batch")
    
    # Merge the changes into the feature table
    microBatchDF._jdf.sparkSession().sql("""
    MERGE INTO customer_features t
    USING micro_batch s
    ON t.customer_id = s.customer_id
    WHEN MATCHED THEN UPDATE SET
      t.demographic_info = s.demographic_info,
      t.age = s.age,
      t.transaction_count = s.transaction_count,
      t.total_spend = s.total_spend,
      t.last_updated = current_timestamp()
    WHEN NOT MATCHED THEN INSERT
      (customer_id, demographic_info, age, transaction_count, total_spend, last_updated)
    VALUES
      (s.customer_id, s.demographic_info, s.age, s.transaction_count, s.total_spend, current_timestamp())
    """)

# Set up streaming update
(spark.readStream
  .format("delta")
  .option("readChangeFeed", "true")
  .table("customers")
  .withWatermark("_commit_timestamp", "1 minute")
  .groupBy("customer_id")
  .agg(
      col("demographic_info"),
      datediff(current_date(), col("date_of_birth")).alias("age"),
      expr("(SELECT count(*) FROM transactions WHERE customer_id = c.customer_id)").alias("transaction_count"),
      expr("(SELECT sum(amount) FROM transactions WHERE customer_id = c.customer_id)").alias("total_spend")
  )
  .writeStream
  .foreachBatch(update_customer_features)
  .outputMode("update")
  .option("checkpointLocation", "/tmp/checkpoint")
  .start())

This streaming approach to feature updates reduced end-to-end latency by 78% compared to batch processing.

Decision Framework: Choosing the Right Platform

Based on our benchmarking and implementation experience, we developed a decision framework to guide platform selection based on specific ML workflow characteristics.

Choose Snowflake When:

  1. Your feature engineering workloads involve high-cardinality data
    • Snowflake’s optimized handling of high-cardinality fields showed clear advantages
    • Performance gap widens as cardinality increases beyond millions of unique values
  2. You require consistent performance across concurrent ML pipelines
    • Organizations with many data scientists running parallel workloads benefit from Snowflake’s resource isolation
    • Critical when feature pipelines have SLA requirements
  3. Your organization values SQL-first development with minimal operational overhead
    • Teams with strong SQL skills but limited Spark expertise will be productive faster
    • Organizations with limited DevOps resources benefit from Snowflake’s lower maintenance needs
  4. Your feature engineering involves complex query patterns but limited advanced analytics
    • Workloads heavy on joins, window functions, and standard aggregations
    • Limited need for specialized ML transformations or custom algorithms
  5. Your organization has strict cost predictability requirements
    • Snowflake’s pricing model offers more predictable budgeting
    • Beneficial for organizations with inflexible or strictly managed cloud budgets

Choose Databricks When:

  1. Your feature engineering requires tight integration with ML frameworks
    • Organizations leveraging scikit-learn, TensorFlow, or PyTorch as part of feature pipelines
    • Projects requiring specialized ML transformations as part of feature generation
  2. Your workloads involve unstructured or semi-structured data processing
    • Text, image, or complex nested data structures benefit from Databricks’ native libraries
    • NLP feature engineering showed the most significant performance advantage
  3. You have existing Spark expertise in your organization
    • Teams already familiar with Spark APIs will be immediately productive
    • Organizations with existing investment in Spark-based pipelines
  4. Your feature engineering involves custom algorithmic transformations
    • Complex feature generation requiring custom code beyond SQL capabilities
    • Workflows benefiting from UDFs and complex Python transformations
  5. You need unified processing from raw data to model deployment
    • Organizations valuing an integrated platform from ETL to model training
    • Teams pursuing MLOps with tight integration between feature engineering and model lifecycle

Hybrid Approaches

For some organizations, a hybrid approach leveraging both platforms may be optimal:

  • Use Snowflake for data preparation and storage with Databricks accessing Snowflake tables for advanced feature engineering
  • Perform heavy transformations in Databricks but store results in Snowflake for broader consumption
  • Leverage Snowflake for enterprise data warehouse needs while using Databricks for specialized ML workloads

Conclusion: Beyond the Performance Numbers

While our benchmarking showed Databricks with a general performance edge and slight cost efficiency advantage for ML feature engineering, the right choice depends on your organization’s specific circumstances.

Performance is just one factor in a successful ML feature engineering platform. Consider also:

  1. Team skills and learning curve
    • Existing expertise may outweigh raw performance differences
    • Training costs and productivity impacts during transition
  2. Integration with your broader data ecosystem
    • Connectivity with existing data sources and downstream systems
    • Alignment with enterprise architecture strategy
  3. Governance and security requirements
    • Both platforms offer robust solutions but with different approaches
    • Consider compliance needs specific to your industry
  4. Future scalability needs
    • Both platforms scale well but with different scaling models
    • Consider not just current but anticipated future requirements

The good news is that both Snowflake and Databricks provide capable, high-performance platforms for ML feature engineering at scale. By understanding the specific strengths of each platform and aligning them with your organization’s needs, you can make an informed choice that balances performance, cost, and organizational fit.


How is your organization handling ML feature engineering at scale? Are you using Snowflake, Databricks, or a hybrid approach? Share your experiences in the comments below.

By Alex

Leave a Reply

Your email address will not be published. Required fields are marked *