2 Apr 2025, Wed

After working with dozens of engineering teams who’ve seen their Snowflake bills unexpectedly balloon to six or seven figures, I’ve noticed a pattern: most teams implement the same handful of basic optimizations they find in Snowflake’s documentation, then hit a wall. They resign themselves to escalating costs as “just the price of doing business” with cloud data warehouses.

But it doesn’t have to be this way.

The reality is that the difference between a well-optimized and poorly-optimized Snowflake implementation can easily represent a 40-60% cost difference—without sacrificing performance or capabilities. The challenge? The most impactful optimization techniques rarely appear in official documentation or basic tutorials.

Let me share what I’ve learned helping teams slash their Snowflake bills while actually improving their data platform’s performance.

Beyond the Basics: Advanced Query Optimization Patterns

Everyone knows to use appropriate warehouse sizes and to leverage caching. But here are three query patterns that silently drain your credits:

1. The Hidden Cost of Subqueries

Consider this common pattern:

SELECT 
  user_id, 
  (SELECT MAX(purchase_date) FROM purchases p WHERE p.user_id = u.user_id) as last_purchase_date,
  (SELECT COUNT(*) FROM purchases p WHERE p.user_id = u.user_id) as purchase_count
FROM users u
WHERE account_status = 'active';

This innocent-looking query forces Snowflake to scan the purchases table twice for every user – a classic N+1 query problem.

Optimized version:

SELECT 
  u.user_id,
  p_agg.last_purchase_date,
  p_agg.purchase_count
FROM users u
LEFT JOIN (
  SELECT 
    user_id, 
    MAX(purchase_date) as last_purchase_date,
    COUNT(*) as purchase_count
  FROM purchases
  GROUP BY user_id
) p_agg ON u.user_id = p_agg.user_id
WHERE account_status = 'active';

In a recent project, this pattern alone reduced query costs by 27% across an analytics workload by significantly decreasing the amount of data scanned.

2. The Massive JOIN Multiplier Effect

When joining large tables, many engineers forget that Snowflake charges based on the data scanned. Consider a fact table with 1 billion rows joining to a dimension with 10 million rows. If your JOIN condition isn’t properly clustered or filtered, you’re multiplying your data scanning exponentially.

Before:

SELECT f.*, d.attribute1, d.attribute2
FROM fact_table f
JOIN dimension_table d ON f.dimension_id = d.id

After:

SELECT f.*, d.attribute1, d.attribute2
FROM fact_table f
JOIN dimension_table d ON f.dimension_id = d.id
WHERE f.event_date > DATEADD(day, -30, CURRENT_DATE())

By adding a time-based filter before the join operation, you can reduce the data scanned by orders of magnitude.

3. The Overlooked LIMIT Clause Trap

Many assume adding a LIMIT clause will reduce processing. However, Snowflake still processes the entire result set before applying the LIMIT.

-- Still processes all 10 million rows
SELECT * FROM large_table ORDER BY process_date DESC LIMIT 10;

The fix? Use window functions with filters:

-- Uses ranking to retrieve only what's needed
WITH ranked_data AS (
  SELECT 
    *,
    ROW_NUMBER() OVER (ORDER BY process_date DESC) as row_num
  FROM large_table
)
SELECT * FROM ranked_data WHERE row_num <= 10;

ML-Specific Clustering Strategies That Pay Dividends

Machine learning workloads have unique data access patterns that benefit from specialized clustering approaches.

Feature Store Clustering for Training Data Efficiency

Most feature stores are clustered by entity_id alone, which works for single-entity lookups but falls apart during model training when you need time-sliced features.

The optimal clustering strategy for ML feature stores is a compound key of:

  1. The feature_date/timestamp (most important)
  2. The entity_id or entity_type
  3. The feature_group_id

This clustering approach can reduce training data extraction time by up to 70% since most ML training datasets pull features across a specific time window.

Here’s a real-world example from a fraud detection system that reduced their daily training data extraction costs by 56%:

CREATE OR REPLACE TABLE feature_store
CLUSTER BY (feature_date, entity_type, feature_group_id)
AS
SELECT
  entity_id,
  entity_type,
  feature_group_id,
  feature_date,
  feature_name,
  feature_value
FROM source_features;

Dynamic Reclustering for Shifting ML Workloads

ML workloads change as models evolve. A static clustering strategy becomes suboptimal over time. Implement this automated reclustering procedure that adapts to your changing query patterns:

CREATE PROCEDURE RECLUSTER_BY_USAGE(TABLE_NAME STRING)
RETURNS STRING
LANGUAGE JAVASCRIPT
AS
$$
  // Query Snowflake's query history to identify optimal clustering keys
  var analysis_query = `
    SELECT 
      PARSE_JSON(base_objects_accessed)[0]:columns::string as columns_used,
      COUNT(*) as query_count
    FROM snowflake.account_usage.access_history
    WHERE table_name = '${TABLE_NAME}'
    AND query_start_time > DATEADD(day, -30, CURRENT_DATE())
    GROUP BY 1
    ORDER BY 2 DESC
    LIMIT 3
  `;
  
  // Execute and analyze query patterns
  var analysis_stmt = snowflake.createStatement({sqlText: analysis_query});
  var analysis_res = analysis_stmt.execute();
  
  // Build new clustering key based on observed usage
  var clustering_columns = [];
  while (analysis_res.next()) {
    var cols = JSON.parse(analysis_res.getColumnValue(1));
    for (var i = 0; i < cols.length; i++) {
      if (!clustering_columns.includes(cols[i])) {
        clustering_columns.push(cols[i]);
      }
      if (clustering_columns.length >= 3) break;
    }
  }
  
  // Apply new clustering if columns discovered
  if (clustering_columns.length > 0) {
    var cluster_sql = `ALTER TABLE ${TABLE_NAME} CLUSTER BY (${clustering_columns.join(', ')})`;
    var cluster_stmt = snowflake.createStatement({sqlText: cluster_sql});
    cluster_stmt.execute();
    return "Reclustered by: " + clustering_columns.join(", ");
  } else {
    return "No suitable clustering keys found";
  }
$$;

Run this procedure monthly to keep your clustering strategy aligned with your evolving ML workloads.

The Credit Monitoring Framework: Preventing Cost Explosions

Most teams only discover cost issues after their monthly bill arrives. Here’s how to implement a proactive monitoring system:

  1. Create a Usage Baseline and Credit Alerting System
-- Create a monitoring table to track daily usage by warehouse
CREATE TABLE credit_monitoring AS
SELECT
  warehouse_name,
  date_trunc('day', start_time) as usage_date,
  sum(credits_used) as daily_credits
FROM snowflake.account_usage.warehouse_metering_history
WHERE start_time >= dateadd(day, -90, current_date())
GROUP BY 1, 2;

-- Set up alerts when usage exceeds moving average by 25%
CREATE OR REPLACE PROCEDURE CREDIT_ALERTING()
RETURNS VARCHAR
LANGUAGE JAVASCRIPT
AS
$$
  var check_query = `
    WITH baseline AS (
      SELECT
        warehouse_name,
        avg(daily_credits) as avg_daily_credits
      FROM credit_monitoring
      WHERE usage_date BETWEEN dateadd(day, -30, current_date()) AND dateadd(day, -1, current_date())
      GROUP BY 1
    ),
    today AS (
      SELECT
        warehouse_name,
        sum(credits_used) as todays_credits
      FROM snowflake.account_usage.warehouse_metering_history
      WHERE start_time >= date_trunc('day', current_timestamp())
      GROUP BY 1
    )
    SELECT
      t.warehouse_name,
      t.todays_credits,
      b.avg_daily_credits,
      t.todays_credits / b.avg_daily_credits as ratio
    FROM today t
    JOIN baseline b ON t.warehouse_name = b.warehouse_name
    WHERE t.todays_credits > b.avg_daily_credits * 1.25
  `;
  
  var alerts = [];
  var result_set = snowflake.execute({sqlText: check_query});
  while (result_set.next()) {
    var wh = result_set.getColumnValue(1);
    var current = result_set.getColumnValue(2);
    var baseline = result_set.getColumnValue(3);
    var ratio = result_set.getColumnValue(4);
    
    alerts.push(\`Warehouse \${wh} is using \${ratio.toFixed(2)}x its baseline (\${current.toFixed(2)} vs \${baseline.toFixed(2)} credits)\`);
    
    // Call your notification system - Slack, email, etc.
  }
  
  return alerts.length > 0 ? alerts.join("\\n") : "No alerts triggered";
$$;

-- Schedule to run hourly
CREATE OR REPLACE TASK credit_monitor_task
WAREHOUSE = monitoring_wh
SCHEDULE = '60 MINUTE'
AS CALL CREDIT_ALERTING();
ALTER TASK credit_monitor_task RESUME;
  1. Implement Query-Level Cost Attribution

The real game-changer is tracking costs by query pattern and team. Here’s a simplified implementation:

CREATE OR REPLACE VIEW query_costs AS
SELECT
  query_id,
  user_name,
  warehouse_name,
  database_name,
  query_text,
  execution_time / 1000 as execution_seconds,
  bytes_scanned / (1024*1024*1024) as gb_scanned,
  -- Estimated credit cost based on warehouse size and time
  CASE
    WHEN warehouse_size = 'X-Small' THEN execution_seconds / 3600
    WHEN warehouse_size = 'Small' THEN execution_seconds / 3600 * 2
    WHEN warehouse_size = 'Medium' THEN execution_seconds / 3600 * 4
    WHEN warehouse_size = 'Large' THEN execution_seconds / 3600 * 8
    WHEN warehouse_size = 'X-Large' THEN execution_seconds / 3600 * 16
    ELSE execution_seconds / 3600 * 32
  END as estimated_credits,
  -- Extract comment tag to attribute to team
  REGEXP_SUBSTR(query_text, '/\\*team:([^\\*]*)\\*/', 1, 1, 'e') as team
FROM snowflake.account_usage.query_history
WHERE execution_status = 'SUCCESS'
AND start_time >= dateadd(day, -30, current_date());

Now you can see which teams and query patterns are consuming the most credits, allowing for targeted optimization.

Case Study: How One Team Cut Their Snowflake Bill by 40%

A fintech company I worked with was processing millions of transactions daily for fraud detection. Their monthly Snowflake bill had grown to over $120,000, primarily due to their ML feature engineering pipeline and model training.

Here’s what we discovered and optimized:

  1. Problem #1: Redundant Feature Computation Their feature engineering pipeline was recomputing many aggregated features each time a model was trained, scanning the same historical data repeatedly. Solution: We implemented a feature store with time-travel capabilities and proper clustering, reducing feature computation costs by 65%.
  2. Problem #2: Inefficient Model Training Data Extraction Their training data creation involved massive joins across unclustered tables. Solution: We pre-materialized training datasets with the clustering strategy mentioned earlier, reducing extraction time from 45 minutes to 11 minutes.
  3. Problem #3: Underutilized Caching Many similar queries were running with slight variations, preventing cache hits. Solution: We standardized query patterns and implemented a query templating system, increasing cache hit rates from 22% to 67%.
  4. Problem #4: Oversized Warehouses They were using XL warehouses for all workloads based on peak needs. Solution: We implemented dynamic warehouse sizing based on workload complexity, using smaller warehouses for 80% of their queries.

The final result? Their monthly bill dropped from $120,000 to $72,000—a 40% reduction—while actually improving query performance through better optimization.

Start Your Optimization Journey

These advanced techniques require a deeper understanding of Snowflake’s architecture and your specific workload patterns. But the effort pays for itself many times over. Start by:

  1. Analyzing your top 20 most expensive queries
  2. Identifying suboptimal patterns using the query profile visualization
  3. Implementing proper clustering for your ML workloads
  4. Setting up the cost monitoring framework

Remember, in cloud data platforms, architecture and query patterns matter far more than you might expect. A well-optimized Snowflake implementation isn’t just cheaper—it’s faster, more reliable, and allows your team to focus on building value rather than managing costs.

What’s been your biggest challenge with Snowflake cost optimization? Have you discovered any hidden techniques I missed? I’d love to hear your experiences in the comments.


SnowflakeDB #DataEngineering #CostOptimization #CloudCosts #DataWarehouse #MLOps #FeatureStore #SQL #DataScience #DatabaseOptimization #CloudComputing #BigData #SnowflakeTips #DataInfrastructure #QueryOptimization #TechROI #DataPerformance #CloudEfficiency #SnowflakeOptimization #DataArchitecture

By Alex

Leave a Reply

Your email address will not be published. Required fields are marked *