SQL GROUP BY


SQL
POSTGRES
INTERACTIVE

SQL GROUP BY: Complete Technical Guide

1. Fundamental Syntax
SELECT column1, aggregate_function(column2) 
FROM table_name
[WHERE conditions] 
GROUP BY column1 [HAVING group_condition]
[ORDER BY columns];
2. Internal Execution Process
  1. Data Retrieval Phase
    1. Base data fetched according to WHERE conditions
    2. May use index-only scans if all columns are covered
  2. Grouping Phase
    1. Hash Aggregation (most common):
      1. Computes hash value for each group key
      2. Maintains hash table with aggregate values
      3. Memory-intensive but single-pass operation
    2. Sort-Based Grouping (legacy/fallback):
      1. Sorts data by GROUP BY columns
      2. Scans sorted data to compute aggregates
      3. Requires temporary storage for large datasets
  3. Filtering Phase
    1. HAVING conditions applied to groups
    2. Different from WHERE which filters rows before grouping
3. Performance Optimization
Index Strategies
-- Optimal index for GROUP BY + WHERE 
CREATE INDEX idx_category_status 
ON products(category, status); 

-- Covering index example 
CREATE INDEX idx_covering 
ON orders(customer_id, order_date) 
INCLUDE (total_amount);
Memory Configuration
  1. MySQL: tmp_table_size, max_heap_table_size
  2. PostgreSQL: work_mem
  3. SQL Server: query memory grant
4. Advanced GROUP BY Patterns
GROUPING SETS
SELECT department_id, job_title, COUNT(*)
FROM employees
GROUP BY GROUPING SETS (
  (department_id),
  (job_title),
  (department_id, job_title),
  ()
);
CUBE (All combinations)
SELECT brand, category, size, SUM(sales) 
FROM products
GROUP BY CUBE (brand, category, size);
ROLLUP (Hierarchical aggregation)
SELECT year, quarter, month, SUM(revenue) 
FROM sales 
GROUP BY ROLLUP (year, quarter, month);
⚠️ Common Anti-Patterns
  1. SELECTing non-aggregated columns not in GROUP BY
  2. Using HAVING for row-level filtering (use WHERE instead)
  3. Over-aggregating large datasets when sampling would suffice
  4. Ignoring NULL groups (NULLs form their own group)
5. Database-Specific Implementations
MySQL
  1. Uses temporary tables for complex GROUP BY
  2. ONLY_FULL_GROUP_BY mode enforces SQL standard
  3. Index visibility affects GROUP BY optimization
PostgreSQL
  1. Advanced hash aggregation with memory spill to disk
  2. Parallel aggregation for large datasets
  3. Supports all ANSI grouping extensions
SQL Server
  1. Stream aggregation operator for sorted data
  2. Hash match operator for unsorted data
  3. Memory grants based on cardinality estimates
Oracle
  1. HASH GROUP BY operation
  2. SORT GROUP BY for sorted data
  3. Advanced analytic functions complement GROUP BY
💡 Expert Optimization Techniques
Two-Phase Aggregation
SELECT customer_id, SUM(total)
FROM (
  SELECT customer_id, SUM(amount) AS total
  FROM orders
  GROUP BY customer_id, DATE_TRUNC('month', order_date)
) subq
GROUP BY customer_id;
Materialized View
CREATE MATERIALIZED VIEW sales_summary AS
SELECT product_id, SUM(quantity) AS total_qty, AVG(price) AS avg_price
FROM sales
GROUP BY product_id
WITH DATA;