SQL GROUP BY
SQL GROUP BY: Complete Technical Guide
1. Fundamental Syntax
SELECT column1, aggregate_function(column2) FROM table_name [WHERE conditions] GROUP BY column1 [HAVING group_condition] [ORDER BY columns];
2. Internal Execution Process
-
Data Retrieval Phase
- Base data fetched according to WHERE conditions
- May use index-only scans if all columns are covered
-
Grouping Phase
-
Hash Aggregation (most common):
- Computes hash value for each group key
- Maintains hash table with aggregate values
- Memory-intensive but single-pass operation
-
Sort-Based Grouping (legacy/fallback):
- Sorts data by GROUP BY columns
- Scans sorted data to compute aggregates
- Requires temporary storage for large datasets
-
Hash Aggregation (most common):
-
Filtering Phase
- HAVING conditions applied to groups
- Different from WHERE which filters rows before grouping
3. Performance Optimization
Index Strategies
-- Optimal index for GROUP BY + WHERE CREATE INDEX idx_category_status ON products(category, status); -- Covering index example CREATE INDEX idx_covering ON orders(customer_id, order_date) INCLUDE (total_amount);
Memory Configuration
-
MySQL:
tmp_table_size
,max_heap_table_size
- PostgreSQL:
work_mem
- SQL Server:
query memory grant
4. Advanced GROUP BY Patterns
GROUPING SETS
SELECT department_id, job_title, COUNT(*) FROM employees GROUP BY GROUPING SETS ( (department_id), (job_title), (department_id, job_title), () );
CUBE (All combinations)
SELECT brand, category, size, SUM(sales) FROM products GROUP BY CUBE (brand, category, size);
ROLLUP (Hierarchical aggregation)
SELECT year, quarter, month, SUM(revenue) FROM sales GROUP BY ROLLUP (year, quarter, month);
⚠️ Common Anti-Patterns
- SELECTing non-aggregated columns not in GROUP BY
- Using HAVING for row-level filtering (use WHERE instead)
- Over-aggregating large datasets when sampling would suffice
- Ignoring NULL groups (NULLs form their own group)
5. Database-Specific Implementations
MySQL
-
Uses
temporary tables
for complex GROUP BY -
ONLY_FULL_GROUP_BY
mode enforces SQL standard - Index visibility affects GROUP BY optimization
PostgreSQL
- Advanced hash aggregation with memory spill to disk
- Parallel aggregation for large datasets
- Supports all ANSI grouping extensions
SQL Server
- Stream aggregation operator for sorted data
- Hash match operator for unsorted data
- Memory grants based on cardinality estimates
Oracle
HASH GROUP BY
operationSORT GROUP BY
for sorted data- Advanced analytic functions complement GROUP BY
💡 Expert Optimization Techniques
Two-Phase Aggregation
SELECT customer_id, SUM(total) FROM ( SELECT customer_id, SUM(amount) AS total FROM orders GROUP BY customer_id, DATE_TRUNC('month', order_date) ) subq GROUP BY customer_id;
Materialized View
CREATE MATERIALIZED VIEW sales_summary AS SELECT product_id, SUM(quantity) AS total_qty, AVG(price) AS avg_price FROM sales GROUP BY product_id WITH DATA;