SQL SELECT DISTINCT
SELECT DISTINCT: Complete Technical Breakdown
1. Fundamental Syntax
SELECT DISTINCT column1, column2, ... FROM table_name [WHERE conditions] [ORDER BY columns];
2. Internal Execution Process
-
Data Retrieval Phase
- Full table scan or index-only scan (if covering index exists)
- All specified columns are read into memory
-
Distinct Processing Phase
-
Sort-Based Deduplication (most common):
- All rows are sorted in memory or temporary storage
- Adjacent duplicates are eliminated during final pass
-
Hash-Based Deduplication (for large
datasets):
- Hash values computed for each row's distinct columns
- Hash table eliminates duplicates on the fly
-
Sort-Based Deduplication (most common):
3. Performance Considerations
Factor | Impact |
---|---|
Number of columns | More columns = larger comparison footprint |
Column data types | Text/VARBINARY comparisons are more expensive than integers |
Dataset size | May trigger disk-based temporary tables |
Available indexes | Covering indexes can enable index-only scans |
4. Database-Specific Implementations
MySQL/InnoDB
- Uses filesort with temporary tables by default
-
Switches to hash-based when
optimizer_switch='distinct_optimization=on'
-
Memory usage controlled by
tmp_table_size
PostgreSQL
- Prefers hash aggregation for larger datasets
-
Uses
work_mem
setting for in-memory operations - Can leverage index-only scans when possible
SQL Server
- Stream aggregation operator for sorted inputs
- Hash match operator for unsorted data
- Memory grants based on estimated cardinality
Oracle
HASH UNIQUE
operation for most casesSORT UNIQUE
when data is nearly sorted- Can use bitmap indexes for distinct operations
5. Advanced Use Cases
Multiple Column Distinct
SELECT DISTINCT department_id, job_title FROM employees;
Creates unique combinations across all specified columns
COUNT(DISTINCT) Variations
SELECT COUNT(DISTINCT CASE WHEN status = 'active' THEN user_id END) FROM users;
Conditional distinct counting with filtering
Window Functions with DISTINCT
SELECT DISTINCT user_id, FIRST_VALUE(order_date) OVER (PARTITION BY user_id ORDER BY order_date) FROM orders;
Combines distinct with window function analysis
⚠️ Important Limitations
- NULL values are considered equal for DISTINCT purposes
- Some databases limit DISTINCT operations on LOB (Large Object) columns
- DISTINCT operations prevent use of certain optimizations like partition pruning
-
Memory-intensive for wide tables (consider
SELECT DISTINCT column FROM (subquery)
)
6. Optimization Techniques
-
Pre-Filter Data
SELECT DISTINCT column FROM table WHERE filter_condition;
-
Use Derived Tables
SELECT DISTINCT column FROM (SELECT * FROM table LIMIT 1000) subq;
-
Consider GROUP BY Alternative
SELECT column FROM table GROUP BY column;
Often more efficient for single-column distinct values
-
Create Covering Indexes
CREATE INDEX idx_covering ON table (distinct_column, other_columns);
Key technical differences from regular SELECT:
- Memory Usage: Requires storing comparison values for deduplication
- Execution Plans: Shows unique operations like "Hash Aggregate" or "Sort (Unique)"
- Index Utilization:Can sometimes leverage covering indexes to avoid table scans
- Sorting Behavior: Implicit sorting not guaranteed (use explicit ORDER BY if needed)
- Parallel Execution:Often limited compared to non-DISTINCT queries