SQL SELECT DISTINCT


SQL
POSTGRES
INTERACTIVE

SELECT DISTINCT: Complete Technical Breakdown

1. Fundamental Syntax
SELECT DISTINCT column1, column2, ... 
FROM table_name 
[WHERE conditions] 
[ORDER BY columns];
2. Internal Execution Process
  1. Data Retrieval Phase
    1. Full table scan or index-only scan (if covering index exists)
    2. All specified columns are read into memory
  2. Distinct Processing Phase
    1. Sort-Based Deduplication (most common):
      1. All rows are sorted in memory or temporary storage
      2. Adjacent duplicates are eliminated during final pass
    2. Hash-Based Deduplication (for large datasets):
      1. Hash values computed for each row's distinct columns
      2. Hash table eliminates duplicates on the fly
3. Performance Considerations
Factor Impact
Number of columns More columns = larger comparison footprint
Column data types Text/VARBINARY comparisons are more expensive than integers
Dataset size May trigger disk-based temporary tables
Available indexes Covering indexes can enable index-only scans
4. Database-Specific Implementations
MySQL/InnoDB
  1. Uses filesort with temporary tables by default
  2. Switches to hash-based when optimizer_switch='distinct_optimization=on'
  3. Memory usage controlled by tmp_table_size
PostgreSQL
  1. Prefers hash aggregation for larger datasets
  2. Uses work_mem setting for in-memory operations
  3. Can leverage index-only scans when possible
SQL Server
  1. Stream aggregation operator for sorted inputs
  2. Hash match operator for unsorted data
  3. Memory grants based on estimated cardinality
Oracle
  1. HASH UNIQUE operation for most cases
  2. SORT UNIQUE when data is nearly sorted
  3. Can use bitmap indexes for distinct operations
5. Advanced Use Cases
Multiple Column Distinct
SELECT DISTINCT department_id, job_title 
FROM employees;

Creates unique combinations across all specified columns

COUNT(DISTINCT) Variations
SELECT 
COUNT(DISTINCT CASE WHEN status = 'active' THEN user_id END) 
FROM users;

Conditional distinct counting with filtering

Window Functions with DISTINCT
SELECT DISTINCT user_id, 
FIRST_VALUE(order_date) 
OVER (PARTITION BY user_id ORDER BY order_date) 
FROM orders;

Combines distinct with window function analysis

⚠️ Important Limitations
  1. NULL values are considered equal for DISTINCT purposes
  2. Some databases limit DISTINCT operations on LOB (Large Object) columns
  3. DISTINCT operations prevent use of certain optimizations like partition pruning
  4. Memory-intensive for wide tables (consider SELECT DISTINCT column FROM (subquery))
6. Optimization Techniques
  1. Pre-Filter Data
    SELECT DISTINCT column FROM table WHERE filter_condition;
  2. Use Derived Tables
    SELECT DISTINCT column FROM (SELECT * FROM table LIMIT 1000) subq;
  3. Consider GROUP BY Alternative
    SELECT column FROM table GROUP BY column;

    Often more efficient for single-column distinct values

  4. Create Covering Indexes
    CREATE INDEX idx_covering ON table (distinct_column, other_columns);

Key technical differences from regular SELECT:

  1. Memory Usage: Requires storing comparison values for deduplication
  2. Execution Plans: Shows unique operations like "Hash Aggregate" or "Sort (Unique)"
  3. Index Utilization:Can sometimes leverage covering indexes to avoid table scans
  4. Sorting Behavior: Implicit sorting not guaranteed (use explicit ORDER BY if needed)
  5. Parallel Execution:Often limited compared to non-DISTINCT queries