The DISTINCT clause is an optional part of the SELECT statement and returns a set of unique records based on the columns included in the SELECT list.
When it comes to optimizing SQL Server performance, using the DISTINCT clause in queries where it isn’t needed is one of the most common and easiest performance problems to fix. Years ago, I worked with a customer where it was a programming standard to include DISTINCT for every query because of a poorly designed data model. Within a couple of days of consulting, we were able to add constraints to the data model and modify the existing queries on the customer’s tables to increase performance by 20%. So, while using DISTINCT to return a unique list of rows can sometimes be necessary, please be aware it comes at a cost to performance.
This post will discuss the DISTINCT clause in detail, including its use cases, edge cases to be aware of, and common performance considerations. Most of the examples use the AdventureWorks sample database.
General Syntax
The general syntax for DISTINCT is the following:
The DISTINCT clause may only be used with a SELECT statement and must follow the SELECT keyword. Any column names or derived columns following the DISTINCT keyword will be consolidated to return a unique set of rows across those fields. Behind the scenes, the deduplication of rows generally happens in one of two ways:
- Introducing a sort operator to sort data and remove duplicates
- SQL Server may introduce an aggregation into the query to remove the duplicate records
The execution plan below is for the simple query above, and you can see SQL Server chose to use a Hash Match Aggregation.
DISTINCT and Collations
Notably, when comparing string data types, duplicate values result from the collation used. Collations in SQL Server provide sorting rules, case, and accent sensitivity properties for your data and can be specified at the server level, database level, column level, and expression level. You can read more about collations in the Microsoft documentation here.
Consider the following example where I enter my last name twice into a table:
My SQL Server installation uses the SQL_Latin1_General_CP1_CI_AS collation, which is case-insensitive. When I run the following query against the table I created, a single record is returned.
However, if I adjust the query and force a case-sensitive expression-level collation, both records are returned. You should be especially aware of this when dealing with case-sensitive collations introducing results you don’t expect.
GROUP BY vs. DISTINCT
GROUP BY provides similar functionality to the DISTINCT operator in that it can return a unique list of rows in a result set. However, GROUP BY also allows you to perform aggregate functions (e.g., SUM, COUNT, AVG) on the groups of rows returned by the statement.
The question of, “Should I use GROUP BY or DISTINCT syntax in this query?” is one of the most common questions I’ve seen developers ask. As you may guess, the answer is “It depends!”
The main consideration is if aggregations are included in the query. If so, using GROUP BY is the correct answer. In fact, in most cases, GROUP BY will provide the same functionality when you need to remove duplicates.
One issue I still see is using the SELECT DISTINCT statement where it’s not needed. This can be resolved by simply understanding (and trusting) the data model a bit more to know if a particular query might produce duplicate values (and whether this is an issue or not).
Adding a SELECT DISTINCT clause in a query often introduces an aggregation or sort behind the scenes. This overhead can be too much if the query is operating on a large data set because of the order of operations SQL Server uses (listed below) when it comes to using DISTINCT vs. GROUP BY:
- FROM
- ON
- JOIN
- WHERE
- GROUP BY
- WITH CUBE or WITH ROLLUP
- HAVING
- SELECT
- DISTINCT
- ORDER BY
- TOP
Notice how GROUP BY is applied before the DISTINCT clause. The DISTINCT clause is almost the last clause of a SELECT statement to be processed.
My good friend Aaron Bertrand has an excellent article on comparing and contrasting the intricacies of the GROUP BY and DISTINCT statements here.
Using DISTINCT With ORDER BY
It’s often useful to order the results of a query. For a normal SELECT statement, you can include columns in the ORDER BY that aren’t part of the SELECT list, but this isn’t the case when the SELECT statement includes the DISTINCT clause.
The following query returns the OrderDate and ShipDate columns from Sales.SalesOrderDetail and then orders the result set based on the TerritoryID column. This is a perfectly valid SQL statement:
If I change the above query to include only a DISTINCT list of values, then the query fails due to the ORDER BY clause containing a column not in the SELECT list:
The way to get around this is to include the TerritoryID as part of the SELECT column list:
Using DISTINCT in Aggregate Functions
You can also include the DISTINCT clause in aggregate functions to perform aggregation on a distinct list of values. The most commonly used aggregate function with DISTINCT is the T-SQL COUNT function.
In the following query, I return a list of all ProductID values and CarrierTrackingNumber values from the Sales.SalesOrderDetail table. Notice how the count for the ProductID column is significantly higher than the count of the CarrierTrackingNumber values. This is because the CarrierTrackingNumber column contains NULL values while the ProductID column does not allow NULL values. NULL values aren’t considered in aggregate functions.
After enabling SET STATISTICS IO ON and getting the execution plan, we can see the query used a clustered index scan and read 1,246 pages:
Now, I’ll introduce the DISTINCT clause inside of a COUNT aggregation on the ProductID column:
From the returned results, we see there are 266 distinct values in the ProductID column:
Using DISTINCT inside the COUNT function is a helpful way to determine how unique the distribution of values in a column is. However, like everything else regarding computing, there is a trade-off as this aggregation is not cost-free.
Although all the data in the table is already being scanned, SQL Server still needs to reaccess a data structure to perform the DISTINCT aggregation. In this case, because a non-clustered index exists on the ProductID column, SQL Server can scan that index to return the distinct values for the column. Below we see the extra scan of the non-clustered index in the execution plan:
We also see the extra pages read from the buffer pool for the second index scan:
So, what happens if I include a COUNT DISTINCT operation on a column without an existing non-clustered index? In this case, an extra table scan must occur to perform the DISTINCT aggregation. Below I include a COUNT DISTINCT on the CarrierTrackingNumber column, and the column returns 3,806 distinct values:
And the resulting execution plan includes an extra clustered index scan:
This is a crucial point to take away. Just because your query only references a single table doesn’t mean your query will, at maximum, read all of the data in the table a single time.
The IN Clause
The IN clause allows for the specification of multiple search values in a WHERE clause. It’s an alternative to writing multiple OR statements in the predicate. Sub-queries can also be used as a source for the IN clause. This sub-query must return a single column, and the list of values returned will always return a DISTINCT list of values. Notice in the execution plan for the query below, a Stream Aggregate is performed to return a unique list of ProductID values from the sub-query:
If I alter the query a bit to include a GROUP BY instead of the simple inner query above, we can see the resulting execution plan is the same:
So, what happens if the inner query returns a list of already unique values? In such a case, SQL Server is smart enough not to perform an aggregation to return a DISTINCT list.
To show this, let’s get a unique list of ProductID values from the SalesOrderDetail table and put them in a temp table. Notice how I include a NULL value in this table (more on why I’m doing this in a moment):
Next, I’ll create a UNIQUE clustered index on the table. This will guarantee no duplicate values can exist in the table. Also, UNIQUE indexes allow for a single NULL value in SQL Server:
Now, when I execute the query and use the temp table containing unique values, we can see from the execution plan how the aggregation to return a DISTINCT list of values is no longer needed.
Things start to get interesting if I modify the above query a bit to use a NOT IN clause instead of an IN clause:
The query returns no results. Recall from the INSERT statement above where I included a single NULL value into the ProductID column of the #DistinctProducts temp table. When an inequality comparison is made against a NULL value, the result is unknown (NULL). When SET ANSI_NULLS is set to ON, which is the default, any NOT IN comparison against a sub-query with a NULL value will return a NULL result set. I’ve seen many developers run into problems with this and not fully understand why no results were returned.
Luckily, the workaround is relatively easy – just include a predicate to filter out any NULL values:
When to Use DISTINCT in SQL
The DISTINCT clause is part of the SELECT statement and returns a unique set of rows for a query. As seen from the DISTINCT examples above, there can be much overhead in the aggregation or sorting process to return the unique list of rows in the result set.
So, when a unique data set is needed, consider using GROUP BY as an alternative because of its flexibility for performing aggregate functions. Also, only consider using DISTINCT when necessary due to the overhead to remove duplicate rows from the result set.
To monitor SQL Server queries and optimize performance more easily, SolarWinds® SQL Sentry can provide you with the ability to get to the root of query problems using Plan Explorer. Plan Explorer is a built-in tool designed to score algorithms to help you determine the best index to support a given query, view recommended indexes, and more. Try SQL Sentry with Plan Explorer free for 14 days.
Paul S. Randal is the CEO of SQLskills.com, which he runs with his wife Kimberly L. Tripp. Both Paul and Kimberly are widely-known and respected experts in the SQL Server world, and both are long-time SQL Server MVPs. Paul was a Contributing Editor for TechNet Magazine, where he wrote the bi-monthly SQL Q&A column and feature articles. He also had #1 top-rated workshops and sessions at the PASS Summit and TechEd. Paul is active in the SQL Server community, from user groups to online forums to helping out on Twitter (@PaulRandal – check out the #sqlhelp tag). His popular and widely-referenced blog can be found at https://www.sqlskills.com/blogs/paul/ and he can be reached at paul@sqlskills.com.