A blog about SQL Server, SSIS, C# and whatever else I happen to be dealing with in my professional life.

Find ramblings

Thursday, October 6, 2016

UNION removes duplicates

UNION removes duplicates

When you need to combine two sets of data together, we use the UNION operator. That comes in two flavors: UNION and UNION ALL. The default is to remove duplicates between the two sets whereas UNION ALL does no filtering.

Pop quiz! Given the following sets A and B

What's the result of SELECT * FROM A UNION SELECT * FROM B;

Piece of cake, we start with everything in A and get the values in B that aren't in A.

So we're looking at 1, 5, 9 7, 3, 3, 2, 3

Except of course that's not what is actually happening. UNION is actually going to smash both sets of data together and then take the distinct results. Or it does a distinct within each result set, smashes them together and takes one last pass to remove duplicates. I don't know or care about the actual mechanics, what I care about is the final outcome.

We actually end up with a result of 1, 5, 9, 7, 3, 2. In the fifteen years I've been writing SQL statements, I don't think I ever realized that behavior of the final result set being distinct. I thought it was purely an intra set dedupe process.

I thought wrong


William Todd Salzman said...

Yep, I've told many other developers that if they use union, they really don't need to put a distinct clause in each of the selects that are being union'd. Of course I've told as many that if they want all of the results, they have to use union all. One of those little quirks I picked up a while ago and forget from time to time...

Ed Klein said...

A performance issue you might want to be aware of. I have found situations where Union ALL runs significantly faster than Union. I assumed it was because it didn't have to go through the processing of identifying the duplicates to remove them. This may vary in different releases or implementations.