SQL: Getting TOP N rows for a grouped query

Jan 07

Ben Nadel posted on his blog earlier today an entry that looked into ways of getting the TOP N rows for a grouped dataset. In his example, if you’ve got blondes and brunettes, and you want to get the top three girls for each hair color, but you want to do it in a single query, how do you do that? I posted a solution, and this post explains how that solution works in detail.

Given the following table:

Table: girl
idnamehairscore
1KimBrunette8
2AnneBrunette7
3SarahBrunette10
4DeborahBrunette9
5MiaBrunette5
6SamanthaBrunette0
7Jo AnnBlonde7
8KatieBlonde8
9BeccaBlonde9
10MiniBlonde5
11LaurenBlonde4
12KitBlonde10

The solution I posted is this:

SELECT c.*, d.ranknum
FROM girl AS c
  INNER JOIN (
    SELECT a.id, COUNT(*) AS ranknum
    FROM girl AS a
      INNER JOIN girl AS b ON (a.hair = b.hair) AND (a.score <= b.score)
    GROUP BY a.id
    HAVING COUNT(*) <= 3
  ) AS d ON (c.id = d.id)
ORDER BY c.hair, d.ranknum

Of course, I must say that while this is the solution that I presented, it is not “my” solution, which is to say that I didn’t come up with it out of thin air. I think I learned it from the excellent “Transact-SQL Cookbook” by Ales Spetic and Jonathan Gennick. A problem very similar to this is solved in Section 2.12 in that book. I highly recommend the book as a desk reference — it solves countless practical every-day SQL problems.

Enough with the disclaiming. That solution gives the following results:

idnamehairscoreranknum
12KitBlonde101
9BeccaBlonde92
8KatieBlonde83
3SarahBrunette101
4DeborahBrunette92
1KimBrunette83

As you can see, within each hair color group, we have the three girls that have the best scores, and we have their rank within their group.

Let’s build that query, starting with the self-join in the subquery. I’ll throw in a few extra columns to make the intermediate results easier to read:

SELECT a.name, a.score, b.name, b.score
FROM girl AS a INNER JOIN girl AS b ON (a.hair = b.hair)

All we’ve done so far is create a cartesian product from our original table. That is, we’ve basically squared the table — it’s gone from 12 rows to 42 rows, and every girl is now paired with another girl (including herself) with the same hair color. We’ve left out the part comparing the scores for now. Here’s a partial result:

a.namea.scoreb.nameb.score
Kim8Anne7
Kim8Deborah9
Kim8Kim8
Kim8Mia5
Kim8Samantha0
Kim8Sarah10

Let’s add in the score comparison:

SELECT a.name, a.score, b.name, b.score
FROM girl AS a INNER JOIN girl AS b ON (a.hair = b.hair) AND (a.score <= b.score)

Kim’s competition has been reduced significantly:

a.namea.scoreb.nameb.score
Kim8Deborah9
Kim8Kim8
Kim8Sarah10

But really, we just want to see each girl’s rank. We can do that by counting the number of girls that have a better or equal score. In this case, we can see that Kim’s rank would be 3. This is accomplished with the GROUP BY statement:

SELECT a.hair, a.name, a.score, COUNT(*) AS ranknum
FROM girl AS a INNER JOIN girl AS b ON (a.hair = b.hair) AND (a.score <= b.score)
GROUP BY a.hair, a.name, a.score

Now we see that each girl has her own score, and that the scores are only relevant to each hair color group. I’ll just show the brunettes here:

hairnamescoreranknum
BrunetteAnne74
BrunetteDeborah92
BrunetteKim83
BrunetteMia55
BrunetteSamantha06
BrunetteSarah101

Since we only want the top three from each group, by rank, we can use a HAVING statement to filter out the lower ranks.

SELECT a.hair, a.name, a.score, COUNT(*) AS ranknum
FROM girl AS a INNER JOIN girl AS b ON (a.hair = b.hair) AND (a.score <= b.score)
GROUP BY a.hair, a.name, a.score
HAVING COUNT(*) <= 3

This gives us our winners:

hairnamescoreranknum
BrunetteSarah101
BrunetteDeborah92
BrunetteKim83

You could stop there, but I’ve taken it a step further. In the real world, you’re probably going to want a whole bunch of information for each row, not just the basics. You have two choices for how you want to bring that information in. In our last SQL statement, we’ve taken the approach of adding each item into the GROUP BY clause. To get the results that we wanted, we’d have to add a few more:

SELECT a.id, a.hair, a.name, a.score, COUNT(*) AS ranknum
FROM girl AS a INNER JOIN girl AS b ON (a.hair = b.hair) AND (a.score <= b.score)
GROUP BY a.id, a.hair, a.name, a.score
HAVING COUNT(*) <= 3

The problem here is twofold — you have to manually add in each column in two different places, and it’s going to take longer to process. Remember that for each column in the GROUP BY, the SQL engine is going to have to do more and more comparisons to perform the grouping. Instead, since each of the rows has a primary key, why don’t we just get that?

SELECT a.id, COUNT(*) AS ranknum
FROM girl AS a INNER JOIN girl AS b ON (a.hair = b.hair) AND (a.score <= b.score)
GROUP BY a.id
HAVING COUNT(*) <= 3

Now our GROUP BY is much simpler and the query should perform much faster, and it should be much easier to maintain. However, we’re still going to need to get out the data from the table, so we wrap it in another query and we’re back to our solution:

SELECT c.*, d.ranknum
FROM girl AS c
  INNER JOIN (
    SELECT a.id, COUNT(*) AS ranknum
    FROM girl AS a
      INNER JOIN girl AS b ON (a.hair = b.hair) AND (a.score <= b.score)
    GROUP BY a.id
    HAVING COUNT(*) <= 3
  ) AS d ON (c.id = d.id)
ORDER BY c.hair, d.ranknum

The nice thing about this is that since you’re joining the subquery to the outer query on the table’s primary key, it should really fly. And no, I’m certainly not advocating that you use SELECT * in your queries — I’ve just done it here to remove visual noise from the examples.

I’m eliding some of the caveats to this technique, such as performace implications for wicked large tables and tie-breaking, but this should be a pretty solid foundation to work from.