I’m working on my curriculum for next month’s Advanced Database Structures class, and one of my examples was going to be about how to detect runs in series data using SQL. Except, I realized, that it’s probably a little over the heads of the students. But, I’ve already done all of the SQL and breakdown, so you’re going to benefit from my over-planning.
You have a table that has a bunch of numbers or dates. For example, you might have:
1, 2, 3, 4, 6, 8, 9
You’d like to group these numbers into runs—contiguous groups of numbers. Your result set should look like:
1-4, 6, 8-9
How do you do this in SQL?
Before I continue, I should point out that this is a classic SQL problem and is tackled in many SQL reference manuals, including my favorite: the SQL Cookbook by Anthony Molinaro from O’Reilly Media. My solution is not the one presented by Anthony, but his is just as valid. (Probably more so, as he presents optimized solutions for multiple different systems.) My solution is very generic SQL, unoptimized, but should work on any SQL DBMS.
My test table is going to be called Series and will have exactly one column s, which is of type tinyint. It could just as easily be a date or a character, or really any sort of countable, enumerable type. This yields the following:
CREATE TABLE Series ( s TINYINT NOT NULL )
For ease of examples, I’m only going to populate the table with numbers above for now. Later, we’ll try to break my solution by doing nasty things like adding duplicate numbers.
The solution we’re looking for is a set of start and finish numbers. Standalone numbers will occupy both sides, like so:
Looking at the given numbers and the required solution, it’s easy to visualize that what we’re looking for is really the edges of the runs:
Let’s break that down one edge at at time, starting with the leading edge. We know we’re looking for (1, 6, 8) as our result set. The definition of a leading edge is that it doesn’t have a number in the series before it. To find such numbers, we’d need to compare each number with the numbers in the set that come before it, looking for matches. The leading edges would then be the ones that don’t have a match.
This is easiest to do with a self-join:
SELECT r1.s AS Start FROM Series AS r1 LEFT JOIN Series AS r2 ON (r1.s = r2.s + 1) WHERE r2.s IS NULL
See what we did there? We join the table to itself, looking for numbers that are one more on the left (r1) side than they are on the right (r2) side. We then throw out anything that has a right-hand (r2) match by looking for where the right side is NULL. What’s left is the set of numbers where there isn’t a matching one-less number: (1, 6, 8). Visually:
Getting the trailing edges works out the same way, but this time we want the left (r1) side to be one less than the right (r2) side. All we have to do is flip a single sign:
SELECT r1.s AS Finish FROM Series AS r1 LEFT JOIN Series AS r2 ON (r1.s = r2.s - 1) WHERE r2.s IS NULL
That gives us our trailing edges: (4, 6, 9).
Mind the Gap
Let’s join those together and see where we stand. But, let’s be smart about it. Since we know that the leading edges should always come before the trailing edges, let’s make that inequality part of our join:
SELECT Start, Finish FROM ( SELECT r1.s AS Start FROM Series AS r1 LEFT JOIN Series AS r2 ON (r1.s = r2.s + 1) WHERE r2.s IS NULL ) AS v1 INNER JOIN ( SELECT r1.s AS Finish FROM Series AS r1 LEFT JOIN Series AS r2 ON (r1.s = r2.s - 1) WHERE r2.s IS NULL ) AS v2 ON (Start <= Finish)
That yields the following results:
We’ve got a few rows that we need to eliminate, namely (1, 6), (1, 9), and (6, 9). We can see that they’re wrong, but what makes them wrong?
We know that they’re wrong because they have holes in them. The (1, 6) pair is missing 5, the (1, 9) is missing 5 and 7, and the (6, 9) pair is missing 7. The difference, then, is the count of numbers that should be in the run versus the count of numbers that actually are in the run. So let’s compare the two counts. To do that, we need to join to our source table one more time: we’ll pull in any numbers between our Start and Finish edges and count them.
(To make this SQL a little friendlier, I’m going to elide the subqueries as they are unchanged.)
SELECT Start, Finish, COUNT(*) AS InRun FROM ( .. ) AS v1 INNER JOIN ( .. ) AS v2 ON (Start <= Finish) INNER JOIN Series AS v3 ON (v3.s BETWEEN Start AND Finish) GROUP BY Start, Finish
This gives us the following results with an extra column counting the numbers that do exist in the run:
We can now do a bit more math: we know that the InRun count should be one more than Finish minus Start. That is:
Finish - Start = InRun - 1
We can eliminate non-matching rows by adding a single line to the end of our query:
HAVING COUNT(*) - 1 = Finish - Start
Et voila, we have our correct result set:
One Step Beyond
As it stands, our query has one fatal flaw: duplicates. What if we had (1, 2, 2, 3, 4) with the extra 2? That would throw off our count, eliminating the set as a run. Worse, a duplicate followed by a hole masks a bad set: (1, 2, 3, 3, 4, 6) looks like 1-6 instead of just 1-4 because it starts at 1 and ends at 6 and has 6 numbers. Also, if we had (1, 1, 2, 3, 4), the extra 1 would show up as an extra leading edge, just as an extra 4 would show up as an extra trailing edge.
Luckily, this is easy to fix.
We can eliminate the duplicate edges by transforming our SELECT clauses in our subqueries into SELECT DISTINCT clauses.
Fixing the duplicate+hole problem is a little trickier, but follows the same concept. Instead of counting the rows that fall between our Start and End, we want to count the distinct values. This requires us to transform our v3 aliased table into a simple subquery:
INNER JOIN ( SELECT DISTINCT s AS Middle FROM Series ) AS v3 ON (Middle BETWEEN Start AND End)
Our solution is presented here in its entirety:
SELECT Start, Finish FROM ( SELECT DISTINCT r1.s AS Start FROM Series AS r1 LEFT JOIN Series AS r2 ON (r1.s = r2.s + 1) WHERE r2.s IS NULL ) AS v1 INNER JOIN ( SELECT DISTINCT r1.s AS Finish FROM Series AS r1 LEFT JOIN Series AS r2 ON (r1.s = r2.s - 1) WHERE r2.s IS NULL ) AS v2 ON (Start <= Finish) INNER JOIN ( SELECT DISTINCT s AS Middle FROM Series ) AS v3 ON (Middle BETWEEN Start AND Finish) GROUP BY Start, Finish HAVING COUNT(*) - 1 = Finish - Start ORDER BY Start