Pivoting Tall Tables Into Wide Tables

Via my Contact Form, mb wrote:

do you have a trick for denormalizing tall tables with monthly data into wide tables that doesn’t need 12 queries?

Short Answer: I wish.

It would be extremely nice if standard SQL had some kind of pivoting feature. Some DB systems have their own keywords for allowing you to do it, but they aren’t anywhere close to standard. I have a laundry list of things that I wish standard SQL had, and this feature is very close to the top. I’ve come across several ways to do this that I’ll share here – hopefully one of them will be useful to you.

Oh, and if I say "buckets", that’s just a synonym for the columnized values.

The Setup

Assume we’re working with a web log — a table with several value fields and a category field that will be used to pivot on. In this case, that pivotable category field is a date field and the value fields are the duration (summable) and the path (groupable). Let’s go one step beyond and say that the category (date) field contains real dates, not just aggregate-friendly dates. Like so:

ColumnTypeComments
PathvarcharThe full path to the page being requested
When_DatedatetimeSub-second resolution?
DurationmoneySeconds to 4 decimal places

Yet again, I like using the money type instead of decimal, but that’s just my personal preference.

The Goals

The boss wants the following reports:

Page Popularity Over Time
Hits per page (rows) per month (columns) for the last 6 months
Page Performance Variance
Hits per page (rows) in buckets (columns) for durations in (0-1,1-2,2-5,5-10,10+) seconds for the last month
Server Load Hotspots
Hits per hour of day (rows) for each day of the week (columns) for the last month

Ye Olde CASE Statements

This is probably the most common way of pivoting data, especially when all of the data is coming out of one place and the selection criteria isn’t too complex. It can accomplish all three of our goals with a minimum of fuss.

For our first goal, we’d use the following SQL:

SELECT
  Path,
  SUM(CASE WHEN i = 0 THEN 1 ELSE 0 END) AS M0,
  SUM(CASE WHEN i = 1 THEN 1 ELSE 0 END) AS M1,
  SUM(CASE WHEN i = 2 THEN 1 ELSE 0 END) AS M2,
  SUM(CASE WHEN i = 3 THEN 1 ELSE 0 END) AS M3,
  SUM(CASE WHEN i = 4 THEN 1 ELSE 0 END) AS M4,
  SUM(CASE WHEN i = 5 THEN 1 ELSE 0 END) AS M5
FROM Web_Hits INNER JOIN (
    SELECT MONTH(DATEADD(MONTH,-i,GETDATE())) AS m, i
    FROM Pivot100
    WHERE i BETWEEN 0 AND 5
  ) AS a ON (MONTH(When_Date) = m)
WHERE (When_Date > DATEADD(MONTH,-7,GETDATE()))
GROUP BY Path
ORDER BY Path

Let’s take this apart. Obviously, we need a column for our Path. The next 6 columns use the CASE statement to filter the hits into buckets. The i field comes from the subquery and represents the number of months into the past for the bucket. I’ve talked about the wonders of pivot tables in the past, and this is yet another scenario where they come in handy.

The subquery grabs the values 0 through 6 and puts them in i, then also figures out which month of the year that value represents and puts that in m. For this month, an i of 0 yields an m of 3 (March), back through an i of 5 yielding an m of 10 (October). We join our subquery to our Web_Hits table by the month of the year, which means that the i that represents the month offset gets dragged along with it. This is how we can then use that i month offset in the CASE statements.

Our WHERE clause just makes sure that we only get records for the last 7 months, and thus don’t have to worry about months from different years tainting our results. If your report needs to go back more than a year, then you’d have to be a bit sneakier about your join. Also, we use 7 months in the filter to account for the fact that we want the whole month on that furthest-back month, not just what is left in it. The days at the end of the 7th unused month that are brought in by the WHERE clause will just get dropped on the floor because not only do they not fit into any of our CASE buckets, but they won’t match up when we do our join.

Pretty simple, right? It turns out that if you can get used to the wacky look of a conditional (CASE) sum, then the rest of the reports won’t give you too much trouble.

The second goal would use the following SQL:

SELECT
  Path,
  SUM(CASE WHEN Duration < 1 THEN 1 ELSE 0 END) AS D_0_1,
  SUM(CASE WHEN Duration >= 1 AND Duration < 2 THEN 1 ELSE 0 END) AS D_1_2,
  SUM(CASE WHEN Duration >= 2 AND Duration < 5 THEN 1 ELSE 0 END) AS D_2_5,
  SUM(CASE WHEN Duration >= 5 AND Duration < 10 THEN 1 ELSE 0 END) AS D_5_10,
  SUM(CASE WHEN Duration >= 10 THEN 1 ELSE 0 END) AS D_10
FROM Web_Hits
WHERE (When_Date > DATEADD(MONTH,-6,GETDATE()))
GROUP BY Path
ORDER BY Path

The goal sounds harder, but it turns out to be even easier. Our CASE statements filter the hit into the correct bucket based on the duration. Since we’re not so worried about half-empty months, we can do a simple WHERE filter to grab 6 months of data. That’s it. No muss, no fuss.

Based on your success with the second goal, you might try the following for the third goal:

SELECT
  DATEPART(HOUR,When_Date) AS Hour_Of_Day,
  SUM(CASE WHEN DATEPART(WEEKDAY,When_Date) = 1 THEN 1 ELSE 0 END) AS Sun,
  SUM(CASE WHEN DATEPART(WEEKDAY,When_Date) = 2 THEN 1 ELSE 0 END) AS Mon,
  SUM(CASE WHEN DATEPART(WEEKDAY,When_Date) = 3 THEN 1 ELSE 0 END) AS Tue,
  SUM(CASE WHEN DATEPART(WEEKDAY,When_Date) = 4 THEN 1 ELSE 0 END) AS Wed,
  SUM(CASE WHEN DATEPART(WEEKDAY,When_Date) = 5 THEN 1 ELSE 0 END) AS Thu,
  SUM(CASE WHEN DATEPART(WEEKDAY,When_Date) = 6 THEN 1 ELSE 0 END) AS Fri,
  SUM(CASE WHEN DATEPART(WEEKDAY,When_Date) = 7 THEN 1 ELSE 0 END) AS Sat
FROM Web_Hits
WHERE (When_Date > DATEADD(MONTH,-6,GETDATE()))
GROUP BY DATEPART(HOUR,When_Date)
ORDER BY DATEPART(HOUR,When_Date)

This looks similar to the first two, except that instead of a simple Path column as our spine, we’re now using the hour of the day. So really, we have one part of the time in the rows, then another part of the time in the columns. It sounds hinky, but it’s really useful sometimes. The rest of it looks pretty straightforward, right?

But there’s a problem: if you are missing data for a specific hour, it won’t return a row in your report. That is, you expect that query to always return 24 rows, right? Well, if your web site never gets any hits between 3 and 4 in the morning, the query might only return 23 rows. How can we fix this? More pivot table action, of course!

SELECT
  i AS Hour_Of_Day,
  SUM(CASE WHEN DATEPART(WEEKDAY,When_Date) = 1 THEN 1 ELSE 0 END) AS Sun,
  SUM(CASE WHEN DATEPART(WEEKDAY,When_Date) = 2 THEN 1 ELSE 0 END) AS Mon,
  SUM(CASE WHEN DATEPART(WEEKDAY,When_Date) = 3 THEN 1 ELSE 0 END) AS Tue,
  SUM(CASE WHEN DATEPART(WEEKDAY,When_Date) = 4 THEN 1 ELSE 0 END) AS Wed,
  SUM(CASE WHEN DATEPART(WEEKDAY,When_Date) = 5 THEN 1 ELSE 0 END) AS Thu,
  SUM(CASE WHEN DATEPART(WEEKDAY,When_Date) = 6 THEN 1 ELSE 0 END) AS Fri,
  SUM(CASE WHEN DATEPART(WEEKDAY,When_Date) = 7 THEN 1 ELSE 0 END) AS Sat
FROM Pivot100 AS i LEFT OUTER JOIN Web_Hits
  ON (DATEPART(HOUR,When_Date) = i)
  AND (When_Date > DATEADD(MONTH,-6,GETDATE()))
WHERE (i BETWEEN 0 AND 23)
GROUP BY i
ORDER BY i

We’ve made our pivot table our spine so that we know we’ll always have 24 records. By using an OUTER join to our hit data, we ensure that even if there are holes, we won’t accidentally eliminate any data. The join itself is pretty simple: match the pivot value to the hour of day. The catch is that the date range filter now has to go in the join clause because the date is now on the right side and isn’t guaranteed to exist for every value on the left side. Which, remember, was the whole point of using the pivot table in the first place.

So, to sum up, using CASE statements to pivot tables is pretty darn useful. They are great when the logic is simple, but I think you can imagine that they would quickly get very complex as the logic gets more complex.

The Harder Way: Joins

You could do it this way:

SELECT h.Path, m0, m1, m2, m3, m4, m5
FROM (
    SELECT DISTINCT Path
    FROM Web_Hits
    WHERE (When_Date > DATEADD(MONTH,-7,GETDATE()))
  ) AS h
  LEFT OUTER JOIN (
    SELECT Path, COUNT(*) AS m0
    FROM Web_Hits
    WHERE (When_Date > DATEADD(MONTH,-7,GETDATE()))
      AND MONTH(When_Date) = MONTH(GETDATE())
    GROUP BY Path
  ) AS d0 ON (h.Path = d0.Path)
  LEFT OUTER JOIN (
    SELECT Path, COUNT(*) AS m1
    FROM Web_Hits
    WHERE (When_Date > DATEADD(MONTH,-7,GETDATE()))
      AND MONTH(When_Date) = MONTH(DATEADD(MONTH,-1,GETDATE()))
    GROUP BY Path
  ) AS d1 ON (h.Path = d1.Path)
  LEFT OUTER JOIN (
    SELECT Path, COUNT(*) AS m2
    FROM Web_Hits
    WHERE (When_Date > DATEADD(MONTH,-7,GETDATE()))
      AND MONTH(When_Date) = MONTH(DATEADD(MONTH,-2,GETDATE()))
    GROUP BY Path
  ) AS d2 ON (h.Path = d2.Path)
  LEFT OUTER JOIN (
    SELECT Path, COUNT(*) AS m3
    FROM Web_Hits
    WHERE (When_Date > DATEADD(MONTH,-7,GETDATE()))
      AND MONTH(When_Date) = MONTH(DATEADD(MONTH,-3,GETDATE()))
    GROUP BY Path
  ) AS d3 ON (h.Path = d3.Path)
  LEFT OUTER JOIN (
    SELECT Path, COUNT(*) AS m4
    FROM Web_Hits
    WHERE (When_Date > DATEADD(MONTH,-7,GETDATE()))
      AND MONTH(When_Date) = MONTH(DATEADD(MONTH,-4,GETDATE()))
    GROUP BY Path
  ) AS d4 ON (h.Path = d4.Path)
  LEFT OUTER JOIN (
    SELECT Path, COUNT(*) AS m5
    FROM Web_Hits
    WHERE (When_Date > DATEADD(MONTH,-7,GETDATE()))
      AND MONTH(When_Date) = MONTH(DATEADD(MONTH,-5,GETDATE()))
    GROUP BY Path
  ) AS d5 ON (h.Path = d5.Path)

But dang that is U-G-L-Y. Just say NO.

Seriously though, I have seen this technique used in a few places. The advantage to using joins like this is that each join can be optimized to use its particular keys and strengths, and the join happens after your tables have been squished town to their barest essentials. It’s not all that applicable here where the data is in one place and nice and homogeneous, but it can really come in handy where you have mongoloid tables with only a few distinct keys and don’t want to waste the time joining then summing. A web log with millions of records for just a few pages tied to a revision-control log for changes to those pages, for example.

There Is Strength In Numbers: UNION

This last technique is nice because it can be done in pieces and parts and doesn’t require any joins:

SELECT Path, SUM(m0) AS m0, SUM(m1) AS m1, SUM(m2) AS m2, SUM(m3) AS m3, SUM(m4) AS m4, SUM(m5) AS m5
FROM (
    SELECT Path, COUNT(*) AS m0, 0 AS m1, 0 AS m2, 0 AS m3, 0 AS m4, 0 AS m5
    FROM Web_Hits
    WHERE (When_Date > DATEADD(MONTH,-7,GETDATE()))
      AND MONTH(When_Date) = MONTH(GETDATE())
    GROUP BY Path
    UNION ALL
    SELECT Path, 0 AS m0, COUNT(*) AS m1, 0 AS m2, 0 AS m3, 0 AS m4, 0 AS m5
    FROM Web_Hits
    WHERE (When_Date > DATEADD(MONTH,-7,GETDATE()))
      AND MONTH(When_Date) = MONTH(DATEADD(MONTH,-1,GETDATE()))
    GROUP BY Path
    UNION ALL
    SELECT Path, 0 AS m0, 0 AS m1, COUNT(*) AS m2, 0 AS m3, 0 AS m4, 0 AS m5
    FROM Web_Hits
    WHERE (When_Date > DATEADD(MONTH,-7,GETDATE()))
      AND MONTH(When_Date) = MONTH(DATEADD(MONTH,-2,GETDATE()))
    GROUP BY Path
    UNION ALL
    SELECT Path, 0 AS m0, 0 AS m1, 0 AS m2, COUNT(*) AS m3, 0 AS m4, 0 AS m5
    FROM Web_Hits
    WHERE (When_Date > DATEADD(MONTH,-7,GETDATE()))
      AND MONTH(When_Date) = MONTH(DATEADD(MONTH,-3,GETDATE()))
    GROUP BY Path
    UNION ALL
    SELECT Path, 0 AS m0, 0 AS m1, 0 AS m2, 0 AS m3, COUNT(*) AS m4, 0 AS m5
    FROM Web_Hits
    WHERE (When_Date > DATEADD(MONTH,-7,GETDATE()))
      AND MONTH(When_Date) = MONTH(DATEADD(MONTH,-4,GETDATE()))
    GROUP BY Path
    UNION ALL
    SELECT Path, 0 AS m0, 0 AS m1, 0 AS m2, 0 AS m3, 0 AS m4, COUNT(*) AS m5
    FROM Web_Hits
    WHERE (When_Date > DATEADD(MONTH,-7,GETDATE()))
      AND MONTH(When_Date) = MONTH(DATEADD(MONTH,-5,GETDATE()))
    GROUP BY Path
  ) AS a
GROUP BY Path
ORDER BY Path

This is almost the same as what we did last time, but just formatted a little differently. It’s not quite as ugly as the last one, but nor is it all that pretty. You are doing basically the same thing: handling one column at a time and saving all of the merging and summing for the end. Again, this is nice for heterogeneous data because each column can include its own tables and whatnot. However, the benefit that this one has over the last one is that there aren’t any joins. This can come in handy when the subqueries that you are using to generate the data already have a lot of joins and you start getting errors from your database telling you that you have too many tables in your query. The UNION method makes it really easy to process the job in batches. Sure, you could make each subquery in the previous method a temporary table, but that’s a table for each column and a lot of joining at the end. With this method you would only need one temp table and no joining.

This method also comes in handy when you want to recalculate your totals but only one part of the source data has changed – in this example only the current month is going to change on a daily basis. With this method you can add a Source column that tells you where each piece of data is coming from. When you want to refresh your totals you can just delete rows from that source, reinsert the new rows, and rerun just the totaling query. That’s very useful when you’re doing more in the pieces than just simple COUNT statements.

Conclusion

As you can see, each method has its strengths and weaknesses. I have at least one real-world use of each method in place right now in the data-mining system I built for my job, so I’m not just blowing smoke up your skirts. If you have another way of doing the same type of thing, drop me a line and I’ll feature it in my blog.

Published by

Rick Osborne

I am a web geek who has been doing this sort of thing entirely too long. I rant, I muse, I whine. That is, I am not at all atypical for my breed.