Using QoQ to intentionally make Cartesian products

Sep 17

For those that don’t remember their database theory classes, a Cartesian product is what you get when you take every element of one matrix (query) and match it up with every element of another matrix (query) so that you end up with every combination of every element. Like so:

A
1
2
×
B
3
4
AB
13
14
23
24

When doing database work we generally try very hard to avoid making unrestricted Cartesian products. Most of the time when we do make them, it is an accident, probably because we left off a key clause in a JOIN expression. On the occasion that we do want to do them intentionally, we can do them easily with a CROSS JOIN expression — or JOIN ON (1 = 1) if your DBMS doesn’t support that.

But! Cross products can be useful things, when used correctly.

Today’s blog post by Ben Nadel is about iterating over an unknown number of nested lists. Long story short for the attention-deficit crowd, here’s the skinny:

Given an array of lists, where neither the length of the array nor the length of the lists are fixed, how can we quickly build a set of every combination of the elements on that list?

Note that last part: “every combination of the elements”. Sound familiar?

I’m going to use the same array that Ben used:

<cfset arrLists = [
    "Blue,Black",
    "Small,Medium,Large",
    "Mens,Womens",
    "Full Sleeve,Short Sleeve"
  ] />

Here’s the code:

<cfloop from="1" to="#arrayLen(arrLists)#" index="i">
    <cfset p=queryNew("")>
    <cfset queryAddColumn(p, "a#i#", "VARCHAR", listToArray(arrLists[i]))>
    <cfif NOT isDefined("q")>
        <cfset q=p>
    <cfelse>
        <cfquery dbtype="query" name="q">
        SELECT * FROM q, p
        </cfquery>
    </cfif>
</cfloop>

If we then cfdump that query “q”, we get this:

 A1A2A3A4
1BlueSmallMensFull Sleeve
2BlueSmallMensShort Sleeve
3BlueSmallWomensFull Sleeve
4BlueSmallWomensShort Sleeve
21BlackLargeMensFull Sleeve
22BlackLargeMensShort Sleeve
23BlackLargeWomensFull Sleeve
24BlackLargeWomensShort Sleeve

So what did we do? In English:

  1. For each of the lists in the array,

    1. Create a new query “p” with a no columns.

    2. Add a column to query “p”. That column is named “A1” or “A2” or whatever the number of the current list is.

    3. Set its single column equal to the items. (Such as “Blue”.) That is, we’ve taken the current list and turned it into a query resultset.

    4. If we haven’t yet created our result query “q”, and are thus on our first iteration of the loop, then just set it equal to query “p”.

    5. But if we have query “q” already, and are thus at least on our second trip through the loop, then do some magic:

      1. Make a Cartesian product of our existing query “q” and the current list query “p”. We do this by joining the two without any restrictive clauses.

      2. Assign that Cartesian product back to our result query “q” via name="q".

  2. Repeat, making the Cartesian product bigger and bigger, until you run out of lists.

Or, visually, here’s how our result query “q” looks at the end of the first two iterations through the array loop:

 A1
1Blue
2Black
 A1A2
1BlueSmall
2BlueMedium
3BlueLarge
4BlackSmall
5BlackMedium
6BlackLarge

And so on and so forth.

Now … let’s make things a little trickier …

What if we wanted to add an “id” column to our resultset that was an automatically-generated unique identifier for that combination? In real-world terms, we want to build a SKU for that particular item configuration. And, just to make it even more realistic, lets say that each item in each list has a particular key, like so:

<cfset arrLists = [
    "u:Blue,k:Black",
    "1:Small,2:Medium,3:Large",
    "m:Mens,f:Womens",
    "l:Full Sleeve,s:Short Sleeve"
  ] />

We’ve added a character to identify each option uniquely. There’s no real reason for limiting it to a single character, nor limiting it to alpha or numeric. I’ve just done so for simplicity. The modifications to make a unique SKU are then:

<cfset columns="">
<cfloop from="1" to="#arrayLen(arrLists)#" index="i">
    <cfset columns=listAppend(columns, "a#i#")>
    <cfset p=queryNew("pid,a#i#","VARCHAR,VARCHAR")>
    <cfloop list="#arrLists[i]#" index="j">
        <cfset queryAddRow(p)>
        <cfset querySetCell(p, "pid", listFirst(j,":"), p.recordCount)>
        <cfset querySetCell(p, "a#i#", listRest(j,":"), p.recordCount)>
    </cfloop>
    <cfif NOT isDefined("q")>
        <cfquery dbtype="query" name="q">
        SELECT pid AS id, #columns#
        FROM p
        </cfquery>
    <cfelse>
        <cfquery dbtype="query" name="q">
        SELECT id + pid AS id, #columns#
        FROM q, p
        </cfquery>
    </cfif>
    <cfdump var="#q#">
</cfloop>

Most of the really weird stuff is actually just working around bugs in the QoQ engine. I’ll explain the differences.

  • The columns variable is going to hold those A1, A2, etc, column names for us. We got away without needing to keep an explicit list before, but keeping one now will make things much simpler. And there’s not much to it really: just append the latest column name to the list at each iteration.

  • We’ve added a “pid” column to our query. No surprise there.

  • You’ll notice that we now need to specify in the queryNew call that we are working with VARCHAR columns. This is needed if any of your identifying character strings are numeric, such as we have done with sizes. The QoQ engine will throw an error later if we don’t do this, as it “helpfully” converts those numbers to BIGINT columns, then can’t figure out how to add a VARCHAR and BIGINT together later.

  • In the loop where we set up the list query, we now have to set both the ID and value columns.

  • Our nice, simple, cfset q=p had to go away, unfortunately. To make the query work later, the ID column in the two queries could not be named the same thing. So, when we just copy over the query without the join, we need to change the column name as we do it.

    We know that A1 is our only column at the time, but I used #columns# instead so that if we decide to use a different naming scheme this is one less place we’ll have to change it later.

  • The Cartesian product query has only gotten a little more complex. We concatenate the old ID and the new ID, then bring over all of the data columns.

Again, with the visuals:

 A1ID
1Blueu
2Blackk
 A1A2ID
1BlueSmallu1
2BlueMediumu2
3BlueLargeu3
4BlackSmallk1
5BlackMediumk2
6BlackLargek3
 A1A2A3ID
1BlueSmallMensu1m
2BlueSmallWomensu1f
3BlueMediumMensu2m
4BlueMediumWomensu2f
5BlueLargeMensu3m
6BlueLargeWomensu3f
7BlackSmallMensk1m
8BlackSmallWomensk1f
9BlackMediumMensk2m
10BlackMediumWomensk2f
11BlackLargeMensk3m
12BlackLargeWomensk3f

It’s a little contrived, but there you go.