Yes, Virginia, that’s automated SQL to MongoDB MapReduce

I have to admit, I’m pretty darn proud of this one. This is a righteous hack. I can now write SQL against MongoDB with ColdFusion.

My project today was to take this SQL statement:

SELECT "goalType",
  SUM("distancekm") AS "totalkm",
  COUNT(*) AS "workouts",
  COUNT("powerSongAlbum") AS "songcount",
  AVG("distancekm") AS "avgkm",
  MAX("distancekm") AS "maxkm",
  MIN("distancekm") AS "minkm"
FROM "workouts"
GROUP BY "goalType"

… and create a set of ColdFusion components to transform that query into a MapReduce function that would run on MongoDB. Mapping basic SQL to MongoDB MapReduce isn’t too hard. Getting the SQL parser from Derby working within ColdFusion was significantly harder.

But I did it. This is the completely automated result:

db.runCommand({
mapreduce: "workouts",
map: function () { emit(
  this.goalType,
  {
    '_cfcount': 1,
    'distancekm_cfsum': isNaN(this.distancekm) ? null : this.distancekm,
    'distancekm_cfnum': isNaN(this.distancekm) ? 0 : 1,
    'powerSongAlbum_cfcount': (this.powerSongAlbum == null) ? 0 : 1,
    'distancekm_cfmax': isNaN(this.distancekm) ? null : this.distancekm,
    'distancekm_cfmin': isNaN(this.distancekm) ? null : this.distancekm
  }
); },
reduce: function (key,vals) {
  var ret = {
    'distancekm_cfmax': null,
    'distancekm_cfsum': null,
    'distancekm_cfmin': null,
    'distancekm_cfnum': 0,
    'powerSongAlbum_cfcount': 0,
    '_cfcount': 0
  };
  for(var i = 0; i < vals.length; i++) {
    var v = vals[i];
    ret['distancekm_cfnum'] += v['distancekm_cfnum'];
    if(!isNaN(v['distancekm_cfmax'])) ret['distancekm_cfmax'] = (ret['distancekm_cfmax'] == null) ? v['distancekm_cfmax'] : (ret['distancekm_cfmax'] > v['distancekm_cfmax']) ? ret['distancekm_cfmax'] : v['distancekm_cfmax'];
    ret['_cfcount'] += v['_cfcount'];
    if(!isNaN(v['distancekm_cfmin'])) ret['distancekm_cfmin'] = (ret['distancekm_cfmin'] == null) ? v['distancekm_cfmin'] : (v['distancekm_cfmin'] > ret['distancekm_cfmin']) ? ret['distancekm_cfmin'] : v['distancekm_cfmin'];
    ret['powerSongAlbum_cfcount'] += v['powerSongAlbum_cfcount'];
    if(!isNaN(v['distancekm_cfsum'])) ret['distancekm_cfsum'] = v['distancekm_cfsum'] + (ret['distancekm_cfsum'] == null ? 0 : ret['distancekm_cfsum']);
  }
  return ret;
},
finalize: function (key,val) {
  return {
    'totalkm'   : val['distancekm_cfsum'],
    'workouts'  : val['_cfcount'],
    'songcount' : val['powerSongAlbum_cfcount'],
    'avgkm'     : (isNaN(val['distancekm_cfnum']) || isNaN(val['distancekm_cfsum'])) ? null : val['distancekm_cfsum'] / val['distancekm_cfnum'],
    'maxkm'     : val['distancekm_cfmax'],
    'minkm'     : val['distancekm_cfmin']
  };
},
out: "s2mr",
verbose: true
});

And here’s the output when I run that against my MongoDB collection:

{ "_id" : null, "value" : {
    "totalkm" : 451.6752000000001,
    "workouts" : 54,
    "songcount" : 53,
    "avgkm" : 8.364355555555557,
    "maxkm" : 19.7502,
    "minkm" : 0.0194
} }
{ "_id" : "Distance", "value" : {
    "totalkm" : 304.76879999999994,
    "workouts" : 27,
    "songcount" : 27,
    "avgkm" : 11.287733333333332,
    "maxkm" : 26.2581,
    "minkm" : 4.0486
} }
{ "_id" : "Time", "value" : {
    "totalkm" : 19.221,
    "workouts" : 2,
    "songcount" : 2,
    "avgkm" : 9.6105,
    "maxkm" : 9.9224,
    "minkm" : 9.2986
} }

w00t!

I’ve left the code modular enough that I can make a CouchDB version almost as easily.

Update: Better yet, I can now get all the way to an actual query data type:

query
  AVGKM MAXKM MINKM SONGCOUNT TOTALKM WORKOUTS _ID
1 8.3644 19.7502 0.0194 53 451.6752 54 [empty string]
2 11.2877 26.2581 4.0486 27 304.7688 27 Distance
3 9.6105 9.9224 9.2986 2 19.221 2 Time

(Note for potential cynics: no, I’m not missing the point. This is not meant as a production tool—it’s a learning tool. If my students can start with SQL and see the end result in MapReduce, then they have that much better chance of grokking all of it.)