Yes, Virginia, that’s automated SQL to MongoDB MapReduce

I have to admit, I’m pretty darn proud of this one. This is a righteous hack. I can now write SQL against MongoDB with ColdFusion.

My project today was to take this SQL statement:

SELECT "goalType",
  SUM("distancekm") AS "totalkm",
  COUNT(*) AS "workouts",
  COUNT("powerSongAlbum") AS "songcount",
  AVG("distancekm") AS "avgkm",
  MAX("distancekm") AS "maxkm",
  MIN("distancekm") AS "minkm"
FROM "workouts"
GROUP BY "goalType"

… and create a set of ColdFusion components to transform that query into a MapReduce function that would run on MongoDB. Mapping basic SQL to MongoDB MapReduce isn’t too hard. Getting the SQL parser from Derby working within ColdFusion was significantly harder.

But I did it. This is the completely automated result:

db.runCommand({
mapreduce: "workouts",
map: function () { emit(
  this.goalType,
  {
    '_cfcount': 1,
    'distancekm_cfsum': isNaN(this.distancekm) ? null : this.distancekm,
    'distancekm_cfnum': isNaN(this.distancekm) ? 0 : 1,
    'powerSongAlbum_cfcount': (this.powerSongAlbum == null) ? 0 : 1,
    'distancekm_cfmax': isNaN(this.distancekm) ? null : this.distancekm,
    'distancekm_cfmin': isNaN(this.distancekm) ? null : this.distancekm
  }
); },
reduce: function (key,vals) {
  var ret = {
    'distancekm_cfmax': null,
    'distancekm_cfsum': null,
    'distancekm_cfmin': null,
    'distancekm_cfnum': 0,
    'powerSongAlbum_cfcount': 0,
    '_cfcount': 0
  };
  for(var i = 0; i < vals.length; i++) {
    var v = vals[i];
    ret['distancekm_cfnum'] += v['distancekm_cfnum'];
    if(!isNaN(v['distancekm_cfmax'])) ret['distancekm_cfmax'] = (ret['distancekm_cfmax'] == null) ? v['distancekm_cfmax'] : (ret['distancekm_cfmax'] > v['distancekm_cfmax']) ? ret['distancekm_cfmax'] : v['distancekm_cfmax'];
    ret['_cfcount'] += v['_cfcount'];
    if(!isNaN(v['distancekm_cfmin'])) ret['distancekm_cfmin'] = (ret['distancekm_cfmin'] == null) ? v['distancekm_cfmin'] : (v['distancekm_cfmin'] > ret['distancekm_cfmin']) ? ret['distancekm_cfmin'] : v['distancekm_cfmin'];
    ret['powerSongAlbum_cfcount'] += v['powerSongAlbum_cfcount'];
    if(!isNaN(v['distancekm_cfsum'])) ret['distancekm_cfsum'] = v['distancekm_cfsum'] + (ret['distancekm_cfsum'] == null ? 0 : ret['distancekm_cfsum']);
  }
  return ret;
},
finalize: function (key,val) {
  return {
    'totalkm'   : val['distancekm_cfsum'],
    'workouts'  : val['_cfcount'],
    'songcount' : val['powerSongAlbum_cfcount'],
    'avgkm'     : (isNaN(val['distancekm_cfnum']) || isNaN(val['distancekm_cfsum'])) ? null : val['distancekm_cfsum'] / val['distancekm_cfnum'],
    'maxkm'     : val['distancekm_cfmax'],
    'minkm'     : val['distancekm_cfmin']
  };
},
out: "s2mr",
verbose: true
});

And here’s the output when I run that against my MongoDB collection:

{ "_id" : null, "value" : {
    "totalkm" : 451.6752000000001,
    "workouts" : 54,
    "songcount" : 53,
    "avgkm" : 8.364355555555557,
    "maxkm" : 19.7502,
    "minkm" : 0.0194
} }
{ "_id" : "Distance", "value" : {
    "totalkm" : 304.76879999999994,
    "workouts" : 27,
    "songcount" : 27,
    "avgkm" : 11.287733333333332,
    "maxkm" : 26.2581,
    "minkm" : 4.0486
} }
{ "_id" : "Time", "value" : {
    "totalkm" : 19.221,
    "workouts" : 2,
    "songcount" : 2,
    "avgkm" : 9.6105,
    "maxkm" : 9.9224,
    "minkm" : 9.2986
} }

w00t!

I’ve left the code modular enough that I can make a CouchDB version almost as easily.

Update: Better yet, I can now get all the way to an actual query data type:

query
  AVGKM MAXKM MINKM SONGCOUNT TOTALKM WORKOUTS _ID
1 8.3644 19.7502 0.0194 53 451.6752 54 [empty string]
2 11.2877 26.2581 4.0486 27 304.7688 27 Distance
3 9.6105 9.9224 9.2986 2 19.221 2 Time

(Note for potential cynics: no, I’m not missing the point. This is not meant as a production tool—it’s a learning tool. If my students can start with SQL and see the end result in MapReduce, then they have that much better chance of grokking all of it.)

Published by

Rick Osborne

I am a web geek who has been doing this sort of thing entirely too long. I rant, I muse, I whine. That is, I am not at all atypical for my breed.