I have to admit, I’m pretty darn proud of this one. This is a righteous hack. I can now write SQL against MongoDB with ColdFusion.
My project today was to take this SQL statement:
SELECT "goalType", SUM("distancekm") AS "totalkm", COUNT(*) AS "workouts", COUNT("powerSongAlbum") AS "songcount", AVG("distancekm") AS "avgkm", MAX("distancekm") AS "maxkm", MIN("distancekm") AS "minkm" FROM "workouts" GROUP BY "goalType"
… and create a set of ColdFusion components to transform that query into a MapReduce function that would run on MongoDB. Mapping basic SQL to MongoDB MapReduce isn’t too hard. Getting the SQL parser from Derby working within ColdFusion was significantly harder.
But I did it. This is the completely automated result:
db.runCommand({ mapreduce: "workouts", map: function () { emit( this.goalType, { '_cfcount': 1, 'distancekm_cfsum': isNaN(this.distancekm) ? null : this.distancekm, 'distancekm_cfnum': isNaN(this.distancekm) ? 0 : 1, 'powerSongAlbum_cfcount': (this.powerSongAlbum == null) ? 0 : 1, 'distancekm_cfmax': isNaN(this.distancekm) ? null : this.distancekm, 'distancekm_cfmin': isNaN(this.distancekm) ? null : this.distancekm } ); }, reduce: function (key,vals) { var ret = { 'distancekm_cfmax': null, 'distancekm_cfsum': null, 'distancekm_cfmin': null, 'distancekm_cfnum': 0, 'powerSongAlbum_cfcount': 0, '_cfcount': 0 }; for(var i = 0; i < vals.length; i++) { var v = vals[i]; ret['distancekm_cfnum'] += v['distancekm_cfnum']; if(!isNaN(v['distancekm_cfmax'])) ret['distancekm_cfmax'] = (ret['distancekm_cfmax'] == null) ? v['distancekm_cfmax'] : (ret['distancekm_cfmax'] > v['distancekm_cfmax']) ? ret['distancekm_cfmax'] : v['distancekm_cfmax']; ret['_cfcount'] += v['_cfcount']; if(!isNaN(v['distancekm_cfmin'])) ret['distancekm_cfmin'] = (ret['distancekm_cfmin'] == null) ? v['distancekm_cfmin'] : (v['distancekm_cfmin'] > ret['distancekm_cfmin']) ? ret['distancekm_cfmin'] : v['distancekm_cfmin']; ret['powerSongAlbum_cfcount'] += v['powerSongAlbum_cfcount']; if(!isNaN(v['distancekm_cfsum'])) ret['distancekm_cfsum'] = v['distancekm_cfsum'] + (ret['distancekm_cfsum'] == null ? 0 : ret['distancekm_cfsum']); } return ret; }, finalize: function (key,val) { return { 'totalkm' : val['distancekm_cfsum'], 'workouts' : val['_cfcount'], 'songcount' : val['powerSongAlbum_cfcount'], 'avgkm' : (isNaN(val['distancekm_cfnum']) || isNaN(val['distancekm_cfsum'])) ? null : val['distancekm_cfsum'] / val['distancekm_cfnum'], 'maxkm' : val['distancekm_cfmax'], 'minkm' : val['distancekm_cfmin'] }; }, out: "s2mr", verbose: true });
And here’s the output when I run that against my MongoDB collection:
{ "_id" : null, "value" : { "totalkm" : 451.6752000000001, "workouts" : 54, "songcount" : 53, "avgkm" : 8.364355555555557, "maxkm" : 19.7502, "minkm" : 0.0194 } } { "_id" : "Distance", "value" : { "totalkm" : 304.76879999999994, "workouts" : 27, "songcount" : 27, "avgkm" : 11.287733333333332, "maxkm" : 26.2581, "minkm" : 4.0486 } } { "_id" : "Time", "value" : { "totalkm" : 19.221, "workouts" : 2, "songcount" : 2, "avgkm" : 9.6105, "maxkm" : 9.9224, "minkm" : 9.2986 } }
w00t!
I’ve left the code modular enough that I can make a CouchDB version almost as easily.
Update: Better yet, I can now get all the way to an actual query data type:
query | |||||||
---|---|---|---|---|---|---|---|
AVGKM | MAXKM | MINKM | SONGCOUNT | TOTALKM | WORKOUTS | _ID | |
1 | 8.3644 | 19.7502 | 0.0194 | 53 | 451.6752 | 54 | [empty string] |
2 | 11.2877 | 26.2581 | 4.0486 | 27 | 304.7688 | 27 | Distance |
3 | 9.6105 | 9.9224 | 9.2986 | 2 | 19.221 | 2 | Time |
(Note for potential cynics: no, I’m not missing the point. This is not meant as a production tool—it’s a learning tool. If my students can start with SQL and see the end result in MapReduce, then they have that much better chance of grokking all of it.)