Yes, Virginia, that’s automated SQL to MongoDB MapReduce
Feb 19
I have to admit, I’m pretty darn proud of this one. This is a righteous hack. I can now write SQL against MongoDB with ColdFusion.
My project today was to take this SQL statement:
SELECT "goalType",
SUM("distancekm") AS "totalkm",
COUNT(*) AS "workouts",
COUNT("powerSongAlbum") AS "songcount",
AVG("distancekm") AS "avgkm",
MAX("distancekm") AS "maxkm",
MIN("distancekm") AS "minkm"
FROM "workouts"
GROUP BY "goalType"
… and create a set of ColdFusion components to transform that query into a MapReduce function that would run on MongoDB. Mapping basic SQL to MongoDB MapReduce isn’t too hard. Getting the SQL parser from Derby working within ColdFusion was significantly harder.
But I did it. This is the completely automated result:
db.runCommand({
mapreduce: "workouts",
map: function () { emit(
this.goalType,
{
'_cfcount': 1,
'distancekm_cfsum': isNaN(this.distancekm) ? null : this.distancekm,
'distancekm_cfnum': isNaN(this.distancekm) ? 0 : 1,
'powerSongAlbum_cfcount': (this.powerSongAlbum == null) ? 0 : 1,
'distancekm_cfmax': isNaN(this.distancekm) ? null : this.distancekm,
'distancekm_cfmin': isNaN(this.distancekm) ? null : this.distancekm
}
); },
reduce: function (key,vals) {
var ret = {
'distancekm_cfmax': null,
'distancekm_cfsum': null,
'distancekm_cfmin': null,
'distancekm_cfnum': 0,
'powerSongAlbum_cfcount': 0,
'_cfcount': 0
};
for(var i = 0; i < vals.length; i++) {
var v = vals[i];
ret['distancekm_cfnum'] += v['distancekm_cfnum'];
if(!isNaN(v['distancekm_cfmax'])) ret['distancekm_cfmax'] = (ret['distancekm_cfmax'] == null) ? v['distancekm_cfmax'] : (ret['distancekm_cfmax'] > v['distancekm_cfmax']) ? ret['distancekm_cfmax'] : v['distancekm_cfmax'];
ret['_cfcount'] += v['_cfcount'];
if(!isNaN(v['distancekm_cfmin'])) ret['distancekm_cfmin'] = (ret['distancekm_cfmin'] == null) ? v['distancekm_cfmin'] : (v['distancekm_cfmin'] > ret['distancekm_cfmin']) ? ret['distancekm_cfmin'] : v['distancekm_cfmin'];
ret['powerSongAlbum_cfcount'] += v['powerSongAlbum_cfcount'];
if(!isNaN(v['distancekm_cfsum'])) ret['distancekm_cfsum'] = v['distancekm_cfsum'] + (ret['distancekm_cfsum'] == null ? 0 : ret['distancekm_cfsum']);
}
return ret;
},
finalize: function (key,val) {
return {
'totalkm' : val['distancekm_cfsum'],
'workouts' : val['_cfcount'],
'songcount' : val['powerSongAlbum_cfcount'],
'avgkm' : (isNaN(val['distancekm_cfnum']) || isNaN(val['distancekm_cfsum'])) ? null : val['distancekm_cfsum'] / val['distancekm_cfnum'],
'maxkm' : val['distancekm_cfmax'],
'minkm' : val['distancekm_cfmin']
};
},
out: "s2mr",
verbose: true
});
And here’s the output when I run that against my MongoDB collection:
{ "_id" : null, "value" : {
"totalkm" : 451.6752000000001,
"workouts" : 54,
"songcount" : 53,
"avgkm" : 8.364355555555557,
"maxkm" : 19.7502,
"minkm" : 0.0194
} }
{ "_id" : "Distance", "value" : {
"totalkm" : 304.76879999999994,
"workouts" : 27,
"songcount" : 27,
"avgkm" : 11.287733333333332,
"maxkm" : 26.2581,
"minkm" : 4.0486
} }
{ "_id" : "Time", "value" : {
"totalkm" : 19.221,
"workouts" : 2,
"songcount" : 2,
"avgkm" : 9.6105,
"maxkm" : 9.9224,
"minkm" : 9.2986
} }
w00t!
I’ve left the code modular enough that I can make a CouchDB version almost as easily.
Update: Better yet, I can now get all the way to an actual query data type:
| query | |||||||
|---|---|---|---|---|---|---|---|
| AVGKM | MAXKM | MINKM | SONGCOUNT | TOTALKM | WORKOUTS | _ID | |
| 1 | 8.3644 | 19.7502 | 0.0194 | 53 | 451.6752 | 54 | [empty string] |
| 2 | 11.2877 | 26.2581 | 4.0486 | 27 | 304.7688 | 27 | Distance |
| 3 | 9.6105 | 9.9224 | 9.2986 | 2 | 19.221 | 2 | Time |
(Note for potential cynics: no, I’m not missing the point. This is not meant as a production tool—it’s a learning tool. If my students can start with SQL and see the end result in MapReduce, then they have that much better chance of grokking all of it.)