Playing around with MongoDB and MapReduce functions

For my Advanced Database Structures course, I have a lecture on non-relational databases. Right now, it’s a little too abstract for my tastes—we talk mainly about OLAP and MDX, but it’s hard to give concrete examples without setting up an OLAP cube. We’re somewhat constrained by our learning environment: instead of using a server for the class, students all have their own laptops upon which they install any software they need to use. We use mySQL (MAMP, really) to teach the relational part of the course. I’ve thought about using Mondrian to show and teach MDX, but I haven’t been able to come up with a lecture and lesson plan simple enough to make sense, but powerful enough to teach the core concepts.

One of the things I’ve had nagging at the back of my brain is that OLAP may be a bit too abstract for the course. It’s great to show that not every database works the same way, but it’s hard to give the students practical examples of OLAP without starting with “first you get a job at a large company with a data warehouse full of millions of records …”. That is, it’s too far removed from where most of the students will be starting off.

As I have this month off—in the office instead of the classroom—I’ve been able to sit down and play around in the other direction: noSQL document databases such as MongoDB and Apache’s CouchDB. CouchDB is a little easier to set up and use … but MongoDB has been more stable for me.

Instead of SQL, both systems use a MapReduce paradigm for getting results. MapReduce works somewhat like the GROUP BY aggregation in SQL, but with you writing your own imperative functions instead of writing declarative SQL statements to do the work for you. But this is better shown with an example:

You’ve got an ecommerce store and a nice, relational database to handle the transactions. You’ve got tables for Customers, Orders, Order Lines, Addresses, Products, and so on. If you wanted to get the total dollars that you’ve sold of a particular product, you might write the following SQL:

SELECT SUM(price * quantity) AS SalesDollars
FROM OrderLines
    INNER JOIN Products ON (OrderLines.ProductID = Products.id)
WHERE (Products.Name = 'Depleted Uranium Slugs')

If you wanted to drill down a bit and see how many dollars you sold to each country, for that same product:

SELECT CountryCode, SUM(price * quantity) AS SalesDollars
FROM OrderLines
    INNER JOIN Products ON (OrderLines.ProductID = Products.id)
    INNER JOIN Orders ON (OrderLines.OrderID = Orders.id)
    INNER JOIN Addresses ON (Orders.ShipToAddressID = Addresses.id)
WHERE (Products.Name = 'Depleted Uranium Slugs')
GROUP BY CountryCode

Point being, as you start doing interesting things with the grouping and the math, the SQL gets trickier and trickier.

The document databases that MongoDB uses are different. Pretty much everything is denormalized into a single collection, or table. Documents are the equivalent of rows, and they are easiest thought of as JSON objects:

{
    orderNumber:    2378,
    lineNumber:     4,
    productSKU:     '223179-X',
    productName:    'Depleted Uranium Slugs',
    productPrice:    499.95,
    productQuantity: 5,
    productDollars:  2499.75
    shipCountry:    'US',
    shipRegion:     'FL',
    ...
}

Like using SQL, you start with a query to reduce the number of documents/rows you want to work with. This acts like a WHERE clause. But instead of SQL, you use something closer to an ORM/ActiveRecord pattern, which again looks like JSON:

{
    productName: 'Depleted Uranium Slugs'
}

You then layer on two functions: a map and a reduce. Your map function extracts what the GROUP BY function would need: the unique combination of attributes/columns that you want to group by. It uses the built-in magic emit() function to return that key, and any values needed for later math. In MongoDB, the map function doesn’t take any arguments—instead, you use the this kayword:

function () {
    emit(this.shipCountry, this.productDollars);
}

Your reduce function does any math you need, acting on the objects produced in your emit call in your map function.

function (key, values) {
    var dollars = 0.0;
    for (var i = 0; i < values.length; i++) {
        dollars += values[i];
    }
    return dollars;
}

Executing the combination of these three yields a JSON result:

[
    { '_id': 'US', 'value': 24497.55; },
    { '_id': 'CA', 'value':  1499.85; },
    ...
]

If we wanted to do a more complex grouping, we’d extend our map function to return an object key instead of a simple value:

function () {
    emit(
        {
            countryCode: this.shipCountry,
            regionCode: this.shipRegion
        },
        this.productDollars
    );
}

Or, if we wanted to aggregate more columns, such as getting the cost dollars to calculate a margin:

function () {
    emit(
        this.shipCountry,
        {
            price: this.productDollars,
            cost:  this.productCost
        }
    );
}

For the latter case, we’d also need to change our reduce function to account for the more complex data:

function (key, values) {
    var price = 0.0, cost = 0.0, margin = 0.0, marginPercent = 0.0;
    for (var i = 0; i < values.length; i++) {
        price += values[i].price;
        cost += values[i].cost;
    }
    margin = price - cost;
    marginPercent = margin / price;
    return {
        price:         price,
        cost:          cost,
        margin:        margin,
        marginPercent: marginPercent
    };
}

It seems kindof silly to go through all this, right? SQL does all of this, but with much less complexity. However, this approach has some huge advantages over SQL:

  1. Programmers who don’t know SQL or relational theory may find it easier to understand and get using quickly. (Newbies especially, such as my students.)
  2. The map and reduce functions can be heavily parallelized on commodity hardware.

It’s really that second one that is the key. If you think about how those two functions are written, each of them could be applied to half of your dataset independently of the other half, then combined at the end with a few more calls to your reduce function. That’s the big deal about MapReduce: your dataset can be divided across dozens or hundreds of servers without you having to worry about what is where.

Of course, SQL databases and tables can be partitioned across multiple servers. However, that’s generally the realm of high-end database systems, out of the range of small budgets. This new breed of noSQL databases is built to be horizontally scalable out of the box.

It’s an interesting direction for databases. It’s not a panacea, and there will be plenty of places where relational databases still make more sense. But, for some applications like data warehousing and analytics, it may be a better option than what’s available at the same (free) price range.

Published by

Rick Osborne

I am a web geek who has been doing this sort of thing entirely too long. I rant, I muse, I whine. That is, I am not at all atypical for my breed.