n-gram prediction project

I recently finished the Data Science Specialization that Coursera offers in partnership with Johns Hopkins. While I will admit that some of the initial offerings of the courses were pretty rough, it’s still an excellent series of nine courses and a capstone project.

The capstone project was about n-gram prediction: Given a text corpus, train a model that can generate reasonable predictions for the next word when supplied with one or more preceding words. That is, the model could be given text like “I like ice” and produce predictions such as “cream”, “skating”, etc.

I thought I’d document the iterations I went through for the project and what I learned.


Step 1: Bigram Predictor

Implementation Details: The original version of this bigram predictor was written in Java, and then rewritten in JavaScript/NodeJS to check the viability of a faster development cycle. The JavaScript implementation is far slower and less memory-efficient, so I switched back to the Java implementation.

  1. Ingest each line of the given text.
  2. For each consecutive word pair in the line, increment a count of the number of times the second word directly follows the second. That is, keep a large list of all of the words that have been seen, and which words follow them.
  3. For predictions, take the last word of the input (such as “ice” above), look it up in the list, and then find the word that most often follows it.

This produced objects in memory that looked like:

"ice": {          // given this word
    "cream": 1,   // we saw "ice cream" once
    "skating": 2  // and "ice skating" twice

Step 2: Bigram + Initial Letter Predictor

It turns out that bigram prediction isn’t very accurate, as you have only a single word of context. While “I like ice cream” and “I like ice skating” are probably equally likely, something like “I like ice storms” probably isn’t as likely.

The initial bigram predictor showed an accuracy of ~13%. I put some thought into how SwiftKey works: it updates predictions as you type. My next idea was to add partial-match searches to the bigram predictor. That is, if you gave the predictor “ice” and the letter “s”, it would narrow its search to following words that began with “s”, such as “skating” or “storms”, but not “cream”.

As expected, this significantly improved the accuracy of the predictor, with each additional letter of the word partial adding between 10% and 15% accuracy, but dropping off rapidly after 3 or 4 letters.

Step 3: Dictionary Predictor

I realized that the partial-letter support in the bigram predictor was good, but it relied on the word pair having been seen before. But if the initial letter(s) of the word partial were used by themselves, they could often make a prediction when the bigram predictor could not.

  1. Ingest each line of the given text.
  2. For each word in the text, keep track of how often it’s been seen.
  3. Create an index of initial letters of words, and how often their words are seen. For example, “to” might lead to “today”, “tomorrow”, “together”, or even just “to”.
  4. When given the partial for a word, look it up in the index and return the most-seen word.

The dictionary predictor could also ingest an actual dictionary instead of learning solely from the corpus. This would allow for possibilities such as misspelling detection and correction. While I eventually did use a word whitelist as a filter, I did not tackle spelling correction.

Step 4: Weighted Bigram + Dictionary Predictors

Once I had two predictors, I analyzed their accuracy and created a meta-model that used both. The meta-model obtained predictions from each, weighted the value of the predictions based on tested accuracy, and returned the most likely prediction. Again, this significantly improved the accuracy of predictions.

Tuning the weights is an interesting challenge. As I initially implemented it, weights were a simple multiplier and do not depend on the input. However, it would likely improve accuracy if the weights used the context of the given information. For example, the weight of the dictionary predictor might be increased as the number of letters in the partial increased.

Step 5: Trigram Predictor

I realized that a trigram predictor was really just an index where each item was a bigram predictor. That is, “to” in the index would point to a bigram predictor that would have its own list with words such as “the”, “me”, or “there”. I implemented it this way, and found that its overall accuracy was roughly equivalent to that of the bigram predictor. I modified the weights so that if the trigram predictor had a prediction it would be used, then the bigram’s prediction, then the dictionary’s prediction.

Step 6: Storage Engines

Storage started to become an issue: working with in-memory objects was very fast, but it meant I was limited to the 12GB of RAM in my workstation. I found that the relationship between words learned and accuracy of the model dropped off sharply: an order of magnitude increase in the number of words learned might only yield a few extra percentage points of accuracy. By my estimation, I’d need to learn over a million phrases for the accuracy to get to the point where the model would be considered usable.

This led to a huge refactoring of the code to use interfaces, dependency injection, and pluggable components.

Step 7: Relational DB with MariaDB

I created tables for each of the predictors: dictionary, bigram, and trigram. Each just had words and a count, along with the proper indexes. This slowed learning down by a factor of ~15x, but it meant that I could store far more words than I could in memory. However, I found that the speed of the model degraded too quickly for prediction performance. After a few million rows in a table, such as ('like', 'ice', 'cream', 17), read queries just weren’t very fast even with a simple query:

SELECT word3, seen
FROM trigram
WHERE (word1 = 'like')
  AND (word2 = 'ice')

Step 8: Redis

Redis is a key-value store that works by keeping its entire database in memory. I figured it would run into the same memory constraints as the naive solution, but I wanted to see how it performed. The Redis back-end required a very different data model, as Redis doesn’t have a concept of structured objects. I decided on something very simple: just store phrases as the key and values as the counts:

"like ice cream": 1
"like ice skating": 2
"ice storms": 1

I could then have Redis scan for keys that started with the phrase I had been given: "like ice *". Redis scans through keys very quickly, so this was one of the fastest solutions, both for learning and for predicting. However, it did eventually hit the memory limit and I had to move on.

Step 9: Flip it and reverse it

Before implementing the next storage engine, I had the crazy idea to start storing phrases backwards. One problem with most of the existing predictors and storage implementations was that they would have to try increasingly simpler phrases until a prediction was found. For example, maybe the original phrase given was "i never thought i would but i like ice". The sequence of attempts to find a prediction would look like:

`"i would but i like ice": (null)
"would but i like ice": (null)
"but i like ice": (null)
"i like ice": "skating"

But if I reversed the input and stored it backwards, I could walk along the sequence until I ran out of forwards context, then make my prediction there:

"ice like i but would i":
  "ice" →
    "like" →
      "i" →
        "but" (out of context)
← "skating"

This meant that I could truly move to an n-gram model where I could vary n to fit my needs. This led well to my next storage engine:

Step 10: MongoDB

MongoDB uses JavaScript objects as documents for its data store. Given that reverse storage idea, I could train my model via the following steps:

  1. Split the words into an array:
    ["i", "like", "ice", "cream"]
  2. Eliminate any stop (very common) words:
    ["like", "ice", "cream"]
  3. Reverse the order:
    ["cream", "ice", "like"]
  4. The first word, "cream", should be predicted for the second, "ice", as well as the sequence of ["ice", "like"] etc.
  5. Find or create a MongoDB document, a JavaScript object, with the key of "ice".
  6. Add a reference to "cream" as a prediction to a special object key that holds predictions and their frequency.
  7. Store "like" as a sub-key of the document, along with "cream" as a prediction under that.
  8. Recursively repeat the previous step until you have reached the end (beginning) of the sequence, or until you hit a defined n-gram length limit, such as 10.

For example, this document includes data learned from "i like ice cream" once, "i like ice skating" twice, and "ice storm" once.

{"_id": "ice",    // MongoDB Primary Key
  ">": {          // Predictions for after "ice"
    "cream": 1,   // seen once
    "skating": 2, // seen twice
    "storm": 1    // seen once
  "like": {       // next word in sequence
    ">": {        // predictions for "like ice"
      "cream": 1, // "like ice cream" seen once
      "skating": 2
      // note: "like ice storm" was not seen

Prediction is easy. Given "i like ice":

  1. Perform the same split & reverse: ["ice", "like"]
  2. Look up the document for "ice".
  3. Traverse the document’s sub-keys for words in the sequence until out of words or matches.
  4. Fetch the predictions (">" key) for the current depth.
  5. Sort by descending count, optionally filtering based on partial matches. For example, if given the first letter of the next word of "s" then "cream" could be eliminated, leaving "skating" and "storm".
  6. Return the top prediction(s).

This seemed to work out pretty well. It wasn’t learning as fast as Redis, but it was in the same ballpark and didn’t have the in-memory constraint. However, it eventually broke, too. MongoDB has a per-document limit of 16MB. While it seems crazy that a word might end up with 16MB of context, it’s entirely possible for very common words, such as “i”.

I think I could work around this by pruning such documents when they start to cause problems, but I have not yet implemented that.


It’s a fun little exercise, and I learned a ton while doing it. My implementation was kept intentionally naive, without any attempt to tackle parts of speech, misspellings, etc. But I really got to think about data structures, access patterns, and performance characteristics.

Should you be interested, my code for the project is up on GitHub: rickosborne/ngram-java.