josh's site

An over-engineered nytimes spelling bee solver

10/24 - 2021

Over the past year, I’ve gotten really into the nytimes daily spelling-bee. The premise of the game is quite simple. There are seven hexagonal tiles that form a honeycomb, each containing a letter. Your goal is to make as many words as possible such that

Every word must contain the central letter
The word only uses the seven given letters
The word must be at least four letters long
There are no restrictions on how many times each letter may be used.

This got me thinking about writing a program to solve the spelling-bee. While this problem isn’t trivial to solve, it also isn’t extremely difficult. But I thought it would be a fun exercise to try a couple unique approaches and compare their performance. Namely, I thought of a fairly simple solution which used many of the bit-manipulation techniques I learned while writing my chess engine. This will be a blog post detailing my different solutions and comparing their performance.

A level playing ground

To make comparisons between the different implementations fair, we’re going to give each of them the same input and count any initialization time against it’s runtime. I’ve decided to implement this in Rust, and so each solver must implement the following trait:

trait GridSolver {
    fn solve(&self, problem: &ProblemStatement) -> Vec<String>;
};

struct ProblemStatement {
    center: u8, // letters are 8-bit ascii
    others: Vec<u8>,
    dictionary: Vec<String>
};

Approach 1 - The “naive” solution

I’m first going to write what some may consider a “naive” solution. While there are some glaring inefficiencies, it’s also very simple to implement, meaning that we can write it very quickly and have high confidence that our solution is correct. This is often times a good way to develop algorithms - start with the “naive” implementation and use it to check your work on more complicated approaches.

The pseudo-code will look something like this:

func valid_words(d: Dictionary, c: CenterLetter, o: OtherLetters):
    return d.filter(w -> is_valid_word(w, c, o))

func is_valid_word(w: Word, c: CenterLetter, o: OtherLetters):
    if !w.contains(c):
        return false

    for each letter l in w:
        if l is not c or in o:
            return false

    return true

The code mostly speaks for itself - we iterate over each word and each letter in those words. If the word doesn’t contain the center letter, we know it’s invalid. Furthermore, if the word contains letters which aren’t the center or surrounding letters, it’s also invalid.

From the complexity standpoint, our program runs in \(O(n l)\) where \(n\) is the number of words in the dictionary and \(l\) is the length of the longest word in the dictionary. Note that since OtherLetters is always a constant length, looking up each letter as valid or not is a constant time operation. (If we wanted to be more general, we would say our algorithm runs in \(O(n l o)\), where \(o\) is the length of OtherLetters).

My final implementation ended up looking something like this:

impl GridSolver for NaiveSolver {
    fn solve(&self, problem: &ProblemStatement) -> Vec<String> {
        let v = problem.dictionary.clone();

        v.into_iter()
            .filter(|s| NaiveSolver::is_legal(s, problem.center, problem.others.clone()))
            .collect()
    }

    fn name(&self) -> String {
        "Naive Solver".to_string()
    }
}

impl NaiveSolver {
    fn is_legal(s: &str, center: u8, others: Vec<u8>) -> bool {
        if !s.contains(center as char) {
            return false;
        }

        for c in s.as_bytes() {
            if !others.contains(c) && *c != center {
                return false;
            }
        }

        true
    }
}

Approach 2 - A set-based approach

The second approach makes on slight “optimization” on the first approach - can we reduce the lookup time for whether or not a given letter is valid. Looking up characters in a vector is going to be slower than looking up characters in a set - especially a set which uses some sort of hashing implementation. The question, however, is whether or not the added overhead of making such a set noticeably affects runtime. Its worth pointing out that the size of the problem is relatively small - most words are short (just a few characters), and the total number of available words is on the scale of hundreds of thousands (using the 2019 Scrabble championship dictionary).

Thus, the only difference between the set-based implementation and the naive implementation would be the following addition to our valid_words function.

func valid_words(d: Dictionary, c: CenterLetter, o: OtherLetters):
    let o = HashSet(o);
    return d.filter(w -> is_valid_word(w, c, o))

Asymptotically, we expect the same runtime since our lookup is still constant time. In practice, however, we might expect some performance gains since we avoid any unnecessary iteration over the allowed characters.

impl GridSolver for SetSolver {
    fn solve(&self, problem: &ProblemStatement) -> Vec<String> {
        let v = problem.dictionary.clone();
        let set: HashSet<u8> = HashSet::from_iter(problem.others.iter().cloned());

        v.into_iter()
            .filter(|s| SetSolver::is_legal(s, problem.center, &set))
            .collect()
    }

    fn name(&self) -> String {
        "Set Solver".to_string()
    }
}

impl SetSolver {
    fn is_legal(s: &str, center: u8, others: &HashSet<u8>) -> bool {
        if !s.contains(center as char) {
            return false;
        }

        for c in s.as_bytes() {
            if !others.contains(c) && *c != center {
                return false;
            }
        }

        true
    }
}

Approach 3 - Using a bitvector

This approach is the main reason I wrote this article - mostly because its a really niche solution to a problem with the given constraints. In the previous solution, I used the HashSet from Rust’s standard library. But what if we could make a really small and efficient hash set using our knowledge of the problem. There are a few key observations to make:

There are only 26 possible characters (casing doesn’t matter)
Letters may be used as many times as we like

Using this information, we can create a very lightweight set of numbers using only a single 32-bit integer. Our set will work like this - the first 26 bits in our integer will represent the presence of a certain letter at it’s respective index. So, for example, if the allowed letters were abcdefg, then our bitvector would simple have 1’s in the first seven indices and be zero everywhere else. We can easily make a bitvector for a given word and compare the two using a bitwise AND.

# Bitvector for the bee [c] f i l o a y
10000100100100100000000010

# Bitvector for the word "coal"
10100000000100100000000000

In other words, our word is only valid if it’s 1’s line up with available 1’s in the bee’s bitvector. Of course, we’ll have to make a special case for the center letter, but this is the basic idea.

In theory, this should simplify the number of bit operations that we need to do, since we don’t have to use any sort of generalized hashing algorithm for our set. Will it be any faster in practice? Let’s find out.

type BitVector = u32;

pub struct BitSolver;

impl GridSolver for BitSolver {
    fn solve(&self, problem: &ProblemStatement) -> Vec<String> {
        let v = problem.dictionary.clone();

        let center_mask = BitSolver::from_char(problem.center);
        let other_mask = BitSolver::calc_mask(&problem.others);

        v.into_iter()
            .filter(|s| BitSolver::is_legal(s, center_mask, other_mask))
            .collect()
    }

    fn name(&self) -> String {
        "Bit Vector Solver".to_string()
    }
}

impl BitSolver {
    fn calc_mask(chars: &Vec<u8>) -> BitVector {
        chars
            .into_iter()
            .map(|c| BitSolver::from_char(*c))
            .reduce(|a, b| a | b)
            .unwrap()
    }

    fn from_str(s: &str) -> BitVector {
        s.as_bytes()
            .into_iter()
            .map(|c| BitSolver::from_char(*c))
            .reduce(|a, b| a | b)
            .unwrap()
    }

    fn from_char(char: u8) -> BitVector {
        1 << ((char as u32) - 97)
    }

    fn is_legal(s: &str, center_mask: BitVector, other_mask: BitVector) -> bool {
        let b = BitSolver::from_str(s);
        if b & center_mask == 0 {
            return false;
        }

        b & !(center_mask | other_mask) == 0
    }
}

Approach 4 - Using regular expressions

A few days after I wrote my bitvector implementation it occurred to me that this problem can also be solve relatively simply using regular expressions. After all, regular expressions match words based on some criteria and are optimized for this exact task. Much like the naive solution, regular expressions are also fairly easy to implement given that they’re so declarative. As such, this would serve as a good baseline test for verifying the correctness of other algorithms.

pub struct RegexSolver;

impl GridSolver for RegexSolver {
    fn solve(&self, problem: &ProblemStatement) -> Vec<String> {
        let v = problem.dictionary.clone();

        let fo = String::from_utf8(problem.others.clone()).unwrap();
        let formatted = format!(
            "^[{}{}]*{}[{}{}]*$",
            fo, problem.center as char, problem.center as char, fo, problem.center as char
        );
        let re = Regex::new(&formatted).unwrap();

        v.into_iter().filter(|s| re.is_match(s)).collect()
    }

    fn name(&self) -> String {
        "Regex Solver".to_string()
    }
}

Results

I ran the tests a few times using each solver to get accurate results and averaged them, achieving the following results:

Naive Solver
  Took        286 ms
Set Solver
  Took        243 ms
Bit Vector Solver
  Took        149 ms
Regex Solver
  Took        210 ms

The improvement between the naive and set implementations is to be expected, since set looks are going to be strictly faster than our array lookups and it adds virtually no overhead. Additionally, our bit vector performs quite well in comparison, showing how a domain-specific data structure can really help in terms of performance. To my surprise, the regex solver actually outperforms the set solver. In some ways this makes sense, given that regular expressions are designed to match strings, but it’s a testament to the regex library that it performs as well as it does.

Now, of course, any improvement in our algorithm is comically trivial. Given the nature of the problem there’s really no reason to make any optimizations beyond the initial algorithm. Nonetheless, I thought this was a very fun exercise in thinking of a fairly clever solution to a problem that really doesn’t need one.

If you’re interested in reading the source code, you can view it on github.