Rust Word Count

Last week there were two posts on the reputation of C++. In a noble effort, Jussi Pakkanen showcased modern C++. Nothing fancy, just clean and readable code. Unfortunately, around ten years after C++11, the news that simple (and safe) C++ code is now the default has not spread everywhere.

Jason Moiron answered with Go code doing the same (article). Both programs find the ten most common words in all txt files, which can be recursively found in the current directory.

Both articles are much better than this one, so I urge you to read them first.

  1. Does C++ still deserve the bad rap it has had for so long?
  2. C++ Deserves Its Bad Reputation

I wanted to see what this would look like in Rust.

The final result can be found in this Gist.

Analysis

Stand-Alone

The C++ article mentioned that C++ is the only language where all necessary features are in the standard library. And looking at my Cargo.toml file, you’ll see that I use these two external dependencies.

[dependencies]
walkdir = "2.3.1"
anyhow = "1.0.33"

I would argue that having a healthy package collection is extremely important. Rust certainly has this and using external dependencies is a join.

However, I do see a certain irony in the fact that on anything larger, I would replace anyhow with color-eyre for more functionality (like .suggestion(...) on errors). And ignore::WalkBuilder can replace walkdir while providing inbuilt parallelism.

Question-Mark Operator

fn main() -> Result<()> { ... }

In Rust, the main function can optionally return a Result, which means that the excellent bubble-up-errors operator (?)1 can be used. It is Rust’s shorthand for returning early in case of an Error.

Type Interference

let mut counter = std::collections::BTreeMap::new();

This should be simple enough. Rust uses let to declare variables with type interference. To give the best of both worlds, it is also possible to provide type annotations, although this is normally only done where automatic type annotation fails2.

Note that everything is constant unless mut is provided. I must admit that I find that sometimes annoying; even though I wholeheartedly agree with it on principle.

Functional

The next part, is in my opinion, the biggest criticism of the readability and cleanliness of this code example.

for entry in WalkDir::new(".")
    .follow_links(true)
    .into_iter()
    .filter_map(|e| e.ok())
    .filter(|e| e.file_name().to_string_lossy().ends_with(".txt"))
{
    ...
}

My first criticism is that here we mix business-logic with boiler-plate. I am talking about .into_iter(). While you’ll become accustomed to this, since it is very common in Rust code, it is nothing we care about from a functionality side.

Next, all this Functional Programming style will take some getting used to for those of us with Imperative and Object Oriented backgrounds. I do believe that FP comes with paradigms that every programmer has to know these days3.

As an exercise, do you know what filter_map(|e| e.ok()) does? I did not, but my IDE told me on hovering above it. It is fancy code to say, remove all Error’ed elements.

Reading Words

let reader = BufReader::new(File::open(entry.into_path())?);
for line in reader.lines() {
    for word in line?.split_whitespace() {
        let w = word.to_lowercase().to_string();
        *counter.entry(w).or_insert(0) += 1;
    }
}

Note the two ? operators after the functions which can fail: File::open can fail and reading a line could fail. In case they fail, main returns early and the error message is shown to the user.

This is a pattern that takes some getting used to: *counter.entry(w).or_insert(0) += 1; We find counter.entry(w) or if that returns None (because the word has not been added yet), a 0 is inserted. The whole thing is dereferenced and incremented. Not difficult but definitely a bit of a stumbling block.

Sorting

The rest is straight forward and easy to understand in my eyes.

Like the other solutions, the HashMap is converted into a vector with tuple (word_count, word) as the elements. The vector is then sorted. 4

let mut tops: Vec<_> = counter.iter().map(|(w, c)| (c, w)).collect();
tops.sort_by_key(|(&c, w)| (std::cmp::Reverse(c), *w));

for (c, w) in tops.iter().take(10) {
    println!("{} - {}", w, c);
}

The only thing of note, is that I struggled writing the right key closure. tops.sort_by_key(|(&c, w)| (std::cmp::Reverse(c), *w)); Note, my use of &c and *w. Not only did I struggle with knowing where the references are but also with a lifetime error. IDE support definitely helped here5.

Summary

In conclusion, this was a fun little exercise. It is well known that Rust targets writing CLI apps. And it does so well on this word counting exercise.

I definitely do not think that Rust is any prettier than the C++ or the Go version. A cursory look tells me that it might be the shortest solution 6. But I doubt if pure line count should ever matter. When I first started reading Rust, I definitely had to get used to some of the concepts. And Rust is by no means an easy language. Yet, once you get used to it, writing something like this goes quickly and with confidence that the result is correct.

Update: Read the next post in this series, where I try it in Swift. And finally, read me going back and improving the original C++ version.


  1. Showing a disappointing lack of creativity it is just called the question mark operator. ↩︎

  2. A typical example is when creating a vector from a .collect() call. In that case, the compiler does not know into which container type the iterator should be collected. let v: Vec<_> = 0..10.iter().collect(); ↩︎

  3. This is written from an OO perspective. I am fully aware that FP has been practiced for decades by a large number of programmers. My point is: Everyone must know about map, filter and predicate-functions. ↩︎

  4. C++ has here the benefit of std::partial_sort, which neither Go nor Rust have out of the box. ↩︎

  5. Please let me know if there’s a more elegant way of sprinkling & and * on this. ↩︎

  6. Bad news for all of us paid by number of lines. ↩︎