Rust Word Count
Last week there were two posts on the reputation of C++. In a noble effort, Jussi Pakkanen showcased modern C++. Nothing fancy, just clean and readable code. Unfortunately, around ten years after C++11, the news that simple (and safe) C++ code is now the default has not spread everywhere.
Jason Moiron answered with Go code doing the same (article). Both programs find the ten most common words in all txt files, which can be recursively found in the current directory.
Both articles are much better than this one, so I urge you to read them first.
I wanted to see what this would look like in Rust.
The final result can be found in this Gist.
Analysis
Stand-Alone
The C++ article mentioned that C++ is the only language where all necessary
features are in the standard library. And looking at my Cargo.toml
file,
you’ll see that I use these two external dependencies.
[dependencies]
walkdir = "2.3.1"
anyhow = "1.0.33"
I would argue that having a healthy package collection is extremely important. Rust certainly has this and using external dependencies is a joy.
However, I do see a certain irony in the fact that on anything larger, I would
replace anyhow with color-eyre for more functionality (like
.suggestion(...)
on errors). And ignore::WalkBuilder can replace
walkdir while providing inbuilt parallelism.
Question-Mark Operator
fn main() -> Result<()> { ... }
In Rust, the main function can optionally return a Result
, which means that
the excellent bubble-up-errors operator (?
)1 can be used. It is Rust’s
shorthand for returning early in case of an Error.
Type Interference
let mut counter = std::collections::BTreeMap::new();
This should be simple enough. Rust uses let
to declare variables with type
interference. To give the best of both worlds, it is also possible to provide
type annotations, although this is normally only done where automatic type
annotation fails2.
Note that everything is constant unless mut
is provided. I must admit that I
find that sometimes annoying; even though I wholeheartedly agree with it on
principle.
Functional
The next part, is in my opinion, the biggest criticism of the readability and cleanliness of this code example.
for entry in WalkDir::new(".")
.follow_links(true)
.into_iter()
.filter_map(|e| e.ok())
.filter(|e| e.file_name().to_string_lossy().ends_with(".txt"))
{
...
}
My first criticism is that here we mix business-logic with boiler-plate. I am
talking about .into_iter()
. While you’ll become accustomed to this, since it
is very common in Rust code, it is nothing we care about from a functionality
side.
Next, all this Functional Programming style will take some getting used to for those of us with Imperative and Object Oriented backgrounds. I do believe that FP comes with paradigms that every programmer has to know these days3.
As an exercise, do you know what filter_map(|e| e.ok())
does? I did not, but
my IDE told me on hovering above it. It is fancy code to say, remove all
Error’ed elements.
Reading Words
let reader = BufReader::new(File::open(entry.into_path())?);
for line in reader.lines() {
for word in line?.split_whitespace() {
let w = word.to_lowercase().to_string();
*counter.entry(w).or_insert(0) += 1;
}
}
Note the two ?
operators after the functions which can fail: File::open
can
fail and reading a line
could fail. In case they fail, main returns early and
the error message is shown to the user.
This is a pattern that takes some getting used to:
*counter.entry(w).or_insert(0) += 1;
We find counter.entry(w)
or if that
returns None (because the word has not been added yet), a 0 is inserted. The
whole thing is dereferenced and incremented. Not difficult but definitely a bit
of a stumbling block.
Sorting
The rest is straight forward and easy to understand in my eyes.
Like the other solutions, the HashMap is converted into a vector with tuple
(word_count, word)
as the elements. The vector is then sorted. 4
let mut tops: Vec<_> = counter.iter().map(|(w, c)| (c, w)).collect();
tops.sort_by_key(|(&c, w)| (std::cmp::Reverse(c), *w));
for (c, w) in tops.iter().take(10) {
println!("{} - {}", w, c);
}
The only thing of note, is that I struggled writing the right key closure.
tops.sort_by_key(|(&c, w)| (std::cmp::Reverse(c), *w));
Note, my use of &c
and *w
. Not only did I struggle with knowing where the references are but also
with a lifetime error. IDE support definitely helped here5.
Summary
In conclusion, this was a fun little exercise. It is well known that Rust targets writing CLI apps. And it does so well on this word counting exercise.
I definitely do not think that Rust is any prettier than the C++ or the Go version. A cursory look tells me that it might be the shortest solution 6. But I doubt if pure line count should ever matter. When I first started reading Rust, I definitely had to get used to some of the concepts. And Rust is by no means an easy language. Yet, once you get used to it, writing something like this goes quickly and with confidence that the result is correct.
Update: Read the next post in this series, where I try it in Swift. And finally, read me going back and improving the original C++ version.
-
Showing a disappointing lack of creativity it is just called the question mark operator. ↩︎
-
A typical example is when creating a vector from a
.collect()
call. In that case, the compiler does not know into which container type the iterator should be collected.let v: Vec<_> = 0..10.iter().collect();
↩︎ -
This is written from an OO perspective. I am fully aware that FP has been practiced for decades by a large number of programmers. My point is: Everyone must know about
map
,filter
and predicate-functions. ↩︎ -
C++ has here the benefit of
std::partial_sort
, which neither Go nor Rust have out of the box. ↩︎ -
Please let me know if there’s a more elegant way of sprinkling
&
and*
on this. ↩︎ -
Bad news for all of us paid by number of lines. ↩︎