Apple M1 Performance Cycles

A short update: I just published my second crate (a Rust package): https://crates.io/crates/macos-perf

I hope this can be used to accurately measure Rust code on M1 Apple devices.

Please try it, and provide feedback.

Example

Running the example code I provide gives the following results:

sudo cargo run --quiet --example main
PerformanceCounters {
    cycles: 46148.1,
    branches: 23053.4,
    missed_branches: 2.4,
    instructions: 162342.9,
}

We see the missed branches, cycles, branches taken and instructions. Unfortunately, none of this is documented by the vendor (Apple). But, Dougall Johnson wrote code to interact with the undocumented macOS syscalls. So, this is what my library uses.

The above output was generated by the below code, in case you’re interested:

/// You need to run this with sudo.
use criterion::black_box;
use macos_perf::{init, timeit_loops};

fn main() {
    init().unwrap();
    let pc = timeit_loops! {10, {
        let n = black_box(1000);
        let _x = (0..n).fold(0, |a, b| a ^ b);
    }}
    .unwrap();
    println!("{:#?}", pc);
}

Initial Measurements

Performance Counters and measured time are clearly correlated

This plot shows that performance counters and measured runtime are correlated.

I used the Matrix multiplication benchmarking code that I already had, and also run the macos-perf counter code on it.

The resulting plot shows that performance counters and measured runtime are correlated.

But out of one program (with its unique characterisk effects on branches, cache misses, etc), is not enough to build a good model to predict runtime from measured counters. So, I clearly have more work ahead of me.

Conclusion

There’s no conclusion. This is only an intermediate step which I wanted to share.