This article discusses some potential performance issues caused by CPU cache collisions.
In normal scenarios cache collisions don’t pose a problem, it usually is only in specific, high speed
applications that they may incur noticeable performance penalties, and as such, things described
here should be considered “the last mile effort”.
As an example, I will use my laptop’s CPU, Intel Core i5 1.7GHz that has 32kB 8-way L1 data cache per core.
address bits: | 0 - 5 | 6 - ... | | cacheline offset |
address bits: | 0 - 5 | 6 - 11 | 12 - ... | | cacheline offset | bucket selector | cacheline identifier withing bucket |
To test the performance degradation I wrote a test C program
full C source here)
that generates a number of vectors of pseudo random integers, sums them up in a typically parallel
optimized way, and estimates the resulting speed. Program takes a couple
of parameters from command line so that various CPUs and scenarios can be tested.
Here are results of three test runs on my example CPU:
In this CPU, L1 cache has 4 cycles of latency, L2 cache has 12 cycles of latency, hence
the performance drop to almost 1/3 when alignment hit the N x 4096 condition, CPU pretty much fell
back from L1 to L2. While this is a synthetic example, real life applications may not be affected
this much, but I’ve seen applications losing 30-40% to this single factor.