technology from back to front

CPU cache collisions in the context of performance

This article discusses some potential performance issues caused by CPU cache collisions.

In normal scenarios cache collisions don’t pose a problem, it usually is only in specific, high speed
applications that they may incur noticeable performance penalties, and as such, things described
here should be considered “the last mile effort”.
As an example, I will use my laptop’s CPU, Intel Core i5 1.7GHz that has 32kB 8-way L1 data cache per core.

  • CPUs have caches organized in cachelines. For Intel and AMD, cachelines are 64 bytes long.
    When CPU needs to reach to a byte located in memory at the address 100, the whole chunk from
    addresses 64-127 is being pulled to cache. Since my example CPU has a 32kB L1 data cache
    per core, this means 512 such cachelines. The size of 64 bytes also means, that the six
    least significant bits of address index byte within the cacheline:

    address bits:    |       0 - 5      |       6 - ...     |
                     | cacheline offset |
    
  • Cachelines are organized in buckets. “8-way” means that each bucket holds 8 cachelines in it.
    Therefore my CPU L1 data cache has 512 cachelines kept in 64 buckets. In order to address those 64 buckets,
    next 6 bits are used from the address word, full address resolution within this L1 cache goes as follows:

    address bits:    |       0 - 5      |      6 - 11     |                12 - ...             |
                     | cacheline offset | bucket selector | cacheline identifier withing bucket |
    
  • Crucial to understand here is, that for this CPU, data separated by N x 4096 bytes
    (N x 12 the first bits) will always end up in the same bucket. So a lot of data chunks
    spaced by N x 4096 bytes, processed in a parallel manner can cause excessive evictions
    of cachelines from buckets thereby defeating the benefits of L1 cache.

To test the performance degradation I wrote a test C program
(
full C source here
)
that generates a number of vectors of pseudo random integers, sums them up in a typically parallel
optimized way, and estimates the resulting speed. Program takes a couple
of parameters from command line so that various CPUs and scenarios can be tested.
Here are results of three test runs on my example CPU:

  1. 100000 iterations, 30 vectors, 1000 integers each, aligned to 1010 integers = 2396 MOP/s
  2. 100000 iterations, 30 vectors, 1000 integers each, aligned to 1024 integers = 890 MOP/s
  3. 100000 iterations, 30 vectors, 1000 integers each, aligned to 1030 integers = 2415 MOP/s

In this CPU, L1 cache has 4 cycles of latency, L2 cache has 12 cycles of latency, hence
the performance drop to almost 1/3 when alignment hit the N x 4096 condition, CPU pretty much fell
back from L1 to L2. While this is a synthetic example, real life applications may not be affected
this much, but I’ve seen applications losing 30-40% to this single factor.

Parting remarks:

  • You may need to take into consideration a structure of cache not only it’s size, as in this case,
    even data chunked into pieces small enough to fit into L1, still can fail to take full advantage of it.
  • The issue cannot be solved by rewriting critical section logic in C/C++/assembly or any other
    “super-fast language of your choice”, this is a behavior dictated by hardware specifics.
  • Developers’ habit of aligning to the even boundaries, especially to the page boundaries,
    can work against you.
  • Padding can help break out of the performance drop.
  • Sometimes, the easiest workaround is a platform change, i.e. switching from Intel to AMD
    or the other way. Although keep in mind, it doesn’t really solve the issue, different platforms
    just manifest it for different data layouts.
by
jarek
on
08/10/13
 
 


one × = 1

2000-14 LShift Ltd, 1st Floor, Hoxton Point, 6 Rufus Street, London, N1 6PE, UK+44 (0)20 7729 7060   Contact us