technology from back to front

Blog: latest posts

CPU cache collisions in the context of performance

This article discusses some potential performance issues caused by CPU cache collisions.

In normal scenarios cache collisions don’t pose a problem, it usually is only in specific, high speed
applications that they may incur noticeable performance penalties, and as such, things described
here should be considered “the last mile effort”.
As an example, I will use my laptop’s CPU, Intel Core i5 1.7GHz that has 32kB 8-way L1 data cache per core.

  • CPUs have caches organized in cachelines. For Intel and AMD, cachelines are 64 bytes long.
    When CPU needs to reach to a byte located in memory at the address 100, the whole chunk from
    addresses 64-127 is being pulled to cache. Since my example CPU has a 32kB L1 data cache
    per core, this means 512 such cachelines. The size of 64 bytes also means, that the six
    least significant bits of address index byte within the cacheline:

    address bits:    |       0 - 5      |       6 - ...     |
                     | cacheline offset |
  • Cachelines are organized in buckets. “8-way” means that each bucket holds 8 cachelines in it.
    Therefore my CPU L1 data cache has 512 cachelines kept in 64 buckets. In order to address those 64 buckets,
    next 6 bits are used from the address word, full address resolution within this L1 cache goes as follows:

    address bits:    |       0 - 5      |      6 - 11     |                12 - ...             |
                     | cacheline offset | bucket selector | cacheline identifier withing bucket |
  • Crucial to understand here is, that for this CPU, data separated by N x 4096 bytes
    (N x 12 the first bits) will always end up in the same bucket. So a lot of data chunks
    spaced by N x 4096 bytes, processed in a parallel manner can cause excessive evictions
    of cachelines from buckets thereby defeating the benefits of L1 cache.

To test the performance degradation I wrote a test C program
full C source here
that generates a number of vectors of pseudo random integers, sums them up in a typically parallel
optimized way, and estimates the resulting speed. Program takes a couple
of parameters from command line so that various CPUs and scenarios can be tested.
Here are results of three test runs on my example CPU:

  1. 100000 iterations, 30 vectors, 1000 integers each, aligned to 1010 integers = 2396 MOP/s
  2. 100000 iterations, 30 vectors, 1000 integers each, aligned to 1024 integers = 890 MOP/s
  3. 100000 iterations, 30 vectors, 1000 integers each, aligned to 1030 integers = 2415 MOP/s

In this CPU, L1 cache has 4 cycles of latency, L2 cache has 12 cycles of latency, hence
the performance drop to almost 1/3 when alignment hit the N x 4096 condition, CPU pretty much fell
back from L1 to L2. While this is a synthetic example, real life applications may not be affected
this much, but I’ve seen applications losing 30-40% to this single factor.

Parting remarks:

  • You may need to take into consideration a structure of cache not only it’s size, as in this case,
    even data chunked into pieces small enough to fit into L1, still can fail to take full advantage of it.
  • The issue cannot be solved by rewriting critical section logic in C/C++/assembly or any other
    “super-fast language of your choice”, this is a behavior dictated by hardware specifics.
  • Developers’ habit of aligning to the even boundaries, especially to the page boundaries,
    can work against you.
  • Padding can help break out of the performance drop.
  • Sometimes, the easiest workaround is a platform change, i.e. switching from Intel to AMD
    or the other way. Although keep in mind, it doesn’t really solve the issue, different platforms
    just manifest it for different data layouts.

Why I support the US Government making a cryptography standard weaker

Documents leaked by Edward Snowden last month reveal a $250M program by the NSA known as Operation BULLRUN, to insert vulnerabilities into encryption systems and weaken cryptography standards. It now seems nearly certain that the NIST-certified random number generator Dual\_EC\_DRBG, adopted as the default in RSA Security’s BSAFE toolkit, contains a back door usable only by the NSA which allows them to predict the entire future output of the generator given only 32 bytes.

So it’s not the easiest time for NIST to suggest they should make a cryptography standard weaker than it was originally proposed. Nevertheless, I support them in this and I hope they go ahead with it. Read more…

Paul Crowley

Programming as a social activity

I realised tonight something that I’d forgotten. We’re usually so busy knocking out code to fulfil our timebox coomitments that it’s perhaps easy to forget something very important: to have fun.

I went to the local Smalltalk user group tonight where Jason Ayers gave a talk on simplicity: do our tools help us make simple code? For a change, there was a relative dearth of laptops in the room (and it was a rather full room – nice!) so we “triple programmed”, tasked with implementing Conway’s Game of Life.


I think I’d forgotten that programming can be fun, and not just fun in an amuse-yourself-in-a-corner-on-your-lonesome kind of way, but fun in a way where you meet new people under the guise of performing some shared task. So if there’s a local programming group near you, why not swing by. You might meet some interesting folk. And if there isn’t such a group, maybe start one? It might be fun!

Frank Shearar

Changing the Primary Key Type in Ruby on Rails Models

Ruby on Rails (RoR) likes to emphasise the concept of convention over configuration. Therefore, it seeks to minimialise the amount of configuration
by resorting to some defaults. These defaults are sometimes not desirable, and RoR does not always make it easy to deviate from these defaults.

Read more…

Yong Wen Chua

My little Backpressure: Flow Control is magic

When we’re designing systems that are designed to be robust against failure, it’s important to know how behaviour at your Integration points (a term borrowed from Michael Nygard’s book Relase It!) impacts the rest of the system. For example, if your database or a remote API is running slowly, then in Synchronous systems, because you (usually) have a limited number of threads on which to process the requests, any slow down will naturally be pushed back onto the clients of that service.

Read more…

Ceri Storey

Testing the Reactor pattern

A good while ago I wrote a SIP stack. Like many network things, a SIP stack needs to keep track of multiple tasks – reading or writing from sockets, asking the user to respond to events, and so on. And so I naïvely added a bunch of threads. And then I spent a few weeks trying to fix nasty bugs. And then I got fed up and rewrote all the logic. It turns out that I accidentally stumbled upon the Reactor pattern… but with something that I’ve not seen elsewhere.

Read more…

Frank Shearar

Precise scheduling with RabbitMQ

On a project recently, we needed to be able to process jobs asynchronously, but we also needed to be able to specify that they should be run at a certain point in the future. We also needed to be able to implement exponential backoff on failure. We initially tried to integrate Sidekiq, but unfortunately it turned out to not be a good fit for the way we structured the code base.

Read more…

Ceri Storey

Assuming there’s a user is sometimes a bad idea

Squeak has a very strong (historic) assumption that there’s a(n interactive) user interface. I stumbled across another occurrence of this assumption the other day. Let’s take a look at the problem, and how to fix it.

Read more…

Frank Shearar

The great GC vs reference counting debate

I read a blog post post recently to the effect that GC is too expensive in mobile devices, Steve Jobs was right, reference counting is the way. It’s titled ‘why mobile web apps are slow’.

I’m inclined to take issue with this: It’s a long since resolved dispute, and GC won. I don’t want Steve Jobs to reach out from the grave and drag us back to the 70s. There’s nothing special about mobile phones: they are more powerful than the computers that GC won on in the first place.

Don’t get me wrong, Apple must have had a good reason to choose reference counting for memory management. That reason is actually pretty obvious: Objective-C is not a safe language, so you can’t move objects. If you can’t move objects, You have to use something like malloc to minimise fragmentation: you pay a big up front cost to allocate an object in the right place. In most GC schemes, allocation is free – you just increment a pointer. This is fine, because you are going to iterate through all your objects every now and again, and it doesn’t cost a whole lot more to move some of them around at the same time. You get to compact away all the fragmentation this results in. This makes GC fairly miserly with memory too. In iOS , if you want to allocate something bigger than the largest free fragment, you die. With GC, you have to wait a while for the GC to compact everything, after which memory will be compacted 100% efficiently and there will be a single memory extent as big as all available memory. Objective-C can’t have any of these benefits, so Apple chose a different route for iOS, and put the best possible spin on it.

The amusing thing is, this leads us to why web apps are so slow. Reference counting requires the developer take care to avoid reference loops, because they leak memory. iOS just demands the programmer to take this care. Browsers can’t  do that, because the ECMAScript standard doesn’t place this demand on programmers. Browsers must use GC. But they can’t move objects, because they are largely written in C, which is also not safe. The C code and javascript share a large data structure: the DOM. The ECMAScript can manipulate the DOM, so the DOM is managed memory. Fragmentation has to be managed on allocation, and on top of that we have to GC. A browser is the worst possible hybrid of memory management techniques.

The author of the above post doesn’t see things this way. He references Python as an example of GC, but it actually does things the same hopelessly inefficient way the web browser does, for the same reasons. He cites a 10 year old paper for empirical evidence on how inefficient GC is. Citations are good, but I think there’s a bit of a sample bias here. Anyway, the paper has a few problems. It says malloc takes 6 clock cycles. It does, until you allocate a certain amount of memory, configured during compilation of libc. once you pass that threshold, it starts trying to manage fragmentation. Then it takes 120 cycles. I suspect they assumed malloc takes constant time, and neglect to measure the cost of free at all. The GC test uses Jikes, and the authors pick their own GC strategies, rather than use proven ones. None of the strategies make any sense, and no rational for there selection is offered: you would expect the results to be poor. A test of the performance of the standard sun JVM strategy isn’t included.

Mobile apps won’t be fast until browsers are re-written in a fast, safe programming language that can share a good GC with ECMAScript. Of course browsers won’t be secure until then either, but I’m not holding my breath. There is no such language that is sufficiently popular, and Google fluffed it’s lines with Go. Maybe ECMAScript implementations will gain so much in performance, that all code that manipulates the DOM will be written in ECMAScript, and we can finally use a high performance GC.


Delimited dynamic variables from call/cc

I’m prepared to own up to my biases. I like delimited continuations. I like zippers. I like getting halfway through my work, shelving my work for a time, and coming back to it later.

We’ve seen the relationship between resumable exceptions and delimited dynamic variables before, but what about languages where you don’t have direct access to the call stack? Let’s implement delimited dynamic variables by implementing resumable exceptions with call/cc (obligatory mention of why call/cc’s a bad idea). So what’s that look like in Ruby, then?

Read more…

Frank Shearar

« Newer Posts Older Posts »





2000-14 LShift Ltd, 1st Floor, Hoxton Point, 6 Rufus Street, London, N1 6PE, UK+44 (0)20 7729 7060   Contact us