This article discusses some potential performance issues caused by CPU cache collisions.
In normal scenarios cache collisions don’t pose a problem, it usually is only in specific, high speed
applications that they may incur noticeable performance penalties, and as such, things described
here should be considered “the last mile effort”.
As an example, I will use my laptop’s CPU, Intel Core i5 1.7GHz that has 32kB 8-way L1 data cache per core.
address bits: | 0 - 5 | 6 - ... | | cacheline offset |
address bits: | 0 - 5 | 6 - 11 | 12 - ... | | cacheline offset | bucket selector | cacheline identifier withing bucket |
To test the performance degradation I wrote a test C program
full C source here)
that generates a number of vectors of pseudo random integers, sums them up in a typically parallel
optimized way, and estimates the resulting speed. Program takes a couple
of parameters from command line so that various CPUs and scenarios can be tested.
Here are results of three test runs on my example CPU:
In this CPU, L1 cache has 4 cycles of latency, L2 cache has 12 cycles of latency, hence
the performance drop to almost 1/3 when alignment hit the N x 4096 condition, CPU pretty much fell
back from L1 to L2. While this is a synthetic example, real life applications may not be affected
this much, but I’ve seen applications losing 30-40% to this single factor.
Documents leaked by Edward Snowden last month reveal a $250M program by the NSA known as Operation BULLRUN, to insert vulnerabilities into encryption systems and weaken cryptography standards. It now seems nearly certain that the NIST-certified random number generator Dual\_EC\_DRBG, adopted as the default in RSA Security’s BSAFE toolkit, contains a back door usable only by the NSA which allows them to predict the entire future output of the generator given only 32 bytes.
So it’s not the easiest time for NIST to suggest they should make a cryptography standard weaker than it was originally proposed. Nevertheless, I support them in this and I hope they go ahead with it. Read more…
I realised tonight something that I’d forgotten. We’re usually so busy knocking out code to fulfil our timebox coomitments that it’s perhaps easy to forget something very important: to have fun.
I went to the local Smalltalk user group tonight where Jason Ayers gave a talk on simplicity: do our tools help us make simple code? For a change, there was a relative dearth of laptops in the room (and it was a rather full room – nice!) so we “triple programmed”, tasked with implementing Conway’s Game of Life.
I think I’d forgotten that programming can be fun, and not just fun in an amuse-yourself-in-a-corner-on-your-lonesome kind of way, but fun in a way where you meet new people under the guise of performing some shared task. So if there’s a local programming group near you, why not swing by. You might meet some interesting folk. And if there isn’t such a group, maybe start one? It might be fun!
Ruby on Rails (RoR) likes to emphasise the concept of convention over configuration. Therefore, it seeks to minimialise the amount of configuration
by resorting to some defaults. These defaults are sometimes not desirable, and RoR does not always make it easy to deviate from these defaults.
When we’re designing systems that are designed to be robust against failure, it’s important to know how behaviour at your Integration points (a term borrowed from Michael Nygard’s book Relase It!) impacts the rest of the system. For example, if your database or a remote API is running slowly, then in Synchronous systems, because you (usually) have a limited number of threads on which to process the requests, any slow down will naturally be pushed back onto the clients of that service.
A good while ago I wrote a SIP stack. Like many network things, a SIP stack needs to keep track of multiple tasks – reading or writing from sockets, asking the user to respond to events, and so on. And so I naïvely added a bunch of threads. And then I spent a few weeks trying to fix nasty bugs. And then I got fed up and rewrote all the logic. It turns out that I accidentally stumbled upon the Reactor pattern… but with something that I’ve not seen elsewhere.
On a project recently, we needed to be able to process jobs asynchronously, but we also needed to be able to specify that they should be run at a certain point in the future. We also needed to be able to implement exponential backoff on failure. We initially tried to integrate Sidekiq, but unfortunately it turned out to not be a good fit for the way we structured the code base.
I read a blog post post recently to the effect that GC is too expensive in mobile devices, Steve Jobs was right, reference counting is the way. It’s titled ‘why mobile web apps are slow’.
I’m inclined to take issue with this: It’s a long since resolved dispute, and GC won. I don’t want Steve Jobs to reach out from the grave and drag us back to the 70s. There’s nothing special about mobile phones: they are more powerful than the computers that GC won on in the first place.
Don’t get me wrong, Apple must have had a good reason to choose reference counting for memory management. That reason is actually pretty obvious: Objective-C is not a safe language, so you can’t move objects. If you can’t move objects, You have to use something like malloc to minimise fragmentation: you pay a big up front cost to allocate an object in the right place. In most GC schemes, allocation is free – you just increment a pointer. This is fine, because you are going to iterate through all your objects every now and again, and it doesn’t cost a whole lot more to move some of them around at the same time. You get to compact away all the fragmentation this results in. This makes GC fairly miserly with memory too. In iOS , if you want to allocate something bigger than the largest free fragment, you die. With GC, you have to wait a while for the GC to compact everything, after which memory will be compacted 100% efficiently and there will be a single memory extent as big as all available memory. Objective-C can’t have any of these benefits, so Apple chose a different route for iOS, and put the best possible spin on it.
The author of the above post doesn’t see things this way. He references Python as an example of GC, but it actually does things the same hopelessly inefficient way the web browser does, for the same reasons.Â He cites a 10 year old paper for empirical evidence on how inefficient GC is. Citations are good, but I think there’s a bit of a sample bias here. Anyway, the paper has a few problems. It says malloc takes 6 clock cycles. It does, until you allocate a certain amount of memory, configured during compilation of libc. once you pass that threshold, it starts trying to manage fragmentation. Then it takes 120 cycles. I suspect they assumed malloc takes constant time, and neglect to measure the cost of free at all. The GC test uses Jikes, and the authors pick their own GC strategies, rather than use proven ones. None of the strategies make any sense, and no rational for there selection is offered: you would expect the results to be poor. A test of the performance of the standard sun JVM strategy isn’t included.
Mobile apps won’t be fast until browsers are re-written in a fast, safe programming language that can share a good GC with ECMAScript. Of course browsers won’t be secure until then either, but I’m not holding my breath. There is no such language that is sufficiently popular, and Google fluffed it’s lines with Go. Maybe ECMAScript implementations will gain so much in performance, that all code that manipulates the DOM will be written in ECMAScript, and we can finally use a high performance GC.
I’m prepared to own up to my biases. I like delimited continuations. I like zippers. I like getting halfway through my work, shelving my work for a time, and coming back to it later.
We’ve seen the relationship between resumable exceptions and delimited dynamic variables before, but what about languages where you don’t have direct access to the call stack? Let’s implement delimited dynamic variables by implementing resumable exceptions with call/cc (obligatory mention of why call/cc’s a bad idea). So what’s that look like in Ruby, then?