We had here, in LShift, this typical C-vs-Fortran discussion which prompted me to follow up on it. In this holy war I stand by C and believe that a lot of opinions supporting the alleged superiority of Fortran in numerical throughput come from poor understanding of what actually can be done on the C side. So, I’d like to demonstrate here a couple of loop optimization techniques that can be used in C in order to maximize numerical throughput. Why loops? Because they often constitute the bulk of computations and have significant potential for utilizing architecture of modern CPUs. And if your code doesn’t spin in such loops, you should strive to make it do so, I hope, I will provide in this post convincing arguments as to why.
I use my laptop with a SandyBridge CPU, Core i5, running at 1.7GHz, and as a compiler, GCC 4.8. As a task, I’ve picked sum of squares of N vectors:
y[i] = x1[i]^2 + x2[i]^2 + ... + xN[i]^2
It nicely fits the bill, sum of squares is a commonplace in numerical computations, uses addition and multiplication allowing to demonstrate mechanisms available in modern CPUs, and has the desired structure of:
for i ... for j ... do stuff here
I use snippets of a bit simplified C here, and although all tested code is pure C, I also use chunks of actual assembly produced by GCC to illustrate arising issues. The assembly snippets are for SSE target as GCC tends to produce cleaner code for SSE than for AVX.
Ok, down to the business, our naive loop summing squares of N vectors could be:
for(i = 0; i < VECTOR_LEN; i++) // mulss %xmm0, %xmm0 for(j = 0; j < N; j++) // addss %xmm0, %xmm1 acc[i] += v[j][i] * v[j][i]; // movss %xmm1, (%rdi)
Disassembly reveals that XXXss instructions are used. This subset of SSE instructions is using one floating point word, not 4, in one go. This code is clearly not taking full advantage of SIMD units and still limits the throughput to 1xCPU_CLOCK. Since logically, it doesn’t matter which loop is inner, which outer, we can swap them while the algorithm remains valid. Now we have the “horizontal” loop:
for(j = 0; j < N; j++) // mulps %xmm0, %xmm0 for(i = 0; i < VECTOR_LEN; i++) // addps %xmm0, %xmm1 acc[i] += v[j][i] * v[j][i]; // movups %xmm1, (%rdi, %rax)
Boom! This time, GCC unrolled the inner loop using full-width XXXps SSE instructions. This single change boosts performance to the expected SIMD_WIDTHxCPU_CLOCK benchmark, as it will be shown below. Too bad that GCC cannot automatically do this simple optimization for us, but as far as I remember, ICC can. Moving on, the next logical step would be to unroll calculations “vertically”, it should reduce number of memory reads/writes. An example of thus manually unrolled loop:
for(j = 0; j <= N-4; j+=4) // movups (%rdi, %r9), %xmm0 for(i = 0; i < VECTOR_LEN; i++) // mulps %xmm1, %xmm1 acc[i] += v[j+0][i] * v[j+0][i]; // addps %xmm1, %xmm0 acc[i] += v[j+1][i] * v[j+1][i]; // movups %xmm0, (%rdi, %r9) <== redundant write acc[i] += v[j+2][i] * v[j+2][i]; // movups (%rax, %r9), %xmm1 acc[i] += v[j+3][i] * v[j+3][i]; // mulps %xmm1, %xmm1 // addps %xmm0, %xmm1 // movups %xmm1, (%rdi, %r9) <== redundant write
Here, we see the infamous effect of pointer aliasing every so often brought up in C-vs-Fortran discussions. Compiler, for each line of calculations produces extra read / write instructions which defeat the intended purpose of vertical unrolling. Luckily, the solution is trivial, an extra variable in the inner loop, this makes compiler produce code which caches calculations in a register. Here is the “cached” loop:
for(j = 0; j <= N-4; j+=4) // movups (%rcx, %r9), %xmm1 <== single reads for(i = 0; i < VECTOR_LEN; i++) // movups (%r8, %r9), %xmm0 float tmp = acc[i]; // mulps %xmm1, %xmm1 <== bulk calculations tmp += v[j+0][i] * v[j+0][i]; // addps %xmm4, %xmm3 tmp += v[j+1][i] * v[j+1][i]; // mulps %xmm0, %xmm0 tmp += v[j+2][i] * v[j+2][i]; // addps %xmm3, %xmm2 tmp += v[j+3][i] * v[j+3][i]; // addps %xmm2, %xmm1 acc[i] = tmp; // addps %xmm1, %xmm0 // movups %xmm0, (%rdi, %r9) <== single write
Now the block of resultant SSE operations is compact and doesn’t have the redundant memory accesses. The last optimization I’d like to introduce further leverages the capability of the modern CPU to parallelize independent streams of operations. In order to do this, we need to break dependency chains, in other words, split calculations into independent sequences being executed on separate registers and execution units, here is our “final” loop:
for(j = 0; j <= N-8; j+=8) for(i = 0; i < VECTOR_LEN; i++) float tmp1 = acc[i]; float tmp2 = v[j+0][i] * v[j+0][i]; float tmp3 = v[j+1][i] * v[j+1][i]; tmp1 += v[j+2][i] * v[j+2][i]; tmp2 += v[j+3][i] * v[j+3][i]; tmp3 += v[j+4][i] * v[j+4][i]; tmp1 += v[j+5][i] * v[j+5][i]; tmp2 += v[j+6][i] * v[j+6][i]; tmp3 += v[j+7][i] * v[j+7][i]; acc[i] = tmp1 + tmp2 + tmp3;
C code I used for testing all the above loops, is here. To rule out memory bandwidth issues as much as it was possible, I run tests for a bunch of vectors small enough to fit into L1 cache. Throughputs for single core:
SSE AVX naive: 1733.4 MFLOPS 1696.6 MFLOPS // 1xCPU_CLOCK barrier for scalar instructions horizontal: 5963.6 MFLOPS 9419.8 MFLOPS // 4xCPU_CLOCK and 8xCPU_CLOCK for SSE and AVX unrolled: 11264.8 MFLOPS 11496.6 MFLOPS cached: 14253.7 MFLOPS 15086.5 MFLOPS final: 17985.4 MFLOPS 18210.4 MFLOPS // Both, SSE and AVX settle at around 10xCPU_CLOCK
So it seems, this midrange laptop CPU could potentially get us some 35 GFLOPS out of its two cores without resorting to nothing more than simple changes in pure C.
Things to consider:
The internet of things seems to be coming any day now, but the state of embedded development seems to be deplorable. Almost everything is written in C or C++. Device drivers are written over and over, once for each RTOS, or worse. When high level languages are available, they seem to be implemented directly on the hardware, rather than on top of an existing RTOS, so there’s another chance to practice writing device drivers. Want a file system, or a network stack? You’ll need to patch one for the kernel of your choice.
This approach is no longer justifiable: the target devices typically have 64K of memory, 256K of memory mapped flash , and have throughput of 100Mips. It’s time embedded developers stopped playing at writing device drivers, and thought about composition.
Trying to put together an environment for experimenting with micro controllers is frustrating. For example, I have a STM32F3 Discovery board. The micro controller itself has a huge array of peripheral interfaces, and the board adds a gyroscope and accelerometer, and LEDs to light it up like a Christmas tree. It costs £9, which is cheap enough to buy several: in case you break one. I’m a software engineer, not an electrical engineer, so that’s going to happen. There’s 48K of RAM, and 256K of flash. It’s sleep mode uses 0.6mA, so if interrupts in you application are rare, you might even use less power than an Arduino.
So, what would be a productive environment?
How do I get it? Well, now the pain begins.
LwIP is a small footprint TCP/IP stack written in C. Almost everything mentioned below includes some support, so you can plug it in. Using it doesn’t require anything beyond supporting C binding. Some extra work might be required if you want to provide network drivers in the language of your choice.
These are the closest thing there is to an all inclusive solution.
Espruino doesn’t have much in the way of architecture documentation. There’s no description of the interpreter, so unless you read the code, you can’t know anything about the competence of the authors, or the sophistication of the interpreter. I’d make a guess that it’s based on Tiny-JS. There’s no intermediate code form, which guarantees your code takes up a lot of RAM.
Lua has co-routines, which is a big step up on being completely event oriented. eLua can execute byte code straight out of flash – rather than having to use RAM to hold programs, which is a pretty useful optimisation in this context. Lua also has a great C-binding.
Elua runs on my device, but only to the extent that it can run a REPL on the serial port. No other peripherals are supported. eLua’s concept is to be Lua as far down as possible. From the point of view of making eLua as easy to improve as possible, this is good design decision. It’s a long game though, and I don’t see anything in it’s roadmap that suggests it’s going to tackle the issue of memory management during interrupts, or compilation, when higher performance is needed. I think that’s going to mean device drivers keep getting written in C. Given that, hitching a ride on an RTOS which has momentum in this area – E.g. Chibios, seems like a pragmatic way forward, but seems to get rejected on the mailing list.
That’s not to say that eLua isn’t the right starting point to tackle these problems: it may well be.
This RTOS offers Posix support and DLLs. This would mean, for example, that it’s reasonably easy to compile various interpreters, and lot’s of open source source software. It has limited support for the F3 Discovery board – basically no more than eLua. I could choose the F4 discovery board to solve this problem. There’s an open source project to run (full) Lua under NuttX, which I hope to try out.
This doesn’t offer any sort of standard interface. It does however support a huge array of boards, including support for all the F3 Discovery peripherals. It also seems to get new boards and drivers supported very quickly. For example, Chibios supports the ADC/Timer/DMA feature I mentioned above, and had that support a month or so after the board was released. This is also the only thing I’ve actually run on the board. It’s easy to set up a tool chain and build. The samples are readable, by C standards.
Because Chibios has good support for the boards I have, and because FreeRTOS (for example) appears to have very similar features to Chibios, I haven’t investigated much further in this category.
Scheme might be a good choice as an embedded interpreter. I could build a system on top of Chibios. There are at least two compilers I could choose between: Chicken and Stalin. Chicken has a REPL, so it appeals more. It lacks really good GC, but I guess that might not be such a big problem in the short term. Chickens first generation is on the stack, and I can see how that might make it possible to write interrupt handlers directly in scheme, although if the stack ran out, the interrupt handler would fail.
I must admit, I’d assumed that a TCP/IP stack written in scheme was available, but I haven’t found one. Or a file system, for that matter. Still, there’s LwIP, and writing a file system in Scheme isn’t so daunting. I’m not sure I’ll be convincing a lot of people to write electricity meter firmware in scheme, but I could always add interpreters for other languages.
I guess I hinted at the top that’s there’s no clear conclusion. Suggestions?
It’s nearly exactly three years since I started at LShift. I’d like to take a moment and look back at what I’ve done.
In Bigwig, in order to keep our code neat and well factored, we’ve tried to adhere to the principle of tell, don’t ask as much as we can. However, one place this can be difficult is within a handler for an HTTP request (we’re using Sinatra for that).
One possible workaround for a lack of generics is code generation. Let’s look at Go’s AST manipulation to make a
Maybe Int out of a
It’s about time someone started talking about Go again around here, so I picked up the old editor, and (painlessly!) installed Go. Maybe 5 minutes later I had the world’s faster compiler, a test framework, a coverage analyzer and a bunch of stuff besides available on my machine. But what to do? Hello World is so done, so I thought I’d grab my copy of Hutton & Meijer and implement a basic parser combinator.
Someone discovered a vulnerability in Zabbix recently, and there’s this lovely, detailed description of an exploit based in it on Corelan Team. It’s lovely because it contains all the information I need to tell if my site is vulnerable, and to what extent.
There’s also a really useless advisory on Packet Storm Security. Why is it useless? Because at the bottom, there’s a section called Workaround, which reads ‘No workaround available’. This is really unfair to Zabbix:
Zabbix offers a mode called ‘active agent’, in which, rather than the server querying the agent, the agent submits information to the server periodically. This means it’s code on the monitored host that determines what information is passed to the server, and this eliminates the logical possibility of an escalation attack onto monitored hosts.
The existence of this mode is why I consider Zabbix in security sensitive applications. I pretty much assumed SQL injection attacks existed in Zabbix, because the API is written in PHP. Hence I wouldn’t consider using passive mode. I was a bit disappointed to find the guest account is enabled by default, but the point is, I know that Zabbix being compromised won’t result in a data protection incident.
So in short, the workaround is to disable passive agents: in your /etc/zabbix/zabbix_agentd.conf, set DisablePassive=1. But that’s what you were doing anyway, right? Zabbix deserve some criticism for providing a way of configuring their product that is not reliably secure, but I don’t think it’s too much to expect security researchers to have some awareness of the architecture of the products they publish security advisories about.
I should also point out that you could equally choose collectd, and graphite to get the same result. This has the added advantage that it’s the only way it works, so there won’t be any irrelevant security advisories to explain to your clients.
I don’t read either of the above sites regularly, so I don’t know if this single data point reflects the overall quality of either.
This article discusses some potential performance issues caused by CPU cache collisions.
In normal scenarios cache collisions don’t pose a problem, it usually is only in specific, high speed
applications that they may incur noticeable performance penalties, and as such, things described
here should be considered “the last mile effort”.
As an example, I will use my laptop’s CPU, Intel Core i5 1.7GHz that has 32kB 8-way L1 data cache per core.
address bits: | 0 - 5 | 6 - ... | | cacheline offset |
address bits: | 0 - 5 | 6 - 11 | 12 - ... | | cacheline offset | bucket selector | cacheline identifier withing bucket |
To test the performance degradation I wrote a test C program
full C source here)
that generates a number of vectors of pseudo random integers, sums them up in a typically parallel
optimized way, and estimates the resulting speed. Program takes a couple
of parameters from command line so that various CPUs and scenarios can be tested.
Here are results of three test runs on my example CPU:
In this CPU, L1 cache has 4 cycles of latency, L2 cache has 12 cycles of latency, hence
the performance drop to almost 1/3 when alignment hit the N x 4096 condition, CPU pretty much fell
back from L1 to L2. While this is a synthetic example, real life applications may not be affected
this much, but I’ve seen applications losing 30-40% to this single factor.
Documents leaked by Edward Snowden last month reveal a $250M program by the NSA known as Operation BULLRUN, to insert vulnerabilities into encryption systems and weaken cryptography standards. It now seems nearly certain that the NIST-certified random number generator Dual\_EC\_DRBG, adopted as the default in RSA Security’s BSAFE toolkit, contains a back door usable only by the NSA which allows them to predict the entire future output of the generator given only 32 bytes.
So it’s not the easiest time for NIST to suggest they should make a cryptography standard weaker than it was originally proposed. Nevertheless, I support them in this and I hope they go ahead with it. Read more…
I realised tonight something that I’d forgotten. We’re usually so busy knocking out code to fulfil our timebox coomitments that it’s perhaps easy to forget something very important: to have fun.
I went to the local Smalltalk user group tonight where Jason Ayers gave a talk on simplicity: do our tools help us make simple code? For a change, there was a relative dearth of laptops in the room (and it was a rather full room – nice!) so we “triple programmed”, tasked with implementing Conway’s Game of Life.
I think I’d forgotten that programming can be fun, and not just fun in an amuse-yourself-in-a-corner-on-your-lonesome kind of way, but fun in a way where you meet new people under the guise of performing some shared task. So if there’s a local programming group near you, why not swing by. You might meet some interesting folk. And if there isn’t such a group, maybe start one? It might be fun!