technology from back to front

Blog: latest posts

CodeMesh 2013 Redux

Last month I attended the CodeMesh conference here in sunny London, along with a couple of my colleagues. Here are my recollections and thoughts.

The venue (Hotel Russel on Russel Square) is a pleasantly rambling, grand old hotel, which hosted a few hundred hardcore geeks fairly well. A couple of the rooms were a bit small and not entirely suited to the task, though the main lecture hall and the exhibition/mingling/eating space alongside worked very well. Food was good and the craft beers on the first night were exactly what is missing from most such parties. Most event organisers don’t do any better than Becks, so the variety of beer here and live, Clojure-generated beats make it stick in the memory.

The content of the conference (I didn’t just go for the beer and food) was tending strongly towards functional programming, new languages and general deep geekery. And for that I applaud it, though I found the quality of the talks variable. Putting it less politely, a few were atrocious, mainly due to scatty content and poor exposition, but there were just enough gems to make it worthwhile. I think I may have chosen badly between the available tracks as my colleagues seemed to fare better. I’ve been to enough conferences to know that this is par for the course, and like a round of golf, it just takes a couple of standouts to make it all worthwhile.

For me, Jafar Husain’s talk “End to End Reactive Programming at Netflix” was a sparkling example, and I have already presented a mini version of it back at base, introducing my colleagues to the Reactive Extensions (Rx) from a JS point of view, and including some nice examples that I found from Bacon.js, an alternative implementation based on the same principles. I strongly recommend looking at Rx for JavaScript if you’re working in the browser, and at the many other ports for use on the server or various UI frameworks. I’ve actually been exposed to the Scala version myself recently as part of the Principles of Reactive Programming course on Coursera, which was nicely timed to bolster my understanding.

More generally, the key themes that stuck out to me, with a heavy helping of personal opinion, were as follows.

  • Scala has a lot of mindshare and people are doing real things with it. It also brings with it a great deal of scepticism from those that think it’s too complicated or too impure as a functional language. I share those feelings, but I’m learning and using it all the same as it may be the best thing available right now and allows me to mix and match paradigms within a single program, bringing pragmatism to problem solving. I spoke to Bruce Tate after his talk on the evolution of programming languages and he reckoned Scala is a good bridge language to a better place that probably doesn’t exist yet (I’m paraphrasing).
  • Haskell maintains the moral high ground and the talk about what they’re doing with it at Facebook was a necessary example of its use in the real world. Having one of the long time Haskell compiler engineers working on the project has clearly helped, and indeed they have made compiler changes to enable their project to succeed. Somehow Haskell leaves me cold however, at least for now. Perhaps because it’s that much more of a stretch away from the JVM and its purity feels more like trouble than freedom. Maybe I’ll get there in the end.
  • People are really trying hard to bring new languages to the table that learn from what has gone before. Rust looks like a nice approach to systems programming and I’d definitely consider it if faced with a problem that looked like it required C or similar.
  • There was a background level of Erlang, but perhaps more interesting was Elixir, a new functional language that runs on the Erlang VM. Dave Thomas was up on the stage with José Valim, Elixir’s creator, waxing hysterical about how he was as excited as he was about Ruby back in the day. Personally I’m not feeling it, perhaps because I’m tending towards statically typed languages recently, having done plenty of Ruby and Groovy in years past.

To summarise, it was a little hit and miss, but allowed me to put my finger on the pulse of this section of the tech community, and to learn some worthwhile lessons. Many of those lessons were of the intangible, hand-wavy sort, but I value them highly as we try to navigate the currents of the fast-moving tech world. Then again, one of the things we learn is that at a higher level it doesn’t move that quickly and a lot of the stuff that’s hot right now has its origins in the 70s and 80s.

by
Sam Carr
on
29/12/13

Documenting an HTTP API with Swagger

I recently tried out Swagger, for documenting an HTTP API. The big win with Swagger is that it provides a sweet HTML UI to browse your API docs and experiment with sending requests and viewing responses, which is a great experience for other developers that are trying to get to grips with your API. Try out their demo of the Swagger UI, for a simple petstore example.

Swagger petstore example - screenshot

Swagger is effectively three things ‘architecturally’:

  • A specification for the JSON files, which contain your API documentation in the abstract
  • Various “generator” tools and libraries (many third party) for producing those JSON files in the first place, either statically or dynamically
  • Swagger UI, a static HTML/JS app which consumes those JSON files and presents the nice UI.

Ideally you use a generator that extracts the API documentation for free, right from your API code. Many such generators are available for various languages and HTTP/REST frameworks, so you may have to do very little to get a free lunch here. However typically you’d expect to use extra annotations to further document the API with the useful human-facing semantic information which isn’t present in the raw code, and further annotations may be required to clarify object serialisation etc. Still, this is a pretty good facsimile of the promised land, where documentation stays in sync with the code and never goes stale!

In our specific case we were working with an app written in Go, so could potentially have used the go-restful library for our REST services, which has Swagger support built into it. However we were already committed to another library that didn’t have that support and being new to Swagger we couldn’t be sure if it was worth switching libraries or wiring up our own swagger integration. We decided to prototype a Swagger solution by hand-crafting the JSON files in the first instance, to see if we (and our users) liked the results. This showed up a particular challenge that is worth covering here.

You can’t do URL hierarchies with static file serving

A typical REST API will have URL hierarchies such as /users (that lists users) /users/fred-smith (details for a specific user) and indeed the Swagger JSON file URLs consumed by Swagger UI are assumed to be in this sort of hierarchy. Swagger UI consumes Swagger JSON files via HTTP: you give it the URL of the main JSON “resource listing” file which provides URLs for the subordinate “API declaration” files. If that resource listing is served from /main, it expects the API declarations to be at /main/user, /main/product etc. and this is hardcoded into the way it constructs URLs. Unfortunately if we want to provide these JSON files by simply serving them via Nginx, straight from disk with no smarts, we’re out of luck as your average filesystem cannot have both a file “main” and a directory “main” in the same parent directory. You just can’t do it, so you can’t serve up a hierarchy like that from static files.

Obviously you could configure your web server more intricately, mapping individual URLs to individual files to construct the hierarchy. This isn’t appealing however, especially as Swagger UI itself can be served statically (it’s just static HTML, JS, CSS etc.) and we are simply including our JSON files within its directory structure. Three simple lines of config in Nginx should be enough, to serve up swagger-ui and our included JSON files:

location /api-docs {
    alias /opt/swagger-ui;
}

The root problem here is that Swagger UI is extremely simplistic about how it interprets paths in the top-level resource listing JSON. It assumes that the paths to the individual API declaration files can simply be concatenated to the resource listing path, as if they are laid out in a pure hierarchy as sub-resources. If the resource listing is at /api-doc.json and it references a path “users.json” then Swagger UI concatenates these and looks for the API declaration at /api-doc.jsonusers.json. This looks especially bad if you have a .json extension and no leading / on the path. By fixing those two problems we get a bit closer but it’s still looking for /api-doc/users and as mentioned above, we can’t have both a file and a directory named “api-doc” in the filesystem so we are stuck. As an aside, losing the file extension is worth doing regardless, as Swagger UI uses the full name as the title for each section of the docs and you really want “users” rather than “users.json” as your heading.

The trick to win the day here is to use a path like “/../users” in the resource listing. Then the concatenated path is /api-doc/../users which is ultimately resolved to just /users. That being the case, we can put our JSON files “api-doc” and “users” in the same directory (even though Swagger likes to consider them as hierarchical) and they will link together correctly. If you do want the API declaration files to be down a level, you could use “/../apis/users” and put them in an “apis” directory one level deeper than the resource listing file. The key here is that we don’t have to have a file and directory with the same name.

by
Sam Carr
on
27/11/13

Optimizing loops in C for higher numerical throughput and for fun

We had here, in LShift, this typical C-vs-Fortran discussion which prompted me to follow up on it. In this holy war I stand by C and believe that a lot of opinions supporting the alleged superiority of Fortran in numerical throughput come from poor understanding of what actually can be done on the C side. So, I’d like to demonstrate here a couple of loop optimization techniques that can be used in C in order to maximize numerical throughput. Why loops? Because they often constitute the bulk of computations and have significant potential for utilizing architecture of modern CPUs. And if your code doesn’t spin in such loops, you should strive to make it do so, I hope, I will provide in this post convincing arguments as to why.

I use my laptop with a SandyBridge CPU, Core i5, running at 1.7GHz, and as a compiler, GCC 4.8. As a task, I’ve picked sum of squares of N vectors:

    y[i] = x1[i]^2 + x2[i]^2 + ... + xN[i]^2

It nicely fits the bill, sum of squares is a commonplace in numerical computations, uses addition and multiplication allowing to demonstrate mechanisms available in modern CPUs, and has the desired structure of:

    for i ...
        for j ...
            do stuff here

I use snippets of a bit simplified C here, and although all tested code is pure C, I also use chunks of actual assembly produced by GCC to illustrate arising issues. The assembly snippets are for SSE target as GCC tends to produce cleaner code for SSE than for AVX.

Ok, down to the business, our naive loop summing squares of N vectors could be:

    for(i = 0; i < VECTOR_LEN; i++)        // mulss   %xmm0, %xmm0
        for(j = 0; j < N; j++)             // addss   %xmm0, %xmm1
            acc[i] += v[j][i] * v[j][i];   // movss   %xmm1, (%rdi)

Disassembly reveals that XXXss instructions are used. This subset of SSE instructions is using one floating point word, not 4, in one go. This code is clearly not taking full advantage of SIMD units and still limits the throughput to 1xCPU_CLOCK. Since logically, it doesn’t matter which loop is inner, which outer, we can swap them while the algorithm remains valid. Now we have the “horizontal” loop:

    for(j = 0; j < N; j++)                  // mulps   %xmm0, %xmm0
        for(i = 0; i < VECTOR_LEN; i++)     // addps   %xmm0, %xmm1
            acc[i] += v[j][i] * v[j][i];    // movups  %xmm1, (%rdi, %rax)

Boom! This time, GCC unrolled the inner loop using full-width XXXps SSE instructions. This single change boosts performance to the expected SIMD_WIDTHxCPU_CLOCK benchmark, as it will be shown below. Too bad that GCC cannot automatically do this simple optimization for us, but as far as I remember, ICC can. Moving on, the next logical step would be to unroll calculations “vertically”, it should reduce number of memory reads/writes. An example of thus manually unrolled loop:

    for(j = 0; j <= N-4; j+=4)                 // movups  (%rdi, %r9), %xmm0
        for(i = 0; i < VECTOR_LEN; i++)        // mulps   %xmm1, %xmm1
            acc[i] += v[j+0][i] * v[j+0][i];   // addps   %xmm1, %xmm0
            acc[i] += v[j+1][i] * v[j+1][i];   // movups  %xmm0, (%rdi, %r9)   <== redundant write
            acc[i] += v[j+2][i] * v[j+2][i];   // movups  (%rax, %r9), %xmm1
            acc[i] += v[j+3][i] * v[j+3][i];   // mulps   %xmm1, %xmm1
                                               // addps   %xmm0, %xmm1
                                               // movups  %xmm1, (%rdi, %r9)   <== redundant write

Here, we see the infamous effect of pointer aliasing every so often brought up in C-vs-Fortran discussions. Compiler, for each line of calculations produces extra read / write instructions which defeat the intended purpose of vertical unrolling. Luckily, the solution is trivial, an extra variable in the inner loop, this makes compiler produce code which caches calculations in a register. Here is the “cached” loop:

    for(j = 0; j <= N-4; j+=4)             // movups  (%rcx, %r9), %xmm1   <== single reads
        for(i = 0; i < VECTOR_LEN; i++)    // movups  (%r8, %r9), %xmm0
            float tmp = acc[i];            // mulps   %xmm1, %xmm1         <== bulk calculations
            tmp += v[j+0][i] * v[j+0][i];  // addps   %xmm4, %xmm3
            tmp += v[j+1][i] * v[j+1][i];  // mulps   %xmm0, %xmm0
            tmp += v[j+2][i] * v[j+2][i];  // addps   %xmm3, %xmm2
            tmp += v[j+3][i] * v[j+3][i];  // addps   %xmm2, %xmm1
            acc[i] = tmp;                  // addps   %xmm1, %xmm0
                                           // movups  %xmm0, (%rdi, %r9)   <== single write

Now the block of resultant SSE operations is compact and doesn’t have the redundant memory accesses. The last optimization I’d like to introduce further leverages the capability of the modern CPU to parallelize independent streams of operations. In order to do this, we need to break dependency chains, in other words, split calculations into independent sequences being executed on separate registers and execution units, here is our “final” loop:

    for(j = 0; j <= N-8; j+=8)
        for(i = 0; i < VECTOR_LEN; i++)
            float tmp1 = acc[i];
            float tmp2 = v[j+0][i] * v[j+0][i];
            float tmp3 = v[j+1][i] * v[j+1][i];
            tmp1      += v[j+2][i] * v[j+2][i];
            tmp2      += v[j+3][i] * v[j+3][i];
            tmp3      += v[j+4][i] * v[j+4][i];
            tmp1      += v[j+5][i] * v[j+5][i];
            tmp2      += v[j+6][i] * v[j+6][i];
            tmp3      += v[j+7][i] * v[j+7][i];
            acc[i] = tmp1 + tmp2 + tmp3;

C code I used for testing all the above loops, is here. To rule out memory bandwidth issues as much as it was possible, I run tests for a bunch of vectors small enough to fit into L1 cache. Throughputs for single core:

                SSE              AVX
naive:       1733.4 MFLOPS    1696.6 MFLOPS    // 1xCPU_CLOCK barrier for scalar instructions
horizontal:  5963.6 MFLOPS    9419.8 MFLOPS    // 4xCPU_CLOCK and 8xCPU_CLOCK for SSE and AVX
unrolled:   11264.8 MFLOPS   11496.6 MFLOPS
cached:     14253.7 MFLOPS   15086.5 MFLOPS
final:      17985.4 MFLOPS   18210.4 MFLOPS    // Both, SSE and AVX settle at around 10xCPU_CLOCK

So it seems, this midrange laptop CPU could potentially get us some 35 GFLOPS out of its two cores without resorting to nothing more than simple changes in pure C.

Things to consider:

  • Why for SSE, we did manage to get throughput of 10xCPU_CLOCK even though SSE operates on 4-floats chunks? SandyBridge architecture has separate execution units for addition and multiplication capable of performing operations in full parallel, this effectively means, SandyBridge can perform as if it had fused add-mul, upping, in some situations the theoretical limit to 8xCPU_CLOCK for SSE and 16xCPU_CLOCK for AVX.
  • Why then, we have 10xCPU_CLOCK not 8xCPU_CLOCK for SSE? SandyBridge CPUs have TurboBoost feature which provides extra headroom under certain circumstances, TurboBoost, however, may be very limited, especially when you properly harness all cores of your CPU.
  • Why then, we didn’t get more than 10xCPU_CLOCK and why we hit the same wall for AVX? We hit an L1 memory bandwidth bottleneck, further memory access optimizations are needed, and this is when carefully handcrafted assembly code may squeeze even more out of CPU.
  • How generic / reliable these optimization techniques are? Well, it’s a product of code, compiler and CPU, so your mileage may vary. But if you really wanna get real kick for your buck, you will have to research and experiment with your own devices.
  • Last but not least, keep in mind that such changes are likely to produce slightly different outputs due to floating-point rounding.

 

by
jarek
on

Small shouldn’t mean primitive

The internet of things seems to be coming any day now, but the state of embedded development seems to be deplorable. Almost everything is written in C or C++. Device drivers are written over and over, once for each RTOS, or worse. When high level languages are available, they seem to be implemented directly on the hardware, rather than on top of an existing RTOS, so there’s another chance to practice writing device drivers. Want a file system, or a network stack? You’ll need to patch one for the kernel of your choice.

This approach is no longer justifiable: the target devices typically have 64K of memory, 256K of memory mapped flash , and have throughput of 100Mips. It’s time embedded developers stopped playing at writing device drivers, and thought about composition.

Trying to put together an environment for experimenting with micro controllers is frustrating. For example, I have a STM32F3 Discovery board. The micro controller itself has a huge array of peripheral interfaces, and the board adds a gyroscope and accelerometer, and LEDs to light it up like a Christmas tree. It costs £9, which is cheap enough to buy several: in case you break one. I’m a software engineer, not an electrical engineer, so that’s going to happen. There’s 48K of RAM, and 256K of flash. It’s sleep mode uses 0.6mA, so if interrupts in you application are rare, you might even use less power than an Arduino.

So, what would be a productive environment?

  1. Device support. I want this to run on whatever board fits my project, and support all the peripherals.
  2. High level. I don’t manage my own memory any more, and I like abstraction.
  3. Interactive. I don’t want to have to compile and install new firmware just to test a bit of wiring I’ve just done.
  4. Interoperable. This is for the internet of things. I’m going to need to implement network protocols.
  5. Composable. I want to add other peoples code to mine, and I don’t mean by forking it.

How do I get it? Well, now the pain begins.

Networking

LwIP is a small footprint TCP/IP stack written in C. Almost everything mentioned below includes some support, so you can plug it in. Using it doesn’t require anything beyond supporting C binding. Some extra work might be required if you want to provide network drivers in the language of your choice.

Embedded interpreters

These are the closest thing there is to an all inclusive solution.

Espruino

Javascript for a micro controller, and more specifically, like Arduino, only in Javascript. I’ve started with the best thing going, I think. It’s got my list covered but for the device support, and to a considerable extent, supports the F3 discovery. However, the the Arduino like programming interface is intrinsically poor: how do I do ADC without polling, for example (the F3 discovery integrates the timer, ADC and DMA in hardware). Javascript means no actual concurrency, as well: you get one event at a time, and there’s no way to prioritize them.

Espruino doesn’t have much in the way of architecture documentation. There’s no description of the interpreter, so unless you read the code, you can’t know anything about the competence of the authors, or the sophistication of the interpreter. I’d make a guess that it’s based on Tiny-JS. There’s no intermediate code form, which guarantees your code takes up a lot of RAM.

eLua

Lua has co-routines, which is a big step up on being completely event oriented. eLua can execute byte code straight out of flash – rather than having to use RAM to hold programs, which is a pretty useful optimisation in  this context. Lua also has a great C-binding.

Elua runs on my device, but only to the extent that it can run a REPL on the serial port. No other peripherals are supported. eLua’s concept is to be Lua as far down as possible. From the point of view of making eLua as easy to improve as possible, this is good design decision. It’s a long game though, and I don’t see anything in it’s roadmap that suggests it’s going to tackle the issue of memory management during interrupts, or compilation, when higher performance is needed. I think that’s going to mean device drivers keep getting written in C. Given that, hitching a ride on an RTOS which has momentum in this area – E.g. Chibios, seems like a pragmatic way forward, but seems to get rejected on the mailing list.

That’s not to say that eLua isn’t the right starting point to tackle these problems: it may well be.

RTOS

Nuttx

This RTOS offers Posix support and DLLs. This would mean, for example, that it’s reasonably easy to compile various interpreters, and lot’s of open source source software. It has limited support for the F3 Discovery board – basically no more than eLua. I could choose the F4 discovery board to solve this problem. There’s an open source project to run (full) Lua under NuttX, which I hope to try out.

Chibios

This doesn’t offer any sort of standard interface. It does however support a huge array of boards, including support for all the F3 Discovery peripherals. It also seems to get new boards and drivers supported very quickly. For example, Chibios supports the ADC/Timer/DMA feature I mentioned above, and had that support a month or so after the board was released. This is also the only thing I’ve actually run on the board. It’s easy to set up a tool chain and build. The samples are readable, by C standards.

Other

Because Chibios has good support for the boards I have, and because FreeRTOS (for example) appears to have very similar features to Chibios, I haven’t investigated much further in this category.

None of the above?

Scheme might be a good choice as an embedded interpreter. I could build a system on top of Chibios. There are at least two compilers I could choose between: Chicken and Stalin. Chicken has a REPL, so it appeals more. It lacks really good GC, but I guess that might not be such a big problem in the short term. Chickens first generation is on the stack, and I can see how that might make it possible to write interrupt handlers directly in scheme, although if the stack ran out, the interrupt handler would fail.

I must admit, I’d assumed that a TCP/IP stack written in scheme was available, but I haven’t found one. Or a file system, for that matter. Still, there’s LwIP, and writing a file system in Scheme isn’t so daunting. I’m not sure I’ll be convincing a lot of people to write electricity meter firmware in scheme, but I could always add interpreters for other languages.

Conclusion

I guess I hinted at the top that’s there’s no clear conclusion. Suggestions?

by
david
on
11/11/13

Three years on…

It’s nearly exactly three years since I started at LShift. I’d like to take a moment and look back at what I’ve done.

Read more…

by
Frank Shearar
on
01/11/13

Tell don’t ask with Sinatra handlers

In Bigwig, in order to keep our code neat and well factored, we’ve tried to adhere to the principle of tell, don’t ask as much as we can. However, one place this can be difficult is within a handler for an HTTP request (we’re using Sinatra for that).

Read more…

by
Ceri Storey
on
31/10/13

Fudging generics in Go with AST rewriting

One possible workaround for a lack of generics is code generation. Let’s look at Go’s AST manipulation to make a Maybe Int out of a Maybe a.

Read more…

by
Frank Shearar
on
30/10/13

Going m(on)ad with parser combinators

It’s about time someone started talking about Go again around here, so I picked up the old editor, and (painlessly!) installed Go. Maybe 5 minutes later I had the world’s faster compiler, a test framework, a coverage analyzer and a bunch of stuff besides available on my machine. But what to do? Hello World is so done, so I thought I’d grab my copy of Hutton & Meijer and implement a basic parser combinator.

Read more…

by
Frank Shearar
on
25/10/13

Zabbix security incidents

Someone discovered a vulnerability in Zabbix recently, and there’s this lovely, detailed description of an exploit based in it on Corelan Team. It’s lovely because it contains all the information I need to tell if my site is vulnerable, and to what extent.

There’s also a really useless advisory on Packet Storm Security. Why is it useless? Because at the bottom, there’s a section called Workaround, which reads ‘No workaround available’. This is really unfair to Zabbix:

Zabbix offers a mode called ‘active agent’, in which, rather than the server querying the agent, the agent submits information to the server periodically. This means it’s code on the monitored host that determines what information is passed to the server, and this eliminates the logical possibility of an escalation attack onto monitored hosts.

The existence of this mode is why I consider Zabbix in security sensitive applications. I pretty much assumed SQL injection attacks existed in Zabbix, because the API is written in PHP. Hence I wouldn’t consider using passive mode. I was a bit disappointed to find the guest account is enabled by default, but the point is, I know that Zabbix being compromised won’t result in a data protection incident.

So in short, the workaround is to disable passive agents: in your /etc/zabbix/zabbix_agentd.conf, set DisablePassive=1. But that’s what you were doing anyway, right? Zabbix deserve some criticism for providing a way of configuring their product that is not reliably secure, but I don’t think it’s too much to expect security researchers to have some awareness of the architecture of the products they publish security advisories about.

I should also point out that you could equally choose collectd, and graphite to get the same result. This has the added advantage that it’s the only way it works, so there won’t be any irrelevant security advisories to explain to your clients.

I don’t read either of the above sites regularly, so I don’t know if this single data point reflects the overall quality of either.

by
david
on
18/10/13

CPU cache collisions in the context of performance

This article discusses some potential performance issues caused by CPU cache collisions.

In normal scenarios cache collisions don’t pose a problem, it usually is only in specific, high speed
applications that they may incur noticeable performance penalties, and as such, things described
here should be considered “the last mile effort”.
As an example, I will use my laptop’s CPU, Intel Core i5 1.7GHz that has 32kB 8-way L1 data cache per core.

  • CPUs have caches organized in cachelines. For Intel and AMD, cachelines are 64 bytes long.
    When CPU needs to reach to a byte located in memory at the address 100, the whole chunk from
    addresses 64-127 is being pulled to cache. Since my example CPU has a 32kB L1 data cache
    per core, this means 512 such cachelines. The size of 64 bytes also means, that the six
    least significant bits of address index byte within the cacheline:

    address bits:    |       0 - 5      |       6 - ...     |
                     | cacheline offset |
    
  • Cachelines are organized in buckets. “8-way” means that each bucket holds 8 cachelines in it.
    Therefore my CPU L1 data cache has 512 cachelines kept in 64 buckets. In order to address those 64 buckets,
    next 6 bits are used from the address word, full address resolution within this L1 cache goes as follows:

    address bits:    |       0 - 5      |      6 - 11     |                12 - ...             |
                     | cacheline offset | bucket selector | cacheline identifier withing bucket |
    
  • Crucial to understand here is, that for this CPU, data separated by N x 4096 bytes
    (N x 12 the first bits) will always end up in the same bucket. So a lot of data chunks
    spaced by N x 4096 bytes, processed in a parallel manner can cause excessive evictions
    of cachelines from buckets thereby defeating the benefits of L1 cache.

To test the performance degradation I wrote a test C program
(
full C source here
)
that generates a number of vectors of pseudo random integers, sums them up in a typically parallel
optimized way, and estimates the resulting speed. Program takes a couple
of parameters from command line so that various CPUs and scenarios can be tested.
Here are results of three test runs on my example CPU:

  1. 100000 iterations, 30 vectors, 1000 integers each, aligned to 1010 integers = 2396 MOP/s
  2. 100000 iterations, 30 vectors, 1000 integers each, aligned to 1024 integers = 890 MOP/s
  3. 100000 iterations, 30 vectors, 1000 integers each, aligned to 1030 integers = 2415 MOP/s

In this CPU, L1 cache has 4 cycles of latency, L2 cache has 12 cycles of latency, hence
the performance drop to almost 1/3 when alignment hit the N x 4096 condition, CPU pretty much fell
back from L1 to L2. While this is a synthetic example, real life applications may not be affected
this much, but I’ve seen applications losing 30-40% to this single factor.

Parting remarks:

  • You may need to take into consideration a structure of cache not only it’s size, as in this case,
    even data chunked into pieces small enough to fit into L1, still can fail to take full advantage of it.
  • The issue cannot be solved by rewriting critical section logic in C/C++/assembly or any other
    “super-fast language of your choice”, this is a behavior dictated by hardware specifics.
  • Developers’ habit of aligning to the even boundaries, especially to the page boundaries,
    can work against you.
  • Padding can help break out of the performance drop.
  • Sometimes, the easiest workaround is a platform change, i.e. switching from Intel to AMD
    or the other way. Although keep in mind, it doesn’t really solve the issue, different platforms
    just manifest it for different data layouts.
by
jarek
on
08/10/13

« Newer Posts Older Posts »

Search

Categories

Feeds

Archives

2000-14 LShift Ltd, 1st Floor, Hoxton Point, 6 Rufus Street, London, N1 6PE, UK+44 (0)20 7729 7060   Contact us