technology from back to front

Archive for November, 2013

Documenting an HTTP API with Swagger

I recently tried out Swagger, for documenting an HTTP API. The big win with Swagger is that it provides a sweet HTML UI to browse your API docs and experiment with sending requests and viewing responses, which is a great experience for other developers that are trying to get to grips with your API. Try out their demo of the Swagger UI, for a simple petstore example.

Swagger petstore example - screenshot

Swagger is effectively three things ‘architecturally’:

  • A specification for the JSON files, which contain your API documentation in the abstract
  • Various “generator” tools and libraries (many third party) for producing those JSON files in the first place, either statically or dynamically
  • Swagger UI, a static HTML/JS app which consumes those JSON files and presents the nice UI.

Ideally you use a generator that extracts the API documentation for free, right from your API code. Many such generators are available for various languages and HTTP/REST frameworks, so you may have to do very little to get a free lunch here. However typically you’d expect to use extra annotations to further document the API with the useful human-facing semantic information which isn’t present in the raw code, and further annotations may be required to clarify object serialisation etc. Still, this is a pretty good facsimile of the promised land, where documentation stays in sync with the code and never goes stale!

In our specific case we were working with an app written in Go, so could potentially have used the go-restful library for our REST services, which has Swagger support built into it. However we were already committed to another library that didn’t have that support and being new to Swagger we couldn’t be sure if it was worth switching libraries or wiring up our own swagger integration. We decided to prototype a Swagger solution by hand-crafting the JSON files in the first instance, to see if we (and our users) liked the results. This showed up a particular challenge that is worth covering here.

You can’t do URL hierarchies with static file serving

A typical REST API will have URL hierarchies such as /users (that lists users) /users/fred-smith (details for a specific user) and indeed the Swagger JSON file URLs consumed by Swagger UI are assumed to be in this sort of hierarchy. Swagger UI consumes Swagger JSON files via HTTP: you give it the URL of the main JSON “resource listing” file which provides URLs for the subordinate “API declaration” files. If that resource listing is served from /main, it expects the API declarations to be at /main/user, /main/product etc. and this is hardcoded into the way it constructs URLs. Unfortunately if we want to provide these JSON files by simply serving them via Nginx, straight from disk with no smarts, we’re out of luck as your average filesystem cannot have both a file “main” and a directory “main” in the same parent directory. You just can’t do it, so you can’t serve up a hierarchy like that from static files.

Obviously you could configure your web server more intricately, mapping individual URLs to individual files to construct the hierarchy. This isn’t appealing however, especially as Swagger UI itself can be served statically (it’s just static HTML, JS, CSS etc.) and we are simply including our JSON files within its directory structure. Three simple lines of config in Nginx should be enough, to serve up swagger-ui and our included JSON files:

location /api-docs {
    alias /opt/swagger-ui;

The root problem here is that Swagger UI is extremely simplistic about how it interprets paths in the top-level resource listing JSON. It assumes that the paths to the individual API declaration files can simply be concatenated to the resource listing path, as if they are laid out in a pure hierarchy as sub-resources. If the resource listing is at /api-doc.json and it references a path “users.json” then Swagger UI concatenates these and looks for the API declaration at /api-doc.jsonusers.json. This looks especially bad if you have a .json extension and no leading / on the path. By fixing those two problems we get a bit closer but it’s still looking for /api-doc/users and as mentioned above, we can’t have both a file and a directory named “api-doc” in the filesystem so we are stuck. As an aside, losing the file extension is worth doing regardless, as Swagger UI uses the full name as the title for each section of the docs and you really want “users” rather than “users.json” as your heading.

The trick to win the day here is to use a path like “/../users” in the resource listing. Then the concatenated path is /api-doc/../users which is ultimately resolved to just /users. That being the case, we can put our JSON files “api-doc” and “users” in the same directory (even though Swagger likes to consider them as hierarchical) and they will link together correctly. If you do want the API declaration files to be down a level, you could use “/../apis/users” and put them in an “apis” directory one level deeper than the resource listing file. The key here is that we don’t have to have a file and directory with the same name.

Sam Carr

Optimizing loops in C for higher numerical throughput and for fun

We had here, in LShift, this typical C-vs-Fortran discussion which prompted me to follow up on it. In this holy war I stand by C and believe that a lot of opinions supporting the alleged superiority of Fortran in numerical throughput come from poor understanding of what actually can be done on the C side. So, I’d like to demonstrate here a couple of loop optimization techniques that can be used in C in order to maximize numerical throughput. Why loops? Because they often constitute the bulk of computations and have significant potential for utilizing architecture of modern CPUs. And if your code doesn’t spin in such loops, you should strive to make it do so, I hope, I will provide in this post convincing arguments as to why.

I use my laptop with a SandyBridge CPU, Core i5, running at 1.7GHz, and as a compiler, GCC 4.8. As a task, I’ve picked sum of squares of N vectors:

    y[i] = x1[i]^2 + x2[i]^2 + ... + xN[i]^2

It nicely fits the bill, sum of squares is a commonplace in numerical computations, uses addition and multiplication allowing to demonstrate mechanisms available in modern CPUs, and has the desired structure of:

    for i ...
        for j ...
            do stuff here

I use snippets of a bit simplified C here, and although all tested code is pure C, I also use chunks of actual assembly produced by GCC to illustrate arising issues. The assembly snippets are for SSE target as GCC tends to produce cleaner code for SSE than for AVX.

Ok, down to the business, our naive loop summing squares of N vectors could be:

    for(i = 0; i < VECTOR_LEN; i++)        // mulss   %xmm0, %xmm0
        for(j = 0; j < N; j++)             // addss   %xmm0, %xmm1
            acc[i] += v[j][i] * v[j][i];   // movss   %xmm1, (%rdi)

Disassembly reveals that XXXss instructions are used. This subset of SSE instructions is using one floating point word, not 4, in one go. This code is clearly not taking full advantage of SIMD units and still limits the throughput to 1xCPU_CLOCK. Since logically, it doesn’t matter which loop is inner, which outer, we can swap them while the algorithm remains valid. Now we have the “horizontal” loop:

    for(j = 0; j < N; j++)                  // mulps   %xmm0, %xmm0
        for(i = 0; i < VECTOR_LEN; i++)     // addps   %xmm0, %xmm1
            acc[i] += v[j][i] * v[j][i];    // movups  %xmm1, (%rdi, %rax)

Boom! This time, GCC unrolled the inner loop using full-width XXXps SSE instructions. This single change boosts performance to the expected SIMD_WIDTHxCPU_CLOCK benchmark, as it will be shown below. Too bad that GCC cannot automatically do this simple optimization for us, but as far as I remember, ICC can. Moving on, the next logical step would be to unroll calculations “vertically”, it should reduce number of memory reads/writes. An example of thus manually unrolled loop:

    for(j = 0; j <= N-4; j+=4)                 // movups  (%rdi, %r9), %xmm0
        for(i = 0; i < VECTOR_LEN; i++)        // mulps   %xmm1, %xmm1
            acc[i] += v[j+0][i] * v[j+0][i];   // addps   %xmm1, %xmm0
            acc[i] += v[j+1][i] * v[j+1][i];   // movups  %xmm0, (%rdi, %r9)   <== redundant write
            acc[i] += v[j+2][i] * v[j+2][i];   // movups  (%rax, %r9), %xmm1
            acc[i] += v[j+3][i] * v[j+3][i];   // mulps   %xmm1, %xmm1
                                               // addps   %xmm0, %xmm1
                                               // movups  %xmm1, (%rdi, %r9)   <== redundant write

Here, we see the infamous effect of pointer aliasing every so often brought up in C-vs-Fortran discussions. Compiler, for each line of calculations produces extra read / write instructions which defeat the intended purpose of vertical unrolling. Luckily, the solution is trivial, an extra variable in the inner loop, this makes compiler produce code which caches calculations in a register. Here is the “cached” loop:

    for(j = 0; j <= N-4; j+=4)             // movups  (%rcx, %r9), %xmm1   <== single reads
        for(i = 0; i < VECTOR_LEN; i++)    // movups  (%r8, %r9), %xmm0
            float tmp = acc[i];            // mulps   %xmm1, %xmm1         <== bulk calculations
            tmp += v[j+0][i] * v[j+0][i];  // addps   %xmm4, %xmm3
            tmp += v[j+1][i] * v[j+1][i];  // mulps   %xmm0, %xmm0
            tmp += v[j+2][i] * v[j+2][i];  // addps   %xmm3, %xmm2
            tmp += v[j+3][i] * v[j+3][i];  // addps   %xmm2, %xmm1
            acc[i] = tmp;                  // addps   %xmm1, %xmm0
                                           // movups  %xmm0, (%rdi, %r9)   <== single write

Now the block of resultant SSE operations is compact and doesn’t have the redundant memory accesses. The last optimization I’d like to introduce further leverages the capability of the modern CPU to parallelize independent streams of operations. In order to do this, we need to break dependency chains, in other words, split calculations into independent sequences being executed on separate registers and execution units, here is our “final” loop:

    for(j = 0; j <= N-8; j+=8)
        for(i = 0; i < VECTOR_LEN; i++)
            float tmp1 = acc[i];
            float tmp2 = v[j+0][i] * v[j+0][i];
            float tmp3 = v[j+1][i] * v[j+1][i];
            tmp1      += v[j+2][i] * v[j+2][i];
            tmp2      += v[j+3][i] * v[j+3][i];
            tmp3      += v[j+4][i] * v[j+4][i];
            tmp1      += v[j+5][i] * v[j+5][i];
            tmp2      += v[j+6][i] * v[j+6][i];
            tmp3      += v[j+7][i] * v[j+7][i];
            acc[i] = tmp1 + tmp2 + tmp3;

C code I used for testing all the above loops, is here. To rule out memory bandwidth issues as much as it was possible, I run tests for a bunch of vectors small enough to fit into L1 cache. Throughputs for single core:

                SSE              AVX
naive:       1733.4 MFLOPS    1696.6 MFLOPS    // 1xCPU_CLOCK barrier for scalar instructions
horizontal:  5963.6 MFLOPS    9419.8 MFLOPS    // 4xCPU_CLOCK and 8xCPU_CLOCK for SSE and AVX
unrolled:   11264.8 MFLOPS   11496.6 MFLOPS
cached:     14253.7 MFLOPS   15086.5 MFLOPS
final:      17985.4 MFLOPS   18210.4 MFLOPS    // Both, SSE and AVX settle at around 10xCPU_CLOCK

So it seems, this midrange laptop CPU could potentially get us some 35 GFLOPS out of its two cores without resorting to nothing more than simple changes in pure C.

Things to consider:

  • Why for SSE, we did manage to get throughput of 10xCPU_CLOCK even though SSE operates on 4-floats chunks? SandyBridge architecture has separate execution units for addition and multiplication capable of performing operations in full parallel, this effectively means, SandyBridge can perform as if it had fused add-mul, upping, in some situations the theoretical limit to 8xCPU_CLOCK for SSE and 16xCPU_CLOCK for AVX.
  • Why then, we have 10xCPU_CLOCK not 8xCPU_CLOCK for SSE? SandyBridge CPUs have TurboBoost feature which provides extra headroom under certain circumstances, TurboBoost, however, may be very limited, especially when you properly harness all cores of your CPU.
  • Why then, we didn’t get more than 10xCPU_CLOCK and why we hit the same wall for AVX? We hit an L1 memory bandwidth bottleneck, further memory access optimizations are needed, and this is when carefully handcrafted assembly code may squeeze even more out of CPU.
  • How generic / reliable these optimization techniques are? Well, it’s a product of code, compiler and CPU, so your mileage may vary. But if you really wanna get real kick for your buck, you will have to research and experiment with your own devices.
  • Last but not least, keep in mind that such changes are likely to produce slightly different outputs due to floating-point rounding.



Small shouldn’t mean primitive

The internet of things seems to be coming any day now, but the state of embedded development seems to be deplorable. Almost everything is written in C or C++. Device drivers are written over and over, once for each RTOS, or worse. When high level languages are available, they seem to be implemented directly on the hardware, rather than on top of an existing RTOS, so there’s another chance to practice writing device drivers. Want a file system, or a network stack? You’ll need to patch one for the kernel of your choice.

This approach is no longer justifiable: the target devices typically have 64K of memory, 256K of memory mapped flash , and have throughput of 100Mips. It’s time embedded developers stopped playing at writing device drivers, and thought about composition.

Trying to put together an environment for experimenting with micro controllers is frustrating. For example, I have a STM32F3 Discovery board. The micro controller itself has a huge array of peripheral interfaces, and the board adds a gyroscope and accelerometer, and LEDs to light it up like a Christmas tree. It costs £9, which is cheap enough to buy several: in case you break one. I’m a software engineer, not an electrical engineer, so that’s going to happen. There’s 48K of RAM, and 256K of flash. It’s sleep mode uses 0.6mA, so if interrupts in you application are rare, you might even use less power than an Arduino.

So, what would be a productive environment?

  1. Device support. I want this to run on whatever board fits my project, and support all the peripherals.
  2. High level. I don’t manage my own memory any more, and I like abstraction.
  3. Interactive. I don’t want to have to compile and install new firmware just to test a bit of wiring I’ve just done.
  4. Interoperable. This is for the internet of things. I’m going to need to implement network protocols.
  5. Composable. I want to add other peoples code to mine, and I don’t mean by forking it.

How do I get it? Well, now the pain begins.


LwIP is a small footprint TCP/IP stack written in C. Almost everything mentioned below includes some support, so you can plug it in. Using it doesn’t require anything beyond supporting C binding. Some extra work might be required if you want to provide network drivers in the language of your choice.

Embedded interpreters

These are the closest thing there is to an all inclusive solution.


Javascript for a micro controller, and more specifically, like Arduino, only in Javascript. I’ve started with the best thing going, I think. It’s got my list covered but for the device support, and to a considerable extent, supports the F3 discovery. However, the the Arduino like programming interface is intrinsically poor: how do I do ADC without polling, for example (the F3 discovery integrates the timer, ADC and DMA in hardware). Javascript means no actual concurrency, as well: you get one event at a time, and there’s no way to prioritize them.

Espruino doesn’t have much in the way of architecture documentation. There’s no description of the interpreter, so unless you read the code, you can’t know anything about the competence of the authors, or the sophistication of the interpreter. I’d make a guess that it’s based on Tiny-JS. There’s no intermediate code form, which guarantees your code takes up a lot of RAM.


Lua has co-routines, which is a big step up on being completely event oriented. eLua can execute byte code straight out of flash – rather than having to use RAM to hold programs, which is a pretty useful optimisation in  this context. Lua also has a great C-binding.

Elua runs on my device, but only to the extent that it can run a REPL on the serial port. No other peripherals are supported. eLua’s concept is to be Lua as far down as possible. From the point of view of making eLua as easy to improve as possible, this is good design decision. It’s a long game though, and I don’t see anything in it’s roadmap that suggests it’s going to tackle the issue of memory management during interrupts, or compilation, when higher performance is needed. I think that’s going to mean device drivers keep getting written in C. Given that, hitching a ride on an RTOS which has momentum in this area – E.g. Chibios, seems like a pragmatic way forward, but seems to get rejected on the mailing list.

That’s not to say that eLua isn’t the right starting point to tackle these problems: it may well be.



This RTOS offers Posix support and DLLs. This would mean, for example, that it’s reasonably easy to compile various interpreters, and lot’s of open source source software. It has limited support for the F3 Discovery board – basically no more than eLua. I could choose the F4 discovery board to solve this problem. There’s an open source project to run (full) Lua under NuttX, which I hope to try out.


This doesn’t offer any sort of standard interface. It does however support a huge array of boards, including support for all the F3 Discovery peripherals. It also seems to get new boards and drivers supported very quickly. For example, Chibios supports the ADC/Timer/DMA feature I mentioned above, and had that support a month or so after the board was released. This is also the only thing I’ve actually run on the board. It’s easy to set up a tool chain and build. The samples are readable, by C standards.


Because Chibios has good support for the boards I have, and because FreeRTOS (for example) appears to have very similar features to Chibios, I haven’t investigated much further in this category.

None of the above?

Scheme might be a good choice as an embedded interpreter. I could build a system on top of Chibios. There are at least two compilers I could choose between: Chicken and Stalin. Chicken has a REPL, so it appeals more. It lacks really good GC, but I guess that might not be such a big problem in the short term. Chickens first generation is on the stack, and I can see how that might make it possible to write interrupt handlers directly in scheme, although if the stack ran out, the interrupt handler would fail.

I must admit, I’d assumed that a TCP/IP stack written in scheme was available, but I haven’t found one. Or a file system, for that matter. Still, there’s LwIP, and writing a file system in Scheme isn’t so daunting. I’m not sure I’ll be convincing a lot of people to write electricity meter firmware in scheme, but I could always add interpreters for other languages.


I guess I hinted at the top that’s there’s no clear conclusion. Suggestions?


Three years on…

It’s nearly exactly three years since I started at LShift. I’d like to take a moment and look back at what I’ve done.

Read more…

Frank Shearar



You are currently browsing the LShift Ltd. blog archives for November, 2013.



2000-14 LShift Ltd, 1st Floor, Hoxton Point, 6 Rufus Street, London, N1 6PE, UK+44 (0)20 7729 7060   Contact us