technology from back to front

Blog: latest posts

Small shouldn’t mean primitive

The internet of things seems to be coming any day now, but the state of embedded development seems to be deplorable. Almost everything is written in C or C++. Device drivers are written over and over, once for each RTOS, or worse. When high level languages are available, they seem to be implemented directly on the hardware, rather than on top of an existing RTOS, so there’s another chance to practice writing device drivers. Want a file system, or a network stack? You’ll need to patch one for the kernel of your choice.

This approach is no longer justifiable: the target devices typically have 64K of memory, 256K of memory mapped flash , and have throughput of 100Mips. It’s time embedded developers stopped playing at writing device drivers, and thought about composition.

Trying to put together an environment for experimenting with micro controllers is frustrating. For example, I have a STM32F3 Discovery board. The micro controller itself has a huge array of peripheral interfaces, and the board adds a gyroscope and accelerometer, and LEDs to light it up like a Christmas tree. It costs £9, which is cheap enough to buy several: in case you break one. I’m a software engineer, not an electrical engineer, so that’s going to happen. There’s 48K of RAM, and 256K of flash. It’s sleep mode uses 0.6mA, so if interrupts in you application are rare, you might even use less power than an Arduino.

So, what would be a productive environment?

  1. Device support. I want this to run on whatever board fits my project, and support all the peripherals.
  2. High level. I don’t manage my own memory any more, and I like abstraction.
  3. Interactive. I don’t want to have to compile and install new firmware just to test a bit of wiring I’ve just done.
  4. Interoperable. This is for the internet of things. I’m going to need to implement network protocols.
  5. Composable. I want to add other peoples code to mine, and I don’t mean by forking it.

How do I get it? Well, now the pain begins.


LwIP is a small footprint TCP/IP stack written in C. Almost everything mentioned below includes some support, so you can plug it in. Using it doesn’t require anything beyond supporting C binding. Some extra work might be required if you want to provide network drivers in the language of your choice.

Embedded interpreters

These are the closest thing there is to an all inclusive solution.


Javascript for a micro controller, and more specifically, like Arduino, only in Javascript. I’ve started with the best thing going, I think. It’s got my list covered but for the device support, and to a considerable extent, supports the F3 discovery. However, the the Arduino like programming interface is intrinsically poor: how do I do ADC without polling, for example (the F3 discovery integrates the timer, ADC and DMA in hardware). Javascript means no actual concurrency, as well: you get one event at a time, and there’s no way to prioritize them.

Espruino doesn’t have much in the way of architecture documentation. There’s no description of the interpreter, so unless you read the code, you can’t know anything about the competence of the authors, or the sophistication of the interpreter. I’d make a guess that it’s based on Tiny-JS. There’s no intermediate code form, which guarantees your code takes up a lot of RAM.


Lua has co-routines, which is a big step up on being completely event oriented. eLua can execute byte code straight out of flash – rather than having to use RAM to hold programs, which is a pretty useful optimisation in  this context. Lua also has a great C-binding.

Elua runs on my device, but only to the extent that it can run a REPL on the serial port. No other peripherals are supported. eLua’s concept is to be Lua as far down as possible. From the point of view of making eLua as easy to improve as possible, this is good design decision. It’s a long game though, and I don’t see anything in it’s roadmap that suggests it’s going to tackle the issue of memory management during interrupts, or compilation, when higher performance is needed. I think that’s going to mean device drivers keep getting written in C. Given that, hitching a ride on an RTOS which has momentum in this area – E.g. Chibios, seems like a pragmatic way forward, but seems to get rejected on the mailing list.

That’s not to say that eLua isn’t the right starting point to tackle these problems: it may well be.



This RTOS offers Posix support and DLLs. This would mean, for example, that it’s reasonably easy to compile various interpreters, and lot’s of open source source software. It has limited support for the F3 Discovery board – basically no more than eLua. I could choose the F4 discovery board to solve this problem. There’s an open source project to run (full) Lua under NuttX, which I hope to try out.


This doesn’t offer any sort of standard interface. It does however support a huge array of boards, including support for all the F3 Discovery peripherals. It also seems to get new boards and drivers supported very quickly. For example, Chibios supports the ADC/Timer/DMA feature I mentioned above, and had that support a month or so after the board was released. This is also the only thing I’ve actually run on the board. It’s easy to set up a tool chain and build. The samples are readable, by C standards.


Because Chibios has good support for the boards I have, and because FreeRTOS (for example) appears to have very similar features to Chibios, I haven’t investigated much further in this category.

None of the above?

Scheme might be a good choice as an embedded interpreter. I could build a system on top of Chibios. There are at least two compilers I could choose between: Chicken and Stalin. Chicken has a REPL, so it appeals more. It lacks really good GC, but I guess that might not be such a big problem in the short term. Chickens first generation is on the stack, and I can see how that might make it possible to write interrupt handlers directly in scheme, although if the stack ran out, the interrupt handler would fail.

I must admit, I’d assumed that a TCP/IP stack written in scheme was available, but I haven’t found one. Or a file system, for that matter. Still, there’s LwIP, and writing a file system in Scheme isn’t so daunting. I’m not sure I’ll be convincing a lot of people to write electricity meter firmware in scheme, but I could always add interpreters for other languages.


I guess I hinted at the top that’s there’s no clear conclusion. Suggestions?


Three years on…

It’s nearly exactly three years since I started at LShift. I’d like to take a moment and look back at what I’ve done.

Read more…

Frank Shearar

Tell don’t ask with Sinatra handlers

In Bigwig, in order to keep our code neat and well factored, we’ve tried to adhere to the principle of tell, don’t ask as much as we can. However, one place this can be difficult is within a handler for an HTTP request (we’re using Sinatra for that).

Read more…

Ceri Storey

Fudging generics in Go with AST rewriting

One possible workaround for a lack of generics is code generation. Let’s look at Go’s AST manipulation to make a Maybe Int out of a Maybe a.

Read more…

Frank Shearar

Going m(on)ad with parser combinators

It’s about time someone started talking about Go again around here, so I picked up the old editor, and (painlessly!) installed Go. Maybe 5 minutes later I had the world’s faster compiler, a test framework, a coverage analyzer and a bunch of stuff besides available on my machine. But what to do? Hello World is so done, so I thought I’d grab my copy of Hutton & Meijer and implement a basic parser combinator.

Read more…

Frank Shearar

Zabbix security incidents

Someone discovered a vulnerability in Zabbix recently, and there’s this lovely, detailed description of an exploit based in it on Corelan Team. It’s lovely because it contains all the information I need to tell if my site is vulnerable, and to what extent.

There’s also a really useless advisory on Packet Storm Security. Why is it useless? Because at the bottom, there’s a section called Workaround, which reads ‘No workaround available’. This is really unfair to Zabbix:

Zabbix offers a mode called ‘active agent’, in which, rather than the server querying the agent, the agent submits information to the server periodically. This means it’s code on the monitored host that determines what information is passed to the server, and this eliminates the logical possibility of an escalation attack onto monitored hosts.

The existence of this mode is why I consider Zabbix in security sensitive applications. I pretty much assumed SQL injection attacks existed in Zabbix, because the API is written in PHP. Hence I wouldn’t consider using passive mode. I was a bit disappointed to find the guest account is enabled by default, but the point is, I know that Zabbix being compromised won’t result in a data protection incident.

So in short, the workaround is to disable passive agents: in your /etc/zabbix/zabbix_agentd.conf, set DisablePassive=1. But that’s what you were doing anyway, right? Zabbix deserve some criticism for providing a way of configuring their product that is not reliably secure, but I don’t think it’s too much to expect security researchers to have some awareness of the architecture of the products they publish security advisories about.

I should also point out that you could equally choose collectd, and graphite to get the same result. This has the added advantage that it’s the only way it works, so there won’t be any irrelevant security advisories to explain to your clients.

I don’t read either of the above sites regularly, so I don’t know if this single data point reflects the overall quality of either.


CPU cache collisions in the context of performance

This article discusses some potential performance issues caused by CPU cache collisions.

In normal scenarios cache collisions don’t pose a problem, it usually is only in specific, high speed
applications that they may incur noticeable performance penalties, and as such, things described
here should be considered “the last mile effort”.
As an example, I will use my laptop’s CPU, Intel Core i5 1.7GHz that has 32kB 8-way L1 data cache per core.

  • CPUs have caches organized in cachelines. For Intel and AMD, cachelines are 64 bytes long.
    When CPU needs to reach to a byte located in memory at the address 100, the whole chunk from
    addresses 64-127 is being pulled to cache. Since my example CPU has a 32kB L1 data cache
    per core, this means 512 such cachelines. The size of 64 bytes also means, that the six
    least significant bits of address index byte within the cacheline:

    address bits:    |       0 - 5      |       6 - ...     |
                     | cacheline offset |
  • Cachelines are organized in buckets. “8-way” means that each bucket holds 8 cachelines in it.
    Therefore my CPU L1 data cache has 512 cachelines kept in 64 buckets. In order to address those 64 buckets,
    next 6 bits are used from the address word, full address resolution within this L1 cache goes as follows:

    address bits:    |       0 - 5      |      6 - 11     |                12 - ...             |
                     | cacheline offset | bucket selector | cacheline identifier withing bucket |
  • Crucial to understand here is, that for this CPU, data separated by N x 4096 bytes
    (N x 12 the first bits) will always end up in the same bucket. So a lot of data chunks
    spaced by N x 4096 bytes, processed in a parallel manner can cause excessive evictions
    of cachelines from buckets thereby defeating the benefits of L1 cache.

To test the performance degradation I wrote a test C program
full C source here
that generates a number of vectors of pseudo random integers, sums them up in a typically parallel
optimized way, and estimates the resulting speed. Program takes a couple
of parameters from command line so that various CPUs and scenarios can be tested.
Here are results of three test runs on my example CPU:

  1. 100000 iterations, 30 vectors, 1000 integers each, aligned to 1010 integers = 2396 MOP/s
  2. 100000 iterations, 30 vectors, 1000 integers each, aligned to 1024 integers = 890 MOP/s
  3. 100000 iterations, 30 vectors, 1000 integers each, aligned to 1030 integers = 2415 MOP/s

In this CPU, L1 cache has 4 cycles of latency, L2 cache has 12 cycles of latency, hence
the performance drop to almost 1/3 when alignment hit the N x 4096 condition, CPU pretty much fell
back from L1 to L2. While this is a synthetic example, real life applications may not be affected
this much, but I’ve seen applications losing 30-40% to this single factor.

Parting remarks:

  • You may need to take into consideration a structure of cache not only it’s size, as in this case,
    even data chunked into pieces small enough to fit into L1, still can fail to take full advantage of it.
  • The issue cannot be solved by rewriting critical section logic in C/C++/assembly or any other
    “super-fast language of your choice”, this is a behavior dictated by hardware specifics.
  • Developers’ habit of aligning to the even boundaries, especially to the page boundaries,
    can work against you.
  • Padding can help break out of the performance drop.
  • Sometimes, the easiest workaround is a platform change, i.e. switching from Intel to AMD
    or the other way. Although keep in mind, it doesn’t really solve the issue, different platforms
    just manifest it for different data layouts.

Why I support the US Government making a cryptography standard weaker

Documents leaked by Edward Snowden last month reveal a $250M program by the NSA known as Operation BULLRUN, to insert vulnerabilities into encryption systems and weaken cryptography standards. It now seems nearly certain that the NIST-certified random number generator Dual\_EC\_DRBG, adopted as the default in RSA Security’s BSAFE toolkit, contains a back door usable only by the NSA which allows them to predict the entire future output of the generator given only 32 bytes.

So it’s not the easiest time for NIST to suggest they should make a cryptography standard weaker than it was originally proposed. Nevertheless, I support them in this and I hope they go ahead with it. Read more…

Paul Crowley

Programming as a social activity

I realised tonight something that I’d forgotten. We’re usually so busy knocking out code to fulfil our timebox coomitments that it’s perhaps easy to forget something very important: to have fun.

I went to the local Smalltalk user group tonight where Jason Ayers gave a talk on simplicity: do our tools help us make simple code? For a change, there was a relative dearth of laptops in the room (and it was a rather full room – nice!) so we “triple programmed”, tasked with implementing Conway’s Game of Life.


I think I’d forgotten that programming can be fun, and not just fun in an amuse-yourself-in-a-corner-on-your-lonesome kind of way, but fun in a way where you meet new people under the guise of performing some shared task. So if there’s a local programming group near you, why not swing by. You might meet some interesting folk. And if there isn’t such a group, maybe start one? It might be fun!

Frank Shearar

Changing the Primary Key Type in Ruby on Rails Models

Ruby on Rails (RoR) likes to emphasise the concept of convention over configuration. Therefore, it seeks to minimialise the amount of configuration
by resorting to some defaults. These defaults are sometimes not desirable, and RoR does not always make it easy to deviate from these defaults.

Read more…

Yong Wen Chua

« Newer Posts Older Posts »





2000-14 LShift Ltd, 1st Floor, Hoxton Point, 6 Rufus Street, London, N1 6PE, UK+44 (0)20 7729 7060   Contact us