technology from back to front

A standard log file format?

mercurial-server logs every push and pull to the repository. These logs are not just informative, but part of the security it offers; a mercurial revision can trivially be attributed to any user, so if malicious code is added to a repository only the mercurial-server logs carry trustworthy information about the source. It’s therefore especially important that these logs be unambiguously parseable.

For versions up to and including 1.0, log entries looked like this:


2010-12-01_00:13:21 push key=lshift/alexander/mrnoisy changeset=b46d2312ee9ee4338d9a1a8585b2959afa178e6b
2010-12-01_10:10:22 pull key=lshift/paul/arizonabay changeset=b46d2312ee9ee4338d9a1a8585b2959afa178e6b

However, this format assumes that the “values” in the key-value pairs will never contain spaces or newlines, and two changes broke that. The first was that I lifted the rules which forbade most troublesome characters – like space, newline, or equals – from appearing in key names. The second was that I wanted to store more information in each message: in particular the SSH_CONNECTION environment variable, which also generally contains spaces. “I can’t be the first person in this situation”, I thought. “There will be a standard format for logging out there on the Internet, with libraries to write and parse it in more than one language; all I have to do is plug that format into my program and I’ll adhere to the standard.”

If it’s out there, I didn’t find it. It appears that in the forty years that we’ve been writing Unix programs, every single little program that needed to log something has cooked up its own format for doing so. If you want to actually do something with this logged information besides eyeball it, you’re going to have to write a new parser for each one. This time, I thought I’d try to write my logs in a format for which there were already parsers.

XML was obviously not an option; an XML document has to have a closing tag, but a log is constantly being appended to. Besides which, XML is very heavyweight, and poorly suited to non-textual information.

JSON is much lighter and better suited to general-purpose human-readable serialization; sadly, it has the same “closing tag” problem as XML. Making each line its own JSON document would solve the “closing tag” problem, but would then require that you had to hand-roll at least one part of your parser; trivial, but not as trivial as simply pointing a library at the log file and saying “read this”, and noticeably more difficult in for example C where all string handling is painful.

I thought about CSV, which has a quoting convention that allows it to store arbitrary strings; however, those quoting conventions are fairly weird and can include newline characters in the middle of a logical line. Also, CSV isn’t self-documenting the way JSON is, and can’t handle any more complex data structures than a list of strings.

YAML is a very flexible output format with some very nice examples showing how it can be used for logging. Unlike any of the others, YAML is most definitely appendable. However, Python doesn’t ship with a YAML parser, so if mercurial-server were to use that format it would depend on an extra library.

Or, as it turns out, not. YAML 1.2 is a superset of JSON. By using YAML format for the sequencing and JSON for everything else, I could produce a line-oriented log format that could contain everything I could throw at it. For versions of Mercurial from 1.1 on, the log file format looks like this:

– {“timestamp”: “2010-12-20_12:02:13 Z”, “nodes”: ["f800743d1c75ff925effb5ffa52dc974eff3f482"], “ssh_connection”: “192.168.23.3 55904 192.168.23.2 22″, “key”: “lshift/paul/arizonabay”, “op”: “push”}

YAML uses a hyphen at the start of a line to build a sequence, so with a YAML parser the whole thing appears to be a sequence of dictionaries. However, after the hyphen the line is pure JSON, so you can trivially write the format with only a JSON serializer, and parsing it with one is only a little harder. Also, newline characters only appear at the end of log entries, so if the program crashes mid-way through writing a log entry, re-synchronizing is easy.

I can’t claim that this is the log file format to end all log file formats. But it’s simple, self-documenting and comfortably human-readable, easy to write and parseable with existing tools. So if this isn’t the right thing, could someone please work out what is, so that applications like mercurial-server that don’t have particularly unusual needs can just stick to a common standard, rather than having us all hand-roll our own new invention for this common need?

by
Paul Crowley
on
28/02/11
  1. One thing I looked into doing a while ago was using csv for logs, but not normal csv :-)

    basically csv but with delimiter = ‘ ‘ and quotes optional, so your message would look like:

    timestamp “2010-12-2012:02:13 Z” nodes f800743d1c75ff925effb5ffa52dc974eff3f482 sshconnection “192.168.23.3 55904 192.168.23.2 22″ key lshift/paul/arizonabay op push

    the csv module(at least in python) handles this trivially, then you just pair up neighboring values as the k/v pairs. So serializing and de-serializing takes a little more code, but produces a little easier to read output. The nice thing is that simpler log messages look very normal:

    event login user bob ip 1.2.3.4 status success

 
 


1 × seven =

2000-14 LShift Ltd, 1st Floor, Hoxton Point, 6 Rufus Street, London, N1 6PE, UK+44 (0)20 7729 7060   Contact us