Google Protocol Buffers

By: on July 9, 2008

Google’s latest open-source initiative: Protocol Buffers. Like XML, this is a generic language-neutral hierarchical container format, but unlike XML it is a binary format designed for low size overhead and fast reading and writing; in this it resembles some other initiatives to that end, including Facebook’s Thrift. I am keen to see a good binary alternative to XML become popular and there are several clever and attractive aspects to PB’s design, but there are several shortcomings in the wire format, and I don’t think it’s hard to design a better alternative.

A PB file consists of a sequence of numbered fields; applications that want to use PB write protocol description (“.proto”) files describing what these fields mean. One of the goals of PB is that old consumers should be able to read the data produced by newer producers, by simply skipping unexpected fields. This means in turn that consumers have to be able to find the end of a field without needing any out-of-band metadata about the field. Take a look at the wire format documentation to see how PB achieves that – seven-bit integer encoding, ZigZag, wire types, and embedded messages. Each of these is a sacrifice in both cleanliness and performance.

The “varint” encoding, that writes integers in seven bit groups with the high bit as a stop bit, is inefficient both for reading and writing; you can’t just send the bytes you have to the wire, you have to measure them out and check when you’ve finished with shifts; each check can also cause pipeline stalls. ZigZag is even worse – it shifts the whole number up in order to put the sign bit in the least significant bit, so there’s more work to do once the number is assembled. Wire types are a nasty wart; they’re not enough information to actually know what is being sent, and they are there only so that the length of what is being sent can be inferred properly where a field is unknown (and to work around the inefficiency of the “varint” encoding by providing special fixed-length integer encodings). And the only non-deprecated way of creating hierarchy is to embed a message as a string – which means that every data structure below the top level must be completely marshalled in memory before it is sent, so that its length can be measured.

So here’s my alternative proposal – same logical structure, different wire format. We encode the data as a stream of “atoms”. An atom is either a hunk of binary data, or a “special symbol”. Here’s how we decode an atom:

def readIntOrLen(lim, stream, i):
    if i < lim:
        return i
    return toUint(stream.read(1 + i - lim))
    # Could force eg one-byte encoding to cover range [lim, lim+255] 
    # instead of [0, 255] here but is it worth it?

def readAtom(stream):
    firstByte = toUint(stream.read(1))
    fieldId = readIntOrLen(12, stream, firstByte >> 4)
    if fieldId == 0:
        return specialSymbol(firstByte & 0x0f)
    dataLen = readIntOrLen(8, stream, firstByte & 0x0f)
    return fieldId, stream.read(dataLen)

So we make the head byte do as much work as possible; it has one nybble for the field ID, one nybble for the data length, and no “wire type”. If the nybble is too small, we overflow into the rest of the data stream, using as few bytes as possible but encoding in a natural 8-bit little-endian format. For structures that have fewer than 12 members, this adds only one byte of overhead on any field which is fewer than 8 bytes long. Right away, that’s 7-bit-footed varints, ZigZag, and wire types dispensed with. We’ve reserved fieldId 0 for special symbols; we can encode 16 different symbols.

To avoid the pre-marshalling problem, we re-introduce the grouping symbols that Google deprecated. A “start group” symbol should be followed by a field ID; this will normally be written in the following byte, but values above 251 will take up more bytes: in other words we read the group ID with readIntOrLen(252, stream, toUint(stream.read(1))). This is then paired with an “end symbol”. Obviously these symbols act like brackets in SPKI and can be nested. So you can start writing a nested data structure just by emitting the start symbol, marshalling the structure directly to the output stream, and then emitting the end symbol; no pre-marshalling needed.

This format fixes the four problems I identified with the format, but we still have 14 special symbols left, and I see nice things to do with them.

Let specialSymbol(0) be NOP. Note that this is just the same as a zero byte, so now runs of zero bytes inbetween proper data has no effect. On disk, you can “zero out” bits of data you don’t need without having to copy up all the other data to take its place; or you can leave zero blocks in useful places (eg at the head of a file) where you might want to write real data later.

Let’s have “begin chunked” and “end chunked” symbols. All atoms in between should be hunks of binary data with the same field ID; the symbols mean these should be concatenated to get the final answer. Now you can start writing even an individual binary field before you know how long it’s going to be, useful if you’re generating it on the fly.

Let’s have a “begin metadata” symbol, which should be followed by a group; the group ID can identify what sort of metadata. Parsers must ignore metadata they don’t understand, but must fail if they see special symbols they don’t understand. We can put for example the protocol description itself in the metadata, suitably encoded, so that smart clients can use it eg to display the data meaningfully.

Let’s have a “version” symbol, which is followed by a version byte. If the version is unknown, the parser must fail. We should choose these symbols carefully, so that the first two bytes of the file can be an appropriate magic number not used by other formats.

We can trivially define a “canonical encoding” for signing: it is precisely the shortest encoding for the data given, with fields in sorted order by fieldID.

This seems to me to be nearly always better than Google’s proposal – are there angles I’ve missed?

Thanks to Brad Fitzpatrick for bringing PB to my attention!

Update: I wasn’t happy with the two-byte overhead for eight-byte fields; here’s an encoding that has a one-byte overhead for fields of 12 bytes or fewer.

def readIntOrLen(lim, stream, i):
    if i < lim:
        return i
    return toUint(stream.read(1 << (i - lim)))
    # Could force eg one-byte encoding to cover range [lim, lim+255] 
    # instead of [0, 255] here but is it worth it?

def readAtom(stream):
    firstByte = toUint(stream.read(1))
    fieldId = readIntOrLen(13, stream, firstByte >> 4)
    if fieldId == 0:
        return specialSymbol(firstByte & 0x0f)
    dataLen = readIntOrLen(13, stream, firstByte & 0x0f)
    return fieldId, stream.read(dataLen)

This limits length descriptors to being 0, 1, 2, or 4 bytes long. Note that this limits individual chunks to 4GB. If you’re sending disk images, use the chunked encoding; the canonical encoding rules will need a special case for this.

FacebookTwitterGoogle+

15 Comments

  1. vanort says:

    I like your proposal, but I have to say that Google’s ‘proposal’ is much better in one regard — it is actually implemented. Implementations beat proposals.

  2. Kenton Varda says:

    And the only non-deprecated way of creating hierarchy is to embed a message as a string – which means that every data structure below the top level must be completely marshalled in memory before it is sent, so that its length can be measured.

    This is not accurate. We pre-compute all sizes; this takes much less time than the actual marshalling. This also makes it possible to skip over entire sub-messages without parsing them. This could be used, for example, to implement lazy parsing of sub-messages (something I’m thinking of doing in a future version).

    I agree that varint is not an ideal encoding. Unfortunately we’re not able to change it without breaking compatibility. I think zigzag makes perfect sense as something that works on top of varint. It’s only a couple instructions with no branches.

    Note that ReadVarint32() inlines the first branch. If most of your varints are less than 128, then this branch is predictable and thus mostly-free. Since it is inlined, this prediction applies to each field individually. Also note that the generated parsing routines attempt to predict the next tag by assuming that tags always appear in order. If this prediction works (i.e. if your messages normally fill in all the fields), then you can avoid an unpredictable branch when reading each tag even if many of your tags are two bytes. All this mitigates the problems with varint somewhat.

    All that said, your proposal sounds interesting. I think it would be cool if you implemented it and tested to see how the performance compares. You could add a new message-level option which allows users to choose an alternative wire format (just add the option to the MessageOptions message defined in descriptor.proto), then detect that option in the code generator and generate different code in that case.

  3. Evan M says:

    One thing you’ve missed about protos is that you can display one without the associated .proto file — that is, the wire format contains enough so that you know a given four-byte element is a 32-bit integer rather than a four-letter string. This comes in handy when you’re not sure of the format of a given blob of data but you are sure that it’s a proto. (This sounds like the sort of thing that should never come up in a “properly designed” system, but in practice it’s come up a ton of times. It’s one of those design tradeoffs, much like the protocol-evolution capability in the proto format adds overhead but allows communication between different versions.)

    For example, imagine protos stuffed into memcache — you don’t want the overhead of storing a metadata block alongside each stored object.

  4. Paul Crowley says:

    Thanks for the detailed responses!

    Kenton Varda: This is not accurate. We pre-compute all sizes; this takes much less time than the actual marshalling • OK, yes, that’s better. Do you cache information about the size of sub-elements when you’re measuring the size of an element, or count it again? In other words, does each byte get measured once and marshalled once, or marshalled once but measured n times where n is its distance from the root? What you describe is a big improvement for the sender, but avoiding that overhead altogether still seems desirable to me.

    This could be used, for example, to implement lazy parsing of sub-messages • are there circumstances where it would be much harder to parse the message once and store indexes into it for the bits you need?

    I think zigzag makes perfect sense as something that works on top of varint. • Agreed, but it’s nice that my alternative does away with the need for zigzag too. Because of my crypto interests I tend to consider the case where your integers are tens or hundreds of bytes long.

    Also note that the generated parsing routines attempt to predict the next tag by assuming that tags always appear in order. • Oh, cute! It sounds like getting a competitor to compete with what you have in speed would be a big job even if it started from a better wire format.

    vanort: I cannot figure out what you’re trying to say. It seems as if you read my proposal as if I were advocating that all PB users switch to it immediately, but I can’t think why you would think that. Or you just think unimplemented ideas shouldn’t be discussed; having worked with a lot of half-baked ideas hardened into implementations, I wish people would discuss their ideas more before implementing them.

    Evan M: the wire format contains enough so that you know a given four-byte element is a 32-bit integer rather than a four-letter string • I thought of that – and it’s a big improvement over something like XDR, which you can’t really infer anything about without the definition – but I don’t think what you say is very much more true of PB. Look at the table under “Message Structure” in the wire format definition – each wire type is hugely overloaded, because it only carries enough information to know how much to skip if you don’t know what this is. So if it’s wire type 0, you can’t tell whether you’ve got a signed or unsigned integer or a bool or an enum; if it’s 1 or 5, you can’t tell whether you’ve got a signed or unsigned integer or a float. Worst of all, if it’s a 2, you can’t tell if it’s a string, or a hunk of binary data, or an entire Protocol Buffers message waiting to be parsed and displayed. I think that places some strong limitations on your proposed tool to display the contents of a mystery PB.

    If you write such a tool against my proposal, you know very slightly less about the atoms that you see; you can’t tell the number/enum hunks from the string/binary hunks, unlike PB. However, you are hugely better off in that the substructures are unambiguously marked out with grouping constructs, so that’s one thing you can always get right and never have to guess. I can see how to work around the ambiguity in my proposal, but working around the string/submessage ambiguity in PB would be much harder, so I think I’m actually ahead on the criterion you discuss.

    Of course, it would be nice to do better. This is the main reason I included a metadata tag – I’d like to be able to provide enough information in the metadata for such a tool to be able to infer the types of everything and display the contents of the PB correctly. Then it would also be able to display field and type names, which would be very useful indeed. For small data structures the overhead of metadata may be too great as you say.

  5. Ade says:

    Hi Paul,
    I know that James Strachan is experimenting with using Protocol Buffers as a replacement for the OpenWire protocol in ActiveMQ. How do you think PB measures up in comparison to the AMQP wire protocol?

  6. Ben Hood says:

    Ade,

    PB and AMQP are complimentary. AMQP would just provide a carrier for PB encoded messages, so the receiver would unmarshal the AMQP wire frames and then unmarshall the PB message.

    HTH,

    Ben

  7. Kenton Varda says:

    Do you cache information about the size of sub-elements when you’re measuring the size of an element, or count it again?

    The sizes are computed once and cached.

    are there circumstances where it would be much harder to parse the message once and store indexes into it for the bits you need?

    We’ve seen cases where e.g. proxy servers would really prefer not to even scan sub-messages that they don’t care about.

    It sounds like getting a competitor to compete with what you have in speed would be a big job even if it started from a better wire format.

    I think you could go a long way pretty easily by just having an option that replaces varint with some faster integer encoding. Basically all you’d have to do is add the right methods for the new encoding to Coded{Input,Output}Stream and then go through the code generator and make sure that wherever it currently generates calls to the varint methods, it instead uses your new methods. You could also trivially swap in startgroup/endgroup encoding for sub-messages; all the details are already implemented (despite being deprecated).

    But, yes, I think in the end, with both sides being heavily-optimized, the difference will probably not be that big.

  8. tonyg says:

    @Ben: Yes – or you could use PB to encode AMQP methods.

  9. Aidan Skinner says:

    @tonyg AMQP is already pretty spartan, there’s not a great deal of waste in the frame encoding in 0-10 (or 0-8/0-9 for that matter)

  10. tonyg says:

    @aidan: Indeed; AMQP’s various wire-protocols are, currently, much like an ad-hoc ASN.1 PER encoding. A PB encoding of AMQP would be more like a BER encoding, and would bring a measure of extensibility and self-description to the table.

  11. vanort says:

    paul: you made reference to PB as “Google’s proposal” at the end of your post. I pointed out that it is not a proposal but an implementation. Your proposal is well thought out, and I like the discussion, but even if your proposal improves speed by 20%, I would still choose to use PB as it is today rather then wait for another implementation. That’s why I say implementations beat proposals.

  12. Paul Crowley says:

    The sizes are computed once and cached. • ah, thanks! Yes, I can see you’ve added a “mutable int _cached_size_” to everything you might want to send. If you wanted to avoid the per-structure memory overhead, another way to do this would be to measure the sizes last-element first and push them onto a stack, then pull sizes off the stack as you need them. This would also make it easier to adapt the framework to be able to serialize and send “foreign” classes which don’t subclass “Message”. I’m sure you already considered this of course.

    We’ve seen cases where e.g. proxy servers would really prefer not to even scan sub-messages that they don’t care about. • I can see that would be of some benefit, but the proxy server still has to scan through and discard all of the sub-message it doesn’t care about, so all it saves is the effort of parsing it all to know where the end will be. That’s not no effort, but it’s a linear speedup for one rare case, whereas eliminating the effort of measuring the message sizes before sending saves work for every producer on every message sent. It’s neat that you have other uses for that precomputed size given that you have it, but it’s a long way from being a big enough advantage by itself to warrant producers putting that effort in every time!

    But, yes, I think in the end, with both sides being heavily-optimized, the difference will probably not be that big. • I’d be curious to know what might be a good benchmark by which to measure things, so I can find out if big speed differences are possible…

  13. Ben Hood says:

    @tony: true that

  14. JohnOH says:

    @paul – nice thinking. Varints are the only really ugly part of PB – esp. the fact you don’t know how big they’ll be makes me uneasy (unless I’m missing something, is there a max lengh of a marshalled varint?)

    @tonyg – AMQP and DIY ASN1 PER. Very true and nice’n’compact. Also very safe in that all sizes are known in advance in the parse stream.
    But, we need to sort out the extensibility and user-header type system better for AMQP1-0; some kind of TLV encoding would be beneficial (kind of DIY PER + BER, but nicer than BER).
    Would like to discuss….

    PS: Read up on ASN.1 again tonight and was reminded of why we didn’t use it.

  15. 0xABADC0DA says:

    One problem with this format and pb is getting the frequent fields to be numbered in range 1..12 (this) or 1..15 (pb) so that they are ‘inlined’ into the control byte. Another use of a special symbol is to optimize a particular stream of messages by remapping the field id values so the first 12|15 are the most used fields.

    For example, MAP atom could be followed by a list of fields such as MAP, {1,55}, {2, 71}, {3,4002}, {0,0} to mean that in the current message type field id 1==55, 2==71, 3=4002 in this and later messages of the same type.

    This would almost entirely eliminate the need for the developers to pick good values for the field numbers, ‘reserve’ some of the lower ones in case some common field is added later, or to even care much about field id values. They could just be assigned in order as fields are added. This assumes some kind of persistent stream, although it could also be done on a message basis, to optimize the size for arrays of messages for instance and still maintain random access of whole messages.

Post a comment

Your email address will not be published.

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>