<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress/2.0.12-alpha" -->
<rss version="2.0" 
	xmlns:content="http://purl.org/rss/1.0/modules/content/">
<channel>
	<title>Comments on: Google Protocol Buffers</title>
	<link>http://www.lshift.net/blog/2008/07/09/google-protocol-buffers</link>
	<description>What happens at LShift</description>
	<pubDate>Fri, 21 Nov 2008 20:44:11 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.0.12-alpha</generator>

	<item>
		<title>by: 0xABADC0DA</title>
		<link>http://www.lshift.net/blog/2008/07/09/google-protocol-buffers#comment-115630</link>
		<pubDate>Mon, 28 Jul 2008 20:04:40 +0000</pubDate>
		<guid>http://www.lshift.net/blog/2008/07/09/google-protocol-buffers#comment-115630</guid>
					<description>&lt;p&gt;One problem with this format and pb is getting the frequent fields to be numbered in range 1..12 (this) or 1..15 (pb) so that they are 'inlined' into the control byte.  Another use of a special symbol is to optimize a particular stream of messages by remapping the field id values so the first 12&#124;15 are the most used fields.&lt;/p&gt;

&lt;p&gt;For example, MAP atom could be followed by a list of fields such as MAP, {1,55}, {2, 71}, {3,4002}, {0,0} to mean that in the current message type field id 1==55, 2==71, 3=4002 in this and later messages of the same type.&lt;/p&gt;

&lt;p&gt;This would almost entirely eliminate the need for the developers to pick good values for the field numbers, 'reserve' some of the lower ones in case some common field is added later, or to even care much about field id values.  They could just be assigned in order as fields are added.  This assumes some kind of persistent stream, although it could also be done on a message basis, to optimize the size for arrays of messages for instance and still maintain random access of whole messages.&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>One problem with this format and pb is getting the frequent fields to be numbered in range 1..12 (this) or 1..15 (pb) so that they are &#8216;inlined&#8217; into the control byte.  Another use of a special symbol is to optimize a particular stream of messages by remapping the field id values so the first 12|15 are the most used fields.</p>
<p>For example, MAP atom could be followed by a list of fields such as MAP, {1,55}, {2, 71}, {3,4002}, {0,0} to mean that in the current message type field id 1==55, 2==71, 3=4002 in this and later messages of the same type.</p>
<p>This would almost entirely eliminate the need for the developers to pick good values for the field numbers, &#8216;reserve&#8217; some of the lower ones in case some common field is added later, or to even care much about field id values.  They could just be assigned in order as fields are added.  This assumes some kind of persistent stream, although it could also be done on a message basis, to optimize the size for arrays of messages for instance and still maintain random access of whole messages.</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: JohnOH</title>
		<link>http://www.lshift.net/blog/2008/07/09/google-protocol-buffers#comment-112366</link>
		<pubDate>Fri, 18 Jul 2008 00:25:02 +0000</pubDate>
		<guid>http://www.lshift.net/blog/2008/07/09/google-protocol-buffers#comment-112366</guid>
					<description>&lt;p&gt;@paul - nice thinking.  Varints are the only really ugly part of PB - esp. the fact you don't know how big they'll be makes me uneasy (unless I'm missing something, is there a max lengh of a marshalled varint?)&lt;/p&gt;

&lt;p&gt;@tonyg - AMQP and DIY ASN1 PER.  Very true and nice'n'compact.  Also very safe in that all sizes are known in advance in the parse stream.
But, we need to sort out the extensibility and user-header type system better for AMQP1-0; some kind of TLV encoding would be beneficial (kind of DIY PER + BER, but nicer than BER).
Would like to discuss....&lt;/p&gt;

&lt;p&gt;PS: Read up on ASN.1 again tonight and was reminded of why we didn't use it.&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>@paul - nice thinking.  Varints are the only really ugly part of PB - esp. the fact you don&#8217;t know how big they&#8217;ll be makes me uneasy (unless I&#8217;m missing something, is there a max lengh of a marshalled varint?)</p>
<p>@tonyg - AMQP and DIY ASN1 PER.  Very true and nice&#8217;n'compact.  Also very safe in that all sizes are known in advance in the parse stream.<br />
But, we need to sort out the extensibility and user-header type system better for AMQP1-0; some kind of TLV encoding would be beneficial (kind of DIY PER + BER, but nicer than BER).<br />
Would like to discuss&#8230;.</p>
<p>PS: Read up on ASN.1 again tonight and was reminded of why we didn&#8217;t use it.</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Ben Hood</title>
		<link>http://www.lshift.net/blog/2008/07/09/google-protocol-buffers#comment-111943</link>
		<pubDate>Wed, 16 Jul 2008 11:08:21 +0000</pubDate>
		<guid>http://www.lshift.net/blog/2008/07/09/google-protocol-buffers#comment-111943</guid>
					<description>&lt;p&gt;@tony: true that&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>@tony: true that</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Paul Crowley</title>
		<link>http://www.lshift.net/blog/2008/07/09/google-protocol-buffers#comment-111361</link>
		<pubDate>Mon, 14 Jul 2008 09:18:08 +0000</pubDate>
		<guid>http://www.lshift.net/blog/2008/07/09/google-protocol-buffers#comment-111361</guid>
					<description>&lt;p&gt;&lt;i&gt;The sizes are computed once and cached.&lt;/i&gt; &#8226; ah, thanks!  Yes, I can see you've added a "&lt;code&gt;mutable int _cached_size_&lt;/code&gt;" to everything you might want to send.  If you wanted to avoid the per-structure memory overhead, another way to do this would be to measure the sizes last-element first and push them onto a stack, then pull sizes off the stack as you need them.  This would also make it easier to adapt the framework to be able to serialize and send "foreign" classes which don't subclass "Message".  I'm sure you already considered this of course.&lt;/p&gt;

&lt;p&gt;&lt;i&gt;We’ve seen cases where e.g. proxy servers would really prefer not to even scan sub-messages that they don’t care about.&lt;/i&gt; &#8226; I can see that would be of some benefit, but the proxy server still has to scan through and discard all of the sub-message it doesn't care about, so all it saves is the effort of parsing it all to know where the end will be.  That's not no effort, but it's a linear speedup for one rare case, whereas eliminating the effort of measuring the message sizes before sending saves work for every producer on every message sent.  It's neat that you have other uses for that precomputed size given that you have it, but it's a long way from being a big enough advantage by itself to warrant producers putting that effort in every time!&lt;/p&gt;

&lt;p&gt;&lt;i&gt;But, yes, I think in the end, with both sides being heavily-optimized, the difference will probably not be that big.&lt;/i&gt; &#8226; I'd be curious to know what might be a good benchmark by which to measure things, so I can find out if big speed differences are possible...&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p><i>The sizes are computed once and cached.</i> &#8226; ah, thanks!  Yes, I can see you&#8217;ve added a &#8220;<code>mutable int _cached_size_</code>&#8221; to everything you might want to send.  If you wanted to avoid the per-structure memory overhead, another way to do this would be to measure the sizes last-element first and push them onto a stack, then pull sizes off the stack as you need them.  This would also make it easier to adapt the framework to be able to serialize and send &#8220;foreign&#8221; classes which don&#8217;t subclass &#8220;Message&#8221;.  I&#8217;m sure you already considered this of course.</p>
<p><i>We’ve seen cases where e.g. proxy servers would really prefer not to even scan sub-messages that they don’t care about.</i> &#8226; I can see that would be of some benefit, but the proxy server still has to scan through and discard all of the sub-message it doesn&#8217;t care about, so all it saves is the effort of parsing it all to know where the end will be.  That&#8217;s not no effort, but it&#8217;s a linear speedup for one rare case, whereas eliminating the effort of measuring the message sizes before sending saves work for every producer on every message sent.  It&#8217;s neat that you have other uses for that precomputed size given that you have it, but it&#8217;s a long way from being a big enough advantage by itself to warrant producers putting that effort in every time!</p>
<p><i>But, yes, I think in the end, with both sides being heavily-optimized, the difference will probably not be that big.</i> &#8226; I&#8217;d be curious to know what might be a good benchmark by which to measure things, so I can find out if big speed differences are possible&#8230;</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: vanort</title>
		<link>http://www.lshift.net/blog/2008/07/09/google-protocol-buffers#comment-111044</link>
		<pubDate>Sun, 13 Jul 2008 02:58:14 +0000</pubDate>
		<guid>http://www.lshift.net/blog/2008/07/09/google-protocol-buffers#comment-111044</guid>
					<description>&lt;p&gt;paul:  you made reference to PB as "Google's proposal" at the end of your post.  I pointed out that it is not a proposal but an implementation.  Your proposal is well thought out, and I like the discussion, but even if your proposal improves speed by 20%, I would still choose to use PB as it is today rather then wait for another implementation.  That's why I say implementations beat proposals.&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>paul:  you made reference to PB as &#8220;Google&#8217;s proposal&#8221; at the end of your post.  I pointed out that it is not a proposal but an implementation.  Your proposal is well thought out, and I like the discussion, but even if your proposal improves speed by 20%, I would still choose to use PB as it is today rather then wait for another implementation.  That&#8217;s why I say implementations beat proposals.</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: tonyg</title>
		<link>http://www.lshift.net/blog/2008/07/09/google-protocol-buffers#comment-110763</link>
		<pubDate>Sat, 12 Jul 2008 00:53:04 +0000</pubDate>
		<guid>http://www.lshift.net/blog/2008/07/09/google-protocol-buffers#comment-110763</guid>
					<description>&lt;p&gt;@aidan: Indeed; AMQP's various wire-protocols are, currently, much like an ad-hoc ASN.1 PER encoding. A PB encoding of AMQP would be more like a BER encoding, and would bring a measure of extensibility and self-description to the table.&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>@aidan: Indeed; AMQP&#8217;s various wire-protocols are, currently, much like an ad-hoc ASN.1 PER encoding. A PB encoding of AMQP would be more like a BER encoding, and would bring a measure of extensibility and self-description to the table.</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Aidan Skinner</title>
		<link>http://www.lshift.net/blog/2008/07/09/google-protocol-buffers#comment-110651</link>
		<pubDate>Fri, 11 Jul 2008 14:26:05 +0000</pubDate>
		<guid>http://www.lshift.net/blog/2008/07/09/google-protocol-buffers#comment-110651</guid>
					<description>&lt;p&gt;@tonyg AMQP is already pretty spartan, there's not a great deal of waste in the frame encoding in 0-10 (or 0-8/0-9 for that matter)&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>@tonyg AMQP is already pretty spartan, there&#8217;s not a great deal of waste in the frame encoding in 0-10 (or 0-8/0-9 for that matter)</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: tonyg</title>
		<link>http://www.lshift.net/blog/2008/07/09/google-protocol-buffers#comment-110457</link>
		<pubDate>Thu, 10 Jul 2008 20:14:06 +0000</pubDate>
		<guid>http://www.lshift.net/blog/2008/07/09/google-protocol-buffers#comment-110457</guid>
					<description>&lt;p&gt;@Ben: Yes - or you could use PB to encode AMQP methods.&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>@Ben: Yes - or you could use PB to encode AMQP methods.</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Kenton Varda</title>
		<link>http://www.lshift.net/blog/2008/07/09/google-protocol-buffers#comment-110449</link>
		<pubDate>Thu, 10 Jul 2008 19:13:04 +0000</pubDate>
		<guid>http://www.lshift.net/blog/2008/07/09/google-protocol-buffers#comment-110449</guid>
					<description>&lt;blockquote&gt;Do you cache information about the size of sub-elements when you’re measuring the size of an element, or count it again?&lt;/blockquote&gt;

&lt;p&gt;The sizes are computed once and cached.&lt;/p&gt;

&lt;blockquote&gt;are there circumstances where it would be much harder to parse the message once and store indexes into it for the bits you need?&lt;/blockquote&gt;

&lt;p&gt;We've seen cases where e.g. proxy servers would really prefer not to even scan sub-messages that they don't care about.&lt;/p&gt;

&lt;blockquote&gt;It sounds like getting a competitor to compete with what you have in speed would be a big job even if it started from a better wire format.&lt;/blockquote&gt;

&lt;p&gt;I think you could go a long way pretty easily by just having an option that replaces varint with some faster integer encoding.  Basically all you'd have to do is add the right methods for the new encoding to Coded{Input,Output}Stream and then go through the code generator and make sure that wherever it currently generates calls to the varint methods, it instead uses your new methods.  You could also trivially swap in startgroup/endgroup encoding for sub-messages; all the details are already implemented (despite being deprecated).&lt;/p&gt;

&lt;p&gt;But, yes, I think in the end, with both sides being heavily-optimized, the difference will probably not be that big.&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<blockquote><p>Do you cache information about the size of sub-elements when you’re measuring the size of an element, or count it again?</p></blockquote>
<p>The sizes are computed once and cached.</p>
<blockquote><p>are there circumstances where it would be much harder to parse the message once and store indexes into it for the bits you need?</p></blockquote>
<p>We&#8217;ve seen cases where e.g. proxy servers would really prefer not to even scan sub-messages that they don&#8217;t care about.</p>
<blockquote><p>It sounds like getting a competitor to compete with what you have in speed would be a big job even if it started from a better wire format.</p></blockquote>
<p>I think you could go a long way pretty easily by just having an option that replaces varint with some faster integer encoding.  Basically all you&#8217;d have to do is add the right methods for the new encoding to Coded{Input,Output}Stream and then go through the code generator and make sure that wherever it currently generates calls to the varint methods, it instead uses your new methods.  You could also trivially swap in startgroup/endgroup encoding for sub-messages; all the details are already implemented (despite being deprecated).</p>
<p>But, yes, I think in the end, with both sides being heavily-optimized, the difference will probably not be that big.</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Ben Hood</title>
		<link>http://www.lshift.net/blog/2008/07/09/google-protocol-buffers#comment-110416</link>
		<pubDate>Thu, 10 Jul 2008 16:24:52 +0000</pubDate>
		<guid>http://www.lshift.net/blog/2008/07/09/google-protocol-buffers#comment-110416</guid>
					<description>&lt;p&gt;Ade,&lt;/p&gt;

&lt;p&gt;PB and AMQP are complimentary. AMQP would just provide a carrier for PB encoded messages, so the receiver would unmarshal the AMQP wire frames and then unmarshall the PB message.&lt;/p&gt;

&lt;p&gt;HTH,&lt;/p&gt;

&lt;p&gt;Ben&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>Ade,</p>
<p>PB and AMQP are complimentary. AMQP would just provide a carrier for PB encoded messages, so the receiver would unmarshal the AMQP wire frames and then unmarshall the PB message.</p>
<p>HTH,</p>
<p>Ben</p>
]]></content:encoded>
				</item>
</channel>
</rss>
