Archive for September 13th, 2007

How should JSON strings be represented in Erlang?

Erlang represents strings as lists of (ASCII, or possibly iso8859-1) codepoints. In this regard, it’s weakly typed - there’s no hard distinction between a string, “ABC”, and a list of small integers, [65,66,67]. For example:

Eshell V5.5.4  (abort with ^G)
1> "ABC".
"ABC"
2> [65,66,67].
"ABC"
3> 

Erlang also has a binary type, a simple vector of bytes. In the rfc4627/JSON codec I made for Erlang, I chose to use binaries to represent decoded strings, as suggested by Joe Armstrong.

All was well - until I came to implement UTF8 support after Sam Ruby got the ball rolling. Binaries will no longer work as the chosen mapping for JSON strings, since strings may contain arbitrary characters, including those with codepoints greater than 255.

It has always been the case that the ideal representation for a JSON string is an Erlang string, a list of codepoints. Binaries are really a bit of a compromise. But choosing strings-for-strings puts us straight back in a weakly-typed position: it’s possible in JSON to distinguish between “ABC” and [65,66,67], but it’s not possible to make the same distinction in Erlang. We’d need to alter the way JSON arrays are represented to compensate.

Possible solutions:

  • Map strings to lists of codepoints. Map arrays to tuples rather than lists. Objects remain {obj,[…]}.
    • Pros: Terse syntax for strings and arrays, no worse than the Unicode-ignorant mapping
    • Cons: Awkward recursion over arrays, either using a counter and the element/2 BIF, or converting to a real list

  • Map strings to binaries containing UTF-8 encoded characters. Keep arrays as lists. Objects remain {obj,[…]}.

    • Pros: Keep terse syntax for strings, with the understanding that the binaries concerned must hold UTF8-encoded text. Keeps the interface largely unchanged.
    • Cons: Codec needs to perform possibly-redundant Unicode encoding/decoding steps to ensure that the binaries hold UTF8 even if, say, UTF32 were the format to be used on the wire

  • Map strings to lists of codepoints. Map arrays to {arr,[…]}, as other JSON codecs do. Objects remain {obj,[…]}.

    • Pros: Natural operations on strings, natural operations on arrays (once you strip the outer {arr,…}).
    • Cons: Converting terms to JSON-encodable form is a pain, since you need to wrap each array in your term with the explicit marker atom.

All in all, I can’t decide which is the least distasteful option. I think I prefer the middle option, keeping strings mapped to binaries and viewing them as UTF-8 encoded text, but I really need to get some feedback on the issue.

8 comments September 13th, 2007 tonyg

NDocProc: Javadoc-like documentation for .NET

There are a few tools for building javadoc-like documentation for .NET code available out there on the ‘net. Unfortunately, the major contenders (e.g. NDoc, Sandcastle) suffer from a few flaws: they are variously not free (gratis), not free (libre), not cross-platform, not maintained, and/or not easy-to-use. Therefore:

Presenting: NDocProc.

Download a zip snapshot from here, or use darcs to check out the repository:

darcs get http://www.lshift.net/~tonyg/ndocproc/

Update: New release available.

Add comment September 13th, 2007 tonyg

Calendar

September 2007
M T W T F S S
« Aug   Oct »
 12
3456789
10111213141516
17181920212223
24252627282930

Posts by Month

Posts by Category