Proper Unicode support in Erlang RFC4627 (JSON) module

October 3rd, 2007 tonyg

In a previous post I explored some of the options for supporting RFC4627 (JSON) Unicode-in-strings well when mapping to Erlang terms. In the end, I settled on keeping the interface almost unchanged: the only change is that binaries returned from rfc4627:decode are to be interpreted as UTF-8 encoded text now, whereas before their interpretation was less well defined.

The new module is Erlang-RFC4627 version 1.1.0, and is available as a tarball, a debian package, or by browsing online here. You can also get the code using mercurial:

hg clone http://hg.opensource.lshift.net/erlang-rfc4627/

Here are some examples using the new module. First, let’s explore the autodetection of which encoding is being used. In the following example, we see UTF-16, both big- and little-endian, as well as ill-formed and well-formed examples of UTF-8 being passed through the autodetector. (It also supports UTF-32 big- and little-endian.)

Eshell V5.5.5  (abort with ^G)
1> rfc4627:unicode_decode([34,0,228,0,34,0]).
{'utf-16le',"\"ä\""}
2> rfc4627:unicode_decode([0,34,0,228,0,34]).
{'utf-16be',"\"ä\""}
3> rfc4627:unicode_decode([34,228,34]).
** exited: {ucs,{bad_utf8_character_code}} **
4> rfc4627:unicode_decode([34,195,164,34]).
{'utf-8',"\"ä\""}
5> 

Now let’s look at decoding some UTF-8 encoded JSON text into Erlang terms, and vice versa.

5> rfc4627:decode([34,194,128,34]).
{ok,<<194,128>>,[]}
6> rfc4627:encode(<<194,128>>).
[34,194,128,34]
7> rfc4627:encode_noauto(<<194,128>>).
[34,128,34]
8> rfc4627:unicode_encode({’utf-32le’,
        rfc4627:encode_noauto(<<194,128>>)}).
[34,0,0,0,128,0,0,0,34,0,0,0]
9> rfc4627:encode_noauto({obj, [{[27700], 123}]}).
[123,34,27700,34,58,49,50,51,125]
10> rfc4627:encode({obj, [{[27700], 123}]}).
“{\”æ°´\”:123}”
11> 

Notice, on that final example, that Erlang is printing the final UTF-8 encoded JSON text as if it were Latin-1. This is nothing to worry about: the numbers in the returned list/string are the correct UTF-8 encoding for Unicode code point 27700.

Entry Filed under: Technology, Our Software, Erlang

2 Comments Add your own

  • 1. Ciaran  |  October 3rd, 2007 at 11:01 pm

    Thanks for that - I’ve been using your rfc2467 module in a little test project and came up against some unicode issues. I’ll grab the latest version and have a play.

  • 2. Ciaran  |  October 3rd, 2007 at 11:03 pm

    Hmm, 2467 - Transmission of IPv6 Packets over FDDI Networks!? You would think I would have got used to typing 4627 by now.

Leave a Comment

Required

Required, hidden

Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>

Trackback this post  |  Subscribe to the comments via RSS Feed

Calendar

October 2007
M T W T F S S
« Sep   Nov »
1234567
891011121314
15161718192021
22232425262728
293031  

Most Recent Posts