Proper Unicode support in Erlang RFC4627 (JSON) module

By: on October 3, 2007

In a previous post I explored some of the options for supporting RFC4627 (JSON) Unicode-in-strings well when mapping to Erlang terms. In the end, I settled on keeping the interface almost unchanged: the only change is that binaries returned from rfc4627:decode are to be interpreted as UTF-8 encoded text now, whereas before their interpretation was less well defined.

The new module is available as a tarball (automatically generated from the github repository) or by browsing online here. You can also get the code using git:

git clone git://github.com/tonyg/erlang-rfc4627.git

Here are some examples using the new module. First, let’s explore the autodetection of which encoding is being used. In the following example, we see UTF-16, both big- and little-endian, as well as ill-formed and well-formed examples of UTF-8 being passed through the autodetector. (It also supports UTF-32 big- and little-endian.)

Eshell V5.5.5  (abort with ^G)
1> rfc4627:unicode_decode([34,0,228,0,34,0]).
{'utf-16le',""ä""}
2> rfc4627:unicode_decode([0,34,0,228,0,34]).
{'utf-16be',""ä""}
3> rfc4627:unicode_decode([34,228,34]).
** exited: {ucs,{bad_utf8_character_code}} **
4> rfc4627:unicode_decode([34,195,164,34]).
{'utf-8',""ä""}
5> 

Now let’s look at decoding some UTF-8 encoded JSON text into Erlang terms, and vice versa.

5> rfc4627:decode([34,194,128,34]).
{ok,<<194,128>>,[]}
6> rfc4627:encode(<<194,128>>).
[34,194,128,34]
7> rfc4627:encode_noauto(<<194,128>>).
[34,128,34]
8> rfc4627:unicode_encode({'utf-32le',
        rfc4627:encode_noauto(<<194,128>>)}).
[34,0,0,0,128,0,0,0,34,0,0,0]
9> rfc4627:encode_noauto({obj, [{[27700], 123}]}).
[123,34,27700,34,58,49,50,51,125]
10> rfc4627:encode({obj, [{[27700], 123}]}).
"{"æ°´":123}"
11> 

Notice, on that final example, that Erlang is printing the final UTF-8 encoded JSON text as if it were Latin-1. This is nothing to worry about: the numbers in the returned list/string are the correct UTF-8 encoding for Unicode code point 27700.

FacebookTwitterGoogle+

2 Comments

  1. Ciaran says:

    Thanks for that – I’ve been using your rfc2467 module in a little test project and came up against some unicode issues. I’ll grab the latest version and have a play.

  2. Ciaran says:

    Hmm, 2467 – Transmission of IPv6 Packets over FDDI Networks!? You would think I would have got used to typing 4627 by now.

Post a comment

Your email address will not be published.

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>