technology from back to front

How readable are your comments?

You’ve done the Right Thing and written extensive class comments, docstrings and the like. But are they really readable?

There’s a fair amount of research into readability, at least in English. Flesh-Kincaid, SMOG (Simple Measure Of Gobbledygook), Coleman-Liau, Automated Readability Index are among the more well-known such measures. Essentially all these measures do the same thing: how long are your sentences, and how long are the words you use? Polysyllabic sesquipedalianism, let alone egregious hyperverbosity and prolixity, decreases readability. (CLI score: 34.14) Using short words makes things more readable. (CLI score: 8.5)

Flesch-Kincaid and SMOG both suffer from measuring syllables. “Suffer”, because syllable-counting in English is not trivial. However, Coleman-Liau and Automated Readability Index just count word length. More amenable to calculation.

Some languages permit comments to be explicitly tied to things: Clojure’s docstrings, Smalltalk’s class comments. Given that, let’s look at using Coleman-Liau on a Smalltalk package’s class comments. The Coleman-Liau index maps text into a real number that represents the approximate education level required to understand the text, according to the US education system. Thus, a score of 10 represents the reading ability of a Grade 10 student, 14 that of a second year undergraduate, and so on.

Now to do this properly we need to tokenise the text into words and sentences. In a production system we’d need to be careful: splitting the streams by periods is insufficient because otherwise “The U.S. Postal Service is slow.” would parse as three sentences. But in the interests of clarity, we’ll ignore that: we treat sentences as delimited by periods, and words by spaces.

| classes cli |
classes := (PackageInfo named: 'Kernel-Classes') classes.
cli := [:str | | words sentences l s |
    words := (str splitBy: ' ') collect:
        [:each | each withoutLeadingBlanks withoutTrailingBlanks].
    sentences := (str splitBy: '.') collect:
        [:each | each withoutLeadingBlanks withoutTrailingBlanks].

    "The original formulation is 0.0588L - 0.296S - 15.8 where
       * L is the average number of letters per 100 words and
       * S is the average number of sentences per 100 words.
      We fold the constant factor into the coefficients to make
      the important things clear:
       * l measures the average word length, while
       * s measures the (reciprocal of the) average sentence length."

    l := ((str select: #isAlphaNumeric) size) / (words size).
    s := ((sentences collect: [:each | (each splitBy: ' ') size]) average) / (words size).
    (5.88 * l) - (29.6 * s) - 15.8].
classes collect:
    [:cls | {cls name. cli value: cls instanceSide organization classComment asString}]

"=>   an OrderedCollection(
    #(#BasicClassOrganizer -45.400000000000006)
    #(#Behavior 8.568979591836733)
    #(#Categorizer 18.796997792494484)
    #(#Class 10.976666666666667)
    #(#ClassBuilder 10.886810551558753)
    #(#ClassCategoryReader -0.9633333333333312)
    #(#ClassCommentReader -45.400000000000006)
    #(#ClassDescription 13.691774891774887)
    #(#ClassOrganizer 7.881632653061221)
    #(#Metaclass 19.016567834681037))"

(Side note: usually we’d say cls comment. However, ClassDescription >> #comment returns a template encouraging the reader to fill in the blanks, in the event of there being a missing class comment. That would throw out our calculations, so we route around the helper and go directly to the source of the comments.)

But look at that first result: BasicClassOrganizer‘s class comment is apparently readable by someone not even born yet! Of course, that’s because that class has no comment. That’s handy information in itself, although we’d be better off filtering those classes out and treat them separately.

So what about languages other than English? I’ve seen work on a Japanese readability index and a Chinese one. Applying Coleman-Liau and ARI to agglutinative languages like isiXhosa would probably not work: such languages have longer words than English, and less words in a sentence: “Indoda iyambona umntwana” has a CLI of around 17.5, indicating a sentence well outside the grasp of an adult without extensive tertiary education. The English translation, “The man sees the child”, has a CLI of -0.5!

It appears that English has had by far the most research into readability. Or of course the internet hasn’t overcome its English bias yet.

by
Frank Shearar
on
27/04/13
 
 


4 + = five

2000-14 LShift Ltd, 1st Floor, Hoxton Point, 6 Rufus Street, London, N1 6PE, UK+44 (0)20 7729 7060   Contact us