September 14, 2012

As students know, some things are harder to read than others. The spectrum of writing extends from the single syllable words in the Dick and Jane books of my youth, to the heady reading found in college text books and articles in scholarly journals.

In one memorable episode of The Bob Newhart Show, Bob's dentist friend, Jerry, writes a children's book. He submits his manuscript as many pages with one word per page. His rejection letter arrives in the same format, one word per page.

Many word processors have lexicographic analysis functions, such as word count, which is an important metric for student submissions. They also have a readability analysis designed to estimate the target audience for your text. The most popular of these is the Flesch-Kincaid readability test[1] that presents its results as either a percentage reading ease, or as a grade level index; viz.,
Flesch Reading Ease =
205.835 - (1.015*(words/sentences))-(84.6*(syllables/words))

Flesch Kincaid Grade =
return (0.39*(words/sentences))+(11.8*(syllables/words))-15.59,
in which words, sentences and syllables are the total counts for these objects in the manuscript. The grade level is designed to track the school grade levels in the US. The reading ease corresponds to text being understood by a particular age group (90->100 = 11 year olds, 60->70 = 13-15 year olds, and 0->30 = college graduate.)

Proust manuscript for À la recherche du temps perduWriter's angst.

Teachers and publishers alike are happy that modern electronics have eliminated the chicken scratching they were once forced to decipher.

A manuscript page of À la recherche du temps perdu (Remembrance of Things Past) by Marcel Proust.

(Not to be confused with "La vie et l'époque de Frank Perdue."[2])

(Via Wikimedia Commons).

When my children were in elementary school and high school, we would apply the grade test to their school reports to see how they fared. The object there was to make the grade level as high as possible. The purpose of these tests is actually the opposite. Development of the grade level test was funded by the US military to ensure that their training materials and maintenance manuals were understood. It's also used by some publishers to "dumb-down" their content to make it more salable. Next time you're in the supermarket, scan the tabloids at the checkout.

I don't try to dumb-down anything in this blog, but its reading level is not that extreme. A recent, relatively low-tech article, Work, September 3, 2012 has a Flesch Reading Ease of 62%, and a Flesch-Kincaid Grade Level of 8.9. The previous, more technical article (Harder than Diamond, August 31, 2012), scores 49.9% and grade 9.5. There's no reason why your high-schooler shouldn't be reading this blog!

These scores were calculated by a C language program I developed just for this purpose. You can grab the source code here. Looking at the above formulas, you would think that such a program is easy to write. Counting words and sentences is somewhat easy, but the syllable count is the hard part. An extreme program might use a dictionary for this, but many of the words in this blog would not be found there. Instead, we use a simple method that's accurate enough for our purpose.

The vowels (a,e,i,o,u,y) are the key. The number of syllables in a word is almost always equal to the number of vowels, with two conditions. When vowels appear in pairs (diphthongs), they have a single sound, so we eliminate any vowel that follows another. Also, there are certain silent endings that must be addressed. We simply eliminate -e, -es and -ed from our count. This syllable count is not 100% accurate, but how accurate are the readability scores themselves? All scientists know that approximation is allowed in certain cases.

As can be seen in the readability formulas, the number of syllables per word is the most important factor. This is no surprise to children who complain about "big words," so word length is an important linguistic concept. An article about word length has recently been posted on the arXiv preprint server.[3]

The authors used the Google Books corpus for the analysis of temporal trend in word length. I wrote about linguistic analysis using Google Books and the Google Ngram Viewer in two previous articles (Culturomics, January 13, 2011 and Word Extinction, August 17, 2011). Their results are shown in the graph, below.

Figure captionTrend in word length.

Blue=common text; green=fiction; red = British English; aqua = American English.

(arXiv Preprint Server, fig. 1 of ref. 3.)[3]
Note the recent "dumbing-down" of American English. The authors of the arXiv paper associate the decrease in average word length with a shifting political environment. I prefer my dumbing-down hypothesis. Word length is an easily understood concept, but linguistics can get into more complicated areas, as another just published paper demonstrates.[4-5]


  1. J. Peter Kincaid, Richard Braby and John E. Mears, "Electronic authoring and delivery of technical information," Journal of Instructional Development, vol. 11, no. 2 (June, 1988), pp. 8-13.
  2. "The Life and Times of Frank Purdue" (Unpublished).
  3. Vladimir V. Bochkarev, Anna V. Shevlyakova and Valery D. Solovyev, "Average word length dynamics as indicator of cultural changes in society," arXiv Preprint Server, August 30, 2012.
  4. How language change sneaks in, Linguistic Society of America Press Release, September 4, 2012 (PDF File).
  5. Hendrik De Smet, "The Course of Actualization," Preprint of Language paper, to appear, September, 2012 (PDF File).
  6. Phonics on the Web.

