Thursday, 21 February 2013

Business Analytics - finding the balance between complexity and readability

In this blog I try to present analytic material for a non-analytic audience.  I focus on point of sale and supply chain analytics: it's a complex area and frankly, it's far too easy whether writing for a blog or presenting to a management-team to slip into the same language I would use with an expert.  

So, I was inspired by a recent post on Nathan Yau's excellent blog FlowingData to look at the "readability" of my own posts and apply some simple analytics to the results.

I've followed Nathan's blog for a couple of years now for the many and varied examples of data-visualization he builds and gathers from other sources. One that particularly caught  my eye was this one published by the  Guardian just before the recent State of the Union address in the United States (click to enlarge).
The Guardian plotted the Flesch-Kincaid grade levels for past addresses. Each circle represents a state of the union and is sized by the number of words used. Color is used to provide separation between presidents. For example, Obama's state of the union last year was around the eighth-grade level, and in contrast, James Madison's 1815 address had a reading level of 25.3. 
Neither the original post nor Nathan's go into much detail around why the linguistic standard has declined.  Within this period, the nature of the address and the intended audience has certainly changed.   Frankly, having scanned a few of the earlier addresses I think we can all be grateful not to be on the receiving end of one of them.

 So, I was inspired to find out the reading level of my own blog.  It's intended to present analytic concepts to a non-analytic audience.  I can probably go a little higher than recent presidential addresses (8th-10th grades, roughly ages 13-15) but I don't want to be writing college-level material either.

All the books my kids read are graded in this (or a very similar) way but I had never thought about how such a grading system could be constructed.   The Flesch-Kincaid grade level estimate is based on a simple formula:

 0.39 \left ( \frac{\mbox{total words}}{\mbox{total sentences}} \right ) + 11.8 \left ( \frac{\mbox{total syllables}}{\mbox{total words}} \right ) - 15.59

That's just a linear combination of : 
  • average words per sentence;
  • average syllables per word
  • a constant term.
In fact (though I have not yet  found details of how it was constructed) it looks to be the result of a regression model.  (Simple) data science in action from the 1970's.

Note that Flesch-Kincaid says nothing about the length of the book or the nature of the vocabulary it's all down to long sentences and the presence of multi-syllabic words. 

(BTW - the preceding sentence has a Flesch-Kincaid grade score of 13.63, calculated with this online utility).  Now that's pretty high, worthy of an early 1900's president and (supposedly) understandable by young college students.    The sentence is longer than typical; 31 words vs. my average of 18 (see below) and words like "vocabulary", "sentences" and "multi-syllabic" are not helping me either.

Approach

I could have used copy/paste into the online utility I used above, recorded the results in a spreadsheet and pulled some stats from that. That would work, but if I ever want to repeat the exercise or modify it, perhaps to use a different readability index, I must do all that work again.   At the time of writing, there are currently 44 published posts on this blog - there must be a better way.

Actually there are probably many better ways but as I also wanted to flex some R-programming muscle I built a web-scraper in R to do the work for me and analyze the results (more on this in a later post).

Results

Let's start with some simple summaries of the results I collected.
Histograms showing the % of posts from this blog (prior to 2/14/13), the average (mean) value shown in red. There is some variety in the grade reading level indicated by Flesch-Kincaid for my blog posts, averaging around 10 but ranging from 7 through 14.  I average about 750 words, but occasionally go much longer and have a number of very short "announcement" style posts.  Average words per sentence of 18.











OK, so now I know, but is that good?  I don't know that I have a definitive source but according to at least one source  the target range on  Flesch-Kincaid for Techical or Industry readers is 7-12, so I'm feeling pretty good about that.

I did wonder whether there was any other, hidden, structure to the data though.  I know the equation is based on words per sentence and syllables per word so there is no point looking at those, obviously I'll find a relationship.   But is my writing style influenced by anything else?
Flesch-Kincaid grade level vs. the number of words by post on this blog.  Other than a handful of long posts that rate lower in the range 8-10,  I don't see much going on here.
Flesch-Kincaid grade level vs. the publication date by post on this blog.  The size of each post (in words) is shown by the area of each point, color is used purely to help visually differentiate each of the points.  Apart from a couple of recent "complex" posts  this does seem to be showing a trend, so I added a regression line and labeled the more extreme posts.  Point (b) is a very short "announcement" style post (you can hardly see the point at all) and I could probably ignore it completely.  Point (e) is a more fun piece I did around using pie-charts that's probably not very representative of the general material either.












If you want to compare readability for yourself here are the top (and bottom) posts ranked by Flesch-Kincaid grade level

Rank
Post
 Flesch-Kincaid grade level
words
sentences
1
13.3
784
33
2
13.1
82
4
3
12.8
676
29
4
12.8
723
31
5
12.7
541
29
6
12.4
891
43
7
12.1
478
24
8
11.9
762
38
9
11.8
297
16
10
11.6
958
41
35
  9.0
1878
114
36
  8.9
1264
78
37
  8.5
177
10
38
  8.4
70
5
39
  8.3
651
42
40
  8.2
1395
83
41
  8.1
531
32
42
  7.9
773
44
43
  7.6
483
36
44
 7.1
1097
70

Conclusions

It appears that my material is (largely) written at a level that should be accessible to the reader.  And I am using more readable language in recent blogs which sounds like a good thing.

But there remains a key question for me that these stats can't really answer. Am I getting better at explaining the complex (my goal) or just explaining simpler things ? What do you think ?

In case you are wondering, this post has a Flesch-Kincaid grade level of about 8.  So if you can follow the "State of the Union" address you should have been just fine with this.


No comments:

Post a Comment