Using Data Science Skills Now: Text Readability Analysis

towards-data-science

This post was originally published by Dawn Moyer at Towards Data Science

How to identify reading level scores using python

Image for post

Image by Evgeni Tcherkasski from Pixabay

When marketing effectiveness analytics are being developed, the content reading level is often overlooked. Should this be a part of the analysis or your model? If you decide it is, how would you easily tag your different documents or creatives with the current reading level? Today I will review how to use python to do reading level analysis and create a report. We will focus on English as the primary language. This same technique can be used to determine your other content’s readability, such as Medium Blogs.

Why readability it important

Who is your reader?

What to do if you aren’t entirely sure of your audience’s education level? Some industry rules of thumb can help.

Written text that scores in the 4th to 6th-grade level is commonly considered ‘easy to read.’ If you are trying to explain complex concepts simply for wide audiences, this is your range. Think general blogs and educational material.

Written text scoring Grades 7th to 9th is considered of average difficulty. These readers expect some more complex vocabulary. They expect that they may need to re-read a section of text to understand it fully. How-to articles may fall into this range.

Any text rated 10th grade and above is considered very difficult. This should be reserved for white papers, technical writing, or literature when you are sure that your audience is ‘up’ for it and/or expects it. Your reader is devoting quite a bit of mental effort to read and absorb your content. Do you know when you read a book that exhausts you? This is that range.

What are the limitations?

The Code

I have a script here that does common automated tasks. It reads in all text documents in a folder and then scores them. For more information on how to scrape texts across many folders and data sources, please see this related article.

There are several packages you can use in your analysis, including textstat and readability. I will use textstat in this example. It is straightforward to use.

import textstat  # https://pypi.org/project/textstat/
import os
import glob       # using to read in many files
import docx2txt   # because I'm reading in word docs# where is the folder with your content?
folderPath = '<path to your top folder>'# I want to search in all of the folders, so I set recursive=True
docs=[]
docs = glob.glob(folderPath + '/**/*.txt',recursive=True)
docs.extend(glob.glob(folderPath + '/**/*.docx',recursive=True))
#... keep adding whatever types you needprint(docs)# the language is English by default so no need to set the language# Loop through my docs
for doc in docs:
if os.path.isfile(doc):
text = docx2txt.process(os.path.join(folderPath,doc))

print(‘Document: ‘ + doc)
print(‘Flesch Reading Ease: ‘ + str(textstat.flesch_reading_ease(text)))
print(‘Smog Index: ‘ + str(textstat.smog_index(text)))
print(‘Flesch Kincaid Grade: ‘ + str(textstat.flesch_kincaid_grade(text)))
print(‘Coleman Liau Index: ‘ + str(textstat.coleman_liau_index(text)))
print(‘Automated Readability Index: ‘ + str(textstat.automated_readability_index(text)))
print(‘Dale Chall Readability Score: ‘ + str(textstat.dale_chall_readability_score(text)))
print(‘Difficult Words: ‘ + str(textstat.difficult_words(text)))
print(‘Linsear Write Formula: ‘ + str(textstat.linsear_write_formula(text)))
print(‘Gunning Fog: ‘ + str(textstat.gunning_fog(text)))
print(‘Text Standard: ‘ + str(textstat.text_standard(text)))
print(‘*********************************************************************************’)

"""Flesch-Kincaid Grade Level = This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document""""""Gunning Fog = This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document.""""""SMOG - for 30 sentences or more  =This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document.""""""Automated Readability Index = Returns the ARI (Automated Readability Index) which outputs a number that approximates the grade level needed to comprehend the text."""""" Coleman Liau Index = Returns the grade level of the text using the Coleman-Liau Formula.""""""Linsear = Returns the grade level using the Linsear Write Formula."""""" Dale Chall = Different from other tests, since it uses a lookup table of the most commonly used 3000 English words. Thus it returns the grade level using the New Dale-Chall Formula."""

The results of my four documents. The first is this blog. The second is this blog, with the links removed. The third is a USAToday article, and the fourth is the USAToday article with the header and photos removed.

Document:                     C:...readabilitysampleblog.docx
Flesch Reading Ease:          56.59  (Fairly Difficult)
Smog Index:                   13.8
Flesch Kincaid Grade:         11.1
Coleman Liau Index:           11.9
Automated Readability Index:  14.1
Dale Chall Readability Score: 7.71 (average 9th or 10th-grade student)
Difficult Words:              124
Linsear Write Formula:        10.833333333333334
Gunning Fog:                  12.86
Text Standard:                10th and 11th grade
*********************************************************************************
Document:                     C:...readabilitysampleblognolinks.docx
Flesch Reading Ease:          58.52 (Fairly Difficult)
Smog Index:                   12.9
Flesch Kincaid Grade:         10.3
Coleman Liau Index:           10.5
Automated Readability Index:  12.2
Dale Chall Readability Score: 7.48
Difficult Words:              101
Linsear Write Formula:        10.833333333333334
Gunning Fog:                  11.95
Text Standard:                10th and 11th grade
*********************************************************************************
Document:                     C:...readabilityusatoday article no header photos.docx
Flesch Reading Ease:          21.47 (Very Confusing)
Smog Index:                   19.8
Flesch Kincaid Grade:         24.6
Coleman Liau Index:           13.19
Automated Readability Index:  32.2
Dale Chall Readability Score: 9.49
Difficult Words:              317
Linsear Write Formula:        16.25
Gunning Fog:                  27.01
Text Standard:                24th and 25th grade
*********************************************************************************
Document:                     C:...readabilityusatoday article.docx
Flesch Reading Ease:          21.47 (Very Confusing)
Smog Index:                   19.8
Flesch Kincaid Grade:         24.6
Coleman Liau Index:           13.19
Automated Readability Index:  32.2
Dale Chall Readability Score: 9.49
Difficult Words:              317
Linsear Write Formula:        16.25
Gunning Fog:                  27.01
Text Standard:                24th and 25th grade
*********************************************************************************

For this blog, the score was influenced by removing links. The USAToday article was NOT affected by removing the header and photos.

References and Resources

  • Editorial: Evidence-based Guidelines for Avoiding Poor Readability in Manuscripts
  • Kaggle:

There was an interesting competition in 2019 that I applied readability scores to improve the City of LA job postings. You can see various competitors use a variety of techniques.

Conclusion

Spread the word

This post was originally published by Dawn Moyer at Towards Data Science

Related posts