Reading Complexity of Indian Language Texts

Overview:

Reading plays an important role in the process of learning and knowledge acquisition for both children and adults. However, not all texts are accessible to every reader. Reading difficulties can arise when there is a mismatch between a reader’s reading proficiency and the complexity of the text they read.

There has been extensive research done and methods developed for calculating reading complexity of text in English. Some of the popular methods include Dale-Chall Readability Formula, Flesch-Kincaid Readability, SMOG Readability Formula, Gunning Fog Index, COLE Readability Formula. Most of these widely used English readability measures focus on word and sentence level surface features such as word length, syllable count per word, word count, words per sentence, sentences per paragraph etc. Some of them also consider the word familiarity based on the frequency of their occurrence. There are also commercial tools such as Lexile developed to measure reading complexity of a text.

In these popular measures, minimal emphasis is given to linguistic features such as lexical, phonological, orthographic aspects of the words as well as sentence syntax.

While there are several methods developed for English, there are no established models that can be used to measure complexity and readability of Indian language texts. Many of the studies have tried applying English readability models to Indian languages. However, research has shown that the methods used for English may not suit well for Indian languages. This is due to the difference in their linguistic features. English is primarily an analytical language where grammatical relationships are indicated using auxiliary words and word order. On the other hand most of the Indian languages are inflectional or agglutinative. This means, the grammatical relationships are included as part of word morphology itself. Indian languages also largely differ from English in their phonological and orthographic features.

Hence, in the context of Indian languages, it is important to develop methods that can identify the reading complexity at a word level. Word level complexity becomes even more important for the early language learning stage, which has a strong focus on reading and understanding of words.

Considering the above, a new method for calculating the word complexity that can be applied to most of the Indian languages is described in this document. The following sections briefly describe the linguistic basis of the proposed complexity, the method of calculating the complexity and the possible applications of it.

Possible Applications

There are several possible applications of the word complexity measures. These measures that are specific to Indian languages can help in language learning of Indian languages. Some of the possible applications are:

Identify the most complex words in a given text. This can in turn help in many ways.

The complex words can be highlighted to content creators while creating learning content, so that they could think of alternatives if required based on the targeted learner levels.
The learning content could provide additional scaffolding (like meanings) for the complex words.

Calculate the complexity of a text based on the word complexity.

This can help Content creators to create learning content suitable for a given learning level.
The same can also be used for assessment of learning levels of learners.
It can also help find a suitable set of content for a give learning level.

Use as one of the features in advanced statistical models. Word complexity calculated specifically for Indian languages can be a useful feature in the advanced statistical or deep learning models for further analysis of Indian language text.

The Linguistic Basis

Phonology

Most of the major Indian Languages have a very similar phonological structure. The sounds classified as vowels and consonants - are very similar across all Indian languages.

There are up to 16 vowels, including short and long vowels and diphthongs. For example, short and long “a”, the short and long “e”, etc and diphthongs such as “ai”, “au”.

There are up to 43 consonants starting with “ka” and going up to “ha”. They are classified into categories such as retroflex, dental, palatal, aspirated etc. Combination of zero or more consonants and a vowel forms an “akshara” (a syllable) which is the basic unit of a word.

The phonology of Akshara was scientifically organized by 7th century BC by Panini using a detailed set of phonetic features.

The complexity of pronunciation varies across different Aksharas based on these phonetic features. There are certain “Aksharas” (called samyukta) which consist of more than one consonant. These are generally more difficult to pronounce as compared to the Aksharas with only one or no consonant. The difficulty of pronunciation also varies between different consonants. The aspirated consonants (like “kha”, “Tha”) are more difficult to pronounce compared to unaspirated consonants (like “ka”, “Ta”). The vowels such as “R” (as in Rishi), “visarga” (the h sound as in duhkha) are more difficult to pronounce.

The overall complexity of pronouncing and reading a word hence depends on the nature of Aksharas in it. Due to the similarity of the phonological structure across Indian languages, the same approach would work across all of them in calculating the phonological complexity of a word.

Orthography

While each of the major Indian languages has its own script, most of them are derived from Brahmi and its later derivative, Devanagari. Hence these scripts have common features such as:

Most of the Indian scripts are syllabic, meaning that each character represents a syllable rather than a single sound. Hence the phonological “Akshara” has a one to one correspondence with the orthographic “character” in most cases.
Most of the Indian scripts have compound letters, or letters that are made up of multiple parts called conjugations. These are represented using ligatures.
Most of the Indian scripts use diacritics, or small marks placed above, below, left or right to the letters, to indicate the presence of vowels.

Following shows the orthography of a Telugu letter “sthree” (meaning woman):

Since the Indian letters encapsulate a complex set of modifiers for the vowel and consonants part of the syllable, the complexity of recognizing and reading varies across different letters.

Word Complexity Calculation

Based on the linguistic properties of Indian languages, described in the previous section, it becomes important to be able to define and calculate the complexity of a word based on the phonological and orthographic features of the Aksharas in it.

Following diagram depicts some of the key dimensions that can be considered in calculating the complexity:

Phonological Complexity

Phonological complexity is measured based on the complexity involved in reading (pronouncing) the Aksharas in a word. This measure will be computed by using a defined set of phonetic features of an Akshara derived from the Phonology of Akshara defined by Panini.

The steps to calculate the phonological complexity of a word in a given language are as follows:

A vector of phonetic features is created for each Akshara in the given language.
A complexity weightage is given to each of these phonetic features.
The phonological complexity of an Akshara is then calculated as a weighted average of the complexities of its phonetic features.
The phonological complexity of the word is calculated as the sum of the phonological complexities of all its Aksharas.

Here is the set of vectors of phonetic features for Telugu Aksharas as a reference.

Orthographic Complexity

Orthographic complexity is the complexity involved in writing or recognizing an Akshara, i.e. the script complexity. This is mainly measured by the number of consonants in the Akshara as well as the various vowel and consonant markers used in it. This measure is computed by using a defined set of orthographic features of an Akshara derived based on the script of a language.

The steps to calculate the orthography complexity of a word in a given language are as follows:

A vector of orthographic features is created for each Akshara in the given language.
A complexity weightage is given to each of these orthographic features.
The orthographic complexity of an Akshara is then calculated as a weighted average of the complexities of its orthographic features.
The orthographic complexity of the word is calculated as the sum of the orthographic complexities of all its Aksharas

Here is the set of vectors of orthographic features for Telugu Aksharas as a reference.

Note:

The creation of phonetic and orthographic feature vectors of Aksharas is a one time activity for each language.
Arriving at complexity weightage for each of the phonetic and orthographic features is also a one time activity. However, based on experiments, usage and data analysis, these weightages can be fine tuned.
Calculation of phonetic and orthographic complexities can be automated through a software program based on the above logic.

Other dimensions of a Word Complexity

Phonological and Orthographic complexities primarily provide complexity related to recognition and pronunciation of a word. There are other dimensions of a word that can also make it complex to read and understand such as semantics, morphology and syntax.

The semantic complexity can be calculated based on the features such as:

The frequency of the word occurrence - More frequently occurring words are typically easier to understand (these are called Threshold words)
Word having multiple meanings - A word with multiple meanings will have more complexity to understand because it depends on the context
In many Indian languages, there are certain words which are borrowed from other languages like Sanskrit (Loan words). Such words could pose challenges to understand, as they are not natural.

Morphological measures are related to the structure of the word. These are mostly grammatical measures which are computed based on how the word is structured. Some of the known morphologies of words in Indian languages are:

Sandhi words (joined words)
Samasa words (compound words)
Derivational and inflectional morphology

Each type of morphology will add a complexity measure to the word.

Syntactic measures are primarily based on the parts of speech of a word. Different parts of speech would have different levels of complexity to understand. For example:

Abstract nouns like ‘Ferocious Animal’ have more complexity than specific nouns like ‘Tiger’. The abstractness of the word can be computed by the number of hypernyms and hyponyms related to the word.
Adjectives will have a higher complexity than simple nouns.
There are 7 senses of verbs each one having a different complexity. Based on the sense of the verb, a different complexity measure can be given to the word.

Calculating these additional measures requires WordNets and NLP parsers to automatically detect the various features involved in these measures.

Empirical Studies and Experiments

Following are some of the empirical studies and experiments done using the Phonological and Orthographic complexities at EkStep.

Analysis of Telugu Textbooks of Class 1 to 5

Lessons in Telugu textbooks of Class 1 to 5 were analyzed for their text complexity using Phonological and Orthographic complexities. Following are the key observations of this study:

There is a gradual progression in the increase of lesson complexity

A Class 5 lesson is as complex as a typical news article
A Sizable set of difficult words increases the overall complexity of a lesson. This is the set of words to be focussed while teaching

Adhoc experiments were also conducted with few children with the lesson in these textbooks. There was empirical evidence that the word complexity is correlated with reading difficulty.

Analysis of Pratham Paragraphs with Bihar Children

A sample set of 10 short paragraphs in Hindi language from Pratham of different complexity levels was evaluated with a small set of 5 children who are at Paragraph level as per ASER. A strong correlation was observed between the text complexity of a paragraph and the reading fluency (words per sec). The correlation is more stronger in case of children with overall lesser fluency level as compared to the children with higher frequency level.

Sensing errors in reading

A sensing experiment was conducted with 19000 children for their reading fluency with a Hindi paragraph consisting of 59 words. The most misread words were identified and correlated with their phonological complexities. There was a significant correlation between the misread words and their complexity.

Further Exploration

Key activities to further use these complexity measures for a given language are:

Calculate word complexities of words that occur in texts of that language across various learning levels - like Preschool, Class 1 to 10. The complexities can be calculated both at word level and text level. This provides a statistical measure to estimate the text complexity level for a given learning level.
Collect sufficient data from learners related to reading fluency of texts in the given language. Correlate the data with calculated word and text complexities and test whether the complexities can predict the reading fluency well. The weightages can be fine tuned based on this analysis if required.

Joomla 10 March 2023