70 likes | 91 Views
Can we accurately classify English text at the appropriate reading level? This research uses statistical language models and a dataset of English novels from different reading lists to explore this problem. The approach involves building language models for each reading level and classifying new text based on the most likely generating model. Results show that language models can capture information that helps differentiate between reading levels. Additionally, using these models as features in a multinomial logistic regression model improves performance. Future work includes exploring higher-order language models and investigating language model overfitting.
E N D
Classifying Reading Levels with Statistical Language Models Johnson Hsieh Sameer Shariff
Problem • Given a passage of English text, can we classify the text at the appropriate reading level? • Dataset – English novels from various reading lists • 5th-7th grade - 15 books • Tom Sawyer, Black Beauty, etc. • 8th-10th grade - 16 books • A Tale of Two Cities, The Call of the Wild, etc. • 11th-12th grade - 17 books • Pride and Prejudice, The Awakening, etc.
Approach • Build language models for each class • Classify new text based on model that was most likely to generate this text (generative model) • Model 1 • Classify text based purely on these language models with some interesting smoothing techniques • Model 2 • Build a discriminative multinomial logistic regression model that uses these language models as just one of many features
Data Separability • A hard problem
Language Model Results Accuracy = (# of books predicted correctly)/(total # of books) Weighted Accuracy = ((# predicted correctly) + 0.5 * (# off by one))/(total # of books)
Multinomial Logistic Regression Results • Without language model: • With language model:
Conclusions and Future Work • Statistical language models do capture information that can help differentiate between different reading levels, better than traditional measures such as Flesch-Kinkaid • Multinomial logistic regression models with additional features outperform the pure language model approach, though using the language model as a feature greatly improves performance • Future Work • Explore higher order language models • Investigate language model overfitting