70 likes | 189 Views
This study explores the classification of English text passages into appropriate reading levels using statistical language models. We analyze a dataset comprised of English novels across various grades, specifically targeting 5th-12th grade materials. Two models were developed: a generative language model for initial classification and a discriminative multinomial logistic regression model that incorporates additional features. Results indicate that statistical models significantly improve accuracy over traditional measurements. Future research aims to enhance performance through higher-order language models and address potential overfitting.
E N D
Classifying Reading Levels with Statistical Language Models Johnson Hsieh Sameer Shariff
Problem • Given a passage of English text, can we classify the text at the appropriate reading level? • Dataset – English novels from various reading lists • 5th-7th grade - 15 books • Tom Sawyer, Black Beauty, etc. • 8th-10th grade - 16 books • A Tale of Two Cities, The Call of the Wild, etc. • 11th-12th grade - 17 books • Pride and Prejudice, The Awakening, etc.
Approach • Build language models for each class • Classify new text based on model that was most likely to generate this text (generative model) • Model 1 • Classify text based purely on these language models with some interesting smoothing techniques • Model 2 • Build a discriminative multinomial logistic regression model that uses these language models as just one of many features
Data Separability • A hard problem
Language Model Results Accuracy = (# of books predicted correctly)/(total # of books) Weighted Accuracy = ((# predicted correctly) + 0.5 * (# off by one))/(total # of books)
Multinomial Logistic Regression Results • Without language model: • With language model:
Conclusions and Future Work • Statistical language models do capture information that can help differentiate between different reading levels, better than traditional measures such as Flesch-Kinkaid • Multinomial logistic regression models with additional features outperform the pure language model approach, though using the language model as a feature greatly improves performance • Future Work • Explore higher order language models • Investigate language model overfitting