1 / 7

Name Extraction from Chinese Novels

Name Extraction from Chinese Novels. CS224n Spring 2008 Jing Chen and Raylene Yung. Problem. Given a Chinese novel, extract the names of people and locations Different from English NER: no whitespace within sentences, no capitalization

percy
Download Presentation

Name Extraction from Chinese Novels

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Name Extraction from Chinese Novels CS224n Spring 2008 Jing Chen and Raylene Yung

  2. Problem • Given a Chinese novel, extract the names of people and locations • Different from English NER: no whitespace within sentences, no capitalization • Can use other characteristics since the domain is limited

  3. System Outline • Extract bigrams, trigrams, and quadrigrams from text • Run logistic regression on extracted features to learn feature weights • Use weights to compute a score for each n-gram • Apply thresholding to limit the number of guessed names • Use word lists from word segmenter and dictionary • Compare output list to correct list for F1 score

  4. Features • N-gram and segmented word counts • Ratio of count of n-gram to (n-1)-gram • Transliterated characters • Prefixes and suffixes • Segmented words and dictionary • Mutual information

  5. Thresholding • Otsu’s method: • Often used in image processing • Separates data into two classes, minimizing the variance within the classes • Does not depend on training data • F1 Maximization • Find the threshold on training data that maximizes F1 score • Use same threshold on test data

  6. Results • No validation set, so chose a baseline set • Ablation tests show that the baseline chosen was non-optimal • Best individual scores:

  7. Conclusion • Most useful features: • N-gram counts / frequency ratios (0.46F1 alone) • Varies depending on type of n-gram • Thresholding • Otsu’s method yielded better overall performance • Both methods had drawbacks • Future work • More rigorous feature set testing • Larger / cleaner data sets

More Related