1 / 10

Citation Extractor

Citation Extractor. Nguyen Bach Sue Ann Hong Ben Lambert. Extraction Task. AuthorOf(Author, Paper) PublishedAt(Paper, Conference) IsPaper, IsAuthor, IsConference. “Citation” = <Paper, Authors, Conference> “Pattern” regular expression. Citation DB. Seed (e.g. 5 citations).

knut
Download Presentation

Citation Extractor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Citation Extractor Nguyen Bach Sue Ann Hong Ben Lambert

  2. Extraction Task AuthorOf(Author, Paper) PublishedAt(Paper, Conference) IsPaper, IsAuthor, IsConference • “Citation” = <Paper, Authors, Conference> • “Pattern” • regular expression

  3. Citation DB Seed (e.g. 5 citations) Method Outline Web pages (HTML, text) Query Search (WIT) Citations Extract Citations using new patterns Extract Patterns using known citations Page-specific Patterns

  4. AUTHOR, AUTHOR: TITLE . CONF 4 Patterns: AUTHOR, (A-Za-z): (A-Za-z). (A-Za-z) (A-Za-z), AUTHOR : (A-Za-z). (A-Za-z) (A-Za-z), (A-Za-z): TITLE. (A-Za-z) (A-Za-z), (A-Za-z): (A-Za-z). CONF Query: "multiple-goal recognition from low-level signals " " Xiaoyong Chai" " Qiang Yang" "AAAI 2005 " Page: http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/y/Yang:Qiang.html

  5. AUTHOR, CONF CONF AUTHOR, AUTHOR, TITLE CONF AUTHOR, AUTHOR, AUTHOR, (A-Za-z): (A-Za-z). (A-Za-z) AUTHOR, (A-Za-z), AUTHOR : (A-Za-z). (A-Za-z) . AUTHOR, (A-Za-z), (A-Za-z): TITLE. (A-Za-z) (A-Za-z), (A-Za-z): (A-Za-z). CONF AUTHOR: AUTHOR: AUTHOR: Finding New Citations

  6. System Spits Out… • 6 seeds  60 citations • 36 of these (partial citations) • "Theory and Algorithms for Plan Merging " , " Ming Li" • "The Expected Value of Hierarchical Problem-Solving " , " Fahiem Bacchus" • "Handling feature interactions in process-planning " • 14 of these (partial strings) • "On D " • "On t " , " John Tromp", " Elizabeth Sweedyk", " Umest Vazirani" • "An L " , " Ronan Sleep" • "To D “ • No new conferences (end-token)

  7. Bootstrapping, Short-Lived • Highly restrictive regex’s • No recovery • More seeds and variety the better • Stupid Little Things • Mis-capitalization • Variations in titles (‘-’ vs. ‘ ’) • Etc, etc, etc…

  8. Extensions ~ Improvements • Less strict string matching • Not case and punctuation sensitive • Better boundary detection • Start/end tokens, HTML wrapper detection? • Better pattern construction • e.g. n authors not 2 • NER • help find the right "window“ • A source of ENTITY marker • Use like ‘AUTHOR’, ‘TITLE’, ‘CONF’ but with probabilities/confidence values • Evaluation with DBLP?

  9. NER • Baseline model (News corpus) <ENAMEX_TYPE="PERSON"> M. Woszczyna, N. Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi, T. Kemp, A. Lavie, A. McNair, T. Polzin, I. Rogina, C. P. Rose, T. Schultz, B. Suhm, M. Tomita, A. Waibel, 1994, JANUS 93: </ENAMEX> Towards Spontaneous Speech Translation, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. <ENAMEX_TYPE="PERSON"> S. Awodey. </ENAMEX> Topological Representation of the Lambda Calculus. September <ENAMEX_TYPE="PERSON"> 1998. Math. Struct. </ENAMEX> in <ENAMEX_TYPE="LOCATION"> Comp. Sci. (2000), vol. 10, pp. 81--96. </ENAMEX> • Adapted model (News + citation corpus) <ENAMEX_TYPE="PERSON"> M. Woszczyna, N. Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi, T. Kemp, A. Lavie, A. McNair, T. Polzin, I. Rogina, C. P. Rose, T. Schultz, B. Suhm, M. Tomita, A. Waibel, 1994, JANUS 93: </ENAMEX> Towards Spontaneous Speech Translation, Proceedings of the <ENAMEX_TYPE="ORGANIZATION"> International Conference on Acoustics, Speech, </ENAMEX> and Signal Processing. <ENAMEX_TYPE="PERSON"> L. Birkedal. </ENAMEX> A General Notion of Realizability. December 1999. Proceedings of <ENAMEX_TYPE="ORGANIZATION"> LICS 2000 </ENAMEX>

  10. Lessons LearnedAnother Boring Text Slide • Semi-structured text is surprisingly difficult to read • Off-line training for wrappers and/or NER may help • Need very high-confidence rules to ensure precision • A continuously-running system needs robustness (internet/Google-failure, unexpected errors, …)

More Related