1 / 38

Rapid Prototyping of WWW Niche Search Engines: Document Classification and Information Extraction

Rapid Prototyping of WWW Niche Search Engines: Document Classification and Information Extraction. Christian W. Omlin http://www.cs.uwc.ac.za/~comlin http://www.coe.uwc.ac.za. Telkom/Cisco Center of Excellence for IP and Internet Computing. Department of Computer Science

ivery
Download Presentation

Rapid Prototyping of WWW Niche Search Engines: Document Classification and Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Rapid Prototyping of WWW Niche Search Engines: Document Classification and Information Extraction Christian W. Omlin http://www.cs.uwc.ac.za/~comlin http://www.coe.uwc.ac.za Telkom/Cisco Center of Excellence for IP and Internet Computing Department of Computer Science University of the Western Cape

  2. Why Niche Search Engines • Technical Challenges and Issues for Niche SEs • Deadliner: A Search Engine for Conference CFP • What Does Deadliner Do? • Deadliner Architecture • Input Data Preprocessing • Document Classifier • Information Extraction: Simple Detectors, Optimal Detector Fusion, ROC curves • Neyman-Pearson • Presentation and Cataloging • Performance • Conclusions & Future Work

  3. Why Niche Search Engines (SEs) ? General SE • General • Low precision, recall not highly structured • High cost large database high bandwidth • Lots to crawl • More resources – all of which is rarely used Niche SE (specialized SEs) • Domain specific • High precision, recall structured data • Lower cost smaller database lower bandwidth • More up to date • Less resources – local, personalized

  4. Technical Challenges & Issues for Niche SEs • Focused crawling • Page classification • Automated data and knowledge extraction and indexing • - Identification • - Extraction • - Summarization • - Presentation • - Integration • Creating (general purpose) tools to help build niche search engines •  Niche SE design research and system integration

  5. What does DEADLINER do? • Domain specific document retrieval • Meta-tool (uses and integrates other existing tools: focused crawler, meta search engine, newsgroups) • Prescreens gathered text using support vector machines • Constructs detectors for target information in text (e.g. theme of a conference) • Bayesian detector fusion • Presentation and cataloging of extracted information • Places data into structured database • Allows complex queries (e.g. search by date)

  6. DEADLINER Architecture

  7. Input Stages • Build upon other existing systems: • Newsgroup scanner • Query modified meta-search: learned query modifications used with text derived from a database of calls for papers • Focused WWW crawler • - learns context of relevant documents • - context graph models link hierarchies

  8. Support Vector Machines (SVMs) • SVMs screen the harvested documents • Good text classifier • Handles high-dimensional input vectors • Resists overfitting by choosing salient features • Class vectors are separated by hyperplanes • Hyperplanes maximize margins between classes, which controls complexity and generalization

  9. SVM Input Data Vectors • Construct input vocabulary • A feature is a word, bi- or trigram • Features are chosen relative to a fixed vocabulary • If feature occurs on 7.5% of true/false class, then the feature is a candidate for the vocabulary • Candidates are ranked in order of ratio of their frequency on true class to frequency on false class documents • The N highest ranked documents form the vocabulary • Feature vector is constructed from this vocabulary • Vectors are binary labeled {0,1}

  10. Heuristics to Combine Text Fields in Filters • Simplest filters perform keyword matching based on built vocabulary using regular expressions • Program committees: • Text matched against a dictionary of known author names & affiliations obtained from ResearchIndex (~90K) • Deadlines: • Standard date formats from sentences with surrounding or immediately preceding text • Titles: • Contains: city/state, date of meeting, deadline, list of sponsors, name, acronym for conference, theme/summary of conference • Must contain at least 2 to match from the database

  11. Bayesian Detector Fusion • How do we find suitable pieces of the document to apply extraction rules? • Optimally integrate simpler detectors for every target field • A regular expression or formatting rule is called a filter • Filter does/doesn’t match; combination of filter and match is called a detector • These partial detectors are combined and yield a new detector • New detectors’ precision/recall setting can be changed via a single parameter

  12. Bayesian Detector Fusion II • Every detector function is a binary classifier • We find the classifier with the highest probability of detection Pd for a given rate of false alarm Pf • Combined output of N detectors denoted as a bit string • Assume two hypotheses, either relevant, or irrelevant • Therefore 2N possible bit strings, and 22N distinct possible classification rules

  13. Bayesian Detection Fusion

  14. Receiver Operating Curves (ROC) • Graph of true positive rate vs. false positive rate • Area under the curve represents the probability of distinguishing between a (normal,abnormal) pair • The operating points defined by the 22Nbinary mappings is called the Achievable Operating Set • The ROC is the set of operating points yielding the maximal detection rate for a given false alarm rate • The Neyman-Pearson (NP) procedure can be used to construct an optimal hypothesis test for a distribution parameter • NP ranks the 2N possible strings according to likelihood ratio function

  15. Receiver Operating Curves (ROC)

  16. ROC Interpretation • Selecting features (y1, y2, …, yN) yields a pair probability of detection Pd and probability of false alarm rate Pf • Pd and Pf are maxima for the chosen features

  17. ROC Curve Example

  18. Constructing a ROC Curve Assume that the true class conditional density functions on the data are as follows: The ROC support classifiers i(y), I=0,1,2,3 developed via the NP design are:

  19. Neyman-Pearson(often confused with an optimal Bayesian classifier) Theorem: Let X be a random sample of X ~ fx(x; ),     { : 0, 1 }, where both X and  can be multidimensional vectors. Let C be a rejection set of significance , and  = P(x; 1 )/P(x; 0), where P is the likelihood function. Then C is a best rejection (optimal) to test set H0 against H1 if for a positive number k   k for x  C if  < k for x  Complement(C )

  20. Presentation & Cataloging • After detection, need to extract target elements • A particular setting might overrule one or more constituent filters, require a certain combination of features, or enforce some other joint relationship • Matching filters and positive detectors are processed for extraction using heuristics • Match with highest confidence is used, but with “close calls”, all the matches are indexed • In principle no irreversible decisions are made

  21. Extraction of target field - example Different filters attempt to recognize target, here the “title”  reflects overall detection confidence

  22. Primitives used in DEADLINER • Primitives are used to construct filters • Most conference materials follow a block layout, e.g. a title, abstracts, program committee, affiliations, discussion topics, venue, scope, and a miscellaneous section • Filters are constructed from dictionaries and/or heuristics • Example: • A title usually contains two or more of: • a country name, city name, affiliation, date of meeting, • deadline, conference name, list of sponsors, conference acronym, theme

  23. Examples of Primitives • For a deadline, we use for example: • (cur_sentence, /deadline/i) • matches the word “deadline” • (cur_sentence, /<date>/) • matches a date • #(cur_sentence, /<date>/)  3 • there are three or more dates in the current sentence

  24. Support Vector Machine Performance • 592 manually classified CFPs, 2269 “random” negative examples from Inquirus2s logs, and 850 negative examples (conference related, but not CFPs) • of these, 249 positive examples, and 1250 negative examples used for training the SVM • remaining 353 positive examples and 1019 negative examples used for testing Non-english and multilingual sites present major issues

  25. ROC Curve – “Deadline” Optimal detection performance for different numbers of combined detectors Significant performance increase from integration multiple detectors

  26. “Program Committee”

  27. “Topic”

  28. Extractor Performance for Pf of 5% • 500 training documents from DBWorld and 100 testing from DIKU • DBWorld pages contains 208 lists of interesting topics, 338 conference titles, 906 deadlines and 197 program committees • DIKU pages contains 300 deadlines, 1455 program committee members • Extraction results: “deadline” and “program committee” • Target Deadline Committee • Total 300 1455 • Detected/Extracted 214 1252 • Detected/Not Extracted 2 72 • Extraneous 31 136 • Extraction results for “Titles” • Start Date End Date Theme/Name Country Type of Meeting • 73% 71% 81% 77.5% 85%

  29. Related Work ResearchIndex.com, Cora collects and indexes online papers by text and citations SoftMealy, Hsu applies finite state transducer rules to html pages XWRAP is a wrapper construction system, which transforms HTML pages into XML from generated rules Stalker is a greedy sequential covering algorithm that uses landmark automata to generate wrappers

  30. Related WorkResearchIndex, Cora • ResearchIndex.com and Cora are examples of document niche search engines • Both perform citation extraction from online publications, page content and citation analysis • Cora uses trained Hidden Markov Models to extract information in citations • ResearchIndex uses hand-constructed algorithms and heuristics, and machine learning • DEADLINER approach: simple filters to select appropriate text blocks followed by simple heuristics

  31. Related WorkSoftMealy, Hsu • Trains a finite state transducer (FST) for token extraction • Uses a heuristic to prevent non-determinism (and therefore increase efficiency) • Contextual rules are produced by an induction algorithm • The FSTs obtained are applied to HTML pages (DEADLINER uses text-only pages in this experiment)

  32. Related WorkXWRAP, Liu, Pu, Han • XWRAP is a wrapper construction system, which transforms HTML pages into XML • Rules are generated and applied to HTML and interesting document regions are identified via an interactive interface • Similarly for semantic tokens • These steps are follow by a hierarchy determination, resulting in a Context Free grammar

  33. Related WorkStalker, Muslea, Minton, Knoblock • An algorithm that uses landmark automata to generate wrappers • Stalker is a greedy sequential covering algorithm • Generates a landmark automaton that accepts only true positives • Does so by finding a perfect disjunctive, or until it runs out of training examples • The best disjunctive covers the most positive examples and new disjunctives are added iteratively to cover uncovered positive candidates

  34. Future Work • Create feature extraction tools (to decrease modeling time) and improve classification/extraction • Increase the (types of) meta-data extracted • Apply the DEADLINER architecture model to other domains • Extension to WWW image searches • eAfrica.org: repository of timely documents related to e-learning, e-commerce, e-government, e-healthcare, etc. for the African continent.

  35. Reference: A. Kruger, C.L. Giles, F. Coetzee, E. Glover, G.W. Flake, S. Lawrence, C.W. Omlin, “Deadliner: Building A New Niche Search Engine’’, 9th International Conference on Information and Knowledge Management (CIKM), 2000.

More Related