Web-Scale Information Extraction in KNOWITALL (Preliminary Results)

Web-Scale Information Extraction in KNOWITALL(Preliminary Results) O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D.S. Weld, and A. Yates Department of Computer Science and Engineering University of Washington CS583 Paper Presentation By Yan Luo and Pei Zhang

Part I KNOWITALL Present by Yan Luo

Outline – Part I • Part I – KNOWITALL • Motivation and Introduction • Features and Modules • Extractor • Assessor • Bootstrapping • Extraction Focus

Motivation • Example: “compiling a list of the humans who have visited the space or the cities in the world whose population is below 500,000 people”. • Manually querying search engines in order to accumulate a large collections of facts is a tedious and error-prone process. • Search engines retrieve and rank potentially relevant documents, but do not extract facts, assess confidence, or fuse information from multiple documents.

Introduction • This paper introduces KNOWITALL, a domain-independent system that extracts information from the web in an automated and scalable manner. • This paper analyzes KNOWITALL’s architecture and reports the lessons learned for the design and building of large-scale information extraction systems. • For the experiments, this paper describes preliminary results in which KNOWITALL ran for four days to extract over 50,000 facts regarding 5 classes: cities, states, countries, actors, and films.

Features • Extracts information from the Web. • Domain independent and highly automated. • Uses PMI to assess the probability of extractions. • PMI: pointwise mutual information • Uses weaker input than previous IE systems. • Employs unsupervised learning methods that extract facts by using search engines.

Modules • Extractor • Instantiates a set of extraction rules from a set of generic, domain independent templates. • Search Engine Interface • Formulates queries based on its extraction rules and sends to search engines and downloads results. • Assessor • Uses statistics computed by querying search engines to assess the likelihood that the extractions are correct. • Database • Stores its information in a commercial RDBMS.

High-level Pseudocode

Procedure • Bootstrapping instantiates a set of extraction rules for each class in the information focus and train the Naïve Bayesian Classifier for the Assessor. • After bootstrapping, the Extractor begins finding instances from web, and the Assessor assigns probability to each instance. • At each cycle, KNOWITALL allocates system resources to favor the most productive class, which is called Extraction Focus.

Bootstrapping

Extraction Cycle

Extractor • Whenever a new class is added to KNOWITALL’s ontology, the Extractor uses generic, domain independent rule templates to create a set of information extraction rules for that class. • These rule templates are modified from Marti Hearst’s hyponym patterns or others are developed independently. • Thus, KNOWITALL forms the appropriate extraction rule, generates queries, and sends them to the web. When the search engine retrieves a web page for a query, the Extractor applies the extraction rule associated with that query to any sentences in the web page that contain the keywords.

Extractor – Syntactic Patterns • A sample of the syntactic patterns that underlie KNOWITALL’s rule templates is shown below:

Extractor – Rule Template This generic rule template is instantiated for a class in the ontology to create an extraction rule that looks for instances of that class.

Extractor – Extraction Rule This extraction rule looks for web pages containing the phrase “countries such as”. It extracts any proper nouns immediately after that phrase as instances of Country.

Assessor • Information extraction from the web is a difficult, noisy process. In order to improve its precision, the Assessor assesses the probability of every extraction generated by the Extractor. • Specifically, the Assessor measures co-occurrence statistics of the candidate extractions with a set of discriminator phrases. • They use search engine hit counts as a means of efficiently computing co-occurrence statistics of the candidate extractions with discriminator phrases.

Probabilistic Assessment • The Assessor computes the mutual statistics between each extracted instance and multiple discriminator phrases. • These mutual information statistics are combined via a Naïve Bayesian Classifier.

Bootstrapping • In order to estimate the probabilities, KNOWITALL needs a training set of positive and negative instances of the target class. • Bootstrapping begins by instantiating a set of extraction rules and queries for each predicate from generic rule templates, and also generates a set of discriminator phrases from rules and class names. • Bootstrapping selects seeds by first running an extraction cycle to find a set of at least n proposed instances of the class, then selecting m instances from those with highest average PMI.

Bootstrapping • The seeds are then used to train the probabilities for the discriminators, with an equal number of negative seeds taken from the positive seeds for other classes. • Bootstrapping selects the best k discriminators to use for its Assessor, where n = 200, m = 20, and k = 5 in experiments. • Bootstrapping process may be iterated: • finding a set of seeds with high average PMI over all generic discriminator phrases; • using these seeds to train the discriminators and selecting the k best discriminators; • finding a new set of seeds with high PMI over just those k discriminators.

Extraction Focus • Since KNOWITALL has multiple classes in its ontology, focus of attention becomes an important issue. • Within a set of classes, some will have a large set of instances on the web and KNOWITALL can productively continue to search for a long time. • For other classes, there are a limited number of instances to find, and it is important for KNOWITALL to know when to stop searching for more instances.

Extraction Focus • The number of downloads allocated for each class in the new cycle is proportional to its yield in the previous cycle, • Another metric that guides KNOWITALL’s resource allocation is the SNR (signal-to-noise ratio) of each class, • In experiments, set high probability to 0.90 and low probability to 0.10 and ratio to 0.05.

Part II Lessons and Experiments Present by Pei Zhang

Outline – Part II • Ways to better the performance • - Termination Criterion • - Features for Assessment • Related problem • - Recursive Query Expansion • Future Work • Conclusions

Ways to better the performance 1) Termination Criterion i) The more we retrieve from the web, the more irrelevant pages we will get. ii) Set a stopping criteria so that the system will not go on searching if it already finds all the related information it needs.

Ways to better the performance STN: Signal-To-Noise ratio To simply put: Signal– relevant pages. Noise– irrelevant pages. If the ratio falls below 0.05, which means no more than 5 out of 100 retrievals is correct, we can stop the search, and change focus on other classes.

Ways to better the performance How to compute the STN? • As the algorithm of KnowItAll explained, the process of extraction is a cycle. For each iteration of an extraction, we can get a number of retrievals. As the retrieval goes on, more and more irrelevant pages will get retrieved. • Since we can't predict how many relevant and irrelevant extractions will be in the current iteration, we can only use the ratio in the last round.

Ways to better the performance Experimental result:

Ways to better the performance Features for probabilistic assessment: • Hits vs. PMI • Density vs. Threshold

Ways to better the performance What is PMI? Example: I1 = “New York” I2 = “Metz” D = “city of X” D+ I1 = “city of New York” D+ I2 = “city of Metz”

Ways to better the performance For Bayesian Classification, we need to calculate the probability of each feature P(fi=x|Φ) P(fi=x|¬Φ) For continuous features, x ranges over all possible hit (or PMI)

Ways to better the performance We can also discretize the features using the following methods: a) Find the threshold x0 where P(fi=x0|Φ) = P(fi=x0|¬Φ) b) Select the threshold that provides the highest information gain.

Ways to better the performance Experimental results: • For Assessors that use PMI scores, the system has better overall performance than those based on raw hit counts. • For continuous vs. discrete features, it doesn’t show much advantages one over the other till now.

Related Problem:RQE – Recursive Query Expansion BACKGROUND • KnowItAll is relied on existing search engines. • Existing search engines only make a small fraction of their results accessible to users.

Related Problem:RQE – Recursive Query Expansion • Actually we only get about 1000 or so top ranked URLs from the website. EXAMPLE: Searching for “cities such as”, we get the following result: 818 out of about 678,000 for "cities such as". • However, KnowItAll cannot be restricted to examine only 1,000 web pages. So it's necessary to expand the query. • How?

Related Problem:RQE – Recursive Query Expansion For each input query q, we break it into two parts: q' = qw q'' = q-w w is drawn from the frequency ordered list of words at: www.comp.lancs.ac.uk/ucrel/bncfreq/flists.html

Total number of pages retrieved for query qw2 Total number of pages retrieved for query q Total number of pages retrieved for query qw1 Total number of pages retrieved for query q-w2 Total number of pages retrieved for query q-w1 Related Problem:RQE – Recursive Query Expansion

Related Problem:RQE – Recursive Query Expansion EXAMPLE: q = “cities such as” Result: 818 out of about 678,000 pages are retrieved

Related Problem:RQE – Recursive Query Expansion EXAMPLE (cont.) q’ = “cities such as” “chicago” Result: 780 out of about 115,000 pages are retrieved

Related Problem:RQE – Recursive Query Expansion EXAMPLE (cont.) q’’ = “cities such as” -“chicago” Result: 839 out of about 630,000 pages are retrieved

Related Problem:RQE – Recursive Query Expansion EXAMPLE (cont.) q’ and q’’ are not totally exclusive; however, the higher ranked results doesn’t have too much overlaps. Thus we then can have a little less than 780+839=1619 pages retrieved. Change “chicago” to other words, and we will have more and more pages retrived.

Related Problem:RQE – Recursive Query Expansion Extraction Rate:

FUTURE WORK • 1) Preliminary work, only has fixed rules and does not extend to other rules. • 2) Ontology extension.

CONCLUSION • 1) Signal-to-noise ratio for extraction termination • 2) Hit counts vs. PMI • 3) Continuous feature vs. discrete feature • 4) Recursive Expansion Query

Thank you. Questions?

Web-Scale Information Extraction in KNOWITALL (Preliminary Results)

Web-Scale Information Extraction in KNOWITALL (Preliminary Results)

Presentation Transcript

Towards Web-Scale Information Extraction

Information Extraction from Web Documents

Preliminary Information for Kaiser in Web Design

Automatic Wrappers for Large Scale Web Extraction

Preliminary Results

Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval

Scalable Information Extraction

Survey on support services and information needs Preliminary results of EDCs

Information Extraction on the Web

Toward Semantic Web Information Extraction

Large-Scale, Real-World Face Recognition in Movie Trailers

LaSIE: The Large Scale Information Extraction System

KnowItAll

Preliminary Results

Information extraction from web pages using extraction ontologies

Web scale Information Extraction

Preliminary Results from Briggs Ranch

KnowItAll

Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval

Information extraction from web pages using extraction ontologies