Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science and Engineering SUNY at Buffalo

Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science and Engineering SUNY at Buffalo Learning to Construct Knowledge Bases from the World Wide Webby Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum, Tom Mitchell, Kamal Nigam, Sean Slattery

Fig. 1. An overview of the WebKB system

Two of the entities automatically extracted from the CMU computer science department Web site after training on four other university computer science sites. These entities were added as new instances of faculty and project to the knowledge base from Web hypertext.

Part II • Pages 11-29 • Appendix B • Appendix C

Learning to Recognize Class Instances • Task: • to identify new instances of ontology classes from the text sources on the Web. • Discussion: • A statistical bag-of-words approach to classifying Web pages. This method is used along with 3 different representation of pages. • Learning first-order rules to classify Web pages. • Evaluation of the effectiveness of combining the predictions made by all 4 of these classifiers.

Naive Bayes • 2 common approaches: • multi-variate Bernoulli model (binary word count); • multinomial model (integer word count). Given a set of classes C = {c1, …cn} and a document consisting of n words, (w1, w2,…wn), we classify the document as a member of the class, c*, that is the most probable, given the words in the document: Transform Pr(c|w1,…wn) by applying Bayes Rule Rewrite the expression using the product rule and dropping the denominator Assume that words are independent of each other

Naive Bayes Classifier Limitations • The Naïve Bayes is not suitable to estimate the level of confidence for all classes. • The winning class tends to have probability 1 (the artifact of the naïve assumption) . • The losing classes tend to have posterior probabilities close to 0. Authors’ proposals isto modify the existing formulae to overcome those limitations.

Modifications to Naive Bayes Goal- scores that accurately reflect the uncertainty in each prediction and enable to sensibly compare the scores of multiple documents (smooth function of confidence): Begin with naive Bayes, rewrite the sum to an equivalent expression that sums over all words in the vocabularly T instead of just the words in the document (B.1), take the log (B.2), and divide by the number of words in the document (B.3).

Modifications to Naive Bayes (continued) By substituting N(wi|d)/n as Pr(wi|d), the authors derived the following formula: where: n– number of words in a document; Pr(c) – prior probability of any class; Pr(wi|d) – probability (frequency) of word that encountered in the document d; T – the whole vocabulary; Pr(wi|c) – probability (frequency) of word that encountered in the class c.

Modifications to Naive Bayes (continued) Subtracting optimal encoding for a given document gives us the final formula of the score for all classes. The biggest score will determine the entity for a specific document: The right side of the equation is negative relative (cross) entropy: - Measure of how different two probability distributions are; - The average number of bits that are “wasted” by encoding events from a distribution p with a code based on a not-quite-right distribution q.

Naive Bayes Classifier (conclusion) Approach: Building a probabilistic model of each class using labeled training data, and then classifying new pages by selecting the most appropriate class. Given a document d to classify, we calculate a score for each class c (The class predicted by the method is the class with the greatest score): • Pr(wi|c)is the probability of random word w given classc; • Pr(wi|d)is the proportion of a word w in documentd. • n is number of words ind • Tis the size of the vocabulary • wi is theith word in the vocabulary

Experimental Evaluation

Experimental Evaluation To obtain insight into the learned classifiers by asking which words contribute most highly to the quantity Score c(d) for each class: Most of the highly weighted words are intuitively prototypical for their class. Many words which are conventionally included in stop list are highly weighted by the model and was included into the vocabulary.

Coverage – the percentage of pages of a given class that are correctly classified Accuracy– the percentage of pages classified into a given class that are actually members of that class

First-Order Text ClassificationQuinlan’s FOIL algorithm, Introduction Two families of First-order Learning Systems: 1. Successive relation method: A faulty theory is too general if it covers negative examples, and too specific if it does not cover all positive examples. The theory is revised until all examples are covered. 2. Separate and Conquer Strategy (greedy algorithms) : All examples are considered together and each iteration new element (a.k.a literal) is added that covers some positive examples, but no negative examples. J.R. Quinlan, R.M Cameron-Jones: “Introduction to Logic programs: FOIL and Related Systems”

Description of FOIL As an example of a task, consider learning a definition of the membership relation on lists from a small world containing just the lists [ ], [1], [2], [3], [1,2], [2,3], and [1,2,3]. The target relation member(E,L) contains pairs whose pairs constant denotes an element that belongs in the list denoted by the second. In this small world there are just ten elements in member: <1,[1]> <2,[2]> <3,[3]> <1,[1,2]> <2,[1,2]> <2,[2,3]> <3,[2,3]> <1,[1,2,3]> <2,[1,2,3]> <3,[1,2,3]> As far as foil is concerned, lists like [1,2,3] are just constants, so a background relation components(L,H,T) is required to show how to find the head H and tail T of a list L. The elements making up components are: <[1],1,[ ]> <[2],2,[ ]> <[3],3,[ ]> <[1,2],1,[2]> <[2,3],2,[3]> <[1,2,3],1,[2,3]> where the first states that list [1] has head 1 and tail [ ].

Description of FOIL (continued) /* Top down Approach */ Initialization: theory := null program /* learning concept member(E,L)*/ remaining := all positive elements of target relation R /* <1,[1]>….. <2,[2]> */ Iteration: While remaining is not empty /* some positive examples are not classified */ clause := R(A,B; :::) :- While clause covers “-” negative elements Find appropriate literal(s) L (a.k.a background relationship) Add L to right-hand side of clause Remove positive “+” elements covered by clause from remaining R Add clause to theory

Description of FOIL (continued) Initialization: We illustrate the process using the member(E,L) relation. The initial clause consists of just the first Literal: member(A,B) : (where A is Element, B – List ) The set of examples corresponding to this initial partial clause are just the all possible positive and negative elements of relation member(A,B). All 10 positive examples: <1,[1]>(+) <2,[2]>(+) <3,[3]>(+) <1,[1,2]>(+) <2,[1,2]>(+) <2,[2,3]>(+) <3,[2,3]>(+) <1,[1,2,3]>(+) <2,[1,2,3]>(+) <3,[1,2,3]>(+) some negative examples: <1,[ ]>(-) <1,[2]>(-) <1,[3]>(-) <1,[2,3]>(-) <2,[ ]>(-)

Description of FOIL (continued) The literal L components(L,H,T) is now added to the clause body to give: This is intermediate theory: member(A,B) :- components(B,A,C) the new clause has three variables and is satisfied the following elements: <A,[B],[C]> <1,[1],[ ]>(+) <2,[2],[ ]>(+) <3,[3],[ ]>(+) <1,[1,2],[2]>(+) <2,[2,3],[3]>(+) <1,[1,2,3],[2,3]>(+) For instance, <1,[1],[ ]> is removed from remaining because the values A=1, B=[1], C=[ ]. Another words, if an element is a header H, it is a member of the list L. We have only 4 positive examples that cannot be described by above relationships : member(A,B) :- components(B,A,C) <2,[1,2]>(+) <3,[2,3]>(+) <2,[1,2,3]>(+) <3,[1,2,3]>(+)

Description of FOIL (continued) Adding a further literal to give the new partial clause member(A,B) :- components(B,A,C) member(A,B) :- components(B,C,D), member(A,D) eliminates the rest 4 positive examples: <2,[1,2],1,[2]>(+) <3,[2,3],2,[3]>(+) <2,[1,2,3],1,[2,3]>(+) <3,[1,2,3],1,[2,3]>(+) Each example that makes the relationship member(E,L) is moved to target relation: member(A,B) :- components(B,A,C). member(A,B) :- components(B,C,D), member(A,D) So the definition of member(E,L) is complete and can be used for other than 1,2, and 3 elements. Example: member(4,[1,2,3,4]) member(4,[1,2,3,4]) =components([1,2,3,4],[1],[2,3,4]), member(4,[2,3,4]) member(4,[2,3,4]) =components([2,3,4],[2],[3,4]), member(4,[3,4]) member(4,[2,3,4]) =components([2,3,4],[2],[3,4]), member(4,[3,4]) member(4,[3,4]) =components([3,4],[3],[4]), member(4,[4])

First-Order Text Classification • Quinlan’s FOIL algorithm for WebKB • A greedy algorithm for learning function-free clauses. • Background relations (stemmed words with 200 occurrences) • has_word(Page): indicates which words occur in which pages. • link_to(Page, Page): represents the hyperlinks that interconnect the pages in the data set. For all FOIL class classifiers the m-estimate accuracy was calculated to determine the winning class for each document (d): • nc – the number of instances correctly classified by the rule • n – the total number of instances classified by the rule • p – a prior estimate of the rule’s accuracy • m – a constant called equivalent sample size which determines how heavily p is weighted relative to the observed data (m = 2)

A few of the rules learned by FOIL • For relationship course(A) the FOIL algorithm learnt the following: • Page has instructor word, but not the word good. • The page has link to other page which doesn’t contain any links. • This linked page contained the word assign.

Combining Learners • Method for combining predictions of classifiers: • Simple voting scheme among all four classifiers (majority of votes made by the individual classifiers); • In case of tie the confidence level is used as a tie-breaker; • To ensure comparability: • Calibrate each classifier by inducing a mapping from its output scores to the probability of a prediction being correct • Partitioning the score produced by each classifier into bins and then measuring the training-set accuracy of the scores that fall into each bin.

Identifying Multi-Page Segments • To develop methods for identifying sets of interlinked pages that represent a single knowledge base instance: • Prior assumption: one page one instance (primary page and others); • New approach of grouping related pages together (using regularities in URL structures); • Identifying the most representative page of a group (for example: “/~*/” naming pattern identifies entity person); • Main page could be identified by file name index, home, cs???;

The URL Grouping Algorithm (Appendix C)

Future work • Methods for classification documents: • Bayesian Learning: Minimum Description Length (MDL); • Symbolic Learning: Decision Trees; • k-NN (Nearest Neighbor ) algorithm;

Igor Yakymenko iy@cse.buffalo.edu Department of Computer Science and Engineering SUNY at Buffalo