1 / 15

Department of Computer Science Brigham Young University Q Wang November, 2000

Ontology-Based Binary-Categorization of Multiple-Record Web Documents Using a Probabilistic Retrieval Model. Department of Computer Science Brigham Young University Q Wang November, 2000. Multiple-Record Web Documents-1. Relevant document--a chunk of Car-sale Ads.

paulmayes
Download Presentation

Department of Computer Science Brigham Young University Q Wang November, 2000

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ontology-Based Binary-Categorization of Multiple-Record Web Documents Using a Probabilistic Retrieval Model Department of Computer Science Brigham Young University Q Wang November, 2000

  2. Multiple-Record Web Documents-1 Relevant document--a chunk of Car-sale Ads Acura Integra 1990 $4,000 (1/27/00)ACURA'90 Integra, AC, AM/FM cassette, cruise, new tires. Asking $4,000. (302) 226-5444.+ Acura Integra 1992 $5,900 (1/27/00)ACURA'92 Integra RS, white, excellent condition. $5,900. 410-548-1353

  3. Multiple-Record Web Documents-2 Irrelevant document--a chunk of Motorcycle Ads '97 HONDA ACESHADOW 1100cc 4k. Customized. $7.5K/obo 410-465-0870 '97 HONDA CR250Exc. cond. $3300/OBO. (410) 479-4499

  4. Application Ontology Year Price 1:* 1:* 0:0.975:1 0:0.8:1 1:* 1:* Make Car Model 0:0.925:1 0:0.908:1 0:1.15:* 0:2.2:* 0:0.45:1 1:* 1:* 1:* Mileage Feature PhoneNr

  5. Document Representation • A set of <index term : term frequency> pairs A1:x1, …….. An:xn • A density heuristic value • A grouping heuristic value P(R|d) P(R|(x1,……,xn), P(R|Density), P(R|Grouping)

  6. Independence Assumption P(R|(Year,……,Make) Independence assumption P(R|(Year) P(R|(Make)

  7. Logistic Regression Logistic regression package Input from a training set data Output

  8. Probability Estimation For a test document, the term frequency of index term Make is 0.4514. xMake = 0.4514 P(R| Make) = 1/(1+exp(-(C0+C1 xMake))) = 1/(1+exp(-(8.358+(-1.606*0.4514)))) = 0.9995

  9. Probability Fitting Curve P(R| x) = 1/(1+exp(-(C0+C1 x))) P * ** * ******* P(R|xi) P(R|x) *** * ******* ** * xi x

  10. Relevance Probability Calculation For a Car Sale document in a test set, we have Index = [Ye,Ma,Mo,Mi,Pr,Fe,Ph,De,Gr] C0 = [.6,8.4,3.7,22.8,15.5,5.9,–2.5,61.9,29.2] C1 = [-.2,-1.6,-.9,-1.7,-3.0,-2.5,1.1,-10,1,-20.5 ] X = [.26,.25,.14,.07,.23,.84,.26,.15,.33 ] I = [1, 1, 1, 1, 1, 1, 1, 1,1] Y = C0 * IT + C1 * XT = 134.111 P(R|d) = 1 + 1/exp(-Y) = 1

  11. Statistical Information : P-Value • A p-value is a significance indicator. • A large p-value indicates either a bad regression model or a statistically insignificant index term. • We should keep only significant index terms.

  12. Dependent Relations • Dependent relation exists among index terms. Independence assumption oversimplifies the problem & causes distortion. For example, in the Car Ads application ontology, we expect Make and Model are likely appearing together. • The performance can be improved by including significant dependent relations in relevance probability calculation.

  13. Estimation of relevance probability-2 P(R|d) P(R|Correlation-n) Multiplication P(R|Density) P(R|Correlation-1) P(R|Year) P(R|Feature) P(R|Grouping)

  14. Comparison

  15. Contribution • We propose a probabilistic model which can accurately classify multiple-record Web documents. • We will study the impact of dependent relations on the performance of our model.

More Related