1 / 27

A Text Filtering Method For Digital Libraries

A Text Filtering Method For Digital Libraries. Mustafa Zafer BOLAT Hayri SEVER. introduction. Information filtering (IF) Incoming relevant documents are routed to profilesqueries. Information retrieval (IR) Provides a list of ordered documents based on the similarity with the user query.

Download Presentation

A Text Filtering Method For Digital Libraries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER

  2. introduction • Information filtering (IF) • Incoming relevant documents are routed to profilesqueries. • Information retrieval (IR) • Provides a list of ordered documents based on the similarity with the user query

  3. introduction (continued...) • Linear Separation -partitions relevant and non-relevant into distinct blocks • Optimal Queries - all relevant documents are ahead of nonrelevant ones. • Steepest Descent Algorithm (SDA)

  4. preliminaries • Information retrieval system (S) can be defined as 5 tuple • S =(T,D,Q,V,f) -T set of ordered index terms -D set of documents -Q set of queries -V set of real numbers -f:DxQV retrieval function

  5. preliminaries (continued) • Vector Space Model -Transformation of raw text into more computationally useful forms - Documents and queries are represented as vectors of weighted terms • d=(t1,wd1;t2,wd2;. . .;tn,wdn)ti  T  d • q = (q1, wq1 ; q2, wq2, . . . ; qm, wqm)qi  Tq

  6. preliminaries (continued) • Rnorm value for effectiveness • It measures up how relevant documents are distributed over nonrelavent ones. • rank matters.

  7. preliminaries (continued) Contingency Table • Precision =a / (a+b) • Recall =a / (a+c) • Breakeven point • Where precision and recall are equal

  8. train Parsing Removing stop words Reuters -21578 Data set Stemming Transform to Vectors test Reducing Normalizing overview of experiment . . . Training With SDA Category labels Effectiveness measures Optimal query Preprocessing

  9. train Parsing Removing stop words Reuters -21578 Data set Stemming Transform to Vectors test Reuters 21578 • Consists of 21578 economic news stories that originally appeared on the Reuters newswire in 1987 • Each story has been manually assigned one or more indexing labels from a fixed list • There are 135 TOPIC labels for classification. • In order to use a text corpus for machine learning research it splited into sets of training and testing examples overview of experiment train . . . Training With SDA Reuters -21578 Data set Category labels test Effectiveness measures Optimal query Preprocessing

  10. <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="9944" NEWID="5031"> <DATE>13-MAR-1987 15:45:35.38</DATE> <TOPICS><D>livestock</D><D>carcass</D></TOPICS> <PLACES><D>usa</D></PLACES> <PEOPLE></PEOPLE> <ORGS><D>ec</D></ORGS> <EXCHANGES></EXCHANGES> <COMPANIES></COMPANIES> <TEXT>&#2; <TITLE>U.S. MEAT GROUP TO FILE TRADE COMPLAINTS</TITLE> <DATELINE> WASHINGTON, March 13 - </DATELINE><BODY> The American Meat Institute, AME,said it intended to ask the U.S. government to retaliate against a European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S. meat products. Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section 301 of the General Agreement on Tariffs and Trade against an EC directive that, effective April 30, will require U.S. meat processing plants to comply fully with EC standards. Reuter &#3;</BODY></TEXT> </REUTERS> train Parsing Removing stop words Reuters -21578 Data set Stemming Transform to Vectors test overview of experiment Sample Reuters 21578 Document train . . . Training With SDA Reuters -21578 Data set Category labels test Effectiveness measures Optimal query PrePocessing

  11. train Parsing Removing stop words Reuters -21578 Data set Stemming Transform to Vectors test overview of experiment train . . . Training With SDA Parsing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Body: U.S. MEAT GROUP TO FILE TRADE COMPLAINTS The American Meat Institute, AME,said it intended to ask the U.S. government to retaliate against a European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S. meat products. Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section 301 of the General Agreement on Tariffs and Trade against an EC directive that, effective April 30, will require U.S. meat processing plants to comply fully with EC standards Reuters -21578 Data set Category labels test Effectiveness measures Optimal query PrePocessing

  12. train Parsing Removing stop words Reuters -21578 Data set Stemming Transform to Vectors test overview of experiment train . . . Training With SDA After Parsing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Body: U S MEAT GROUP TO FILE TRADE COMPLAINTS The American Meat Institute AME said it intended to ask the U S government to retaliate against a European Community meat inspection requirement AME President C Manly Molpus also said the industry would file a petition challenging Korea's ban of U S meat products Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section of the General Agreement on Tariffs and Trade against an EC directive that effective April will require U S meat processing plants to comply fully with EC standards Reuters -21578 Data set Category labels test Effectiveness measures Optimal query PrePocessing

  13. train Parsing Removing stop words Reuters -21578 Data set Stemming Transform to Vectors test overview of experiment train . . . Training With SDA Reuters -21578 Data set Removing Stop Words HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Body: U.S. MEAT GROUP FILE TRADE COMPLAINTS The American Meat Institute, AME,saidit intended to ask theU.S. government to retaliate againsta European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S. meat products. Molpus tolda Senate Agriculture subcommittee that AME andother livestock and farm groups intended to file a petition under Section 301 of the General Agreement on Tariffs and Trade against an EC directive that, effective April 30, will require U.S. meat processing plants to comply fully with EC standards Category labels test Effectiveness measures Optimal query PrePocessing

  14. train Parsing Removing stop words Reuters -21578 Data set Stemming Transform to Vectors test overview of experiment train . . . Training With SDA Reuters -21578 Data set After Removing Stop Words HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Body: . MEAT GROUP FILE TRADE COMPLAINTS American Meat Institute AME intended ask government retaliate European Community meat inspection requirement. AME President Manly Molpus industry file petition challenging Korea's ban U.S. meat products Molpus Senate Agriculture subcommittee AME livestock farm groups intended file petition Section General Agreement Tariffs Trade EC directive effective April require meat processing plants comply fully EC standards Category labels test Effectiveness measures Optimal query PrePocessing

  15. train Parsing Removing stop words Reuters -21578 Data set Stemming Transform to Vectors test Reducing Normalizing overview of experiment Stemming HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass Body: MEAT GROUP FILE TRADE COMPLAINT American Meat Institute AME intended ask government retaliate European Community meat inspection requirement. AME President Manly Molpus industry file petition challeng Korea ban meat product Molpus Senate Agriculture subcommittee AME livestock farm group intended file petition Section General Agreement Tariff Trade EC direct effect April require meat process plant compli fulli EC standard train . . . Training With SDA Reuters -21578 Data set Category labels test Effectiveness measures Optimal query PrePocessing

  16. train Parsing Removing stop words Reuters -21578 Data set Stemming Transform to Vectors test Reducing Normalizing overview of experiment Transform To Vectors HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass train . . . Training With SDA Reuters -21578 Data set Category labels test Effectiveness measures Optimal query PrePocessing

  17. train Parsing Removing stop words Reuters -21578 Data set Stemming Transform to Vectors test Reducing Normalizing overview of experiment Create Dictionary (only in training) train . . . Training With SDA Reuters -21578 Data set Category labels test Effectiveness measures Optimal query PrePocessing

  18. train Parsing Removing stop words Reuters -21578 Data set Stemming Transform to Vectors test Reducing Normalizing overview of experiment Reducing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass train . . . Training With SDA Reuters -21578 Data set Category labels test Effectiveness measures Optimal query PrePocessing

  19. train Parsing Removing stop words Reuters -21578 Data set Stemming Transform to Vectors test Reducing Normalizing overview of experiment After Reducing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass train . . . Training With SDA Reuters -21578 Data set Category labels test Effectiveness measures Optimal query PrePocessing

  20. train Parsing Removing stop words Reuters -21578 Data set Stemming Transform to Vectors test Reducing Normalizing overview of experiment wk =tk x log (ND /nk) tkterm frequency ND Number of documents in collection nk number of documents containing tk is normalized weight of term k unnormalizedweight of term k Normalizing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass train . . . Training With SDA Reuters -21578 Data set Category labels test Effectiveness measures Optimal query PrePocessing

  21. train Parsing Removing stop words Reuters -21578 Data set Stemming Transform to Vectors test Reducing Normalizing overview of experiment wk =tk x log (ND /nk) tkterm frequency ND Number of documents in collection nk number of documents containing tk is normalized weight of term k unnormalizedweight of term k After Normalizing HAS TOPICS=YES LEWISSPLIT=TRAIN TOPICS:livestock,carcass train . . . Training With SDA Reuters -21578 Data set Category labels test Effectiveness measures Optimal query PrePocessing

  22. Parsing Removing stop words Training With SDA Stemming Transform to Vectors Reducing overview of experiment • Training • Choose a starting query vector • Q0; let k = 0. • 2. Let Qkbe a query vector at the start of • the (k+1)thiteration; identify the • following set of difference vectors: • (Qk) ={b=d- d’ :d d’and • f(Qk,b)  0}; if (Qk)=, • Qopt= Qkis a solution • and exit, otherwise, •  3. Let • Qk+1 = Qk+ • 4. k = k+1; go back to Step (2). train . . . Reuters -21578 Data set Category labels test Effectiveness measures Optimal query PrePocessing

  23. Parsing Removing stop words Training With SDA Stemming Transform to Vectors Reducing Normalizing overview of experiment • Training • All the category examples as positive • examples • Random 60% from other topics • as negative examples • If maximum Rnorm value (1) • is not reached at maximum 150 • iterations set optimal query as • the query that produces maximum • Rnorm value available train . . . Reuters -21578 Data set Category labels test Effectiveness measures Optimal query PrePocessing

  24. train Parsing Removing stop words Reuters -21578 Data set Stemming Transform to Vectors test Reducing Normalizing overview of experiment There are 135 categories train . test . . Training With SDA Category labels Effectiveness measures Optimal query PrePocessing

  25. train Parsing Removing stop words Reuters -21578 Data set Stemming Transform to Vectors test Reducing Normalizing overview of experiment . . . Training With SDA Create contingency tables Find breakeven points Category labels Effectiveness measures Optimal query PrePocessing

  26. Results breakevens

  27. Thank you!

More Related