1 / 43

Structured Querying of Web Text A Technical Challenge

Structured Querying of Web Text A Technical Challenge. by Cafarella, Re’, Suciu, Etzioni & Banko. Kulsawasd Jitkajornwanich University of Texas at Arlington kulsawasdj@hotmail.com. CSE6339 Web Mining | April 16, 2009 | 9:30 am. Introduction. What is structured-query ?

zoey
Download Presentation

Structured Querying of Web Text A Technical Challenge

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Structured Querying of Web TextA Technical Challenge by Cafarella, Re’, Suciu, Etzioni & Banko Kulsawasd Jitkajornwanich University of Texas at Arlington kulsawasdj@hotmail.com CSE6339 Web Mining | April 16, 2009 | 9:30 am

  2. Introduction • What is structured-query? • 2 types of query: Structured-query & Unstructured-query • 1. Structured-query • Has “condition” in the query • Can make a complicated query • ex. “SQL query” List employee whose name start with ‘David’ and salary > 5000 • SELECT E.name, E.salary • FROM Employee E • WHERE E.name LIKE ‘David’, E. salary > 5000 2

  3. Introduction • What is structured-query? • 2. Unstructured-query • ex. “Keyword Search” • no “condition” in the query • simply do “string matching” 3

  4. Introduction --> we just talked about type of query <-- • What about type of data? • 2 types of data: • 1. Structured-data • ex. Relational tables • 2. Unstructured-data • ex. Web documents 4

  5. Introduction • Objective of the paper: • To propose a tool called ExDB to make a structured-query on web documents (unstructured-data) Structured-query (Complicated querylike SQL-query) Relational Database SQL Query Structured-data ExDB SQL Query Unstructured-query (Keyword Search) Web Text Search Engine Unstructured-data 5

  6. How it works: Big Picture ofExDB q(?x,?y):- invented(?x,?y) Fact Table Type Table User Constraint Table ExDB Complier ExDB Extractor Collection of web documents Resulting Table RDBMS Database 6

  7. How it works: Big Picture ofExDB q(?x,?y):- invented(?x,?y) Fact Table Type Table User Constraint Table ExDB Complier ExDB Extractor Collection of web documents Resulting Table RDBMS Database 7

  8. Outline • 1st Component: ExDB Extractor • What/How does it do in more detail? • 2nd Component: ExDB Compiler • What/How does it do in more detail? • Test your understanding!! • Working on tasks • Compare result ExDB & Google • Conclusion 8

  9. How ExDBWorks • 1st Component: ExDB Extractor • What does it do? • To extract data from the web documents & put it into the tables 9

  10. How ExDBWorks • 2nd Component: ExDB Compiler • What does it do? • To process the user’s structured-query on the tables from 1st component (ExDB Extractor) and give the resulting table back to user • ex. q(?x, ?y):- invented(?x, ?y) • <we will study this query syntax later on> 10

  11. 1st Component: ExDB Extractor 2nd Component: ExDB Compiler How it works: Big Picture ofExDB Fact Table …was surprising. In 1877, Edisoninvented the light bulb. Although he … Type Table User: Make a query using ExDB syntax Constraint Table ExDB Extractor ExDB Complier Collection of web documents RDBMS Database 11

  12. 1st Component:ExDB Extractor • What does it do? • To extract data from the web documents & put it into the tables • There are 3 tables: • 1. Fact Table • 2. Type Table • 3. Constraint Table • Additional column: stores tuple probability • Discussion:Why do need this column? • 0<p<1,  pi = 1 • One way to assign probability: Counting occurrence frequency • Assume Independence among tuples 12

  13. 1st Component:ExDB Extractor • 1.1 Fact Table • Stores fact information • ex. “Edison invented light bulb” • Uses TextRunner to extract • How is it look like? Probability = no of occurrence / no of predicate occurrences Fact Table 13

  14. not only that Edison also invented the phonograph. It was a big news when Edisoninvented the light bulb. … … We all know that Edison invented light bulb. TextRunner TextRunner Probability = no of occurrences …was surprising. In 1877, Edisoninvented the light bulb. Although he … no of predicate occurrences Object Predicate Example1: shows how to get Fact table Fact Table 14

  15. 1st Component:ExDB Extractor • Discussion: • What do you think might be a problem with this design of fact table? • Cannot support Ternary-predicate --> ex. David donatesbooks to Child Organization. Fact Table 15

  16. 1st Component:ExDB Extractor • 1.2 Type Table • Stores object type information • ex. Edison is a scientist. • Uses KnowItAll to extract • How is it look like? Probability = no of occurrence / no of type occurences Type Table 16

  17. scientists such as Edison, … … there are many world-famous scientists such as Edison, … … However, someone claim that Benjamin is also an scientist. KnowItAll …As we know, Edison is a scientist. Although he … Probability = no of occurrences no of type occurrences Example2: shows how to get Type table Object Type Type Table 17

  18. 1st Component:ExDB Extractor • 1.3 Constraint Table • Stores constraint information of objects or predicates • There are 2 types of constraints discussed in this paper: Synonym and Inclusion Dependency • Uses DIRT to extract • 1. Synonym • example for predicate: did-invented = invented • example for object: Edison T. = Edison • 2. Inclusion Dependency • example for predicate: be-guardian  be-parent • example for object: relative  sister 19

  19. example shows how DIRT worksfor Synonym constraint Edison T. Thomas Edison …was surprising. In 1877, Edisoninvented the light bulb. Although he … Thomas E. Thomas Edison DIRT Collection of web documents 11

  20. example shows how DIRT worksfor Inclusion Dependency constraint Be-parent …was surprising. In 1877, Edisoninvented the light bulb. Although he … Be-guardian Be-babysitter DIRT Collection of web documents 11

  21. 1st Component:ExDB Extractor • 1.3 Constraint Table • How is it look like? Subset Superset Constraints Table 20

  22. 1st Component:ExDB Extractor • Key point summary of 1st component: (ExDB Extractor) • 1. ExDB Extractor uses different kinds of existing extractor: TextRunner, KnowItAll and DIRT. • 2. Probabilistic column is used to indicate the degree of correctness and deal with uncertainty problem. • 3. Drawback of fact table, only Binary Predicate is allowed. 22

  23. 1st Component: ExDB Extractor 2nd Component: ExDB Compiler How it works: Big Picture ofExDB Fact Table …was surprising. In 1877, Edisoninvented the light bulb. Although he … Type Table User: Make a query using ExDB syntax Constraint Table ExDB Extractor ExDB Complier Collection of web documents RDBMS Database 23

  24. 2nd Component:ExDB Compiler • What does it do? • To process the user’s structured-query on the tables from 1st component (ExDB Extractor) • Result will be in table format and ranked by highest probability value. • ex. q(?x, ?y):- invented(?x, ?y) • However, users are not expected to know the table schema. 24

  25. 2nd Component:ExDB Compiler q(?x, ?y):-invented(?x, ?y) • ExDB syntax: • ?x = variable x • w = constant value w • q(?x,?y):- = define resulting table q consisting of column x and y • invented(?x,?y) = return list of object x and y regarding predicate “invented” • invented(<scientists> ?x,?y) = return list of object x whose type is <scientists> and y regarding predicate “invented” • This syntax is called “Datalog-like notation” • Let’s try some examples! 25

  26. Make a Query • example: example4: • list all inventions invented by Edison Fact Table answer: • q(?i):- invented(Edison, ?i) q Table 26

  27. Make a Query • example: example5: • list all scientist died in 1955 Fact Table Type Table answer: q(?i):- died-in(<scientist> ?i, 1955) 27

  28. Make a Query • example: example5: • list all scientist died in 1955 Fact Table Type Table answer: q(?i):- died-in(<scientist> ?i, 1955) 0.20 = 0.50 x 0.40 because we assume independence among tuples; i.e,P(t1, t2)=P(t1) * P(t2) Joining Table q Table 28

  29. Make a Query • example: example6: • list all scientist who died after 1900, their inventions and year they died Fact Table Type Table answer: • q(?x, ?y, ?z):- invented(?x, ?y), died-in(<scientist> ?x, ?z), (z > 1900) 29

  30. Make a Query • example: example6: • list all scientist who died after 1900, their inventions and year they died Fact Table Type Table 0.14 = 0.50 x 0.40 x 0.70 Joining Table q Table 30

  31. Test Your Understanding! Problem1: • list all singer who born in 1980, their instruments Fact Table Type Table answer: • q(?x, ?y):- play(<singer> ?x,<instrument> ?y), born-in(<singer> ?x, 1980) 31

  32. Test Your Understanding! Problem2: • list all singer who has income more than their producer Fact Table Type Table answer: • q(?x):- has-income(<singer> ?x, ?y), has-income(<producer> ?m, ?n), being-producer(?m, ?x), (?y > ?n) 32

  33. Make a Query • example: example7: • list all inventions discovered by Edison Fact Table Constraint Table answer: • q(?i):- discovered(Edison, ?i) q Table Discussion: • In this case, What can we do to answer this query? 26

  34. Make a Query • Problem Scenario example8: (this example involves PROJECTION) • list all name who invented something answer: • q(?x):- invented(?x, ?y) Joining Table • Discussion: • Can you see something wrong in the resulting table? 0.63 = 0.09 x 7 q Table 2 33

  35. Solving Problem Scenario by using‘Panel of Expert’ technique • Problem scenario caused by projection operation. • Conventional Way: • newProb =  duplicateProbi • New Way: using“Panel of Expert” technique • principle: • 1.define number n of duplicate output ex. n=5 (meaning that if in total, there are 10 duplicate output, we will consider only 5 and eliminate other 5) to eliminate low quality output. • 2.newProb = calculate by selecting the max value among those n duplicate output. • newProb = max {duplicateProbi}; in 34

  36. Make a Query • Problem Scenario: example8: (problem caused by projection operation) • list all name who invented something answer: • q(?x):- invented(?x, ?y), Solved by “Panel of Expert”technique Joining Table q Table 0.63 = 0.09 x 7 q Table 2 35

  37. 2nd Component:ExDB Compiler Key points summary of 2nd Component: (ExDB Compiler) • ExDB has its own syntax. • Result will be in table format. • Last column is probability value ranked by decreasing order of probability value. The assumption is that the higher probability, the more accurate. • Can implement top K to reduce time complexity (increase performance). • In case of JOIN table, the resulting probability the product of 2 joining table • In case of PROJECTION, use Panel of Expert to solve the problem. • In case that user’s query contains relation which does not exist in the Fact Table, we can use Constraint Table to answer such a query. 36

  38. Working On Task#1 • Synthetic Table • an additional feature to combine the result query q together • example: Synthetic Table generated by MERGING answers fromdied-in(?x,?y),invented(?x,?y),published(?x,?y),taught(?x,?y) 37

  39. Working On Task#2 • Implementing with Google Search Engine Search Textbox list all scientist, their inventions, who died before 1955 GO q(?x, ?y):- invented(<scientists>?x, ?y), died-in(?x, ?z), (?z < 1955) 38

  40. Compare result ExDB&Google • Test query:list all scientists who create something Output fromExDB Output from Google Comments: • ExDB performs much better than Google. • For Google result, after investigating all the link, only 1 document comes close to the answer. • For ExDB, although they have some redundancy, answer is still better. 39

  41. Conclusion • Only Binary Predicate is allowed. • Result will be in table format (different from Google search engine). • How ExDB get answer makes more sense since they integrate all data together before we make a query on them. • Extractor has to run beforehand before allowing user to make a query. • IE involved in this paper are TextRunner, KnowItAll, DIRT. • User is not expected to know the schema of the table, instead, system itself will try to match as much as they can to answer the query (using synonym, inclusion independency). 40

  42. Question? ? 42

  43. References • N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, 2004. • D. V. K. Reynold Cheng and S. Prabhakar.Evaluating probabilistic queries over imprecise data.In SIGMOD, pages 551–562, 2003. 41

More Related