1 / 46

Using Weight-controlled Token Matching to Extract Data From HTML Files

Using Weight-controlled Token Matching to Extract Data From HTML Files. Yan Xu, Tok Wang Ling Dept. of Computer Science National University of Singapore (xuyan, lingtw)@comp.nus.edu.sg. Outline. Outline. Motivation and background Our approach Generate wrapper Extraction data

nico
Download Presentation

Using Weight-controlled Token Matching to Extract Data From HTML Files

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Weight-controlled Token Matching to Extract Data From HTML Files Yan Xu, Tok Wang Ling Dept. of Computer Science National University of Singapore (xuyan, lingtw)@comp.nus.edu.sg

  2. Outline Outline • Motivation and background • Our approach • Generate wrapper • Extraction data • Experimental Result • Conclusion

  3. Motivation and Background Motivation and Background • What is a wrapper • XML and HTML • Related works • Some criteria to build wrappers for Web pages

  4. Motivation and Background What Is a Wrapper? • Wrapper is a software component. • Wrapper is used to extract data from source files and convert them into a structured way. • On the Web, the source files are usually HTML files.

  5. Motivation and Background What is a Wrapper? (Cont…) • The source files are usually semistructured or unstructured. • We only discuss HTML files as source files in this paper.

  6. Motivation and Background XML and HTML • XML is more suitable to organize data than HTML. HTML is simple and widely used and accepted. • More and more XML sites appear on the Web. HTML files are far many more than XML files on the Web

  7. Motivation and Background XML and HTML (Cont…) • Query XML is easier and standard query language is coming. HTML files are usually queried by using web search engines. • XML contains information no less than HTML • XML is easier to be converted to other data models, especially semistructured data models • XML is suitable to be the output of a wrapper • XML provides one of possible semantic interpretations of a document.

  8. Motivation and Background Related Works • Construct wrappers for HTML files manually or automatically • Using specification files to extract data. Such as extraction system in TSIMMIS • Advantages: sufficient expressive and high precision • Limitations: built by experienced programmers and hard to maintain • Very time consuming to build a wrapper

  9. Motivation and Background Related Works (Cont…) • Rule-based wrappers • Using rules to extract data • Inducing rules from training examples • Using delimiter-based rules • Such as: WIEN, STALKER, SoftMealy • Our wrapper is rule-based

  10. Motivation and Background Some Criteria to Build Wrappers for Web Pages • Simple and powerful extracting rules • Need less examples and less user’s interaction • Use HTML structure information as much as possible • Easy to maintain and update • Less time to build a wrapper

  11. Our Approach Our Approach • Rule-based wrapper • Use delimiter to identify data • Use training examples to induce rules • Use Weighted Token List to identify delimiter • Use rules and threshold to extract data

  12. Our Approach Our Approach (Cont…) • Weighted token list (WTL): a list of vector. Each vector contains a set of <token, weight> pair • Token: could be a HTML tag, a word, or a punctuation in HTML files • Weight: how important a token is in its position. It’s a number between 0 and 1. • Generate WTL using labeled examples

  13. Our Approach An Example • Part of a page from Amazon.Com

  14. Our Approach An Example (Cont…) • We hope the wrapper could output the below result : TITLE: professional xml (2nd edition) AUTHOR: nikola ozu, et al TYPE: paperback DATE: may 2001 SHIPINFO: usually ships in 24 hours LISTPRICE: $59.99 OURPRICE: $47.99 Save: 20%

  15. Our Approach An Example (Cont…) • The label information is input by the user • The label is the meaning of the data. So we could identify the extracted data. Such as “TITLE”, “AUTHOR” etc. in the previous page • There are two kinds of users. • The user who build the wrapper • The user who use the generated wrapper to extract data

  16. Our Approach An Example (Cont…) • Part of HTML source code about author information in the example is : …… </td> <td > <font><b> <A href="/exec/obidos/ASIN/1861005059/qid=99219684 8/sr=1-4/ref=sc_b_4/104-9977965-3139126"> Professional XML (2nd Edition) </A></b><br><font>byNikola Ozu, et al</font> (Paperback – May 2001)<br> ……

  17. Our Approach An Example (Cont…) • If we choose token “by” as the left delimiter of the author data, and HTML tag “</font>” as the right delimiter, we will have high recall but low precision when we try to extract author data. • If we choose a sequence of tokens as the delimiter, for example 5 tokens: • The 5 tokens before the author information: “<a>” “</b>” “<br>” “<font>” “by” and the 5 tokens after the author information: “</font>” “(“ “Paperback” “-” “May”

  18. Our Approach An Example (Cont…) • Surveying the entire example page (25 books), we find: • The 5 tokens before the author data do not change and they are expressive enough to be left delimiter. • The 5 tokens after the author data are not precise enough to be the delimiter. For example, there is a book that is hardcover and do not have publish date, the right 5 tokens after author data is: “</font>” “(“ “Hardcover” “)” “<br>” • We will have high precision but low recall

  19. Our Approach An Example (Cont…) • Surveying the example page, we find 6 out of 25 books are hardcover and 18 out of 25 books are paperback. • Using 3 books as training example, we obtain the following token lists:

  20. <a>,1.0 <br >, 1.0 <b>,1.0 <font>,1.0 “by”, 1.0 Paperback, 0.67 “)”, 0.67 <br>, 0.67 </font>, 1.0 “(“, 1.0 May, 0.33 Hardcover, 0.33 “-”, 0.33 Our Approach An Example (Cont…) • The begin weighted token list (the tokens before the author data): • The end weighted token list (the tokens after the author data):

  21. Our Approach An Example (Cont…) • One token near the data associates its weight at its position • For example: means token “paperback” is found 2 out of 3 times (i.e. 67%) in training examples. We allocate the possibility of token “paperback” (0.67) as weight to this token in this position Paperback, 0.67

  22. Our Approach An Example (Cont…) • The Weighted Token List to identify the left delimiter is: {by,1.0} {<font>,1.0} {<br>,1.0} {</b>,1.0} {<a>,1.0}

  23. Our Approach An Example (Cont…) • The Weighted Token List to identify the left delimiter is: {</font>,1.0} {(,1.0} {hardcover,0.33} {paperback,0.67} {),0.67} {-,0.33} {<br>,0.67} {may,0.33} • The colored line means there are two tokens are found in the third position after the author data.

  24. Our Approach An Example (Cont…) • Using Weighted Token List, we achieve: • A list of tokens as the delimiter • Associating weights to tokens, we could obtain a better recall-precision tradeoff • We can “bear” small modification of HTML pages, especially, the modification is not occurred near the data

  25. Our Approach Label the Example Page • Using our GUI tool:

  26. Our Approach Label the Example Page (Cont…) • User highlights the interested data • User clicks the “input label” button • A dialog window pops up and user inputs the label • We insert the label into HTML file following our specification

  27. Our Approach Label the Example Page (Cont…) • After labeling, the modified HTML file is: … [LABEL:TITLE]Professional XML (2nd Edition)[INFOREND] </a></b><br><font>by [LABEL:AUTHOR]Nikola Ozu, et al[INFOREND] </font>( [LABEL:TYPE]Paperback[INFOREND] – [LABEL:DATE]May 2001[INFOREND])<br> … • The user input parts are “TITLE”, “AUTHOR” etc.

  28. Our Approach An Extraction Rule Has… • Label information • Delimiters information: • Begin WTL (BWTL): a WTL that describe a list of tokens as begin delimiter • End WTL (EWTL): a WTL that describe a list of tokens as end delimiter • A rule contains enough information to extract a piece of data

  29. Our Approach A Rule Looks Like: • <LABEL, BWTL, EWTL> • LABEL is “AUTHOR” in our example • BWTL in our example is : {by,1.0} {<font>,1.0} {<br>,1.0} {</b>,1.0} {<a>,1.0} • EWTL in our example is : {</font>,1.0} {(,1.0} {hardcover,0.33}{paperback,0.67} {),0.67}{-,0.33} {<br>,0.67} {may,0.33}

  30. Our Approach How to Generate Rule? • Find label information after “[LABEL:” and before the next “]” from examples that is user labeled using the our GUI tool • We set the number of tokens needed as 5. User could use it to generate rules and test the result. If not good, user could set it manually • Generate BWTL for left delimiter • Generate EWTL for right delimiter • Assemble label, BWTL and EWTL to a rule

  31. Our Approach How to Generate WTL • Find the begin point and end point of the data from the labeled training example • Detect the lists of tokens before and after the data • Use the collected tokens to generate new WTL or add the lists of tokens into correspond WTL and calculate the weight for each token • Weight of a token is calculated by using the times that the token appears near the data divided by the sum of the times that all the tokens appear in training examplesnear the same data

  32. Our Approach Extract Data Using Rules • Tokenize the object HTML file • Obtain a list of tokens and find the correspond rule in rule set • Obtain the data • Associate label with data • Output the result

  33. Our Approach Find the Correspond Rule • Obtain a list of tokens from web pages • Find a rule in rule set that if the given tokens are found in Weighted Token List and the sum of the weight of the tokens are larger than the threshold multiply the number of tokens

  34. Our Approach Threshold • threshold is between 0 and 1 • After testing, we found the result is usually good when the threshold is set between 0.4 to 0.6. We set it to 0.5 by default • User could test the wrapper and change the threshold

  35. Our Approach Another Example • HTML source code from Amazon.com about author data of a book … </a></b> <br><font>by Cisco Systems (Editor), Vito Amato</font> (Hardcover) <br> …

  36. Our Approach Another Example (Cont…) • We detect “<a></b><br><font>by” as left delimiter of author data, the weight is 5 larger than 5*0.5 • We detect “</font>(Hardcover)<br>” as the right delimiter of author data, the weight is 3.7 larger than 5*0.5 • The author data is between two list of tokens “Cisco Systems (Editor), Vito Amato”

  37. Result Analysis Result Analysis • We define: • field: a piece of data. The smallest unit that our wrapper could handle. For example, the author data of a book • item: a group of fields such as all the data of a book. • Our wrapper’s training example is item. For example, an Amazon.com page usually contains information more than ten books (ten items), we need only several of them (3 items) but not the entire page to be labeled as training examples

  38. Source Size (kb) No. of items No. of fields in each item No. of items as examples CNN 34 10 4 3 WorldFact Book 25 1 165 1 MSN 46 15 3 3 Film.com 22 10 4 3 Amazon 99 25 8 3 Google 17 10 5 3 ACM DL 12 10 6 3 Ebay.com 65 50 5 3 BBC 33 10 4 3 News.com 43 30 3 3 Result Analysis Result Analysis (Cont…) • Ten test web sites’ basic information • Java SDK1.3. PC with Windows NT 4.0 workstation (Intel PIII 800/128 M RAM)

  39. Source Recall (%) Precision (%) generation time (s) extraction time (s) CNN 100 100 3 9 WorldFact Book 100 100 25 40 MSN 100 100 2 4 Film.com 100 100 1 5 Amazon 100 88 10 43 Google 100 86 3 9 ACM DL 90 98 1 2 Ebay.com 100 83.3 3 19 BBC 100 80 1 4 News.com 63.3 100 1 6 Result Analysis Result Analysis (Cont…) • Ten test web sites’ recall-precision table

  40. Result Analysis Recall and Precision • Recall and precision • Recall: 80% has a 100% recall • Precision: 50% has a 100% precision all have a more than 80% precision • Four sites has 100% both in recall and precision test • The recall of News.com is lowest because News.com’s web pages are assembled from several news and newspaper web sites • The result shows the best recall-precision balance. Increase the number of tokens will have a better recall but lower precision. Increase the threshold will cause a better precision but lower recall.

  41. Result Analysis Wrapper Generation Time • Wrapper generation time: Except two examples, all the others need less than 5 seconds • Labeling time is not included in wrapper generation time and labels are input with the help of our GUI tool. The time depends on how many items are selected as training example and how many fields contained in one item. All examples’ labeling time is less than 10 minutes except worldfact book example page

  42. Result Analysis Extraction Time • Data extraction time: 70% less than 10 seconds • The extraction time is related to the HTML file size. The HTML file size is usually not quite large. • The wrapper generation time and the labeling time are acceptable • The Data extraction time is not too long and is bearable when used in real time web applications

  43. Result Analysis Compare to other approaches • Automatically generate wrappers and implement a friendly GUI tool to help user input labels and extract data • Simple and powerful rules that could deal with missed and mis-ordered items in web pages

  44. Result Analysis Compare to other approaches(Cont…) • We need a less number of training examples because • when HTML file does not have missed and mis-ordered items, we demand no more examples than other methods. • When there is missed and mis-ordered items, we need not to meet every situation of missed and mis-ordered items in web pages • Quickly generated wrapper and the allocation Weights to token assures a easier maintenance and update

  45. Conclusion Conclusion • Use weighted token list to find and extract data from HTML files. • A friendly GUI tool to generate wrappers easily • Acceptable result

  46. Reference [1] S. Abiteboul. Querying Semistructured Data. In Proceedings of the International Conference on Datbase Theory (ICDT), January 1997. [2] S. Abiteboul, D.Quass, J.McHugh, J.Widom, and J.Wiener. The Lorel Query Language for Semistructured Data. Journal of Digital Libraries, November 1996 : 68-88 [3] Naveen Ashish, Craig A. Knoblock. Semi-Automatic Wrapper Generation for Internet Information Sources. CoopIS 1997: 160-169 [4] Naveen Ashish and Craig Knoblock. Wrapper Generation for Semi-Structured Internet Sources. . SIGMOD Record26 (4): 8-15, 1997 [5] S. Chawathe, H.Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom: The TSIMMIS Project: Integration of Heterogeneous Information sources. Proceedings of Tenth Anniversary Meeting of Information Processing Society of Japan, Tokyo, Japan, 1994: 7-18. [6] J. Hammer, H. Garcia-Molina , J. Cho , R. Aranha, A. Crespo. Extracting Semistructured Information from the Web. In Proceedings of the Workshop on Management of Semistructured Data. Tucson, Arizona, May 1997 [7] Chun-nan Hsu et al. Finite-State Transducers for Semi-structured Data Extraction From the Web. Information Systems, 23(8):521-538, 1998 [8] Nicholas Kushmerick, Daniel S. Weld, Robert Doorenbos. Wrapper Induction for Information Extraction. International Joint Conference on Artificial Intelligence: 729-737, 1997 [9] Ion Muslea, Steve Minton, Craig Knoblock. Hierarchical Wrapper Induction for Semistructured Information Sources. Journal of Autonomous Agents and Multi-Agent Systems: 4:93-114, 2001 [10] Arnaud Sahuguet, Fabien Azavant. WysiWyg Web Wrapper Factory (W4F). unpublished, 1999. http://db.cis.upenn.edu/Research/w4f.html [11] W3C. HTML 4.01 specification, http://www.w3.org/TR/html4/ [12] W3C. XML1.0, http://www.w3.org/TR/1998/REC-xml-19980210

More Related